Harmonia: Tenant-Provider Cooperation for Work-Conserving Bandwidth Guarantees Frederick Douglas, Lucian Popa, Sujata Banerjee, Praveen Yalagandula, Jayaram Mudigonda, Matthew Caesar Hewlett Packard Labs HPE-2016-16 Keyword(s): Cloud bandwidth guarantees; multi-path networks Abstract: In current cloud datacenters, a tenant’s share of the network bandwidth is determined by congestion control interactions with other tenants’ traffic. This can lead to unfair bandwidth allocation, and unpredictable application performance. Performance isolation can be achieved by providing bandwidth guarantees for each tenant. However, since networks are lightly-loaded on average and providers do not want to waste network resources, a guarantee scheme should be work conserving – i.e., fully utilize the network, even when some tenants are not active. Unfortunately, the work-conserving guarantee schemes proposed thus far suffer from two problems that have hindered their deployment: (a) they are complex and incur a significant overhead to the provider; (b) they have not been designed or tested with multi-path networks, even though all modern cloud datacenter networks have multiple paths and, as we will show, bandwidth guarantees must be split across multiple paths to fully utilize the network. To address these challenges, we propose a radical approach that departs from the belief that in a cloud with bandwidth guarantees, the provider alone should be in charge of spare bandwidth allocation. Unlike prior proposals, we empower tenants to control the work-conserving bandwidth, just as today, while the provider only has to enforce bandwidth guarantees as an additional service. Our solution, Harmonia, relies on two key ingredients: (a) we separate bandwidth guaranteed traffic from work-conserving traffic through the use of priority queues, and (b) we use MPTCP to apportion traffic (at packet granularity) to be guaranteed or work conserving, as well as to split guarantees across multiple paths. External Posting Date: February 19, 2016 [Fulltext] Internal Posting Date: February 19, 2016 [Fulltext] Copyright 2016 Hewlett Packard Enterprise Development LP Harmonia: Tenant-Provider Cooperation for Work-Conserving Bandwidth Guarantees Frederick Douglas (UIUC), Lucian Popa (Databricks), Sujata Banerjee (HP Labs), Praveen Yalagandula (Avi Networks), Jayaram Mudigonda (Google), Matthew Caesar (UIUC) ABSTRACT CPU and memory, and making the cloud more attractive for network-constrained applications. However, slicing cloud networks into statically guaranteed shares is insufficient. Average load in datacenter networks is typically low [5, 20] and clouds provide good average network performance – albeit with no lower bound on the worst case. Thus, the ability to be work-conserving (i.e., to redistribute idle bandwidth to the active flows) is crucial; perhaps more so even than guarantees. Therefore, cloud providers cannot offer just strict bandwidth guarantees; they must offer work-conserving bandwidth guarantees (from now on WCBG), where tenants exceed their guarantees when there is free capacity [11, 14, 15, 17]. Unfortunately, providing WCBG is a much harder problem than providing non-work-conserving guarantees. Existing solutions for offering WCBG suffer from two problems that have hindered their deployment. The first problem with existing solutions is that they are complex, with high overhead. To be work-conserving, current solutions implement complex adaptive algorithms [11, 15, 17]. Moreover, to be efficient and satisfy guarantees, these algorithms must operate on very small timescales (milliseconds), resulting in high overhead. The second problem is that none of the existing methods for sharing bandwidth in the cloud have been designed for or tested with multi-path networks, even though (a) almost all modern cloud datacenter networks have multiple paths and (b) reserving bandwidth guarantees across multiple paths is necessary to fully utilize the network, as we will show later in this paper (§3). We will argue that support for multiple paths cannot be easily retrofitted onto existing solutions. In fact, the problem of offering WCBG is so difficult that most approaches only provide guarantees when congestion occurs at the edge of the network [11, 17]. These reasons tip the balance for cloud providers, such as Amazon, Microsoft, IBM or HP, into not offering the guaranteed bandwidth service, even after numerous efforts aimed at this goal. This paper proposes a radical lightweight approach that will make it much easier for providers to offer WCBG on today’s networks. Our solution combines three key ideas: In current cloud datacenters, a tenant’s share of the network bandwidth is determined by congestion control interactions with other tenants’ traffic. This can lead to unfair bandwidth allocation, and unpredictable application performance. Performance isolation can be achieved by providing bandwidth guarantees for each tenant. However, since networks are lightly-loaded on average and providers do not want to waste network resources, a guarantee scheme should be work conserving – i.e., fully utilize the network, even when some tenants are not active. Unfortunately, the work-conserving guarantee schemes proposed thus far suffer from two problems that have hindered their deployment: (a) they are complex and incur a significant overhead to the provider; (b) they have not been designed or tested with multi-path networks, even though all modern cloud datacenter networks have multiple paths and, as we will show, bandwidth guarantees must be split across multiple paths to fully utilize the network. To address these challenges, we propose a radical approach that departs from the belief that in a cloud with bandwidth guarantees, the provider alone should be in charge of spare bandwidth allocation. Unlike prior proposals, we empower tenants to control the work-conserving bandwidth, just as today, while the provider only has to enforce bandwidth guarantees as an additional service. Our solution, Harmonia, relies on two key ingredients: (a) we separate bandwidthguaranteed traffic from work-conserving traffic through the use of priority queues, and (b) we use MPTCP to apportion traffic (at packet granularity) to be guaranteed or work conserving, as well as to split guarantees across multiple paths. 1. INTRODUCTION Modern cloud datacenter networks are mostly free-foralls. A tenant’s share of the network bandwidth is determined by the interaction of its VMs’ congestion control algorithms with the unpredictable and wildly varying network usage of other tenants. This can lead to unfair bandwidth allocation and unpredictable application performance [3,4,11, 12, 14, 15, 17, 18]. Providing bandwidth guarantees in the cloud would lead to predictable lower bounds on the performance of tenant applications [3, 4, 8, 11, 14, 15, 17]. Bandwidth guarantees would thus be an excellent addition to today’s cloud offerings, putting the network on par with other resources such as 1. We fully decouple the provision of guarantees from the goal of fully utilizing the network. Unlike previous proposals for achieving WCBG, e.g., [11, 14, 15, 17], 1 pre-installed, such that users don’t need to install it themselves.) Furthermore, we stress that installing MPTCP is the only change that users need to make to their VMs and applications: MPTCP handles the splitting of traffic onto the guaranteed and work-conserving paths, while presenting the application with what appears to be a single TCP connection, through the standard sockets API. Note that MPTCP has a dual role in our design, both enabling the split between priority queues (guaranteed vs. non guaranteed traffic), as well as across multiple physical paths. work conservation in Harmonia is achieved just as it is today – by tenants’ congestion control algorithms filling all available capacity, rather than by providers’ control processes constantly tweaking bandwidth allocations. This decoupling is achieved by separating a flow into ‘guaranteed’ and ‘work-conserving’ portions, with the guaranteed portion rate-limited and directed to higher-priority queues in the network. This separation should be transparent to the tenant. 2. We make multi-pathing a primary design decision. In §3 we demonstrate that in a multi-path topology – such as a fat tree – bandwidth reservation schemes that cannot split a single VM-to-VM reservation across multiple paths are fundamentally inefficient. Intuitively, there are situations where single path schemes are forced to duplicate one reservation onto two links, whereas a multi-path scheme could safely split the reservation load over the two links. 2. PRIOR WORK ON BANDWIDTH GUARANTEES IN THE CLOUD There is a tradeoff between the benefits of offering bandwidth guarantees and the overhead to the cloud providers. With existing proposals, the overhead dominates; therefore, cloud providers do not offer bandwidth guarantees. We now discuss why existing proposals are so expensive. Gatekeeper [17] and EyeQ [11] require datacenters with fully provisioned network cores; otherwise, they cannot provide guarantees. However, all cloud datacenters that we are aware of are oversubscribed, and congestion is known to typically occur in the core [5, 20]. Moreover, oversubscription may never disappear: many providers consider spending the money to build networks with higher bisection bandwidth to be wasteful, since average utilization is low. FairCloud [14] (the PS-P allocation scheme) can be very expensive because it requires a large number of hardware queues on each switch, on the order of the number of hosted VMs. Current commodity switches provide only a few queues. Finally, ElasticSwitch [15] has a significant CPU overhead, requiring an extra core per VM in the worst case, and one core per 15 VMs on average. Further, as we show in the next section, prior solutions are not well suited to run on multi-path networks. This is problematic, since most datacenter networks today have multiple paths. Moreover, trying to retrofit support for multiple paths on these solutions would greatly increase their already large overhead. 3. To make the previous two ideas transparent to the user, and accommodate packet reordering, uneven congestion across paths and failed links, we propose the use of a multi-path transport protocol inside tenant VMs. In our prototype, we use MPTCP [16]. MPTCP is a mature standard beginning to see serious adoption (e.g., Apple iOS 7). In the future, other such protocols could be used instead of MPTCP. In a nutshell, our solution works as follows. We completely separate guaranteed traffic from work-conserving traffic1 by using priority queuing: guaranteed traffic is forwarded using a queue with higher priority compared to the workconserving, opportunistic, traffic. We fully expose this decoupling at the level of tenant VMs, through multiple virtual network interfaces: one interface’s traffic is given a high priority but is rate-limited to enforce guarantees, while the others are not rate-limited, but get a lower priority. We spread all the guaranteed traffic uniformly across all paths. We make sure that the guaranteed traffic will not congest any link by relying on prior work for non-workconserving guarantees, such as Oktopus [3] or a subset of ElasticSwitch [15]. At the same time, to achieve work conservation, we allow tenants to send lower priority traffic at will on all paths. Tenants who are content with non-workconserving guarantees, or who want only best-effort traffic without any guarantees, can operate unmodified from today. However, we expect the tenants that want to achieve full WCBG to replace TCP with MPTCP [16] in their VMs. As MPTCP is a transparent, drop-in replacement for TCP, and very easy to install (just an apt-get on Ubuntu, for example), we do not believe this to be unduly demanding. (In fact we expect providers to offer VM images with MPTCP 3. RESERVING BANDWIDTH ON MULTIPATH NETWORKS In this section, we aim to answer the following question: how can bandwidth guarantees be reserved efficiently in a multi-path datacenter network? We first show that singlepath reservations are not efficient (§3.1). Next, we describe efficient multi-path reservations, and show that existing solutions are hard to adapt for such reservations (§3.2). We consider only symmetric, tree-like topologies, such as the fat tree (folded Clos) and derivatives, e.g., [1,7]. All current and in-progress datacenter topologies we are aware of belong to this category. We leave the study of other topologies – random [19] and small world [21] graphs, topologies containing servers [9, 10], etc. – to future work. 1 We refer to the traffic sent by a tenant up to its bandwidth guarantee as guaranteed traffic, and that above its bandwidth guarantee as work-conserving traffic. 2 BwminA BwminB A … B BwminX X VMs Figure 1: Hose model example We focus on the most popular model for offering bandwidth guarantees, the hose model [3, 4, 11, 14, 15, 17]. Fig. 1 shows an example of the hose model, which provides the abstraction of a single switch that connects all the VMs of one tenant with dedicated links. The capacity of each virtual link represents the minimum bandwidth guaranteed to the VM connected to that link. Most operators today use multiple paths by dynamically load balancing flows along different paths, e.g., [2, 7]. However, each flow is still physically routed along a single path. When providing bandwidth guarantees on multi-path networks, cloud providers can start with a similar approach and always route each VM-to-VM flow across a single path. When this path is selected at the time of VM placement, we call this approach static VM-to-VM routing. When it can be updated during runtime, we call it dynamic VM-to-VM routing. Unfortunately, reserving bandwidth based on single-path VM-to-VM routing (either static or dynamic) is fundamentally inefficient and cannot effectively use the network infrastructure, as it leads to duplicate reservations: when each VM-to-VM reservation uses a single path, one VM’s hosemodel guarantee might need to be duplicated across multiple paths, as we will show. Dynamically reserving VM-to-VM guarantees on individual paths instead of statically pinning them to a path could potentially reduce the degree of inefficiency, but would be extremely complex and difficult. We discuss these issues in more detail next and show that for multi-path networks, bandwidth guarantees should be split across multiple paths. Figure 2: Example showing that reserving bandwidth guarantees using static single path routing between VMs is inefficient. The red and green boxes represent servers that host VMs that request guarantees. The VMs on a fully shaded server request a total of 1Gbps. Colored lines represent proposed link reservations. tity RL on any such link L. Any reservation higher than RL would be wasteful. The definition of an efficient reservation can be generalized to multi-path networks by replacing the single link L with a cut in the network topology. If a set of links S forms a cut, an efficient reservation would reserve on S a total of R, with R analogous to the single-path case. We now present an example topology and guarantee request for which no efficient single-path reservation exists. Example of single-path inefficiency: Fig. 2 shows (half of) a quaternary fat tree [1]. All links are 1 Gbps, which makes this a fully-subscribed network (it is easy to derive similar examples for oversubscribed topologies). There are two tenants in this example, depicted as red and green. Each tenant requests a hose model guarantee as in Fig. 1. The sum of the hose model bandwidth guarantees requested by the VMs located on each server is proportional to the fraction of the server that is shaded; VMs on a fully shaded sever request a total of 1Gbps.2 For example, the red VMs on H6 request 333Mbps. Now let’s try to find an efficient singlepath static reservation. To begin, consider only the servers H1, H5 and H8 belonging to the red tenant in Fig. 2. The efficient way to connect these VMs in the hose model is to fully reserve a tree embedded in the physical topology. Any other approach requires more reservations. There are four such trees to choose from; without loss of generality we choose the one marked with solid red lines. Now, consider the green tenant with servers H3 and H7. 3.1 Single-Path Reservations are Wasteful To show that single path reservations are not efficient, we define the notion of an efficient reservation. On single-path networks (such as single-rooted trees), reserving bandwidth guarantees for the hose model on a link L requires reserving the minimum of the summed guarantees of the nodes in the two halves of the network connected by L [3, 4, 15]. For example, in Fig. 1, assume VMs A and B are on one side of L, and all other VMs are on the other. In this case, the bandwidth reserved on L should be RL = min(BwminA + BwminB , BwminC + . . . + BwminX ). R is the maximum traffic that can be sent on L under the hose model, and we define an efficient reservation as reserving exactly such a quan- 2 Note that the network does indeed have enough capacity to sustain this reservation load. To see this, imagine aggregating the capacity of the parallel links of the multi-path tree into a single-path tree. Then map all hose models’ virtual switches onto the core switch. 3 Given the tree we just reserved for the red tenant, the efficient reservation must go through A4; we choose the path marked with the solid green line. Finally, consider host H6. Satisfying the hose model for the red tenant means satisfying any hose-model compliant communication pattern. For instance, VMs in H5 could send at 1 Gbps to H1 while VMs in H6 send 333 Mbps to H8. Thus, H6 must have a reservation to H8 that does not intersect H5’s reservation to H1. Since E3-A3 is fully reserved, the reservation to H8 must go through A4, as shown by the dotted red line. However, A4-E4 is fully reserved by green. Therefore, this request pattern cannot be satisfied by static single-path reservations.3 Fundamentally, single path reservations are inefficient because reserving different VM-to-VM flows for a single VM across different paths can lead to duplicate bandwidth reservations. This can be unavoidable when not all reservations fit in the capacity of a single link. For example, in Fig 2, any single-path reservation for the red tenant would be inefficient (irrespective of the green tenant). In trying to use single-path routing while providing bandwidth guarantees, providers can go beyond static reservations and dynamically reserve bandwidth. Specifically, when a new VM-to-VM communication starts, a centralized load balancer for guarantees may dynamically assign a path to this communication. This approach suffers from the same fundamental limitation described before (e.g., assume each shaded component in Fig. 2 is a single VM), however, when the number of VMs is large it could mitigate its effect. Note that the guarantee load balancer needs to be centralized, since it must also perform admission control, and needs to know the reservation status on each link that is going to be used by the reservation. However, this approach is significantly more complex than Hedera [2], an already complex and high overhead load-balancing system. Essentially, a dynamic bandwidth reservation system must implement: Hedera’s load balancing algorithm [2] + Oktopus’s admission control algorithm [3] + ElasticSwitch guarantee partitioning [15] (or its equivalent in Oktopus). This would lead to an extremely complex and heavyweight system that would not even be work-conserving! Making such a system work-conserving would be an even more challenging task, due to the scale and high level of dynamicity of the traffic that needs to be centrally coordinated. Work!conserving interfaces VM Interface with bandwidth guarantees Rate!limiter Hypervisor High!priority queues Low!priority queues Figure 3: Overview of Harmonia’s architecture. as the one presented in Fig. 2, we can apply the Oktopus [3] algorithm on the conceptual single-path tree that results from merging links on the same level. For example, for the red tenant in Fig. 2 we would have the following reservations: 0.5 Gbps on links E1-A1 and E1-A2, 0.25 Gbps on all links between the core and the aggregation switches, 0.5 Gbps on E4-A3 and E4-A4, and 1.33 2 Gbps on E3-A3 and E3-A4. Since efficient bandwidth reservations require multiple paths, a VM pair communicating with a single TCP/UDP flow must have its flow split across multiple paths. Unfortunately, none of the prior proposals for providing WCBG is easily adaptable to such reservations. For example, we can try applying prior works designed for single-path reservations – e.g. ElasticSwitch [15] – to multi-path reservations by scattering packets uniformly on all paths. However, this approach becomes infeasible when there are failures in the network. In this case, uniformly scattering packets no longer works, because the network is no longer symmetric, and some links will become more loaded/congested than others [6, 16]. Thus, a complex mechanism must be used to keep each link uncongested. This mechanism must essentially implement something like MPTCP [16], or like ElasticSwitch’s Rate Allocation [15] on each path. For providers, implementing such an approach inside hypervisors is a very complex and resource intensive task. An alternative could be to rely on new hardware support [6], but this is not currently available. 4. Harmonia Our solution, Harmonia, addresses both of the concerns we have raised: it significantly reduces the overhead for providers to offer WCBG compared to previous approaches, and it works on multi-path networks. At a high level, Harmonia separates out the guaranteed traffic from the work-conserving traffic by assigning them to separate priority queues in the network switches. In addition, both of these traffic types utilize all paths available to them. Fig. 3 presents an overview of our solution’s architecture, which can be summarized by four high level architectural features that we describe next. 3.2 Multi-Path Reservations We have shown that single-path reservations are inefficient. Now we discuss how to create and use efficient multipath reservations. For symmetric tree-like multi-path topologies, there is an easy method for efficient reservations: reserve bandwidth uniformly across all paths. For example, on a fat tree, such 3 As a final note, we add that this proof remains valid no matter what non-zero bandwidth guarantee H6 requests – we chose 333 Mbps as an arbitrary example. 4 (1) Separate VM interfaces for guaranteed and work conserving traffic: Harmonia exposes to each VM multiple virtual network interfaces. The main interface (solid blue line in Fig. 3) is associated with guaranteed network bandwidth.4 The other interfaces are for sending work-conserving traffic, i.e.,the volume of traffic that exceeds the VM’s bandwidth guarantee. This traffic is best-effort; any or all packets may be dropped. The traffic splitting between the guaranteed and work conserving interfaces is handled by MPTCP – as far as the application knows, its communications are just standard, single-path TCP connections. (2) Rate Limiting in hypervisors to enforce guarantees: The high priority paths are only intended for satisfying guarantees; tenants cannot be allowed to use them at will. Therefore, Harmonia must rate-limit the traffic on these paths to enforce the guarantees. For this purpose we can use any non-work-conserving solution, such as the centralized Oktopus [3], or the distributed Guarantee Partitioning of ElasticSwitch [15]. Both of these solutions are implemented entirely in the hypervisor, and work with commodity switches. Providing non-work-conserving guarantees in the hypervisor is much simpler than providing WCBG, because of the different timescale on which rate-limiters must be updated. When providing WCBG, rate-limiters must be updated whenever conditions change on any link inside the network, while for non-work-conserving guarantees, ratelimiters need only be updated when traffic demands change at a given VM. (3) Priority queuing in network switches: Most commodity network switches are equipped with a small number of static priority queues. We utilize two of these queues in Harmonia to handle guaranteed and work-conserving traffic separately. Inside each network entity, the guaranteed traffic is forwarded using a higher priority queue, while the workconserving traffic is forwarded using lower priority queue. Thus, work-conserving traffic is forwarded only when a link is not fully utilized by guaranteed traffic. Since, our system uses only two priority queues on each switch, we do not expect this to be a barrier towards adoption. (4) Multi-path Support: We need to both (a) utilize our multi-path reservations, and (b) allow tenants to use the spare capacity on all paths. To achieve (a), we scatter the guaranteed traffic uniformly across all the available paths. Since this traffic is rate-limited and the offered guarantees do not oversubscribe the network capacity, the guaranteed traffic should never congest any link in the network. Packet scattering has been shown to perform well when congestion is low [6, 16]. Then, to achieve (b) – i.e., to fully utilize any spare bandwidth – we expose all physical paths to each VM using “work-conserving interfaces” (one for each path). The packets on these interfaces will be tagged as lower pri- ority. (Again, note that users will not use these paths explicitly; MPTCP on top of multiple interfaces presents itself as a vanilla TCP implementation through the sockets API.) For example, consider a multi-rooted fat-tree topology in a datacenter with 8 core switches; this results in maximum 8 independent shortest paths between two VMs. In our current architecture, each VM has 3 interfaces exposed to it: a main interface G, and two additional interfaces W1 and W2 for the work conserving traffic. When a VM sends packets from its G interface to the G interface of another VM, it will appear as if they are connected to the conceptual dedicated switch of the hose model, with link bandwidths set to the bandwidth guarantee, as shown in Fig. 1. Under the hood, this traffic is rate-limited, scattered across all the available paths, and forwarded using the high priority queues. The other 8 combinations of interfaces, i.e., between a G and a W interface or between two W interfaces, are each mapped to the low priority queue √ on one of the 8 paths. Thus, we need to expose at least ⌊ N ⌋ + 1 virtual interfaces to each VM, where N is the maximum number of distinct paths between two VMs. To fully utilize all these paths and thus gain work conservation in addition to their guarantees, tenants must use a multi-path protocol, such as MPTCP [16]. We also envision providers implementing simple forms of multi-path UDP protocols; otherwise, each UDP flow can only be sent on one of these paths, either as part of the guaranteed bandwidth, or on one of the work-conserving paths. To route along multiple paths, we use different VLANs, similar to SPAIN [13]. Each VLAN is rooted at a core switch. Each core switch has two VLANs, at two different priority levels (i.e., one for guaranteed traffic and one for workconserving traffic). Thus, if the network has 8 core switches, Harmonia requires 16 VLANs. The capabilities of modern switches should easily satisfy this requirement5. A kernel module in the hypervisor dynamically tags each packet with a VLAN. Traffic between two guaranteed interfaces are scattered uniformly across all high priority VLANs, while each work-conserving path is consistently mapped to the same lower priority VLAN. Harmonia copes well with failures. The rate-limit for the guaranteed traffic of each VM is divided uniformly across all paths. When a link L fails, only the VMs that are routed through L are affected; however, no link in the network can be congested by the traffic with guarantees, which means packet scattering still provides good results. Therefore, all the other VMs that are not forwarded through L are not affected by L’s failure. Tenants who do not care about WCBG could join a Harmonia datacenter without being exposed to our MPTCP-and5 Standard 802.1Q supports 4094 VLANs, allowing our design to scale to over 2,000 core switches. A typical datacenter fat tree might have 48 switches, which would correspond to 7 virtual network interfaces in our design. A theoretical 2000-ary fat tree would support 2 billion hosts. 4 More precisely, rather than having one purely guaranteed interface, we discriminate among pairs of interfaces: only the subflow from the sender’s first interface to the receiver’s first interface is mapped to a guaranteed, high-priority VLAN. 5 60 One core CPU overhead (%) ElasticSwitch 40 30 20 10 0 Figure 4: Harmonia provides bandwidth guarantees in the presence of malicious users. A tenant has a 225Mbps hose-model bandwidth guarantee across its VMs. The tenant is able to grab the entire bandwidth when there are no other users and his guarantee is respected even when he is using TCP and many other users blast UDP traffic on all interfaces. 0 50 100 150 200 250 No. of VM-to-VM flows Figure 5: CPU Overhead: Harmonia vs. ElasticSwitch. tion, the TCP flows fully utilized the available capacity, competing evenly on top of the guarantees. When UDP flows were added, both the UDP and TCP receivers’ guarantees were honored, giving them both an average total throughput of roughly 225Mbps. We repeated the experiment with a varying number of UDP senders, to ensure that a large number of active flows would not detract from a single flow’s guarantee. Harmonia adds very little CPU overhead to servers it runs on, even when managing many flows. We compare Harmonia’s CPU overhead with ElasticSwitch’s overhead by repeating the ElasticSwitch overhead experiment [15]. This experiment starts a number of long flows, and measures the average CPU utilization of the WCBG system on one machine. Results are shown in Fig. 5. Fig. 5 shows that Harmonia markedly improves upon the overhead of ElasticSwitch. Whereas ElasticSwitch quickly becomes a large burden on its servers as more flows are started, Harmonia maintains a fairly small footprint: it stays under 10% of one core on an Intel Xeon X3370 CPU while ElasticSwitch reaches 50%. virtual-interfaces setup: their single normal interface can be treated either as a guaranteed one, giving them non-workconserving guarantees, or as a work-conserving one, giving them traditional best-effort service. If the best-effort tenants are getting an unacceptably low share of the non-guaranteed bandwidth, providers could switch to three priority queues: (0) guaranteed traffic, (1) traffic of best-effort tenants, and (2) work-conserving traffic of tenants with guarantees. (Note that queues 0 and 2 are the ones discussed up to now.) 5. Harmonia 50 EVALUATION We have implemented a prototype of Harmonia and we have tested it on a small multi-path datacenter testbed with 36 servers. Our topology is similar to the one in Fig. 2; we use 3 core switches and 6 racks, each with 6 servers. To implement non-work-conserving bandwidth guarantees (and rate-limit the guaranteed interfaces), we reuse part of ElasticSwitch [15] (more specifically the Guarantee Partitioning algorithm). Our goals for this evaluation are to show that our solution: (1) provides guarantees on multi-path networks, (2) is work-conserving on multi-path networks and (3) has low overhead. To demonstrate the first two goals, we constructed the following experiment. We selected 6 servers in one rack (under a single ToR switch), and ran 2 VMs on each server: one receiving a single TCP flow, and the other receiving multiple full-blast UDP flows on all possible paths. All senders were located on servers in other racks. Each VM was given a 225Mbps bandwidth guarantee. This fat tree topology had 3 core switches, making a total of 3Gbps available to the 12 receiving VMs; we leave 10% of the capacity unreserved, hence the 225Mbps guarntees. Fig. 4 depicts the throughput of the TCP flows as we increase the number of UDP senders. With no UDP competi- 6. SUMMARY We close with two key design points about Harmonia. (1) Harmonia splits the task of bandwidth management between providers and tenants: the provider handles traffic up to the bandwidth guarantee, while the tenant handles traffic above it. This simplifies providers’ efforts to deploy guarantees. We note that our solution is transparent to tenants, which do not need to change their applications but only need to use VMs with MPTCP installed (a drop-in replacement for standard TCP implementations). (2) This is the first work to consider multi-path reservations for work-conserving bandwidth guarantees in cloud datacenters. We showed that single path reservations are inefficient, and existing proposals for bandwidth guarantees are not easily adaptable to multi-path reservations. 6 Using experiments in a real datacenter, we showed preliminary results indicating that Harmonia indeed achieves its goals. 7. Computing. In SIGCOMM. ACM, 2013. [16] C. Raiciu, S. Barre, C. Pluntke, A. Greenhalgh, D. Wischik, and M. Handley. Improving Datacenter Performance and Robustness with Multipath TCP. In ACM SIGCOMM, 2011. [17] H. Rodrigues, J. R. Santos, Y. Turner, P. Soares, and D. Guedes. Gatekeeper: Supporting bandwidth guarantees for multi-tenant datacenter networks. In USENIX WIOV, 2011. [18] A. Shieh, S. Kandula, A. Greenberg, C. Kim, and B. Saha. Sharing the Data Center Network. In Usenix NSDI, 2011. [19] A. Singla, C.-Y. Hong, L. Popa, and P. B. Godfrey. Jellyfish: Networking Data Centers Randomly. In USENIX NSDI, 2012. [20] Srikanth K and Sudipta Sengupta and Albert Greenberg and Parveen Patel and Ronnie Chaiken. The Nature of Datacenter Traffic: Measurements & Analysis. In IMC. ACM, 2009. [21] G. Wang, D. G. Andersen, M. Kaminsky, K. Papagiannaki, T. S. E. Ng, M. Kozuch, and M. P. Ryan. c-Through: Part-time optics in data centers. In SIGCOMM. ACM, 2010. REFERENCES [1] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. In SIGCOMM. ACM, 2008. [2] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat. Hedera: Dynamic Flow Scheduling for Data Center Networks. In NSDI, 2010. [3] H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron. Towards Predictable Datacenter Networks. In ACM SIGCOMM, 2011. [4] H. Ballani, K. Jang, T. Karagiannis, C. Kim, D. Gunawardena, et al. Chatty Tenants and the Cloud Network Sharing Problem. NSDI’13. [5] T. Benson, A. Akella, and D. A. Maltz. Network traffic characteristics of data centers in the wild. In IMC. ACM, 2010. [6] A. A. Dixit, P. Prakash, Y. C. Hu, and R. R. Kompella. On the impact of packet spraying in data center networks. In INFOCOM. IEEE, 2013. [7] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. VL2: A Scalable and Flexible Data Center Network. ACM SIGCOMM, 2009. [8] C. Guo, G. Lu, H. J. Wang, S. Yang, C. Kong, P. Sun, W. Wu, and Y. Zhang. Secondnet: a data center network virtualization architecture with bandwidth guarantees. In CoNEXT. ACM, 2010. [9] C. Guo and G. Lu et al. BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers. ACM SIGCOMM, 2009. [10] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu. Dcell: A Scalable and Fault-tolerant Network Structure for Data Centers. In SIGCOMM, 2008. [11] V. Jeyakumar, M. Alizadeh, D. Mazières, B. Prabhakar, C. Kim, and A. Greenberg. EyeQ: Practical Network Performance Isolation at the Edge. In USENIX NSDI, 2013. [12] T. Lam, S. Radhakrishnan, A. Vahdat, and G. Varghese. NetShare: Virtualizing Data Center Networks across Services. UCSD TR, 2010. [13] J. Mudigonda, P. Yalagandula, M. Al-Fares, and J. C. Mogul. SPAIN: COTS data-center Ethernet for multipathing over arbitrary topologies. In USENIX NSDI, 2010. [14] L. Popa, G. Kumar, M. Chowdhury, A. Krishnamurthy, S. Ratnasamy, and I. Stoica. FairCloud: Sharing the Network in Cloud Computing. In ACM SIGCOMM, 2012. [15] L. Popa, P. Yalagandula, S. Banerjee, J. C. Mogul, Y. Turner, and J. R. Santos. ElasticSwitch: Practical Work-Conserving Bandwidth Guarantees for Cloud 7