Network Sharing The story thus far: Sharing Omega + Mesos How to share End-host resource Think CPU, Memory, I/O Different ways to share: Fair sharing: Idealist view. Proportional sharing: Ideal for public cloud Get access to an amount equal to how much you pay Priority-Deadline based sharing: Ideal for private data center. Everyone should get equal access Company care about completion times. What about the network? Isn’t this important? Network Caring is Network Sharing Network is import to a jobs completion time. Default network sharing is TCP Vague notion of fair sharing Fairness is based on individual flows Work-conserving Per-Flow based sharing is biased VMs with many flows get a greater share of the network What is the best form of Network Sharing Fair sharing: Per-Source based fairness? Per-Destination based fairness? Map can cheat Fairness === Bad: Reducers cheats– many flows to one destination. No one can predict anything. And we like things to be prediction: we like short and predictable latency Min-Bandwidth Guarantees Perfect!! But: Implementation can lead to inefficiency How do you predict bandwidth demands What is the best formUnpredictability& of Network Sharing Performance& Fair sharing: Per-Source based fairness? Reducers cheats– many flows to one destination. 200ms' Per-Destination based fairness? Map can cheat Fairness === Bad: No one can predict anything. And we like things to be prediction: we like short and predictable Mean,' th latency 99 'pcile' Median' Min-Bandwidth Guarantees Perfect!! But: 4'Apr'2013' Implementation can lead to inefficiency How do you predict bandwidth demands NSDI'2013' 6' Conges5on& Kills& Predictability& What is the best form of Network Sharing Performance&Unpredictability& Fair sharing: Per-Source based fairness? Reducers cheats– many flows to one destination. 200ms' Per-Destination based fairness? Map can cheat Fairness === Bad: No one can predict anything. And we like things to be prediction: we like short and predictable Mean,' th latency 99 'pcile' Median' Min-Bandwidth Guarantees Perfect!! But: 4'Apr'2013' Implementation can lead to inefficiency How do you predict bandwidth demands NSDI'2013' 6' 7' How can you share the network? Endhost sharing schemes Use default TCP? Never! Change the hypervisor! Change the endhost’s TCP stack Requires virtualization Invasive and undesirably In-Network sharing schemes Use queues and rate-limiters Utilize ECN Limited enforcing to 7-8 different guarantees Requires switches that support ECN mechanism Other switch modifications Expensive and highly unlikely except maybe OpenFlow. ElasticSwitch: Practical Work-Conserving Bandwidth Guarantees for Cloud Computing Lucian Popa Praveen Yalagandula* Sujata Banerjee Jeffrey C. Mogul+ Yoshio Turner Jose Renato Santos HP Labs * Avi Networks + Google Goals Provide Minimum Bandwidth Guarantees in Clouds 1. Tenants can affect each other’s traffic MapReduce jobs can affect performance of user-facing applications Large MapReduce jobs can delay the completion of small jobs Bandwidth guarantees offer predictable performance Goals 1. Provide Minimum Bandwidth Guarantees in Clouds Hose model Virtual (imaginary) Switch VS BX X BY Y BZ Bandwidth Guarantees Z VMs of one tenant Other models based on hose model such as TAG [HotCloud’13] Goals Provide Minimum Bandwidth Guarantees in Clouds Work-Conserving Allocation 1. 2. Tenants can use spare bandwidth from unallocated or underutilized guarantees Goals Provide Minimum Bandwidth Guarantees in Clouds Work-Conserving Allocation 1. 2. Tenants can use spare bandwidth from unallocated or underutilized guarantees Significantly increases performance Average traffic is low [IMC09,IMC10] Traffic is bursty Goals 2. Provide Minimum Bandwidth Guarantees in Clouds Work-Conserving Allocation ElasticSwitch Bmin X Bmin Y XY bandwidth 1. Everything reserved & used Free capacity Bmin Time Goals 1. 2. 3. Provide Minimum Bandwidth Guarantees in Clouds Work-Conserving Allocation Be Practical Topology independent: work with oversubscribed topologies Inexpensive: per VM/per tenant queues are expensive work with commodity switches Scalable: centralized controller can be bottleneck distributed solution Hard to partition:VMs can cause bottlenecks anywhere in the network Goals 1. 2. 3. Provide Minimum Bandwidth Guarantees in Clouds Work-Conserving Allocation Be Practical Prior Work Guarantees Workconserving Practical X (fair sharing) √ √ SecondNet [CoNEXT’10] √ X √ Oktopus [SIGCOMM’10] √ X ~X (centralized) Gatekeeper [WIOV’11], EyeQ [NSDI’13] √ √ X (congestion-free core) FairCloud (PS-P) [SIGCOMM’12] √ √ X (queue/VM) Hadrian [NSDI’13] √ √ X (weighted RCP) ElasticSwitch √ √ √ Seawall [NDSI’11], NetShare [TR], FairCloud (PS-L/N) [SIGCOMM’12] Outline Motivation and Goals Overview More Details Guarantee Partitioning Rate Allocation Evaluation ElasticSwitch Overview: Operates At Runtime Tenant selects bandwidth guarantees. Models: Hose, TAG, etc. Oktopus [SIGCOMM’10] Hadrian [NSDI’10] CloudMirror [HotCLoud’13] VMs placed, Admission Control ensures all guarantees can be met VM setup ElasticSwitch Enforce bandwidth guarantees & Provide work-conservation Runtime ElasticSwitch Overview: Runs In Hypervisors • Resides in the hypervisor of each host • Distributed: Communicates pairwise following data flows VM VM VM Hypervisor ElasticSwitch Network VM VM Hypervisor ElasticSwitch VM Hypervisor ElasticSwitch ElasticSwitch Overview: Two Layers Guarantee Partitioning Rate Allocation Hypervisor Provides Guarantees Provides Work-conservation ElasticSwitch Overview: Guarantee Partitioning 1. Guarantee Partitioning: turns hose model into VM-to-VM pipe guarantees VM-to-VM control is necessary, coarser granularity is not enough ElasticSwitch Overview: Guarantee Partitioning 1. Guarantee Partitioning: turns hose model into VM-to-VM pipe guarantees VS Intra-tenant BX BY X Z Y BXY BZ BXZ VM-to-VM guarantees bandwidths as if tenant communicates on a physical hose network ElasticSwitch Overview: Rate Allocation 1. Guarantee Partitioning: turns hose model into VM-to-VM pipe guarantees 2. Rate Allocation: uses rate limiters, increases rate between X-Y above BXY when there is no congestion between X and Y VS Work-conserving allocation Inter-tenant BX X X Hypervisor BY Y BZ Z RateXY ≥ BXY Limiter Y Hypervisor Unreserved/Unused Capacity ElasticSwitch Overview: Periodic Application Guarantee Partitioning VM-to-VM guarantees Rate Allocation Hypervisor Applied periodically and on new VM-to-VM pairs Demand estimates Applied periodically, more often Outline Motivation and Goals Overview More Details Guarantee Partitioning Rate Allocation Evaluation Guarantee Partitioning – Overview VS1 BX X BQ Y Z T Q Z T BXZ Max-min allocation X BTY BXY Goals: A. Safety – don’t violate hose model B. Efficiency – don’t waste guarantee C. No Starvation – don’t block traffic Y BQY Q Guarantee Partitioning – Overview VS1 BX X BQ Y Z T BX = … = BQ = 100Mbps Max-min allocation Q Z 66Mbps X T 33Mbps 33Mbps Goals: A. Safety – don’t violate hose model B. Efficiency – don’t waste guarantee C. No Starvation – don’t block traffic Y 33Mbps Q Guarantee Partitioning – Operation VS1 BX X BQ Y Z T Q Z XZ BX X BXY = XY BX T XY XY min( BX , BY ) XY BY Hypervisor divides guarantee of each hosted VM between VM-to-VM pairs in each direction Source hypervisor uses the minimum between the source and destination guarantees Q TY BY Y QY BY Guarantee Partitioning – Safety VS1 BX X BQ Y Z T Q Z XZ BX X BXY = XY BX T XY XY min( BX , BY ) XY BY Safety: hose-model guarantees are not exceeded Q TY BY Y QY BY Guarantee Partitioning – Operation VS1 BX X BQ Y Z T Q Z BX = … = BQ = 100Mbps X BXY = T XY XY min( BX , BY ) Q Y Guarantee Partitioning – Operation VS1 BX X BXY = BQ Y Z T BX = … = BQ = 100Mbps Q XY XY min( BX , BY ) Z XZ BX = X 50 XY BX = 50 T TY BY = BXY = 33 XY BY = 33 Q 33 Y QY BY = 33 Guarantee Partitioning – Efficiency VS1 BX X BQ Z Y T BX = … = BQ = 100Mbps Q Z XZ BX = X 1 50 T TY BY = BXY = 33 XY BX = 50 What happens when flows have low demands? XY BY = 33 33 Y QY BY = Q Hypervisor divides guarantees max-min based on demands (future demands estimated based on history) 33 Guarantee Partitioning – Efficiency VS1 BX X BQ Z Y T BX = … = BQ = 100Mbps Q Z XZ BX = X 66 50 T TY BY = BXY = 33 XY BX = 50 33 1 What happens when flows have low demands? 2 How to avoid unallocated guarantees? XY BY = 33 Q 33 Y QY BY = 33 Guarantee Partitioning – Efficiency VS1 BX X BQ Y Z T BX = … = BQ = 100Mbps Q Z XZ BX = X 66 50 T TY BY = BXY = 33 XY BX = 50 33 Source considers destination’s allocation when destination is bottleneck Guarantee Partitioning converges XY BY = 33 Q 33 Y QY BY = 33 Outline Motivation and Goals Overview More Details Guarantee Partitioning Rate Allocation Evaluation Rate Allocation RXY Spare bandwidth Fully used X Guarantee Partitioning BXY BXY Time Congestion data Rate Allocation Rate RXY Limiter Y Rate Allocation RXY = max(BXY , RTCP-like) X Guarantee Partitioning BXY Congestion data Rate Allocation Rate RXY Limiter Y Rate Allocation RXY = max(BXY , RTCP-like) 1000 Rate Limiter Rate Another Tenant 900 800 Mbps 700 600 500 400 300 Guarantee 200 100 0 144.178 144.678 145.178 145.678 146.178 146.678 Seconds 147.178 147.678 148.178 Rate Allocation RXY = max(BXY , Rweighted-TCP-like) Weight is the BXY guarantee X Z RXY = 333Mbps BXY = 100Mbps RXT = 666Mbps BZT = 200Mbps L = 1Gbps Y T Rate Allocation – Congestion Detection Detect congestion through dropped packets Hypervisors add/monitor sequence numbers in packet headers Use ECN, if available Rate Allocation – Adaptive Algorithm Use Seawall [NSDI’11] as rate-allocation algorithm TCP-Cubic like Essential improvements (for when using dropped packets) Many flows probing for spare bandwidth affect guarantees of others Rate Allocation – Adaptive Algorithm Use Seawall [NSDI’11] as rate-allocation algorithm TCP-Cubic like Essential improvements (for when using dropped packets) Hold-increase: hold probing for free bandwidth after a congestion event. Holding time is inversely proportional to guarantee. Rate increasing Guarantee Holding time Outline Motivation and Goals Overview More Details Guarantee Partitioning Rate Allocation Evaluation Evaluation – MapReduce Setup 44 servers, 4x oversubscribed topology, 4 VMs/server Each tenant runs one job, all VMs of all tenants same guarantee Two scenarios: Light 10% of VM slots are either a mapper or a reducer Randomly placed Heavy 100% of VM slots are either a mapper or a reducer Mappers are placed in one half of the datacenter Evaluation – MapReduce 1 CDF 0.8 0.6 0.4 0.2 0 0.2 0.5 1 5 Worst case shuffle completion time / static reservation Evaluation – MapReduce ElasticSwitch 1 CDF 0.8 No Protection Longest completion reduced from No Protection Work-conserving pays off: finish faster than static reservation 0.6 Light Setup 0.4 0.2 0 0.2 0.5 1 5 Worst case shuffle completion time / static reservation Evaluation – MapReduce ElasticSwitch ElasticSwitch enforces guarantees in worst case 1 No Protection up to 160X CDF 0.8 0.6 Guarantees are useful in reducing worst-case shuffle completion 0.4 0.2 Heavy Setup 0 0.2 0.5 1 5 Worst case shuffle completion time / static reservation ElasticSwitch Summary Properties 1. 2. 3. Bandwidth Guarantees: hose model or derivatives Work-conserving Practical: oversubscribed topologies, commodity switches, decentralized Design: two layers Guarantee Partitioning: provides guarantees by transforming hosemodel guarantees into VM-to-VM guarantees Rate Allocation: enables work conservation by increasing rate limits above guarantees when no congestion HP Labs is hiring! Future Work Reduce Overhead Multi-path solution ElasticSwitch: average 1 core / 15 VMs ,worst case 1 core /VM Single-path reservations are inefficient No existing solution works on multi-path networks VM placement Placing VMs in different locations impacts the gaurantees that can be made. Open Questions How do you integrate network sharing with endhost sharing. What are the implications of different sharing mechanisms with each other? How does the network architecture affect network sharing? How do you do admission control? How do you detect demand? How does payment fit into this question? And if it does, when VMs from different people communicate, who dictates price, who gets charged? Elastic Switch – Detecting Demand Optimize for bimodal distribution flows Most flows short, a few flows carry most bytes Short flows care about latency, long flows care about throughput Start with a small guarantee for a new VM-to-VM flow If demand not satisfied, increase guarantee exponentially Perfect Network Architecture What happens in the perfect network architecture? Implications: No loss in network Only at the edge of the network: Edge uplinks between Server and ToR Or hypervisor to VM – virtual links These losses core networks are real: VL2 at Azure Clos at Google ' Core& Edge& How do you integrate network sharing with endhost sharing. Open Questions … … 70%' 60%' 50%' 40%' 30%' 20%' 10%' 0%' ' What are the implications of different sharing mechanisms 99.9th& percen5le& with each other? u5liza5on& (%)& HoUest&storage&cluster:& 1000x& more& drops& How does the network architecture affect network sharing?at& the&Edge,&than&Core.& How do you do admission control? & How does payment fit into this question? And ifclusters:& it does, when 16&of& 17& VMs from different people communicate, who dictates price, Core' charged?Edge' 0&drops&in&the&Core.&& who gets Timescales:&over&2&weeks,& 99.9th&pcile&=&several&minutes& 4'Apr'2013' NSDI'2013' 15' OpenTransmit/Receive& Questions Modules& How do you integrate network sharing with endhost sharing. What 2Gb/s' VM' areRate'limit.' the implications 1Gb/s' of different sharing mechanisms with each other? Conges%on'detectors' 2Gb/s' 1Gb/s' How does the network architecture affect network8Gb/s' sharing? 2Gb/s' VM' VM' VM' Rate'limit.' How do you do admission control? How does payment fit into this question? And if it does, when Rate'limit.' 8Gb/s' VM' VM' 8Gb/s' VMs from different people communicate, who dictates price, who gets charged? PerZdes%na%on'rate'limiters:' only'if'dest.'is'congested…'bypass'otherwise' 4'Apr'2013' NSDI'2013' 25' OpenTransmit/Receive& Questions Modules& How do you integrate network sharing with endhost sharing. RCP:'Rate'feedback'(R)'every'10kB' (no'perZsource'state'needed)' What 2Gb/s' VM' areRate'limit.' the implications 1Gb/s' of different sharing mechanisms with each other? Feedback'pkt' Conges%on'detectors' Rate:'1G b 2Gb/s' VM' How does the network architecture affect network8Gb/s' sharing? 2Gb/s' VM' /s ' 1Gb/s' VM' Rate'limit.' How do you do admission control? How does payment fit into this question? And if it does, when Rate'limit.' 8Gb/s' VM' VM' 8Gb/s' VMs from different people communicate, who dictates price, who gets charged? PerZdes%na%on'rate'limiters:' only'if'dest.'is'congested…'bypass'otherwise' 4'Apr'2013' NSDI'2013' 26'