Data Center TCP (DCTCP) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research, Stanford University by liyong 1 OutLook • Introduction • Communications in Data Centers • The DCTCP Algorithm • Analysis and Experiments • Conclusion 2 • Introduction • Communications in Data Centers • The DCTCP Algorithm • Analysis and Experiments • Conclusion 3 Data Center Packet Transport • Cloud computing service provider – Amazon,Microsoft,Google Need to build highly available, highly performant computing and storage infrastructure using low cost, commodity components 4 we focus on soft real-time applications – Supporting Web search Retail Advertising Recommendation – Require three things from the data center network Low latency for short flows High burst tolerance High utilization for long fows 5 Two major contributions – First Measure and analyze production traffic Extracting application patterns and needs Impairments that hurt performance are identified – Second Propose Data Center TCP (DCTCP) Evaluate DCTCP at 1 and 10Gbps speeds on ECN-capable commodity switches 6 Production traffic • >150TB of compressed data,collected over the course of a month from ~6000 servers • The measurements reveal that 99.91% of traffic in our data center is TCP traffic • The traffic consists of three kinds of traffic —query traffic (2KB to 20KB in size) — delay sensitive short messages (100KB to 1MB) —throughput sensitive long flows (1MB to 100MB) • Our key learning from these measurements is that to meet the requirements of such a diverse mix of short and long flows, switch buffer occupancies need to be persistently low, while maintaining high throughput for the long flows. • DCTCP is designed to do exactly this. 7 TCP in the Data Center • We’ll see TCP does not meet demands of apps. – Incast Suffers from bursty packet drops Not fast enough utilize spare bandwidth – Builds up large queues: Adds significant latency. Wastes precious buffers, esp. bad with shallow-buffered switches. • Operators work around TCP problems. ‒ Ad-hoc, inefficient, often expensive solutions • Our solution: Data Center TCP 8 • Introduction • Communications in Data Centers • The DCTCP Algorithm • Analysis and Experiments • Conclusion 9 Partition/Aggregate Application Structure 10 Partition/Aggregate TLA Picasso Art is… 1. Deadline 2. Art is=a250ms lie… ….. 3. Picasso • Time is money MLA ……… MLA 1. Strict deadlines (SLAs) Deadline = 50ms 2. 2. The chief… 3. ….. 3. ….. • Missed deadline 1. Art is a lie… Lower quality result “It“I'd is“Art “Computers “Inspiration your chief like “Bad isto you awork enemy lie live artists can that in as are does of imagine life amakes copy. useless. creativity poor that exist, man us is the real.” is Deadline“Everything =“The 10ms They but can itultimate with Good must realize only good lots artists find give seduction.“ the of sense.“ you money.“ you truth. steal.” working.” answers.” Worker Nodes 11 Workloads • Partition/Aggregate (Query) • Short messages [50KB-1MB] (Coordination, Control state) Delay-sensitive Delay-sensitive • Large flows [1MB-50MB] (Data update) Throughput-sensitive 12 Impairments Switche s Incast Impairments Queue Buildup Buffer Pressur e 13 Switches • Like most commodity switches in clusters are shared memory switches that aim to exploit statistical multiplexing gain through use of logically common packet buffers available to all switch ports. • Packets arriving on an interface are stored into a high speed multi-ported memory shared by all the interfaces. • Memory from the shared pool is dynamically allocated to a packet by a MMU(attempts to give each interface as much memory as it needs while preventing unfairness by dynamically adjusting the maximum amount of memory any one interface can take). • Building large multi-ported memories is very expensive, so most cheap switches are shallow buffered, with packet buffer being the scarcest resource. The shallow packet buffers cause three specific performance impairments,which we discuss next. 14 Incast Worker 1 • Synchronized mice collide. Caused by Partition/Aggregate. Aggregator Worker 2 Worker 3 RTOmin = 300 ms Worker 4 TCP timeout 15 Queue Buildup Sender 1 • Big flows buildup queues. Increased latency for short flows. Receiver Sender 2 • Measurements in Bing cluster For 90% packets: RTT < 1ms For 10% packets: 1ms < RTT < 15ms 16 Buffer Pressure • Indeed, the loss rate of short flows in this traffic pattern depends on the number of long flows traversing other ports • The bad result is packet loss and timeouts, as in incast, but without requiring synchronized flows. 17 Data Center Transport Requirements 1. High Burst Tolerance – Incast due to Partition/Aggregate is common. 2. Low Latency – Short flows, queries 3. High Throughput – Large file transfers The challenge is to achieve these three together. 18 Balance Between Requirements High Throughput High Burst Tolerance Low Latency Deep Buffers: Queuing Delays Increase Latency Shallow Buffers: Bad for Bursts & Throughput Reduced RTOmin (SIGCOMM ‘09) Doesn’t Help Latency AQM – RED: Avg Queue Not Fast Enough for Incast Objective: Low Queue Occupancy & High Throughput DCTCP 19 • Introduction • Communications in Data Centers • The DCTCP Algorithm • Analysis and Experiments • Conclusion 20 Review TCP Congestion Control • Four Stage: Slow Start Congestion Avoidance Quickly Retransmission Quickly Recovery • Router must maintain one or more queues on port, so it is important to control queue – Two queue control algorithm Queue Management Algorithm: manage the queue length through dropping packets when necessary Queue Scheduling Algorithm: determine the next packet to send 21 Two queue management algorithm • Passive management Algorithm: dropping packets after queue is full. – Traditional Method Drop-tail Random drop Drop front – Some drawbacks Lock-out: several flows occupy queue exclusively, prevent the packets from others flows entering queue Full queues: send congestion signals only when the queues are full, so the queue is full state in quite a long period • Active Management Algorithm(AQM) 22 AQM (dropping packets before queue is full) • RED(Random Early Detection)[RFC2309] Calculate the average queue length(aveQ): Estimate the degree of congestion Calculate probability of dropping packets (P): according to the degree of congestion. (two threshold: minth, maxth) abeQ<minth:don’t drop packets Minth<abeQ<maxth: drop packets in P abeQ>maxth: drop all packets Drawback: drop packets sometimes when queue isn’t full • ECN(Explicit Congestion Notification)[RFC3168] An method to use multibit feed-back notifying congestion instead of dropping packets 23 ECN • Routers or Switches must support it.(ECN-capable) – Set two bits by the ECN field in the IP packet header ECT (ECN-Capable Transport): set by sender, To display the sender’s transmission protocol whether support ECN or not. CE(Congestion Experienced): set by routers or switches, to display whether congestion occur or not. – Set two bits field in TCP header ECN-Echo: receiver notify sender that it has received CE packet CWR(Congestion Window Red-UCed): sender notify receiver that it has decreased the congestion window Integrate ECN with RED 24 ECN working principle IP头部 TCP头部 TCP头部 ETC CE ETC CE 1 0 1 1 1 0 0 CWR CWR 3 1 CWR 源端 4 路由器 2 ACK 1 ECN-Echo 目的端 25 Review: The TCP/ECN Control Loop Sender 1 ECN = Explicit Congestion Notification ECN Mark (1 bit) Receiver Sender 2 26 Two Key Ideas 1. React in proportion to the extent of congestion, not its presence. Reduces variance in sending rates, lowering queuing requirements. ECN Marks TCP DCTCP 1011110111 Cut window by 50% Cut window by 40% 0000000001 Cut window by 50% Cut window by 5% 2. Mark based on instantaneous queue length. Fast feedback to better deal with bursts. 27 Data Center TCP Algorithm Switch side: B – Mark packets when Queue Length > K. Mark K Don’t Mark Sender side: – Maintain an estimate of fraction of packets marked (α). In each RTT: – where F is the fraction of packets that were marked in the last window of data – 0 < g < 1 is the weight given to new samples against the past in the estimation of α Adaptive window decreases: 28 (Kbytes) DCTCP in Action Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, K = 30KB 29 • Introduction • Communications in Data Centers • The DCTCP Algorithm • Analysis and Experiments • Conclusion 30 Why it Works 1. High Burst Tolerance Large buffer headroom → bursts fit. Aggressive marking → sources react before packets are dropped. 2. Low Latency Small buffer occupancies → low queuing delay. 3. High Throughput ECN averaging → smooth rate adjustments, cwind low variance. 31 Analysis Window Size Packets sent in this RTT are marked. W*+1 W* (W*+1)(1-α/2) Time 32 Analysis We are interested in computing the following quantities: The maximum queue size (Qmax) The amplitude of queue oscillations (A) The period of oscillations (TC) 33 Analysis • Consider N infinitely long-lived flows with identical round-trip times RTT, sharing a single bottleneck link of capacity C. We further assume that the N flows are synchronized • The queue size at time t is given by Q(t) = NW(t)-C×RTT (3) where W(t) is the window size of a single source • The fraction of marked packets α • S(W1,W2) denote the number of packets sent by the sender, while its window size increases from W1 to W2 > W1. • Since this takes W2-W1 round-trip times, during which the average window size is (W1 +W2)/2 34 Analysis • LetW* = (C × RTT +K)/N, This is the critical window size at which the queue size reaches K, and the switch starts marking packets with the CE codepoint. During the RTT it takes for the sender to react to these marks, its window size increases by one more packet, reaching W* + 1. Hence Plugging (4) into (5) and rearranging, we get: • Assuming α is small, this can be simplified as compute A , TC and in Qmax. . We can now 35 Analysis • Note that the amplitude of oscillation in window size of a single flow, D, is given by: • Since there are N flows in total • Finally, using (3), we have: 36 Analysis • How do we set the DCTCP parameters? Marking Threshold(K). The minimum value of the queue occupancy sawtooth is given by: Choose K so that this minimum is larger than zero, i.e. the queue does not underflow. This results in: Estimation Gain(g). The estimation gain g must be chosen small enough to ensure the exponential moving average (1) “spans” at least one congestion event. Since a congestion event occurs every TC round-trip times, we choose g such that: Plugging in (9) with the worst case value N = 1, results in the following criterion: g 1 . 386 / 2 ( C RTT k ) 37 Evaluation • Implemented in Windows stack. • Real hardware, 1Gbps and 10Gbps experiments – – – – 90 server testbed Broadcom Triumph Cisco Cat4948 Broadcom Scorpion 48 1G ports – 4MB shared memory 48 1G ports – 16MB shared memory 24 10G ports – 4MB shared memory • Numerous benchmarks – Incast – Queue Buildup – Buffer Pressure 38 Experiment implement • 45 1G servers connected to a Triumph, a 10G server extern connection – 1Gbps links K=20 – 10Gbps link K=65 • Generate query, and background traffic – 10 minutes, 200,000 background, 188,000 queries • Metric: – Flow completion time for queries and background flows. We use RTOmin = 10ms for both TCP & DCTCP. 39 Baseline Background Flows(95th Percentile) Query Flows 40 Baseline Background Flows(95th Percentile) Query Flows Low latency for short flows. 41 Baseline Background Flows(95th Percentile) Query Flows Low latency for short flows. High burst tolerance for query flows. 43 (95th Percentile)Scaled Background & Query 10x Background, 10x Query 44 These results make three key points • First, if our data center used DCTCP it could handle 10X larger query responses and 10X larger background flows while performing better than it does with TCP today. • Second, while using deep buffered switches (without ECN) improves performance of query traffic, it makes performance of short transfers worse, due to queue build up. • Third, while RED improves performance of short transfers, it does not improve the performance of query traffic, due to queue length variability. 44 Conclusions • DCTCP satisfies all our requirements for Data Center packet transport. Handles bursts well Keeps queuing delays low Achieves high throughput • Features: Very simple change to TCP and a single switch parameter K. Based on ECN mechanisms already available in commodity switch. 46 46