TCP & Data Center Networking • • • • TCP & Data Center Networking: Overview TCP Incast Problem & Possible Solutions DC-TCP MPTCP (multipath TPC) • Please read the following papers [InCast] [DC-TCP] [MPTCP] CSci5221: TCP and Data Center Networking 1 TCP Congestion Control: Recap • Designed to address network congestion problem – reduce sending rates when network conges • How to detect network congestion at end systems? – Assume packet losses (& re-ordering) network congestion • How to adjust sending rates dynamically? – AIMD (additive increase & multiplicative decrease): • no packet loss in one RTT: W W+1 • packet loss in one RTT: W W/2 • How to determine the initial sending rates? – probe the network available bandwidth via “slow start” • W:=1; no loss in one RTT: W 2W • Fairness: assume everyone will use the same algorithm 2 TCP Congestion Control: Devils in the Details • How to detect packet losses? – e.g., as opposed to late-arriving packets? – estimate (average) RTT times, and set a time-out threshold • called RTO (Retransmission Time-Out) timer • packets arriving very late are treated as if they were lost! • RTT and RTO estimations: Jacobson’s algorithm • Compute estRTT and devRTT using exponential smoothing: • estRTT := (1-a)estRTT + sampleRTT (a>0 small, e.g., a=0.125) • devRTT:=(1-a)devRTT + a|sampleRTT-devRTT| • Set RTO conservatively: • RTO:= max{minRTO, estRTT + 4xdevRTT} where minRTO = 200 ms • Aside: many variants of TCP: Tahoe, Reno, Vegas, ... 3 But …. Internet vs. data center network: Internet propagation delay: 10-100 ms data center propagation delay: 0.1 ms • packet size 1 KB, link capacity 1 Gbps packet transmission time is 0.01 ms 4 What Special about Data Center Transport Application requirements (particularly, low latency) Particular traffic patterns • customer facing vs. internal: often co-exist • internal: e.g., • Google file system • Map-Reduce • … Commodity switches: shallow buffer And time is money! 5 How does search work? Art is… 1. Deadline 2. Art is=a250ms lie… 3. ….. Picasso Partition/Aggregate Application Structure TLA • Time is money Picasso Strict deadlines (SLAs) MLA ……… MLA 1. • Missed deadline Deadline = 50ms 2. 2. The chief… 3. ….. 3. ….. Lower quality result 1. Art is a lie… • Many requests per query Tail-latency matters “Everything imagine real.” “It is“Computers your you workcan in are lifeuseless. that is the Good realize lots artists the of money.“ truth. steal.” but itwith must good find sense.“ you working.” “I'd “Art like “Bad isto aenemy lie live artists that as copy. poor man us is “The “Inspiration chief does ofamakes creativity exist, Deadline =They 10ms can ultimate only give seduction.“ you answers.” Worker Nodes 6 Data Center Workloads • Partition/Aggregate Bursty, Delay-sensitive (Query) • Short messages [50KB-1MB] (Coordination, Control state) • Large flows [1MB-100MB] (Data update) Delay-sensitive Throughput-sensitive Flow Size Distribution > 65% of Flows are < 1MB 1 CDF 0.8 Flow Size Total Bytes 0.6 0.4 0.2 0 3 10 4 10 5 10 6 10 Flow Size (Bytes) 7 10 > 95% of Bytes from Flows > 1MB 8 10 A Simple Data Center Network Model Logical 1 packet size S_DATA data block Ethernet: 1-10Gbps (S) small buffer B (e.g., 1 MB) 2 3 aggregator switch Server link capacity C Request Unit (SRU) (e.g., 32 KB) N Round Trip Time (RTT): 100-10us N servers 9 TCP Incast Problem Vasudevan et al. (SIGCOMM’09) Worker 1 • Synchronized fan-in congestion: Caused by Partition/Aggregate. Aggregator Worker 2 RTOmin = 200 ms Worker 3 Worker 4 Req. sent TCP timeout Rsp. 7-8 dropped sent 1 – 6 done Link Idle! 7-8 resent time 10 TCP Throughput Collapse Cluster Setup 1Gbps Ethernet Collapse! Unmodified TCP S50 Switch 1MB Block Size TCP Incast • Cause of throughput collapse: coarse-grained TCP timeouts MLA Query Completion Time (ms) Incast in Bing 12 Problem Statement TCP retransmission timeouts How to provide high goodput for data center applications? TCP throughput degradation N • • • • High-speed, low-latency network (RTT ≤ 0.1 ms) Highly-multiplexed link (e.g., 1000 flows) Highly-synchronized flows on bottleneck link Limited switch buffer size (e.g., 32 KB) 13 13 One Quick Fix: µsecond TCP + no minRTO µsecond Retransmission Timeouts (RTO) RTO = max( minRTO, f(RTT) ) 200ms RTT tracked in milliseconds 200µs? 0? Track RTT in µsecond Solution: µsecond TCP + no minRTO Proposed solution Throughput (Mbps) Unmodified TCP more servers High throughput for up to 47 servers Simulation scales to thousands of servers TCP in the Data Center • TCP does not meet demands of applications. – Requires large queues for high throughput: Adds significant latency. Wastes buffer space, esp. bad with shallow-buffered switches. • Operators work around TCP problems. ‒ Ad-hoc, inefficient, often expensive solutions ‒ No solid understanding of consequences, tradeoffs 16 Data Center Workloads • Partition/Aggregate Bursty, Delay-sensitive (Query) • Short messages [50KB-1MB] (Coordination, Control state) • Large flows [1MB-100MB] (Data update) Delay-sensitive Throughput-sensitive Flow Size Distribution > 65% of Flows are < 1MB 1 CDF 0.8 Flow Size Total Bytes 0.6 0.4 0.2 0 3 10 4 10 5 10 6 10 Flow Size (Bytes) 7 10 > 95% of Bytes from Flows > 1MB 8 10 Queue Buildup Sender 1 • Large flows buildup queues. Increase latency for short flows. How was this supported by measurements? Send 2 Receiver • Measurements in Bing cluster For 90% packets: RTT < 1ms For 10% packets: 1ms < RTT < 15ms 19 Data Center Transport Requirements 1. High Burst Tolerance – Incast due to Partition/Aggregate is common. 2. Low Latency – Short flows, queries 3. High Throughput – Continuous data updates, large file transfers The challenge is to achieve these three together. 20 DCTCP: Main Idea React in proportion to the extent of congestion. • Reduce window size based on fraction of marked packets. ECN Marks TCP DCTCP 1011110111 Cut window by 50% Cut window by 40% 0000000001 Cut window by 50% Cut window by 5% 21 DCTCP: Algorithm Switch side: – Mark packets when Queue Length > K. B Mark K Don’t Mark Sender side: – Maintain running average of fraction of packets marked (α). # of marked ACKs each RTT: F (1 g) gF T otal #of ACKs - Adaptive window decreases: W (1 )W 2 Note: decrease factor between 1 and 2. 22 (Kbytes) DCTCP vs TCP Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, ECN Marking Thresh = 30KB 23 Multi-path TCP (MPTCP) Initially, there is one flow. In a data center with rich path diversity (e.g., Fat-Tree or Bcube), can we use multipath to get higher throughput? In a BCube data center, can we use multipath to get higher throughput? Initially, there is one flow. A new flow starts. Its direct route collides with the first flow. In a BCube data center, can we use multipath to get higher throughput? Initially, there is one flow. A new flow starts. Its direct route collides with the first flow. But it also has longer routes available, which don’t collide. The MPTCP protocol MPTCP is a replacement for TCP which lets you use multiple paths simultaneously. The sender stripes packets across paths user space The receiver puts the packets in the correct order socket API MPTCP TCP MPTCP IP addr addr1 addr2 Design goal 1: Multipath TCP should be fair to regular TCP at shared bottlenecks A multipath TCP flow with two subflows Regular TCP To be fair, Multipath TCP should take as much capacity as TCP at a bottleneck link, no matter how many paths it is using. Strawman solution: Run “½ TCP” on each path Design goal 2: MPTCP should use efficient paths 12Mb/s 12Mb/s 12Mb/s Each flow has a choice of a 1-hop and a 2-hop path. How should we split its traffic? Design goal 2: MPTCP should use efficient paths 12Mb/s 8Mb/s 12Mb/s 8Mb/s 8Mb/s 12Mb/s If each flow split its traffic 1:1 ... Design goal 2: MPTCP should use efficient paths 12Mb/s 9Mb/s 12Mb/s 9Mb/s 9Mb/s 12Mb/s If each flow split its traffic 2:1 ... Design goal 2: MPTCP should use efficient paths 12Mb/s 10Mb/s 12Mb/s 10Mb/s 10Mb/s 12Mb/s If each flow split its traffic 4:1 ... Design goal 2: MPTCP should use efficient paths 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s If each flow split its traffic ∞:1 ... Design goal 2: MPTCP should use efficient paths 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s Theoretical solution (Kelly+Voice 2005; Han, Towsley et al. 2006) Theorem: MPTCP should send all its traffic on its least-congested paths. This will lead to the most efficient allocation possible, given a network topology and a set of available paths. Design goal 3: MPTCP should be fair compared to TCP wifi path: high loss, small RTT 3G path: low loss, high RTT Design Goal 2 says to send all your traffic on the least congested path, in this case 3G. But this has high RTT, hence it will give low throughput. Goal 3a. A Multipath TCP user should get at least as much throughput as a single-path TCP would on the best of the available paths. Goal 3b. A Multipath TCP flow should take no more capacity on any link than a single-path TCP would. Design goals Goal Goal Goal Goal Goal 1. Be fair to TCP at bottleneck links redundant 2. Use efficient paths ... 3. as much as we can, while being fair to TCP 4. Adapt quickly when congestion changes 5. Don’t oscillate How does MPTCP try to achieve all this? How does MPTCP congestion control work? Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. - Increase wr for each ACK on path r, by - Decrease wr for each drop on path r, by wr /2 How does MPTCP congestion control work? Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. Design goal 3: - Increase wr for each ACK on path r, by At any potential bottleneck S that path r might be in, look at the best that a single-path TCP could get, and compare to what I’m getting. - Decrease wr for each drop on path r, by wr /2 How does MPTCP congestion control work? Maintain a congestion window wr, one window for each path, where r ∊ R ranges over the set of available paths. Design goal 2: We want to shift - Increase w for each ACK on path r, by r traffic away from congestion. To achieve this, we increase windows in proportion to their size. - Decrease wr for each drop on path r, by wr /2 MPTCP chooses efficient paths in a BCube data center, hence it gets high throughput. Initially, there is one flow. A new flow starts. Its direct route collides with the first flow. But it also has longer routes available, which don’t collide. MPTCP shifts its traffic away from the congested link. MPTCP chooses efficient paths in a BCube data center, hence it gets high throughput. throughput [Mb/s] 300 ½ TCP MPTCP 250 200 Packet-level simulations of BCube (125 hosts, 25 switches, 100Mb/s links) and measured average throughput, for three traffic matrices. 150 100 50 For two of the traffic matrices, MPTCP and ½ TCP (strawman) were as good. For one of the traffic matrices, MPTCP got 19% higher throughput. 0 perm. traffic matrix spars e traffic matrix local traffic matrix 42