TCP Congestion Control at the Network Edge Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks http://www.cs.princeton.edu/courses/archive/fall14/cos561/ Original TCP Design Internet 2 Wireless and Data-Center Networks data-center network Internet wireless network 3 Data-Center Networks 4 Modular Network Topology • Containers • Racks – Multiple servers – Top-of-rack switches 5 Tree-Like Topologies CR CR AR AR AR AR S S S S S S S S S … S ... … S … S … • Many equal-cost paths • Small round-trip times (e.g., < 250 microseconds) 6 Commodity Switches • Low-cost switches – Especially for top-of-rack switches • Simple memory architecture – Small packet-buffer space – Shared buffer over all input ports – Simple drop-tail queues 7 Multi-Tier Applications Front end Server Aggregator Aggregator Aggregator … … Aggregator … Worker 8 Worker … Worker Worker Worker 8 Application Mix • Partition-aggregate workflow – Multiple workers working in parallel – Straggler slows down the entire system – Many workers send response at the same time • Diverse mix of traffic – Low latency for short flows – High throughput for long flows • Multi-tenancy – Many tenants sharing the same platform – Running network to high levels of utilization • Small number of large flows on links 9 TCP Incast Problem • Multiple workers transmitting to one aggregator – Many flows traversing the same link – Burst of packets sent at (nearly) the same time – … into a relatively small switch memory • Leading to high packet loss – Some results are slow to arrive – May be excluded from the final results • Developer software changes – Limit the size of worker responses – Randomize the sending time for responses 10 Queue Buildup • Mix of long and short flows – Long flows fill up the buffers in switches – … causing queuing delay (and loss) for the short flows – E.g., queuing delay of 1-14 milliseconds • Large relative to propagation delay – E.g., 100 microseconds intra-rack – E.g., 250 microseconds inter-rack – Leading to RTT variance and big throughput drop • Shared switch buffers – Short flows on one port – … affected by long flows on other ports 11 TCP Outcast Problem • Mix of flows at two different input ports – Many inter-rack flows – Few intra-rack flows – Destined for same output • Burst of packet arrivals – Arriving on one input port – Causing bursty loss for the other AR AR S S S S S … S … • Harmful to the intra-rack flows – Lose multiple packets – Loss detected by timeout – Irony: worse throughput despite lower RTT! 12 Delayed Acknowledgments • Sending ACKs can be expensive – E.g., send 40-byte ACK packet for each data packet • Delay ACKs reduce the overhead – Receiver waits before sending the ACK – … in the hope of piggybacking the ACK on a response • Delayed-ACK mechanism – Set a timer when the data arrives (e.g., 200 msec) – Piggyback the ACK or send ACK for every other packet – … or send an ACK after the timer expires • Timeout for delayed ACK is an eternity!! – Disable delayed ACKs, or shorten the timer 13 Data-Center TCP (DCTCP) • Key observation – TCP reacts to the presence of congestion – … not to the extent of congestion • Measuring extent of congestion – Mark packets when buffer exceeds a threshold • Reacting to congestion – Reduce cwnd in proportion to fraction of marked packets • Benefits – React early, as queue starts to build – Prevent harm to packets on other ports – Get workers to reduce sending rate early 14 Poor Multi-Path Load Balancing • Multiple shortest paths between pairs of hosts – Spread the load over multiple paths • Equal-cost multipath – Round robin – Hash-based • Uneven load – Elephant flows congest some paths – … while other paths are lightly loaded • Reducing congestion – Careful routing of elephant flows 15 Wireless Networks 16 TCP Design Setting • Relatively low packet loss – E.g., hopefully less than 1% – Okay to retransmit lost packets from the sender • Loss is caused primarily by congestion – Use loss as an implicit signal of congestion – … and reduce the sending rate • Relatively stable round-trip times – Use RTT estimate in retransmission timer • End-points are always on • Stable end-point IP addresses – Use IP addresses as end-point identifiers 17 Problems in Wireless Networks • Limited bandwidth • High latencies Internet • High bit-error rates • Temporary disconnections • Slow handoffs • Mobile device disconnects to save energy, bearers, etc. 18 Link-Level Retransmission • Retransmit over the wireless link –Hide packet losses from end-to-end –… by retransmitting lost packets on wireless link –Works for any transport protocol 19 Split Connection • Two TCP connections – Between fixed host and the base station – Between base and the mobile device • Other optimizations – Compression, just-in-time delivery, etc. 20 Burst Optimization • Radio wakeup is expensive –Wake up –Establish a bearer –Use battery and signaling resources • Burst optimization –Send bigger chunks less often –… to allow the mobile device to go to idle state 21 Lossless Handover • Mobile moves from one base station to another – Packets in flight still arrive at the old base station – … and could lead to bursty loss (and TCP timeout) • Old base station can buffer packets – Send buffered packets to the new base station Internet 22 Freezing the Connection • Mobile device can predict temporary disconnection – E.g., fading, handoff • Mobile can ask the fixed host to stop sending – Advertise a receive window of 0 • Benefits – Avoids wasted transmission of data – Avoid loss that triggers timeouts, decrease in cwnd, etc. 23 Discussion CUBIC paper 24