Curbing Delays in Datacenters: Need Time to Save Time? Mohammad Alizadeh Sachin Katti, Balaji Prabhakar Insieme Networks Stanford University 1 Window-based rate control schemes (e.g., TCP) do not work at near zero round-trip latency 2 Datacenter Networks Message latency is King need very high throughput, very low latency 10-40Gbps links 1-5μs latency 1000s of server ports web app cache db mapreduce HPC monitoring 3 Transport in Datacenters • TCP widely used, but has poor performance – Buffer hungry: adds significant queuing latency TCP Queuing Latency ~1–10ms Baseline fabric latency: 1-5μs How do we get here? DCTCP ~100μs ~Zero Latency 4 Reducing Queuing: Sn 700 Experiment: 2 flows (Win 7 stack), Broadcom 1Gbps Switch 700 700 600 600 600 500 500 500 400 400 400 300 300 300 200 TCP, 2 flows 200 TCP 2 flows DCTCP, 200 TCP, ECN Marking Thresh = 30KB 100 DCTCP2 flows DCTCP, 2 flows 100 1000 0 0 0 0 0 Time (seconds) 5 Time (seconds) (KBytes) Queue Length (Packets) Queue Length (Packets) Queue Length (Packets) DCTCP vs TCP S1 Towards Zero Queuing S1 ECN@90% Sn S1 ECN@90% Sn Towards Zero Queuing S1 ECN@90% Sn ns2 sim: 10 DCTCP flows, 10Gbps switch, ECN at 9Gbps (90% util) 45 40 Queueing Latency 10 Total Latency 9.5 Latency (μs) 35 30 Floor ≈ 23μs 25 20 15 10 Throughput (Gbps) 50 Target Throughput 9 8.5 8 7.5 5 0 7 0 20 40 Round-Trip Propagation Time (μs) 0 20 40 Round-Trip Propagation Time (us) Window-based Rate Control RTT = 10 C×RTT = 10 pkts Cwnd = 1 Sender C=1 Receiver Throughput = 1/RTT = 10% 8 Window-based Rate Control RTT = 2 C×RTT = 2 pkts Cwnd = 1 Sender C=1 Receiver Throughput = 1/RTT = 50% 9 Window-based Rate Control RTT = 1.01 C×RTT = 1.01 pkts Cwnd = 1 Sender C=1 Receiver Throughput = 1/RTT = 99% 10 Window-based Rate Control RTT = 1.01 C×RTT = 1.01 pkts Cwnd = 1 Sender 1 Receiver Sender 2 As propagation time 0: Cwnd = 1buildup is unavoidable Queue 11 So What? Window-based RC needs lag in the loop Near-zero latency transport must: 1. Use timer-based rate control / pacing 2. Use small packet size Both increase CPU overhead (not practical in software) Possible in hardware, but complex (e.g., HULL NSDI’12) Or… Change the Problem! 12 Changing the Problem… Switch Port FIFO queue 5 9 4 3 7 1 Switch Port Priority queue Queue buildup costly Queue buildup irrelevant need precise rate control coarse rate control OK 13 pFABRIC 14 DC Fabric: Just a Giant Switch H1 H2 H3 H4 H5 H6 H7 H8 H9 15 H2 H3 H4 H4 H5 H5 H3 H6 H6 H2 H8 H8 H9 H9 H7 H7 H1 TX H1 DC Fabric: Just a Giant Switch RX 16 H2 H3 H4 H4 H5 H5 H3 H6 H6 H2 H8 H8 H9 H9 H7 H7 H1 TX H1 DC Fabric: Just a Giant Switch RX 17 DC transport = Flow scheduling on giant switch Objective? Minimize avg FCT H1 H1 H2 H2 H3 H3 H4 H4 H5 H5 H6 H6 H7 H7 H9 ingress & egress capacity constraints H9 H8 H8 TX RX 18 “Ideal” Flow Scheduling Problem is NP-hard [Bar-Noy et al.] – Simple greedy algorithm: 2-approximation 1 1 2 2 3 3 19 pFabric in 1 Slide Packets carry a single priority # • e.g., prio = remaining flow size pFabric Switches • Very small buffers (~10-20 pkts for 10Gbps fabric) • Send highest priority / drop lowest priority pkts pFabric Hosts • Send/retransmit aggressively • Minimal rate control: just prevent congestion collapse 20 Key Idea Decouple flow scheduling from rate control Switches implement flow scheduling via local mechanisms Queue buildup does not hurt performance Window-based rate control OK Hosts use simple window-based rate control (≈TCP) to avoid high packet loss H 1 H 2 H 3 H 4 H 5 H 6 H 7 H 8 H 9 21 pFabric Switch HHH HHH HHH 1 2 3 4 5 6 7 8 9 Priority Scheduling send highest priority packet first Priority Dropping drop lowest priority packets first 5 9 4 3 7 Switch Port 1 prio = remaining flow size small “bag” of packets per-port 22 pFabric Switch Complexity • Buffers are very small (~2×BDP per-port) – e.g., C=10Gbps, RTT=15µs → Buffer ~ 30KB – Today’s switch buffers are 10-30x larger Priority Scheduling/Dropping • Worst-case: Minimum size packets (64B) – 51.2ns to find min/max of ~600 numbers – Binary comparator tree: 10 clock cycles – Current ASICs: clock ~ 1ns 23 Why does this work? Invariant for ideal scheduling: At any instant, have the highest priority packet (according to ideal algorithm) available at the switch. • Priority scheduling High priority packets traverse fabric as quickly as possible • What about dropped packets? Lowest priority → not needed till all other packets depart Buffer > BDP → enough time (> RTT) to retransmit 24 Evaluation (144-port fabric; Search traffic pattern) FCT (normalized to optimal in idle fabric) Ideal pFabric PDQ DCTCP TCP-DropTail 10 Recall: “Ideal” is REALLY idealized! • Centralized with full view of flows 8 • No rate-control dynamics 7 6 • No buffering 5 • No pkt drops 4 • No load-balancing inefficiency 9 3 2 1 0 0.1 0.2 0.3 0.4 0.5 Load 0.6 0.7 0.8 25 Mice FCT (<100KB) Average pFabric PDQ 10 9 8 7 6 5 4 3 2 1 0 DCTCP Normalized FCT Normalized FCT Ideal 99th Percentile 0.1 0.2 0.3 0.4 0.5 Load 0.6 0.7 0.8 TCP-DropTail 10 9 8 7 6 5 4 3 2 1 0 0.1 0.2 0.3 0.4 0.5 Load 0.6 0.7 26 0.8 Conclusion • Window-based rate control does not work at near-zero round-trip latency • pFabric: simple, yet near-optimal – Decouples flow scheduling from rate control – Allows use of coarse window-base rate control • pFabric is within 10-15% of “ideal” for realistic DC workloads (SIGCOMM’13) 27 Thank You! 28 29