FAST TCP for Multi-Gbps WAN: Experiments and Applications Les Cottrell & Fabrizio Coccetti– SLAC Prepared for the Internet2, Washington, April 2003 http://www.slac.stanford.edu/grp/scs/net/talk/fast-i2-apr03.html Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), by the SciDAC base program. 1 Outline • High throughput challenges • New TCP stacks • Tests on Unloaded (testbed) links – Performance of multi-streams – Performance of various stacks • Tests on Production networks – Stack comparisons with single streams – Stack comparisons with multiple streams – Fairness • Where do I find out more? 2 High Speed Challenges • PCI bus limitations (66MHz * 64 bit = 4.2Gbits/s at best) • At 2.5Gbits/s and 180msec RTT requires 120MByte window • Some tools (e.g. bbcp) will not allow a large enough window – (bbcp limited to 2MBytes) • Slow start problem at 1Gbits/s takes about 5-6 secs for 180msec link, – i.e. if want 90% of measurement in stable (non slow start), need to measure for 60 secs – need to ship >700MBytes at 1Gbits/s • After a loss it can take over an hour for stock TCP (Reno) to recover to maximum throughput at 1Gbits/s – i.e. loss rate of 1 in ~ 2 Gpkts (3Tbits), or BER of 1 in 3.6*1012 Sunnyvale-Geneva, 1500Byte MTU, stock TCP 3 New TCP Stacks • Reno (AIMD) based, loss indicates congestion – Back off less when see congestion – Recover more quickly after backing off Standard • Scalable TCP: exponential recovery – Tom Kelly, Scalable TCP: Improving Performance in Highspeed Wide Area Networks Submitted for publication, December 2002. Scalable • High Speed TCP: same as Reno for low performance, then increase window more & more aggressively as window increases using a table • Vegas based, RTT indicates congestion – Caltech FAST TCP, quicker response to congestion, but … High Speed cwnd=38pkts~0.5Mbits 4 Typical testbed 6*2cpu servers 4 disk servers Chicago 7 6 0 9 Geneva 12*2cpu servers 4 disk servers Sunnyvale 2.5Gbits/s (EU+US) 6*2cpu servers T G 6 S 4 OC192/POS R 0 (10Gbits/s) 7 6 0 9 Sunnyvale section deployed for SC2002 (Nov 02) SNV CHI AMS > 10,000 km GVA 5 Testbed Collaborators and sponsors • Caltech: Harvey Newman, Steven Low, Sylvain Ravot, Cheng Jin, Xiaoling Wei, Suresh Singh, Julian Bunn • SLAC: Les Cottrell, Gary Buhrmaster, Fabrizio Coccetti • LANL: Wu-chun Feng, Eric Weigle, Gus Hurwitz, Adam Englehart • NIKHEF/UvA: Cees DeLaat, Antony Antony • CERN: Olivier Martin, Paolo Moroni • ANL: Linda Winkler • DataTAG, StarLight, TeraGrid, SURFnet, NetherLight, Deutsche Telecom, Information Society Technologies • Cisco, Level(3), Intel • DoE, European Commission, NSF 6 Windows and Streams • Well accepted that multiple streams (n) and/or big windows are important to achieve optimal throughput • Effectively reduces impact of a loss by 1/n, and improves recovery time by 1/n • Optimum windows & streams changes with changes (e.g. utilization) in path, hard to optimize n • Can be unfriendly to others 7 Even with big windows (1MB) still need multiple streams with Standard TCP • ANL, Caltech & RAL reach a knee (between 2 and 24 streams) above this gain in throughput slow • Above knee performance still improves slowly, maybe due to squeezing out others and taking more than fair share due to large number of streams 8 Stock vs FAST TCP MTU=1500B • Need to measure all parameters to understand effects of parameters, configurations: – Windows, streams, txqueuelen, TCP stack, MTU, NIC card – Lot of variables Stock TCP, 1500B MTU 65ms RTT • Examples of 2 TCP stacks – FAST TCP no longer needs multiple streams, this is a major simplification (reduces # variables to tune by 1) FAST TCP, 1500B MTU FAST 65msTCP, RTT 1500B MTU 65ms RTT 9 TCP stacks with 1500B MTU @1Gbps txqueuelen 10 Jumbo frames, new TCP stacks at 1 Gbits/s SNV-GVA But: Jumbos not part of GE or 10GE standard Not widely deployed in end networks 11 Production network tests All 6 hosts have 1GE interfaces (2 SLAC hosts send simultaneously) Competing flows, no jumbos Host running “New” TCP CERN RTT = 202 ms CERN GVA Host running Reno TCP Remote host OC 48 ESnet NIKHEF RTT = 158 ms CHICAGO OC 192 OC 12 SURFnet CHI OC 48 SLAC Stanford AMS Caltech RTT = 25 ms APAN RTT = 147 ms Abilene OC 12 SNV CalREN SEATTL E OC 12 APAN 12 High Speed TCP vs Reno – 1 Stream 2 separate hosts @ SLAC sending simultaneously to 1 receiver (2 iperf processes), 8MB window, pre-flush TCP config, 1500B MTU RTT bursty = congestion? Checked Reno vs Reno 2 hosts and very similar as expected 13 Nb large RTT=congestion? 14 Large RTTs => poor FAST 15 Scalable vs multi-streams SLAC to CERN, duration 60s, RTT 207ms, 8MB window 16 FAST & Scalable vs. Multi-stream Reno (SLAC>CERN ~230ms) •Bottleneck capacity 622Mbits/s •For short duration, very noisy, hard to distinguish Congestion events often sync Reno 1 streams 87 Mbits/s average FAST 1 stream 244 Mbits/s average Reno 8 streams 150 Mbits/s average FAST 1 stream 200 Mbits/s average 17 Scalable & FAST TCP with 1 stream vs Reno with n streams 18 Fairness Reno alone 221Mbps FAST vs Reno 1 Stream, 16MB window, SLAC to CERN Fast alone 240Mbps Reno (45Mbps) & FAST (285Mbps) competing 19 Summary (very preliminary) • With single flow & empty network: – Can saturate 2.5 Gbps with standard TCP & jumbos – Can saturate 1Gbps with new stacks & 1500B frame or with standard & jumbos • With production network, – FAST can take a while to get going – Once going, FAST TCP with one stream looks good compared to multi-stream RENO – FAST can back down early compared to RENO – More work needed on fairness • Scalable – Does not look as good vs. multi-stream Reno 20 What’s next? • Go beyond 2.5Gbits/s • Disk-to-disk throughput & useful applications – Need faster cpus (extra 60% MHz/Mbits/s over TCP for disk to disk), understand how to use multi-processors • Further evaluate new stacks with real-world links, and other equipment – – – – Other NICs Response to congestion, pathologies Fairness Deploy for some major (e.g. HENP/Grid) customer applications • Understand how to make 10GE NICs work well with 1500B MTUs • Move from “hero” demonstrations to commonplace 21 More Information • 10GE tests – www-iepm.slac.stanford.edu/monitoring/bulk/10ge/ – sravot.home.cern.ch/sravot/Networking/10GbE/10GbE_test.html • TCP stacks – netlab.caltech.edu/FAST/ – datatag.web.cern.ch/datatag/pfldnet2003/papers/kelly.pdf – www.icir.org/floyd/hstcp.html • Stack comparisons – www-iepm.slac.stanford.edu/monitoring/bulk/fast/ – www.csm.ornl.gov/~dunigan/net100/floyd.html – www-iepm.slac.stanford.edu/monitoring/bulk/tcpstacks/ 22 Extras 23 FAST TCP vs. Reno – 1 stream N.b. RTT curve for Caltech shows why FAST performs poorly against Reno (too polite?) 24 Scalable vs. Reno - 1 stream 8MB windows, 2 hosts, competing 25 Other high speed gotchas • Large windows and large number of streams can cause last stream to take a long time to close. • Linux memory leak • Linux TCP configuration caching • What is the window size actually used/reported • 32 bit counters in iperf and routers wrap, need latest releases with 64bit counters • Effects of txqueuelen (number of packets queued for NIC) • Routers do not pass jumbos • Performance differs between drivers and NICs from different manufacturers – May require tuning a lot of parameters 26