TCP/IP and Other Transports for High Bandwidth Applications Real Applications on Real Networks Richard Hughes-Jones University of Manchester www.hep.man.ac.uk/~rich/ then “Talks” then look for “Brasov” Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 1 Slide: 1 What we might cover! This is what researchers find when they try to use high performance networks. Real Applications on Real Networks Disk-2-disk applications on real networks Memory-2-memory tests Comparison of different data moving applications The effect (improvement) of different TCP Stacks Transatlantic disk-2-disk at Gigabit speeds Remote Computing Farms The effect of distance Protocol vs implementation Radio Astronomy e-VLBI Users with data that is random noise ! Thanks for allowing me to use their slides to: Sylvain Ravot CERN, Les Cottrell SLAC, Brian Tierney LBL, Robin Tasker DL Ralph Spencer Jodrell Bank Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 2 Slide: 2 “Server Quality” Motherboards SuperMicro P4DP8-2G (P4DP6) Dual Xeon 400/522 MHz Front side bus 6 PCI PCI-X slots 4 independent PCI buses 64 bit 66 MHz PCI 100 MHz PCI-X 133 MHz PCI-X Dual Gigabit Ethernet Adaptec AIC-7899W dual channel SCSI UDMA/100 bus master/EIDE channels data transfer rates of 100 MB/sec burst Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 3 Slide: 3 “Server Quality” Motherboards Boston/Supermicro H8DAR Two Dual Core Opterons 200 MHz DDR Memory Theory BW: 6.4Gbit HyperTransport 2 independent PCI buses 133 MHz PCI-X 2 Gigabit Ethernet SATA ( PCI-e ) Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 4 Slide: 4 UK Transfers MB-NG and SuperJANET4 Throughput for real users Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 5 Slide: 5 Topology of the MB – NG Network Manchester Domain man02 Boundary Router Cisco 7609 man01 UKERNA Development Network UCL Domain lon01 Boundary Router Cisco 7609 lon02 HW RAID Edge Router Cisco 7609 man03 RAL Domain lon03 ral02 Key Gigabit Ethernet 2.5 Gbit POS Access ral01 HW RAID ral02 Boundary Router Cisco 7609 MPLS Admin. Domains Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 6 Slide: 6 Topology of the Production Network Manchester Domain man01 3 routers 2 switches HW RAID RAL Domain ral01 HW RAID routers switches Key Gigabit Ethernet 2.5 Gbit POS Access 10 Gbit POS Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 7 Slide: 7 iperf Throughput + Web100 SuperMicro on MB-NG network HighSpeed TCP Linespeed 940 Mbit/s DupACK ? <10 (expect ~400) BaBar on Production network Standard TCP 425 Mbit/s DupACKs 350-400 – re-transmits Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 8 Slide: 8 Applications: Throughput Mbit/s HighSpeed TCP 2 GByte file RAID5 SuperMicro + SuperJANET bbcp bbftp Apachie Gridftp Previous work used RAID0 (not disk limited) Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 9 Slide: 9 bbftp: What else is going on? Scalable TCP BaBar + SuperJANET SuperMicro + SuperJANET Congestion window – duplicate ACK Variation not TCP related? Disk speed / bus transfer Application Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 10 Slide: 10 bbftp: Host & Network Effects 2 Gbyte file RAID5 Disks: 1200 Mbit/s read 600 Mbit/s write Scalable TCP BaBar + SuperJANET Instantaneous 220 - 625 Mbit/s SuperMicro + SuperJANET Instantaneous 400 - 665 Mbit/s for 6 sec Then 0 - 480 Mbit/s SuperMicro + MB-NG Instantaneous 880 - 950 Mbit/s for 1.3 sec Then 215 -Summer 625 Mbit/s School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 11 Slide: 11 Average Transfer Rates Mbit/s App TCP Stack Iperf Standard 940 350-370 425 940 HighSpeed 940 510 570 940 Scalable 940 580-650 605 940 Standard 434 290-310 290 HighSpeed 435 385 360 Scalable 432 400-430 380 Standard 400-410 325 320 370-390 380 bbcp bbftp SuperMicro on MB-NG HighSpeed apache Gridftp SuperMicro on SuperJANET4 BaBar on SuperJANET4 Scalable 430 345-532 380 Standard 425 260 300-360 HighSpeed 430 370 315 Scalable 428 400 317 Standard 405 240 HighSpeed Rate decreases 825 875 New stacks give more throughput 320 Summer School, Brasov, Romania,335 July 2005, R. Hughes-Jones Manchester Scalable Richard Hughes-Jones SC2004 on UKLight 12 Slide: 12 Transatlantic Disk to Disk Transfers With UKLight SuperComputing 2004 Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 13 Slide: 13 SC2004 UKLIGHT Overview SLAC Booth SC2004 Cisco 6509 MB-NG 7600 OSR Manchester Caltech Booth UltraLight IP UCL network UCL HEP NLR Lambda NLR-PITT-STAR-10GE-16 ULCC UKLight K2 K2 Ci UKLight 10G Four 1GE channels Ci Caltech 7600 UKLight 10G Surfnet/ EuroLink 10G Two 1GE channels Amsterdam Chicago Starlight K2 Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 14 Slide: 14 Collaboration at SC2004 Setting up the BW Bunker Working with S2io, Sun, Chelsio SCINet Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones The BW Challenge at the SLAC Booth 15 Slide: 15 Transatlantic Ethernet: TCP Throughput Tests 2000 InstaneousBW AveBW 1400000000 CurCwnd (Value) 1200000000 1500 1000000000 Cwnd Supermicro X5DPE-G2 PCs Dual 2.9 GHz Xenon CPU FSB 533 MHz 1500 byte MTU 2.6.6 Linux Kernel Memory-memory TCP throughput Standard TCP TCPAchive Mbit/s 800000000 1000 600000000 400000000 500 200000000 Wire rate throughput of 940 Mbit/s 0 0 20000 40000 60000 80000 time ms 100000 120000 First 10 sec InstaneousBW AveBW 40000000 CurCwnd (Value) 35000000 30000000 25000000 1500 Cwnd TCPAchive Mbit/s 2000 Work in progress to study: Implementation detail Advanced stacks Effect of packet loss Sharing 1000 500 0 0 1000 2000 3000 4000 5000 6000 time ms 7000 8000 Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 0 140000 9000 20000000 15000000 10000000 5000000 0 10000 16 Slide: 16 SC2004 Disk-Disk bbftp Standard TCP Average 825 Mbit/s (bbcp: 670 Mbit/s) TCPAchive Mbit/s 2500 AveBW CurCwnd (Value) 2000 1500 1000 500 0 0 10000 time ms 15000 20000 InstaneousBW AveBW CurCwnd (Value) 2500 TCPAchive Mbit/s Scalable TCP Average 875 Mbit/s (bbcp: 701 Mbit/s ~4.5s of overhead) 5000 45000000 40000000 35000000 30000000 25000000 20000000 15000000 10000000 5000000 0 2000 1500 1000 500 0 Disk-TCP-Disk at 1Gbit/s 0 5000 10000 time ms 15000 20000 Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 45000000 40000000 35000000 30000000 25000000 20000000 15000000 10000000 5000000 0 Cwnd bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 Gbyte file Web100 plots: InstaneousBW Cwnd 17 Slide: 17 Network & Disk Interactions Hosts: (work in progress) Supermicro X5DPE-G2 motherboards dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size % CPU kernel mode Measure memory to RAID0 transfer rates with & without UDP traffic RAID0 6disks 1 Gbyte Write 64k 3w8506-8 Throughput Mbit/s 2000 Disk write 1735 Mbit/s 1500 1000 500 0 0.0 Throughput Mbit/s 2000 20.0 40.0 60.0 80.0 Trial number R0 6d 1 Gbyte udp Write 64k 3w8506-8 100.0 1500 1000 500 0 0.0 2000 20.0 40.0 60.0 Trial number 80.0 100.0 Disk write + 1500 MTU UDP 1218 Mbit/s Drop of 30% R0 6d 1 Gbyte udp9000 write 64k 3w8506-8 1500 1000 500 0 0.0 20.0 40.0 60.0 Trial number 80.0 100.0 Disk write + 9000 MTU UDP 1400 Mbit/s Drop of 19% R0 6d 1 Gbyte udp9000 write 8k 3w8506-8 07Jan05 16384 8k 180 % cpu system mode L3+4 Throughput Mbit/s 200 64k 160 y=178-1.05x 140 120 100 80 60 40 20 0 Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 0 20 40 60 80 100 120 140 % cpu system mode L1+2 160 180 200 18 Slide: 18 Remote Computing Farms in the ATLAS TDAQ Experiment Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 19 Slide: 19 Remote Computing Concepts ~PByte/sec ATLAS Detectors – Level 1 Trigger ROB ROB ROB ROB Remote Event Processing Farms Copenhagen Edmonton Krakow Manchester PF Data Collection Network L2PU L2PU L2PU L2PU SFI SFI Event Builders Back End Network PF lightpaths SFI PF PF GÉANT Level 2 Trigger PF Local Event Processing Farms Experimental Area SFOs CERN B513 320 MByte/sec Mass storage Switch Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones PF 20 Slide: 20 ATLAS Remote Farms – Network Connectivity Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 21 Slide: 21 ATLAS Application Protocol Event Filter Daemon EFD SFI and SFO Request event Request-Response time (Histogram) Send event data Process event Request Buffer Send OK Send processed event ●●● Event Request Time EFD requests an event from SFI SFI replies with the event ~2Mbytes Processing of event Return of computation EF asks SFO for buffer space SFO sends OK EF transfers results of the computation tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication. Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 22 Slide: 22 Using Web100 TCP Stack Instrumentation to analyse application protocol - tcpmon Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 23 Slide: 23 tcpmon: TCP Activity Manc-CERN Req-Resp 350 200000 300 250 150000 200 100000 150 100 50000 50 0 400 600 800 1000 time 1200 1600 200000 0 2000 250000 200000 150000 150000 100000 100000 50000 50000 0 0 200 400 600 800 1000 1200 time ms 1400 1600 1800 180 160 140 120 100 80 60 40 20 0 0 2000 250000 200000 150000 100000 50000 0 200 400 600 800 1000 1200 time ms 1400 1600 Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 1800 Cwnd Data Bytes Out 1400 DataBytesOut (Delta DataBytesIn (Delta CurCwnd (Value 250000 TCPAchive Mbit/s Transfer achievable throughput 120 Mbit/s 200 CurCwnd 0 TCP Congestion window gets re-set on each Request TCP stack implementation detail to reduce Cwnd after inactivity Even after 10s, each response takes 13 rtt or ~260 ms Data Bytes In Data Bytes Out Round trip time 20 ms 64 byte Request green 1 Mbyte Response blue TCP in slow start 1st event takes 19 rtt or ~ 380 ms DataBytesOut (Delta 400 DataBytesIn (Delta 250000 1800 0 2000 24 Slide: 24 tcpmon: TCP Activity Manc-cern Req-Resp TCP stack tuned DataBytesOut (Delta 400 DataBytesIn (Delta 1200000 800000 300 250 600000 200 150 400000 100 50 200000 0 0 1000 1500 time 2000 700 600000 300 Cwnd 400 400000 200 200000 0 TCPAchive Mbit/s 1000000 800000 500 0 500 1000 1500 time ms 2000 2500 900 800 700 600 500 400 300 200 100 0 0 3000 1200000 1000000 800000 600000 400000 200000 0 1000 2000 3000 4000 time ms 5000 6000 7000 Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 1200000 600 100 Transfer achievable throughput grows to 800 Mbit/s 0 3000 2500 PktsOut (Delta PktsIn (Delta CurCwnd (Value 800 num Packets TCP Congestion window grows nicely Response takes 2 rtt after ~1.5s Rate ~10/s (with 50ms wait) 500 Data Bytes In 350 1000000 Cwnd Data Bytes Out Round trip time 20 ms 64 byte Request green 1 Mbyte Response blue TCP starts in slow start 1st event takes 19 rtt or ~ 380 ms 0 8000 25 Slide: 25 DataBytesOut (Delta DataBytesIn (Delta 1000000 900000 800000 700000 600000 500000 400000 300000 200000 100000 0 300 250 200 150 100 50 0 3000 2000 1000 time 700 4000 PktsOut (Delta PktsIn (Delta CurCwnd (Value 5000 1000000 800000 500 400 600000 300 400000 Cwnd num Packets 600 200 200000 100 0 0 2000 4000 6000 8000 0 10000 12000 14000 16000 18000 20000 time ms TCPAchive Mbit/s Transfer achievable throughput grows slowly from 250 to 800 Mbit/s Data Bytes In 350 0 TCP Congestion window in slow start to ~1.8s then congestion avoidance Response in 2 rtt after ~2.5s Rate 2.2/s (with 50ms wait) 400 800 700 600 500 400 300 200 100 0 1000000 800000 600000 400000 Cwnd Round trip time 150 ms 64 byte Request green 1 Mbyte Response blue TCP starts in slow start 1st event takes 11 rtt or ~ 1.67 s Data Bytes Out tcpmon: TCP Activity Alberta-CERN Req-Resp TCP stack tuned 200000 0 2000 4000 6000 8000 0 10000 12000 14000 16000 18000 20000 time ms Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 26 Slide: 26 Time Series of Request-Response Latency Manchester – CERN Round trip time 20 ms 1 Mbyte of data returned Stable for ~18s at ~42.5ms Then alternate points 29 & 42.5 ms 75.00 Round Trip Latnecy ms 70.00 65.00 60.00 55.00 50.00 45.00 40.00 35.00 30.00 25.00 0 10 20 30 40 50 60 70 80 90 100 Request Time s Alberta – CERN Round trip time 150 ms 1 Mbyte of data returned Stable for ~150s at 300ms Falls to 160ms with ~80 μs variation 160.55 160.50 2000.00 Round Trip Latency ms Round Trip Latency ms 160.60 1800.00 160.45 1600.00 1000000 160.40 1400.00 160.35 1200.00 160.30 200 1000.00 205 210 215 800.00 220 225 230 235 240 245 250 Request Time s 600.00 400.00 200.00 0.00 0 50 100 150 200 250 300 Request Time s Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 27 Slide: 27 Using the Trigger DAQ Application Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 28 Slide: 28 4 15 3 10 2 5 1 0 0 0 Event Rate: Use tcpmon transfer time of ~42.5ms Add the time to return the data 95ms Expected rate 10.5/s Observe ~6/s for the gigabit node Reason: TCP buffers could not be set large enough in T/DAQ application 200 Time Sec 100 300 400 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Events/sec Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones No. remote nodes 3 nodes: 1 GEthernet + two 100Mbit 2 nodes: two 100Mbit nodes 1node: one 100Mbit node 20 Frequency Manchester – CERN Round trip time 20 ms 1 Mbyte of data returned Event Rate event/s Time Series of T/DAQ event rate 9 10 11 12 29 Slide: 29 Tcpdump of the Trigger DAQ Application Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 30 Slide: 30 tcpdump of the T/DAQ dataflow at SFI (1) Cern-Manchester 1.0 Mbyte event Remote EFD requests event from SFI Incoming event request Followed by ACK ● ● ● SFI sends event Limited by TCP receive buffer Time 115 ms (~4 ev/s) When TCP ACKs arrive more data is sent. N 1448 byte packets Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 31 Slide: 31 Tcpdump of TCP Slowstart at SFI (2) Cern-Manchester 1.0 Mbyte event Remote EFD requests event from SFI First event request SFI sends event Limited by TCP Slowstart Time 320 ms N 1448 byte packets When ACKs arrive more data sent. Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 32 Slide: 32 tcpdump of the T/DAQ dataflow for SFI &SFO Cern-Manchester – another test run 1.0 Mbyte event Remote EFD requests events from SFI Remote EFD sending computation back to SFO Links closed by Application Link setup & TCP slowstart Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 33 Slide: 33 Some Conclusions The TCP protocol dynamics strongly influence the behaviour of the Application. Care is required with the Application design eg use of timeouts. With the correct TCP buffer sizes It is not throughput but the round-trip nature of the application protocol that determines performance. Requesting the 1-2Mbytes of data takes 1 or 2 round trips TCP Slowstart (the opening of Cwnd) considerably lengthens time for the first block of data. Implementation “improvements” (Cwnd reduction) kill performance! When the TCP buffer sizes are too small (default) The amount of data sent is limited on each rtt Data is send and arrives in bursts It takes many round trips to send 1 or 2 Mbytes The End Hosts themselves CPU power is required for the TCP/IP stack as well and the application Packets can be lost in the IP stack due to lack of processing power Although the application is ATLAS-specific, the network interactions is applicable to other areas including: Remote iSCSI Remote database accesses Real-time Grid Computing – eg Real-Time Interactive Medical Image processing Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 34 Slide: 34 Radio Astronomy e-VLBI Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 35 Slide: 35 Radio Astronomy with help from Ralph Spencer Jodrell Bank The study of celestial objects at <1 mm to >1m wavelength. Sensitivity for continuum sources B=bandwidth, t=integration time. 1 / Bt High resolution achieved by interferometers. Some radio emitting X-ray binary stars in our own galaxy: GRS 1915+105 MERLINSummer School, Brasov, Romania, July 2005, R. Hughes-Jones SS433 MERLIN and European VLBI Richard Hughes-Jones Cygnus X-1 VLBA Manchester 36 Slide: 36 Earth-Rotation Synthesis and Fringes Telescope data correlated in pairs: N(N-1)/2 baselines Fringes Obtained with the correct signal phase Merlin u-v coverage Need ~ 12 hours for full synthesis, not necessarily collecting data for all that time. NB Trade-off between B and t for sensitivity. Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 37 Slide: 37 The European VLBI Network: EVN Detailed radio imaging uses antenna networks over 100s1000s km At faintest levels, sky teems with galaxies being formed Radio penetrates cosmic dust see process clearly Telescopes in place … Disk recording at 512Mb/s Real-time connection allows greater: response reliability sensitivity Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 38 Slide: 38 EVN-NREN Gbit link Chalmers University of Technology, Gothenburg Onsala Sweden Gbit link Torun Poland Jodrell Bank UK Dedicated Gbit link Westerbork Netherlands MERLIN Cambridge UK Dwingeloo DWDM link Medicina Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones Italy 39 Slide: 39 UDP Throughput Manchester-Dwingeloo (Nov 2003) Throughput vs packet spacing Manchester: 2.0G Hz Xeon Dwingeloo: 1.2 GHz PIII Near wire rate, 950 Mbps NB record stands at 6.6 Gbps SLAC-CERN Gnt5-DwMk5 11Nov03/DwMk5-Gnt5 13Nov03-1472bytes 1200 1000 Recv Wire rate Mbits/s Gnt5-DwMk5 DwMk5-Gnt5 800 600 400 200 0 0 5 12 10 15 20 25 Spacing between frames us Gnt5-DwMk5 11Nov03-1472 bytes 30 35 40 Packet loss % Packet loss 10 8 Gnt5-DwMk5 DwMk5-Gnt5 6 4 2 0 0 CPU Kernel Load receiver 25 30 35 40 Gnt5-DwMk5 11Nov03 1472 bytes % Kernel Sender 10 15 20 Spacing between frames us 0 100 80 60 40 20 0 5 10 15 20 Spacing between frames us 25 30 35 40 Gnt5-DwMk5 11Nov03 1472 bytes % Kernel Receiver CPU Kernel Load sender 5 100 80 60 40 20 0 4th Year project 0 5July 2005, 10 15 20 25 30 Summer School, Brasov, Romania, R. Hughes-Jones Manchester Spacing between frames us Adam Mathews Richard Hughes-Jones Steve O’Toole 35 40 40 Slide: 40 Packet loss distribution: Cumulative distribution p(t )dt t Long range effects in the data? Poisson Cumulative distribution of packet loss, each bin is 12 msec wide Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 41 Slide: 41 26th January 2005 UDP Tests Simon Casey (PhD project) Between JBO and JIVE in Dwingeloo, using production network Period of high packet loss (3%): Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 42 Slide: 42 The GÉANT2 Launch June 2005 Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 43 Slide: 43 e-VLBI at the GÉANT2 Launch Jun 2005 Dwingeloo DWDM link Jodrell Bank UK Medicina Italy Torun Poland Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 44 Slide: 44 e-VLBI UDP Data Streams Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 45 Slide: 45 UDP Performance: 3 Flows on GÉANT 1000 Recv wire rate Mbit/s Throughput: 5 Hour run Jodrell: JIVE 2.0 GHz dual Xeon – 2.4 GHz dual Xeon 670-840 Mbit/s Medicina (Bologna): JIVE 800 MHz PIII – mark623 1.2 GHz PIII 330 Mbit/s limited by sending PC Torun: JIVE 2.4 GHz dual Xeon – mark575 1.2 GHz PIII 245-325 Mbit/s limited by security policing (>400Mbit/s 20 Mbit/s) ? Jodrell Medicina Torun BW 14Jun05 800 600 400 200 0 0 500 1000 Time 10s steps 1500 2000 Throughput: 50 min period Period is ~17 min Jodrell Medicina Torun BW 14Jun05 Recv wire rate Mbit/s 1000 800 600 400 200 0 200 250 300 350 Time 10s steps 400 Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 450 500 46 Slide: 46 UDP Performance: 3 Flows on GÉANT 100000 1000 50000 500 0 500 1500 0 2000 1500 re-ordered 70000 num_lost 60000 50000 40000 30000 20000 10000 0 2000 Medicina 14Jun05 5 4 num lost num re-ordered 1000 Time 10s 3 2 1 0 0 500 1000 Time 10s Torun 14Jun04 5 4 3 2 1 0 0 500 1000 1500 Time 10s Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones re-ordered 140000 num_lost 120000 100000 80000 60000 40000 20000 0 2000 num lost num re-ordered Torun: 2.4 GHz Xeon Loss 6 - 12% Reordering in-significant num lost 1500 0 Medicina: 800 MHz PIII Loss ~6% Reordering in-significant re-ordered 150000 num_lost jbgig1-jivegig1_14Jun05 2000 num re-ordered Packet Loss & Re-ordering Jodrell: 2.0 GHz Xeon Loss 0 – 12% Reordering significant 47 Slide: 47 18 Hour Flows on UKLight Jodrell – JIVE, 26 June 2005 Traffic through SURFnet man03-jivegig1_26Jun05 Recv wire rate Mbit/s Throughput: Jodrell: JIVE 2.4 GHz dual Xeon – 2.4 GHz dual Xeon 960-980 Mbit/s w10 1000 800 600 400 200 0 0 1000 2000 3000 4000 5000 6000 7000 Packet Loss Only 3 groups with 10-150 lost packets each No packets lost the rest of the time Recv wire rate Mbit/s Time 10s steps 1000 990 980 970 960 man03-jivegig1_26Jun05 w10 950 940 930 920 910 900 5000 5050 5100 5150 5200 Time 10s man03-jivegig1_26Jun05 Packet Loss Packet re-ordering None w10 1000 100 10 1 0 1000 2000 3000 Manchester 4000 5000 Summer School, Brasov, Romania, July 2005, R. Hughes-Jones 6000 7000 48 Time 10s steps Richard Hughes-Jones Slide: 48 Summary, Conclusions Host is critical: Motherboards NICs, RAID controllers and Disks matter The NICs should be well designed: NIC should use 64 bit 133 MHz PCI-X (66 MHz PCI can be OK) NIC/drivers: CSR access / Clean buffer management / Good interrupt handling Worry about the CPU-Memory bandwidth as well as the PCI bandwidth Data crosses the memory bus at least 3 times Separate the data transfers – use motherboards with multiple 64 bit PCI-X buses 32 bit 33 MHz is too slow for Gigabit rates 64 bit 33 MHz > 80% used Choose a modern high throughput RAID controller Consider SW RAID0 of RAID5 HW controllers Need plenty of CPU power for sustained 1 Gbit/s transfers Packet loss is a killer Check on campus links & equipment, and access links to backbones New stacks are stable give better response & performance Still need to set the tcp buffer sizes ! Check other kernel settings e.g. window-scale, Application architecture & implementation is also important Interaction between HW, protocol processing, and disk sub-system complex MB - NG Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 49 Slide: 49 More Information Some URLs Real-Time Remote Farm site http://csr.phys.ualberta.ca/real-time UKLight web site: http://www.uklight.ac.uk DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/ (Software & Tools) Motherboard and NIC Tests: http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt & http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ (Publications) TCP tuning information may be found at: http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html TCP stack comparisons: “Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004 http:// www.hep.man.ac.uk/~rich/ (Publications) PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ Dante PERT http://www.geant2.net/server/show/nav.00d00h002 Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 50 Slide: 50 Any Questions? Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 51 Slide: 51 Backup Slides Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 52 Slide: 52 Latency Measurements UDP/IP packets sent between back-to-back systems Processed in a similar manner to TCP/IP Not subject to flow control & congestion avoidance algorithms Used UDPmon test program Latency Round trip times measured using Request-Response UDP frames Latency as a function of frame size Slope is given by: s= data paths 1 db dt Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s) Intercept indicates: processing times + HW latencies Histograms of ‘singleton’ measurements Tells us about: Behavior of the IP stack The way the HW operates Summer School, Brasov, Romania, July 2005, Interrupt coalescence Richard Hughes-Jones R. Hughes-Jones Manchester 53 Slide: 53 Throughput Measurements UDP Throughput Send a controlled stream of UDP frames spaced at regular intervals Sender Receiver Zero stats OK done Send data frames at regular intervals ●●● ●●● Inter-packet time (Histogram) Time to receive Time to send Get remote statistics Send statistics: No. received No. lost + loss pattern No. out-of-order OK done CPU load & no. int 1-way delay Signal end of test Time Number of packets n bytes Wait time Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones time 54 Slide: 54 PCI Bus & Gigabit Ethernet Activity PCI Activity Logic Analyzer with PCI bus Gigabit Ethernet Probe NIC CPU NIC PCI Probe cards in sending PC Gigabit Ethernet Fiber Probe Card PCI Probe cards in receiving PC CPU PCI bus chipset chipset mem mem Possible Bottlenecks Logic Analyser Display Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 55 Slide: 55 End Hosts & NICs CERN-nat-Manc. Throughput Packet Loss Re-Order Use UDP packets to characterise Host, NIC & Network Recv Wire rate Mbits/s Request-Response Latency 256 bytes pcatb121-nat-gig6 10000 5000 N(t) N(t) 4000 3000 512 bytes pcatb121-nat-gig6 5000 1400 bytes pcatb121-nat-gig6 8000 4000 6000 3000 N(t) 6000 2000 4000 2000 1000 2000 1000 0 20900 21100 21300 Latency us 21500 0 20900 21100 21300 Latency us 21500 0 20900 pcatb121-nat-gig6_13Aug04 1000 900 800 700 600 500 400 300 200 100 0 SuperMicro P4DP8 motherboard Dual Xenon 2.2GHz CPU 400 MHz System bus 64 bit 66 MHz PCI / 133 MHz PCI-X bus 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 0 10 20 Spacing between frames us 21100 21300 Latency us 21500 20 0 num re-ordered 5 10 15 20 25 Spacing between frames us 30 35 40 1000 bytes 1200 bytes 1400 bytes 1472 bytes 50 bytes pcatb121-nat-gig6_13Aug04 15 1472 bytes 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 40 100 bytes 200 bytes 10 400 bytes 5 600 bytes 800 bytes 0 1000 bytes 0 5 10 15 20 25 Spacing between frames us 30 Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 40 60 0 The network can sustain 1Gbps of UDP traffic The average server can loose smaller packets Packet loss caused by lack of power in the PC receiving the traffic Out of order packets due to WAN routers Lightpaths look like extended LANS have no re-ordering 30 pcatb121-nat-gig6_13Aug04 80 % Packet loss 35 40 1200 bytes 1400 bytes 1472 bytes 56 Slide: 56 TCP (Reno) – Details 2 min C * RTT 2 = 2 * MSS for rtt of ~200 ms: Time to recover sec Time for TCP to recover its throughput from 1 lost packet given by: 100000 10000 1000 100 10 1 0.1 0.01 0.001 0.0001 10Mbit 100Mbit 1Gbit 2.5Gbit 10Gbit 0 50 100 rtt ms 150 200 UK 6 ms Europe 20 ms USA 150 ms Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester Richard Hughes-Jones 57 Slide: 57 Network & Disk Interactions mem-disk: 1735 Mbit/s Tends to be in 1 die 200 RAID0 6disks 1 Gbyte Write 8k 3w8506-8 26 Dec04 16384 180 % cpu system mode L3+4 Disk Write Kernel CPU load 160 8k total CPU 140 64k total CPU 120 y = -1.0215x + 215.63 Total CPU load 100 80 60 40 y = -1.0529x + 206.46 20 0 Disk Write + UDP 1500 0 200 mem-disk : 1218 Mbit/s Both dies at ~80% % cpu system mode L3+4 200 1000 160 y=1781.05x cut equn 140 120 100 80 60 L3+L4<cut 20 40 0 70.0 80.0 90.0 60 80 100 120 140 % cpu system mode L1+2 160 180 60 8k totalCPU 40 64k totalCPU y=178-1.05x 20 40 60 80 100 120 140 % cpu system mode L1+2 180 200 R0 6d 1 Gbyte membw write 8k 3w8506-8 04Jan05 16384 160 140 120 100 80 60 8k totalCPU 64k totalCPU y=178-1.05x cut equn 2 0 20 40 60 80 100 120 140 % cpu system mode L1+2 160 180 200 100.0 200 R0 6d 1 Gbyte cpuload Write 8k 3w8506-8 3Jan05 16384 180 200 160 R0 6d 1 Gbyte cpuload Write 8k 3w8506-8 3Jan05 16384 180 8k 8k total CPU 160 64k 64k total CPU mem-disk : 1334 Mbit/s 1 CPU at ~60% other 20% All CPUs saturated in user mode Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester y=178-1.05x 140 120 y=178-1.05x 140 120 100 100 80 60 80 60 40 40 20 20 0 0 0 Richard Hughes-Jones 160 0 200 % cpu system mode L3+4 Disk Write + CPUload 60.0 % cpu system mode L3+4 Trial number 80 20 0 0 50.0 100 40 20 Series1 500 40.0 120 180 64k 40 30.0 200 140 200 8k % cpu system mode L3+4 % cpu system mode L3+4 Throughput Mbit/s R0 6d 1 Gbyte membw write 8k 3w8506-8 04Jan05 16384 180 1500 20.0 180 160 0 2000 10.0 160 0 R0 6d 1 Gbyte membw write 64k 3w8506-8 04Jan05 16384 0.0 60 80 100 120 140 % cpu system mode L1+2 R0 6d 1 Gbyte udp Write 8k 3w8506-8 26 Dec04 16384 20 mem-disk : 1341 Mbit/s 1 CPU at ~60% other 20% Large user mode usage Below Cut = hi BW Hi BW = die1 used 2500 40 180 Disk Write + CPU mem 20 20 40 60 80 100 120 140 % cpu system mode L1+2 160 180 200 0 20 40 60 80 100 120 140 % cpu system mode L1+2 160 180 200 58 Slide: 58