High Performance Networking for ALL Members of GridPP are in many Network collaborations including: Close links with: MB - NG SLAC UKERNA, SURFNET and other NRNs Dante Internet2 Starlight, Netherlight GGF Ripe Industry … GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 1 Network Monitoring [1] Architecture DataGrid WP7 code extended by Gareth Manc Technology transfer to UK e-Science Developed by Mark Lees DL Fed back into DataGrid by Gareth Links to: GGF NM-WG, Dante, Internet2 Characteristics, Schema & web services GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester Success 2 Network Monitoring [2] 24 Jan to 4 Feb 04 TCP iperf RAL to HEP Only 2 sites >80 Mbit/s 24 Jan to 4 Feb 04 TCP iperf DL to HEP HELP! GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 3 High bandwidth, Long distance…. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK [r.tasker@dl.ac.uk] DataTAG is a project sponsored by the European Commission - EU Grant IST-2001-32459 RIPE-47, Amsterdam, 29 January 2004 Throughput… What’s the problem? One Terabyte of data transferred in less than an hour On February 27-28 2003, the transatlantic DataTAG network was extended, i.e. CERN - Chicago - Sunnyvale (>10000 km). For the first time, a terabyte of data was transferred across the Atlantic in less than one hour using a single TCP (Reno) stream. The transfer was accomplished from Sunnyvale to Geneva at a rate of 2.38 Gbits/s GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 5 Internet2 Land Speed Record On October 1 2003, DataTAG set a new Internet2 Land Speed Record by transferring 1.1 Terabytes of data in less than 30 minutes from Geneva to Chicago across the DataTAG provision, corresponding to an average rate of 5.44 Gbits/s using a single TCP (Reno) stream GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 6 So how did we do that? Management of the End-to-End Connection Memory-to-Memory transfer; no disk system involved Processor speed and system bus characteristics TCP Configuration – window size and frame size (MTU) Network Interface Card and associated driver and their configuration End-to-End “no loss” environment from CERN to Sunnyvale! At least a 2.5 Gbits/s capacity pipe on the end-to-end path A single TCP connection on the end-to-end path No real user application That’s to say - not the usual User experience! GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 7 Realistically – what’s the problem & why do network research? End System Issues Network Interface Card and Driver and their configuration TCP and its configuration Operating System and its configuration Disk System Processor speed Bus speed and capability Network Infrastructure Issues Obsolete network equipment Configured bandwidth restrictions Topology Security restrictions (e.g., firewalls) Sub-optimal routing Transport Protocols Network Capacity and the influence of Others! Many, many TCP connections Mice and Elephants on the path Congestion GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 8 End Hosts: Buses, NICs and Drivers Use UDP packets to characterise Intel PRO/10GbE Server Adaptor SuperMicro P4DP8-G2 motherboard Dual Xenon 2.2GHz CPU 400 MHz System bus 133 MHz PCI-X bus Throughput Latency Bus Activity GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 9 End Hosts: Understanding NIC Drivers Linux driver basics – TX Linux driver basics – RX Application system call Encapsulation in UDP/TCP and IP headers Enqueue on device send queue Driver places information in DMA descriptor ring NIC reads data from main memory via DMA and sends on wire NIC signals to processor that TX descriptor sent NIC places data in main memory via DMA to a free RX descriptor NIC signals RX descriptor has data Driver passes frame to IP layer and cleans RX descriptor IP layer passes data to application Linux NAPI driver model On receiving a packet, NIC raises interrupt Driver switches off RX interrupts and schedules RX DMA ring poll Frames are pulled off DMA ring and is processed up to application When all frames are processed RX interrupts are re-enabled Dramatic reduction in RX interrupts under load Improving the performance of a Gigabit Ethernet driver under Linux http://datatag.web.cern.ch/datatag/papers/drafts/linux_kernel_map/ GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 10 Protocols: TCP (Reno) – Performance AIMD and High Bandwidth – Long Distance networks Poor performance of TCP in high bandwidth wide area networks is due in part to the TCP congestion control algorithm For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 11 Protocols: HighSpeed TCP & Scalable TCP Adjusting the AIMD Algorithm – TCP Reno For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ High Speed TCP a and b vary depending on current cwnd where a increases more rapidly with larger cwnd and as a consequence returns to the ‘optimal’ cwnd size sooner for the network path; and b decreases less aggressively and, as a consequence, so does the cwnd. The effect is that there is not such a decrease in throughput. Scalable TCP a and b are fixed adjustments for the increase and decrease of cwnd such that the increase is greater than TCP Reno, and the decrease on loss is less than TCP Reno GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 12 Protocols: HighSpeed TCP & Scalable TCP HighSpeed TCP Scalable TCP HighSpeed TCP implemented by Gareth Manc Success Scalable TCP implemented by Tom Kelly Camb Integration of stacks into DataTAG Kernel Yee UCL + Gareth GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 13 Some Measurements of Throughput CERN -SARA Using the GÉANT Backup Link 400 300 200 100 0 1043509370 Standard TCP Average Throughput 1043509570 Time 1043509670 Hispeed TCP txlen 2000 26 Jan03 400 300 200 100 0 1043577520 345 Mbit/s 1043577620 1043577720 Time 1043577820 Scalable TCP txlen 2000 27 Jan03 Mbit/s 340 400 300 200 100 0 1043678800 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Out Mbit/s 1043679200 In Mbit/s Recv. Rate Mbits/s Average Throughput II/f Rate Mbits/s 500 Scalable TCP 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Out Mbit/s 1043577920 In Mbit/s Recv. Rate Mbits/s High-Speed TCP 1043509470 500 I/f Rate Mbits/s Average Throughput 167 Mbit/s Users see 5 - 50 Mbit/s! Out Mbit/s 2 In Mbit/s 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1043509770 Recv. Rate Mbits/s I/f Rate Mbits/s 1 GByte file transfers Blue Data Red TCP ACKs Standard TCP txlen 100 25 Jan03 500 1043678900 GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 1043679000 Time 1043679100 14 Pete White Users, The Campus & the MAN Pat Meyrs NNW – to – SJ4 Access 2.5 Gbit PoS Hits 1 Gbit 50 % [1] Man – NNW Access 2 * 1 Gbit Ethernet GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 15 Users, The Campus & the MAN [2] 800 ULCC-JANET traffic 30/1/2004 700 600 700 500 600 400 300 In from SJ4 Out to SJ4 800 Traffic Mbit/s Traffic Mbit/s 900 in out 200 500 400 300 200 Message: Continue to work with your network group LMN to site 1 Access 1 Gbit Ethernet LMN to site 2 Access 1 Gbit Ethernet Understand the traffic levels In 250 350 Understand theInOutsite1site1Network Topology out 100 0 00:00 100 0 02:24 04:48 07:12 09:36 12:00 14:24 16:48 19:12 21:36 24/01/2004 25/01/2004 26/01/2004 27/01/2004 28/01/2004 29/01/2004 30/01/2004 31/01/2004 00:00 00:00 00:00 00:00 00:00 00:00 00:00 00:00 00:00 300 200 Traffic Mbit/s Traffic Mbit/s 250 200 150 150 100 100 50 50 0 0 24/01/2004 25/01/2004 26/01/2004 27/01/2004 28/01/2004 29/01/2004 30/01/2004 31/01/2004 00:00 00:00 00:00 00:00 00:00 00:00 00:00 00:00 24/01/2004 25/01/2004 26/01/2004 27/01/2004 28/01/2004 29/01/2004 30/01/2004 31/01/2004 00:00 00:00 00:00 00:00 00:00 00:00 00:00 00:00 GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 16 10 GigEthernet: Tuning PCI-X mmrbc 512 bytes mmrbc 1024 bytes CSR Access mmrbc 2048 bytes PCI-X segment 2048 bytes Transfer of 16114 bytes mmrbc 4096 bytes GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 17 10 GigEthernet at SC2003 BW Challenge (Phoenix) Three Server systems with 10 GigEthernet NICs Used the DataTAG altAIMD stack 9000 byte MTU Streams From SLAC/FNAL booth in Phoenix to: Pal Alto PAIX 17 ms rtt 10 Router to LA/PAIX Phoenix-PAIX HS-TCP Phoenix-PAIX Scalable-TCP Phoenix-PAIX Scalable-TCP #2 10 Gbits/s throughput from SC2003 to PAIX 9 Throughput Gbits/s 8 7 6 5 4 3 2 1 Chicago Starlight 65 ms rtt Amsterdam SARA 175 ms rtt 11/19/03 16:13 11/19/03 16:27 Router traffic to Abilele 11/19/03 16:42 11/19/03 16:56 11/19/03 17:11 11/19/03 17:25 Date & Time 10 Gbits/s throughput from SC2003 to Chicago & Amsterdam Phoenix-Chicago 10 Phoenix-Amsterdam 9 8 Throughput Gbits/s 0 11/19/03 15:59 7 6 5 4 3 2 1 0 11/19/03 15:59 11/19/03 16:13 11/19/03 16:27 11/19/03 16:42 GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 11/19/03 16:56 11/19/03 17:11 11/19/03 17:25 Date & Time 18 Helping Real Users [1] Radio Astronomy VLBI PoC with NRNs & GEANT 1024 Mbit/s 24 on 7 NOW GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 19 VLBI Project: Throughput Jitter & 1-way Delay 1472 byte Packets Manchester -> Dwingeloo JIVE 1472 byte Packets man -> JIVE FWHM 22 µs (B2B 3 µs ) Gnt5-DwMk5 11Nov03/DwMk5-Gnt5 13Nov03-1472bytes 1200 10000 Gnt5-DwMk5 DwMk5-Gnt5 600 1472 bytes w=50 jitter Gnt5-DwMk5 28Oct03 8000 N(t) 800 6000 4000 400 2000 200 0 0 0 0 5 10 15 20 Spacing between frames us 25 30 35 20 40 60 80 100 120 140 Jitter us 40 1-way Delay – note the packet loss (points with zero 1 –way delay) 1472 bytes w12 Gnt5-DwMk5 21Oct03 1472 bytes w12 Gnt5-DwMk5 21Oct03 12000 12000 10000 10000 8000 8000 1-way delay us 1-way delay us Recv Wire rate Mbits/s 1000 6000 4000 2000 6000 4000 2000 0 0 1000 2000 3000 Packet No. 0 40002000 210050002200 2300 GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 2400 2500 2600 Packet No. 2700 2800 2900 20 3000 VLBI Project: Packet Loss Distribution P(t) = λ packet loss distribution 12b bin=12us 80 70 Number in Bin 60 50 Measured Poisson 40 30 20 10 972 912 852 792 732 672 612 552 492 432 372 312 252 192 132 72 0 12 Measure the time between lost packets in the time series of packets sent. Lost 1410 in 0.6s Is it a Poisson process? Assume Poisson is stationary λ(t) =λ Use Prob. Density Function: Time between lost frames (us) e-λt Mean λ = 2360 / s packet loss distribution 12b 100 [426 µs] Plot log: slope -0.0028 expect -0.0024 Could be additional process involved Number in Bin y = 41.832e-0.0028x y = 39.762e-0.0024x 10 1 0 500 1000 1500 2000 Time between frames (us) GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 21 VLBI Traffic Flows – Only testing! Manchester – NetNorthWest - SuperJANET Access links Two 1 Gbit/s Access links: SJ4 to GÉANT GÉANT to SurfNet GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 22 Throughput & PCI transactions on the Mark5 PC: Mark5 uses Supermicro P3TDLE 1.2 GHz PIII Mem bus 133/100 MHz Ethernet 2 *64bit 66 MHz PCI 4 32bit 33 MHz PCI NIC IDE Disc Pack Read / Write n bytes Wait time SuperStor Input Card Logic Analyser Display time GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 23 PCI Activity: Read Multiple data blocks 0 wait Read 999424 bytes Each Data block: Setup CSRs Data movement Update CSRs For 0 wait between reads: Data blocks ~600µs long take ~6 ms Then 744µs gap PCI transfer rate 1188Mbit/s (148.5 Mbytes/s) Read_sstor rate 778 Mbit/s Data Block131,072 bytes (97 Mbyte/s) PCI bus occupancy: 68.44% Concern about Ethernet Traffic 64 bit 33 MHz PCI needs ~ 82% for 930 Mbit/s Expect ~360 Mbit/s CSR Access GridPP Meeting Edinburgh 4-5 Feb 04 PCI R. Hughes-Jones Manchester Data transfer Burst 4096 bytes 24 PCI Activity: Read Throughput Flat then 1/t dependance ~ 860 Mbit/s for Read blocks >= 262144 bytes CPU load ~20% Concern about CPU load needed to drive Gigabit link GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 25 Helping Real Users [2] HEP BaBar & CMS Application Throughput GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 26 BaBar Case Study: Disk Performace BaBar Disk Server Tyan Tiger S2466N motherboard 1 64bit 66 MHz PCI bus Athlon MP2000+ CPU AMD-760 MPX chipset 3Ware 7500-8 RAID5 8 * 200Gb Maxtor IDE 7200rpm disks Note the VM parameter readahead max Disk to memory (read) Max throughput 1.2 Gbit/s 150 MBytes/s) Memory to disk (write) Max throughput 400 Mbit/s 50 MBytes/s) [not as fast as Raid0] GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 27 BaBar: Serial ATA Raid Controllers 3Ware 66 MHz PCI 1600 900 Read Throughput raid5 4 3Ware 66MHz SATA disk Read Throughput raid5 4 ICP 66MHz SATA disk 800 1400 700 1200 600 Mbit/s 1000 Mbit/s ICP 66 MHz PCI 800 600 readahead readahead readahead readahead readahead readahead 400 200 max 31 max 63 max 127 max 256 max 512 max 1200 500 400 readahead readahead readahead readahead readahead readahead 300 200 100 0 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 200 400 600 File size MBytes 1800 1600 Write Throughput raid5 4 3Ware 66MHz SATA disk 1200 1000 max 31 max 63 max 127 max 256 max 516 max 1200 1000 1200 1400 1600 1800 2000 800 Write Throughput raid5 4 ICP 66MHz SATA disk 1400 readahead readahead readahead readahead readahead readahead 1200 1000 Mbit/s readahead readahead readahead readahead readahead readahead 1400 800 File size MBytes 1600 Mbit/s max 31 max 63 max 127 max 256 max 512 max 1200 800 max 31 max 63 max 127 max 256 max 512 max 1200 600 600 400 400 200 200 0 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 200 400 File size MBytes 600 800 1000 1200 1400 1600 1800 File size MBytes GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 28 2000 BaBar Case Study: RAID Throughput & PCI Activity 3Ware 7500-8 RAID5 parallel EIDE 3Ware forces PCI bus to 33 MHz BaBar Tyan to MB-NG SuperMicro Network mem-mem 619 Mbit/s Disk – disk throughput bbcp 40-45 Mbytes/s (320 – 360 Mbit/s) PCI bus effectively full! Read from RAID5 Disks Write to RAID5 Disks GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 29 MB - NG MB – NG SuperJANET4 Development Network BaBar Case Study MAN MCC OSM1OC48POS-SS PC Status / Tests: Manc host has DataTAG TCP stack RAL Host now available BaBar-BaBar mem-mem BaBar-BaBar real data MB-NG SJ4 BaBar-BaBar realDev data SJ4 SJ4 SJ4 Dev Dev real data MB-NG Mbng-mbng Mbng-mbng real data SJ4 Different TCP stacks already installed SJ4 Dev PC PC BarBar PC RAL 3ware RAID5 3ware RAID5 Gigabit Ethernet 2.5 Gbit POS Access 2.5 Gbit POS core MPLS Admin. Domains OSM1OC48POS-SS PC GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester PC 30 MB - NG Study of Applications MB – NG SuperJANET4 Development Network MAN MCC SJ4 Dev OSM1OC48POS-SS SJ4 Dev PC SJ4 Dev SJ4 Dev PC PC PC UCL 3ware RAID0 3ware RAID0 Gigabit Ethernet 2.5 Gbit POS Access 2.5 Gbit POS core MPLS Admin. Domains OSM1OC48POS-SS PC GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester PC 31 MB - NG 24 Hours HighSpeed TCP mem-mem TCP mem-mem lon2-man1 Tx 64 Tx-abs 64 Rx 64 Rx-abs 128 941.5 Mbit/s +- 0.5 Mbit/s GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 32 MB - NG Gridftp Throughput HighSpeedTCP Int Coal 64 128 Txqueuelen 2000 TCP buffer 1 M byte (rtt*BW = 750kbytes) Interface throughput Acks received Data moved 520 Mbit/s Same for B2B tests So its not that simple! GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 33 MB - NG Gridftp Throughput + Web100 Throughput Mbit/s: See alternate 600/800 Mbit and zero Cwnd smooth No dup Ack / send stall / timeouts GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 34 MB - NG http data transfers HighSpeed TCP Apachie web server out of the box! prototype client - curl http library 1Mbyte TCP buffers 2Gbyte file Throughput 72 MBytes/s Cwnd - some variation No dup Ack / send stall / timeouts GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 35 More Information Some URLs MB-NG project web site: http://www.mb-ng.net/ DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/net Motherboard and NIC Tests: www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt & http://datatag.web.cern.ch/datatag/pfldnet2003/ TCP tuning information may be found at: http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester 36