20th-21st June 2005 NeSC Edinburgh End-user systems: NICs, MotherBoards, Disks, TCP Stacks & Applications Richard Hughes-Jones Work reported is from many Networking Collaborations MB - NG http://gridmon.dl.ac.uk/nfnn/ Network Performance Issues End System Issues Network Interface Card and Driver and their configuration Processor speed MotherBoard configuration, Bus speed and capability Disk System TCP and its configuration Operating System and its configuration Network Infrastructure Issues Obsolete network equipment Configured bandwidth restrictions Topology Security restrictions (e.g., firewalls) Sub-optimal routing Transport Protocols Network Capacity and the influence of Others! Congestion – Group, Campus, Access links Many, many TCP connections Mice and Elephants on the path Richard Hughes-Jones Slide: 2 Methodology used in testing NICs & Motherboards Richard Hughes-Jones Slide: 3 Latency Measurements UDP/IP packets sent between back-to-back systems Processed in a similar manner to TCP/IP Not subject to flow control & congestion avoidance algorithms Used UDPmon test program Latency Round trip times measured using Request-Response UDP frames Latency as a function of frame size Slope is given by: s= ∑ data paths 1 db dt z Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s) Intercept indicates: processing times + HW latencies Histograms of ‘singleton’ measurements Tells us about: Behavior of the IP stack The way the HW operates Interrupt coalescence Richard Hughes-Jones Slide: 4 Throughput Measurements UDP Throughput Send a controlled stream of UDP frames spaced at regular intervals Sender Receiver Zero stats OK done Send data frames at regular intervals ●●● ●●● Inter-packet time (Histogram) Time to receive Time to send Get remote statistics Send statistics: No. received No. lost + loss pattern No. out-of-order OK done CPU load & no. int 1-way delay Signal end of test Time Number of packets n bytes ••• Wait time Richard Hughes-Jones time Slide: 5 PCI Bus & Gigabit Ethernet Activity PCI Activity Logic Analyzer with PCI bus Gigabit Ethernet Probe chipset Richard Hughes-Jones CPU PCI bus chipset mem Possible Bottlenecks NIC CPU NIC PCI Probe cards in sending PC Gigabit Ethernet Fiber Probe Card PCI Probe cards in receiving PC mem Logic Analyser Display Slide: 6 “Server Quality” Motherboards SuperMicro P4DP8-2G (P4DP6) Dual Xeon 400/522 MHz Front side bus 6 PCI PCI-X slots 4 independent PCI buses 64 bit 66 MHz PCI 100 MHz PCI-X 133 MHz PCI-X Dual Gigabit Ethernet Adaptec AIC-7899W dual channel SCSI UDMA/100 bus master/EIDE channels data transfer rates of 100 MB/sec burst Richard Hughes-Jones Slide: 7 “Server Quality” Motherboards Boston/Supermicro H8DAR Two Dual Core Opterons 200 MHz DDR Memory Theory BW: 6.4Gbit HyperTransport 2 independent PCI buses 133 MHz PCI-X 2 Gigabit Ethernet SATA ( PCI-e ) Richard Hughes-Jones Slide: 8 NIC & Motherboard Evaluations Richard Hughes-Jones Slide: 9 SuperMicro 370DLE: Latency: SysKonnect Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset CPU: PIII 800 MHz SysKonnec t 32bit 33 MHz 160 RedHat 7.1 Kernel 2.4.14 140 PCI:32 bit 33 MHz Latency small 62 µs & well behaved Latency Slope 0.0286 µs/byte Expect: 0.0232 µs/byte y = 0.0286x + 61.79 120 Latency us y = 0.0188x + 89.756 100 80 60 40 PCI 0.00758 GigE 0.008 PCI 0.00758 20 0 0 500 160 PCI:64 bit 66 MHz Latency small 56 µs & well behaved Latency Slope 0.0231 µs/byte Expect: 0.0118 µs/byte PCI 0.00188 GigE 0.008 PCI 0.00188 Possible extra data moves ? Richard Hughes-Jones 1500 2000 2500 3000 SysKonnec t 64 bit 66 MHz 140 120 Latency us 1000 Message length bytes y = 0.0231x + 56.088 100 y = 0.0142x + 81.975 80 60 40 20 0 0 500 1000 1500 2000 2500 3000 M essag e len g th b ytes Slide: 10 SuperMicro 370DLE: Throughput: SysKonnect PCI:32 bit 33 MHz Max throughput 584Mbit/s No packet loss >18 us spacing Recv Wire rate Mbits/s Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset SysKonnect 32bit 33 MHz 1000 CPU: PIII 800 MHz 800 RedHat 7.1 Kernel 2.4.14 200 1000 bytes 1200 bytes 0 1400 bytes 1472 bytes 0 Recv Wire rate Mbits/s 400 bytes 600 bytes 800 bytes 400 5 10 15 20 25 Transmit Time per frame us 30 35 40 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes SysKonnnect 64bit 66MHz 800 600 400 200 0 0 5 10 15 20 25 30 35 40 Transmit Time per frame us 80 Receiver % CPU Kernel 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes SysKonnect 64 bit 66 MHz 100 95-100% Kernel mode 100 bytes 200 bytes 600 1000 PCI:64 bit 66 MHz Max throughput 720 Mbit/s No packet loss >17 us spacing Packet loss during BW drop 50 bytes 60 40 20 0 0 Richard Hughes-Jones 10 20 Spacing between frames us 30 40 Slide: 11 SuperMicro 370DLE: PCI: SysKonnect Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset CPU: PIII 800 MHz PCI:64 bit 66 MHz RedHat 7.1 Kernel 2.4.14 Send transfer Send setup Send PCI Receive PCI Packet on Ethernet Fibre Receive 1400 bytes sent transfer Wait 100 us ~8 us for send or receive Stack & Application overhead ~ 10 us / node Richard Hughes-Jones ~36 us Slide: 12 Signals on the PCI bus 1472 byte packets every 15 µs Intel Pro/1000 PCI:64 bit 33 MHz Data Transfers 82% usage Send setup Send PCI Receive PCI Receive Transfers PCI:64 bit 66 MHz Data Transfers Send setup 65% usage Data transfers half as long Send PCI Receive PCI Receive Transfers Richard Hughes-Jones Slide: 13 SuperMicro 370DLE: PCI: SysKonnect Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset CPU: PIII 800 MHz PCI:64 bit 66 MHz RedHat 7.1 Kernel 2.4.14 1400 bytes sent Wait 20 us Frames on Ethernet Fiber 20 us spacing 1400 bytes sent Wait 10 us Frames are back-to-back 800 MHz Can drive at line speed Cannot go any faster ! Richard Hughes-Jones Slide: 14 SuperMicro 370DLE: Throughput: Intel Pro/1000 Max throughput 910 Mbit/s No packet loss >12 us spacing R e c v W ire ra t e M b it s / s Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset CPU: PIII 800 MHz PCI:64 bit 66 MHz 1000 RedHat 7.1 Kernel 2.4.14 Intel Pro/1000 64bit 66MHz 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes 800 600 400 200 0 0 1 % P a c k e t lo s s Packet loss during BW drop CPU load 65-90% spacing < 13 us 5 10 15 20 25 Transmit Time per frame us 30 35 40 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes UDP Intel Pro/1000 : 370DLE 64bit 66MHz 0.8 0.6 0.4 0.2 0 0 Richard Hughes-Jones 5 10 15 20 25 Transmit Time per frame us 30 35 40 Slide: 15 SuperMicro 370DLE: PCI: Intel Pro/1000 Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset CPU: PIII 800 MHz PCI:64 bit 66 MHz RedHat 7.1 Kernel 2.4.14 Send Interrupt processing Interrupt delay Send 64 byte request Send PCI Receive PCI Request received Receive Interrupt processing 1400 byte response Richard Hughes-Jones Request – Response Demonstrates interrupt coalescence No processing directly after each transfer Slide: 16 SuperMicro P4DP6: Latency Intel Pro/1000 Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz RedHat 7.2 Kernel 2.4.19 Intel 64 bit 66 MHz 300 y = 0.0093x + 194.67 250 Some steps Slope 0.009 us/byte Slope flat sections : 0.0146 us/byte Expect 0.0118 us/byte Latency us 200 y = 0.0149x + 201.75 150 100 50 0 0 700 600 Richard Hughes-Jones 800 1024 bytes Intel 64 bit 66 MHz 700 600 600 600 500 500 500 400 400 300 300 300 300 200 200 200 200 100 100 100 100 0 170 190 Latency us 210 0 0 170 190 Latency us 210 1400 bytes Intel 64 bit 66 MHz 800 700 400 400 No variation with packet size FWHM 1.5 us Confirms timing reliable 512 bytes Intel 64 bit 66 MHz 3000 700 N (t ) N (t ) 500 800 2500 N (t ) 64 bytes Intel 64 bit 66 MHz 800 1000 1500 2000 Message length bytes N (t ) 900 500 0 190 210 Latency us 230 190 210 230 Latency us Slide: 17 SuperMicro P4DP6: Throughput Intel Pro/1000 Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz RedHat 7.2 Kernel 2.4.19 Max throughput 950Mbit/s No packet loss R e c v W ire ra t e M b it s / s 1000 gig6-7 Intel pci 66 MHz 27nov02 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes 800 600 400 200 0 0 Richard Hughes-Jones 10 15 20 25 Transmit Time per frame us 30 35 40 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes gig6-7 Intel pci 66 MHz 27nov02 % CPU Idle Receiver Averages are misleading CPU utilisation on the receiving PC was ~ 25 % for packets > than 1000 bytes 30- 40 % for smaller packets 5 100 80 60 40 20 0 0 5 10 15 20 25 Transmit Time per frame us 30 35 40 Slide: 18 SuperMicro P4DP6: PCI Intel Pro/1000 Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz RedHat 7.2 Kernel 2.4.19 1400 bytes sent Wait 12 us ~5.14us on send PCI bus PCI bus ~68% occupancy ~ 3 us on PCI for data recv CSR access inserts PCI STOPs NIC takes ~ 1 us/CSR CPU faster than the NIC ! Similar effect with the SysKonnect NIC Richard Hughes-Jones Slide: 19 SuperMicro P4DP8-G2: Throughput Intel onboard Motherboard: SuperMicro P4DP8-G2 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon Prestonia 2.4 GHz PCI-X:64 bit RedHat 7.3 Kernel 2.4.19 Intel-onboard 1000 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes Max throughput 995Mbit/s No packet loss R e c v W ir e r a te M b its /s 800 600 400 200 0 Averages are misleading 20% CPU utilisation receiver packets > 1000 bytes 30% CPU utilisation smaller packets Richard Hughes-Jones % CPU Idle Receiver 0 5 10 15 20 25 Transmit Time per frame us 30 35 40 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes Intel-onboard 100 80 60 40 20 0 0 5 10 15 20 Transmit Time per frame us 25 30 35 40 Slide: 20 Interrupt Coalescence: Throughput Intel Pro 1000 on 370DLE coa5 coa10 coa20 coa40 coa64 coa100 Throughput 1000 byte packets 800 700 600 500 400 300 200 100 0 0 10 20 30 40 Delay between transmit packets us coa5 coa10 coa20 coa40 coa64 coa100 Throughput 1472 byte packets 900 800 700 600 500 400 300 200 100 0 0 10 20 30 40 Delay between transmit packets us Richard Hughes-Jones Slide: 21 Interrupt Coalescence Investigations Set Kernel parameters for Socket Buffer size = rtt*BW TCP mem-mem lon2-man1 Tx 64 Tx-abs 64 Rx 0 Rx-abs 128 820-980 Mbit/s +- 50 Mbit/s Tx 64 Tx-abs 64 Rx 20 Rx-abs 128 937-940 Mbit/s +- 1.5 Mbit/s Tx 64 Tx-abs 64 Rx 80 Rx-abs 128 937-939 Mbit/s +- 1 Mbit/s Richard Hughes-Jones Slide: 22 Supermarket Motherboards and other “Challenges” Richard Hughes-Jones Slide: 23 Tyan Tiger S2466N Motherboard: Tyan Tiger S2466N PCI: 1 64bit 66 MHz CPU: Athlon MP2000+ Chipset: AMD-760 MPX 3Ware forces PCI bus to 33 MHz Tyan to MB-NG SuperMicro Network mem-mem 619 Mbit/s Richard Hughes-Jones Slide: 24 IBM das: Throughput: Intel Pro/1000 Motherboard: IBM das Chipset:: ServerWorks CNB20LE CPU: Dual PIII 1GHz PCI:64 bit 33 MHz RedHat 7.1 Kernel 2.4.14 Max throughput 930Mbit/s No packet loss > 12 us Clean behaviour Packet loss during drop 900 Recv Wire rate Mbits/s 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes UDP Intel pro1000 IBMdas 64bit 33 MHz 1000 800 700 600 500 400 300 200 100 0 0 5 10 15 20 25 30 35 40 Transmit Time per frame us 1400 bytes sent 11 us spacing Signals clean ~9.3us on send PCI bus PCI bus ~82% occupancy ~ 5.9 us on PCI for data recv. Richard Hughes-Jones Slide: 25 Network switch limits behaviour End2end UDP packets from udpmon Only 700 Mbit/s throughput 1000 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes w05gva-gig6_29May04_UDP Recv Wire rate Mbits/s 900 800 700 600 500 400 300 200 100 0 0 % Packet loss Lots of packet loss Packet loss distribution shows throughput limited 5 10 15 20 Spacing between frames us 100 90 80 70 60 50 40 30 20 10 0 25 30 35 40 50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes w05gva-gig6_29May04_UDP 0 5 10 15 20 Spacing between frames us 25 30 35 40 w05gva-gig6_29May04_UDP wait 12us 14000 14000 12000 1-waydelayus 12000 1-way delay us 10000 8000 6000 8000 6000 4000 2000 4000 0 2000 0 500 10000 0 510 520 530 Packet No. 540 100 200 300 Packet No. 400 500 600 550 Richard Hughes-Jones Slide: 26 10 Gigabit Ethernet Richard Hughes-Jones Slide: 27 10 Gigabit Ethernet: UDP Throughput 1500 byte MTU gives ~ 2 Gbit/s Used 16144 byte MTU max user length 16080 DataTAG Supermicro PCs Dual 2.2 GHz Xenon CPU FSB 400 MHz PCI-X mmrbc 512 bytes wire rate throughput of 2.9 Gbit/s CERN OpenLab HP Itanium PCs Dual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz PCI-X mmrbc 4096 bytes wire rate of 5.7 Gbit/s 16080 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes an-al 10GE Xsum 512kbuf MTU16114 27Oct03 6000 SLAC Dell PCs giving a Dual 3.0 GHz Xenon CPU FSB 533 MHz PCI-X mmrbc 4096 bytes wire rate of 5.4 Gbit/s Richard Hughes-Jones Recv Wire rate Mbits/s 5000 4000 3000 2000 1000 0 0 5 10 15 20 25 Spacing between frames us 30 35 40 Slide: 28 10 Gigabit Ethernet: Tuning PCI-X 16080 byte packets every 200 µs Intel PRO/10GbE LR Adapter PCI-X bus occupancy vs mmrbc mmrbc 512 bytes Measured times Times based on PCI-X times from the logic analyser Expected throughput ~7 Gbit/s Measured 5.7 Gbit/s 10 mmrbc 1024 bytes PCI-X Transfer time us DataTAG Xeon 2.2 GHz 8 6 4 measured Rate Gbit/s rate from expected time Gbit/s Max throughput PCI-X 2 0 0 10 1000 2000 3000 4000 Max Memory Read Byte Count 5000 PCI-X Transfer time us PCI-X Sequence Data Transfer Interrupt & CSR Update Kernel 2.6.1#17 HP Itanium Intel10GE Feb04 8 6 4 measured Rate Gbit/s rate from expected time Gbit/s Max throughput PCI-X 2 CSR Access mmrbc 2048 bytes mmrbc 4096 bytes 5.7Gbit/s 0 0 1000 2000 3000 4000 Max Memory Read Byte Count Richard Hughes-Jones 5000 Slide: 29 Different TCP Stacks Richard Hughes-Jones Slide: 30 Investigation of new TCP Stacks The AIMD Algorithm – Standard TCP (Reno) For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd - Additive Increase, a=1 For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ High Speed TCP a and b vary depending on current cwnd using a table a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner for the network path b decreases less aggressively and, as a consequence, so does the cwnd. The effect is that there is not such a decrease in throughput. Scalable TCP a and b are fixed adjustments for the increase and decrease of cwnd a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno Scalable over any link speed. Fast TCP Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput. Richard Hughes-Jones Slide: 31 Comparison of TCP Stacks TCP Response Function Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel MB-NG rtt 6ms Richard Hughes-Jones DataTAG rtt 120 ms Slide: 32 High Throughput Demonstrations London (Chicago) Manchester (Geneva) Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz lon01 Cisco 7609 1 GEth Drop Packets man03 Cisco GSR Cisco GSR 2.5 Gbit SDH MB-NG Core Cisco 7609 1 GEth Send data with TCP Monitor TCP with Web100 Richard Hughes-Jones Slide: 33 High Performance TCP – MB-NG Drop 1 in 25,000 rtt 6.2 ms Recover in 1.6 s Standard Richard Hughes-Jones HighSpeed Scalable Slide: 34 High Performance TCP – DataTAG Different TCP stacks tested on the DataTAG Network rtt 128 ms Drop 1 in 106 High-Speed Rapid recovery Scalable Very fast recovery Standard Recovery would take ~ 20 mins Richard Hughes-Jones Slide: 35 Disk and RAID Sub-Systems Based on talk at NFNN 2004 by Andrew Sansum Tier 1 Manager at RAL Richard Hughes-Jones Slide: 36 Reliability and Operation The performance of a system that is broken is 0.0 MB/S! When staff are fixing broken servers they cannot spend their time optimising performance. With a lot of systems, anything that can break – will break Typical ATA failure rate 2-3% per annum at RAL, CERN and CASPUR. That’s 30-45 drives per annum on the RAL Tier1 i.e. One/week. One failure for every 10**15 bits read. MTBF 400K hours. RAID arrays may not handle block re-maps satisfactorily or take so long that the system has given up. RAID5 not perfect protection – finite chance of 2 disks failing! SCSI interconnects corrupts data or give CRC errors Filesystems (ext2) corrupt for no apparent cause (EXT2 errors) Silent data corruption happens (checksums are vital). Fixing data corruptions can take a huge amount of time (> 1 week). Data loss possible. Richard Hughes-Jones Slide: 37 You need to Benchmark – but how? Andrew’s golden rule: Benchmark Benchmark Benchmark Choose a benchmark that matches your application (we use Bonnie++ and IOZONE). Single stream of sequential I/O. Random disk accesses. Multi-stream – 30 threads more typical. Watch out for caching effects (use large files and small memory). Use a standard protocol that gives reproducible results. Document what you did for future reference Stick with the same benchmark suite/protocol as long as possible to gain familiarity. Much easier to spot significant changes Richard Hughes-Jones Slide: 38 Hard Disk Performance Factors affecting performance: Seek time (rotation speed is a big component) Transfer Rate Cache size Retry count (vibration) Watch out for unusual disk behaviours: write/verify (drive verifies writes for first <n> writes) drive spin down etc etc. Storage review is often worth reading: http://www.storagereview.com/ Richard Hughes-Jones Slide: 39 Impact of Drive Transfer Rate A slower drive may not cause significant sequential performance degradation in a RAID array. This 5400 drive was 25% slower than the 7200 80 70 60 50 MB/S 40 30 20 10 0 7200 Read 5400 Read 1 2 4 8 16 32 Threads Richard Hughes-Jones Slide: 40 Disk Parameters Distribution of Seek Time Mean 14.1 ms Published tseek time 9 ms Taken off rotational latency ms Head Position & Transfer Rate Outside of disk has more blocks so faster 58 MB/s to 35 MB/s 40% effect From: www.storagereview.com Richard Hughes-Jones Position on disk Slide: 41 Drive Interfaces http://www.maxtor.com/ Richard Hughes-Jones Slide: 42 RAID levels Choose Appropriate RAID level RAID 0 (stripe) is fastest but no redundancy RAID 1 (mirror) single disk performance RAID 2, 3 and 4 not very common/supported sometimes worth testing with your application RAID 5 – Read is almost as fast as RAID-0, write substantially slower. RAID 6 – extra parity info – good for unreliable drives but could have considerable impact on performance. Rare but potentially very useful for large IDE disk arrays. RAID 50 & RAID 10 - RAID0 2 (or more) controllers Richard Hughes-Jones Slide: 43 Kernel Versions Kernel versions can make an enormous difference. It is not unusual for I/O to be badly broken in Linux. In 2.4 series – Virtual Memory Subsystem problems zonly 2.4.5 some 2.4.9 >2.4.17 OK Redhat Enteprise 3 seems badly broken (although most recent RHE Kernel may be better) 2.6 kernel contains new functionality and may be better (although first tests show little difference) Always check after an upgrade that performance remains good. Richard Hughes-Jones Slide: 44 Kernel Tuning Tunable kernel parameters can improve I/O performance, but depends what you are trying to achieve. IRQ balancing. Issues with IRQ handling on some Xeon chipsets/kernel combinations (all IRQs handled on CPU zero). Hyperthreading I/O schedulers – can tune for latency by minimising the seeks z /sbin/elvtune ( number sectors of writes before reads allowed) z Choice of scheduler: elevator=xx orders I/O to minimise seeks, merges adjacent requests. VM on 2.4 tuningBdflush and readahead (single thread helped) z Sysctl –w vm.max-readahead=512 z Sysctl –w vm.min-readahead=512 z Sysctl –w vm.bdflush=10 500 0 0 3000 10 20 0 On 2.6 readahead command is [default 256] z blockdev --setra 16384 /dev/sda Richard Hughes-Jones Slide: 45 Example Effect of VM Tuneup Redhat 7.3 Read Performance MB/s 200 180 160 140 120 100 80 60 40 20 0 Default Tuned 1 2 4 8 16 32 threads g.prassas@rl.ac.uk Richard Hughes-Jones Slide: 46 IOZONE Tests Read Performance File System ext2 ext3 Due to journal behaviour 40% effect 70 60 50 40 MB/s 30 20 10 0 ext2 ext3 1 2 4 8 16 32 threads ext3 Write Journal write data + write journal 60 50 40 MB/s 30 writeback ordered journal 20 10 0 1 2 4 8 threads Richard Hughes-Jones 16 32 ext3 Read 70 60 50 40 MB/s 30 20 10 0 1 2 4 8 16 32 threads Slide: 47 RAID Controller Performance RAID5 (stripped with redundancy) Stephen 3Ware 7506 Parallel 66 MHz 3Ware 7505 Parallel 33 MHz 3Ware 8506 Serial ATA 66 MHz ICP Serial ATA 33/66 MHz Tested on Dual 2.2 GHz Xeon Supermicro P4DP8-G2 motherboard Disk: Maxtor 160GB 7200rpm 8MB Cache Read ahead kernel tuning: /proc/sys/vm/max-readahead = 512 Disk – Memory Read Speeds Dallison Memory - Disk Write Speeds Rates for the same PC RAID0 (stripped) Read 1040 Mbit/s, Write 800 Mbit/s Richard Hughes-Jones Slide: 48 SC2004 RAID Controller Performance Supermicro X5DPE-G2 motherboards loaned from Boston Ltd. Dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus Configured as RAID0 64k byte stripe size Six 74.3 GByte Western Digital Raptor WD740 SATA disks 75 Mbyte/s disk-buffer 150 Mbyte/s buffer-memory Scientific Linux with 2.6.6 Kernel + altAIMD patch (Yee) + packet loss patch Read ahead kernel tuning: /sbin/blockdev --setra 16384 /dev/sda Memory - Disk Write Speeds Disk – Memory Read Speeds 12000 2500 RAID0 6disks Read 3w8506-8 1 G Byte Read 10000 2000 2 G Byte Read 6000 4000 Throughput Mbit/s Throughput Mbit/s 0.5 G Byte Read 8000 RAID0 6disks Write 3w8506-8 16 1500 1 G Byte write 1000 0.5 G Byte write 2 G Byte write 500 2000 0 0 0.0 20.0 40.0 60.0 80.0 100.0 120.0 Buffer size kbytes 140.0 160.0 180.0 200.0 0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 200.0 Buffer size kbytes RAID0 (stripped) 2 GByte file: Read 1500 Mbit/s, Write 1725 Mbit/s Richard Hughes-Jones Slide: 49 Applications Throughput for Real Users Richard Hughes-Jones Slide: 50 Topology of the MB – NG Network Manchester Domain man02 Boundary Router Cisco 7609 man01 UKERNA Development Network UCL Domain lon01 Boundary Router Cisco 7609 lon02 HW RAID Edge Router man03 Cisco 7609 RAL Domain lon03 ral02 Key Gigabit Ethernet 2.5 Gbit POS Access ral01 HW RAID ral02 Richard Hughes-Jones Boundary Router Cisco 7609 MPLS Admin. Domains Slide: 51 Gridftp Throughput + Web100 RAID0 Disks: 960 Mbit/s read 800 Mbit/s write Throughput Mbit/s: See alternate 600/800 Mbit and zero Data Rate: 520 Mbit/s Cwnd smooth No dup Ack / send stall / timeouts Richard Hughes-Jones Slide: 52 http data transfers HighSpeed TCP Same Hardware Bulk data moved by web servers Apachie web server out of the box! prototype client - curl http library 1Mbyte TCP buffers 2Gbyte file Throughput ~720 Mbit/s Cwnd - some variation No dup Ack / send stall / timeouts Richard Hughes-Jones Slide: 53 bbftp: What else is going on? Scalable TCP BaBar + SuperJANET SuperMicro + SuperJANET Congestion window – duplicate ACK Variation not TCP related? Disk speed / bus transfer Application Richard Hughes-Jones Slide: 54 SC2004 UKLIGHT Topology SLAC Booth SC2004 Cisco 6509 Manchester MB-NG 7600 OSR Caltech Booth UltraLight IP UCL network UCL HEP NLR Lambda NLR-PITT-STAR-10GE-16 ULCC UKLight K2 K2 Ci UKLight 10G Four 1GE channels Ci Caltech 7600 UKLight 10G Surfnet/ EuroLink 10G Two 1GE channels Chicago Starlight Richard Hughes-Jones Amsterdam K2 Slide: 55 SC2004 Disk-Disk bbftp bbftp file transfer program uses TCP/IP UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off Move a 2 Gbyte file Web100 plots: InstaneousBW AveBW CurCwnd (Value) 2000 1500 1000 500 0 0 5000 10000 time ms 15000 20000 InstaneousBW AveBW CurCwnd (Value) 2500 TCPAchive Mbit/s Scalable TCP Average 875 Mbit/s (bbcp: 701 Mbit/s ~4.5s of overhead) 45000000 40000000 35000000 30000000 25000000 20000000 15000000 10000000 5000000 0 2000 1500 1000 500 0 Disk-TCP-Disk at 1Gbit/s Richard Hughes-Jones 45000000 40000000 35000000 30000000 25000000 20000000 15000000 10000000 5000000 0 Cwnd Standard TCP Average 825 Mbit/s (bbcp: 670 Mbit/s) TCPAchive Mbit/s 2500 0 5000 10000 time ms 15000 Cwnd 20000 Slide: 56 Network & Disk Interactions (work in progress) Hosts: Supermicro X5DPE-G2 motherboards dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size Measure memory to RAID0 transfer rates with & without UDP traffic RAID0 6disks 1 Gbyte Write 64k 3w8506-8 Throughput Mbit/s 2000 1000 RAID0 6disks 1 Gbyte Write 64k 3w8506- 8 200 Disk write 1735 Mbit/s 1500 % CPU kernel mode 180 160 8k 140 64k 120 y = - 1.017x + 178.32 100 500 80 60 0 40 0.0 Mbit/s Throughput 2000 20.0 40.0 60.0 80.0 Trial number R0 6d 1 Gbyte udp Write 64k 3w8506-8 1500 1000 500 0 0.0 20.0 40.0 60.0 Trial number 80.0 20 100.0 0 20 200 Disk write + 1500 MTU UDP 1218 Mbit/s Drop of 30% 100.0 40 8k 64k y=178- 1.05x 160 140 120 100 80 60 40 20 0 200 20 40 500 0 0.0 20.0 40.0 60.0 Trial number Richard Hughes-Jones 80.0 60 80 100 120 140 % c pu system mode L1+2 160 180 200 R0 6d 1 Gbyte udp9000 write 8k 3w8506-8 07Jan05 16384 8k 180 % cpu system mode L3+4 1000 Disk write + 9000 MTU UDP 1400 Mbit/s Drop of 19% 100.0 160 180 200 R0 6d 1 Gbyte udp Write 64k 3w8506- 8 0 1500 60 80 100 120 140 % c pu system mode L1+2 180 R0 6d 1 Gbyte udp9000 write 64k 3w8506-8 2000 Throughput Mbit/s y = - 1.0479x + 174.44 0 64k 160 y=178-1.05x 140 120 100 80 60 40 20 0 0 20 40 60 80 100 120 140 % cpu system mode L1+2 160 180 200 Slide: 57 Remote Processing Farms: Manc-CERN 150 100 50000 50 200 400 600 800 1000 time 1200 1600 1800 DataBytesOut (Delta DataBytesIn (Delta CurCwnd (Value 250000 200000 Data Bytes Out 1400 0 2000 250000 200000 150000 150000 100000 100000 50000 50000 0 0 200 400 600 800 1000 1200 time ms 1400 1600 1800 180 160 140 120 100 80 60 40 20 0 CurCwnd 0 ● ● ● Data Bytes In 200 100000 0 2000 250000 200000 150000 100000 Cwnd Data Bytes Out 250 150000 50000 0 Richard Hughes-Jones 300 0 TCP Congestion window gets re-set on each Request TCP stack implementation detail to reduce Cwnd after inactivity Even after 10s, each response takes 13 rtt or ~260 ms Transfer achievable throughput 120 Mbit/s 350 200000 TCPAchive Mbit/s Round trip time 20 ms 64 byte Request green 1 Mbyte Response blue TCP in slow start 1st event takes 19 rtt or ~ 380 ms DataBytesOut (Delta 400 DataBytesIn (Delta 250000 200 400 600 800 1000 1200 time ms 1400 1600 1800 0 2000 Slide: 58 Summary, Conclusions & Thanks Host is critical: Motherboards NICs, RAID controllers and Disks matter The NICs should be well designed: z NIC should use 64 bit 133 MHz PCI-X (66 MHz PCI can be OK) z NIC/drivers: CSR access / Clean buffer management / Good interrupt handling Worry about the CPU-Memory bandwidth as well as the PCI bandwidth z Data crosses the memory bus at least 3 times Separate the data transfers – use motherboards with multiple 64 bit PCI-X buses Test Disk Systems with representative Load Choose a modern high throughput RAID controller z Consider SW RAID0 of RAID5 HW controllers Need plenty of CPU power for sustained 1 Gbit/s transfers and disk access Packet loss is a killer Check on campus links & equipment, and access links to backbones New stacks are stable give better response & performance Still need to set the tcp buffer sizes & other kernel settings e.g. window-scale Application architecture & implementation is important Interaction between HW, protocol processing, and disk sub-system complex MB - NG Richard Hughes-Jones Slide: 59 More Information Some URLs UKLight web site: http://www.uklight.ac.uk DataTAG project web site: http://www.datatag.org/ UDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/ (Software & Tools) Motherboard and NIC Tests: http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt & http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ (Publications) TCP tuning information may be found at: http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html TCP stack comparisons: “Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004 http:// www.hep.man.ac.uk/~rich/ (Publications) PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ Dante PERT http://www.geant2.net/server/show/nav.00d00h002 Real-Time Remote Farm site http://csr.phys.ualberta.ca/real-time Disk information http://www.storagereview.com/ Richard Hughes-Jones Slide: 60 Any Questions? Richard Hughes-Jones Slide: 61 Backup Slides Richard Hughes-Jones Slide: 62 TCP (Reno) – Details 2 min C * RTT ρ = 2 * MSS 2 for rtt of ~200 ms: Time to recover sec Time for TCP to recover its throughput from 1 lost packet given by: 100000 10000 1000 100 10 1 0.1 0.01 0.001 0.0001 10Mbit 100Mbit 1Gbit 2.5Gbit 10Gbit 0 50 100 rtt ms 150 200 UK 6 ms Europe 20 ms USA 150 ms Richard Hughes-Jones Slide: 63 Packet Loss and new TCP Stacks TCP Response Function UKLight London-Chicago-London rtt 180 ms 2.6.6 Kernel sculcc1-chi-2 iperf 13Jan05 Agreement with theory good TCP Achievable throughput Mbit/s 1000 A0 1500 A1HSTCP A2 Scalable A3 HTCP 100 A5 BICTCP A8 Westwood A7 Vegas A0 Theory Series10 10 Scalable Theory 1 100000000 10000000 1000000 100000 10000 1000 100 Packet drop rate 1 in n sculcc1-chi-2 iperf 13Jan05 TCP Achievable throughput Mbit/s 1000 900 800 A0 1500 700 A1 HSTCP 600 A2 Scalable 500 A3 HTCP A5 BICTCP 400 A8 Westwood 300 A7 Vegas 200 100 0 100000000 10000000 1000000 100000 10000 1000 100 Packet drop rate 1 in n Richard Hughes-Jones Slide: 64 Test of TCP Sharing: Methodology (1Gbit/s) Les Cottrell Chose 3 paths from SLAC (California) PFLDnet Caltech (10ms), Univ Florida (80ms), CERN (180ms) 2005 Used iperf/TCP and UDT/UDP to generate traffic Iperf or UDT SLAC TCP/UDP bottleneck Caltech/UFL/CERN iperf Ping 1/s ICMP/ping traffic Each run was 16 minutes, in 7 regions 2 mins Richard Hughes-Jones 4 mins Slide: 65 TCP Reno single stream Les Cottrell PFLDnet 2005 Low performance on fast long distance paths AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion) Net effect: recovers slowly, does not effectively use available bandwidth, so poor throughput Remaining flows do not take Unequal sharing up slack when flow removed Increase recovery rate RTT increases when achieves best throughput SLAC to CERN Congestion has a dramatic effect Recovery is slow Richard Hughes-Jones Slide: 66 Average Transfer Rates Mbit/s App TCP Stack SuperMicro on MB-NG SuperMicro on SuperJANET4 BaBar on SuperJANET4 SC2004 on UKLight Iperf Standard 940 350-370 425 940 HighSpeed 940 510 570 940 Scalable 940 580-650 605 940 Standard 434 290-310 290 HighSpeed 435 385 360 Scalable 432 400-430 380 Standard 400-410 325 320 370-390 380 bbcp bbftp HighSpeed apache Gridftp Scalable 430 345-532 380 Standard 425 260 300-360 HighSpeed 430 370 315 Scalable 428 400 317 Standard 405 240 HighSpeed 320 Scalable 335 Richard Hughes-Jones Rate decreases 825 875 New stacks give more throughput Slide: 67 iperf Throughput + Web100 SuperMicro on MB-NG network HighSpeed TCP Linespeed 940 Mbit/s DupACK ? <10 (expect ~400) Richard Hughes-Jones BaBar on Production network Standard TCP 425 Mbit/s DupACKs 350-400 – re-transmits Slide: 68 Applications: Throughput Mbit/s HighSpeed TCP 2 GByte file RAID5 SuperMicro + SuperJANET bbcp bbftp Apachie Gridftp Previous work used RAID0 (not disk limited) Richard Hughes-Jones Slide: 69 bbcp & GridFTP Throughput 2Gbyte file transferred RAID5 - 4disks Manc – RAL bbcp DataTAG altAIMD kernel in BaBar & ATLAS Mean 710 Mbit/s Mean ~710 GridFTP See many zeros Mean ~620 Richard Hughes-Jones Slide: 70 tcpdump of the T/DAQ dataflow at SFI (1) Cern-Manchester 1.0 Mbyte event Remote EFD requests event from SFI Incoming event request Followed by ACK ● ● ● SFI sends event Limited by TCP receive buffer Time 115 ms (~4 ev/s) When TCP ACKs arrive more data is sent. N 1448 byte packets Richard Hughes-Jones Slide: 71