The Performance of High Throughput Data Flows for e-VLBI in Europe Multiple vlbi_udp Flows, Constant Bit-Rate over TCP & Multi-Gigabit over GÉANT2 Richard Hughes-Jones The University of Manchester www.hep.man.ac.uk/~rich/ then “Talks” 1 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester What is VLBI ? VLBI signal wave front Resolution Sensitivity 1 / B Baseline Bandwidth B is as important as time τ : Can use as many Gigabits as we can get! Data wave front sent over the network to the Correlator 2 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester European e-VLBI Test Topology Metsähovi Finland Gbit link Chalmers University of Technology, Gothenburg Jodrell Bank UK Onsala Sweden Gbit link Torun Poland 2* 1 Gbit links Dedicated DWDM link Medicina Italy Dwingeloo Netherlands TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester 3 vlbi_udp: UDP on the WAN iGrid2002 monolithic code Convert to use pthreads control Data input Data output Work done on vlbi_recv: Output thread polled for data in the ring buffer – burned CPU Input thread signals output thread when there is work to do – else wait on semaphore – had packet loss at high rate, variable throughput Output thread uses sched_yield() when no work to do Multi-flow Network performance – set up in Dec06 3 Sites to JIVE: Manc UKLight; Manc production; Bologna GEANT PoP Measure: throughput, packet loss, re-ordering, 1-way delay TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester 4 vlbi_udp: Some of the Problems JIVE made Huygens, mark524 (.54) and mark620 (.59) available Within minutes of Arpad leaving, the Alteon NIC of mark524 lost the data network! OK used mark623 (.62) – faster CPU Firewalls needed to allow vlbi_udp ports Aarrgg (!!!) Huygens is SUSE Linux Routing – well this ALWAYS needs to be fixed !!! AMD Opteron did not like sched_getaffinity() sched_setaffinity() Comment out this bit udpmon flows Onsala to JIVE look good udpmon flows JIVE mark623 to Onsala & Manc UKL don’t work Firewall down stops after 77 udpmon loops Firewall up udpmon cant communicate with Onsala CPU load issues on the MarkV systems Don’t seem to be able to keep up with receiving UDP flow AND emptying the ring buffer Torun PC / Link lost as the test started 5 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester Multiple vlbi_udp Flows 1 800 0.8 600 0.6 400 0.4 200 0.2 0 0 2000 4000 6000 8000 10000 12000 % Packet loss 816 Mbit/s sigma <1Mbit/s step 1 Mbit/s Zero packet loss Zero re-ordering vlbi_udp_3flows_6Dec06 1000 Wire Rate Mbit/s Gig7 Huygens UKLight 15 us spacing 0 14000 Time during the transfer s vlbi_udp_3flows_6Dec06 1 800 0.8 600 0.6 400 0.4 200 0.2 0 0 2000 4000 6000 8000 10000 12000 % Packet loss 612 Mbit/s 0.6 falling to 0.05% packet loss 0.02 % re-ordering Wire Rate Mbit/s Gig8 mark623 Academic Internet 20 us spacing 1000 0 14000 Time during the transfer s 1 800 0.8 600 0.6 400 0.4 200 0.2 0 0 2000 4000 6000 8000 10000 12000 0 14000 Time during the transfer s 6 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester % Packet loss 396 Mbit/s 0.02 % packet loss 0 % re-ordering Wire Rate Mbit/s Bologna mark620 Academic Internet 30 us spacing vlbi_udp_3flows_6Dec06 1000 The Impact of Multiple vlbi_udp Flows Gig7 Huygens UKLight 15 us spacing Gig8 mark623 Academic Internet 20 us spacing Bologna mark620 Academic Internet 30 us spacing 800 Mbit/s 600 Mbit/s 400 Mbit/s SJ5 Access link SURFnet Access link GARR Access link 7 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester e-VLBI: Driven by Science Microquasar GRS1915+105 (11 kpc) on 21 April 2006 at 5 Ghz using 6 EVN telescopes, during a weak flare (11 mJy), just resolved in jet direction (PA140 deg). (Rushton et al.) 128 Mbit/s from each telescope 4 TBytes raw samples data over 12 hours 2.8 GBytes of correlated data Microquasar Cygnus X-3 (10 kpc) on 20 April (a) and 18 May 2006 (b). The source as in a semi-quiescent state in (a) and in a flaring state in (b), The core of the source is probably ~20 mas to the N of knot A. (Tudose et al.) a b TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester 8 RR001 The First Rapid Response Experiment (Rushton Spencer) The experiment was planned as follows: 1. Operate EVN 6 telescope in real time on 29th Jan 2007 2. Correlate and Analyse results in double quick time 3. Select sources for follow up observations 4. Observe selected sources 1 Feb 2007 The experiment worked – we successfully observed and analysed 16 sources (weak microquasars), ready for the follow up run but we found that none of the sources were suitably active at that time. – a perverse universe! 9 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester Constant Bit-Rate Data over TCP/IP 10 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester CBR Test Setup 11 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester Moving CBR over TCP When there is packet loss TCP decreases the rate. TCP buffer 0.9 MB (BDP) RTT 15.2 ms Effect of loss rate on message arrival time. TCP buffer 1.8 MB (BDP) RTT 27 ms Effect of loss rate on message arrival time 50 Drop 1 in 5k 45 Drop 1 in 10k 40 Drop 1 in 20k Drop 1 in 40k No loss Time / s 35 30 Data delayed 25 20 Timely arrival of data 15 10 5 0 1 2 3 4 5 6 Message number 7 8 9 10 4 x 10 Can TCP deliver the data on time? 12 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester Resynchronisation 1 Slope throughput Arrival time Packet loss Delay in stream Expected arrival time at CBR Message number / Time 13 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester CBR over TCP – Large TCP Buffer Message size: 1448 Bytes Data Rate: 525 Mbit/s Route: Manchester - JIVE RTT 15.2 ms 6 x 10 CurCwnd 2 0 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 Time in sec 80 100 120 Mbit/s 1000 TCP buffer 160 MB Drop 1 in 1.12 million packets 500 0 PktsRetrans 2 Throughput increases 1 0 1000 DupAcksIn Peak throughput ~ 734 Mbit/s Min. throughput ~ 252 Mbit/s 1 500 0 14 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester CBR over TCP – Message Delay TCP buffer 160 MB Drop 1 in 1.12 million packets 3000 2500 One way delay / ms Message size: 1448 Bytes Data Rate: 525 Mbit/s Route: Manchester - JIVE RTT 15.2 ms 2000 1500 1000 500 OK you can recover BUT: 0 2,000,000 3,000,000 4,000,000 5,000,000 Message number Peak Delay ~2.5s TCP buffer RTT4 15 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester Multi-gigabit tests over GÉANT But will 10 Gigabit Ethernet work on a PC? 16 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester High-end Server PCs for 10 Gigabit Boston/Supermicro X7DBE Two Dual Core Intel Xeon Woodcrest 5130 2 GHz Independent 1.33GHz FSBuses 530 MHz FD Memory (serial) Parallel access to 4 banks Chipsets: Intel 5000P MCH – PCIe & Memory ESB2 – PCI-X GE etc. PCI 3 8 lane PCIe buses 3* 133 MHz PCI-X 2 Gigabit Ethernet SATA 17 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester 10 GigE Back2Back: UDP Latency 60 50 y = 0.0028x + 21.937 40 30 20 10 0 0 1000 2000 3000 4000 5000 6000 7000 8000 10000 Histogram FWHM ~1-2 us 12000 64 bytes gig6-5 6000 3000 bytes gig6-5 10000 10000 5000 8000 8000 4000 6000 6000 4000 2000 2000 2000 1000 0 0 0 20 40 Latency us 60 80 8900 bytes gig6-5 3000 4000 0 9000 Message length bytes N(t) Latency 22 µs & very well behaved Latency Slope 0.0028 µs/byte B2B Expect: 0.00268 µs/byte Mem 0.0004 PCI-e 0.00054 10GigE 0.0008 PCI-e 0.00054 Mem 0.0004 N(t) 12000 gig6-5_Myri10GE_rxcoal=0 N(t) Motherboard: Supermicro X7DBE Chipset: Intel 5000P MCH CPU: 2 Dual Intel Xeon 5130 2 GHz with 4096k L2 cache Mem bus: 2 independent 1.33 GHz PCI-e 8 lane Linux Kernel 2.6.20-web100_pktd-plus Myricom NIC 10G-PCIE-8A-R Fibre myri10ge v1.2.0 + firmware v1.4.10 rx-usecs=0 Coalescence OFF MSI=1 Checksums ON tx_boundary=4096 MTU 9000 bytes Latency us 0 20 40 60 Latency us 80 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester 0 20 40 60 Latency us 18 80 Kernel 2.6.20-web100_pktd-plus Myricom 10G-PCIE-8A-R Fibre rx-usecs=25 Coalescence ON MTU 9000 bytes Max throughput 9.4 Gbit/s Notice rate for 8972 byte packet ~0.002% packet loss in 10M packets in receiving host Recv Wire rate Mbit/s 10 GigE Back2Back: UDP Throughput gig6-5_myri10GE 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 1000 bytes 1472 bytes 2000 bytes 3000 bytes 4000 bytes 5000 bytes 6000 bytes 7000 bytes 8000 bytes 8972 bytes 0 10 20 Spacing between frames us Receiving host 3 CPUs idle For <8 µs packets, 1 CPU is 70-80% in kernel mode inc ~15% soft int 40 1000 bytes gig6-5_myri10GE 1472 bytes 80 2000 bytes 60 3000 bytes 4000 bytes 40 C 20 5000 bytes 6000 bytes 0 7000 bytes 0 % cpu1 kernel rec Sending host, 3 CPUs idle For <8 µs packets, 1 CPU is >90% in kernel mode inc ~10% soft int %cpu1 kernel snd 100 30 5 10 15 20 25 Spacing between frames us 30 35 40 8972 bytes 1000 bytes gig6-5_myri10GE 100 80 8000 bytes 1472 bytes 2000 bytes 60 40 20 3000 bytes 4000 bytes 5000 bytes 6000 bytes 0 7000 bytes 0 10 20 Spacing between frames us 30 40 8972 bytes 19 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester 8000 bytes 10 GigE UDP Throughput vs packet size Motherboard: Supermicro X7DBE Linux Kernel 2.6.20-web100_ pktd-plus Myricom NIC 10G-PCIE-8A-R Fibre myri10ge v1.2.0 + firmware v1.4.10 rx-usecs=0 Coalescence ON MSI=1 Checksums ON tx_boundary=4096 gig6-5_myri_udpscan 10000 9000 8000 Recv Wire rate Mbit/s 7000 6000 5000 4000 3000 2000 1000 Steps at 4060 and 8160 bytes within 36 bytes of 2n boundaries 0 0 Model data transfer time as t= C + m*Bytes C includes the time to set up transfers Fit reasonable C= 1.67 µs m= 5.4 e4 µs/byte Steps consistent with C increasing by 0.6 µs The Myricom driver segments the transfers, limiting the DMA to 4096 bytes – PCI-e chipset dependent! 2000 4000 6000 Size of user data in packet bytes 8000 10000 20 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester 10 GigE X7DBEX7DBE: TCP iperf No packet loss MTU 9000 TCP buffer 256k BDP=~330k Cwnd SlowStart then slow growth Limited by sender ! Web100 plots of TCP parameters Duplicate ACKs One event of 3 DupACKs Packets Re-Transmitted Iperf TCP throughput 7.77 Gbit/s 21 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester OK so it works !!! 22 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester ESLEA-FABRIC:4 Gbit flows over GÉANT2 Set up 4 Gigabit Lightpath Between GÉANT2 PoPs Collaboration with DANTE GÉANT2 Testbed London – Prague – London PCs in the DANTE London PoP with 10 Gigabit NICs VLBI Tests: UDP Performance Throughput, jitter, packet loss, 1-way delay, stability Continuous (days) Data Flows – VLBI_UDP and udpmon Multi-Gigabit TCP performance with current kernels Multi-Gigabit CBR over TCP/IP Experience for FPGA Ethernet packet systems DANTE Interests: Multi-Gigabit TCP performance The effect of (Alcatel 1678 MCC 10GE port) buffer size on bursty TCP using BW limited Lightpaths 23 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester The GÉANT2 Testbed 10 Gigabit SDH backbone Alcatel 1678 MCCs GE and 10GE client interfaces Node location: London Amsterdam Paris Prague Frankfurt Can do lightpath routing so make paths of different RTT Locate the PCs in London 24 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester Provisioning the lightpath on ALCATEL MCCs Some jiggery-pokery needed with the NMS to force a “looped back” lightpath London-Prague-London Manual XCs (using element manager) possible but hard work Instead used RM to create two parallel VC-4-28v (singleended) Ethernet private line (EPL) paths 196 needed + other operations! Constrained to transit DE Then manually joined paths in CZ Only 28 manually created XCs required 25 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester Provisioning the lightpath on ALCATEL MCCs Paths come up (Transient) alarms clear Result: provisioned a path of 28 virtually concatenated VC-4s UK-NL-DE-NL-UK Optical path ~4150 km With dispersion compensation ~4900 km RTT 46.7 ms YES!!! 26 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester Photos at The PoP Production SDH Test-bed SDH 10 GE Production Router Optical Transport 27 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester 4 Gig Flows on GÉANT: UDP Throughput exp2-1_prag_15May07 10000 9000 Recv Wire rate Mbit/s Kernel 2.6.20-web100_pktdplus Myricom 10G-PCIE-8A-R Fibre rx-usecs=25 Coalescence ON MTU 9000 bytes Max throughput 4.199 Gbit/s 1000 bytes 8000 1472 bytes 7000 2000 bytes 6000 3000 bytes 5000 4000 bytes 4000 5000 bytes 3000 6000 bytes 2000 7000 bytes 1000 8972 bytes 8000 bytes 0 0 100 %cpu1 kernel snd Sending host, 3 CPUs idle For <8 µs packets, 1 CPU is >90% in kernel mode inc ~10% soft int 5 30 35 40 exp2-1_prag_15May07 1000 bytes 1472 bytes 80 2000 bytes 60 3000 bytes 40 4000 bytes 5000 bytes 20 6000 bytes 0 7000 bytes 0 5 % cpu1 kernel rec 10 15 20 25 Spacing between frames us 30 35 40 8972 bytes 8000 bytes exp2-1_prag_15May07 100 Receiving host 3 CPUs idle For <8 µs packets, 1 CPU is ~37% in kernel mode inc ~9% soft int 10 15 20 25 Spacing between frames us 1000 bytes 1472 bytes 80 2000 bytes 60 3000 bytes 40 4000 bytes 5000 bytes 20 6000 bytes 0 7000 bytes 0 5 10 15 20 25 Spacing between frames us 30 35 40 8000 bytes 28 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester 8972 bytes 4 Gig Flows on GÉANT: 1-way delay W18 exp1-2_prag_rxcoal0_16May07 23480 23470 23460 23450 23440 7000 6000 5000 4000 3000 2000 1000 0 23430 0 100 200 300 400 500 400 500 Packet No. 160 W20 gig6-g5Cu_myri_MSI_30Mar07 Lab Tests: Peak separation 86 µs ~40 µs extra delay 140 1-way delay us 1-way delay us 23440 23438 23436 23434 23432 150 23430 N(t) 1-way delay stable at 23.435 µs Peak separation 86 µs ~40 µs extra delay 23490 1-way delay us Kernel 2.6.20-web100_pktd-plus Myricom 10G-PCIE-8A-R Fibre Coalescence OFF 130 120 110 100 90 80 70 60 Lightpath adds no unwanted effects 0 100 200 300 Packet No. TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester 29 4 Gig Flows on GÉANT: Jitter hist Kernel 2.6.20-web100_pktd-plus Myricom 10G-PCIE-8A-R Fibre Coalescence OFF Peak separation ~36 µs Factor 100 smaller Packet separation 300 µs Packet separation 100 µs 8900 bytes w=100 exp1-2_rxcoal0_16May07 100000 10000 10000 1000 1000 N(t) N(t) 100000 100 8900 bytes w=300 exp1-2_rxcoal0_16May07 100 10 10 1 100 1 0 50 100 150 200 250 300 150 200 Latency us 250 300 350 400 300 350 400 Latency us Lab Tests: Lightpath adds no effects 8900 bytes w=100 gig6-5_Lab_30Mar07 100000 10000 10000 1000 1000 N(t) N(t) 100000 100 10 8900 bytes w=300 gig6-5_Lab_30Mar07 100 10 1 0 50 100 150 Latency us 200 250 300 1 100 150 200 250 Latency us 30 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester 4 Gig Flows on GÉANT: UDP Flow Stability MTU 9000 bytes Packet spacing 18 us Trials send 10 M packets Ran for 26 Hours exp2-1_w18_i500_udpmon_21May 3980.5 3980.4 3980.3 Wire Rate Mbit/s Kernel 2.6.20-web100_pktdplus Myricom 10G-PCIE-8A-R Fibre Coalescence OFF 3980.2 3980.1 3980 3979.9 3979.8 3979.7 3979.6 3979.5 0 20000 40000 60000 80000 100000 Time during the transfer s Throughput very stable 3.9795 Gbit/s Occasional trials have packet loss ~40 in 10M - investigating Our thanks go to all our collaborators DANTE really provided “Bandwidth on Demand” A record 6 hours ! including Driving to the PoP Installing the PCs Provisioning the Light-path 31 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester Any Questions? 32 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester Introduction What is EXPReS? EXPReS = Express Production Real-time e-VLBI Service Three year project, started March 2006, funded by the European Commission (DG-INFSO), Sixth Framework Programme, Contract #026642 Objective: to create a distributed, large-scale astronomical instrument of continental and inter-continental dimensions Means: high-speed communication networks operating in real-time and connecting some of the largest and most sensitive radio telescopes on the planet Additional Information http://expres-eu.org/ http://www.jive.nl [note: only one “s”] 33 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester Introduction EXPReS Partners Radio Astronomy Institutes Joint Institute for VLBI in Europe (Coordinator), The Netherlands Arecibo Observatory, National Astronomy and Ionosphere Center, Cornell University, USA Australia Telescope National Facility, a Division of CSIRO, Australia Institute of Radioastronomy, National Institute for Astrophysics (INAF), Italy Jodrell Bank Observatory, University of Manchester, United Kingdom Max Planck Institute for Radio Astronomy (MPIfR), Germany Metsähovi Radio Observatory, Helsinki University of Technology (TKK), Finland National Center of Geographical Information, National Geographic Institute (CNIG-IGN), Spain Hartebeesthoek Radio Astronomy Observatory, National Research Foundation, South Africa Netherlands Foundation for Research in Astronomy (ASTRON), NWO, The Netherlands Onsala Space Observatory, Chalmers University of Technology, Sweden Shanghai Astronomical Observatory, Chinese Academy of Sciences, China Torun Centre for Astronomy, Nicolaus Copernicus University, Poland Transportable Integrated Geodetic Observatory (TIGO), University of Concepción, Chile Ventspils International Radio Astronomy Center, Ventspils University College, Latvia National Research Networks AARNet, Australia DANTE, United Kingdom Poznan Supercomputing and Networking Center, Poland SURFnet, The Netherlands 34 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester Introduction Participating EXPReS Telescopes 35 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester Provisioning the lightpath on ALCATEL MCCs Create a virtual network element to a planned port (non-existing) in Prague VNE2 Define end points Add Constraint: to go via DE Out port 3 in UK & VNE2 CZ In port 4 in UK & VNE2 CZ Or does OSPF Set capacity ( 28 VC-4s ) Alcatel Resource Manager allocates routing of EXPReS_out VC-4 trails Repeat for EXPReS_ret Same time slots used in CZ for EXPReS_out & EXPReS_ret paths 36 TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester