The Performance of High Throughput Data Flows for e

advertisement
The Performance of High Throughput Data
Flows for e-VLBI in Europe
Multiple vlbi_udp Flows,
Constant Bit-Rate over TCP
&
Multi-Gigabit over GÉANT2
Richard Hughes-Jones
The University of Manchester
www.hep.man.ac.uk/~rich/ then “Talks”
1
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
What is VLBI ?
 VLBI signal wave front
Resolution

Sensitivity
 1 / B
Baseline
Bandwidth B is as important
as time τ :
Can use as many Gigabits
as we can get!
 Data wave front sent over the network to the Correlator
2
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
European e-VLBI Test Topology
Metsähovi
Finland
Gbit link
Chalmers
University
of
Technology,
Gothenburg
Jodrell Bank
UK
Onsala
Sweden
Gbit link
Torun
Poland
2* 1 Gbit
links
Dedicated
DWDM link
Medicina
Italy
Dwingeloo
Netherlands
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
3
vlbi_udp: UDP on the WAN
 iGrid2002 monolithic code
 Convert to use pthreads
 control
 Data input
 Data output
 Work done on vlbi_recv:
 Output thread polled for data in the ring buffer – burned CPU
 Input thread signals output thread when there is work to do – else wait on
semaphore – had packet loss at high rate,  variable throughput
 Output thread uses sched_yield() when no work to do
 Multi-flow Network performance – set up in Dec06
 3 Sites to JIVE: Manc UKLight; Manc production; Bologna GEANT PoP
 Measure: throughput, packet loss, re-ordering, 1-way delay
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
4
vlbi_udp: Some of the Problems
 JIVE made Huygens, mark524 (.54) and mark620 (.59) available
 Within minutes of Arpad leaving, the Alteon NIC of mark524 lost the
data network!
 OK used mark623 (.62) – faster CPU
 Firewalls needed to allow vlbi_udp ports
 Aarrgg (!!!) Huygens is SUSE Linux
 Routing – well this ALWAYS needs to be fixed !!!
 AMD Opteron did not like sched_getaffinity() sched_setaffinity()
 Comment out this bit
 udpmon flows Onsala to JIVE look good
 udpmon flows JIVE mark623 to Onsala & Manc UKL don’t work
 Firewall down stops after 77 udpmon loops
 Firewall up udpmon cant communicate with Onsala
 CPU load issues on the MarkV systems
 Don’t seem to be able to keep up with receiving UDP flow AND
emptying the ring buffer
 Torun PC / Link lost as the test started
5
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
Multiple vlbi_udp Flows
1
800
0.8
600
0.6
400
0.4
200
0.2
0
0
2000
4000
6000
8000
10000
12000
% Packet loss
 816 Mbit/s sigma <1Mbit/s
step 1 Mbit/s
 Zero packet loss
 Zero re-ordering
vlbi_udp_3flows_6Dec06
1000
Wire Rate Mbit/s
 Gig7  Huygens
UKLight 15 us spacing
0
14000
Time during the transfer s
vlbi_udp_3flows_6Dec06
1
800
0.8
600
0.6
400
0.4
200
0.2
0
0
2000
4000
6000
8000
10000
12000
% Packet loss
 612 Mbit/s
 0.6 falling to 0.05% packet loss
 0.02 % re-ordering
Wire Rate Mbit/s
 Gig8  mark623
Academic Internet 20 us spacing
1000
0
14000
Time during the transfer s
1
800
0.8
600
0.6
400
0.4
200
0.2
0
0
2000
4000
6000
8000
10000
12000
0
14000
Time during the transfer s
6
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
% Packet loss
 396 Mbit/s
 0.02 % packet loss
 0 % re-ordering
Wire Rate Mbit/s
 Bologna  mark620
Academic Internet 30 us spacing
vlbi_udp_3flows_6Dec06
1000
The Impact of Multiple vlbi_udp Flows
 Gig7  Huygens UKLight
15 us spacing
 Gig8  mark623 Academic Internet
20 us spacing
 Bologna  mark620 Academic Internet 30 us spacing
800 Mbit/s
600 Mbit/s
400 Mbit/s
SJ5 Access link
SURFnet Access link
GARR Access link
7
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
e-VLBI: Driven by Science
Microquasar GRS1915+105 (11 kpc) on 21 April 2006
at 5 Ghz using 6 EVN telescopes, during a weak flare
(11 mJy), just resolved in jet direction (PA140 deg).
(Rushton et al.)
 128 Mbit/s from each telescope
 4 TBytes raw samples data over 12 hours
 2.8 GBytes of correlated data
Microquasar Cygnus X-3 (10
kpc) on 20 April (a) and 18 May
2006 (b). The source as in a
semi-quiescent state in (a) and
in a flaring state in (b), The core
of the source is probably ~20
mas to the N of knot A. (Tudose
et al.)
a
b
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
8
RR001 The First Rapid Response Experiment
(Rushton Spencer)
The experiment was planned as follows:
1. Operate EVN 6 telescope in real time on 29th Jan 2007
2. Correlate and Analyse results in double quick time
3. Select sources for follow up observations
4. Observe selected sources 1 Feb 2007
The experiment worked – we successfully observed and analysed 16 sources
(weak microquasars), ready for the follow up run but we found that none of the
sources were suitably active at that time. – a perverse universe!
9
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
Constant Bit-Rate Data over TCP/IP
10
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
CBR Test Setup
11
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
Moving CBR over TCP
When there is packet loss
TCP decreases the rate.
TCP buffer 0.9 MB (BDP)
RTT 15.2 ms
Effect of loss rate on message arrival time.
TCP buffer
1.8 MB (BDP) RTT 27 ms
Effect of loss rate on message arrival time
50
Drop 1 in 5k
45
Drop 1 in 10k
40
Drop 1 in 20k
Drop 1 in 40k
No loss
Time / s
35
30
Data delayed
25
20
Timely
arrival
of data
15
10
5
0
1
2
3
4
5
6
Message number
7
8
9
10
4
x 10
Can TCP deliver the data on time?
12
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
Resynchronisation
1
Slope 
throughput
Arrival time
Packet loss
Delay in
stream
Expected arrival
time at CBR
Message number / Time
13
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
CBR over TCP – Large TCP Buffer
 Message size: 1448 Bytes
 Data Rate: 525 Mbit/s
 Route:
Manchester - JIVE
 RTT 15.2 ms
6
x 10
CurCwnd
2
0
0
20
40
60
80
100
120
0
20
40
60
80
100
120
0
20
40
60
80
100
120
0
20
40
60
Time in sec
80
100
120
Mbit/s
1000
 TCP buffer 160 MB
 Drop 1 in 1.12 million packets
500
0
PktsRetrans
2
 Throughput increases
1
0
1000
DupAcksIn
 Peak throughput ~ 734 Mbit/s
 Min. throughput ~ 252 Mbit/s
1
500
0
14
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
CBR over TCP – Message Delay
 TCP buffer 160 MB
 Drop 1 in 1.12 million packets
3000
2500
One way delay / ms
 Message size: 1448 Bytes
 Data Rate: 525 Mbit/s
 Route:
Manchester - JIVE
 RTT 15.2 ms
2000
1500
1000
500
 OK you can recover BUT:
0
2,000,000
3,000,000
4,000,000
5,000,000
Message number
 Peak Delay ~2.5s
 TCP buffer  RTT4
15
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
Multi-gigabit tests over GÉANT
But will 10 Gigabit Ethernet work on a PC?
16
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
High-end Server PCs for 10 Gigabit
 Boston/Supermicro X7DBE
 Two Dual Core Intel Xeon Woodcrest 5130
 2 GHz
 Independent 1.33GHz FSBuses
 530 MHz FD Memory (serial)
 Parallel access to 4 banks
 Chipsets: Intel
5000P MCH – PCIe & Memory
ESB2 – PCI-X GE etc.
 PCI
 3 8 lane PCIe buses
 3* 133 MHz PCI-X
 2 Gigabit Ethernet
 SATA
17
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
10 GigE Back2Back: UDP Latency

60
50
y = 0.0028x + 21.937
40
30
20
10
0
0
1000
2000
3000
4000
5000
6000
7000
8000
10000
Histogram FWHM ~1-2 us
12000
64 bytes gig6-5
6000
3000 bytes gig6-5
10000
10000
5000
8000
8000
4000
6000
6000
4000
2000
2000
2000
1000
0
0
0
20
40
Latency us
60
80
8900 bytes gig6-5
3000
4000
0
9000
Message length bytes
N(t)
Latency 22 µs & very well behaved
Latency Slope 0.0028 µs/byte
B2B Expect: 0.00268 µs/byte
 Mem
0.0004
 PCI-e
0.00054
 10GigE
0.0008
 PCI-e
0.00054
 Mem
0.0004
N(t)
12000



gig6-5_Myri10GE_rxcoal=0
N(t)





Motherboard: Supermicro X7DBE
Chipset: Intel 5000P MCH
CPU: 2 Dual Intel Xeon 5130
2 GHz with 4096k L2 cache
Mem bus: 2 independent 1.33 GHz
PCI-e 8 lane
Linux Kernel 2.6.20-web100_pktd-plus
Myricom NIC 10G-PCIE-8A-R Fibre
myri10ge v1.2.0 + firmware v1.4.10
 rx-usecs=0 Coalescence OFF
 MSI=1
 Checksums ON
 tx_boundary=4096

MTU 9000 bytes
Latency us



0
20
40
60
Latency us
80
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
0
20
40
60
Latency us
18
80
 Kernel 2.6.20-web100_pktd-plus
 Myricom 10G-PCIE-8A-R Fibre
 rx-usecs=25
Coalescence ON
 MTU 9000 bytes
 Max throughput 9.4 Gbit/s
 Notice rate for 8972 byte packet
 ~0.002% packet loss in 10M packets
in receiving host
Recv Wire rate Mbit/s
10 GigE Back2Back: UDP Throughput
gig6-5_myri10GE
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
1000 bytes
1472 bytes
2000 bytes
3000 bytes
4000 bytes
5000 bytes
6000 bytes
7000 bytes
8000 bytes
8972 bytes
0
10
20
Spacing between frames us
 Receiving host 3 CPUs idle
 For <8 µs packets,
1 CPU is 70-80% in kernel mode
inc ~15% soft int
40
1000 bytes
gig6-5_myri10GE
1472 bytes
80
2000 bytes
60
3000 bytes
4000 bytes
40
C
20
5000 bytes
6000 bytes
0
7000 bytes
0
% cpu1
kernel rec
 Sending host, 3 CPUs idle
 For <8 µs packets,
1 CPU is >90% in kernel mode
inc ~10% soft int
%cpu1 kernel
snd
100
30
5
10
15
20
25
Spacing between frames us
30
35
40
8972 bytes
1000 bytes
gig6-5_myri10GE
100
80
8000 bytes
1472 bytes
2000 bytes
60
40
20
3000 bytes
4000 bytes
5000 bytes
6000 bytes
0
7000 bytes
0
10
20
Spacing between frames us
30
40
8972 bytes
19
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
8000 bytes
10 GigE UDP Throughput vs packet size


Motherboard: Supermicro X7DBE
Linux Kernel 2.6.20-web100_
pktd-plus
Myricom NIC 10G-PCIE-8A-R Fibre
myri10ge v1.2.0 + firmware v1.4.10
 rx-usecs=0 Coalescence ON
 MSI=1
 Checksums ON
 tx_boundary=4096
gig6-5_myri_udpscan
10000
9000
8000
Recv Wire rate Mbit/s


7000
6000
5000
4000
3000
2000
1000

Steps at 4060 and 8160 bytes
within 36 bytes of 2n boundaries
0
0

Model data transfer time as t= C + m*Bytes
 C includes the time to set up transfers
 Fit reasonable C= 1.67 µs m= 5.4 e4 µs/byte
 Steps consistent with C increasing by 0.6 µs

The Myricom driver segments the transfers,
limiting the DMA to 4096 bytes
– PCI-e chipset dependent!
2000
4000
6000
Size of user data in packet bytes
8000
10000
20
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
10 GigE X7DBEX7DBE: TCP iperf




No packet loss
MTU 9000
TCP buffer 256k BDP=~330k
Cwnd
 SlowStart then slow growth
 Limited by sender !
Web100 plots of TCP parameters
 Duplicate ACKs
 One event of 3 DupACKs
 Packets Re-Transmitted
 Iperf TCP throughput
7.77 Gbit/s
21
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
OK so it works !!!
22
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
ESLEA-FABRIC:4 Gbit flows over GÉANT2
 Set up 4 Gigabit Lightpath Between GÉANT2 PoPs
 Collaboration with DANTE
 GÉANT2 Testbed London – Prague – London
 PCs in the DANTE London PoP with 10 Gigabit NICs
 VLBI Tests:
 UDP Performance
 Throughput, jitter, packet loss, 1-way delay, stability
 Continuous (days) Data Flows – VLBI_UDP and udpmon
 Multi-Gigabit TCP performance with current kernels
 Multi-Gigabit CBR over TCP/IP
 Experience for FPGA Ethernet packet systems
 DANTE Interests:
 Multi-Gigabit TCP performance
 The effect of (Alcatel 1678 MCC 10GE port) buffer size on bursty TCP using
BW limited Lightpaths
23
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
The GÉANT2 Testbed






10 Gigabit SDH backbone
Alcatel 1678 MCCs
GE and 10GE client interfaces
Node location:
 London
 Amsterdam
 Paris
 Prague
 Frankfurt
Can do lightpath routing
so make paths of different RTT
Locate the PCs in London
24
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
Provisioning the lightpath on ALCATEL MCCs


Some jiggery-pokery needed
with the NMS to force a
“looped back” lightpath
London-Prague-London
Manual XCs (using element
manager) possible but hard
work


Instead used RM to create two
parallel VC-4-28v (singleended) Ethernet private line
(EPL) paths


196 needed + other operations!
Constrained to transit DE
Then manually joined paths in
CZ

Only 28 manually created XCs
required
25
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
Provisioning the lightpath on ALCATEL MCCs


Paths come up
(Transient) alarms clear

Result: provisioned a path of 28
virtually concatenated VC-4s
UK-NL-DE-NL-UK


Optical path ~4150 km
With dispersion compensation
~4900 km
RTT 46.7 ms

YES!!!
26
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
Photos at The PoP
Production SDH
Test-bed SDH
10 GE
Production
Router
Optical Transport
27
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
4 Gig Flows on GÉANT: UDP Throughput
exp2-1_prag_15May07
10000
9000
Recv Wire rate Mbit/s
 Kernel 2.6.20-web100_pktdplus
 Myricom 10G-PCIE-8A-R Fibre
 rx-usecs=25
Coalescence ON
 MTU 9000 bytes
 Max throughput 4.199 Gbit/s
1000 bytes
8000
1472 bytes
7000
2000 bytes
6000
3000 bytes
5000
4000 bytes
4000
5000 bytes
3000
6000 bytes
2000
7000 bytes
1000
8972 bytes
8000 bytes
0
0
100
%cpu1 kernel snd
 Sending host, 3 CPUs idle
 For <8 µs packets,
1 CPU is >90% in kernel mode
inc ~10% soft int
5
30
35
40
exp2-1_prag_15May07
1000 bytes
1472 bytes
80
2000 bytes
60
3000 bytes
40
4000 bytes
5000 bytes
20
6000 bytes
0
7000 bytes
0
5
% cpu1 kernel
rec
10
15
20
25
Spacing between frames us
30
35
40
8972 bytes
8000 bytes
exp2-1_prag_15May07
100
 Receiving host 3 CPUs idle
 For <8 µs packets,
1 CPU is ~37% in kernel mode
inc ~9% soft int
10
15
20
25
Spacing between frames us
1000 bytes
1472 bytes
80
2000 bytes
60
3000 bytes
40
4000 bytes
5000 bytes
20
6000 bytes
0
7000 bytes
0
5
10
15
20
25
Spacing between frames us
30
35
40
8000 bytes
28
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
8972 bytes
4 Gig Flows on GÉANT: 1-way delay
W18 exp1-2_prag_rxcoal0_16May07
23480
23470
23460
23450
23440
7000
6000
5000
4000
3000
2000
1000
0
23430
0
100
200
300
400
500
400
500
Packet No.
160
W20 gig6-g5Cu_myri_MSI_30Mar07
 Lab Tests:
 Peak separation 86 µs
 ~40 µs extra delay
140
1-way delay us
1-way delay us
23440
23438
23436
23434
23432
150
23430
N(t)
 1-way delay stable at 23.435 µs
 Peak separation 86 µs
 ~40 µs extra delay
23490
1-way delay us
 Kernel 2.6.20-web100_pktd-plus
 Myricom 10G-PCIE-8A-R Fibre
 Coalescence OFF
130
120
110
100
90
80
70
60
 Lightpath adds no unwanted
effects
0
100
200
300
Packet No.
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
29
4 Gig Flows on GÉANT: Jitter hist
 Kernel 2.6.20-web100_pktd-plus
 Myricom 10G-PCIE-8A-R Fibre
 Coalescence OFF
 Peak separation ~36 µs
 Factor 100 smaller
Packet separation 300 µs
Packet separation 100 µs
8900 bytes w=100 exp1-2_rxcoal0_16May07
100000
10000
10000
1000
1000
N(t)
N(t)
100000
100
8900 bytes w=300 exp1-2_rxcoal0_16May07
100
10
10
1
100
1
0
50
100
150
200
250
300
150
200
Latency us
250
300
350
400
300
350
400
Latency us
Lab Tests: Lightpath adds no effects
8900 bytes w=100 gig6-5_Lab_30Mar07
100000
10000
10000
1000
1000
N(t)
N(t)
100000
100
10
8900 bytes w=300 gig6-5_Lab_30Mar07
100
10
1
0
50
100
150
Latency us
200
250
300
1
100
150
200
250
Latency us
30
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
4 Gig Flows on GÉANT: UDP Flow Stability




MTU 9000 bytes
Packet spacing 18 us
Trials send 10 M packets
Ran for 26 Hours
exp2-1_w18_i500_udpmon_21May
3980.5
3980.4
3980.3
Wire Rate Mbit/s
 Kernel 2.6.20-web100_pktdplus
 Myricom 10G-PCIE-8A-R
Fibre
 Coalescence OFF
3980.2
3980.1
3980
3979.9
3979.8
3979.7
3979.6
3979.5
0
20000
40000
60000
80000
100000
Time during the transfer s
 Throughput very stable
3.9795 Gbit/s
 Occasional trials have packet
loss ~40 in 10M - investigating
 Our thanks go to all our collaborators
 DANTE really provided “Bandwidth on Demand”
 A record 6 hours ! including
 Driving to the PoP
 Installing the PCs
 Provisioning the Light-path
31
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
Any Questions?
32
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
Introduction
What is EXPReS?
 EXPReS = Express Production Real-time e-VLBI Service
 Three year project, started March 2006, funded by the
European Commission (DG-INFSO), Sixth Framework
Programme, Contract #026642
 Objective: to create a distributed, large-scale astronomical
instrument of continental and inter-continental dimensions
 Means: high-speed communication networks operating in
real-time and connecting some of the largest and most
sensitive radio telescopes on the planet
 Additional Information
http://expres-eu.org/
http://www.jive.nl
[note: only one “s”]
33
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
Introduction
EXPReS Partners
Radio Astronomy Institutes
 Joint Institute for VLBI in Europe (Coordinator), The Netherlands
 Arecibo Observatory, National Astronomy and Ionosphere Center, Cornell University, USA
 Australia Telescope National Facility, a Division of CSIRO, Australia
 Institute of Radioastronomy, National Institute for Astrophysics (INAF), Italy
 Jodrell Bank Observatory, University of Manchester, United Kingdom
 Max Planck Institute for Radio Astronomy (MPIfR), Germany
 Metsähovi Radio Observatory, Helsinki University of Technology (TKK), Finland
 National Center of Geographical Information, National Geographic Institute (CNIG-IGN), Spain
 Hartebeesthoek Radio Astronomy Observatory, National Research Foundation, South Africa
 Netherlands Foundation for Research in Astronomy (ASTRON), NWO, The Netherlands
 Onsala Space Observatory, Chalmers University of Technology, Sweden
 Shanghai Astronomical Observatory, Chinese Academy of Sciences, China
 Torun Centre for Astronomy, Nicolaus Copernicus University, Poland
 Transportable Integrated Geodetic Observatory (TIGO), University of Concepción, Chile
 Ventspils International Radio Astronomy Center, Ventspils University College, Latvia
National Research Networks
 AARNet, Australia
 DANTE, United Kingdom
 Poznan Supercomputing and Networking Center, Poland
 SURFnet, The Netherlands
34
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
Introduction
Participating EXPReS Telescopes
35
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
Provisioning the lightpath on ALCATEL MCCs


Create a virtual network element
to a planned port (non-existing)
in Prague VNE2
Define end points



Add Constraint:
to go via DE





Out port 3 in UK & VNE2 CZ
In port 4 in UK & VNE2 CZ
Or does OSPF
Set capacity ( 28 VC-4s )
Alcatel Resource Manager
allocates routing of EXPReS_out
VC-4 trails
Repeat for EXPReS_ret
Same time slots used in CZ for
EXPReS_out & EXPReS_ret paths
36
TERENA Networking Conference, Lyngby, 21-24 May 2007, R. Hughes-Jones Manchester
Download