Brasov Summer School - University of Manchester

advertisement
TCP/IP and Other Transports for
High Bandwidth Applications
Real Applications on Real Networks
Richard Hughes-Jones
University of Manchester
www.hep.man.ac.uk/~rich/ then “Talks” then look for “Brasov”
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
1
Slide: 1
What we might cover!
This is what researchers find when they try to use high
performance networks.
 Real Applications on Real Networks
 Disk-2-disk applications on real networks
 Memory-2-memory tests
 Comparison of different data moving applications
 The effect (improvement) of different TCP Stacks
 Transatlantic disk-2-disk at Gigabit speeds
 Remote Computing Farms
 The effect of distance
 Protocol vs implementation
 Radio Astronomy e-VLBI
 Users with data that is random noise !
Thanks for allowing me to use their slides to:
Sylvain Ravot CERN, Les Cottrell SLAC, Brian Tierney LBL, Robin Tasker DL
Ralph Spencer Jodrell Bank
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
2
Slide: 2
“Server Quality” Motherboards
 SuperMicro P4DP8-2G (P4DP6)
 Dual Xeon
 400/522 MHz Front side bus
 6 PCI PCI-X slots
 4 independent PCI buses
 64 bit 66 MHz PCI
 100 MHz PCI-X
 133 MHz PCI-X
 Dual Gigabit Ethernet
 Adaptec AIC-7899W
dual channel SCSI
 UDMA/100 bus master/EIDE channels
 data transfer rates of 100 MB/sec burst
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
3
Slide: 3
“Server Quality” Motherboards
 Boston/Supermicro H8DAR
 Two Dual Core Opterons
 200 MHz DDR Memory
 Theory BW: 6.4Gbit
 HyperTransport
 2 independent PCI buses
 133 MHz PCI-X
 2 Gigabit Ethernet
 SATA
 ( PCI-e )
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
4
Slide: 4
UK Transfers MB-NG and SuperJANET4
Throughput for real users
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
5
Slide: 5
Topology of the MB – NG Network
Manchester Domain
man02
Boundary Router
Cisco 7609
man01
UKERNA
Development
Network
UCL Domain
lon01
Boundary
Router Cisco
7609
lon02
HW RAID
Edge Router
Cisco 7609
man03
RAL Domain
lon03
ral02
Key
Gigabit Ethernet
2.5 Gbit POS Access
ral01
HW RAID
ral02
Boundary Router
Cisco 7609
MPLS Admin. Domains
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
6
Slide: 6
Topology of the Production Network
Manchester Domain
man01
3 routers
2 switches
HW RAID
RAL Domain
ral01
HW RAID
routers
switches
Key
Gigabit Ethernet
2.5 Gbit POS Access
10 Gbit POS
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
7
Slide: 7
iperf Throughput + Web100




SuperMicro on MB-NG network
HighSpeed TCP
Linespeed 940 Mbit/s
DupACK ? <10 (expect ~400)




BaBar on Production network
Standard TCP
425 Mbit/s
DupACKs 350-400 – re-transmits
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
8
Slide: 8
Applications: Throughput Mbit/s
 HighSpeed TCP
 2 GByte file RAID5
 SuperMicro + SuperJANET
 bbcp
 bbftp
 Apachie
 Gridftp
 Previous work used RAID0
(not disk limited)
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
9
Slide: 9
bbftp: What else is going on?
 Scalable TCP
 BaBar + SuperJANET
 SuperMicro + SuperJANET
 Congestion window – duplicate ACK
 Variation not TCP related?
 Disk speed / bus transfer
 Application
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
10
Slide: 10
bbftp: Host & Network Effects
 2 Gbyte file RAID5 Disks:
 1200 Mbit/s read
 600 Mbit/s write
 Scalable TCP
 BaBar + SuperJANET
 Instantaneous 220 - 625 Mbit/s
 SuperMicro + SuperJANET
 Instantaneous
400 - 665 Mbit/s for 6 sec
 Then 0 - 480 Mbit/s
 SuperMicro + MB-NG
 Instantaneous
880 - 950 Mbit/s for 1.3 sec
 Then 215 -Summer
625 Mbit/s
School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
11
Slide: 11
Average Transfer Rates Mbit/s
App
TCP Stack
Iperf
Standard
940
350-370
425
940
HighSpeed
940
510
570
940
Scalable
940
580-650
605
940
Standard
434
290-310
290
HighSpeed
435
385
360
Scalable
432
400-430
380
Standard
400-410
325
320
370-390
380
bbcp
bbftp
SuperMicro on
MB-NG
HighSpeed
apache
Gridftp
SuperMicro on
SuperJANET4
BaBar on
SuperJANET4
Scalable
430
345-532
380
Standard
425
260
300-360
HighSpeed
430
370
315
Scalable
428
400
317
Standard
405
240
HighSpeed
Rate
decreases
825
875
New stacks
give more
throughput
320
Summer School, Brasov, Romania,335
July 2005, R. Hughes-Jones Manchester
Scalable
Richard Hughes-Jones
SC2004 on
UKLight
12
Slide: 12
Transatlantic Disk to Disk Transfers
With UKLight
SuperComputing 2004
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
13
Slide: 13
SC2004 UKLIGHT Overview
SLAC Booth
SC2004
Cisco 6509
MB-NG 7600 OSR
Manchester
Caltech Booth
UltraLight IP
UCL network
UCL HEP
NLR Lambda
NLR-PITT-STAR-10GE-16
ULCC UKLight
K2
K2
Ci
UKLight 10G
Four 1GE channels
Ci
Caltech 7600
UKLight 10G
Surfnet/ EuroLink 10G
Two 1GE channels
Amsterdam
Chicago Starlight
K2
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
14
Slide: 14
Collaboration at SC2004
 Setting up the BW Bunker  Working with S2io,
Sun, Chelsio
 SCINet
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
 The BW Challenge at the SLAC Booth
15
Slide: 15
Transatlantic Ethernet: TCP Throughput Tests
2000
InstaneousBW
AveBW
1400000000
CurCwnd (Value)
1200000000
1500
1000000000
Cwnd
Supermicro X5DPE-G2 PCs
Dual 2.9 GHz Xenon CPU FSB 533 MHz
1500 byte MTU
2.6.6 Linux Kernel
Memory-memory TCP throughput
Standard TCP
TCPAchive Mbit/s






800000000
1000
600000000
400000000
500
200000000
 Wire rate throughput of 940 Mbit/s
0
0
20000
40000
60000
80000
time ms
100000
120000
 First 10 sec
InstaneousBW
AveBW
40000000
CurCwnd (Value)
35000000
30000000
25000000
1500
Cwnd
TCPAchive Mbit/s
2000
 Work in progress to study:
 Implementation detail
 Advanced stacks
 Effect of packet loss
 Sharing
1000
500
0
0
1000
2000
3000
4000
5000 6000
time ms
7000
8000
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
0
140000
9000
20000000
15000000
10000000
5000000
0
10000
16
Slide: 16
SC2004 Disk-Disk bbftp
 Standard TCP
 Average 825 Mbit/s
 (bbcp: 670 Mbit/s)
TCPAchive Mbit/s
2500
AveBW
CurCwnd (Value)
2000
1500
1000
500
0
0
10000
time ms
15000
20000
InstaneousBW
AveBW
CurCwnd (Value)
2500
TCPAchive Mbit/s
 Scalable TCP
 Average 875 Mbit/s
 (bbcp: 701 Mbit/s
~4.5s of overhead)
5000
45000000
40000000
35000000
30000000
25000000
20000000
15000000
10000000
5000000
0
2000
1500
1000
500
0
 Disk-TCP-Disk at 1Gbit/s
0
5000
10000
time ms
15000
20000
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
45000000
40000000
35000000
30000000
25000000
20000000
15000000
10000000
5000000
0
Cwnd
bbftp file transfer program uses TCP/IP
UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0
MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off
Move a 2 Gbyte file
Web100 plots:
InstaneousBW
Cwnd





17
Slide: 17
Network & Disk Interactions
 Hosts:




(work in progress)
Supermicro X5DPE-G2 motherboards
dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory
3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0
six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size
% CPU kernel
mode
 Measure memory to RAID0 transfer rates with & without UDP traffic
RAID0 6disks 1 Gbyte Write 64k 3w8506-8
Throughput
Mbit/s
2000
Disk write
1735 Mbit/s
1500
1000
500
0
0.0
Throughput
Mbit/s
2000
20.0
40.0
60.0
80.0
Trial number
R0 6d 1 Gbyte udp Write 64k 3w8506-8
100.0
1500
1000
500
0
0.0
2000
20.0
40.0
60.0
Trial number
80.0
100.0
Disk write +
1500 MTU UDP
1218 Mbit/s
Drop of 30%
R0 6d 1 Gbyte udp9000 write 64k 3w8506-8
1500
1000
500
0
0.0
20.0
40.0
60.0
Trial number
80.0
100.0
Disk write +
9000 MTU UDP
1400 Mbit/s
Drop of 19%
R0 6d 1 Gbyte udp9000 write 8k 3w8506-8 07Jan05 16384
8k
180
% cpu system mode L3+4
Throughput Mbit/s
200
64k
160
y=178-1.05x
140
120
100
80
60
40
20
0
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
0
20
40
60
80
100
120
140
% cpu system mode L1+2
160
180
200
18
Slide: 18
Remote Computing Farms in the
ATLAS TDAQ Experiment
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
19
Slide: 19
Remote Computing Concepts
~PByte/sec
ATLAS Detectors – Level 1 Trigger
ROB
ROB
ROB
ROB
Remote Event Processing Farms
Copenhagen
Edmonton
Krakow
Manchester
PF
Data Collection
Network
L2PU
L2PU
L2PU
L2PU
SFI
SFI
Event
Builders
Back End
Network
PF
lightpaths
SFI
PF
PF
GÉANT
Level 2 Trigger
PF
Local Event
Processing Farms
Experimental Area
SFOs
CERN B513
320 MByte/sec
Mass storage
Switch
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
PF
20
Slide: 20
ATLAS Remote Farms – Network Connectivity
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
21
Slide: 21
ATLAS Application Protocol
Event Filter Daemon EFD
SFI and SFO
Request event
Request-Response time
(Histogram)
Send event data
Process event
Request Buffer
Send OK
Send processed
event
●●●
 Event Request
Time
 EFD requests an event from SFI
 SFI replies with the event ~2Mbytes
 Processing of event
 Return of computation
 EF asks SFO for buffer space
 SFO sends OK
 EF transfers results of the computation
 tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI
communication.
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
22
Slide: 22
Using Web100 TCP Stack Instrumentation
to analyse application protocol - tcpmon
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
23
Slide: 23
tcpmon: TCP Activity Manc-CERN Req-Resp
350
200000
300
250
150000
200
100000
150
100
50000
50
0
400
600
800
1000
time
1200
1600
200000
0
2000
250000
200000
150000
150000
100000
100000
50000
50000
0
0
200
400
600
800
1000
1200
time ms
1400
1600
1800
180
160
140
120
100
80
60
40
20
0
0
2000
250000
200000
150000
100000
50000
0
200
400
600
800
1000
1200
time ms
1400
1600
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
1800
Cwnd
Data Bytes Out
1400
DataBytesOut (Delta
DataBytesIn (Delta
CurCwnd (Value
250000
TCPAchive Mbit/s
 Transfer achievable throughput
120 Mbit/s
200
CurCwnd
0
 TCP Congestion window
gets re-set on each Request
 TCP stack implementation
detail to reduce Cwnd after inactivity
 Even after 10s, each response
takes 13 rtt or ~260 ms
Data Bytes In
Data Bytes Out
 Round trip time 20 ms
 64 byte Request green
1 Mbyte Response blue
 TCP in slow start
 1st event takes 19 rtt or ~ 380 ms
DataBytesOut (Delta
400
DataBytesIn (Delta
250000
1800
0
2000
24
Slide: 24
tcpmon: TCP Activity Manc-cern Req-Resp
TCP stack tuned
DataBytesOut (Delta
400
DataBytesIn (Delta
1200000
800000
300
250
600000
200
150
400000
100
50
200000
0
0
1000
1500
time
2000
700
600000
300
Cwnd
400
400000
200
200000
0
TCPAchive Mbit/s
1000000
800000
500
0
500
1000
1500
time ms
2000
2500
900
800
700
600
500
400
300
200
100
0
0
3000
1200000
1000000
800000
600000
400000
200000
0
1000
2000
3000
4000
time ms
5000
6000
7000
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
1200000
600
100
 Transfer achievable throughput
grows to 800 Mbit/s
0
3000
2500
PktsOut (Delta
PktsIn (Delta
CurCwnd (Value
800
num Packets
 TCP Congestion window
grows nicely
 Response takes 2 rtt after ~1.5s
 Rate ~10/s (with 50ms wait)
500
Data Bytes In
350
1000000
Cwnd
Data Bytes Out
 Round trip time 20 ms
 64 byte Request green
1 Mbyte Response blue
 TCP starts in slow start
 1st event takes 19 rtt or ~ 380 ms
0
8000
25
Slide: 25
DataBytesOut (Delta
DataBytesIn (Delta
1000000
900000
800000
700000
600000
500000
400000
300000
200000
100000
0
300
250
200
150
100
50
0
3000
2000
1000
time
700
4000
PktsOut (Delta
PktsIn (Delta
CurCwnd (Value
5000
1000000
800000
500
400
600000
300
400000
Cwnd
num Packets
600
200
200000
100
0
0
2000
4000
6000
8000
0
10000 12000 14000 16000 18000 20000
time ms
TCPAchive Mbit/s
 Transfer achievable throughput
grows slowly from
250 to 800 Mbit/s
Data Bytes In
350
0
 TCP Congestion window
in slow start to ~1.8s
then congestion avoidance
 Response in 2 rtt after ~2.5s
 Rate 2.2/s (with 50ms wait)
400
800
700
600
500
400
300
200
100
0
1000000
800000
600000
400000
Cwnd
 Round trip time 150 ms
 64 byte Request green
1 Mbyte Response blue
 TCP starts in slow start
 1st event takes 11 rtt or ~ 1.67 s
Data Bytes Out
tcpmon: TCP Activity Alberta-CERN Req-Resp
TCP stack tuned
200000
0
2000
4000
6000
8000
0
10000 12000 14000 16000 18000 20000
time ms
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
26
Slide: 26
Time Series of Request-Response Latency
Manchester – CERN
Round trip time 20 ms
1 Mbyte of data returned
Stable for ~18s at ~42.5ms
Then alternate points 29 & 42.5 ms
75.00
Round Trip Latnecy ms





70.00
65.00
60.00
55.00
50.00
45.00
40.00
35.00
30.00
25.00
0
10
20
30
40
50
60
70
80
90
100
Request Time s
Alberta – CERN
Round trip time 150 ms
1 Mbyte of data returned
Stable for ~150s at 300ms
Falls to 160ms with ~80 μs variation
160.55
160.50
2000.00
Round Trip Latency ms





Round Trip Latency ms
160.60
1800.00
160.45
1600.00
1000000
160.40
1400.00
160.35
1200.00
160.30
200
1000.00
205
210
215
800.00
220
225
230
235
240
245
250
Request Time s
600.00
400.00
200.00
0.00
0
50
100
150
200
250
300
Request Time s
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
27
Slide: 27
Using the Trigger DAQ Application
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
28
Slide: 28
4
15
3
10
2
5
1
0
0
0
 Event Rate:
 Use tcpmon transfer time of ~42.5ms
 Add the time to return the data
95ms
 Expected rate 10.5/s
 Observe ~6/s for the gigabit node
 Reason: TCP buffers could not be set large enough in
T/DAQ application
200
Time Sec
100
300
400
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5 6 7 8
Events/sec
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
No. remote nodes
 3 nodes: 1 GEthernet + two 100Mbit
 2 nodes: two 100Mbit nodes
 1node: one 100Mbit node
20
Frequency
 Manchester – CERN
 Round trip time 20 ms
 1 Mbyte of data returned
Event Rate event/s
Time Series of T/DAQ event rate
9 10 11 12
29
Slide: 29
Tcpdump of the Trigger DAQ Application
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
30
Slide: 30
tcpdump of the T/DAQ dataflow at SFI (1)
Cern-Manchester 1.0 Mbyte event
Remote EFD requests event from SFI
Incoming event request
Followed by ACK
●
●
●
SFI sends event
Limited by TCP receive buffer
Time 115 ms (~4 ev/s)
When TCP ACKs arrive
more data is sent.
N 1448 byte packets
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
31
Slide: 31
Tcpdump of TCP Slowstart at SFI (2)
Cern-Manchester 1.0 Mbyte event
Remote EFD requests event from SFI
First event request
SFI sends event
Limited by TCP Slowstart
Time 320 ms
N 1448 byte packets
When ACKs arrive
more data sent.
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
32
Slide: 32
tcpdump of the T/DAQ dataflow for SFI &SFO
 Cern-Manchester – another test run
 1.0 Mbyte event
 Remote EFD requests events from SFI
 Remote EFD sending computation back to SFO
 Links closed by Application
Link setup &
TCP slowstart
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
33
Slide: 33
Some Conclusions
 The TCP protocol dynamics strongly influence the behaviour of the Application.
 Care is required with the Application design eg use of timeouts.
 With the correct TCP buffer sizes
 It is not throughput but the round-trip nature of the application protocol that determines
performance.
 Requesting the 1-2Mbytes of data takes 1 or 2 round trips
 TCP Slowstart (the opening of Cwnd) considerably lengthens time for the first block of data.
 Implementation “improvements” (Cwnd reduction) kill performance!
 When the TCP buffer sizes are too small (default)
 The amount of data sent is limited on each rtt
 Data is send and arrives in bursts
 It takes many round trips to send 1 or 2 Mbytes
 The End Hosts themselves
 CPU power is required for the TCP/IP stack as well and the application
 Packets can be lost in the IP stack due to lack of processing power
 Although the application is ATLAS-specific, the network interactions is applicable to other areas
including:
 Remote iSCSI
 Remote database accesses
 Real-time Grid Computing – eg Real-Time Interactive Medical Image processing
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
34
Slide: 34
Radio Astronomy
e-VLBI
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
35
Slide: 35
Radio Astronomy
with help from Ralph Spencer Jodrell Bank
 The study of celestial objects at <1 mm to >1m wavelength.
 Sensitivity for continuum sources
B=bandwidth, t=integration time.
 1 / Bt
 High resolution achieved by interferometers.
Some radio emitting X-ray binary stars in our own galaxy:
GRS 1915+105
MERLINSummer School, Brasov, Romania, July 2005, R. Hughes-Jones
SS433
MERLIN and European VLBI
Richard Hughes-Jones
Cygnus X-1
VLBA
Manchester
36
Slide: 36
Earth-Rotation Synthesis
and Fringes
Telescope data correlated in
pairs:
N(N-1)/2 baselines
Fringes Obtained
with the correct signal phase
Merlin u-v coverage
Need ~ 12 hours for full synthesis, not necessarily collecting data for all that time.
NB Trade-off between B and t for sensitivity.
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
37
Slide: 37
The European VLBI Network: EVN
 Detailed radio imaging uses
antenna networks over 100s1000s km
 At faintest levels, sky teems
with galaxies being formed
 Radio penetrates cosmic dust see process clearly
 Telescopes in place …
 Disk recording at 512Mb/s
 Real-time connection allows
greater:
 response
 reliability
 sensitivity
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
38
Slide: 38
EVN-NREN
Gbit link
Chalmers
University
of
Technology,
Gothenburg
Onsala
Sweden
Gbit link
Torun
Poland
Jodrell Bank
UK
Dedicated
Gbit link
Westerbork
Netherlands
MERLIN
Cambridge
UK
Dwingeloo
DWDM link
Medicina
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones
Manchester
Richard Hughes-Jones
Italy
39
Slide: 39
UDP Throughput Manchester-Dwingeloo
(Nov 2003)
Throughput vs packet spacing
Manchester: 2.0G Hz Xeon
Dwingeloo: 1.2 GHz PIII
Near wire rate, 950 Mbps
NB record stands at 6.6 Gbps
SLAC-CERN
Gnt5-DwMk5 11Nov03/DwMk5-Gnt5 13Nov03-1472bytes
1200
1000
Recv Wire rate Mbits/s





Gnt5-DwMk5
DwMk5-Gnt5
800
600
400
200
0
0
5
12
10
15
20
25
Spacing
between
frames
us
Gnt5-DwMk5 11Nov03-1472 bytes
30
35
40
 Packet loss
% Packet loss
10
8
Gnt5-DwMk5
DwMk5-Gnt5
6
4
2
0
0
 CPU Kernel Load receiver
25
30
35
40
Gnt5-DwMk5 11Nov03 1472 bytes
% Kernel
Sender
10
15
20
Spacing between frames us
0
100
80
60
40
20
0
5
10
15
20
Spacing between frames us
25
30
35
40
Gnt5-DwMk5 11Nov03 1472 bytes
% Kernel
Receiver
 CPU Kernel Load sender
5
100
80
60
40
20
0
 4th Year project
0
5July 2005,
10
15
20
25
30
Summer School, Brasov, Romania,
R. Hughes-Jones
Manchester
Spacing between frames us
 Adam Mathews
Richard
Hughes-Jones
 Steve
O’Toole
35
40
40
Slide: 40
Packet loss distribution:

Cumulative distribution
 p(t )dt
t
Long range effects in
the data?
Poisson
Cumulative distribution of packet loss, each bin is 12 msec wide
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
41
Slide: 41
26th January 2005 UDP Tests
Simon Casey (PhD project)
Between JBO and JIVE in Dwingeloo, using production network
Period of high packet loss (3%):
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
42
Slide: 42
The GÉANT2 Launch June 2005
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
43
Slide: 43
e-VLBI at the GÉANT2 Launch Jun 2005
Dwingeloo
DWDM link
Jodrell Bank
UK
Medicina
Italy
Torun
Poland
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
44
Slide: 44
e-VLBI UDP Data Streams
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
45
Slide: 45
UDP Performance: 3 Flows on GÉANT
1000
Recv wire rate Mbit/s
 Throughput: 5 Hour run
 Jodrell: JIVE
2.0 GHz dual Xeon – 2.4 GHz dual Xeon
670-840 Mbit/s
 Medicina (Bologna): JIVE
800 MHz PIII – mark623 1.2 GHz PIII
330 Mbit/s limited by sending PC
 Torun: JIVE
2.4 GHz dual Xeon – mark575 1.2 GHz PIII
245-325 Mbit/s limited by security policing
(>400Mbit/s  20 Mbit/s) ?
Jodrell
Medicina
Torun
BW 14Jun05
800
600
400
200
0
0
500
1000
Time 10s steps
1500
2000
 Throughput: 50 min period
 Period is ~17 min
Jodrell
Medicina
Torun
BW 14Jun05
Recv wire rate Mbit/s
1000
800
600
400
200
0
200
250
300
350
Time 10s steps
400
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
450
500
46
Slide: 46
UDP Performance: 3 Flows on GÉANT
100000
1000
50000
500
0
500
1500
0
2000
1500
re-ordered
70000
num_lost
60000
50000
40000
30000
20000
10000
0
2000
Medicina 14Jun05
5
4
num lost
num re-ordered
1000
Time 10s
3
2
1
0
0
500
1000
Time 10s
Torun 14Jun04
5
4
3
2
1
0
0
500
1000
1500
Time 10s
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones
Manchester
Richard Hughes-Jones
re-ordered
140000
num_lost
120000
100000
80000
60000
40000
20000
0
2000
num lost
num re-ordered
 Torun: 2.4 GHz Xeon
 Loss 6 - 12%
 Reordering in-significant
num lost
1500
0
 Medicina: 800 MHz PIII
 Loss ~6%
 Reordering in-significant
re-ordered
150000
num_lost
jbgig1-jivegig1_14Jun05
2000
num re-ordered
 Packet Loss & Re-ordering
 Jodrell: 2.0 GHz Xeon
 Loss 0 – 12%
 Reordering significant
47
Slide: 47
18 Hour Flows on UKLight
Jodrell – JIVE, 26 June 2005
 Traffic through SURFnet
man03-jivegig1_26Jun05
Recv wire rate Mbit/s
 Throughput:
 Jodrell: JIVE
2.4 GHz dual Xeon –
2.4 GHz dual Xeon
960-980 Mbit/s
w10
1000
800
600
400
200
0
0
1000
2000
3000
4000
5000
6000
7000
 Packet Loss
 Only 3 groups with 10-150 lost packets
each
 No packets lost the rest of the time
Recv wire rate Mbit/s
Time 10s steps
1000
990
980
970
960
man03-jivegig1_26Jun05
w10
950
940
930
920
910
900
5000
5050
5100
5150
5200
Time 10s
man03-jivegig1_26Jun05
Packet Loss
 Packet re-ordering
 None
w10
1000
100
10
1
0
1000
2000
3000 Manchester
4000
5000
Summer School, Brasov, Romania, July 2005,
R.
Hughes-Jones
6000
7000
48
Time 10s steps
Richard Hughes-Jones
Slide: 48
Summary, Conclusions
 Host is critical: Motherboards NICs, RAID controllers and Disks matter
 The NICs should be well designed:
 NIC should use 64 bit 133 MHz PCI-X (66 MHz PCI can be OK)
 NIC/drivers: CSR access / Clean buffer management / Good interrupt handling
 Worry about the CPU-Memory bandwidth as well as the PCI bandwidth
 Data crosses the memory bus at least 3 times
 Separate the data transfers – use motherboards with multiple 64 bit PCI-X buses
 32 bit 33 MHz is too slow for Gigabit rates
 64 bit 33 MHz > 80% used
 Choose a modern high throughput RAID controller
 Consider SW RAID0 of RAID5 HW controllers
 Need plenty of CPU power for sustained 1 Gbit/s transfers
 Packet loss is a killer
 Check on campus links & equipment, and access links to backbones
 New stacks are stable give better response & performance
 Still need to set the tcp buffer sizes !
 Check other kernel settings e.g. window-scale,
 Application architecture & implementation is also important
 Interaction between HW, protocol processing, and disk sub-system complex
MB - NG
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
49
Slide: 49
More Information Some URLs









Real-Time Remote Farm site http://csr.phys.ualberta.ca/real-time
UKLight web site: http://www.uklight.ac.uk
DataTAG project web site: http://www.datatag.org/
UDPmon / TCPmon kit + writeup:
http://www.hep.man.ac.uk/~rich/ (Software & Tools)
Motherboard and NIC Tests:
http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt
& http://datatag.web.cern.ch/datatag/pfldnet2003/
“Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality
Motherboards” FGCS Special issue 2004
http:// www.hep.man.ac.uk/~rich/ (Publications)
TCP tuning information may be found at:
http://www.ncne.nlanr.net/documentation/faq/performance.html
& http://www.psc.edu/networking/perf_tune.html
TCP stack comparisons:
“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production
Networks” Journal of Grid Computing 2004
http:// www.hep.man.ac.uk/~rich/ (Publications)
PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/
Dante PERT http://www.geant2.net/server/show/nav.00d00h002
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
50
Slide: 50
Any Questions?
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
51
Slide: 51
Backup Slides
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
52
Slide: 52
Latency Measurements
 UDP/IP packets sent between back-to-back systems
 Processed in a similar manner to TCP/IP
 Not subject to flow control & congestion avoidance algorithms
 Used UDPmon test program
 Latency
 Round trip times measured using Request-Response UDP frames
 Latency as a function of frame size
 Slope is given by:
s=

data paths
1
db
dt
 Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s)
 Intercept indicates: processing times + HW latencies
 Histograms of ‘singleton’ measurements
 Tells us about:
 Behavior of the IP stack
 The way the HW operates
Summer
School, Brasov, Romania, July 2005,
 Interrupt
coalescence
Richard Hughes-Jones
R. Hughes-Jones Manchester
53
Slide: 53
Throughput Measurements
 UDP Throughput
 Send a controlled stream of UDP frames spaced at regular intervals
Sender
Receiver
Zero stats
OK done
Send data frames at
regular intervals
●●●
●●●
Inter-packet time
(Histogram)
Time to receive
Time to send
Get remote statistics
Send statistics:
No. received
No. lost + loss pattern
No. out-of-order
OK done
CPU load & no. int
1-way delay
Signal end of test
Time
Number of packets
n bytes

Wait time
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
time
54
Slide: 54
PCI Bus & Gigabit Ethernet Activity
 PCI Activity
 Logic Analyzer with
PCI bus
Gigabit
Ethernet
Probe
NIC
CPU
NIC
 PCI Probe cards in sending PC
 Gigabit Ethernet Fiber Probe Card
 PCI Probe cards in receiving PC
CPU
PCI bus
chipset
chipset
mem
mem
Possible Bottlenecks
Logic Analyser
Display
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
55
Slide: 55
End Hosts & NICs CERN-nat-Manc.
Throughput Packet Loss Re-Order
Use UDP packets to characterise
Host, NIC & Network
Recv Wire rate Mbits/s
Request-Response Latency
256 bytes pcatb121-nat-gig6
10000
5000
N(t)
N(t)
4000
3000
512 bytes pcatb121-nat-gig6
5000 1400 bytes pcatb121-nat-gig6
8000
4000
6000
3000
N(t)
6000
2000
4000
2000
1000
2000
1000
0
20900
21100 21300
Latency us
21500
0
20900
21100 21300
Latency us
21500
0
20900
pcatb121-nat-gig6_13Aug04
1000
900
800
700
600
500
400
300
200
100
0
SuperMicro P4DP8 motherboard
Dual Xenon 2.2GHz CPU
400 MHz System bus
64 bit 66 MHz PCI / 133 MHz PCI-X bus
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
0
10
20
Spacing between frames us
21100 21300
Latency us
21500
20
0
num re-ordered
5
10
15
20
25
Spacing between frames us
30
35
40
1000 bytes
1200 bytes
1400 bytes
1472 bytes
50 bytes
pcatb121-nat-gig6_13Aug04
15
1472 bytes
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
40
100 bytes
200 bytes
10
400 bytes
5
600 bytes
800 bytes
0
1000 bytes
0
5
10
15
20
25
Spacing between frames us
30
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
40
60
0
The network can sustain 1Gbps of UDP traffic
The average server can loose smaller packets
Packet loss caused by lack of power in the PC
receiving the traffic
Out of order packets due to WAN routers
Lightpaths look like extended LANS
have no re-ordering
30
pcatb121-nat-gig6_13Aug04
80
% Packet loss




35
40
1200 bytes
1400 bytes
1472 bytes
56
Slide: 56
TCP (Reno) – Details
2 min
C * RTT 2
 =
2 * MSS

for rtt of ~200 ms:
Time to recover sec
 Time for TCP to recover its throughput from 1 lost packet given by:
100000
10000
1000
100
10
1
0.1
0.01
0.001
0.0001
10Mbit
100Mbit
1Gbit
2.5Gbit
10Gbit
0
50
100
rtt ms
150
200
UK 6 ms Europe 20 ms USA 150 ms
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
Richard Hughes-Jones
57
Slide: 57
Network & Disk Interactions
 mem-disk: 1735 Mbit/s
 Tends to be in 1 die
200
RAID0 6disks 1 Gbyte Write 8k 3w8506-8 26 Dec04 16384
180
% cpu system mode L3+4
 Disk Write
Kernel CPU
load
160
8k total CPU
140
64k total CPU
120
y = -1.0215x + 215.63
Total CPU
load
100
80
60
40
y = -1.0529x + 206.46
20
0
 Disk Write + UDP 1500
0
200
 mem-disk : 1218 Mbit/s
 Both dies at ~80%
% cpu system mode L3+4
200
1000
160
y=1781.05x
cut equn
140
120
100
80
60
L3+L4<cut
20
40
0
70.0
80.0
90.0
60
80
100
120
140
% cpu system mode L1+2
160
180
60
8k totalCPU
40
64k totalCPU
y=178-1.05x
20
40
60
80
100
120
140
% cpu system mode L1+2
180
200
R0 6d 1 Gbyte membw write 8k 3w8506-8 04Jan05 16384
160
140
120
100
80
60
8k totalCPU
64k totalCPU
y=178-1.05x
cut equn 2
0
20
40
60
80
100
120
140
% cpu system mode L1+2
160
180
200
100.0
200
R0 6d 1 Gbyte cpuload Write 8k 3w8506-8 3Jan05 16384
180
200
160
R0 6d 1 Gbyte cpuload Write 8k 3w8506-8 3Jan05 16384
180
8k
8k total CPU
160
64k
64k total CPU
 mem-disk : 1334 Mbit/s
 1 CPU at ~60% other 20%
 All CPUs saturated in
user mode
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester
y=178-1.05x
140
120
y=178-1.05x
140
120
100
100
80
60
80
60
40
40
20
20
0
0
0
Richard Hughes-Jones
160
0
200
% cpu system mode L3+4
 Disk Write + CPUload
60.0
% cpu system mode L3+4
Trial number
80
20
0
0
50.0
100
40
20
Series1
500
40.0
120
180
64k
40
30.0
200
140
200
8k
% cpu system mode L3+4
% cpu system mode L3+4
Throughput Mbit/s
R0 6d 1 Gbyte membw write 8k 3w8506-8 04Jan05 16384
180
1500
20.0
180
160
0
2000
10.0
160
0
R0 6d 1 Gbyte membw write 64k 3w8506-8 04Jan05 16384
0.0
60
80
100
120
140
% cpu system mode L1+2
R0 6d 1 Gbyte udp Write 8k 3w8506-8 26 Dec04 16384
20
mem-disk : 1341 Mbit/s
1 CPU at ~60% other 20%
Large user mode usage
Below Cut = hi BW
Hi BW = die1 used
2500
40
180
 Disk Write + CPU  mem





20
20
40
60
80
100
120
140
% cpu system mode L1+2
160
180
200
0
20
40
60
80
100
120
140
% cpu system mode L1+2
160
180
200
58
Slide: 58
Download