High Performance Networking for ALL

advertisement
High Performance Networking
for ALL
Members of GridPP are in many Network collaborations including:
Close links with:
MB - NG
SLAC
UKERNA, SURFNET and other NRNs
Dante
Internet2
Starlight, Netherlight
GGF
Ripe
Industry
…
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
1
Network Monitoring [1]
Architecture
DataGrid WP7 code extended by Gareth Manc
Technology transfer to UK e-Science
Developed by Mark Lees DL
Fed back into DataGrid by Gareth
Links to: GGF NM-WG, Dante, Internet2 Characteristics, Schema &
web services
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
Success
2
Network Monitoring [2]
24 Jan to 4 Feb 04
TCP iperf RAL to HEP
Only 2 sites >80 Mbit/s
24 Jan to 4 Feb 04
TCP iperf DL to HEP
HELP!
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
3
High bandwidth, Long
distance….
Where is my throughput?
Robin Tasker
CCLRC, Daresbury Laboratory, UK
[r.tasker@dl.ac.uk]
DataTAG is a project sponsored by the European Commission - EU Grant IST-2001-32459
RIPE-47, Amsterdam, 29 January 2004
Throughput… What’s the problem?
One Terabyte of data
transferred in less
than an hour
On February 27-28 2003, the transatlantic DataTAG network was
extended, i.e. CERN - Chicago - Sunnyvale (>10000 km).
For the first time, a terabyte of data was transferred across the Atlantic in less
than one hour using a single TCP (Reno) stream.
The transfer was accomplished from Sunnyvale to Geneva at
a rate of 2.38 Gbits/s
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
5
Internet2 Land Speed Record
On October 1 2003, DataTAG set a new Internet2 Land Speed Record
by transferring 1.1 Terabytes of data in less than 30 minutes from
Geneva to Chicago across the DataTAG provision, corresponding to an
average rate of 5.44 Gbits/s using a single TCP (Reno) stream
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
6
So how did we do that?
Management of the End-to-End Connection
Memory-to-Memory transfer; no disk system involved
Processor speed and system bus characteristics
TCP Configuration – window size and frame size (MTU)
Network Interface Card and associated driver and their configuration
End-to-End “no loss” environment from CERN to Sunnyvale!
At least a 2.5 Gbits/s capacity pipe on the end-to-end path
A single TCP connection on the end-to-end path
No real user application
That’s to say - not the usual User experience!
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
7
Realistically – what’s the problem &
why do network research?
End System Issues
Network Interface Card and Driver and their configuration
TCP and its configuration
Operating System and its configuration
Disk System
Processor speed
Bus speed and capability
Network Infrastructure Issues
Obsolete network equipment
Configured bandwidth restrictions
Topology
Security restrictions (e.g., firewalls)
Sub-optimal routing
Transport Protocols
Network Capacity and the influence of Others!
Many, many TCP connections
Mice and Elephants on the path
Congestion
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
8
End Hosts: Buses, NICs and Drivers
Use UDP packets to characterise Intel PRO/10GbE
Server Adaptor
 SuperMicro P4DP8-G2 motherboard
 Dual Xenon 2.2GHz CPU
 400 MHz System bus
 133 MHz PCI-X bus
Throughput
Latency
Bus Activity
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
9
End Hosts: Understanding NIC Drivers
Linux driver basics – TX








Linux driver basics – RX
Application system call
Encapsulation in UDP/TCP and IP headers
Enqueue on device send queue
Driver places information in DMA descriptor ring
NIC reads data from main memory
via DMA and sends on wire
NIC signals to processor that TX
descriptor sent






NIC places data in main memory via
DMA to a free RX descriptor
NIC signals RX descriptor has data
Driver passes frame to IP layer and
cleans RX descriptor
IP layer passes data to application
Linux NAPI driver model
 On receiving a packet, NIC raises interrupt
 Driver switches off RX interrupts and schedules RX DMA ring poll
 Frames are pulled off DMA ring and is processed up to application
 When all frames are processed RX interrupts are re-enabled
 Dramatic reduction in RX interrupts under load
 Improving the performance of a Gigabit Ethernet driver under Linux
http://datatag.web.cern.ch/datatag/papers/drafts/linux_kernel_map/
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
10
Protocols: TCP (Reno) – Performance
 AIMD and High Bandwidth – Long Distance networks
Poor performance of TCP in high bandwidth wide area networks is due
in part to the TCP congestion control algorithm

For each ack in a RTT without loss:
cwnd -> cwnd + a / cwnd
- Additive Increase, a=1
 For each window experiencing loss:
cwnd -> cwnd – b (cwnd)
- Multiplicative Decrease, b= ½
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
11
Protocols: HighSpeed TCP & Scalable TCP
 Adjusting the AIMD Algorithm – TCP Reno

For each ack in a RTT without loss:
cwnd -> cwnd + a / cwnd
- Additive Increase, a=1
 For each window experiencing loss:
cwnd -> cwnd – b (cwnd)
- Multiplicative Decrease, b= ½
 High Speed TCP
a and b vary depending on current cwnd where
 a increases more rapidly with larger cwnd and as a consequence returns to
the ‘optimal’ cwnd size sooner for the network path; and
 b decreases less aggressively and, as a consequence, so does the cwnd.
The effect is that there is not such a decrease in throughput.
 Scalable TCP
a and b are fixed adjustments for the increase and decrease of cwnd
such that the increase is greater than TCP Reno, and the decrease on
loss is less than TCP Reno
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
12
Protocols: HighSpeed TCP & Scalable TCP
 HighSpeed TCP
 Scalable TCP
HighSpeed TCP implemented by Gareth Manc
Success
Scalable TCP implemented by Tom Kelly Camb
Integration of stacks into DataTAG Kernel Yee UCL + Gareth
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
13
Some Measurements of Throughput CERN -SARA
Using the GÉANT Backup Link
400
300
200
100
0
1043509370
Standard TCP
 Average Throughput
1043509570
Time
1043509670
Hispeed TCP txlen 2000 26 Jan03
400
300
200
100
0
1043577520
345 Mbit/s
1043577620
1043577720
Time
1043577820
Scalable TCP txlen 2000 27 Jan03
Mbit/s
340
400
300
200
100
0
1043678800
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Out Mbit/s
1043679200
In Mbit/s
Recv. Rate Mbits/s
 Average Throughput
II/f Rate Mbits/s
500
Scalable TCP
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Out Mbit/s
1043577920
In Mbit/s
Recv. Rate Mbits/s
High-Speed TCP
1043509470
500
I/f Rate Mbits/s
 Average Throughput 167 Mbit/s
 Users see 5 - 50 Mbit/s!
Out Mbit/s
2
In Mbit/s
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
1043509770
Recv. Rate Mbits/s
I/f Rate Mbits/s
 1 GByte file transfers
 Blue Data
 Red TCP ACKs
Standard TCP txlen 100 25 Jan03
500
1043678900
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
1043679000
Time
1043679100
14
Pete White
Users, The Campus & the MAN
Pat Meyrs
NNW – to – SJ4 Access 2.5 Gbit PoS Hits 1 Gbit 50 %
[1]
Man – NNW Access 2 * 1 Gbit Ethernet
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
15
Users, The Campus & the MAN [2]
800
ULCC-JANET traffic 30/1/2004
700
600
700
500
600
400
300
In from SJ4
Out to SJ4
800
Traffic Mbit/s
Traffic Mbit/s
900
in
out
200
500
400
300
200
Message:
 Continue to work with your network group
LMN to site 1 Access
1 Gbit Ethernet
LMN
to site 2 Access 1 Gbit Ethernet
 Understand
the traffic
levels
In
250
350
 Understand theInOutsite1site1Network
Topology
out
100
0
00:00
100
0
02:24
04:48
07:12
09:36
12:00
14:24
16:48
19:12
21:36
24/01/2004 25/01/2004 26/01/2004 27/01/2004 28/01/2004 29/01/2004 30/01/2004 31/01/2004
00:00
00:00
00:00
00:00
00:00
00:00
00:00
00:00
00:00
300
200
Traffic Mbit/s
Traffic Mbit/s
250
200
150
150
100
100
50
50
0
0
24/01/2004 25/01/2004 26/01/2004 27/01/2004 28/01/2004 29/01/2004 30/01/2004 31/01/2004
00:00
00:00
00:00
00:00
00:00
00:00
00:00
00:00
24/01/2004 25/01/2004 26/01/2004 27/01/2004 28/01/2004 29/01/2004 30/01/2004 31/01/2004
00:00
00:00
00:00
00:00
00:00
00:00
00:00
00:00
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
16
10 GigEthernet: Tuning PCI-X
mmrbc
512 bytes
mmrbc
1024 bytes
CSR Access
mmrbc
2048 bytes
PCI-X segment
2048 bytes
Transfer of
16114 bytes
mmrbc
4096 bytes
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
17
10 GigEthernet at SC2003 BW Challenge
(Phoenix)
 Three Server systems with 10 GigEthernet NICs
 Used the DataTAG altAIMD stack 9000 byte MTU
 Streams From SLAC/FNAL booth in Phoenix to:
 Pal Alto PAIX 17 ms rtt
10
Router to LA/PAIX
Phoenix-PAIX HS-TCP
Phoenix-PAIX Scalable-TCP
Phoenix-PAIX Scalable-TCP #2
10 Gbits/s throughput from SC2003 to PAIX
9
Throughput Gbits/s
8
7
6
5
4
3
2
1
Chicago Starlight 65 ms rtt
Amsterdam SARA 175 ms rtt
11/19/03
16:13
11/19/03
16:27
Router traffic to Abilele
11/19/03
16:42
11/19/03
16:56
11/19/03
17:11
11/19/03
17:25 Date & Time
10 Gbits/s throughput from SC2003 to Chicago & Amsterdam
Phoenix-Chicago
10
Phoenix-Amsterdam
9
8
Throughput Gbits/s


0
11/19/03
15:59
7
6
5
4
3
2
1
0
11/19/03
15:59
11/19/03
16:13
11/19/03
16:27
11/19/03
16:42
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
11/19/03
16:56
11/19/03
17:11
11/19/03
17:25 Date & Time
18
Helping Real Users [1]
Radio Astronomy VLBI
PoC with NRNs & GEANT
1024 Mbit/s 24 on 7 NOW
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
19
VLBI Project: Throughput Jitter & 1-way Delay
 1472 byte Packets
Manchester -> Dwingeloo JIVE
 1472 byte Packets man -> JIVE
 FWHM 22 µs (B2B 3 µs )
Gnt5-DwMk5 11Nov03/DwMk5-Gnt5 13Nov03-1472bytes
1200
10000
Gnt5-DwMk5
DwMk5-Gnt5
600
1472 bytes w=50 jitter Gnt5-DwMk5 28Oct03
8000
N(t)
800
6000
4000
400
2000
200
0
0
0
0
5
10
15
20
Spacing between frames us
25
30
35
20
40
60
80
100
120
140
Jitter us
40
 1-way Delay – note the packet loss (points with zero 1 –way delay)
1472 bytes w12 Gnt5-DwMk5 21Oct03
1472 bytes w12 Gnt5-DwMk5 21Oct03
12000
12000
10000
10000
8000
8000
1-way delay us
1-way delay us
Recv Wire rate Mbits/s
1000
6000
4000
2000
6000
4000
2000
0
0
1000
2000
3000
Packet No.
0
40002000
210050002200
2300
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
2400
2500
2600
Packet No.
2700
2800
2900
20
3000
VLBI Project: Packet Loss Distribution
P(t) = λ
packet loss distribution 12b bin=12us
80
70
Number in Bin
60
50
Measured
Poisson
40
30
20
10
972
912
852
792
732
672
612
552
492
432
372
312
252
192
132
72
0
12
 Measure the time between
lost packets in the time
series of packets sent.
 Lost 1410 in 0.6s
 Is it a Poisson process?
 Assume Poisson is
stationary
λ(t)
=λ
 Use Prob. Density Function:
Time between lost frames (us)
e-λt
 Mean λ = 2360 / s
packet loss distribution 12b
100
[426 µs]
 Plot log: slope -0.0028
expect -0.0024
 Could be additional process
involved
Number in Bin
y = 41.832e-0.0028x
y = 39.762e-0.0024x
10
1
0
500
1000
1500
2000
Time between frames (us)
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
21
VLBI Traffic Flows – Only testing!
 Manchester – NetNorthWest - SuperJANET Access links

Two 1 Gbit/s
 Access links:
SJ4 to GÉANT
GÉANT to SurfNet
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
22
Throughput & PCI transactions on the Mark5
PC:
 Mark5 uses Supermicro P3TDLE
 1.2 GHz PIII
 Mem bus 133/100 MHz
Ethernet
 2 *64bit 66 MHz PCI
 4 32bit 33 MHz PCI
NIC
IDE
Disc
Pack
Read /
Write
n bytes
Wait time
SuperStor
Input Card
Logic Analyser
Display

time
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
23
PCI Activity: Read Multiple data blocks 0 wait
 Read 999424 bytes
 Each Data block:



Setup CSRs
Data movement
Update CSRs
 For 0 wait between reads:
 Data blocks ~600µs long
take ~6 ms
 Then 744µs gap
 PCI transfer rate 1188Mbit/s
(148.5 Mbytes/s)
 Read_sstor rate 778 Mbit/s
Data Block131,072 bytes
(97 Mbyte/s)
 PCI bus occupancy: 68.44%
 Concern about Ethernet Traffic 64
bit 33 MHz PCI needs ~ 82% for
930 Mbit/s
Expect ~360 Mbit/s
CSR Access
GridPP Meeting Edinburgh 4-5 Feb 04 PCI
R. Hughes-Jones Manchester
Data transfer
Burst 4096 bytes
24
PCI Activity: Read Throughput
 Flat then 1/t dependance
 ~ 860 Mbit/s for Read blocks >=
262144 bytes
 CPU load ~20%
 Concern about CPU load needed
to drive Gigabit link
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
25
Helping Real Users [2]
HEP
BaBar & CMS
Application Throughput
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
26
BaBar Case Study: Disk Performace

BaBar Disk Server






Tyan Tiger S2466N
motherboard
1 64bit 66 MHz PCI bus
Athlon MP2000+ CPU
AMD-760 MPX chipset
3Ware 7500-8 RAID5
8 * 200Gb Maxtor IDE
7200rpm disks

Note the VM parameter
readahead max

Disk to memory (read)
Max throughput
1.2 Gbit/s 150 MBytes/s)

Memory to disk (write)
Max throughput
400 Mbit/s 50 MBytes/s)
[not as fast as Raid0]
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
27
BaBar: Serial ATA Raid Controllers

3Ware 66 MHz PCI
1600

900
Read Throughput raid5 4 3Ware 66MHz SATA disk
Read Throughput raid5 4 ICP 66MHz SATA disk
800
1400
700
1200
600
Mbit/s
1000
Mbit/s
ICP 66 MHz PCI
800
600
readahead
readahead
readahead
readahead
readahead
readahead
400
200
max 31
max 63
max 127
max 256
max 512
max 1200
500
400
readahead
readahead
readahead
readahead
readahead
readahead
300
200
100
0
0
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0
200
400
600
File size MBytes
1800
1600
Write Throughput raid5 4 3Ware 66MHz SATA disk
1200
1000
max 31
max 63
max 127
max 256
max 516
max 1200
1000
1200
1400
1600
1800
2000
800
Write Throughput raid5 4 ICP 66MHz SATA disk
1400
readahead
readahead
readahead
readahead
readahead
readahead
1200
1000
Mbit/s
readahead
readahead
readahead
readahead
readahead
readahead
1400
800
File size MBytes
1600
Mbit/s
max 31
max 63
max 127
max 256
max 512
max 1200
800
max 31
max 63
max 127
max 256
max 512
max 1200
600
600
400
400
200
200
0
0
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0
200
400
File size MBytes
600
800
1000
1200
1400
1600
1800
File size MBytes
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
28
2000
BaBar Case Study: RAID Throughput & PCI Activity



3Ware 7500-8 RAID5 parallel EIDE
3Ware forces PCI bus to 33 MHz
BaBar Tyan to MB-NG SuperMicro
Network mem-mem 619 Mbit/s

Disk – disk throughput bbcp
40-45 Mbytes/s (320 – 360 Mbit/s)

PCI bus effectively full!
Read from RAID5 Disks
Write to RAID5 Disks
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
29
MB - NG MB – NG SuperJANET4 Development Network
BaBar Case Study
MAN
MCC
OSM1OC48POS-SS
PC
Status / Tests:
 Manc host has DataTAG TCP stack
 RAL Host now available
 BaBar-BaBar mem-mem
 BaBar-BaBar real data MB-NG
SJ4
BaBar-BaBar realDev
data SJ4
SJ4 
SJ4
Dev
Dev real data MB-NG
 Mbng-mbng
 Mbng-mbng real data SJ4
 Different TCP stacks already installed
SJ4
Dev
PC
PC
BarBar
PC
RAL
3ware RAID5
3ware RAID5
Gigabit Ethernet
2.5 Gbit POS Access
2.5 Gbit POS core
MPLS Admin. Domains
OSM1OC48POS-SS
PC
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
PC
30
MB - NG
Study of Applications
MB – NG SuperJANET4 Development Network
MAN
MCC
SJ4
Dev
OSM1OC48POS-SS
SJ4
Dev
PC
SJ4
Dev
SJ4
Dev
PC
PC
PC
UCL
3ware RAID0
3ware RAID0
Gigabit Ethernet
2.5 Gbit POS Access
2.5 Gbit POS core
MPLS Admin. Domains
OSM1OC48POS-SS
PC
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
PC
31
MB - NG
24 Hours HighSpeed TCP mem-mem




TCP mem-mem lon2-man1
Tx 64 Tx-abs 64
Rx 64 Rx-abs 128
941.5 Mbit/s +- 0.5 Mbit/s
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
32
MB - NG
Gridftp Throughput HighSpeedTCP
 Int Coal 64 128
 Txqueuelen 2000
 TCP buffer 1 M byte
(rtt*BW = 750kbytes)
 Interface throughput
 Acks received
 Data moved
 520 Mbit/s
 Same for B2B tests
 So its not that simple!
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
33
MB - NG
Gridftp Throughput + Web100
 Throughput Mbit/s:
 See alternate 600/800 Mbit
and zero
 Cwnd smooth
 No dup Ack / send stall /
timeouts
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
34
MB - NG
http data transfers HighSpeed TCP
 Apachie web server
out of the box!
 prototype client - curl http library
 1Mbyte TCP buffers
 2Gbyte file
 Throughput 72 MBytes/s
 Cwnd - some variation
 No dup Ack / send stall /
timeouts
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
35
More Information Some URLs
 MB-NG project web site: http://www.mb-ng.net/
 DataTAG project web site: http://www.datatag.org/
 UDPmon / TCPmon kit + writeup:
http://www.hep.man.ac.uk/~rich/net
 Motherboard and NIC Tests:
www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt
& http://datatag.web.cern.ch/datatag/pfldnet2003/
 TCP tuning information may be found at:
http://www.ncne.nlanr.net/documentation/faq/performance.html
& http://www.psc.edu/networking/perf_tune.html
GridPP Meeting Edinburgh 4-5 Feb 04
R. Hughes-Jones Manchester
36
Download