End-user systems: NICs, MotherBoards, Disks, TCP Stacks & Applications Richard Hughes-Jones

advertisement
20th-21st June 2005
NeSC Edinburgh
End-user systems:
NICs, MotherBoards, Disks,
TCP Stacks & Applications
Richard Hughes-Jones
Work reported is from many Networking Collaborations
MB - NG
http://gridmon.dl.ac.uk/nfnn/
Network Performance Issues
End System Issues
Network Interface Card and Driver and their configuration
Processor speed
MotherBoard configuration, Bus speed and capability
Disk System
TCP and its configuration
Operating System and its configuration
Network Infrastructure Issues
Obsolete network equipment
Configured bandwidth restrictions
Topology
Security restrictions (e.g., firewalls)
Sub-optimal routing
Transport Protocols
Network Capacity and the influence of Others!
Congestion – Group, Campus, Access links
Many, many TCP connections
Mice and Elephants on the path
Richard Hughes-Jones
Slide: 2
Methodology used in testing
NICs & Motherboards
Richard Hughes-Jones
Slide: 3
Latency Measurements
‹ UDP/IP packets sent between back-to-back systems
„ Processed in a similar manner to TCP/IP
„ Not subject to flow control & congestion avoidance algorithms
„ Used UDPmon test program
‹ Latency
‹ Round trip times measured using Request-Response UDP frames
‹ Latency as a function of frame size
„ Slope is given by:
s=
∑
data paths
1
db
dt
z Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s)
„ Intercept indicates: processing times + HW latencies
‹ Histograms of ‘singleton’ measurements
‹ Tells us about:
„ Behavior of the IP stack
„ The way the HW operates
„ Interrupt coalescence
Richard Hughes-Jones
Slide: 4
Throughput Measurements
‹ UDP Throughput
‹ Send a controlled stream of UDP frames spaced at regular intervals
Sender
Receiver
Zero stats
OK done
Send data frames at
regular intervals
●●●
●●●
Inter-packet time
(Histogram)
Time to receive
Time to send
Get remote statistics
Send statistics:
No. received
No. lost + loss pattern
No. out-of-order
OK done
CPU load & no. int
1-way delay
Signal end of test
Time
Number of packets
n bytes
•••
Wait time
Richard Hughes-Jones
time
Slide: 5
PCI Bus & Gigabit Ethernet Activity
‹ PCI Activity
‹ Logic Analyzer with
PCI bus
Gigabit
Ethernet
Probe
chipset
Richard Hughes-Jones
CPU
PCI bus
chipset
mem
Possible Bottlenecks
NIC
CPU
NIC
„ PCI Probe cards in sending PC
„ Gigabit Ethernet Fiber Probe Card
„ PCI Probe cards in receiving PC
mem
Logic Analyser
Display
Slide: 6
“Server Quality” Motherboards
‹ SuperMicro P4DP8-2G (P4DP6)
‹ Dual Xeon
‹ 400/522 MHz Front side bus
‹ 6 PCI PCI-X slots
‹ 4 independent PCI buses
„ 64 bit 66 MHz PCI
„ 100 MHz PCI-X
„ 133 MHz PCI-X
‹ Dual Gigabit Ethernet
‹ Adaptec AIC-7899W
dual channel SCSI
‹ UDMA/100 bus master/EIDE channels
„ data transfer rates of 100 MB/sec burst
Richard Hughes-Jones
Slide: 7
“Server Quality” Motherboards
‹ Boston/Supermicro H8DAR
‹ Two Dual Core Opterons
‹ 200 MHz DDR Memory
„ Theory BW: 6.4Gbit
‹ HyperTransport
‹ 2 independent PCI buses
„ 133 MHz PCI-X
‹ 2 Gigabit Ethernet
‹ SATA
‹ ( PCI-e )
Richard Hughes-Jones
Slide: 8
NIC & Motherboard Evaluations
Richard Hughes-Jones
Slide: 9
SuperMicro 370DLE: Latency: SysKonnect
„ Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset
„ CPU: PIII 800 MHz
SysKonnec t 32bit 33 MHz
160
„ RedHat 7.1 Kernel 2.4.14
140
PCI:32 bit 33 MHz
Latency small 62 µs & well behaved
Latency Slope 0.0286 µs/byte
Expect: 0.0232 µs/byte
y = 0.0286x + 61.79
120
Latency us
„
„
„
„
y = 0.0188x + 89.756
100
80
60
40
PCI 0.00758
GigE 0.008
PCI 0.00758
20
0
0
500
160
PCI:64 bit 66 MHz
Latency small 56 µs & well behaved
Latency Slope 0.0231 µs/byte
Expect: 0.0118 µs/byte
PCI 0.00188
GigE 0.008
PCI 0.00188
„
Possible extra data moves ?
Richard Hughes-Jones
1500
2000
2500
3000
SysKonnec t 64 bit 66 MHz
140
120
Latency us
„
„
„
„
1000
Message length bytes
y = 0.0231x + 56.088
100
y = 0.0142x + 81.975
80
60
40
20
0
0
500
1000
1500
2000
2500
3000
M essag e len g th b ytes
Slide: 10
SuperMicro 370DLE: Throughput: SysKonnect
„ PCI:32 bit 33 MHz
„ Max throughput 584Mbit/s
„ No packet loss >18 us
spacing
Recv Wire rate
Mbits/s
„ Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset
SysKonnect 32bit 33 MHz
1000
„ CPU: PIII 800 MHz
800
„ RedHat 7.1 Kernel 2.4.14
200
1000 bytes
1200 bytes
0
1400 bytes
1472 bytes
0
Recv Wire rate
Mbits/s
400 bytes
600 bytes
800 bytes
400
5
10
15
20
25
Transmit Time per frame us
30
35
40
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
SysKonnnect 64bit 66MHz
800
600
400
200
0
0
5
10
15
20
25
30
35
40
Transmit Time per frame us
80
Receiver
% CPU Kernel
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
SysKonnect 64 bit 66 MHz
100
„ 95-100% Kernel mode
100 bytes
200 bytes
600
1000
„ PCI:64 bit 66 MHz
„ Max throughput 720 Mbit/s
„ No packet loss >17 us
spacing
„ Packet loss during BW drop
50 bytes
60
40
20
0
0
Richard Hughes-Jones
10
20
Spacing between frames us
30
40
Slide: 11
SuperMicro 370DLE: PCI: SysKonnect
„ Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset
„ CPU: PIII 800 MHz PCI:64 bit 66 MHz
„ RedHat 7.1 Kernel 2.4.14
Send
transfer
Send
setup
Send PCI
Receive PCI
„
„
„
„
Packet on Ethernet
Fibre
Receive
1400 bytes sent
transfer
Wait 100 us
~8 us for send or receive
Stack & Application overhead ~ 10 us / node
Richard Hughes-Jones
~36 us
Slide: 12
Signals on the PCI bus
‹
1472 byte packets every 15 µs Intel Pro/1000
‹
PCI:64 bit 33 MHz
Data Transfers
‹
82% usage
Send setup
Send PCI
Receive PCI
‹
Receive
Transfers
PCI:64 bit 66 MHz
Data Transfers
Send setup
‹
65% usage
‹
Data transfers
half as long
Send PCI
Receive PCI
Receive
Transfers
Richard Hughes-Jones
Slide: 13
SuperMicro 370DLE: PCI: SysKonnect
„ Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset
„ CPU: PIII 800 MHz PCI:64 bit 66 MHz
„ RedHat 7.1 Kernel 2.4.14
„ 1400 bytes sent
„ Wait 20 us
Frames on Ethernet
Fiber 20 us spacing
„ 1400 bytes sent
„ Wait 10 us
Frames are back-to-back
800 MHz Can drive at line speed
Cannot go any faster !
Richard Hughes-Jones
Slide: 14
SuperMicro 370DLE: Throughput: Intel Pro/1000
„ Max throughput 910 Mbit/s
„ No packet loss >12 us spacing
R e c v W ire ra t e
M b it s / s
„ Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset
„ CPU: PIII 800 MHz PCI:64 bit 66 MHz
1000
„ RedHat 7.1 Kernel 2.4.14
Intel Pro/1000 64bit 66MHz
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
800
600
400
200
0
0
1
% P a c k e t lo s s
„ Packet loss during BW drop
„ CPU load 65-90% spacing < 13
us
5
10
15
20
25
Transmit Time per frame us
30
35
40
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
UDP Intel Pro/1000 : 370DLE 64bit 66MHz
0.8
0.6
0.4
0.2
0
0
Richard Hughes-Jones
5
10
15
20
25
Transmit Time per frame us
30
35
40
Slide: 15
SuperMicro 370DLE: PCI: Intel Pro/1000
„ Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset
„ CPU: PIII 800 MHz PCI:64 bit 66 MHz
„ RedHat 7.1 Kernel 2.4.14
Send Interrupt processing
Interrupt delay
Send 64 byte
request
Send PCI
Receive PCI
Request received
Receive Interrupt processing
1400 byte response
Richard Hughes-Jones
„ Request – Response
„ Demonstrates interrupt coalescence
„ No processing directly after each
transfer
Slide: 16
SuperMicro P4DP6: Latency Intel Pro/1000
„ Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas)
„ CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz
„ RedHat 7.2 Kernel 2.4.19
Intel 64 bit 66 MHz
300
y = 0.0093x + 194.67
250
„ Some steps
„ Slope 0.009 us/byte
„ Slope flat sections : 0.0146
us/byte
„ Expect 0.0118 us/byte
Latency us
200
y = 0.0149x + 201.75
150
100
50
0
0
700
600
Richard Hughes-Jones
800
1024 bytes Intel 64 bit 66 MHz
700
600
600
600
500
500
500
400
400
300
300
300
300
200
200
200
200
100
100
100
100
0
170
190
Latency us
210
0
0
170
190
Latency us
210
1400 bytes Intel 64 bit 66 MHz
800
700
400
400
„ No variation with packet size
„ FWHM 1.5 us
„ Confirms timing reliable
512 bytes Intel 64 bit 66 MHz
3000
700
N (t )
N (t )
500
800
2500
N (t )
64 bytes Intel 64 bit 66 MHz
800
1000
1500
2000
Message length bytes
N (t )
900
500
0
190
210
Latency us
230
190
210
230
Latency us
Slide: 17
SuperMicro P4DP6: Throughput Intel Pro/1000
„ Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas)
„ CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz
„ RedHat 7.2 Kernel 2.4.19
„ Max throughput 950Mbit/s
„ No packet loss
R e c v W ire ra t e
M b it s / s
1000
gig6-7 Intel pci 66 MHz 27nov02
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
800
600
400
200
0
0
Richard Hughes-Jones
10
15
20
25
Transmit Time per frame us
30
35
40
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
gig6-7 Intel pci 66 MHz 27nov02
% CPU Idle Receiver
„ Averages are misleading
„ CPU utilisation on the
receiving PC was ~ 25 % for
packets > than 1000 bytes
„ 30- 40 % for smaller packets
5
100
80
60
40
20
0
0
5
10
15
20
25
Transmit Time per frame us
30
35
40
Slide: 18
SuperMicro P4DP6: PCI Intel Pro/1000
„ Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas)
„ CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz
„ RedHat 7.2 Kernel 2.4.19
„
„
„
„
„
1400 bytes sent
Wait 12 us
~5.14us on send PCI bus
PCI bus ~68% occupancy
~ 3 us on PCI for data recv
„ CSR access inserts PCI STOPs
„ NIC takes ~ 1 us/CSR
„ CPU faster than the NIC !
„ Similar effect with the SysKonnect
NIC
Richard Hughes-Jones
Slide: 19
SuperMicro P4DP8-G2: Throughput Intel onboard
„ Motherboard: SuperMicro P4DP8-G2 Chipset: Intel E7500 (Plumas)
„ CPU: Dual Xeon Prestonia 2.4 GHz PCI-X:64 bit
„ RedHat 7.3 Kernel 2.4.19
Intel-onboard
1000
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
„ Max throughput 995Mbit/s
„ No packet loss
R e c v W ir e r a te
M b its /s
800
600
400
200
0
„ Averages are misleading
„ 20% CPU utilisation receiver
packets > 1000 bytes
„ 30% CPU utilisation smaller
packets
Richard Hughes-Jones
% CPU Idle Receiver
0
5
10
15
20
25
Transmit Time per frame us
30
35
40
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
Intel-onboard
100
80
60
40
20
0
0
5
10
15
20
Transmit Time per frame us
25
30
35
40
Slide: 20
Interrupt Coalescence: Throughput
„ Intel Pro 1000 on 370DLE
coa5
coa10
coa20
coa40
coa64
coa100
Throughput 1000 byte packets
800
700
600
500
400
300
200
100
0
0
10
20
30
40
Delay between transmit packets us
coa5
coa10
coa20
coa40
coa64
coa100
Throughput 1472 byte packets
900
800
700
600
500
400
300
200
100
0
0
10
20
30
40
Delay between transmit packets us
Richard Hughes-Jones
Slide: 21
Interrupt Coalescence Investigations
‹
Set Kernel parameters for
Socket Buffer size = rtt*BW
‹
‹
‹
‹
TCP mem-mem lon2-man1
Tx 64 Tx-abs 64
Rx 0 Rx-abs 128
820-980 Mbit/s +- 50 Mbit/s
‹
‹
‹
Tx 64 Tx-abs 64
Rx 20 Rx-abs 128
937-940 Mbit/s +- 1.5 Mbit/s
‹
‹
‹
Tx 64 Tx-abs 64
Rx 80 Rx-abs 128
937-939 Mbit/s +- 1 Mbit/s
Richard Hughes-Jones
Slide: 22
Supermarket Motherboards
and other
“Challenges”
Richard Hughes-Jones
Slide: 23
Tyan Tiger S2466N
‹
‹
‹
‹
Motherboard: Tyan Tiger S2466N
PCI: 1 64bit 66 MHz
CPU: Athlon MP2000+
Chipset: AMD-760 MPX
‹
3Ware forces PCI bus to 33 MHz
‹
Tyan to MB-NG SuperMicro
Network mem-mem 619 Mbit/s
Richard Hughes-Jones
Slide: 24
IBM das: Throughput: Intel Pro/1000
„ Motherboard: IBM das Chipset:: ServerWorks CNB20LE
„ CPU: Dual PIII 1GHz PCI:64 bit 33 MHz
„ RedHat 7.1 Kernel 2.4.14
Max throughput 930Mbit/s
No packet loss > 12 us
Clean behaviour
Packet loss during drop
900
Recv Wire rate Mbits/s
„
„
„
„
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
UDP Intel pro1000 IBMdas 64bit 33 MHz
1000
800
700
600
500
400
300
200
100
0
0
5
10
15
20
25
30
35
40
Transmit Time per frame us
1400 bytes sent
11 us spacing
Signals clean
~9.3us on send PCI bus
PCI bus ~82%
occupancy
„ ~ 5.9 us on PCI for data
recv.
„
„
„
„
„
Richard Hughes-Jones
Slide: 25
Network switch limits behaviour
‹End2end UDP packets from udpmon
„Only 700 Mbit/s throughput
1000
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
w05gva-gig6_29May04_UDP
Recv Wire rate Mbits/s
900
800
700
600
500
400
300
200
100
0
0
% Packet loss
„Lots of packet loss
„Packet loss distribution
shows throughput limited
5
10
15
20
Spacing between frames us
100
90
80
70
60
50
40
30
20
10
0
25
30
35
40
50 bytes
100 bytes
200 bytes
400 bytes
600 bytes
800 bytes
1000 bytes
1200 bytes
1400 bytes
1472 bytes
w05gva-gig6_29May04_UDP
0
5
10
15
20
Spacing between frames us
25
30
35
40
w05gva-gig6_29May04_UDP wait 12us
14000
14000
12000
1-waydelayus
12000
1-way delay us
10000
8000
6000
8000
6000
4000
2000
4000
0
2000
0
500
10000
0
510
520
530
Packet No.
540
100
200
300
Packet No.
400
500
600
550
Richard Hughes-Jones
Slide: 26
10 Gigabit Ethernet
Richard Hughes-Jones
Slide: 27
10 Gigabit Ethernet: UDP Throughput
‹ 1500 byte MTU gives ~ 2 Gbit/s
‹ Used 16144 byte MTU max user length
16080
‹ DataTAG Supermicro PCs
‹ Dual 2.2 GHz Xenon CPU FSB 400 MHz
‹ PCI-X mmrbc 512 bytes
‹ wire rate throughput of 2.9 Gbit/s
‹ CERN OpenLab HP Itanium PCs
‹ Dual 1.0 GHz 64 bit Itanium CPU FSB 400
MHz
‹ PCI-X mmrbc 4096 bytes
‹ wire rate of 5.7 Gbit/s
16080 bytes
14000 bytes
12000 bytes
10000 bytes
9000 bytes
8000 bytes
7000 bytes
6000 bytes
5000 bytes
4000 bytes
3000 bytes
2000 bytes
1472 bytes
an-al 10GE Xsum 512kbuf MTU16114 27Oct03
6000
‹
‹
‹
‹
SLAC Dell PCs giving a
Dual 3.0 GHz Xenon CPU FSB 533 MHz
PCI-X mmrbc 4096 bytes
wire rate of 5.4 Gbit/s
Richard Hughes-Jones
Recv Wire rate Mbits/s
5000
4000
3000
2000
1000
0
0
5
10
15
20
25
Spacing between frames us
30
35
40
Slide: 28
10 Gigabit Ethernet: Tuning PCI-X
‹
‹
‹
16080 byte packets every 200 µs
Intel PRO/10GbE LR Adapter
PCI-X bus occupancy vs mmrbc
mmrbc
512 bytes
Measured times
Times based on PCI-X times
from the logic analyser
Expected throughput ~7 Gbit/s
Measured 5.7 Gbit/s
„
„
„
„
10
mmrbc
1024 bytes
PCI-X Transfer time
us
DataTAG Xeon 2.2 GHz
8
6
4
measured Rate Gbit/s
rate from expected time Gbit/s
Max throughput PCI-X
2
0
0
10
1000
2000
3000
4000
Max Memory Read Byte Count
5000
PCI-X Transfer time
us
PCI-X Sequence
Data Transfer
Interrupt & CSR Update
Kernel 2.6.1#17 HP Itanium Intel10GE Feb04
8
6
4
measured Rate Gbit/s
rate from expected time Gbit/s
Max throughput PCI-X
2
CSR Access
mmrbc
2048 bytes
mmrbc
4096 bytes
5.7Gbit/s
0
0
1000
2000
3000
4000
Max Memory Read Byte Count
Richard Hughes-Jones
5000
Slide: 29
Different TCP Stacks
Richard Hughes-Jones
Slide: 30
Investigation of new TCP Stacks
‹
The AIMD Algorithm – Standard TCP (Reno)
„ For each ack in a RTT without loss:
cwnd -> cwnd + a / cwnd
- Additive Increase, a=1
„ For each window experiencing loss:
cwnd -> cwnd – b (cwnd)
- Multiplicative Decrease, b= ½
‹ High Speed TCP
a and b vary depending on current cwnd using a table
„
„
a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner
for the network path
b decreases less aggressively and, as a consequence, so does the cwnd. The effect
is that there is not such a decrease in throughput.
‹ Scalable TCP
a and b are fixed adjustments for the increase and decrease of cwnd
„
„
„
a = 1/100 – the increase is greater than TCP Reno
b = 1/8 – the decrease on loss is less than TCP Reno
Scalable over any link speed.
‹ Fast TCP
Uses round trip time as well as packet loss to indicate congestion with rapid
convergence to fair equilibrium for throughput.
Richard Hughes-Jones
Slide: 31
Comparison of TCP Stacks
‹TCP Response Function
„ Throughput vs Loss Rate – further to right: faster recovery
„ Drop packets in kernel
MB-NG rtt 6ms
Richard Hughes-Jones
DataTAG rtt 120 ms
Slide: 32
High Throughput Demonstrations
London (Chicago)
Manchester (Geneva)
Dual Zeon 2.2 GHz
Dual Zeon 2.2 GHz
lon01
Cisco
7609
1 GEth
Drop Packets
man03
Cisco
GSR
Cisco
GSR
2.5 Gbit SDH
MB-NG Core
Cisco
7609
1 GEth
Send data with TCP
Monitor TCP
with Web100
Richard Hughes-Jones
Slide: 33
High Performance TCP – MB-NG
‹Drop 1 in 25,000
‹rtt 6.2 ms
‹Recover in 1.6 s
Standard
Richard Hughes-Jones
HighSpeed
Scalable
Slide: 34
High Performance TCP – DataTAG
‹Different TCP stacks tested on the DataTAG Network
‹ rtt 128 ms
‹Drop 1 in 106
‹High-Speed
„Rapid recovery
‹Scalable
„Very fast recovery
‹Standard
„Recovery would
take ~ 20 mins
Richard Hughes-Jones
Slide: 35
Disk and RAID Sub-Systems
Based on talk at NFNN 2004
by
Andrew Sansum
Tier 1 Manager at RAL
Richard Hughes-Jones
Slide: 36
Reliability and Operation
‹ The performance of a system that is broken is 0.0 MB/S!
‹ When staff are fixing broken servers they cannot spend their
time optimising performance.
‹ With a lot of systems, anything that can break – will break
„ Typical ATA failure rate 2-3% per annum at RAL, CERN and CASPUR.
That’s 30-45 drives per annum on the RAL Tier1 i.e. One/week.
One failure for every 10**15 bits read. MTBF 400K hours.
„ RAID arrays may not handle block re-maps satisfactorily or take so long
that the system has given up.
RAID5 not perfect protection – finite chance of 2 disks failing!
„ SCSI interconnects corrupts data or give CRC errors
„ Filesystems (ext2) corrupt for no apparent cause (EXT2 errors)
„ Silent data corruption happens (checksums are vital).
‹ Fixing data corruptions can take a huge amount of time (> 1
week). Data loss possible.
Richard Hughes-Jones
Slide: 37
You need to Benchmark – but how?
‹ Andrew’s golden rule: Benchmark Benchmark Benchmark
‹ Choose a benchmark that matches your application (we
use Bonnie++ and IOZONE).
„ Single stream of sequential I/O.
„ Random disk accesses.
„ Multi-stream – 30 threads more typical.
‹ Watch out for caching effects (use large files and small
memory).
‹ Use a standard protocol that gives reproducible results.
‹ Document what you did for future reference
‹ Stick with the same benchmark suite/protocol as long as
possible to gain familiarity. Much easier to spot significant
changes
Richard Hughes-Jones
Slide: 38
Hard Disk Performance
‹ Factors affecting performance:
„ Seek time (rotation speed is a big component)
„ Transfer Rate
„ Cache size
„ Retry count (vibration)
‹ Watch out for unusual disk behaviours:
„ write/verify (drive verifies writes for first <n> writes)
„ drive spin down etc etc.
‹ Storage review is often worth reading:
http://www.storagereview.com/
Richard Hughes-Jones
Slide: 39
Impact of Drive Transfer Rate
A slower drive may not cause significant
sequential performance degradation in a RAID array.
This 5400 drive was 25% slower than the 7200
80
70
60
50
MB/S 40
30
20
10
0
7200 Read
5400 Read
1
2
4
8
16
32
Threads
Richard Hughes-Jones
Slide: 40
Disk Parameters
Distribution of Seek Time
‹ Mean 14.1 ms
‹ Published tseek time 9 ms
‹ Taken off rotational latency
ms
Head Position & Transfer Rate
‹ Outside of disk has more blocks
so faster
‹ 58 MB/s to 35 MB/s
‹ 40% effect
From: www.storagereview.com
Richard Hughes-Jones
Position on
disk
Slide: 41
Drive Interfaces
http://www.maxtor.com/
Richard Hughes-Jones
Slide: 42
RAID levels
‹Choose Appropriate RAID level
„ RAID 0 (stripe) is fastest but no redundancy
„ RAID 1 (mirror) single disk performance
„ RAID 2, 3 and 4 not very common/supported
sometimes worth testing with your application
„ RAID 5 – Read is almost as fast as RAID-0, write
substantially slower.
„ RAID 6 – extra parity info – good for unreliable drives
but could have considerable impact on performance.
Rare but potentially very useful for large IDE disk
arrays.
„ RAID 50 & RAID 10 - RAID0 2 (or more) controllers
Richard Hughes-Jones
Slide: 43
Kernel Versions
‹ Kernel versions can make an enormous difference. It is not
unusual for I/O to be badly broken in Linux.
„ In 2.4 series – Virtual Memory Subsystem problems
zonly 2.4.5 some 2.4.9 >2.4.17 OK
„ Redhat Enteprise 3 seems badly broken
(although most recent RHE Kernel may be better)
„ 2.6 kernel contains new functionality and may be better
(although first tests show little difference)
‹ Always check after an upgrade that performance remains
good.
Richard Hughes-Jones
Slide: 44
Kernel Tuning
‹ Tunable kernel parameters can improve I/O performance,
but depends what you are trying to achieve.
„ IRQ balancing. Issues with IRQ handling on some Xeon
chipsets/kernel combinations (all IRQs handled on CPU zero).
„ Hyperthreading
„ I/O schedulers – can tune for latency by minimising the seeks
z /sbin/elvtune ( number sectors of writes before reads allowed)
z Choice of scheduler: elevator=xx orders I/O to minimise seeks,
merges adjacent requests.
„ VM on 2.4 tuningBdflush and readahead (single thread helped)
z Sysctl –w vm.max-readahead=512
z Sysctl –w vm.min-readahead=512
z Sysctl –w vm.bdflush=10 500 0 0 3000 10 20 0
„ On 2.6 readahead command is [default 256]
z blockdev --setra 16384 /dev/sda
Richard Hughes-Jones
Slide: 45
Example Effect of VM Tuneup
Redhat 7.3 Read Performance
MB/s
200
180
160
140
120
100
80
60
40
20
0
Default
Tuned
1
2
4
8
16
32
threads
g.prassas@rl.ac.uk
Richard Hughes-Jones
Slide: 46
IOZONE Tests
Read Performance
‹ File System ext2 ext3
‹ Due to journal behaviour
‹ 40% effect
70
60
50
40
MB/s
30
20
10
0
ext2
ext3
1
2
4
8
16
32
threads
ext3 Write
‹ Journal write data + write journal
60
50
40
MB/s 30
writeback
ordered
journal
20
10
0
1
2
4
8
threads
Richard Hughes-Jones
16
32
ext3 Read
70
60
50
40
MB/s
30
20
10
0
1
2
4
8
16
32
threads
Slide: 47
RAID Controller Performance
‹
‹
‹
‹
‹
‹
RAID5 (stripped with redundancy)
Stephen
3Ware 7506 Parallel 66 MHz
3Ware 7505 Parallel 33 MHz
3Ware 8506 Serial ATA 66 MHz
ICP Serial ATA 33/66 MHz
Tested on Dual 2.2 GHz Xeon Supermicro P4DP8-G2 motherboard
Disk: Maxtor 160GB 7200rpm 8MB Cache
Read ahead kernel tuning: /proc/sys/vm/max-readahead = 512
Disk – Memory Read Speeds
Dallison
Memory - Disk Write Speeds
‹ Rates for the same PC RAID0 (stripped) Read 1040 Mbit/s, Write 800 Mbit/s
Richard Hughes-Jones
Slide: 48
SC2004 RAID Controller Performance
‹
‹
‹
‹
‹
Supermicro X5DPE-G2 motherboards loaned from Boston Ltd.
Dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory
3Ware 8506-8 controller on 133 MHz PCI-X bus
Configured as RAID0 64k byte stripe size
Six 74.3 GByte Western Digital Raptor WD740 SATA disks
„ 75 Mbyte/s disk-buffer 150 Mbyte/s buffer-memory
‹ Scientific Linux with 2.6.6 Kernel + altAIMD patch (Yee) + packet loss patch
‹ Read ahead kernel tuning: /sbin/blockdev --setra 16384 /dev/sda
Memory - Disk Write Speeds
Disk – Memory Read Speeds
12000
2500
RAID0 6disks Read 3w8506-8
1 G Byte Read
10000
2000
2 G Byte Read
6000
4000
Throughput Mbit/s
Throughput Mbit/s
0.5 G Byte Read
8000
RAID0 6disks Write 3w8506-8 16
1500
1 G Byte write
1000
0.5 G Byte write
2 G Byte write
500
2000
0
0
0.0
20.0
40.0
60.0
80.0
100.0
120.0
Buffer size kbytes
140.0
160.0
180.0
200.0
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
160.0
180.0
200.0
Buffer size kbytes
‹ RAID0 (stripped) 2 GByte file: Read 1500 Mbit/s, Write 1725 Mbit/s
Richard Hughes-Jones
Slide: 49
Applications
Throughput for Real Users
Richard Hughes-Jones
Slide: 50
Topology of the MB – NG Network
Manchester Domain
man02
Boundary
Router Cisco
7609
man01
UKERNA
Development
Network
UCL Domain
lon01
Boundary
Router Cisco
7609
lon02
HW RAID
Edge Router man03
Cisco 7609
RAL Domain
lon03
ral02
Key
Gigabit Ethernet
2.5 Gbit POS Access
ral01
HW RAID
ral02
Richard Hughes-Jones
Boundary
Router Cisco
7609
MPLS Admin.
Domains
Slide: 51
Gridftp Throughput + Web100
‹ RAID0 Disks:
„ 960 Mbit/s read
„ 800 Mbit/s write
‹ Throughput Mbit/s:
‹ See alternate 600/800 Mbit and zero
‹ Data Rate: 520 Mbit/s
‹ Cwnd smooth
‹ No dup Ack / send stall /
timeouts
Richard Hughes-Jones
Slide: 52
http data transfers HighSpeed TCP
‹ Same Hardware
‹ Bulk data moved by web servers
‹ Apachie web server
out of the box!
‹ prototype client - curl http library
‹ 1Mbyte TCP buffers
‹ 2Gbyte file
‹ Throughput ~720 Mbit/s
‹ Cwnd - some variation
‹ No dup Ack / send stall / timeouts
Richard Hughes-Jones
Slide: 53
bbftp: What else is going on?
‹ Scalable TCP
‹ BaBar + SuperJANET
‹ SuperMicro + SuperJANET
‹ Congestion window – duplicate
ACK
‹ Variation not TCP related?
„ Disk speed / bus transfer
„ Application
Richard Hughes-Jones
Slide: 54
SC2004 UKLIGHT Topology
SLAC Booth
SC2004
Cisco 6509
Manchester
MB-NG 7600 OSR
Caltech Booth
UltraLight IP
UCL network
UCL HEP
NLR Lambda
NLR-PITT-STAR-10GE-16
ULCC UKLight
K2
K2
Ci
UKLight 10G
Four 1GE channels
Ci
Caltech 7600
UKLight 10G
Surfnet/ EuroLink 10G
Two 1GE channels
Chicago Starlight
Richard Hughes-Jones
Amsterdam
K2
Slide: 55
SC2004 Disk-Disk bbftp
bbftp file transfer program uses TCP/IP
UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0
MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off
Move a 2 Gbyte file
Web100 plots:
InstaneousBW
AveBW
CurCwnd (Value)
2000
1500
1000
500
0
0
5000
10000
time ms
15000
20000
InstaneousBW
AveBW
CurCwnd (Value)
2500
TCPAchive Mbit/s
‹ Scalable TCP
‹ Average 875 Mbit/s
‹ (bbcp: 701 Mbit/s
~4.5s of overhead)
45000000
40000000
35000000
30000000
25000000
20000000
15000000
10000000
5000000
0
2000
1500
1000
500
0
‹ Disk-TCP-Disk at 1Gbit/s
Richard Hughes-Jones
45000000
40000000
35000000
30000000
25000000
20000000
15000000
10000000
5000000
0
Cwnd
‹ Standard TCP
‹ Average 825 Mbit/s
‹ (bbcp: 670 Mbit/s)
TCPAchive Mbit/s
2500
0
5000
10000
time ms
15000
Cwnd
‹
‹
‹
‹
‹
20000
Slide: 56
Network & Disk Interactions
(work in progress)
‹ Hosts:
Supermicro X5DPE-G2 motherboards
dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory
3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0
six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size
„
„
„
„
‹ Measure memory to RAID0 transfer rates with & without UDP traffic
RAID0 6disks 1 Gbyte Write 64k 3w8506-8
Throughput
Mbit/s
2000
1000
RAID0 6disks 1 Gbyte Write 64k 3w8506- 8
200
Disk write
1735 Mbit/s
1500
% CPU
kernel mode
180
160
8k
140
64k
120
y = - 1.017x + 178.32
100
500
80
60
0
40
0.0
Mbit/s
Throughput
2000
20.0
40.0
60.0
80.0
Trial number
R0 6d 1 Gbyte udp Write 64k 3w8506-8
1500
1000
500
0
0.0
20.0
40.0
60.0
Trial number
80.0
20
100.0
0
20
200
Disk write +
1500 MTU UDP
1218 Mbit/s
Drop of 30%
100.0
40
8k
64k
y=178- 1.05x
160
140
120
100
80
60
40
20
0
200
20
40
500
0
0.0
20.0
40.0
60.0
Trial number
Richard Hughes-Jones
80.0
60 80 100 120 140
% c pu system mode L1+2
160 180 200
R0 6d 1 Gbyte udp9000 write 8k 3w8506-8 07Jan05 16384
8k
180
% cpu system mode L3+4
1000
Disk write +
9000 MTU UDP
1400 Mbit/s
Drop of 19%
100.0
160 180 200
R0 6d 1 Gbyte udp Write 64k 3w8506- 8
0
1500
60 80 100 120 140
% c pu system mode L1+2
180
R0 6d 1 Gbyte udp9000 write 64k 3w8506-8
2000
Throughput Mbit/s
y = - 1.0479x + 174.44
0
64k
160
y=178-1.05x
140
120
100
80
60
40
20
0
0
20
40
60
80
100
120
140
% cpu system mode L1+2
160
180
200
Slide: 57
Remote Processing Farms: Manc-CERN
150
100
50000
50
200
400
600
800
1000
time
1200
1600
1800
DataBytesOut (Delta
DataBytesIn (Delta
CurCwnd (Value
250000
200000
Data Bytes Out
1400
0
2000
250000
200000
150000
150000
100000
100000
50000
50000
0
0
200
400
600
800
1000
1200
time ms
1400
1600
1800
180
160
140
120
100
80
60
40
20
0
CurCwnd
0
●
●
●
Data Bytes In
200
100000
0
2000
250000
200000
150000
100000
Cwnd
Data Bytes Out
250
150000
50000
0
Richard Hughes-Jones
300
0
„ TCP Congestion window
gets re-set on each Request
„ TCP stack implementation
detail to reduce Cwnd after inactivity
„ Even after 10s, each response
takes 13 rtt or ~260 ms
„ Transfer achievable throughput
120 Mbit/s
350
200000
TCPAchive Mbit/s
„ Round trip time 20 ms
„ 64 byte Request green
1 Mbyte Response blue
„ TCP in slow start
„ 1st event takes
19 rtt or ~ 380 ms
DataBytesOut (Delta
400
DataBytesIn (Delta
250000
200
400
600
800
1000
1200
time ms
1400
1600
1800
0
2000
Slide: 58
Summary, Conclusions & Thanks
‹ Host is critical: Motherboards NICs, RAID controllers and Disks matter
„ The NICs should be well designed:
z NIC should use 64 bit 133 MHz PCI-X (66 MHz PCI can be OK)
z NIC/drivers: CSR access / Clean buffer management / Good interrupt
handling
„ Worry about the CPU-Memory bandwidth as well as the PCI bandwidth
z Data crosses the memory bus at least 3 times
„ Separate the data transfers – use motherboards with multiple 64 bit PCI-X buses
‹ Test Disk Systems with representative Load
‹
‹
‹
‹
‹
„ Choose a modern high throughput RAID controller
z Consider SW RAID0 of RAID5 HW controllers
Need plenty of CPU power for sustained 1 Gbit/s transfers and disk access
Packet loss is a killer
„ Check on campus links & equipment, and access links to backbones
New stacks are stable give better response & performance
„ Still need to set the tcp buffer sizes & other kernel settings e.g. window-scale
Application architecture & implementation is important
Interaction between HW, protocol processing, and disk sub-system complex
MB - NG
Richard Hughes-Jones
Slide: 59
More Information Some URLs
‹ UKLight web site: http://www.uklight.ac.uk
‹ DataTAG project web site: http://www.datatag.org/
‹ UDPmon / TCPmon kit + writeup:
http://www.hep.man.ac.uk/~rich/ (Software & Tools)
‹ Motherboard and NIC Tests:
http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt
& http://datatag.web.cern.ch/datatag/pfldnet2003/
“Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality
Motherboards” FGCS Special issue 2004
http:// www.hep.man.ac.uk/~rich/ (Publications)
‹ TCP tuning information may be found at:
http://www.ncne.nlanr.net/documentation/faq/performance.html
& http://www.psc.edu/networking/perf_tune.html
‹ TCP stack comparisons:
“Evaluation of Advanced TCP Stacks on Fast Long-Distance
Production Networks” Journal of Grid Computing 2004
http:// www.hep.man.ac.uk/~rich/ (Publications)
‹ PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/
‹ Dante PERT http://www.geant2.net/server/show/nav.00d00h002
‹ Real-Time Remote Farm site http://csr.phys.ualberta.ca/real-time
‹ Disk information http://www.storagereview.com/
Richard Hughes-Jones
Slide: 60
Any Questions?
Richard Hughes-Jones
Slide: 61
Backup Slides
Richard Hughes-Jones
Slide: 62
TCP (Reno) – Details
2 min
C * RTT
ρ =
2 * MSS
‹
2
for rtt of ~200 ms:
Time to recover sec
‹ Time for TCP to recover its throughput from 1 lost packet given by:
100000
10000
1000
100
10
1
0.1
0.01
0.001
0.0001
10Mbit
100Mbit
1Gbit
2.5Gbit
10Gbit
0
50
100
rtt ms
150
200
UK 6 ms Europe 20 ms USA 150
ms
Richard Hughes-Jones
Slide: 63
Packet Loss and new TCP Stacks
‹ TCP Response Function
„ UKLight London-Chicago-London rtt 180 ms
„ 2.6.6 Kernel
sculcc1-chi-2 iperf 13Jan05
„ Agreement with
theory good
TCP Achievable throughput
Mbit/s
1000
A0 1500
A1HSTCP
A2 Scalable
A3 HTCP
100
A5 BICTCP
A8 Westwood
A7 Vegas
A0 Theory
Series10
10
Scalable Theory
1
100000000
10000000
1000000
100000
10000
1000
100
Packet drop rate 1 in n
sculcc1-chi-2 iperf 13Jan05
TCP Achievable throughput
Mbit/s
1000
900
800
A0 1500
700
A1 HSTCP
600
A2 Scalable
500
A3 HTCP
A5 BICTCP
400
A8 Westwood
300
A7 Vegas
200
100
0
100000000
10000000
1000000
100000
10000
1000
100
Packet drop rate 1 in n
Richard Hughes-Jones
Slide: 64
Test of TCP Sharing: Methodology
(1Gbit/s)
Les Cottrell
‹ Chose 3 paths from SLAC (California)
PFLDnet
„ Caltech (10ms), Univ Florida (80ms), CERN (180ms) 2005
‹ Used iperf/TCP and UDT/UDP to generate traffic
Iperf or UDT
SLAC
TCP/UDP
bottleneck
Caltech/UFL/CERN
iperf
Ping 1/s
ICMP/ping traffic
‹ Each run was 16 minutes, in 7 regions
2 mins
Richard Hughes-Jones
4 mins
Slide: 65
TCP Reno single stream
Les Cottrell
PFLDnet 2005
‹ Low performance on fast long distance paths
„ AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5
in congestion)
„ Net effect: recovers slowly, does not effectively use available
bandwidth, so poor throughput
Remaining flows do not take
„ Unequal sharing
up slack when flow removed
Increase recovery rate
RTT increases when
achieves best
throughput
SLAC to CERN
Congestion has a
dramatic effect
Recovery is slow
Richard Hughes-Jones
Slide: 66
Average Transfer Rates Mbit/s
App
TCP Stack
SuperMicro on
MB-NG
SuperMicro on
SuperJANET4
BaBar on
SuperJANET4
SC2004 on
UKLight
Iperf
Standard
940
350-370
425
940
HighSpeed
940
510
570
940
Scalable
940
580-650
605
940
Standard
434
290-310
290
HighSpeed
435
385
360
Scalable
432
400-430
380
Standard
400-410
325
320
370-390
380
bbcp
bbftp
HighSpeed
apache
Gridftp
Scalable
430
345-532
380
Standard
425
260
300-360
HighSpeed
430
370
315
Scalable
428
400
317
Standard
405
240
HighSpeed
320
Scalable
335
Richard Hughes-Jones
Rate
decreases
825
875
New stacks
give more
throughput
Slide: 67
iperf Throughput + Web100
‹
‹
‹
‹
SuperMicro on MB-NG network
HighSpeed TCP
Linespeed 940 Mbit/s
DupACK ? <10 (expect ~400)
Richard Hughes-Jones
‹
‹
‹
‹
BaBar on Production network
Standard TCP
425 Mbit/s
DupACKs 350-400 – re-transmits
Slide: 68
Applications: Throughput Mbit/s
‹ HighSpeed TCP
‹ 2 GByte file RAID5
‹ SuperMicro + SuperJANET
‹ bbcp
‹ bbftp
‹ Apachie
‹ Gridftp
‹ Previous work used RAID0
(not disk limited)
Richard Hughes-Jones
Slide: 69
bbcp & GridFTP Throughput
‹ 2Gbyte file transferred RAID5 - 4disks Manc – RAL
‹ bbcp
‹ DataTAG altAIMD kernel in BaBar & ATLAS
‹ Mean 710 Mbit/s
Mean ~710
‹ GridFTP
‹ See many zeros
Mean ~620
Richard Hughes-Jones
Slide: 70
tcpdump of the T/DAQ dataflow at SFI (1)
Cern-Manchester 1.0 Mbyte event
Remote EFD requests event from SFI
Incoming event request
Followed by ACK
●
●
●
SFI sends event
Limited by TCP receive buffer
Time 115 ms (~4 ev/s)
When TCP ACKs arrive
more data is sent.
N 1448 byte packets
Richard Hughes-Jones
Slide: 71
Download