20040928-HENP

advertisement
10GbE WAN Data Transfers
for Science
High Energy/Nuclear Physics (HENP) SIG
Fall 2004 Internet2 Member Meeting
Yang Xia, HEP, Caltech
yxia@caltech.edu
September 28, 2004
8:00 AM – 10:00 AM
Agenda






Introduction
10GE NIC comparisons & contrasts
Overview of LHCnet
High TCP performance over wide area networks
 Problem statement
 Benchmarks
 Network architecture and tuning
 Networking enhancements in Linux 2.6 kernels
Light paths : UltraLight
FAST TCP protocol development
Introduction

High Engery Physics LHC model shows data at the
experiment will be stored at the rate of 100 – 1500
Mbytes/sec throughout the year.

Many Petabytes per year of stored and processed
binary data will be accessed and processed
repeatedly by the worldwide collaborations.

Network backbone capacities advancing rapidly to 10
Gbps range and seamless integration into SONETs.

Proliferating GbE adapters on commodity desktops
generates bottleneck on GbE Switch I/O ports.

More commercial 10GbE adapter products entering
the market, e.g. Intel, S2io, IBM, Chelsio etc.
IEEE 802.3ae Port Types
Port Type
Wavelength and
Fiber Type
WAN/
LAN
Maximum
Reach
10GBase-SR
850nm/MMF
LAN
300m
10GBase-LR
1310nm/SMF
(LAN-PHY)
LAN
10km
10GBase-ER
1550nm/SMF
LAN
40km
10GBase-SW
850nm/MMF
WAN
300m
10GBase-LW
1310nm/SMF
(WAN-PHY)
WAN
10km
10GBase-EW
1550nm/SMF
WAN
40km
10GBaseCX4
InfiniBand and
4xTwinax cables
LAN
15m
10GBase-T
Twisted-pair
LAN
100m
10GbE NICs Comparison (Intel vs S2io)
Standard Support:
802.3ae
Standard, full duplex only
64bit/133MHz
1310nm
Jumbo
PCI-X bus
SMF/850nm MMF
Frame Support
Major Difference in Performance Features:
S2io
Adapter
Intel
Adapter
PCI-X Bus DMA
Split Transaction
Capacity
32
2
Rx Frame Buffer
Capacity
64MB
256KB
MTU
9600Byte
16114Byte
IPv4 TCP Large
Send Offload
Max
offload
size 80k
Partial;
Max offload
size 32k
LHCnet Network Setup
10
Gbps transatlantic link extended to Caltech via Abilene and
CENIC. NLR wave local loop is in working progress.
High-performance
end stations (Intel Xeon & Itanium, AMD
Opteron) running both Linux and Windows
We
have added a 64x64 Non-SONET all optical switch from
Calient to provision a dynamic path via MonALISA, in the context
of UltraLight.
LHCnet Topology: August 2004
Alcatel
7770
Juniper
M10
10GE
Procket
8801
Linux Farm
20 P4 CPU
6 TBytes
Cisco
7609
Cisco
7609
10GE
LHCnet tesbed
American
Partners
10GE
Caltech/DoE PoP - Chicago
10GE
Glimmerglass
Juniper
T320
Alcatel
7770
Procket
8801
Linux Farm
20 P4 CPU
6 TBytes
Juniper
M10
LHCnet tesbed
Juniper 10GE Internal
Network
T320
OC192
(Production and R&D)
European
Partners
CERN - Geneva
 Services:
 IPv4 & IPv6 ; Layer2 VPN ; QoS ; scavenger ; large MTU (9k) ; MPLS ; links
aggregation ; monitoring (Monalisa)
 Clean separation of production and R&D traffic based on CCC.
 Unique Multi-platform / Multi-technology optical transatlantic test-bed
 Powerful Linux farms equipped with 10 GE adapters (Intel; S2io)
 Equipment loan and donation; exceptional discount
 NEW: Photonic switch (Glimmerglass T300) evaluation
 Circuit (“pure” light StarLight
path) provisioning CERN
LHCnet Topology: August 2004 (cont’d)
Optical Switch Matrix
Calient Photonic
Cross Connect
Switch
GMPLS controlled PXCs and IP/MPLS
routers can provide dynamic shortest
path set-up and path setup based on
priority of links.
Problem Statement
1.
To get the most bangs for the
buck on 10GbE WAN, packet
loss is the #1 enemy. This is
because of slow TCP
responsive from AIMD
algorithm:
 No Loss:
cwnd := cwnd + 1/cwnd
 Loss: cwnd := cwnd/2
2.
Fairness: TCP Reno MTU &
RTT bias
Different MTUs and delays
lead to a very poor sharing of
the bandwidth.
Internet 2 Land Speed Record (LSR)





IPv6 record: 4.0 Gbps between Geneva and Phoenix (SC2003)
IPv4 Multi-stream record with Windows: 7.09 Gbps between Caltech and CERN (11k km)
Single Stream 6.6 Gbps X 16.5 k km with Linux
We have exceeded 100 Petabit-m/sec with both Linux & Windows
Testing on different WAN distances doesn’t seem to change TCP rate:
 7k km (Geneva - Chicago)
 11k km (Normal Abilene Path)
 12.5k km (Petit Abilene's Tour)
Monitoring of the Abilene Traffic in LA:
 16.5k km (Grande Abilene's Tour)
Internet 2 Land Speed Record (cont’d)
Single Stream IPv4 Category
Primary Workstation Summary
Sending Station:

Newisys 4300, 4 x AMD Opteron 248 2.2GHz, 4GB
PC3200/Processor. Up to 5 x 1GB/s 133MHz/64bit
PCI-X slots. No FSB bottleneck. HyperTransport
connects CPUs (up to 19.2GB/s peak BW per
processor), 24 SATA disks RAID system @ 1.2GB/s
read/write

Opteron white box with Tyan S2882 motherboard, 2x
Opteron 2.4 GHz , 2 GB DDR.

AMD8131 chipset PCI-X bus speed: ~940MB/s
Receiving Station:

HP rx4640, 4x 1.5GHz Itanium-2, zx1 chipset, 8GB
memory.

SATA disk RAID system
Linux Tuning Parameters
1.
PCI-X Bus Parameters: (via setpci command)
 Maximum Memory Read Byte Count (MMRBC)
controls PCI-X transmit burst lengths on the bus:
Available values are 512Byte (default), 1024KB,
2048KB and 4096KB
 “max_split_trans” controls outstanding splits.
Available values are: 1, 2, 3, 4
 latency_timer to 248
2.
Interrupt Coalescence:
 It allows a user to change the CPU-affinity of the
interrupts in a system.
3.
Large window size = BW*Delay (BDP)
 Too large window size will negatively impact
throughput.
4.
9000byte MTU and 64KB TSO
Linux Tuning Parameters (cont’d)
5.
Use sysctl command to modify /proc parameters to increase
TCP memory values.
10GbE Network Testing Tools


In Linux:
 Iperf:
 Version 1.7.0 doesn’t work by default on the
Itanium2 machine. Workarounds: 1) Compile
using RedHat’s gcc 2.96 or 2) make it single
threaded
 UDP send rate limits to 2Gbps because of 32bit date type
 Nttcp: Measures the time required to send preset
chunk of data.
 Netperf (v2.1): Sends as much data as it can in
an interval and collects result at the end of test.
Great for end-to-end latency test.
 Tcpdump: Challenging task for 10GbE link
In Windows:
 NTttcp: Using Windows APIs
 Microsoft Network Monitoring Tool
 Ethereal
Networking Enhancements in Linux 2.6







2.6.x Linux kernel has made many improvements in
general to improve system performance, scalability
and hardware drivers.
Improved Posix Threading Support (NGPT and NPTL)
Supporting AMD 64-bit (x86-64) and improved NUMA
support.
TCP Segmentation Offload (TSO)
Network Interrupt Mitigation: Improved handling of
high network loads
Zero-Copy Networking and NFS: One system call
with: sendfile(sd, fd, &offset, nbytes)
NFS Version 4
TCP Segmentation Offload




Must have hardware support in NIC.
It’s a sender only option. It allows TCP layer to send
a larger than normal segment of data, e,g, 64KB, to
the driver and then the NIC. The NIC then fragments
the large packet into smaller (<=mtu) packets.
TSO is disabled in multiple places in the TCP
functions. It is disabled when sacks are received, in
tcp_sacktag_write_queue, and when a packet is
retransmitted, in tcp_retransmit_skb. However, TSO
is never re-enabled in the current 2.6.8 kernel when
TCP state changes back to normal (TCP_CA_Open).
Need to patch the kernel to re-enable TSO.
Benefits:
 TSO can reduce CPU overhead by 10%~15%.
 Increase TCP responsiveness.
p=(C*RTT*RTT)/(2*MSS)
p: Time to recover to full rate
C: Capacity of the link
RTT: Round Trip Time
MSS: Maximum Segment Size
Responsiveness with and w/o TSO
With
Delayed
ACK (min)
Path
BW
RTT (s)
MTU
Responsiveness
(min)
Geneva-LA
(Normal Path)
10Gbps
0.18
9000
38
75
Geneva-LA
(Long Path)
10Gbps
0.252
9000
74
148
Geneva-LA
(Long Path w/
64KB TSO)
10Gbps
0.252
9000
10
20
LAN
10Gbps
0.001
1500
428ms
856ms
Geneva-Chicago
10Gbps
0.12
1500
103
205
Geneva-LA
(Normal Path)
1Gbps
0.18
1500
23
46
Geneva-LA
(Long Path)
1Gbps
0.252
1500
45
91
Geneva-LA
(Long Path w/
64KB TSO)
1Gbps
0.252
1500
1
2
The Transfer over 10GbE WAN
With 9000byte MTU and
stock Linux 2.6.7 kernel:
LAN: 7.5Gb/s
WAN: 7.4Gb/s (Receiver
is CPU bound)
We’ve reached the PCI-X
bus limit with single NIC.
Using bonding (802.3ad) of
multiple interfaces we could
bypass the PCI X bus
limitation in mulple streams
case only
LAN: 11.1Gb/s
WAN: ???
(a.k.a. doom’s day for Abilene)
UltraLight: Developing Advanced Network
Services for Data Intensive HEP Applications
 UltraLight (funded by NSF ITR): a next-generation hybrid
packet- and circuit-switched network infrastructure.
 Packet switched: cost effective solution; requires
ultrascale protocols to share 10G  efficiently and fairly
 Circuit-switched: Scheduled or sudden “overflow”
demands handled by provisioning additional wavelengths;
Use path diversity, e.g. across the US, Atlantic, Canada,…
 Extend and augment existing grid computing infrastructures
(currently focused on CPU/storage) to include the network as
an integral component
 Using MonALISA to monitor and manage global systems
 Partners: Caltech, UF, FIU, UMich, SLAC, FNAL,
MIT/Haystack; CERN, Internet2, NLR, CENIC;
Translight, UKLight, Netherlight; UvA, UCL, KEK,
Taiwan
 Strong support from Cisco and Level(3)
“Ultrascale” protocol development:
FAST TCP
 FAST TCP
 Based on TCP Vegas
 Uses end-to-end delay and loss to dynamically adjust the
congestion window
 Achieves any desired fairness, expressed by utility function
 Very high utilization (99% in theory)
 Compare to Other TCP Variants: e.g. BIC, Westwood+
Capacity = OC-192 9.5Gbps; 264 ms round trip latency; 1 flow
BW use 30%
Linux TCP
BW use 40%
BW use 50%
Linux Westwood+ Linux BIC TCP
BW use 79%
FAST
Summary and Future Approaches
 Full TCP offload engine will be available for 10GbE in the
near future. There is a trade-off between maximizing
CPU utilization and ensuring data integrity.
 Develop and provide cost-effective transatlantic network
infrastructure and services required to meet the HEP
community's needs
 a highly reliable and performance production network,
with rapidly increasing capacity and a diverse
workload.
 an advanced research backbone for network and Grid
developments: including operations and management
assisted by agent-based software (MonALISA)
 Concentrate on reliable Terabyte-scale file transfers, to
drive development of an effective Grid-based Computing
Model for LHC data analysis.
Download