20040928-HENP

10GbE WAN Data Transfers for Science High Energy/Nuclear Physics (HENP) SIG Fall 2004 Internet2 Member Meeting Yang Xia, HEP, Caltech yxia@caltech.edu September 28, 2004 8:00 AM – 10:00 AM Agenda       Introduction 10GE NIC comparisons & contrasts Overview of LHCnet High TCP performance over wide area networks  Problem statement  Benchmarks  Network architecture and tuning  Networking enhancements in Linux 2.6 kernels Light paths : UltraLight FAST TCP protocol development Introduction  High Engery Physics LHC model shows data at the experiment will be stored at the rate of 100 – 1500 Mbytes/sec throughout the year.  Many Petabytes per year of stored and processed binary data will be accessed and processed repeatedly by the worldwide collaborations.  Network backbone capacities advancing rapidly to 10 Gbps range and seamless integration into SONETs.  Proliferating GbE adapters on commodity desktops generates bottleneck on GbE Switch I/O ports.  More commercial 10GbE adapter products entering the market, e.g. Intel, S2io, IBM, Chelsio etc. IEEE 802.3ae Port Types Port Type Wavelength and Fiber Type WAN/ LAN Maximum Reach 10GBase-SR 850nm/MMF LAN 300m 10GBase-LR 1310nm/SMF (LAN-PHY) LAN 10km 10GBase-ER 1550nm/SMF LAN 40km 10GBase-SW 850nm/MMF WAN 300m 10GBase-LW 1310nm/SMF (WAN-PHY) WAN 10km 10GBase-EW 1550nm/SMF WAN 40km 10GBaseCX4 InfiniBand and 4xTwinax cables LAN 15m 10GBase-T Twisted-pair LAN 100m 10GbE NICs Comparison (Intel vs S2io) Standard Support: 802.3ae Standard, full duplex only 64bit/133MHz 1310nm Jumbo PCI-X bus SMF/850nm MMF Frame Support Major Difference in Performance Features: S2io Adapter Intel Adapter PCI-X Bus DMA Split Transaction Capacity 32 2 Rx Frame Buffer Capacity 64MB 256KB MTU 9600Byte 16114Byte IPv4 TCP Large Send Offload Max offload size 80k Partial; Max offload size 32k LHCnet Network Setup 10 Gbps transatlantic link extended to Caltech via Abilene and CENIC. NLR wave local loop is in working progress. High-performance end stations (Intel Xeon & Itanium, AMD Opteron) running both Linux and Windows We have added a 64x64 Non-SONET all optical switch from Calient to provision a dynamic path via MonALISA, in the context of UltraLight. LHCnet Topology: August 2004 Alcatel 7770 Juniper M10 10GE Procket 8801 Linux Farm 20 P4 CPU 6 TBytes Cisco 7609 Cisco 7609 10GE LHCnet tesbed American Partners 10GE Caltech/DoE PoP - Chicago 10GE Glimmerglass Juniper T320 Alcatel 7770 Procket 8801 Linux Farm 20 P4 CPU 6 TBytes Juniper M10 LHCnet tesbed Juniper 10GE Internal Network T320 OC192 (Production and R&D) European Partners CERN - Geneva  Services:  IPv4 & IPv6 ; Layer2 VPN ; QoS ; scavenger ; large MTU (9k) ; MPLS ; links aggregation ; monitoring (Monalisa)  Clean separation of production and R&D traffic based on CCC.  Unique Multi-platform / Multi-technology optical transatlantic test-bed  Powerful Linux farms equipped with 10 GE adapters (Intel; S2io)  Equipment loan and donation; exceptional discount  NEW: Photonic switch (Glimmerglass T300) evaluation  Circuit (“pure” light StarLight path) provisioning CERN LHCnet Topology: August 2004 (cont’d) Optical Switch Matrix Calient Photonic Cross Connect Switch GMPLS controlled PXCs and IP/MPLS routers can provide dynamic shortest path set-up and path setup based on priority of links. Problem Statement 1. To get the most bangs for the buck on 10GbE WAN, packet loss is the #1 enemy. This is because of slow TCP responsive from AIMD algorithm:  No Loss: cwnd := cwnd + 1/cwnd  Loss: cwnd := cwnd/2 2. Fairness: TCP Reno MTU & RTT bias Different MTUs and delays lead to a very poor sharing of the bandwidth. Internet 2 Land Speed Record (LSR)      IPv6 record: 4.0 Gbps between Geneva and Phoenix (SC2003) IPv4 Multi-stream record with Windows: 7.09 Gbps between Caltech and CERN (11k km) Single Stream 6.6 Gbps X 16.5 k km with Linux We have exceeded 100 Petabit-m/sec with both Linux & Windows Testing on different WAN distances doesn’t seem to change TCP rate:  7k km (Geneva - Chicago)  11k km (Normal Abilene Path)  12.5k km (Petit Abilene's Tour) Monitoring of the Abilene Traffic in LA:  16.5k km (Grande Abilene's Tour) Internet 2 Land Speed Record (cont’d) Single Stream IPv4 Category Primary Workstation Summary Sending Station:  Newisys 4300, 4 x AMD Opteron 248 2.2GHz, 4GB PC3200/Processor. Up to 5 x 1GB/s 133MHz/64bit PCI-X slots. No FSB bottleneck. HyperTransport connects CPUs (up to 19.2GB/s peak BW per processor), 24 SATA disks RAID system @ 1.2GB/s read/write  Opteron white box with Tyan S2882 motherboard, 2x Opteron 2.4 GHz , 2 GB DDR.  AMD8131 chipset PCI-X bus speed: ~940MB/s Receiving Station:  HP rx4640, 4x 1.5GHz Itanium-2, zx1 chipset, 8GB memory.  SATA disk RAID system Linux Tuning Parameters 1. PCI-X Bus Parameters: (via setpci command)  Maximum Memory Read Byte Count (MMRBC) controls PCI-X transmit burst lengths on the bus: Available values are 512Byte (default), 1024KB, 2048KB and 4096KB  “max_split_trans” controls outstanding splits. Available values are: 1, 2, 3, 4  latency_timer to 248 2. Interrupt Coalescence:  It allows a user to change the CPU-affinity of the interrupts in a system. 3. Large window size = BW*Delay (BDP)  Too large window size will negatively impact throughput. 4. 9000byte MTU and 64KB TSO Linux Tuning Parameters (cont’d) 5. Use sysctl command to modify /proc parameters to increase TCP memory values. 10GbE Network Testing Tools   In Linux:  Iperf:  Version 1.7.0 doesn’t work by default on the Itanium2 machine. Workarounds: 1) Compile using RedHat’s gcc 2.96 or 2) make it single threaded  UDP send rate limits to 2Gbps because of 32bit date type  Nttcp: Measures the time required to send preset chunk of data.  Netperf (v2.1): Sends as much data as it can in an interval and collects result at the end of test. Great for end-to-end latency test.  Tcpdump: Challenging task for 10GbE link In Windows:  NTttcp: Using Windows APIs  Microsoft Network Monitoring Tool  Ethereal Networking Enhancements in Linux 2.6        2.6.x Linux kernel has made many improvements in general to improve system performance, scalability and hardware drivers. Improved Posix Threading Support (NGPT and NPTL) Supporting AMD 64-bit (x86-64) and improved NUMA support. TCP Segmentation Offload (TSO) Network Interrupt Mitigation: Improved handling of high network loads Zero-Copy Networking and NFS: One system call with: sendfile(sd, fd, &offset, nbytes) NFS Version 4 TCP Segmentation Offload     Must have hardware support in NIC. It’s a sender only option. It allows TCP layer to send a larger than normal segment of data, e,g, 64KB, to the driver and then the NIC. The NIC then fragments the large packet into smaller (<=mtu) packets. TSO is disabled in multiple places in the TCP functions. It is disabled when sacks are received, in tcp_sacktag_write_queue, and when a packet is retransmitted, in tcp_retransmit_skb. However, TSO is never re-enabled in the current 2.6.8 kernel when TCP state changes back to normal (TCP_CA_Open). Need to patch the kernel to re-enable TSO. Benefits:  TSO can reduce CPU overhead by 10%~15%.  Increase TCP responsiveness. p=(C*RTT*RTT)/(2*MSS) p: Time to recover to full rate C: Capacity of the link RTT: Round Trip Time MSS: Maximum Segment Size Responsiveness with and w/o TSO With Delayed ACK (min) Path BW RTT (s) MTU Responsiveness (min) Geneva-LA (Normal Path) 10Gbps 0.18 9000 38 75 Geneva-LA (Long Path) 10Gbps 0.252 9000 74 148 Geneva-LA (Long Path w/ 64KB TSO) 10Gbps 0.252 9000 10 20 LAN 10Gbps 0.001 1500 428ms 856ms Geneva-Chicago 10Gbps 0.12 1500 103 205 Geneva-LA (Normal Path) 1Gbps 0.18 1500 23 46 Geneva-LA (Long Path) 1Gbps 0.252 1500 45 91 Geneva-LA (Long Path w/ 64KB TSO) 1Gbps 0.252 1500 1 2 The Transfer over 10GbE WAN With 9000byte MTU and stock Linux 2.6.7 kernel: LAN: 7.5Gb/s WAN: 7.4Gb/s (Receiver is CPU bound) We’ve reached the PCI-X bus limit with single NIC. Using bonding (802.3ad) of multiple interfaces we could bypass the PCI X bus limitation in mulple streams case only LAN: 11.1Gb/s WAN: ??? (a.k.a. doom’s day for Abilene) UltraLight: Developing Advanced Network Services for Data Intensive HEP Applications  UltraLight (funded by NSF ITR): a next-generation hybrid packet- and circuit-switched network infrastructure.  Packet switched: cost effective solution; requires ultrascale protocols to share 10G  efficiently and fairly  Circuit-switched: Scheduled or sudden “overflow” demands handled by provisioning additional wavelengths; Use path diversity, e.g. across the US, Atlantic, Canada,…  Extend and augment existing grid computing infrastructures (currently focused on CPU/storage) to include the network as an integral component  Using MonALISA to monitor and manage global systems  Partners: Caltech, UF, FIU, UMich, SLAC, FNAL, MIT/Haystack; CERN, Internet2, NLR, CENIC; Translight, UKLight, Netherlight; UvA, UCL, KEK, Taiwan  Strong support from Cisco and Level(3) “Ultrascale” protocol development: FAST TCP  FAST TCP  Based on TCP Vegas  Uses end-to-end delay and loss to dynamically adjust the congestion window  Achieves any desired fairness, expressed by utility function  Very high utilization (99% in theory)  Compare to Other TCP Variants: e.g. BIC, Westwood+ Capacity = OC-192 9.5Gbps; 264 ms round trip latency; 1 flow BW use 30% Linux TCP BW use 40% BW use 50% Linux Westwood+ Linux BIC TCP BW use 79% FAST Summary and Future Approaches  Full TCP offload engine will be available for 10GbE in the near future. There is a trade-off between maximizing CPU utilization and ensuring data integrity.  Develop and provide cost-effective transatlantic network infrastructure and services required to meet the HEP community's needs  a highly reliable and performance production network, with rapidly increasing capacity and a diverse workload.  an advanced research backbone for network and Grid developments: including operations and management assisted by agent-based software (MonALISA)  Concentrate on reliable Terabyte-scale file transfers, to drive development of an effective Grid-based Computing Model for LHC data analysis.

20040928-HENP

Related documents

Products

Support

20040928-HENP

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib