Design and Performance Evaluation of NUMA-Aware RDMA-Based End-to-End Data Transfer System Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, and Thomas Robertazzi Massive Data Output • Big data in pata-/exa-bytes scale Climate Simulation High-energy physics System biology SC 13 2 Massive Data Output • Big data in pata-/exa-bytes scale • Data transfer and data synchronization Climate Simulation High-energy physics System biology LAN WAN SC 13 2 End-to-End Data Transfer 1~10 Gbps Ethernet 1~10 Gbps Ethernet Ethernet SAN Gateways Gateways SAN • TCP-based Data Transfer Software: : GridFTP, scp, BBCP • SAN storage: iSCSI SC 13 3 RDMA is a Game-Changer 100 Gbps 100 Gbps Ethernet SAN SC 13 Gateways Gateways SAN 4 RDMA is a Game-Changer 100 Gbps 100 Gbps Ethernet SAN Gateways Gateways SAN RoCE, iWARP SC 13 InfiniBand 4 RDMA is a Game-Changer 100 Gbps 100 Gbps Ethernet SAN Gateways Gateways SAN • Q1: Scalability and efficiency of TCP-based software in high speed? • Q2: How to utilize advanced RDMA technology to transfer data with high bandwidth and low cost? SC 13 4 A Preliminary Experiment 3 x 40 Gbps RoCE SC 13 5 A Preliminary Experiment 3 x 40 Gbps RoCE 117.6 Throughput (Gbps) 120 100 80 91.8 91.8 60 83.5 40 20 0 TCP SC 13 RDMA • TCP wasn’t able to saturate this fat link. • RDMA achieved 98% baremetal throughput. • TCP: 35% CPU is used for memory copy • copy_user_generic_string() 5 Goals • Better practice in high-speed data transfer • High throughput • Achieve line speed • Low cost • CPU utilization • Memory footprints • Scalability • 100 Gbps and Beyond • Wide Area Networks SC 13 5 Hardware Development and Zero-copy SC 13 Hardware Development 1990 2000 Single-core Memory Wall Power Wall Frequency Wall Processor-Memory Performance Gap: (grows 50% / year) 2010 © John D. McCalpin SC 13 7 Hardware Development 1990 Single-core Memory Wall Power Wall Frequency Wall 2000 c c c c Multi-core 2010 SC 13 Memory Controller Memory 7 Hardware Development 1990 Single-core Memory Wall Power Wall Frequency Wall 2000 c c c c c c c c Multi-core 2010 SC 13 Memory Controller Memory 7 Hardware Development 1990 Single-core Memory Wall Power Wall Frequency Wall 2000 Multi-core Memory Memory c c c c c c c c c c c c c c c c 2010 NUMA SC 13 Memory Memory 7 Hardware Development 1990 Single-core Memory Wall Power Wall Frequency Wall 2000 Multi-core Memory Memory c c c c c c c c c c c c c c c c 2010 NUMA SC 13 Memory Memory 7 Hardware Development 1990 Single-core Gigabit Ethernet Multi-core 10 Gigabit Ethernet/InfiniBand NUMA 40/56/100 Gigabit Ethernet/InfiniBand 2000 2010 SC 13 8 Therefore … High speed data transfer should pay high attention to data copy, and use more efficient zero copy to address performance bottlenecks. SC 13 9 Cost: Data Copy vs. Zero Copy 40 Gbps RoCE RFTP /dev/zero SC 13 iperf RDMA TCP RFTP /dev/null iperf 10 Cost: Data Copy vs. Zero Copy 40 Gbps RoCE RFTP /dev/zero Loading SC 13 iperf RDMA TCP Transmission RFTP /dev/null iperf Offloading 10 Cost Breakdown Loading Transmission Offloading 400% 350% 300% 250% 200% 150% 100% 50% 0% RDMA data source SC 13 RDMA data sink loading TCP data source TCP data sink 10 Cost Breakdown Loading Transmission 400% 350% 300% 250% 200% 150% 100% 50% 0% 111% 152% 101% 158% RDMA data source SC 13 Offloading loading RDMA data sink TCP data source protocol processing TCP data sink data copy 10 Cost Breakdown Loading Transmission Offloading 400% 350% 300% 250% 200% 150% 100% 50% 0% RDMA data source SC 13 loading RDMA data sink protocol processing TCP data source data copy TCP data sink offloading 10 RDMA-Based End-to-End Data Transfer SC 13 RDMA-based End-to-End Solution iSER iSCSI Extentions for RDMA 100 Gbps 100 Gbps Ethernet SAN SC 13 Gateways Gateways SAN 12 RDMA-based End-to-End Solution iSER iSCSI Extentions for RDMA RFTP RDMA enabled FTP service 100 Gbps 100 Gbps Ethernet SAN SC 13 Gateways Gateways SAN 12 End-to-End: TCP vs. RDMA iSER iSCSI User Buffer RDMA TCP Kernel Buffer IB NIC SC 13 IB Loading RoCE iWARP RoCE iWARP Transmission IB Offloading IB 13 End-to-End: TCP vs. RDMA iSER RFTP GridFTP, SCP iSCSI User Buffer RDMA TCP Kernel Buffer IB NIC SC 13 IB Loading RoCE iWARP RoCE iWARP Transmission IB Offloading IB 13 End-to-End: TCP vs. RDMA iSER iSER RFTP GridFTP, SCP iSCSI iSCSI User Buffer RDMA TCP Kernel Buffer IB NIC SC 13 IB Loading RoCE iWARP RoCE iWARP Transmission IB Offloading IB 13 SAN: NUMA-Agnostic iSER iSER SAN tgtd Initiator Initiator c c c c c c c c Memory Memory tgtd SC 13 13 SAN: NUMA-Aware iSER iSER SAN numactl tgtd Initiator Initiator c c c c c c c c Memory Memory numactl tgtd SC 13 14 RFTP: RDMA-based FTP Service • RDMA Pros • Save CPU & Memory Resource • Low latency & high throughput • RDMA Cons • Explicit memory management • Asynchronous, event-driven programming interfaces • The application has to pipeline RDMA operations itself and manage in-flight memory status. SC 13 15 Front-end: RFTP Software More in : Protocols for Wide-Area Data-Intensive Applications: Design and Performance Issues, SC ‘12 • One dedicated Reliable Connection queue pair for exchanging control messages, and one or more for actual data transfer • Multiple memory blocks in flight • Multiple reliable queue pairs for data transfer • Proactive feedback Process Load Data put_ready_blk Process Offload Data get_ready_blk get_free_blk put_free_blk Control Msg QP SC 13 Data Data Source Sink Bulk Data Transfer QPs 16 End-to-End Performance Evaluation SC 13 Testbed Setup • Testbed • LAN: 3 * 40 Gbps RoCE, 2 * 56 Gbps InfiniBand • WAN: 40 Gbps RoCE • 384 GB memory as storage media to simulate real high performance storage system • GridFTP vs. RFTP • Bandwidth • CPU Utilization • Load data from storage server and dump data to storage server • TCP tuning • Jumbo Frame, IRQ affinity, TCP buffer etc. SC 13 18 Storage Performance HP DL380 tgtd fio IBM X3650 Mellanox FDR Switch InfiniBand SX6018 • NUMA-aware tgtd vs. NUMA agnostic tgtd SC 13 19 Storage Performance - Read 14 12 10 8 6 4 2 0 tgtd CPU Utilization - Read 7.6% CPU Utilization(%) Throughput (GB/s) Read Throughput 64 256 512 1024 4096 8192 800 700 600 500 400 300 200 100 0 64 256 I/O Size (KB) OS default SC 13 NUMA-aware tuning OS default 512 1024 I/O Size (KB) 4096 8192 NUMA-aware tuning 20 Storage Performance - Write Write Throughput 19% tgtd CPU Utilization - Write 10 CPU Utilization (%) Throughput (GB/s) 12 8 6 4 2 0 64 256 512 1024 4096 8192 1600 1400 1200 1000 800 600 400 200 0 300% 64 256 I/O Size (KB) OS default NUMA-aware tuning 512 1024 4096 I/O Size (KB) OS default NUMA-aware tuning • Cache coherent traffic in NUMA architecture • Read: cached/shared • Write: modified SC 13 8192 21 LAN Testbed HP DL380 IBM X3650 Mellanox QDR Switch Ethernet SX1036 Mellanox FDR Switch InfiniBand SX6018 HP DL380 SC 13 Mellanox ConnectX 3 VPI 56Gbps FDR IBM X3650 Mellanox ConnectX 3 Ethernet 40Gbps QDR 22 LAN Testbed iSER RFTP GridFTP HP DL380 IBM X3650 Mellanox QDR Switch Ethernet SX1036 Mellanox FDR Switch InfiniBand SX6018 HP DL380 SC 13 Mellanox ConnectX 3 VPI 56Gbps FDR IBM X3650 Mellanox ConnectX 3 Ethernet 40Gbps QDR 22 LAN: End-to-End Performance Bandwidth Comparison CPU Comparison 100 1400 Bandwidth (Gbps) 3x 60 40 20 CPU Utilization (%) 1200 80 1000 800 600 400 200 0 25 Minutes RFTP GridFTP 0 RFTP source user SC 13 GridFTP source sys RFTP sink GridFTP sink wait 23 LAN: End-to-End Performance 1400 1200 80 Bandwidth (Gbps) CPU Comparison 3x 60 40 20 CPU Utilization (%) 100 Bandwidth Comparison Storage Threshold 1000 800 600 400 200 0 25 Minutes RFTP GridFTP 0 RFTP source user SC 13 GridFTP source sys RFTP sink GridFTP sink wait 23 End-to-End Performance: Bi-directional Bi-directional Bandwidth Bandwidth (Gbps) 200 150 100 50 0 30 Minutes RFTP GridFTP • RFTP: 83% improvement vs. unidirectional • GridFTP: 33% improvement vs. unidirectional SC 13 24 40 Gbps WAN Testbed NERSC • • • • • ANL 40 Gbps RoCE WAN 4,000 miles RTT: 95 millisecond BDP: 500 MB Will RFTP be scalable in WAN? SC 13 25 RFTP Bandwidth in 40 Gbps WAN Bandwidth (Gbps) 40 39 38 37 36 35 1M 2M 4M 8M 16M Block Size 1 2 4 8 16 # of streams SC 13 26 Scale RDMA to WAN • RoCE and iWARP • RoCE requires a complicated layer-2 configuration for lossless operation. • iWARP: ToE • iWARP operate with standard switches Bandwidth (Gbps) End-to-End Performance over 40 Gbps iWARP in LAN SC 13 40 30 20 10 0 27 Conclusion • HPC data transfer • Hardware advances need advanced software • Efficient memory usage in HPC • RDMA-based design • NUMA-aware tuning • Testbed in LAN and WAN validated our design SC 13 28 Q&A RFTP Software http://ftp100.cewit.stonybrook.edu/rftp RFTP runs on Caltech booth Stony Brook University http://ftp100.cewit.stonybrook.edu/ganglia SC 13 29