Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Bo Li, Zhigang Huo, Panyong Zhang, Dan Meng {leo, zghuo, zhangpanyong, md}@ncic.ac.cn Presenter: Xiang Zhang zhangxiang@ncic.ac.cn Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China Introduction • Virtualization is now one of the enabling technologies of Cloud Computing • Many HPC providers now use their systems as platforms for cloud/utility computing, these HPC on Demand offerings include: – Penguin's POD – IBM's Computing On Demand service – R Systems' dedicated hosting service – Amazon’s EC2 Introduction: Virtualizing HPC clouds? • Pros: – – – – good manageability proactive fault tolerance performance isolation online system maintenance • Cons: – Performance gap • Lack low latency interconnects, which is important to tightlycoupled MPI applications • VMM-bypass has been proposed to relieve the worry Introduction: VMM-bypass I/O Virtualization • Xen split device driver model only used to setup necessary user access points • data communication in the critical path bypasses both the guest OS and the VMM VM IDD Application Application Guest Module OS OS Backend Module Privileged Module OS-bypass I/O device Privileged Access VMM-bypass Access VMM-Bypass I/O (courtesy [7]) Introduction: InfiniBand Overview • InfiniBand is a popular high-speed interconnect – OS-bypass/RDMA – Latency: ~1us – BW: 3300MB/s • ~41.4% of Top500 now uses InfiniBand as the primary interconnect Interconnect Family / Systems June 2010 Source: http://www.top500.org Introduction: InfiniBand Scalability Problem • Reliable Connection (RC) – Queue Pair (QP), Each QP consists of SQ and RQ – QPs require memory • Shared Receive Queue (SRQ) • eXtensible Reliable Connection (XRC) N: node count C: cores per node – XRC domain & SRQ-based addressing XRC domain P3 P4 P1 P2 P3 P4 Conns/Process: (N-1)×C node1 P2 node1 P1 Conns/Process: (N-1) XRC domain P7 RC in InfiniBand P8 P5 P6 P7 P8 SRQ5 SRQ6 SRQ7 SRQ8 XRC in InfiniBand node2 P6 SRQ node2 P5 RQ Problem Statement • Does scalability gap exist between native and virtualized environments? – CV: cores per VM XRC domain XRCD XRCD XRC in VMs (Cv=1) VM XRC in VMs (Cv=2) XRCD QPs per Node (N-1)×C2 (N-1)×C (N-1)×C2 2/C ) (N-1)×(C V P7 P8 VM XRC domain VM P8 VM XRC domain P7 VM 4 P6 VM 3 P5 XRCD VM VM RC XRC RC XRC QPs per Process (N-1)×C (N-1) (N-1)×C (N-1)×(C/C ) P6 P5 V XRCD VM Native VM Transport P1 VM VM 2 VM 1 P1 XRC domain XRCD XRCD Scalability gap exists! XRCD Presentation Outline • • • • • Introduction Problem Statement Proposed Design Evaluation Conclusions and Future Work Proposed Design: VM-proof XRC design • Design goal is to eliminate the scalability gap – Conns/Process: (N-1)×(C/CV) (N-1) Shared XRC domain P1 V M P5 P6 V M V M P7 P8 V M Shared XRC domain Proposed Design: Design Challenges Guest Domain IDD • VM-proof sharing of XRC domain MPI Application – A single XRC domain must be shared among different VMs within a physical node Internal MPI Architecture Device Mananger and Control Software Abstraction Device Interface (ADI) • VM-proof connection management VM-proof CM MPI Library Channel Interface – With a single XRC connection, P1 is able to send data to all the processes in another physical node (P5~P8), no matter which VMs those processes reside in Shared XRC domain Userspace Kernel Back-end Driver Core InfiniBand Modules Core InfiniBand Modules Resource Management VM-proof XRCD sharing Communication Device APIs InfiniBand OS-bypass I/O Front-end Driver VM-proof XRCD sharing Resource Management P1 V M P5 P6 V M V M Native HCA Driver Device Channel Event Channel Xen Hypervisor High-Speed Interconnection Network P7 P8 V M Shared XRC domain Proposed Design: Implementation • VM-proof sharing of XRCD – XRCD is shared by opening the same XRCD file – guest domains and IDD have dedicated, nonshared filesystem – pseudo XRCD file and real XRCD file • VM-proof CM – Traditionally IP/hostname was used to identify a node – LID of the HCA is used instead Proposed Design: Discussions • safe XRCD sharing – unauthorized applications from other VMs may share the XRCD • the isolation of the sharing of XRCD could be guaranteed by the IDD – isolation between VMs running different MPI jobs • By using different XRCD files, different jobs (or VMs) could share different XRCDs and run without interfering with each other • XRC migration – main challenge: XRC connection is a process-to-node communication channel. • Future work Presentation Outline • • • • • Introduction Problem Statement Proposed Design Evaluation Conclusions and Future Work Evaluation: Platform • Cluster Configuration: – 128-core InfiniBand Cluster – Quad Socket, Quad-Core Barcelona 1.9GHz – Mellanox DDR ConnectX HCA, 24-port MT47396 Infiniscale-III switch • Implementation – Xen 3.4 with Linux 2.6.18.8 – OpenFabrics Enterprise Edition (OFED) 1.4.2 – MVAPICH-1.1.0 Evaluation: Latency (us) Microbenchmark 2 4 8 16 32 64 128 256 512 1024 • The bandwidth results are nearly the same • Virtualized IB performs ~0.1us worse when using blueframe mechanism. between the Size guest Message (Bytes) domain and the IDD. VM Message Size (Bytes) IB verbs latency using blueframe IB verbs latency using doorbell 6 Latency (us) Native 2 4 8 16 32 64 128 256 512 1024 Latency (us) – memory copy of the sending data to the HCA's blueframe page 6 5 4 3 2 1 0 6 Native VM 5 4 3 Explanation: Memory 2 copy operations under 1 virtualized case would 0 include interactions Native VM 4 2 0 0 2 8 32 128 512 Message Size (Bytes) MPI latency using blueframe Evaluation: VM-proof XRC Evaluation • Configurations – Native-XRC: Native environment running XRCbased MVAPICH. – VM-XRC (CV=n): VM-based environment running unmodified XRC-based MVAPICH. The parameter CV denotes the number of cores per VM. – VM-proof XRC: VM-based environment running MVAPICH with our VM-proof XRC design. Evaluation: Memory Usage – 64K processes will consume 13GB/node with the VM-XRC (CV=1) configuration – The VM-proof XRC design reduces the memory usage to only 800MB/node Memory Usage (GB/Node) • 16x less memory usage 14 12 10 8 6 4 VM-XRC (Cv=1) VM-XRC (Cv=2) VM-XRC (Cv=4) Better • 16 cores/node cluster fully connected – The X-axis denotes the process count – ~12KB memory for each QP 13GB VM-XRC (Cv=8) Native/VM-proof XRC 2 0 128 256 512 1K 2K 4K 8K 16K 32K 64K Number of Processes 800MB Evaluation: MPI Alltoall Evaluation VM-XRC (Cv=1) Native-XRC Latency (us) VM-proof XRC 1 2 Better 400 350 300 250 200 150 100 50 0 VM-proof XRC 4 8 16 32 64 128 256 Message Size (Bytes) • a total of 32 processes • 10%~25% improvement for messages < 256B Evaluation: • VM-proof XRC performs nearly the same as NativeXRC VM-proof XRC • Both are better than VM-XRC Native-XRC VM-XRC (Cv=1) VM-proof XRC Better Normalized Time 1.2 1 0.8 0.6 0.4 0.2 0 BT CG EP FT IS LU MG SP Data Set VM-XRC (Cv=1) VM-XRC (Cv=2) VM-XRC (Cv=4) VM-XRC (Cv=8) 2 1.5 1 Better – Except BT and EP Normalized Time Application Benchmarks 0.5 0 BT EP LU MG SP Data set • little variation for different CV values • Cv=8 is an exception • Memory allocation not NUMA-aware guaranteed Evaluation: FT IS Avg. QPs/Node Max QPs/Process Avg. QPs/Process Comm. Peers Configuration Benchmark Application Benchmarks (Cont’d) VM-XRC (Cv=1) 127 127 2032 VM-XRC (Cv=2) 63.4 65 VM-XRC (Cv=4) VM-XRC (Cv=8) VM-proof XRC Native-XRC VM-XRC (Cv=1) 127 31.1 15.1 8 7 127 1014 498 242 128 112 2032 VM-XRC (Cv=2) 63.7 VM-XRC (Cv=4) VM-XRC (Cv=8) VM-proof XRC Native-XRC 127 31.7 15.8 8.6 7.6 32 16 8 7 127 65 33 18 12 11 1019 507 253 138 122 ~15.9x less conns ~14.7x less conns Conclusion and Future Work • VM-proof XRC design converges two technologies – VMM-bypass I/O virtualization – eXtensible Reliable Connection in modern high speed interconnection networks (InfiniBand) • the same raw performance and scalability as in native nonvirtualized environment with our VM-proof XRC design – ~16x scalability improvement is seen in 16-core/node clusters • Future work – evaluations on different platforms with increased scale – add VM migration support to our VM-proof XRC design – extend our work to the newly SRIOV-enabled ConnectX-2 HCAs Questions? {leo, zghuo, zhangpanyong, md}@ncic.ac.cn Backup Slides OS-bypass of InfiniBand OpenIB Gen2 stack