HierKNEM: An Adaptive Framework for KernelAssisted and Topology-Aware Collective Communications on Many-core Clusters Teng Ma, George Bosilca, Aurelien Bouteiller, Jack J. Dongarra Dec 2. 2011 @ICL Lunch Talk Agenda • • • • • • Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion Introduction • Hierarchies brought by multi-core cluster • Message Passing is still dominative Programming. • Programming libraries want to handle hierarchies internally. • Collective communication is critical to application’s performance Problem: Tuned Collective • It cannot see the edges brought by the hierarchies of multi-core clusters • Build a logical topology without runtime hardware topology information. Topology-Unaware: Mismatch problem* --bycore --bynode Node 0 Node 1 1 2 3 4 1 2 3 4 4 1 2 3 4 1 2 3 3 4 1 2 3 4 1 2 2 3 4 1 2 3 4 1 Core0 Core2 Core1 Core3 P0 P1 P2 P3 Core0 Core1 Node 0 # of nodes Core2 Core3 Node 1 # of cores P0 P1 P2 P3 Open MPI Tuned Allgather Ring algorithm under different process-core binding cases. * T. Ma, T. Herault, G. Bosilca and J. J. Dongarra, Process Distance-aware Adaptive MPI Collective Communications, Cluster 2011 Agenda • • • • • • Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion Related work IB links NUMA links • Cheetah R. Graham and etc., Cheetah: A Framework for Scalable Hierarchical Collective Operations CCGRID 2011 • Distance-aware framework T. Ma, and etc., Process DistanceAware Adaptive MPI Collective Communications. CLUSTER 2011 Intra-socket links SBGP BCOL SBGP BCOL SBGP BCOL Agenda • • • • • • Introduction Related work Kernel-assisted Approach HierKNEM Experiments Conclusion Status of Kernel-assisted Onesided Single-copy Inter-Process communication • KNEM(0.9.7) and LIMIC(0.5.5) • XPMEM(Cross-Process Memory Mapping) • CMA(Cross Memory Attach). Development of kernel-assisted approach in MPI stacks • Intra-node p2p comm. MPICH2-LMT(KNEM), Open MPI(SM/KNEM BTL, vader BTL), MVAPICH2(LIMIC) • Intra-node collective comm. KNEM Coll T Ma, G. Bosilca, A. Bouteiller, B. Goglin, J. Squyres, J. J. Dongarra: Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs. ICPP 2011 • Inter- and intra-node collective comm. HierKNEM Coll T Ma, G. Bosilca, A. Bouteiller, J. J. Dongarra: HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters, submitted to IPDPS2012 Agenda • • • • • • Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion Framework of HierKNEM Inter-node Comm. Subgroup: Intra-node Comm. Broadcast Inter-node forward KNEM read Leader processes Non-Leader processes Send Recv Bcast with 64 processes on Dancer’s 8 nodes(8 cores/node), 256KB message size. KNEM Copy Reduce Inter-node Comm. Intra-node Comm. Inter-node forward KNEM read/write New_Comm. Allgather: Topology-aware Ring Agenda • • • • • • Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion Hardware Environment • • • • • • • • Stremi Cluster 32 nodes Node: AMD’s 24-core Gigabit Ethernet Parapluie Cluster 32 nodes Node: AMD’s 24-core 20 G Infiniband Software Environment • Open MPI 1.5.3, MPICH2-1.4 and MVAPICH2-1.7 • KNEM version 0.9.6, LIMIC 0.5.5 • IMB-3.2(cache on) • Always use the same mapping between cores and processes if without special mention. (--bycore way) Broadcast Performance More than 30 times!! More than twice Figure: Aggregate Broadcast bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node , 32nodes). Reduce Performance Figure: Aggregate Reduce bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node, 32 nodes). Allgather Performance Figure: Aggregate Allgather bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node). Topology-aware Operations Figure: Impact of process mapping: aggregate Broadcast and Allgather bandwidth of the collective modules for two different process-core bindings: by core and by node (Parapluie cluster, IB20G, 768 processes, 24 cores/node). Core per Node Scalability Figure: Core per node scalability: aggregate bandwidth of Broadcast for 2MB messages on multicore clusters (32 nodes). Conclusion • HierKNEM achieved huge speedup from overlap between inter- and intra-node communication. • HierKNEM is immune to modifications of the underlying process-core binding.(topologyaware). • HierKNEM provides a linear speedup with the increase of the number of cores per node