HierKNEM:an adaptive Framework for Kernel

advertisement
HierKNEM: An Adaptive Framework for KernelAssisted and Topology-Aware
Collective Communications on Many-core Clusters
Teng Ma, George Bosilca, Aurelien Bouteiller, Jack J. Dongarra
Dec 2. 2011
@ICL Lunch Talk
Agenda
•
•
•
•
•
•
Introduction
Related work
Kernel-assisted approach
HierKNEM
Experiments
Conclusion
Introduction
• Hierarchies brought by multi-core
cluster
• Message Passing is still dominative
Programming.
• Programming libraries want to
handle hierarchies internally.
• Collective communication is critical
to application’s performance
Problem: Tuned Collective
• It cannot see the edges brought by the
hierarchies of multi-core clusters
• Build a logical topology without runtime
hardware topology information.
Topology-Unaware: Mismatch problem*
--bycore
--bynode
Node 0
Node 1
1
2
3
4
1
2
3
4
4
1
2
3
4
1
2
3
3
4
1
2
3
4
1
2
2
3
4
1
2
3
4
1
Core0
Core2
Core1
Core3
P0
P1
P2
P3
Core0 Core1
Node 0
# of nodes
Core2 Core3
Node 1
# of cores
P0
P1
P2
P3
Open MPI Tuned Allgather Ring algorithm under different process-core
binding cases.
* T. Ma, T. Herault, G. Bosilca and J. J. Dongarra, Process Distance-aware
Adaptive MPI Collective Communications, Cluster 2011
Agenda
•
•
•
•
•
•
Introduction
Related work
Kernel-assisted approach
HierKNEM
Experiments
Conclusion
Related work
IB links
NUMA links
• Cheetah
R. Graham and etc., Cheetah: A
Framework for Scalable
Hierarchical Collective
Operations CCGRID 2011
• Distance-aware framework
T. Ma, and etc., Process DistanceAware Adaptive MPI Collective
Communications. CLUSTER 2011
Intra-socket links
SBGP
BCOL
SBGP
BCOL
SBGP
BCOL
Agenda
•
•
•
•
•
•
Introduction
Related work
Kernel-assisted Approach
HierKNEM
Experiments
Conclusion
Status of Kernel-assisted Onesided Single-copy Inter-Process
communication
• KNEM(0.9.7) and LIMIC(0.5.5)
• XPMEM(Cross-Process Memory Mapping)
• CMA(Cross Memory Attach).
Development of kernel-assisted
approach in MPI stacks
• Intra-node p2p comm.
MPICH2-LMT(KNEM), Open MPI(SM/KNEM BTL, vader BTL), MVAPICH2(LIMIC)
• Intra-node collective comm.
KNEM Coll
T Ma, G. Bosilca, A. Bouteiller, B. Goglin, J. Squyres, J. J. Dongarra: Kernel Assisted
Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs.
ICPP 2011
• Inter- and intra-node collective comm.
HierKNEM Coll
T Ma, G. Bosilca, A. Bouteiller, J. J. Dongarra: HierKNEM: An Adaptive Framework for
Kernel-Assisted and Topology-Aware Collective Communications on Many-core
Clusters, submitted to IPDPS2012
Agenda
•
•
•
•
•
•
Introduction
Related work
Kernel-assisted approach
HierKNEM
Experiments
Conclusion
Framework of HierKNEM
Inter-node Comm.
Subgroup: Intra-node Comm.
Broadcast
Inter-node forward
KNEM read
Leader processes
Non-Leader processes
Send
Recv
Bcast with 64 processes
on Dancer’s 8 nodes(8
cores/node), 256KB
message size.
KNEM Copy
Reduce
Inter-node Comm.
Intra-node Comm.
Inter-node forward
KNEM read/write
New_Comm.
Allgather: Topology-aware Ring
Agenda
•
•
•
•
•
•
Introduction
Related work
Kernel-assisted approach
HierKNEM
Experiments
Conclusion
Hardware Environment
•
•
•
•
•
•
•
•
Stremi Cluster
32 nodes
Node: AMD’s 24-core
Gigabit Ethernet
Parapluie Cluster
32 nodes
Node: AMD’s 24-core
20 G Infiniband
Software Environment
• Open MPI 1.5.3, MPICH2-1.4 and
MVAPICH2-1.7
• KNEM version 0.9.6, LIMIC 0.5.5
• IMB-3.2(cache on)
• Always use the same mapping between
cores and processes if without special
mention. (--bycore way)
Broadcast Performance
More than 30 times!!
More than twice
Figure: Aggregate Broadcast bandwidth of collective modules on
multicore clusters (768 processes, 24 cores/node , 32nodes).
Reduce Performance
Figure: Aggregate Reduce bandwidth of collective modules on
multicore clusters (768 processes, 24 cores/node, 32 nodes).
Allgather Performance
Figure: Aggregate Allgather bandwidth of collective modules
on multicore clusters (768 processes, 24 cores/node).
Topology-aware Operations
Figure: Impact of process mapping: aggregate Broadcast and
Allgather bandwidth of the collective modules for two different
process-core bindings: by core and by node (Parapluie cluster,
IB20G, 768 processes, 24 cores/node).
Core per Node Scalability
Figure: Core per node scalability: aggregate bandwidth of
Broadcast for 2MB messages on multicore clusters (32 nodes).
Conclusion
• HierKNEM achieved huge speedup from
overlap between inter- and intra-node
communication.
• HierKNEM is immune to modifications of the
underlying process-core binding.(topologyaware).
• HierKNEM provides a linear speedup with
the increase of the number of cores per node
Download