Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara Parallel Computation on SMP Clusters Massively Parallel Machines SMP Clusters Commodity Components: Off-the-shelf Processors + Fast Network (Myrinet, Fast/GigaBit Ethernet) Parallel Programming Model for SMP Clusters MPI: Portability, Performance, Legacy Programs MPI+Variations: MPI+Multithreading, MPI+OpenMP June, 20th, 2001 Hong Tang 2 Threaded MPI Execution MPI Paradigm: Separated Address Spaces for Different MPI Nodes Natural Solution: MPI Nodes Processes What if we map MPI nodes to threads? Faster synchronization among MPI nodes running on the same machine. Demonstrated in previous work [PPoPP ’99] for a single shared memory machine. (Developed techniques to safely execute MPI programs using threads.) Threaded MPI Execution on SMP Clusters Intra-Machine Comm. through Shared Memory Inter-Machine Comm. through Network June, 20th, 2001 Hong Tang 3 Threaded MPI Execution Benefits InterMachine Communication Common Intuition Inter-machine communication cost is dominated by network delay, so the advantage of executing MPI nodes as threads diminishes. Our Findings Using threads can significantly reduce the buffering and orchestration overhead for intermachine communications. June, 20th, 2001 Hong Tang 4 Related Work MPI on Network Clusters MPICH – a portable MPI implementation. LAM/MPI – communication through a standalone RPI server. Collective Communication Optimization SUN-MPI and MPI-StarT – modify MPICH ADI layer; target for SMP clusters. MagPIe – target for SMP clusters connected through WAN. Lower Communication MPI-FM and MPI-AM. Layer Optimization Threaded Execution of MPI-Lite, LPVM, TPVM. Message Passing Programs June, 20th, 2001 Hong Tang 5 Background: MPICH Design MPI Collective MPI Collective MPI Point to Point MPI Point-to-Point Abstract Device Interface Devices June, 20th, 2001 ADI Chameleon Interface T3D SGI others P4 TCP shmem Hong Tang 6 MPICH Communication Structure WS - A cluster node - MPI node (process) - MPICH daemon process - Inter-process pipe - Shared memory - TCP connection WS WS WS WS WS WS WS WS MPICH with shared memory June, 20th, 2001 MPICH without shared memory Hong Tang 7 TMPI Communication Structure WS WS WS WS WS - A cluster node - MPI node (thread) - TMPI daemon thread June, 20th, 2001 Hong Tang - TCP connection - Direct mem access and thread sync 8 Comparison of TMPI and MPICH Drawbacks of MPICH w/ Shared Memory Intra-node communication limited by shared memory size. Busy polling to check messages from either daemon or local peer. Cannot do automatic resource clean-up. Drawbacks of MPICH w/o Shared Memory Big overhead for intra-node communication. Too many daemon processes and open connections. Drawbacks of both MPICH Systems Extra data copying for inter-machine communication. June, 20th, 2001 Hong Tang 9 TMPI Communication Design MPI Communication Inter- and Intra-Machine Communication Abstract Network and Thread Sync Interface OS Facilities June, 20th, 2001 MPI INTER NETD TCP others Hong Tang INTRA THREAD pthread other thread impl 10 Separation of Point-to-Point and Collective Communication Channels Observations: MPI Point-to-point Communication WS WS and Collective Communication Semantics are Different. Point-to-point Collective Unknown Source (MPI_ANY_SOURCE) Determined Source (Ancestor in the spanning tree.) Out-of-order (Message Tag) In order delivery Asynchronous WS (Non-block Receive) WS Synchronous WS - A cluster node - TCP connection Separated channels for pt2pt and collective comm. - MPI node (thread) - Direct mem access Eliminate intervention for collective communication. - TMPIdaemon daemon thread and thread sync Less effective for MPICH – no sharing of ports among processes. June, 20th, 2001 Hong Tang 11 Hierarchy-Aware Collective Communication Observation: Two level 0communication hierarchy. 0 0 Inside an SMP node: shared memory (10-8 sec) 3 Between 1 2 6SMP 4 1 2 4 8(10-6 sec) 1 nodes: network 2 Idea: Building the 3communication spanning tree 5 7 8 6 5 3 4 5 6 in two steps 7 node 8 Choose a root MPI7 node on each cluster and build MPICH MPICH aTMPI spanning tree among all the cluster nodes. (hypercube) (balanced binary tree) Second, all other MPI nodes connect to the local root Spanning trees for an MPI program with 9 nodes on three cluster nodes. node. The three cluster nodes contain MPI node 0-2, 3-5 and 6-8 respectively. Thick edges are network edges. June, 20th, 2001 Hong Tang 12 Adaptive Buffer Management Question: How do we manage temporary req/ S S rebuffering q/da of message dataS when the remote d at req t D D D receiver is not ready to accept data? q t e D Q Q D got da Q D got r eq got r R R Choices: R Send the data with the request – eager push. ady eady e r r r r e e v v recei recei D data dat when the receiver is D daonly Send request and send t D D protocol. ready – three-phase Q t dat a Q t d o D t g D go between both methods. TMPI – adapt R R Three-phase Protocol One Step Eager-push GracefulProtocol Degradation from Eagernode cannot buffer the msg. msg. Remote Remote node can buffer push tothe Three-phase Protocol June, 20th, 2001 Hong Tang 13 Experimental Study Goal: Illustrate the advantage of threaded MPI execution on SMP clusters. Hardware Setting A cluster of 6 Quad-Xeon 500MHz SMPs, with 1GB main memory and 2 fast Ethernet cards per machine. Software Setting OS: RedHat Linux 6.0, kernel version 2.2.15 w/ channel bonding enabled. Process-based MPI System: MPICH 1.2 Thread-based MPI System: TMPI (45 functions in MPI 1.1 standard) June, 20th, 2001 Hong Tang 14 Inter-Cluster-Node Point-to-Point Ping-ping, TMPI vs MPICH w/ shared memory (a) Ping-Pong Short Message (b) Ping-Pong Long Message 700 600 500 400 TMPI MPICH Transfer Rate (MB) Round Trip Time ( m s) 20 18 16 TMPI MPICH 14 12 10 300 8 200 0 200 400 600 800 1000 Message Size (bytes) June, 20th, 2001 0 Hong Tang 200 400 600 800 Message Size (KB) 1000 15 Intra-Cluster-Node Point-to-Point Ping-pong, TMPI vs MPICH1 (MPICH w/ shared memory) and MPICH2 (MPICH w/o shared memory) (a) Ping-Pong Short Message (b) Ping-Pong Long Message 180 TMPI MPICH1 MPICH2 200 TMPI MPICH1 MPICH2 150 100 Transfer Rate (MB) Round Trip Time ( m s) 160 140 120 100 80 60 40 50 0 20 200 400 600 800 1000 Message Size (bytes) June, 20th, 2001 Hong Tang 0 200 400 600 800 Message Size (KB) 1000 16 Collective Communication Reduce, Allreduce. 3) For TMPI,Bcast, the performance of 4X4 cases is roughly the TMPIis/of MPICH_SHM /faster MPICH_NOSHM 2) TMPI 70+ times than MPICH w/ of Shared Memory summation that of the 4X1 cases and that the 1X4 cases. Three nodefor distributions, threeand rootMPI_Reduce. node settings. MPI_Bcast 1) MPICH w/o shared memory performs the worst. (us) 4x1 1x4 4x4 June, 20th, 2001 root Reduce Bcast same 9/121/4384 10/137/7913 rotate 33/81/3699 129/91/4238 combo 25/102/3436 17/32/966 same 28/1999/1844 21/1610/1551 rotate 146/1944/1878 164/1774/1834 combo 167/1977/1854 43/409/392 same 39/2532/4809 56/2792/10246 rotate 161/1718/8566 216/2204/8036 combo 141/2242/8515 62/489/2054 Hong Tang Allreduce 160 /175/627 571/675/775 736/1412/19914 17 Macro-Benchmark Performance (b) Gaussian Elimination 1000 1000 800 800 600 600 MFLOP MFLOP (a) Matrix Multiplication 400 400 TMPI MPICH 200 200 TMPI MPICH 0 0 June, 20th, 2001 5 10 15 Number of MPI Nodes 20 Hong Tang 0 0 5 10 15 20 Number of MPI Nodes 25 18 Conclusions Great Advantage of Threaded MPI Execution on SMP Clusters Micro-benchmark: 70+ times faster than MPICH. Macro-benchmark: 100% faster than MPICH. Optimization Techniques Separated Collective and Point-to-Point Communication Channels Adaptive Buffer Management Hierarchy-Aware Communications http://www.cs.ucsb.edu/projects/tmpi/ June, 20th, 2001 Hong Tang 19 Background: Safe Execution of MPI Programs using Threads Program Transformation: Eliminate global and static variables (called permanent variables). Thread-Specific Data (TSD) Each thread can associate a pointer-sized data variable with a commonly defined key value (an integer). With the same key, different threads can set/get the values of their own copy of the data variable. TSD-based Transformation Each permanent variable declaration is replaced with a KEY declaration. Each node associates its private copy of the permanent variable with the corresponding key. In places where global variables are referenced, use the global keys to retrieve the per-thread copies of the variables. June, 20th, 2001 Hong Tang 20 Program Transformation – An Example Source Program int X=1; int f() { Program After Transformation int kX=0; void main_init() { if (kX==0) kX=key_create(); } void user_init() { int *pX=malloc(sizeof(int)); *pX=1; setval(kX, pX); } int f() { int *pX=getval(kX); return X++; } June, 20th, 2001 return (*pX)++; } Hong Tang 21