Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines Kai Shen, Hong Tang, and Tao Yang http://www.cs.ucsb.edu/research/tmpi Department of Computer Science University of California, Santa Barbara MPI-Based Parallel Computation on Shared Memory Machines Shared Memory Machines (SMMs) or SMM Clusters become popular for high end computing. MPI is a portable high performance parallel programming model. MPI on SMMs Threads are easy to program. But MPI is still used on SMMs: Better portability for running on other platforms (e.g. SMM clusters); Good data locality due to data partitioning. 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 2 Scheduling for Parallel Jobs in Multiprogrammed SMMs Gang-scheduling Good for parallel programs which synchronize frequently; Affect resource utilization (Processor-fragmentation; not enough parallelism to use allocated resource). Space/time Sharing Time sharing combined with dynamic partitioning; High throughput. Popular in current OS (e.g., IRIX 6.5) Impact on MPI program execution Not all MPI nodes are scheduled simultaneously; The number of available processors for each application may change dynamically. Optimization is needed for fast MPI execution on SMMs. 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 3 Techniques Studied Thread-Based MPI execution [PPoPP’99] Compile-time transformation for thread-safe MPI execution Fast context switch and synchronization Fast communication through address sharing Two-level thread management for multiprogrammed environments Even faster context switch/synchronization Use scheduling information to guide synchronization 2016/7/26 Our prototype system: TMPI Shen, Tang, and Yang @ SuperComputing'99 4 Impact of synchronization on coarse-grain parallel programs Running a communication-infrequent MPI program (SWEEP3D) on 8 SGI Origin 2000 processors with multiprogramming degree 3. Synchronization costs 43%-84% of total time. Execution time breakdown for TMPI and SGI MPI: SWEEP3D Kernel Computation Synchronization Queue Management Memory Copy Other Cost Total 2016/7/26 TMPI Time Percentage 47.8 Sec 54.3% 38.1 Sec 43.3% 1.0 Sec 1.1% 1.1 Sec 1.3% 0 Sec 0% 88.0 Sec 100% SGIMPI Time Percentage 48.3 Sec 5.65% 722.8 Sec 84.45% 83.4 Sec 9.74% 1.4 Sec 0.16% 0 Sec 0% 855.9 Sec 100% Shen, Tang, and Yang @ SuperComputing'99 5 Related Work MPI-related Work MPICH, a portable MPI implementation [Gropp/Lusk et al.]. SGI MPI, highly optimized on SGI platforms. MPI-2, multithreading within a single MPI node. Scheduling and Synchronization Process Control [Tucker/Gupta] and Scheduler Activation [Anderson et al.] Focus on OS research. Scheduler-conscious Synchronization [Kontothanssis et al.] Focus on primitives such as barriers and locks. Hood/Cilk threads [Arora et al.] and Loop-level Scheduling [Yue/Lilja]. Focus on fine-grain parallelism. 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 6 Outline Motivations & Related Work Adaptive Two-level Thread Management Scheduler-conscious Event Waiting Experimental Studies 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 7 Context Switch/Synchronization in Multiprogrammed Environments In multiprogrammed environments, synchronization leads to more context switches large performance impact. Conventional MPI implementation maps each MPI node to an OS process. Our earlier work maps each MPI node to a kernel thread. Two-level Thread Management: maps each MPI node to a user-level thread. Faster context switch and synchronization among userlevel threads Very few kernel-level context switches 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 8 System Architecture MPI application … ... MPI application TMPI Runtime … ... TMPI Runtime User-level threads … ... User-level threads System-wide resource management Targeted at multiprogrammed environments Two-level thread management 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 9 Adaptive Two-level Thread Management System-wide resource manager (OS kernel or Userlevel central monitor) collects information about active MPI applications; partitions processors among them. Application-wide user-level thread management maps each MPI node into a user-level thread; schedules user-level threads on a pool of kernel threads; controls the number of active kernel threads close to the number of allocated processors. Big picture (in the whole system): #Active kernel threads ≈ #Processors Minimize kernel-level context switch 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 10 User-level Thread Scheduling Every kernel thread can be: active: executing an MPI node (user-level thread); suspended. Execution invariant for each application: #active kernel threads ≈ #allocated processors (minimize kernel-level context switch) #kernel threads = #MPI nodes (avoid dynamic thread creation) Every active kernel thread polls system resource manager, which leads to: Deactivation: suspending itself Activation: waking up some suspended kernel threads No-action When to poll? 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 11 Polling in User-Level Context Switch Context switch is a result of synchronization (e.g. an MPI node waits for a message). Underlying kernel thread polls system resource manager during context switch: Two stack switches if deactivation suspend on a dummy stack One stack switch otherwise After optimization, 2s in average on SGI Power Challenge 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 12 Outline Motivations & Related Work Adaptive Two-level Thread Management Scheduler-conscious Event Waiting Experimental Studies 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 13 Event Waiting Synchronization All MPI synchronization is based on waitEvent waiter caller waitEvent(*pflag == value); waiting *pflag = value; Waiting could be: • spinning • yielding/blocking wakeup 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 14 Tradeoff between spin and block Basic rules for waiting using spin-then-block: Spinning wastes CPU cycles. Blocking introduces context switch overhead; always- blocking is not good for dedicated environments. Previous work focuses on choosing the best spin time. Our optimization focus and findings: Fast context switch has substantial performance impact; Use scheduling information to guide spin/block decision: Spinning is futile when the caller is not currently scheduled; Most blocking cost comes from cache flushing penalty. (actual cost varies, up to several ms) 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 15 Scheduler-conscious Event Waiting waitEvent (condition) Yes Check condition No Return Is the caller currently scheduled? No Blocking (yielding processor) Yes Check cache affinity Yes User-level scheduler provides: • scheduling info • affinity info No Yes No Spinning for time T and check condition periodically 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 16 Experimental Settings Machines: SGI Origin 2000 system with 32 195MHz MIPS R10000s with 2GB memory SGI Power Challenge with 4 200MHz MPIS R4400s with 256MB memory Compare among: TMPI-2: TMPI with two-level thread management SGI MPI: SGI’s native MPI implementation TMPI: original TMPI without two-level thread management 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 17 Testing Benchmarks Benchmark MM GE SWEEP3D GOODWIN TID E40R0100 2016/7/26 Description Matrix multiplication Gaussian elimination 3D Neutron Transport Sparse LU factorization Sparse LU factorization Sparse LU factorization Sync Frequency 2 times /sec 59 times /sec 32 times /sec 2392 times /sec 2067 times /sec 1635 times /sec MPI Operations Mostly MPI_Bsend Mostly MPI_Bcast Mostly send/recv Mixed Mixed Mixed Sync frequency is obtained by running each benchmark with 4 MPI nodes on 4-processor Power Challenge. The higher the multiprogramming degree, the more spinblocks (context switch) during each synchronization Sparse LU benchmarks have much more frequent synchronization than others. Shen, Tang, and Yang @ SuperComputing'99 18 Performance evaluation on a Multiprogrammed Workload Workload: contains a sequence of six jobs launched with a fixed interval. Compare job turnaround time in Power Challenge. Interval = 20 secs Interval = 10 secs TMPI-2 TMPI SGI TMPI-2 TMPI SGI GOODWIN 16.0 20.4 19.0 20.8 30.5 29.2 MM2 14.9 16.2 25.9 29.3 43.7 60.6 GE1 19.9 21.2 27.3 37.5 65.3 162.0 GE2 11.0 11.8 16.4 22.4 40.5 61.7 MM1 33.8 35.1 63.8 66.5 63.3 160.4 SWEEP3D 47.5 47.4 67.4 67.0 72.9 162.1 Average 23.8 25.4 36.6 40.6 52.7 106.0 Normalized 1.00 1.07 1.54 1.00 1.30 2.61 Jobs 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 19 Workload with Certain Multiprogramming Degrees Goal: identify the performance impact of multiprogramming degrees. Experimental setting: Each workload has one benchmark program. Run n MPI nodes on p processors (n≥p). Multiprogramming degree is n/p. 2016/7/26 Compare megaflop rates or speedups of the kernel part of each application. Shen, Tang, and Yang @ SuperComputing'99 20 Performance Impact of Multiprogramming Degree (SGI Power Challenge) 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 21 Performance Impact of Multiprogramming Degree (SGI Origin 2000) Performance Ratio = Benchmarks _#MPI nodes_ #Processors 2 Processors 4 Processors 6 Processors 8 Processors Speedup/MFLOP of TMPI-2 Speedup/MFLOP of TMPI or SGI MPI GE 1 1.04 1.01 0.99 0.99 2 1.11 1.12 1.22 1.32 SWEEP3D 3 1.23 1.45 1.54 1.67 1 2 1.04 1.03 1.01 1.01 1.08 1.19 1.44 1.63 3 1.17 1.36 1.51 1.88 Performance ratios of TMPI-2 over TMPI Benchmarks _#MPI nodes_ #Processors 2 Processors 4 Processors 6 Processors 8 Processors GE 1 1.01 1.02 1.03 1.03 2 SWEEP3D 3 3.05 7.22 5.10 13.66 6.61 22.72 8.07 32.17 1 1.01 1.00 1.00 1.00 2 3 2.01 2.97 3.71 7.07 4.44 11.94 6.50 15.69 Performance ratios of TMPI-2 over SGI MPI 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 22 Benefits of Scheduler-conscious Event Waiting #MPI nodes _#MPI nodes_ #Processors GE SWEEP3D GOODWIN TID E40R0100 4 6 8 10 12 1 1.5 2 2.5 3 0.3 % 0% 0.6 % 0% 1.5 % 2.4 % 0.9 % 0.1 % 2.7 % 1.1 % 2.4 % 3.0 % 4.6 % 2.1 % 1.7 % 6.1 % 8.3 % 11.7 % 14.4 % 5.2 % 6.7 % 12.6 % 6.1 % 8.1 % 11.4 % Improvement over simple spin-block on Power Challenge #MPI nodes _#MPI nodes_ #Processors GE SWEEP3D GOODWIN TID E40R0100 4 6 8 10 12 1 1.5 2 2.5 3 0% -1.0 % 0% 0.2 % 0.7 % 3.1 % 3.8 % 0.1 % 0.7 % 0.4 % 2.2 % 1.1 % 7.5 % 5.3 % 2.2 % 1.0 % 3.4 % 0.7 % 3.7 % 9.3 % 14.2 % 4.5 % 8.2 % 4.8 % 12.0 % Improvement over simple spin-block on Origin 2000 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 23 Conclusions Contributions for optimizing MPI execution: Adaptive two-level thread management; Schedulerconscious event waiting; Great performance improvement: up to an order of magnitude, depending on applications and load; In multiprogrammed environments, fast context switch/synchronization is important even for communication-infrequent MPI programs. Current and future work Support threaded MPI on SMP-clusters http://www.cs.ucsb.edu/research/tmpi 2016/7/26 Shen, Tang, and Yang @ SuperComputing'99 24