Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines Kai Shen, Hong Tang, and Tao Yang http://www.cs.ucsb.edu/research/tmpi Department of Computer Science University of California, Santa Barbara MPI-Based Parallel Computation on Shared Memory Machines Shared Memory Machines (SMMs) or SMM Clusters become popular for high end computing. MPI is a portable high performance parallel programming model. MPI on SMMs Threads are easy to program. But people still use MPI on SMMs: Better portability for running on other platforms (e.g. SMM clusters); Good data locality due to data partitioning. 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 2 Scheduling for Parallel Jobs in Multiprogrammed SMMs Gang-scheduling Good for parallel programs which synchronize frequently; Low resource utilization (Processor-fragmentation; not enough parallelism). Space/time Sharing Time sharing on dynamically partitioned machines; Short response time and high throughput. Impact on MPI program execution Not all MPI nodes are scheduled simultaneously; The number of available processors for each application may change dynamically. Optimization is needed for fast MPI execution on SMMs. 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 3 Techniques Studied Thread-Based MPI execution [PPoPP’99] Compile-time transformation for thread-safe MPI execution Fast context switch and synchronization Fast communication through address sharing Two-level thread management for multiprogrammed environments Even faster context switch/synchronization Use scheduling information to guide synchronization 6/20/2016 Our prototype system: TMPI Shen, Tang, and Yang @ SuperComputing'99 4 Related Work MPI-related Work MPICH, a portable MPI implementation [Gropp/Lusk/et al.]. SGI MPI, highly optimized on SGI platforms. MPI-2, multithreading within a single MPI node. Scheduling and Synchronization Process Control [Tucker/Gupta] and Scheduler Activation [Anderson et al.] Focus on OS research. Scheduler-conscious Synchronization [Kontothanssis et al.] Focus on primitives such as barriers and locks. Hood/Cilk threads [Arora et al.] and Loop-level Scheduling [Yue/Lilja]. Focus on fine-grain parallelism. 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 5 Outline Motivations & Related Work Adaptive Two-level Thread Management Scheduler-conscious Event Waiting Experimental Studies 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 6 Context Switch/Synchronization in Multiprogrammed Environments In multiprogrammed environments, more synchronization will lead to context switch context switch/synchronization has large performance impact in multiprogrammed environments Conventional MPI implementation maps each MPI node to an OS process. Our earlier work maps each MPI node to a kernel thread. Two-level Thread Management: maps each MPI node to a user-level thread. Faster context switch and synchronization among userlevel threads Very few kernel-level context switches 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 7 System Architecture MPI application … ... MPI application TMPI Runtime … ... TMPI Runtime User-level threads … ... User-level threads System-wide resource management Targeted at multiprogrammed environments Two-level thread management 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 8 Adaptive Two-level Thread Management System-wide resource manager (OS kernel or Userlevel central monitor) collects information about active MPI applications; partitions processors among them. Application-wide user-level thread management maps each MPI node into a user-level thread; schedules user-level threads on a pool of kernel threads; controls the number of active kernel threads close to the number of allocated processors. Big picture (in the whole system): #Active kernel threads ≈ #Processors Minimize kernel-level context switch 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 9 User-level Thread Scheduling Every kernel thread can be: active: executing an MPI node (user-level thread); suspended. Execution invariant for each application: #active kernel threads ≈ #allocated processors (minimize kernel-level context switch) #kernel threads = #MPI nodes (avoid dynamic thread creation) Every active kernel thread polls system resource manager, which leads to: Deactivation: suspending itself Activation: waking up some suspended kernel threads No-action When to poll? 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 10 Polling in User-Level Context Switch Context switch is a result of synchronization (e.g. an MPI node waits for a message). Underlying kernel thread polls system resource manager during context switch: Two stack switches if deactivation suspend on a dummy stack One stack switch otherwise After optimization, 2s in average on SGI Power Challenge 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 11 Outline Motivations & Related Work Adaptive Two-level Thread Management Scheduler-conscious Event Waiting Experimental Studies 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 12 Event Waiting Synchronization All MPI synchronization is based on waitEvent waiter caller waitEvent(*pflag == value); waiting *pflag = value; Waiting could be: • spinning • yielding/blocking wakeup 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 13 Tradeoff between spin and block Basic rules for waiting using spin-then-block: Spinning wastes CPU cycles. Blocking introduces context switch overhead; always- blocking is not good for dedicated environments. Previous work focuses on choosing the best spin time. Our optimization focus and findings: Fast context switch has substantial performance impact; Use scheduling information to guide spin/block decision: Spinning is futile when the caller is not currently scheduled; Most blocking cost comes from cache flushing penalty. (actual cost varies, up to several ms) 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 14 Scheduler-conscious Event Waiting waitEvent (condition) Yes Check condition No Return Is the caller currently scheduled? No Blocking (yielding processor) Yes Check cache affinity Yes User-level scheduler provides: • scheduling info • affinity info No Yes No Spinning for time T and check condition periodically 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 15 Experimental Settings Machines: SGI Origin 2000 system with 32 195MHz MIPS R10000s with 2GB memory SGI Power Challenge with 4 200MHz MPIS R4400s with 256MB memory Compare among: TMPI-2: TMPI with two-level thread management SGI MPI: SGI’s native MPI implementation TMPI: original TMPI without two-level thread management 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 16 Testing Benchmarks Benchmark MM GE SWEEP3D GOODWIN TID E40R0100 6/20/2016 Description Matrix multiplication Gaussian elimination 3D Neutron Transport Sparse LU factorization Sparse LU factorization Sparse LU factorization Sync Frequency 2 times /sec 59 times /sec 32 times /sec 2392 times /sec 2067 times /sec 1635 times /sec MPI Operations Mostly MPI_Bsend Mostly MPI_Bcast Mostly send/recv Mixed Mixed Mixed Sync frequency is obtained by running each benchmark with 4 MPI nodes on 4-processor Power Challenge. The higher the multiprogramming degree is, the more synchronization will lead to context switch. Sparse LU benchmarks have much more frequent synchronization than others. Shen, Tang, and Yang @ SuperComputing'99 17 Performance evaluation on a Multiprogrammed Workload Workload: contains a sequence of six jobs launched with a fixed interval. Compare job turnaround time in Power Challenge. Interval = 20 secs Interval = 10 secs TMPI-2 TMPI SGI TMPI-2 TMPI SGI GOODWIN 16.0 20.4 19.0 20.8 30.5 29.2 MM2 14.9 16.2 25.9 29.3 43.7 60.6 GE1 19.9 21.2 27.3 37.5 65.3 162.0 GE2 11.0 11.8 16.4 22.4 40.5 61.7 MM1 33.8 35.1 63.8 66.5 63.3 160.4 SWEEP3D 47.5 47.4 67.4 67.0 72.9 162.1 Average 23.8 25.4 36.6 40.6 52.7 106.0 Normalized 1.00 1.07 1.54 1.00 1.30 2.61 Jobs 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 18 Workload with Certain Multiprogramming Degrees Goal: identify the performance impact of multiprogramming degrees. Experimental setting: Each workload has one benchmark program. Run n MPI nodes on p processors (n≥p). Multiprogramming degree is n/p. 6/20/2016 Compare megaflop rates or speedups of the kernel part of each application. Shen, Tang, and Yang @ SuperComputing'99 19 Performance Impact of Multiprogramming Degree (SGI Power Challenge) 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 20 Performance Impact of Multiprogramming Degree (SGI Origin 2000) Performance Ratio = Benchmarks _#MPI nodes_ #Processors 2 Processors 4 Processors 6 Processors 8 Processors Speedup/MFLOP of TMPI-2 Speedup/MFLOP of TMPI or SGI MPI GE 1 1.04 1.01 0.99 0.99 2 1.11 1.12 1.22 1.32 SWEEP3D 3 1.23 1.45 1.54 1.67 1 2 1.04 1.03 1.01 1.01 1.08 1.19 1.44 1.63 3 1.17 1.36 1.51 1.88 Performance ratios of TMPI-2 over TMPI Benchmarks _#MPI nodes_ #Processors 2 Processors 4 Processors 6 Processors 8 Processors GE 1 1.01 1.02 1.03 1.03 2 SWEEP3D 3 3.05 7.22 5.10 13.66 6.61 22.72 8.07 32.17 1 1.01 1.00 1.00 1.00 2 3 2.01 2.97 3.71 7.07 4.44 11.94 6.50 15.69 Performance ratios of TMPI-2 over SGI MPI 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 21 Benefits of Scheduler-conscious Event Waiting #MPI nodes _#MPI nodes_ #Processors GE SWEEP3D GOODWIN TID E40R0100 4 6 8 10 12 1 1.5 2 2.5 3 0.3 % 0% 0.6 % 0% 1.5 % 2.4 % 0.9 % 0.1 % 2.7 % 1.1 % 2.4 % 3.0 % 4.6 % 2.1 % 1.7 % 6.1 % 8.3 % 11.7 % 14.4 % 5.2 % 6.7 % 12.6 % 6.1 % 8.1 % 11.4 % Improvement over simple spin-block on Power Challenge #MPI nodes _#MPI nodes_ #Processors GE SWEEP3D GOODWIN TID E40R0100 4 6 8 10 12 1 1.5 2 2.5 3 0% -1.0 % 0% 0.2 % 0.7 % 3.1 % 3.8 % 0.1 % 0.7 % 0.4 % 2.2 % 1.1 % 7.5 % 5.3 % 2.2 % 1.0 % 3.4 % 0.7 % 3.7 % 9.3 % 14.2 % 4.5 % 8.2 % 4.8 % 12.0 % Improvement over simple spin-block on Origin 2000 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 22 Conclusions Contributions for optimizing MPI execution: Adaptive two-level thread management; Schedulerconscious event waiting; Great performance improvement: up to an order of magnitude, depending on applications and load; In multiprogrammed environments, fast context switch/synchronization is important even for communication-infrequent MPI programs. Current and future work Support threaded MPI on SMP-clusters http://www.cs.ucsb.edu/research/tmpi 6/20/2016 Shen, Tang, and Yang @ SuperComputing'99 23