Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines

advertisement
Adaptive Two-level Thread Management
for MPI Execution on Multiprogrammed
Shared Memory Machines
Kai Shen, Hong Tang, and Tao Yang
http://www.cs.ucsb.edu/research/tmpi
Department of Computer Science
University of California, Santa Barbara
MPI-Based Parallel Computation on
Shared Memory Machines
Shared Memory Machines (SMMs) or SMM
Clusters become popular for high end computing.
 MPI is a portable high performance parallel
programming model.
 MPI on SMMs
 Threads are easy to program. But MPI is still used
on SMMs:

 Better portability for running on other platforms (e.g.
SMM clusters);
 Good data locality due to data partitioning.
2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
2
Scheduling for Parallel Jobs in
Multiprogrammed SMMs

Gang-scheduling
 Good for parallel programs which synchronize frequently;
 Affect resource utilization (Processor-fragmentation; not
enough parallelism to use allocated resource).

Space/time Sharing
 Time sharing combined with dynamic partitioning;
 High throughput. Popular in current OS (e.g., IRIX 6.5)

Impact on MPI program execution
 Not all MPI nodes are scheduled simultaneously;
 The number of available processors for each application
may change dynamically.
 Optimization is needed for fast MPI execution on SMMs.
2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
3
Techniques Studied

Thread-Based MPI execution [PPoPP’99]
 Compile-time transformation for thread-safe
MPI execution
 Fast context switch and synchronization
 Fast communication through address sharing

Two-level thread management for
multiprogrammed environments
 Even faster context switch/synchronization
 Use scheduling information to guide
synchronization

2016/7/26
Our prototype system: TMPI
Shen, Tang, and Yang @ SuperComputing'99
4
Impact of synchronization on
coarse-grain parallel programs



Running a communication-infrequent MPI program
(SWEEP3D) on 8 SGI Origin 2000 processors with
multiprogramming degree 3.
Synchronization costs 43%-84% of total time.
Execution time breakdown for TMPI and SGI MPI:
SWEEP3D
Kernel Computation
Synchronization
Queue Management
Memory Copy
Other Cost
Total
2016/7/26
TMPI
Time
Percentage
47.8 Sec
54.3%
38.1 Sec
43.3%
1.0 Sec
1.1%
1.1 Sec
1.3%
0 Sec
0%
88.0 Sec
100%
SGIMPI
Time
Percentage
48.3 Sec
5.65%
722.8 Sec
84.45%
83.4 Sec
9.74%
1.4 Sec
0.16%
0 Sec
0%
855.9 Sec
100%
Shen, Tang, and Yang @ SuperComputing'99
5
Related Work
MPI-related Work



MPICH, a portable MPI implementation [Gropp/Lusk et al.].
SGI MPI, highly optimized on SGI platforms.
MPI-2, multithreading within a single MPI node.
Scheduling and Synchronization



Process Control [Tucker/Gupta] and Scheduler Activation
[Anderson et al.] Focus on OS research.
Scheduler-conscious Synchronization [Kontothanssis et al.]
Focus on primitives such as barriers and locks.
Hood/Cilk threads [Arora et al.] and Loop-level Scheduling
[Yue/Lilja]. Focus on fine-grain parallelism.
2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
6
Outline
Motivations & Related Work
 Adaptive Two-level Thread Management
 Scheduler-conscious Event Waiting
 Experimental Studies

2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
7
Context Switch/Synchronization in
Multiprogrammed Environments
In multiprogrammed environments, synchronization leads to
more context switches  large performance impact.
 Conventional MPI implementation maps each MPI node to
an OS process.
 Our earlier work maps each MPI node to a kernel thread.
 Two-level Thread Management: maps each MPI node to a
user-level thread.
 Faster context switch and synchronization among userlevel threads
 Very few kernel-level context switches
2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
8
System Architecture
MPI application
… ...
MPI application
TMPI Runtime
… ...
TMPI Runtime
User-level threads
… ...
User-level threads
System-wide resource management
Targeted at multiprogrammed environments
 Two-level thread management

2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
9
Adaptive Two-level Thread
Management

System-wide resource manager (OS kernel or Userlevel central monitor)
 collects information about active MPI applications;
 partitions processors among them.

Application-wide user-level thread management
 maps each MPI node into a user-level thread;
 schedules user-level threads on a pool of kernel threads;
 controls the number of active kernel threads close to the
number of allocated processors.
Big picture (in the whole system):
 #Active kernel threads ≈ #Processors
 Minimize kernel-level context switch

2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
10
User-level Thread Scheduling

Every kernel thread can be:
 active: executing an MPI node (user-level thread);
 suspended.

Execution invariant for each application:
 #active kernel threads ≈ #allocated processors
(minimize kernel-level context switch)
 #kernel threads = #MPI nodes
(avoid dynamic thread creation)

Every active kernel thread polls system resource
manager, which leads to:
 Deactivation: suspending itself
 Activation: waking up some suspended kernel threads
 No-action
When
to
poll?
2016/7/26
Shen, Tang, and Yang @ SuperComputing'99

11
Polling in User-Level Context Switch
Context switch is a result of synchronization
(e.g. an MPI node waits for a message).
 Underlying kernel thread polls system
resource manager during context switch:

 Two stack switches if deactivation
 suspend on a dummy stack
 One stack switch otherwise

After optimization, 2s in average on SGI
Power Challenge
2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
12
Outline
Motivations & Related Work
 Adaptive Two-level Thread Management
 Scheduler-conscious Event Waiting
 Experimental Studies

2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
13
Event Waiting Synchronization

All MPI synchronization is based on waitEvent
waiter
caller
waitEvent(*pflag == value);
waiting
*pflag = value;
Waiting could be:
• spinning
• yielding/blocking
wakeup
2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
14
Tradeoff between spin and block

Basic rules for waiting using spin-then-block:
 Spinning wastes CPU cycles.
 Blocking introduces context switch overhead; always-
blocking is not good for dedicated environments.
 Previous work focuses on choosing the best spin time.

Our optimization focus and findings:
 Fast context switch has substantial performance impact;
 Use scheduling information to guide spin/block decision:
 Spinning
is futile when the caller is not currently
scheduled;
 Most blocking cost comes from cache flushing
penalty. (actual cost varies, up to several ms)
2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
15
Scheduler-conscious Event Waiting
waitEvent (condition)
Yes
Check
condition
No
Return
Is the caller
currently
scheduled?
No
Blocking (yielding
processor)
Yes
Check cache
affinity
Yes
User-level scheduler
provides:
• scheduling info
• affinity info
No
Yes
No
Spinning for time T and
check condition
periodically
2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
16
Experimental Settings

Machines:
 SGI Origin 2000 system with 32 195MHz MIPS R10000s
with 2GB memory
 SGI Power Challenge with 4 200MHz MPIS R4400s with
256MB memory

Compare among:
 TMPI-2: TMPI with two-level thread management
 SGI MPI: SGI’s native MPI implementation
 TMPI: original TMPI without two-level thread
management
2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
17
Testing Benchmarks
Benchmark
MM
GE
SWEEP3D
GOODWIN
TID
E40R0100



2016/7/26
Description
Matrix multiplication
Gaussian elimination
3D Neutron Transport
Sparse LU factorization
Sparse LU factorization
Sparse LU factorization
Sync Frequency
2 times /sec
59 times /sec
32 times /sec
2392 times /sec
2067 times /sec
1635 times /sec
MPI Operations
Mostly MPI_Bsend
Mostly MPI_Bcast
Mostly send/recv
Mixed
Mixed
Mixed
Sync frequency is obtained by running each benchmark
with 4 MPI nodes on 4-processor Power Challenge.
The higher the multiprogramming degree, the more spinblocks (context switch) during each synchronization
Sparse LU benchmarks have much more frequent
synchronization than others.
Shen, Tang, and Yang @ SuperComputing'99
18
Performance evaluation on a
Multiprogrammed Workload
Workload: contains a sequence of six jobs launched with a
fixed interval.
Compare job turnaround time in Power Challenge.


Interval = 20 secs
Interval = 10 secs
TMPI-2 TMPI SGI TMPI-2 TMPI
SGI
GOODWIN
16.0 20.4 19.0
20.8 30.5
29.2
MM2
14.9 16.2 25.9
29.3 43.7
60.6
GE1
19.9 21.2 27.3
37.5 65.3
162.0
GE2
11.0 11.8 16.4
22.4 40.5
61.7
MM1
33.8 35.1 63.8
66.5 63.3
160.4
SWEEP3D
47.5 47.4 67.4
67.0 72.9
162.1
Average
23.8 25.4 36.6
40.6 52.7
106.0
Normalized
1.00 1.07 1.54
1.00 1.30
2.61
Jobs
2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
19
Workload with Certain
Multiprogramming Degrees
Goal: identify the performance impact of
multiprogramming degrees.
 Experimental setting:

 Each workload has one benchmark program.
 Run n MPI nodes on p processors (n≥p).
 Multiprogramming degree is n/p.

2016/7/26
Compare megaflop rates or speedups of the
kernel part of each application.
Shen, Tang, and Yang @ SuperComputing'99
20
Performance Impact of Multiprogramming
Degree (SGI Power Challenge)
2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
21
Performance Impact of Multiprogramming
Degree (SGI Origin 2000)
Performance Ratio =
Benchmarks
_#MPI nodes_
#Processors
2 Processors
4 Processors
6 Processors
8 Processors
Speedup/MFLOP of TMPI-2
Speedup/MFLOP of TMPI or SGI MPI
GE
1
1.04
1.01
0.99
0.99
2
1.11
1.12
1.22
1.32
SWEEP3D
3
1.23
1.45
1.54
1.67
1
2
1.04
1.03
1.01
1.01
1.08
1.19
1.44
1.63
3
1.17
1.36
1.51
1.88
Performance ratios of TMPI-2 over TMPI
Benchmarks
_#MPI nodes_
#Processors
2 Processors
4 Processors
6 Processors
8 Processors
GE
1
1.01
1.02
1.03
1.03
2
SWEEP3D
3
3.05 7.22
5.10 13.66
6.61 22.72
8.07 32.17
1
1.01
1.00
1.00
1.00
2
3
2.01 2.97
3.71 7.07
4.44 11.94
6.50 15.69
Performance ratios of TMPI-2 over SGI MPI
2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
22
Benefits of Scheduler-conscious Event
Waiting
#MPI nodes
_#MPI nodes_
#Processors
GE
SWEEP3D
GOODWIN
TID
E40R0100
4
6
8
10
12
1
1.5
2
2.5
3
0.3 %
0%
0.6 %
0%
1.5 %
2.4 %
0.9 %
0.1 %
2.7 %
1.1 %
2.4 % 3.0 % 4.6 %
2.1 % 1.7 % 6.1 %
8.3 % 11.7 % 14.4 %
5.2 % 6.7 % 12.6 %
6.1 % 8.1 % 11.4 %
Improvement over simple spin-block on Power Challenge
#MPI nodes
_#MPI nodes_
#Processors
GE
SWEEP3D
GOODWIN
TID
E40R0100
4
6
8
10
12
1
1.5
2
2.5
3
0%
-1.0 %
0%
0.2 %
0.7 %
3.1 %
3.8 %
0.1 %
0.7 %
0.4 %
2.2 %
1.1 %
7.5 %
5.3 %
2.2 %
1.0 %
3.4 %
0.7 %
3.7 %
9.3 % 14.2 %
4.5 %
8.2 %
4.8 % 12.0 %
Improvement over simple spin-block on Origin 2000
2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
23
Conclusions
Contributions for optimizing MPI execution:
Adaptive two-level thread management; Schedulerconscious event waiting;
 Great performance improvement: up to an order of
magnitude, depending on applications and load;
 In multiprogrammed environments, fast context
switch/synchronization is important even for
communication-infrequent MPI programs.

Current and future work

Support threaded MPI on SMP-clusters
http://www.cs.ucsb.edu/research/tmpi
2016/7/26
Shen, Tang, and Yang @ SuperComputing'99
24
Download