Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines

advertisement
Adaptive Two-level Thread Management
for MPI Execution on Multiprogrammed
Shared Memory Machines
Kai Shen, Hong Tang, and Tao Yang
http://www.cs.ucsb.edu/research/tmpi
Department of Computer Science
University of California, Santa Barbara
MPI-Based Parallel Computation on
Shared Memory Machines
Shared Memory Machines (SMMs) or SMM
Clusters become popular for high end computing.
 MPI is a portable high performance parallel
programming model.
 MPI on SMMs
 Threads are easy to program. But people still use
MPI on SMMs:

 Better portability for running on other platforms (e.g.
SMM clusters);
 Good data locality due to data partitioning.
6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
2
Scheduling for Parallel Jobs in
Multiprogrammed SMMs

Gang-scheduling
 Good for parallel programs which synchronize frequently;
 Low resource utilization (Processor-fragmentation; not
enough parallelism).

Space/time Sharing
 Time sharing on dynamically partitioned machines;
 Short response time and high throughput.

Impact on MPI program execution
 Not all MPI nodes are scheduled simultaneously;
 The number of available processors for each application
may change dynamically.
 Optimization is needed for fast MPI execution on SMMs.
6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
3
Techniques Studied

Thread-Based MPI execution [PPoPP’99]
 Compile-time transformation for thread-safe
MPI execution
 Fast context switch and synchronization
 Fast communication through address sharing

Two-level thread management for
multiprogrammed environments
 Even faster context switch/synchronization
 Use scheduling information to guide
synchronization

6/20/2016
Our prototype system: TMPI
Shen, Tang, and Yang @ SuperComputing'99
4
Related Work
MPI-related Work



MPICH, a portable MPI implementation [Gropp/Lusk/et al.].
SGI MPI, highly optimized on SGI platforms.
MPI-2, multithreading within a single MPI node.
Scheduling and Synchronization



Process Control [Tucker/Gupta] and Scheduler Activation
[Anderson et al.] Focus on OS research.
Scheduler-conscious Synchronization [Kontothanssis et al.]
Focus on primitives such as barriers and locks.
Hood/Cilk threads [Arora et al.] and Loop-level Scheduling
[Yue/Lilja]. Focus on fine-grain parallelism.
6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
5
Outline
Motivations & Related Work
 Adaptive Two-level Thread Management
 Scheduler-conscious Event Waiting
 Experimental Studies

6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
6
Context Switch/Synchronization in
Multiprogrammed Environments
In multiprogrammed environments, more synchronization will
lead to context switch
 context switch/synchronization has large performance
impact in multiprogrammed environments



Conventional MPI implementation maps each MPI node to
an OS process.
Our earlier work maps each MPI node to a kernel thread.
Two-level Thread Management: maps each MPI node to a
user-level thread.
 Faster context switch and synchronization among userlevel threads
 Very few kernel-level context switches
6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
7
System Architecture
MPI application
… ...
MPI application
TMPI Runtime
… ...
TMPI Runtime
User-level threads
… ...
User-level threads
System-wide resource management
Targeted at multiprogrammed environments
 Two-level thread management

6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
8
Adaptive Two-level Thread
Management

System-wide resource manager (OS kernel or Userlevel central monitor)
 collects information about active MPI applications;
 partitions processors among them.

Application-wide user-level thread management
 maps each MPI node into a user-level thread;
 schedules user-level threads on a pool of kernel threads;
 controls the number of active kernel threads close to the
number of allocated processors.
Big picture (in the whole system):
 #Active kernel threads ≈ #Processors
 Minimize kernel-level context switch

6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
9
User-level Thread Scheduling

Every kernel thread can be:
 active: executing an MPI node (user-level thread);
 suspended.

Execution invariant for each application:
 #active kernel threads ≈ #allocated processors
(minimize kernel-level context switch)
 #kernel threads = #MPI nodes
(avoid dynamic thread creation)

Every active kernel thread polls system resource
manager, which leads to:
 Deactivation: suspending itself
 Activation: waking up some suspended kernel threads
 No-action
When
to
poll?
6/20/2016
Shen, Tang, and Yang @ SuperComputing'99

10
Polling in User-Level Context Switch
Context switch is a result of synchronization
(e.g. an MPI node waits for a message).
 Underlying kernel thread polls system
resource manager during context switch:

 Two stack switches if deactivation
 suspend on a dummy stack
 One stack switch otherwise

After optimization, 2s in average on SGI
Power Challenge
6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
11
Outline
Motivations & Related Work
 Adaptive Two-level Thread Management
 Scheduler-conscious Event Waiting
 Experimental Studies

6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
12
Event Waiting Synchronization

All MPI synchronization is based on waitEvent
waiter
caller
waitEvent(*pflag == value);
waiting
*pflag = value;
Waiting could be:
• spinning
• yielding/blocking
wakeup
6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
13
Tradeoff between spin and block

Basic rules for waiting using spin-then-block:
 Spinning wastes CPU cycles.
 Blocking introduces context switch overhead; always-
blocking is not good for dedicated environments.
 Previous work focuses on choosing the best spin time.

Our optimization focus and findings:
 Fast context switch has substantial performance impact;
 Use scheduling information to guide spin/block decision:
 Spinning
is futile when the caller is not currently
scheduled;
 Most blocking cost comes from cache flushing
penalty. (actual cost varies, up to several ms)
6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
14
Scheduler-conscious Event Waiting
waitEvent (condition)
Yes
Check
condition
No
Return
Is the caller
currently
scheduled?
No
Blocking (yielding
processor)
Yes
Check cache
affinity
Yes
User-level scheduler
provides:
• scheduling info
• affinity info
No
Yes
No
Spinning for time T and
check condition
periodically
6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
15
Experimental Settings

Machines:
 SGI Origin 2000 system with 32 195MHz MIPS R10000s
with 2GB memory
 SGI Power Challenge with 4 200MHz MPIS R4400s with
256MB memory

Compare among:
 TMPI-2: TMPI with two-level thread management
 SGI MPI: SGI’s native MPI implementation
 TMPI: original TMPI without two-level thread
management
6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
16
Testing Benchmarks
Benchmark
MM
GE
SWEEP3D
GOODWIN
TID
E40R0100



6/20/2016
Description
Matrix multiplication
Gaussian elimination
3D Neutron Transport
Sparse LU factorization
Sparse LU factorization
Sparse LU factorization
Sync Frequency
2 times /sec
59 times /sec
32 times /sec
2392 times /sec
2067 times /sec
1635 times /sec
MPI Operations
Mostly MPI_Bsend
Mostly MPI_Bcast
Mostly send/recv
Mixed
Mixed
Mixed
Sync frequency is obtained by running each benchmark
with 4 MPI nodes on 4-processor Power Challenge.
The higher the multiprogramming degree is, the more
synchronization will lead to context switch.
Sparse LU benchmarks have much more frequent
synchronization than others.
Shen, Tang, and Yang @ SuperComputing'99
17
Performance evaluation on a
Multiprogrammed Workload
Workload: contains a sequence of six jobs launched with a
fixed interval.
Compare job turnaround time in Power Challenge.


Interval = 20 secs
Interval = 10 secs
TMPI-2 TMPI SGI TMPI-2 TMPI
SGI
GOODWIN
16.0 20.4 19.0
20.8 30.5
29.2
MM2
14.9 16.2 25.9
29.3 43.7
60.6
GE1
19.9 21.2 27.3
37.5 65.3
162.0
GE2
11.0 11.8 16.4
22.4 40.5
61.7
MM1
33.8 35.1 63.8
66.5 63.3
160.4
SWEEP3D
47.5 47.4 67.4
67.0 72.9
162.1
Average
23.8 25.4 36.6
40.6 52.7
106.0
Normalized
1.00 1.07 1.54
1.00 1.30
2.61
Jobs
6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
18
Workload with Certain
Multiprogramming Degrees
Goal: identify the performance impact of
multiprogramming degrees.
 Experimental setting:

 Each workload has one benchmark program.
 Run n MPI nodes on p processors (n≥p).
 Multiprogramming degree is n/p.

6/20/2016
Compare megaflop rates or speedups of the
kernel part of each application.
Shen, Tang, and Yang @ SuperComputing'99
19
Performance Impact of Multiprogramming
Degree (SGI Power Challenge)
6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
20
Performance Impact of Multiprogramming
Degree (SGI Origin 2000)
Performance Ratio =
Benchmarks
_#MPI nodes_
#Processors
2 Processors
4 Processors
6 Processors
8 Processors
Speedup/MFLOP of TMPI-2
Speedup/MFLOP of TMPI or SGI MPI
GE
1
1.04
1.01
0.99
0.99
2
1.11
1.12
1.22
1.32
SWEEP3D
3
1.23
1.45
1.54
1.67
1
2
1.04
1.03
1.01
1.01
1.08
1.19
1.44
1.63
3
1.17
1.36
1.51
1.88
Performance ratios of TMPI-2 over TMPI
Benchmarks
_#MPI nodes_
#Processors
2 Processors
4 Processors
6 Processors
8 Processors
GE
1
1.01
1.02
1.03
1.03
2
SWEEP3D
3
3.05 7.22
5.10 13.66
6.61 22.72
8.07 32.17
1
1.01
1.00
1.00
1.00
2
3
2.01 2.97
3.71 7.07
4.44 11.94
6.50 15.69
Performance ratios of TMPI-2 over SGI MPI
6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
21
Benefits of Scheduler-conscious Event
Waiting
#MPI nodes
_#MPI nodes_
#Processors
GE
SWEEP3D
GOODWIN
TID
E40R0100
4
6
8
10
12
1
1.5
2
2.5
3
0.3 %
0%
0.6 %
0%
1.5 %
2.4 %
0.9 %
0.1 %
2.7 %
1.1 %
2.4 % 3.0 % 4.6 %
2.1 % 1.7 % 6.1 %
8.3 % 11.7 % 14.4 %
5.2 % 6.7 % 12.6 %
6.1 % 8.1 % 11.4 %
Improvement over simple spin-block on Power Challenge
#MPI nodes
_#MPI nodes_
#Processors
GE
SWEEP3D
GOODWIN
TID
E40R0100
4
6
8
10
12
1
1.5
2
2.5
3
0%
-1.0 %
0%
0.2 %
0.7 %
3.1 %
3.8 %
0.1 %
0.7 %
0.4 %
2.2 %
1.1 %
7.5 %
5.3 %
2.2 %
1.0 %
3.4 %
0.7 %
3.7 %
9.3 % 14.2 %
4.5 %
8.2 %
4.8 % 12.0 %
Improvement over simple spin-block on Origin 2000
6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
22
Conclusions
Contributions for optimizing MPI execution:
Adaptive two-level thread management; Schedulerconscious event waiting;
 Great performance improvement: up to an order of
magnitude, depending on applications and load;
 In multiprogrammed environments, fast context
switch/synchronization is important even for
communication-infrequent MPI programs.

Current and future work

Support threaded MPI on SMP-clusters
http://www.cs.ucsb.edu/research/tmpi
6/20/2016
Shen, Tang, and Yang @ SuperComputing'99
23
Download