Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang

advertisement
Optimizing Threaded MPI Execution
on SMP Clusters
Hong Tang and Tao Yang
Department of Computer Science
University of California, Santa Barbara
Parallel Computation on SMP Clusters

Massively Parallel Machines  SMP Clusters
 Commodity Components: Off-the-shelf Processors
+ Fast Network (Myrinet, Fast/GigaBit Ethernet)

Parallel Programming Model for SMP Clusters
 MPI: Portability, Performance, Legacy Programs
 MPI+Variations: MPI+Multithreading,
MPI+OpenMP
June, 20th, 2001
Hong Tang
2
Threaded MPI Execution
MPI Paradigm: Separated Address Spaces for
Different MPI Nodes
 Natural Solution: MPI Nodes  Processes


What if we map MPI nodes to threads?
 Faster synchronization among MPI nodes running on the
same machine.
 Demonstrated in previous work [PPoPP ’99] for a single
shared memory machine. (Developed techniques to safely
execute MPI programs using threads.)

Threaded MPI Execution on SMP Clusters
 Intra-Machine Comm. through Shared Memory
 Inter-Machine Comm. through Network
June, 20th, 2001
Hong Tang
3
Threaded MPI Execution Benefits InterMachine Communication

Common Intuition
Inter-machine communication cost is dominated
by network delay, so the advantage of
executing MPI nodes as threads diminishes.

Our Findings
Using threads can significantly reduce the
buffering and orchestration overhead for intermachine communications.
June, 20th, 2001
Hong Tang
4
Related Work
MPI on Network Clusters
MPICH – a portable MPI implementation.
LAM/MPI – communication through a standalone RPI server.
Collective Communication Optimization
SUN-MPI and MPI-StarT – modify MPICH ADI layer; target
for SMP clusters.
MagPIe – target for SMP clusters connected through WAN.
Lower Communication
MPI-FM and MPI-AM.
Layer Optimization
Threaded Execution of
MPI-Lite, LPVM, TPVM.
Message Passing Programs
June, 20th, 2001
Hong Tang
5
Background: MPICH Design
MPI Collective
MPI Collective
MPI Point to Point
MPI Point-to-Point
Abstract Device Interface



Devices 


June, 20th, 2001
ADI
Chameleon
Interface
T3D
SGI
others
P4
TCP
shmem
Hong Tang
6
MPICH Communication Structure
WS - A cluster node
- MPI node (process)
- MPICH daemon process
- Inter-process pipe
- Shared memory
- TCP connection
WS
WS
WS
WS
WS
WS
WS
WS
MPICH with shared memory
June, 20th, 2001
MPICH without shared memory
Hong Tang
7
TMPI Communication Structure
WS


WS

WS
WS
WS - A cluster node
- MPI node (thread)
- TMPI daemon thread
June, 20th, 2001
Hong Tang
- TCP connection
- Direct mem access
and thread sync
8
Comparison of TMPI and MPICH
 Drawbacks
of MPICH w/ Shared Memory
 Intra-node communication limited by shared memory size.
 Busy polling to check messages from either daemon or local
peer.
 Cannot do automatic resource clean-up.
 Drawbacks
of MPICH w/o Shared Memory
 Big overhead for intra-node communication.
 Too many daemon processes and open connections.
 Drawbacks
of both MPICH Systems
 Extra data copying for inter-machine communication.
June, 20th, 2001
Hong Tang
9
TMPI Communication Design
MPI Communication
Inter- and Intra-Machine
Communication
Abstract Network and
Thread Sync Interface
OS Facilities
June, 20th, 2001
MPI
INTER
NETD
TCP
others
Hong Tang
INTRA
THREAD
pthread
other
thread impl
10
Separation of Point-to-Point and
Collective Communication Channels

Observations:
MPI Point-to-point Communication
WS
WS and
Collective Communication Semantics are Different.
Point-to-point
Collective
Unknown Source
(MPI_ANY_SOURCE)
Determined Source
(Ancestor in the spanning tree.)
Out-of-order
(Message Tag)
In order delivery
Asynchronous
WS
(Non-block Receive)

WS
Synchronous
WS - A cluster node
- TCP
connection
Separated
channels for pt2pt and
collective
comm.
- MPI node (thread)
- Direct mem access
 Eliminate
intervention
for collective communication.
- TMPIdaemon
daemon
thread
and thread sync
 Less effective for MPICH – no sharing of ports among processes.
June, 20th, 2001
Hong Tang
11
Hierarchy-Aware Collective
Communication

Observation:
Two level 0communication hierarchy.
0
0
 Inside an SMP node: shared memory (10-8 sec)
3  Between
1
2
6SMP

4
1
2
4
8(10-6 sec) 1
nodes:
network
2
Idea:
Building
the 3communication
spanning
tree
5
7
8
6
5
3
4
5
6
in two steps
7 node
8
 Choose a root MPI7 node on each cluster
and build
MPICH
MPICH
aTMPI
spanning tree among
all the cluster nodes.
(hypercube)
(balanced binary tree)
 Second, all other MPI nodes connect to the local root
Spanning
trees for an MPI program with 9 nodes on three cluster nodes.
node.
The three cluster nodes contain MPI node 0-2, 3-5 and 6-8 respectively.
Thick edges are network edges.
June, 20th, 2001
Hong Tang
12
Adaptive Buffer Management
Question: How do we manage temporary
req/
S
S rebuffering
q/da
of
message
dataS when
the remote
d at
req
t
D
D
D
receiver
is
not
ready
to
accept
data?
q
t
e
D Q
Q
D got da
Q
D got r eq
got r
R
R
 Choices:

R
 Send the data with the
request
–
eager
push.
ady
eady
e
r
r
r
r
e
e
v
v
recei
recei
D data
dat when the receiver is
D daonly
 Send request
and send
t
D
D protocol.
ready – three-phase
Q
t
dat
a
Q
t
d
o
D
t
g
D go between both methods.
 TMPI – adapt
R
R
Three-phase Protocol
One Step Eager-push
GracefulProtocol
Degradation from Eagernode cannot buffer the msg.
msg. Remote
Remote node can buffer
push tothe
Three-phase
Protocol
June, 20th, 2001
Hong Tang
13
Experimental Study
Goal: Illustrate the advantage of threaded MPI
execution on SMP clusters.
 Hardware Setting

 A cluster of 6 Quad-Xeon 500MHz SMPs, with 1GB main
memory and 2 fast Ethernet cards per machine.

Software Setting
 OS: RedHat Linux 6.0, kernel version 2.2.15 w/ channel
bonding enabled.
 Process-based MPI System: MPICH 1.2
 Thread-based MPI System: TMPI (45 functions in MPI
1.1 standard)
June, 20th, 2001
Hong Tang
14
Inter-Cluster-Node Point-to-Point

Ping-ping, TMPI vs MPICH w/ shared memory
(a) Ping-Pong Short Message
(b) Ping-Pong Long Message
700
600
500
400
TMPI
MPICH
Transfer Rate (MB)
Round Trip Time ( m s)
20
18
16
TMPI
MPICH
14
12
10
300
8
200
0
200 400 600 800 1000
Message Size (bytes)
June, 20th, 2001
0
Hong Tang
200
400 600 800
Message Size (KB)
1000
15
Intra-Cluster-Node Point-to-Point

Ping-pong, TMPI vs MPICH1 (MPICH w/ shared memory)
and MPICH2 (MPICH w/o shared memory)
(a) Ping-Pong Short Message
(b) Ping-Pong Long Message
180
TMPI
MPICH1
MPICH2
200
TMPI
MPICH1
MPICH2
150
100
Transfer Rate (MB)
Round Trip Time ( m s)
160
140
120
100
80
60
40
50
0
20
200 400 600 800 1000
Message Size (bytes)
June, 20th, 2001
Hong Tang
0
200
400 600 800
Message Size (KB)
1000
16
Collective Communication
Reduce,
Allreduce.
3) For
TMPI,Bcast,
the performance
of 4X4 cases is roughly the
TMPIis/of
MPICH_SHM
/faster
MPICH_NOSHM
2) TMPI
70+
times
than MPICH
w/ of
Shared
Memory
summation
that
of the
4X1 cases
and that
the 1X4
cases.
 Three nodefor
distributions,
threeand
rootMPI_Reduce.
node settings.
MPI_Bcast
1) MPICH w/o shared memory performs the worst.
(us)
4x1
1x4
4x4
June, 20th, 2001
root
Reduce
Bcast
same
9/121/4384
10/137/7913
rotate
33/81/3699
129/91/4238
combo
25/102/3436
17/32/966
same
28/1999/1844
21/1610/1551
rotate
146/1944/1878
164/1774/1834
combo
167/1977/1854
43/409/392
same
39/2532/4809
56/2792/10246
rotate
161/1718/8566
216/2204/8036
combo
141/2242/8515
62/489/2054
Hong Tang
Allreduce
160 /175/627
571/675/775
736/1412/19914
17
Macro-Benchmark Performance
(b) Gaussian Elimination
1000
1000
800
800
600
600
MFLOP
MFLOP
(a) Matrix Multiplication
400
400
TMPI
MPICH
200
200
TMPI
MPICH
0
0
June, 20th, 2001
5
10
15
Number of MPI Nodes
20
Hong Tang
0
0
5
10
15
20
Number of MPI Nodes
25
18
Conclusions

Great Advantage of Threaded MPI Execution
on SMP Clusters
 Micro-benchmark: 70+ times faster than MPICH.
 Macro-benchmark: 100% faster than MPICH.

Optimization Techniques
 Separated Collective and Point-to-Point
Communication Channels
 Adaptive Buffer Management
 Hierarchy-Aware Communications
http://www.cs.ucsb.edu/projects/tmpi/
June, 20th, 2001
Hong Tang
19
Background: Safe Execution of MPI
Programs using Threads
Program Transformation: Eliminate global and static
variables (called permanent variables).
 Thread-Specific Data (TSD)

Each thread can associate a pointer-sized data variable with
a commonly defined key value (an integer). With the same
key, different threads can set/get the values of their own
copy of the data variable.

TSD-based Transformation
Each permanent variable declaration is replaced with a KEY
declaration. Each node associates its private copy of the
permanent variable with the corresponding key. In places
where global variables are referenced, use the global keys to
retrieve the per-thread copies of the variables.
June, 20th, 2001
Hong Tang
20
Program Transformation –
An Example
Source Program
int X=1;
int f()
{
Program After Transformation
int kX=0;
void main_init()
{
if (kX==0)
kX=key_create();
}
void user_init()
{
int *pX=malloc(sizeof(int));
*pX=1;
setval(kX, pX);
}
int f()
{
int *pX=getval(kX);
return X++;
}
June, 20th, 2001
return (*pX)++;
}
Hong Tang
21
Download