Programming Distributed Memory Sytems Using OpenMP

advertisement
Programming
Distributed Memory Sytems
Using OpenMP
Rudolf Eigenmann,
Ayon Basumallik, Seung-Jai Min,
School of Electrical and Computer Engineering,
Purdue University,
http://www.ece.purdue.edu/ParaMount
Is OpenMP a useful programming
model for distributed systems?

OpenMP is a parallel programming model that assumes a shared
address space
#pragma OMP parallel for
for (i=1; 1<n; i++) {a[i] = b[i];}

Why is it difficult to implement OpenMP for distributed processors?
The compiler or runtime system will need to

partition and place data onto the distributed memories

send/receive messages to orchestrate remote data accesses
HPF (High Performance Fortran) was a large-scale effort to do so without success

So, why should we try (again)?
OpenMP is an easier programming (higher-productivity?) programming
model. It

allows programs to be incrementally parallelized starting from the serial
versions,

relieves the programmer of the task of managing the movement of
logically shared data.
R. Eigenmann, Purdue
HIPS 2007
2
Two Translation Approaches


Use a Software Distributed Shared
Memory System
Translate OpenMP directly to MPI
R. Eigenmann, Purdue
HIPS 2007
3
Approach 1:
Compiling OpenMP for Software
Distributed Shared Memory
R. Eigenmann, Purdue
HIPS 2007
4
Inter-procedural Shared Data Analysis
SUBROUTINE SUB0
INTEGER DELTAT
CALL DCDTZ(DELTAT,…)
CALL DUDTZ(DELTAT,…)
END
SUBROUTINE DUDTZ(X, Y, Z)
INTEGER X,Y,Z
C$OMP PARALLEL
C$OMP+REDUCTION(+:X)
X=X+…
C$OMP END PARALLEL
END
R. Eigenmann, Purdue
SUBROUTINE DCDTZ(A, B, C)
INTEGER A,B,C
C$OMP PARALLEL
C$OMP+PRIVATE (B, C)
A=…
CALL CCRANK
…
C$OMP END PARALLEL
END
SUBROUTINE CCRANK()
…
beta = 1 – alpha
…
END
HIPS 2007
5
Access Pattern
Analysis
DO istep = 1, itmax, 1
!$OMP PARALLEL DO
rsd (i, j, k) = …
!$OMP END PARALLEL DO
!$OMP PARALLEL DO
rsd (i, j, k) = …
!$OMP END PARALLEL DO
!$OMP PARALLEL DO
u (i, j, k) = rsd (i, j, k)
!$OMP END PARALLEL DO
CALL RHS()
ENDDO
R. Eigenmann, Purdue
SUBROUTINE RHS()
!$OMP PARALLEL DO
u (i, j, k) = …
!$OMP END PARALLEL DO
!$OMP PARALLEL DO
…
= u (i, j, k)..
rsd (i, j, k) = rsd (i, j, k)..
!$OMP END PARALLEL DO
!$OMP PARALLEL DO
…
= u (i, j, k)..
rsd (i, j, k) = rsd (i, j, k)..
!$OMP END PARALLEL DO
!$OMP PARALLEL DO
…
= u (i, j, k)..
rsd (i, j, k) = ...
!$OMP END PARALLEL DO
HIPS 2007
6
=> Data Distribution-Aware
SUBROUTINE RHS()
Optimization
DO istep = 1, itmax, 1
!$OMP PARALLEL DO
rsd (i, j, k) = …
!$OMP END PARALLEL DO
!$OMP PARALLEL DO
rsd (i, j, k) = …
!$OMP END PARALLEL DO
!$OMP PARALLEL DO
u (i, j, k) = rsd (i, j, k)
!$OMP END PARALLEL DO
CALL RHS()
ENDDO
R. Eigenmann, Purdue
!$OMP PARALLEL DO
u (i, j, k) = …
!$OMP END PARALLEL DO
!$OMP PARALLEL DO
…
= u (i, j, k)..
rsd (i, j, k) = rsd (i, j, k)..
!$OMP END PARALLEL DO
!$OMP PARALLEL DO
…
= u (i, j, k)..
rsd (i, j, k) = rsd (i, j, k)..
!$OMP END PARALLEL DO
!$OMP PARALLEL DO
…
= u (i, j, k)..
rsd (i, j, k) = ...
!$OMP END PARALLEL DO
HIPS 2007
7
Adding Redundant Computation
to Eliminate Communication
Optimized S-DSM Code
S-DSM Program
OpenMP Program
init00 = (N/proc_num)*(pid-1)…
DO k = 1, z
init00 = (N/proc_num)*(pid-1)…
limit00 = (N/proc_num)*pid…
limit00 = (N/proc_num)*pid …
!$OMP PARALLEL DO
new_init = init00 - 1
DO j = 1, N, 1
new_limit = limit00 + 1
DO k = 1, z
flux(m, j) = u(3, i, j, k) + …
DO k = 1, z
ENDDO
DO j = init00, limit00, 1
DO j = new_init, new_limit, 1
flux(m, j) = u(3, i, j, k) + …
!$OMP PARALLEL DO
flux(m, j) = u(3, i, j, k) + …
ENDDO
DO j = 1, N, 1
ENDDO
CALL TMK_BARRIER(0)
DO m = 1, 5, 1
CALL TMK_BARRIER(0)
DO j = init00, limit00, 1
rsd(m, i, j, k) = … +
DO j = init00, limit00, 1
DO m = 1, 5, 1
flux(m, j+1)-flux(m, j-1))
DO m = 1, 5, 1
rsd(m, i, j, k) = … +
ENDDO
rsd(m, i, j, k) = … +
flux(m, j+1)-flux(m, j-1))
ENDDO
flux(m, j+1)-flux(m, j-1))
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
R. Eigenmann, Purdue
HIPS 2007
8
Access Privatization
Example from equake (SPEC OMPM2001)
If (master) {
shared->ARCHnodes = …..
shared->ARCHduration = …
...
}
READ-ONLY
SHARED VARS
// Done by all nodes
/* Parallel Region */
/* Parallel Region */
N = ARCHnodes;
iter = ARCHduration;
…...
N = shared->ARCHnodes ;
iter = shared->ARCHduration;
{ ARCHnodes = …..
ARCHduration = …
...
PRIVATE
}
VARIABLES
…...
R. Eigenmann, Purdue
HIPS 2007
9
Optimized Performance of
OMPM2001
Benchmarks
SPEC OMP2001M Performance
6
5
4
3
2
1
0
1 2 4 8
wupwise
1
2 4 8
swim
1 2 4 8
1 2 4 8
mgrid
applu
Baseline Performance
R. Eigenmann, Purdue
1 2 4
equake
8
1 2 4 8
art
Optimized Performance
HIPS 2007
10
A Key Question: How Close Are we
to MPI Performance ?
SPEC OMP2001 Performance
8
7
6
Baseline Performance
5
Optimized Performance
4
MPI Performance
3
2
1
0
1 2 4 8
wupwise
R. Eigenmann, Purdue
1 2 4 8
swim
1 2 4 8
mgrid
1 2 4 8
applu
HIPS 2007
11
Towards Adaptive Optimization
A combined Compiler-Runtime Scheme


Compiler identifies repetitive access patterns
Runtime system learns the actual remote
addresses and sends data early.
Ideal program characteristics:
Outer, serial
loop
R. Eigenmann, Purdue
Data addresses are
invariant or a linear
sequence, w.r.t.
outer loop
Inner, parallel
loops
Communication
points at barriers
HIPS 2007
12
Current Best Performance of
OpenMP for S-DSM
6
5
4
3
2
1
0
1 2 4 8
1 2 4 8
1 2 4 8
1 2 4 8
1 2 4 8
wupwise
swim
applu
SpMul
CG
Baseline(No Opt.)
R. Eigenmann, Purdue
Locality Opt
HIPS 2007
Locality Opt + Comp/Run Opt
13
Approach 2:
Translating OpenMP directly to MPI


Baseline translation
Overlapping computation and communication
for irregular accesses
R. Eigenmann, Purdue
HIPS 2007
14
Baseline Translation of
OpenMP to MPI

Execution Model
 SPMD model



Serial Regions are replicated on all processes
Iterations of parallel for loops are distributed (using
static block scheduling)
Shared Data is allocated on all nodes



There is no concept of “owner” – only producers and
consumers of shared data
At the end of a parallel loop, producers communicate
shared data to “potential” future consumers
Array section analysis is used for summarizing array
accesses
R. Eigenmann, Purdue
HIPS 2007
15
Baseline Translation
Translation Steps:
1.
2.
3.
4.
R. Eigenmann, Purdue
Identify all shared data
Create annotations for accesses to shared data
(use regular section descriptors to summarize
array accesses)
Use interprocedural data flow analysis to identify
potential consumers; incorporate OpenMP
relaxed consistency specifications
Create message sets to communicate data
between producers and consumers
HIPS 2007
16
Message Set Generation
V1:
For every write,
determine all future reads
<write,A,1,l1(p),u1(p)>
…
<read,A,1,l2(p),u2(p)>
<write,A,1,l3(p),u3(p)>
<read,A,1,l5(p),u5(p)>
…
<read,A,1,l4(p),u4(p)>
Message Set at RSD vertex V1, for array
A from process p to process q
computed as
SApq = Elements of A with subscripts in
the set
{[l1(p),u1(p)]∩[l2(q),u2(q)]} U
{[l1(p),u1(p)] [l4(q),u4(q)]}
…
R. Eigenmann, Purdue
U ([l1(p),u1(p)] {[l5(q),u5(q)][l3(p),u3(p)]})
HIPS 2007
17
Baseline Translation of
Irregular Accesses

Irregular Access – A[B[i]], A[f(i)]



Reads: assumed the whole array accessed
Writes: inspect at runtime, communicate at
the end of parallel loop
We often can do better than
“conservative”:

Monotonic array values => sharpen access
regions
R. Eigenmann, Purdue
HIPS 2007
18
Optimizations based on
Collective Communication

Recognition of Reduction Idioms


Translate to MPI_Reduce / MPI_Allreduce
functions.
Casting sends/receives in terms of alltoall
calls

Beneficial where the producer-consumer
relationship is many-to-many and there is
insufficient distance between producers and
consumers.
R. Eigenmann, Purdue
HIPS 2007
19
Performance of the Baseline
OpenMP to MPI Translation
Platform II – Sixteen IBM SP-2 WinterHawk-II nodes connected by a high-performance switch.
R. Eigenmann, Purdue
HIPS 2007
20
We can do more for
Irregular Applications ?
L1 : #pragma omp parallel for
for(i=0;i<10;i++)
A[i] = ...


L2 : #pragma omp parallel for
for(j=0;j<20;j++)
B[j] = A[C[j]] + ...
produced by
process 1


produced by
process 2

Array
A
1, 3, 5, 0, 2 ….. 2, 4, 8, 1, 2 ...
accesses on
accesses on
process 1
process 2
R. Eigenmann, Purdue
Subscripts of accesses to shared
arrays not always analyzable at
compile-time
Baseline OpenMP to MPI translation:
Conservatively estimate that each
process accesses the entire array
Try to deduce properties such as
monotonicity for the irregular
subscript to refine the estimate
Still, there may be redundant
communication

Runtime tests (inspection) are
needed to resolve accesses
Array
C
HIPS 2007
21
Inspection


Inspection allows accesses to be differentiated (at runtime) as
local and non-local accesses.
Inspection can also map iterations to accesses. This mapping
can then be used to re-order iterations so that iterations with
the same data source are clubbed together.

Communication of remote data can be overlapped with the
computation of iterations that access local data (or data already
received)
Array
A
1, 3, 5, 0, 2 ….. 2, 5, 8, 1, 2 ... C[i]
0, 1, 2, 3, 4, ……..
reorder
iterations
3, 0, 4, 1, 2, ……..
10, 11, 12, 13, 14 ... Index i
accesses on
accesses on
process 1
process 2
R. Eigenmann, Purdue
0, 1, 2, 3, 5 ….. . 5, 8, 1, 2, 2, ...
HIPS 2007
accesses on
process 1
11, 12, 13, 10, 14 ...
accesses on
process 2
22
Loop Restructuring

Simple iteration reordering may
not be sufficient to expose the
full set of possibilities for
computation-communication
overlap.
L1 : #pragma omp parallel for
for(i=0;i<N;i++)
Distribute
loop
p[i] = x[i] + alpha*r[i]
;
L2 to form loops
L2-1 and L2-2
L2 : #pragma omp parallel for
for(j=0;j<N;j++) {
w[j] = 0 ;
for(k=rowstr[j];k<rowstr[j+1];k++)
S2:
w[j] = w[j] +
a[k]*p[col[k]] ;
}
Reordering loop L2 may still not club
together accesses from different sources
R. Eigenmann, Purdue
L1 : #pragma omp parallel for
for(i=0;i<N;i++)
p[i] = x[i] + alpha*r[i] ;
L2-1 : #pragma omp parallel for
for(j=0;j<N;j++) {
w[j] = 0 ;
}
L2-2: #pragma omp parallel for
for(j=0;j<N;j++) {
for(k=rowstr[j];k<rowstr[j+1];k++)
S2: w[j] = w[j] + a[k]*p[col[k]] ;
}
HIPS 2007
23
Loop Restructuring contd.
L1 : #pragma omp parallel for
for(i=0;i<N;i++)
p[i] = x[i] + alpha*r[i] ;
L1 : #pragma omp parallel for
for(i=0;i<N;i++)
p[i] = x[i] + alpha*r[i] ;
L2-1 : #pragma omp parallel for
for(j=0;j<N;j++) {
Coalesce nested
w[j] = 0 ;
loop L2-2 to form
loop L3
}
L2-1 : #pragma omp parallel for
for(j=0;j<N;j++) {
w[j] = 0 ;
}
L2-2: #pragma omp parallel for
for(j=0;j<N;j++) {
for(k=rowstr[j];k<rowstr[j+1];k++)
S2: w[j] = w[j] + a[k]*p[col[k]] ;
}
L3: for(i=0;i<num_iter;i++)
w[T[i].j] = w[T[i].j] +
a[T[i].k]*p[T[i].col] ;
Reorder iterations of
loop L3 to achieve
computationcommunication overlap
The T[i] data structure is created
and
filled in Purdue
by the inspector
R. Eigenmann,
Final restructured and reordered loop
HIPS 2007
24
Achieving actual overlap of
computation and communication

Non-blocking send/recv calls may not actually
progress concurrently with computation.


Use a multi-threaded runtime system with separate
computation and communication threads – on dual CPU
machines these threads can progress concurrently.
The compiler extracts the send/recvs along with the
packing/unpacking of message buffers into a
communication thread.
R. Eigenmann, Purdue
HIPS 2007
25
Computation Thread on
Process p
Communication Thread
on Process p
Wake up communication thread
Initiate sends to process q,r
tsend
Pack data and send to
processes q and r.
Execute iterations that
access local data
tcompp
trecv-q
Receive data from process q
Receive data from process r
trecv-r
R. Eigenmann, Purdue
twait-q
Wait for receives from process q
to complete
tcomp-
Execute iterations that access data
received from process q
q
twait-r
Wait for receives from process r
to complete
tcomp-r
Execute iterations that access data
received from process r
Program
Timeline
HIPS 2007
26
Execution Time (seconds)
1200
1000
800
600
Performance of
Equake
400
200
0
1
2
4
8
16
Node s
Hand-Coded MPI
Baseline (No Inspection)
Inspection (No Reordering)
Inspection and Reordering
450
Time (in Seconds)
400
350
300
250
200
150
100
50
0
1
2
4
8
16
Computationcommunication
overlap in Equake
Numbe r of Proce ssors
Actual Time Spent in Send/Recv
Computation available for Overlapping
Actual Wait Time
R. Eigenmann, Purdue
HIPS 2007
27
Execution Time (seconds)
140
120
100
80
Performance of
Moldyn
60
40
20
0
1
2
4
8
16
Number of Nodes
Hand-coded MPI
Baseline
Inspector without Reordering
Inspection and Reordering
12
Time (seconds)
10
8
6
4
2
0
1
2
4
8
16
Computationcommunication
overlap in Moldyn
Number of Nodes
Time spent in Send/Recv
R. Eigenmann, Purdue
Computation Available for Overlapping
Actual Wait Time
HIPS 2007
28
Execution Time (in seconds)
1400
1200
1000
800
600
Performance of CG
400
200
0
1
2
4
8
16
Number of Nodes
NPB-2.3-MPI
Baseline Translation
Inspector without Reordering
Inspector with Iteration Reordering
100
90
339
Time (seconds)
80
169
70
60
Computationcommunication
overlap in CG
50
40
30
20
10
0
1
2
Time spent in Send/Recv
R. Eigenmann, Purdue
4
Numbe r of Node s
8
16
Computation available for Overlap
Actual Wait Time
HIPS 2007
29
Conclusions

There is hope for easier programming models on distributed systems

OpenMP can be translated effectively onto DPS; we have used
benchmarks from




Direct Translation of OpenMP to MPI outperforms translation via S-DSM


SPEC OMP
NAS
additional irregular codes
“Fall back” of S-DSM for irregular accesses incurs significant overhead
Caveats:



Data scalability is an issue
Black-belt programmers will always be able to do better
Advanced compiler technology is involved. There will be performance
surprises.
R. Eigenmann, Purdue
HIPS 2007
30
Download