Developing Parallel Programs with Multiprocessor Tasks

advertisement
Introduction
Example
TwoL Compiler Framework
Tlib library
Developing Parallel Programs with
Multiprocessor Tasks
Thomas Rauber
Department of Computer Science
University of Bayreuth
Contributors: Gudula Rünger, Sascha Hunold, Robert
Reilein, Matthias Kühnemann, Mattias Korch
June 13, 2006
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Outline
1
2
3
4
5
Introduction
Motivation
Multiprocessor Task Programming
Example
Solving ODEs
Parallel Adams Methods
TwoL Compiler Framework
Two-Level Parallelism
Implementation
Extrapolation methods
Tlib library
Tlib API
Hierachical Matrix Multiplication
DRD-Lib – a library for dynamic data redistribution
Conclusions
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Motivation
Introduction
Many large applications from scientific computing have a
modular structure of cooperating subtasks.
Numerical Algorithms can often be restructured such
that a task structure is generated:
* if the inherent degree of parallelism is large enough, a
program-based restructuring is sufficient
the numerical
properties are not changed;
* if the inherent degree of parallelism is too small, an
algorithmic restructuring has to be performed;
the numerical properties may be changed.
Task parallelism can be combined with other types of
parallelism
example: mixed task and data parallel execution;
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Motivation
Advantages of M-task programming
Realization of task parallelism: multiprocesor tasks
(M-tasks);
An M-task program is a collection of M-tasks that can be
executed on an arbitrary number of processors;
Using M-tasks that are concurrently executed and
perform mainly internal communication can significantly
reduce the communication overhead;
main reason: collective communication operations are
executed on subsets of processors
The M-task execution can be adapted to the execution
environment; the number of executing processors, the
execution order (sequentially or concurrently) and the
processor assignment can be adapted;
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Motivation
Group-based Broadcast Operation
global communication
operations:
linear or logarithmic
cost functions
Beowulf Cluster CLiC
MPI_Broadcast with 24 processors
0.18
0.16
0.14
0.12
runtime (in sec)
Prozessor P0
0.1
0.08
0.06
0.04
0.02
P1
0
P2
P3
P4
24
2*12
3*8
4*6
6*4
8*3
12*2
20
40
60
80
100
messagesize (in kbyte)
processors or groupnumber * groupsize
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Multiprocessor Task Programming
Multiprocessor-Tasks programming
An M-task can be executed by an arbitrary number of
processors (malleable task)
Independent tasks can be executed concurrently by
disjoint processor groups
Each M-task can be internally computed in a data-parallel
or SPMD way (or threads)
specific mixed parallel execution
each M-task can be activated at different program points
with a varying number of processors
M-tasks can contain point-to-point-communication and
collective communication operations
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Multiprocessor Task Programming
Scheduling and data distribution
static scheduling of M-Tasks (compiler framework
TwoL):
Advantage:
fixed data distribution and redistribution −→ data
distribution cost are known for scheduling
Limited class of applications: M-task structure has to be
known in advance
dynamic scheduling of M-Task (runtime library Tlib):
Advantage:
recursive, hierarchical M-task structure; divide&conquer
algorithms
But: adaptive re-distibutions for different data layouts and
group structures is required
Introduction
Example
TwoL Compiler Framework
Tlib library
Outline
1
2
3
4
5
Introduction
Motivation
Multiprocessor Task Programming
Example
Solving ODEs
Parallel Adams Methods
TwoL Compiler Framework
Two-Level Parallelism
Implementation
Extrapolation methods
Tlib library
Tlib API
Hierachical Matrix Multiplication
DRD-Lib – a library for dynamic data redistribution
Conclusions
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Solving ODEs
Ordinary Differential Equations
Initial value problem
′
y (t) = f (t, y(t)), f : IR × R
I d→R
I d
y(t0 ) = y0 ,
y0 ∈ R
I d
system
One-step methods:
y0
y∼1
y∼2
t0
t1
t2
TIME
Sources of parallelism within one time step:
parallelism across the method
parallelism across the system
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Solving ODEs
General linear method
Computation of k stage values in time step n at time tn
Yn = (yn1 , . . . , ynk )
|
{z
}
stage values
vector of size d · k
Stage value yni is the approximation of y(tn + ai h)
Computation in each time step
Yn+1 = (R ⊗ I) Yn + h(S ⊗ I)F (Yn ) + h(T ⊗ I)F (Yn+1 )
| {z }
Kronecker
with R, S, T k × k matrices and I d × d matrix
F (Yn ) = (f (yn1 ), . . . , f (ynk ))
vector of size d · k
Introduction
Example
TwoL Compiler Framework
Tlib library
Solving ODEs
Kronecker product
.
0
r11
r21
0
. . .
r12
.
1
{z
r1k
0
0
r1k
0
.
.
0
.
0
0
rk1
r kk
.
0
.
. . .
0
.
.
.
0
. . .
. . .
r21
rk1



}
d
.
0
.
r12
.
.
0
..
.
r11
0
.
k
.
=
···
{z
 
1
r1k
..  ⊗ 
.  
0
rkk
} |
.
···
r11
 ..
R⊗I = .
rk 1
|
.

rkk
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
Parallel Adams Bashforth (PAB)
v.d.Houwen, Messina 99
T = 0 (explicit method)
R = e · ekT , e = (1, . . . 1), ek = (0, . . . , 0, 1)
S = Va Wb−1 , S = (sij )i, j = 1, . . . k
abscissa vector a = (a1 , . . . , ak ) Lobatto points
explicit method (DOPRI8) to compute Y0 = (y01 , . . . y0k )
yn+1,i =
k
X
yn,k
sil
+h
|{z}
l=1
k − th
stage value
f (ynl ) , i = 1, . . . k
| {z }
previous
stage
values
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
approximation vectors
Parallel Adams Bashforth (PAB) k = 3 stage values
y03
y13
y23
next
time
step
Simulated Time
t0
t2
t1
stepsize h
stepsize h
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
approximation vectors
Parallel Adams Bashforth (PAB) k = 3 stage values
y03
y01
DOPRI8
t0
y02 y13
y23
next
time
step
DOPRI8
0.35
stepsize h
0.845
Simulated Time
t2
t1
stepsize h
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
Parallel Adams Bashforth (PAB) k = 3 stage values
f(y01 )
y03
y01
f(y02 )
approximation vectors
function values
f(y03 )
DOPRI8
t0
y02 y13
y23
next
time
step
DOPRI8
0.35
stepsize h
0.845
Simulated Time
t2
t1
stepsize h
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
Parallel Adams Bashforth (PAB) k = 3 stage values
f(y01 )
y03
y01
f(y02 )
approximation vectors
function values
f(y03 )
DOPRI8
t0
y02 y13
y11
y12 y23
next
time
step
DOPRI8
0.35
stepsize h
0.845
Simulated Time
t2
t1
stepsize h
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
Parallel Adams Bashforth (PAB) k = 3 stage values
f(y01 )
f(y02 ) f(y13 )
f(y11 )
y03
y01
y02 y13
y11
f(y12 )
approximation vectors
function values
f(y03 )
DOPRI8
t0
y12 y23
next
time
step
DOPRI8
0.35
stepsize h
0.845
Simulated Time
t2
t1
stepsize h
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
Parallel Adams Bashforth (PAB) k = 3 stage values
f(y01 )
f(y02 ) f(y13 )
f(y11 )
y03
y01
y02 y13
y11
f(y12 )
approximation vectors
function values
f(y03 )
DOPRI8
t0
y12 y23
y21
y22
next
time
step
DOPRI8
0.35
stepsize h
0.845
Simulated Time
t2
t1
stepsize h
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
Parallel Adams Bashforth (PAB) k = 3 stage values
f(y01 )
f(y02 ) f(y13 )
f(y11 )
f(y12 ) f(y23 )
f(y21 )
f(y22 )
y03
y01
y02 y13
y11
y12 y23
y21
y22
approximation vectors
function values
f(y03 )
DOPRI8
t0
next
time
step
DOPRI8
0.35
stepsize h
0.845
Simulated Time
t2
t1
stepsize h
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
Parallel Adams Bashforth (PAB) k = 3 stage values
f(y01 )
f(y02 ) f(y13 )
f(y11 )
f(y12 ) f(y23 )
f(y21 )
f(y22 )
y03
y01
y02 y13
y11
y12 y23
y21
y22
approximation vectors
function values
f(y03 )
DOPRI8
t0
next
time
step
DOPRI8
0.35
stepsize h
0.845
Simulated Time
t2
t1
stepsize h
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
Parallel Adams Moulton (PAM)
T diagonal matrix (implicit method)
specific value δi , i = 1 . . . k
R = e · ekT
S = (Va + RVb − TWa )Wb−1 (see v.d.Houwen, Messina)
a = (a1 , . . . , ak ) Lobatto points
i = 1, . . . , k
vni
yn+1,i − h ∗ δi ∗ f (yn+1,i ) =
|{z}
{z
}
|
contains yni
implicit
fixed point iteration
(j)
(j−1)
yn+1,i − h ∗ δi ∗ f (yn+1,i ) = vni i = 1, . . . , k
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Parallel Adams Methods
Parallel Adams Bashforth Moulton
k = 3 stage values
Stage values yni , i = 1, . . . , k, are computed by the PAB
method are as starting vector for the fixed point iteration of
the PABM method
Additional vector vni to hold constant part of the right hand
side of the fixed point iteration
Similar computation for vni and yni with different matrix S
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
approximation vectors
function values
Parallel Adams Moulton
f(y03 )
f(y01 )
f(y02 )
y03
y01
y02
next
DOPRI8
time
step
DOPRI8
Simulated Time
t0
t1
stepsize h
t2
stepsize h
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
approximation vectors
function values
Parallel Adams Moulton
f(y03 )
f(y01 )
f(y02 )
y03
y01
y02
y13
v
DOPRI8
y12
y11
v
v
next
time
step
DOPRI8
Simulated Time
t0
t1
stepsize h
t2
stepsize h
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
approximation vectors
function values
Parallel Adams Moulton
f(y03 )
f(y01 )
f(y02 )
y03
y01
y02
f(y13 )
y13
v
DOPRI8
y12
y11
v
v
next
time
step
DOPRI8
Simulated Time
t0
t1
stepsize h
t2
stepsize h
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
f(y03 )
f(y01 )
f(y02 )
y03
y01
y02
f(y13 )
y13
v
DOPRI8
fixed point iteration
approximation vectors
function values
Parallel Adams Moulton
y12
y11
v
v
next
time
step
DOPRI8
Simulated Time
t0
t1
stepsize h
t2
stepsize h
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
f(y01 )
f(y02 )
y03
y01
y02
f(y13 )
y13
v
DOPRI8
f(y11 )
y11
v
fixed point iteration
f(y03 )
fixed point iteration
approximation vectors
function values
Parallel Adams Moulton
y12
v
next
time
step
DOPRI8
Simulated Time
t0
t1
stepsize h
t2
stepsize h
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
f(y02 )
y03
y01
y02
f(y13 )
y13
v
DOPRI8
f(y12 )
f(y11 )
y11
v
y12
v
fixed point iteration
f(y01 )
fixed point iteration
f(y03 )
fixed point iteration
approximation vectors
function values
Parallel Adams Moulton
next
time
step
DOPRI8
Simulated Time
t0
t1
stepsize h
t2
stepsize h
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
f(y02 )
y03
y01
y02
f(y13 )
y13
v
DOPRI8
f(y12 )
f(y11 )
y11
v
y12
v
fixed point iteration
f(y01 )
fixed point iteration
f(y03 )
fixed point iteration
approximation vectors
function values
Parallel Adams Moulton
next
time
step
DOPRI8
Simulated Time
t0
t1
stepsize h
t2
stepsize h
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Parallel Adams Methods
Experiments k=8 [Europar 04]
PABM−method with brusselator function for K=8 on Cray T3E−1200
data parallel
task parallel
task parallel orthogonal
3
time per step in seconds
2.5
2
1.5
1
0.5
0
8
16 32 48 64 96 128
8
16 32 48 64 96 128
8
16 32 48 64 96 128
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Outline
1
2
3
4
5
Introduction
Motivation
Multiprocessor Task Programming
Example
Solving ODEs
Parallel Adams Methods
TwoL Compiler Framework
Two-Level Parallelism
Implementation
Extrapolation methods
Tlib library
Tlib API
Hierachical Matrix Multiplication
DRD-Lib – a library for dynamic data redistribution
Conclusions
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Two-Level Parallelism
Compiler framework TwoL – (Two-Level
Parallelism)
Mixed task and data parallelism:
application
algorithm
specification
program
transformation
coordination
program
translation
parallel
target program
specification program maximum degree of parallelism
composed modules → medium grain task parallelism
(coordination language)
basic modules → fine grain data parallelism
coordination program exploited degree of parallelism
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Two-Level Parallelism
Interaction between task and data parallel level
task parallel level
11
22
basic
modules
coordination
program
group information
data distribution
design steps
call of
basic modules
input-output
call of
basic modules
11
22
specification
33program
parallel
basic modules
data parallel level
executable
program
translation
task parallel
implementation
function calls
non-executable
parallel specification
non-executable
specification
data parallel
program
Introduction
Example
TwoL Compiler Framework
Tlib library
Two-Level Parallelism
Task graph
dependencies between modul activations
nodes: activations of basic modules
edges: data dependencies data parallel modules
11
11
11
SPMD computations
11
data
dependencies
11
independent
modules
communication
11
difference to normal flow graphs: equivalent representation
as module tree
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Two-Level Parallelism
Design decisons
load balance – sizes of processor subgroups
communication – internal collective communication
operations
data redistribution between data parallel modules
→ multiprocessor task scheduling with dependencies of
data parallel modules
cost model: function of application and machine specific
parameters
TIME = Fparallel (n, p, maschineparameters)
for the execution time computation, communication and
data redistribtion times
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Two-Level Parallelism
Overview of the transformation steps
SDS
GDS
0
t
i
m
e
processors
0
0
1
2
3
1
3
2
4
4
5
5
module dependence graph
module execution graph
module
specification
module
specification
with scheduling
1
3
2
4
5
module execution plan
module
specification
with scheduling
and group sizes
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Extrapolation methods
Extrapolation methods for ODEs
initialization
one-step method for solving ODEs:
in each time step: r different
approximations with different stepsizes h0 , . . . , hr −1 and combination
to approximation of higher order
h1
h2 h3
h4
generating
method
extrapolation table
possible task organizations:
• data parallel execution
• linear group partitioning: use r processor groups and
adapt size of the groups to the computational effort:
group i contains gi = p · 1/2·ri+1
(r +1) processors
• extended group partitioning: use ⌈r /2⌉ groups; group i
executes the computations for step-sizes hi and hr −i
Introduction
Example
TwoL Compiler Framework
Tlib library
Extrapolation methods
Extrapolation method on the Cray T3E
runtime per time step for sparse non-stiff ODE system;
p = 4 and p = 8 processors of a Cray T3E
runtimes for extrapolation method, r=4
0.45
0.4
p=4 consecutive
p=4 linear
p=4 extended
p=8 consecutive
p=8 linear
p=8 extended
0.35
runtime in seconds
0.3
0.25
0.2
0.15
0.1
0.05
0
242
882
1922
3362
5202
ODE system size
7442
10082
13122
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Extrapolation methods
Extrapolation method on the Cray T3E — cont.
runtime per time step for sparse non-stiff ODE system
p = 16 and p = 32 processors of a Cray T3E
runtimes for extrapolation method, r=4
0.15
0.125
p=16 consecutive
p=16 linear
p=16 extended
p=32 consecutive
p=32 linear
p=32 extended
runtime in seconds
0.1
0.075
0.05
0.025
0
242
882
1922
3362
5202
ODE system size
7442
10082
13122
Introduction
Example
TwoL Compiler Framework
Tlib library
Outline
1
2
3
4
5
Introduction
Motivation
Multiprocessor Task Programming
Example
Solving ODEs
Parallel Adams Methods
TwoL Compiler Framework
Two-Level Parallelism
Implementation
Extrapolation methods
Tlib library
Tlib API
Hierachical Matrix Multiplication
DRD-Lib – a library for dynamic data redistribution
Conclusions
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Tlib API
Tlib library – Basic steps
The programmer has to identify the M-tasks;
an automatic identification by a compiler tool is difficult;
The runtime library Tlib allows the realization of
hierarchically structured M-tasks programs;
[SC02, JPDC05];
the execution order of the M-tasks and the group
splittings are specified statically by the programmer;
the number of executing processors and the recursion
control can be adapted automatically.
outlook: dynamic adaptation of the execution order during
M-task execution;
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Tlib API
Tlib API
Tlib is based on MPI and allows the use of arbitrary MPI
functions.
The Tlib API provides support for
* the creation and administration of a dynamic hierarchy
of processor groups;
* the coordination and mapping of nested M-tasks;
* the handling and termination of recursive calls and group
splittings;
* the organization of communication between M-tasks;
The splitting and mapping can be adapted to the execution
platform.
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Tlib API
Interface of the Tlib library
Type of all Tlib tasks (basic and coordination):
void *F (void * arg, MPI Comm comm, T Descr *pdescr)
• void * arg packed arguments for F;
• comm MPI communicator for internal communication;
• pdescr pointer to a descriptor of executing processor group;
The function F may contain calls of Tlib functions to create
sub-groups and to map sub-computations to
sub-groups.
Initialization of the Tlib library:
int T Init( int argc, char *argv[],
MPI Comm comm, T Descr *pdescr)
• argc , *argv[]: arguments of main();
• comm: initial MPI communicator;
• pdescr: returns initial group descriptor;
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Tlib API
Splitting operations for processor groups
int T SplitGrp( T Descr * pdescr,
T Descr * pdescr1,
float per1,
float per2)
• pdescr: current group descriptor
• pdescr1: new group descriptor
for two new subgroups
• per1 relative size of first subgroup
• per2 relative size of second subgroup
int T SplitGrpParfor( int n,
T Descr *pdescr,
T Descr *pdescr1,
float p[])
group subdivision
group subdivision
group subdivision
int T SplitGrpExpl( T Descr * pdescr,
T Descr * pdescr1,
int color,
int key)
group subdivision
Introduction
Example
TwoL Compiler Framework
Tlib library
Tlib API
Concurrent execution of M-tasks
Mapping of two concurrent M-tasks to processor groups:
int T Par( void * (*f1)(void *,
void * parg1, void *
void * (*f2)(void *,
void * parg2, void *
T Descr *pdescr)
•
•
•
•
•
MPI Comm, T Descr *),
pres1,
MPI Comm, T Descr *),
pres2,
f1 function for first M-Task
parg1, pres1 packed arguments and results for f1
f2 function for second M-Task
parg2, pres2 packed arguments and results for f2
pdescr pointer to current group descriptor describing two
groups
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Tlib API
Concurrent execution of M-tasks
Mapping of more than two independent M-tasks to processor
groups:
int T Parfor( void * (*f[])(void *, MPI Comm, T Descr),
void * parg[], void * pres[],
T Descr * pdescr)
• f array of n pointers to functions for M-tasks;
• pdescr descriptor of group partition into n disjoint groups;
This corresponds to the execution of a parallel loop with n
independent iterates;
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Tlib API
Programming example using Tlib
M−task hierarchy
F1
F3
F2
F4 F5
F6
M−task
FH
F7
SPMD task
main( int argc, char * argv[] ) {
T Descr descr, descr1;
void *arg f1, *arg fh, *arg f3, *arg f7;
void *res fh, *res f3;
T Init (argc,argv,MPI COMM WORLD,&descr);
/* pack argument arg f1 */
F1(arg f1,MPI COMM WORLD, &descr);
T SplitGrp ( &descr, &descr1, 0.7, 0.3);
/* pack argument arg fh in Ga */
/* pack argument arg f3 in Gb */
T Par (FH, &arg fh, &res fh,
F3, &arg f3, &res f3, &descr1);
/* possible redistribution of data from Ga and Gb */
/* to G = Ga ∪ Gb ; */
F7(arg f7,MPI COMM WORLD, &descr)
}
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Tlib API
Programming example using Tlib -2-
M−task hierarchy
F1
F3
F2
void * FH(void * arg, MPI Comm comm,
T Descr * pdescr){
T Descr descr2;
void *arg f2, *arg f4, *arg f5, *arg f6;
void *res f2, *res f4, *res f5, *res f6;
res f2 = F2( arg f2, comm, pdescr);
T SplitGrp ( pdescr, &descr2, 0.5, 0.5);
/* pack argument arg f4 in Gc */
/* pack argument arg f5 in Gd */
T Par ( F4, &arg f4, &res f4,
F5, &arg f5, &res f5, &descr2);
/* possible redistribution of data from Gc and Gd */
/* to Ga = Gc ∪ Gd ; */
res f6 = F6( arg f6, comm, pdescr);
/* build result of FH in res fh; */
return res fh;
F4 F5
F6
M−task
FH
F7
SPMD task
}
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Hierachical Matrix Multiplication
Fast Multilevel Hierachical Matrix Multiplication
A combination of different algorithms for matrix
multiplication can lead to very competitive
implementations:
Strassen method, tpMM, Atlas; [ICCS’04,ICS’04]
Illustration:
Strassen
pdgemm
tpMM
ring
low level
dgemm
BLAS
ESSL
ATLAS
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Hierachical Matrix Multiplication
Recursive splitting – Strassen algorithm
matrix multiplication C = AB with A, B, C ∈ IRn×n :
decompose A, B into square blocks of size n/2:
C11 C12
A11 A12
B11 B12
=
C21 C22
A21 A22
B21 B22
compute the submatrices C11 , C12 , C21 , C22 separately
according to
C11 = Q1 + Q4 − Q5 + Q7
C21 = Q2 + Q4
C12 = Q3 + Q5
C22 = Q1 + Q3 − Q2 + Q6
compute Q1 , . . . , Q7 by recursive calls:
Q1
=
strassen(A11 + A22 , B11 + B22 ); Q5
=
strassen(A11 + A12 , B22 );
Q2
=
strassen(A21 + A22 , B11 );
Q6
=
strassen(A21 − A11 , B11 + B12 );
Q3
=
strassen(A11 , B12 − B22 );
Q7
=
strassen(A12 − A22 , B21 + B22 );
Q4
=
strassen(A22 , B21 − B11 );
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Hierachical Matrix Multiplication
Recursive splitting – Strassen algorithm
organization into four tasks
Task C11(), Task C12(), Task C21(), Task C22():
Task C11
on group 0
compute Q1
compute Q7
receive Q5
receive Q4
compute C11
Task C12
on group 1
compute Q3
compute Q5
send Q5
send Q3
compute C12
Task C21
on group 2
compute Q2
compute Q4
send Q2
send Q4
compute C21
Task C22
on group 3
compute Q1
compute Q6
receive Q2
receive Q3
compute C22
realization with Tlib: Task C11() and Task q1()
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Hierachical Matrix Multiplication
Strassen algorithm – main coordination function
void * Strassen (void * arg, MPI Comm comm,
T Descr *pdescr){
T Descr descr1;
void * (* f[4])(), *pres[4], *parg[4];
float per[4];
per[0]=0.25; per[1]=0.25; per[2]=0.25; per[3]=0.25;
T Split GrpParfor (4, pdescr, &descr1, per);
f[0] = Task C11; f[1] = Task C12;
f[2] = Task C21; f[3] = Task C22;
parg[0] = parg[1] = parg[2] = parg[3] = arg;
T Parfor (f, parg, pres, &descr1);
assemble matrix (pres[0], pres[1], pres[2], pres[3]);
}
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Hierachical Matrix Multiplication
Strassen algorithm – Task C11()
void * Task C11 (void * arg, MPI Comm comm,
T Descr *pdescr){
double **q1, **q4, **q5, **q7, **res;
int i,j,n2;
/* extract arguments from arg including matrix size n */
n2 = n/2;
q1 = Task q1 (arg, comm, pdescr);
q7 = Task q7 (arg, comm, pdescr);
/* allocate res, q4 and q5 as matrices of size n/2 times n/2 */
/* receive q4 from group 2 and q5 from group 1 using parent comm. */
for (i=0; i < n2; i++)
for (j=0; j < n2; j++)
res[i][j] = q1[i][j]+q4[i][j]-q5[i][j]+q7[i][j];
return res;
}
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Hierachical Matrix Multiplication
Strassen algorithm – Task q1()
void *Task q1(void *arg, MPI Comm comm, T Descr *pdescr){
double **res, **q11, **q12, **q1;
struct struct MM mul *mm, *arg1;
*mm = (struct MM mul *) arg;
n2 = (*mm).n /2;
/* allocate q1, q11, q12 as matrices of size n2 times n2 */
for (i=0; i < n2; i++)
for (j=0; j < n2; j++) {
q11[i][j] = (*mm).a[i][j]+(*mm).a[i+n2][j+n2]; /* A11+A22 */
q12[i][j] = (*mm).b[i][j]+(*mm).b[i+n2][j+n2]; /* B11+B22 */
}
arg1 = (struct MM mul *) malloc (sizeof(struct MM mul));
(*arg1).a = q11; (*arg1).b = q12; (*arg1).n = n2;
q1 = Strassen (arg1, comm, pdescr);
return q1;
}
Introduction
Example
TwoL Compiler Framework
Conclusions
Tlib library
Hierachical Matrix Multiplication
New hierarchical algorithm
intermediate level: new algorithm tpMM: computation
order for 8 processors:
P0 P1 P2 P3 P4 P5 P6 P7
P0
P1
P1 P0 P3 P2 P5 P4 P7 P6
P2 P3 P0 P1 P6 P7 P4 P5
P3 P2 P1 P0 P7 P6 P5 P4
0 P0
0 1 P0
P1
1 0
P2
0 1
P3 1 0
2
0 P1
2 P2 0
P3 2 1
P1 0
P2
0 P2
P3 0
P3
P4
0 P4
P5 0
P5
P6
P7
0
1
0 1 P4
0 P6
P7
1 0
P6
1
P0
2
1
0
1 2
1 0 P5
2 P6 0
P7 2 1
0
P5
0
1
P7 1
0
P4
2
1
0
bottom level: Atlas for one-processor matrix multiplication
Introduction
Example
TwoL Compiler Framework
Conclusions
Tlib library
Hierachical Matrix Multiplication
Experimental evaluation: IBM Regatta p690
MFLOPS for different algorithmic combinations [ICS’04]:
3000
pdgemm_tester_64 32 procs
strassen_pdgemm_64_r1 32 procs
strassen_pdgemm_64_r2 32 procs
strassen_ring_64 32 procs
strassen_tpmm_plug_64 32 procs
2500
pdgemm_tester_64 64 procs
strassen_pdgemm_64_r1 64 procs
strassen_pdgemm_64_r2 64 procs
strassen_pdgemm_64_r3 64 procs
strassen_ring_64 64 procs
strassen_tpmm_plug_64 64 procs
2500
mflops (per processor)
mflops (per processor)
2000
2000
1500
1500
1000
1000
500
500
0
2048
4096
6144
8192
matrix size
10240
12288
0
2048
4096
6144
8192
matrix size
MFLOPS per processor for 32 and 64 processors
10240
12288
Introduction
Example
TwoL Compiler Framework
Conclusions
Tlib library
Hierachical Matrix Multiplication
Experimental evaluation: Dual Xeon Cluster 3 GHz
MFLOPS for different algorithmic combinations:
pdgemm_tester 16 procs
strassen_pdgemm_r1 16 procs
strassen_pdgemm_r2 16 procs
strassen_ring 16 procs
strassen_tpmm_plug 16 procs
2000
pdgemm_tester 32 procs
strassen_pdgemm_r1 32 procs
strassen_pdgemm_r2 32 procs
strassen_ring 32 procs
strassen_tpmm_plug 32 procs
1000
800
mflops (per processor)
mflops (per processor)
1500
1000
600
400
500
200
0
2048
4096
5120
6144
7168
matrix size
8192
9216
10240
0
2048
4096
5120
6144
7168
matrix size
MFLOPS per processor for 16 and 32 processors
8192
9216
10240
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
DRD-Lib – a library for dynamic data redistribution
DRD-Lib
– a library for data redistribution between M-tasks
distributed array-based data structures in M-Task parallel
context
data distribution format
set of data re-distribution operations for M-task
programming (especially Tlib)
−→ dynamic character of redistribution· M-task
coordination and processor group layout are not known in
advance
information about data location is required:
distributed bookkeeping about location of distributed data
or
mechanism to determine data distribution information only
when required
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
DRD-Lib – a library for dynamic data redistribution
DRD-Lib Format
distributed, opaque data distribution object
information contained
data distribution type (e.gḃlock-cyclic]
parameters for data distribution
specifics about storage (e.g. ghost cells)
Tlib group descriptor – information about specific processor
group
size of processor group
Interface for generating a distributed structure – runtime
routines called collectively by a processor group executing an
M-Task
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
DRD-Lib – a library for dynamic data redistribution
DRD-Lib Re-Distributions
data distribution type
T1 = T2
T1 = T2
T1 = T2
T1 = T2
T 1 6= T 2
T 1 6= T 2
T 1 6= T 2
T 1 6= T 2
processor groups
G1 = G2
G1 ⊂ G2
G1 ⊃ G2
G1 ∩ G2 = ∅
G1 = G2
G1 ⊂ G2
G1 ⊃ G2
G1 ∩ G2 = ∅
automatic detection of group situation and adaptive
re-distribution (with distributed algorithm executed in SPMD)
[IDPDS workshop APDPM 2005]
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
DRD-Lib – a library for dynamic data redistribution
Percentage of the communication overhead
total vs pure communication
120
communication PLUS administration
pure communication
percentage of communication
100
80
60
40
20
0
10000
100000
1000000
10000000 20000000 40000000
vector size
Opteron cluster with Gigabit Ethernet for the PVA example and
p = 4 processors.
Introduction
Example
TwoL Compiler Framework
Tlib library
DRD-Lib – a library for dynamic data redistribution
Communication and administration overhead
overhead of DLIB operations
10
9
pure communication
communication + administration
execution time in seconds
8
7
6
5
4
3
2
1
0
4
8
16
32
number of processors
Opteron cluster with Gigabit Ethernet for the PVA example,
10000000 elements.
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Outline
1
2
3
4
5
Introduction
Motivation
Multiprocessor Task Programming
Example
Solving ODEs
Parallel Adams Methods
TwoL Compiler Framework
Two-Level Parallelism
Implementation
Extrapolation methods
Tlib library
Tlib API
Hierachical Matrix Multiplication
DRD-Lib – a library for dynamic data redistribution
Conclusions
Conclusions
Introduction
Example
TwoL Compiler Framework
Tlib library
Conclusions
Conclusion and Future Work
Scalability improvement: Multiprocessor Task
programming
scheduling of M-Tasks (or malleable
tasks) with dependencies
static scheduling with scheduling algorithm
or dynamic scheduling
Future work: dynamic scheduling module for the
automatic re-organization of task assignement to
processor groups; → application for grid environments;
further information:
ai2.uni-bayreuth.de
rauber@uni-bayreuth.de
Download