Introduction Example TwoL Compiler Framework Tlib library Developing Parallel Programs with Multiprocessor Tasks Thomas Rauber Department of Computer Science University of Bayreuth Contributors: Gudula Rünger, Sascha Hunold, Robert Reilein, Matthias Kühnemann, Mattias Korch June 13, 2006 Conclusions Introduction Example TwoL Compiler Framework Tlib library Outline 1 2 3 4 5 Introduction Motivation Multiprocessor Task Programming Example Solving ODEs Parallel Adams Methods TwoL Compiler Framework Two-Level Parallelism Implementation Extrapolation methods Tlib library Tlib API Hierachical Matrix Multiplication DRD-Lib – a library for dynamic data redistribution Conclusions Conclusions Introduction Example TwoL Compiler Framework Tlib library Conclusions Motivation Introduction Many large applications from scientific computing have a modular structure of cooperating subtasks. Numerical Algorithms can often be restructured such that a task structure is generated: * if the inherent degree of parallelism is large enough, a program-based restructuring is sufficient the numerical properties are not changed; * if the inherent degree of parallelism is too small, an algorithmic restructuring has to be performed; the numerical properties may be changed. Task parallelism can be combined with other types of parallelism example: mixed task and data parallel execution; Introduction Example TwoL Compiler Framework Tlib library Conclusions Motivation Advantages of M-task programming Realization of task parallelism: multiprocesor tasks (M-tasks); An M-task program is a collection of M-tasks that can be executed on an arbitrary number of processors; Using M-tasks that are concurrently executed and perform mainly internal communication can significantly reduce the communication overhead; main reason: collective communication operations are executed on subsets of processors The M-task execution can be adapted to the execution environment; the number of executing processors, the execution order (sequentially or concurrently) and the processor assignment can be adapted; Introduction Example TwoL Compiler Framework Tlib library Conclusions Motivation Group-based Broadcast Operation global communication operations: linear or logarithmic cost functions Beowulf Cluster CLiC MPI_Broadcast with 24 processors 0.18 0.16 0.14 0.12 runtime (in sec) Prozessor P0 0.1 0.08 0.06 0.04 0.02 P1 0 P2 P3 P4 24 2*12 3*8 4*6 6*4 8*3 12*2 20 40 60 80 100 messagesize (in kbyte) processors or groupnumber * groupsize Introduction Example TwoL Compiler Framework Tlib library Conclusions Multiprocessor Task Programming Multiprocessor-Tasks programming An M-task can be executed by an arbitrary number of processors (malleable task) Independent tasks can be executed concurrently by disjoint processor groups Each M-task can be internally computed in a data-parallel or SPMD way (or threads) specific mixed parallel execution each M-task can be activated at different program points with a varying number of processors M-tasks can contain point-to-point-communication and collective communication operations Introduction Example TwoL Compiler Framework Tlib library Conclusions Multiprocessor Task Programming Scheduling and data distribution static scheduling of M-Tasks (compiler framework TwoL): Advantage: fixed data distribution and redistribution −→ data distribution cost are known for scheduling Limited class of applications: M-task structure has to be known in advance dynamic scheduling of M-Task (runtime library Tlib): Advantage: recursive, hierarchical M-task structure; divide&conquer algorithms But: adaptive re-distibutions for different data layouts and group structures is required Introduction Example TwoL Compiler Framework Tlib library Outline 1 2 3 4 5 Introduction Motivation Multiprocessor Task Programming Example Solving ODEs Parallel Adams Methods TwoL Compiler Framework Two-Level Parallelism Implementation Extrapolation methods Tlib library Tlib API Hierachical Matrix Multiplication DRD-Lib – a library for dynamic data redistribution Conclusions Conclusions Introduction Example TwoL Compiler Framework Tlib library Solving ODEs Ordinary Differential Equations Initial value problem ′ y (t) = f (t, y(t)), f : IR × R I d→R I d y(t0 ) = y0 , y0 ∈ R I d system One-step methods: y0 y∼1 y∼2 t0 t1 t2 TIME Sources of parallelism within one time step: parallelism across the method parallelism across the system Conclusions Introduction Example TwoL Compiler Framework Tlib library Conclusions Solving ODEs General linear method Computation of k stage values in time step n at time tn Yn = (yn1 , . . . , ynk ) | {z } stage values vector of size d · k Stage value yni is the approximation of y(tn + ai h) Computation in each time step Yn+1 = (R ⊗ I) Yn + h(S ⊗ I)F (Yn ) + h(T ⊗ I)F (Yn+1 ) | {z } Kronecker with R, S, T k × k matrices and I d × d matrix F (Yn ) = (f (yn1 ), . . . , f (ynk )) vector of size d · k Introduction Example TwoL Compiler Framework Tlib library Solving ODEs Kronecker product . 0 r11 r21 0 . . . r12 . 1 {z r1k 0 0 r1k 0 . . 0 . 0 0 rk1 r kk . 0 . . . . 0 . . . 0 . . . . . . r21 rk1 } d . 0 . r12 . . 0 .. . r11 0 . k . = ··· {z 1 r1k .. ⊗ . 0 rkk } | . ··· r11 .. R⊗I = . rk 1 | . rkk Conclusions Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods Parallel Adams Bashforth (PAB) v.d.Houwen, Messina 99 T = 0 (explicit method) R = e · ekT , e = (1, . . . 1), ek = (0, . . . , 0, 1) S = Va Wb−1 , S = (sij )i, j = 1, . . . k abscissa vector a = (a1 , . . . , ak ) Lobatto points explicit method (DOPRI8) to compute Y0 = (y01 , . . . y0k ) yn+1,i = k X yn,k sil +h |{z} l=1 k − th stage value f (ynl ) , i = 1, . . . k | {z } previous stage values Conclusions Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods approximation vectors Parallel Adams Bashforth (PAB) k = 3 stage values y03 y13 y23 next time step Simulated Time t0 t2 t1 stepsize h stepsize h Conclusions Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods approximation vectors Parallel Adams Bashforth (PAB) k = 3 stage values y03 y01 DOPRI8 t0 y02 y13 y23 next time step DOPRI8 0.35 stepsize h 0.845 Simulated Time t2 t1 stepsize h Conclusions Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods Parallel Adams Bashforth (PAB) k = 3 stage values f(y01 ) y03 y01 f(y02 ) approximation vectors function values f(y03 ) DOPRI8 t0 y02 y13 y23 next time step DOPRI8 0.35 stepsize h 0.845 Simulated Time t2 t1 stepsize h Conclusions Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods Parallel Adams Bashforth (PAB) k = 3 stage values f(y01 ) y03 y01 f(y02 ) approximation vectors function values f(y03 ) DOPRI8 t0 y02 y13 y11 y12 y23 next time step DOPRI8 0.35 stepsize h 0.845 Simulated Time t2 t1 stepsize h Conclusions Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods Parallel Adams Bashforth (PAB) k = 3 stage values f(y01 ) f(y02 ) f(y13 ) f(y11 ) y03 y01 y02 y13 y11 f(y12 ) approximation vectors function values f(y03 ) DOPRI8 t0 y12 y23 next time step DOPRI8 0.35 stepsize h 0.845 Simulated Time t2 t1 stepsize h Conclusions Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods Parallel Adams Bashforth (PAB) k = 3 stage values f(y01 ) f(y02 ) f(y13 ) f(y11 ) y03 y01 y02 y13 y11 f(y12 ) approximation vectors function values f(y03 ) DOPRI8 t0 y12 y23 y21 y22 next time step DOPRI8 0.35 stepsize h 0.845 Simulated Time t2 t1 stepsize h Conclusions Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods Parallel Adams Bashforth (PAB) k = 3 stage values f(y01 ) f(y02 ) f(y13 ) f(y11 ) f(y12 ) f(y23 ) f(y21 ) f(y22 ) y03 y01 y02 y13 y11 y12 y23 y21 y22 approximation vectors function values f(y03 ) DOPRI8 t0 next time step DOPRI8 0.35 stepsize h 0.845 Simulated Time t2 t1 stepsize h Conclusions Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods Parallel Adams Bashforth (PAB) k = 3 stage values f(y01 ) f(y02 ) f(y13 ) f(y11 ) f(y12 ) f(y23 ) f(y21 ) f(y22 ) y03 y01 y02 y13 y11 y12 y23 y21 y22 approximation vectors function values f(y03 ) DOPRI8 t0 next time step DOPRI8 0.35 stepsize h 0.845 Simulated Time t2 t1 stepsize h Conclusions Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods Parallel Adams Moulton (PAM) T diagonal matrix (implicit method) specific value δi , i = 1 . . . k R = e · ekT S = (Va + RVb − TWa )Wb−1 (see v.d.Houwen, Messina) a = (a1 , . . . , ak ) Lobatto points i = 1, . . . , k vni yn+1,i − h ∗ δi ∗ f (yn+1,i ) = |{z} {z } | contains yni implicit fixed point iteration (j) (j−1) yn+1,i − h ∗ δi ∗ f (yn+1,i ) = vni i = 1, . . . , k Conclusions Introduction Example TwoL Compiler Framework Tlib library Conclusions Parallel Adams Methods Parallel Adams Bashforth Moulton k = 3 stage values Stage values yni , i = 1, . . . , k, are computed by the PAB method are as starting vector for the fixed point iteration of the PABM method Additional vector vni to hold constant part of the right hand side of the fixed point iteration Similar computation for vni and yni with different matrix S Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods approximation vectors function values Parallel Adams Moulton f(y03 ) f(y01 ) f(y02 ) y03 y01 y02 next DOPRI8 time step DOPRI8 Simulated Time t0 t1 stepsize h t2 stepsize h Conclusions Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods approximation vectors function values Parallel Adams Moulton f(y03 ) f(y01 ) f(y02 ) y03 y01 y02 y13 v DOPRI8 y12 y11 v v next time step DOPRI8 Simulated Time t0 t1 stepsize h t2 stepsize h Conclusions Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods approximation vectors function values Parallel Adams Moulton f(y03 ) f(y01 ) f(y02 ) y03 y01 y02 f(y13 ) y13 v DOPRI8 y12 y11 v v next time step DOPRI8 Simulated Time t0 t1 stepsize h t2 stepsize h Conclusions Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods f(y03 ) f(y01 ) f(y02 ) y03 y01 y02 f(y13 ) y13 v DOPRI8 fixed point iteration approximation vectors function values Parallel Adams Moulton y12 y11 v v next time step DOPRI8 Simulated Time t0 t1 stepsize h t2 stepsize h Conclusions Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods f(y01 ) f(y02 ) y03 y01 y02 f(y13 ) y13 v DOPRI8 f(y11 ) y11 v fixed point iteration f(y03 ) fixed point iteration approximation vectors function values Parallel Adams Moulton y12 v next time step DOPRI8 Simulated Time t0 t1 stepsize h t2 stepsize h Conclusions Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods f(y02 ) y03 y01 y02 f(y13 ) y13 v DOPRI8 f(y12 ) f(y11 ) y11 v y12 v fixed point iteration f(y01 ) fixed point iteration f(y03 ) fixed point iteration approximation vectors function values Parallel Adams Moulton next time step DOPRI8 Simulated Time t0 t1 stepsize h t2 stepsize h Conclusions Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods f(y02 ) y03 y01 y02 f(y13 ) y13 v DOPRI8 f(y12 ) f(y11 ) y11 v y12 v fixed point iteration f(y01 ) fixed point iteration f(y03 ) fixed point iteration approximation vectors function values Parallel Adams Moulton next time step DOPRI8 Simulated Time t0 t1 stepsize h t2 stepsize h Conclusions Introduction Example TwoL Compiler Framework Tlib library Parallel Adams Methods Experiments k=8 [Europar 04] PABM−method with brusselator function for K=8 on Cray T3E−1200 data parallel task parallel task parallel orthogonal 3 time per step in seconds 2.5 2 1.5 1 0.5 0 8 16 32 48 64 96 128 8 16 32 48 64 96 128 8 16 32 48 64 96 128 Conclusions Introduction Example TwoL Compiler Framework Tlib library Outline 1 2 3 4 5 Introduction Motivation Multiprocessor Task Programming Example Solving ODEs Parallel Adams Methods TwoL Compiler Framework Two-Level Parallelism Implementation Extrapolation methods Tlib library Tlib API Hierachical Matrix Multiplication DRD-Lib – a library for dynamic data redistribution Conclusions Conclusions Introduction Example TwoL Compiler Framework Tlib library Two-Level Parallelism Compiler framework TwoL – (Two-Level Parallelism) Mixed task and data parallelism: application algorithm specification program transformation coordination program translation parallel target program specification program maximum degree of parallelism composed modules → medium grain task parallelism (coordination language) basic modules → fine grain data parallelism coordination program exploited degree of parallelism Conclusions Introduction Example TwoL Compiler Framework Tlib library Conclusions Two-Level Parallelism Interaction between task and data parallel level task parallel level 11 22 basic modules coordination program group information data distribution design steps call of basic modules input-output call of basic modules 11 22 specification 33program parallel basic modules data parallel level executable program translation task parallel implementation function calls non-executable parallel specification non-executable specification data parallel program Introduction Example TwoL Compiler Framework Tlib library Two-Level Parallelism Task graph dependencies between modul activations nodes: activations of basic modules edges: data dependencies data parallel modules 11 11 11 SPMD computations 11 data dependencies 11 independent modules communication 11 difference to normal flow graphs: equivalent representation as module tree Conclusions Introduction Example TwoL Compiler Framework Tlib library Conclusions Two-Level Parallelism Design decisons load balance – sizes of processor subgroups communication – internal collective communication operations data redistribution between data parallel modules → multiprocessor task scheduling with dependencies of data parallel modules cost model: function of application and machine specific parameters TIME = Fparallel (n, p, maschineparameters) for the execution time computation, communication and data redistribtion times Introduction Example TwoL Compiler Framework Tlib library Conclusions Two-Level Parallelism Overview of the transformation steps SDS GDS 0 t i m e processors 0 0 1 2 3 1 3 2 4 4 5 5 module dependence graph module execution graph module specification module specification with scheduling 1 3 2 4 5 module execution plan module specification with scheduling and group sizes Introduction Example TwoL Compiler Framework Tlib library Conclusions Extrapolation methods Extrapolation methods for ODEs initialization one-step method for solving ODEs: in each time step: r different approximations with different stepsizes h0 , . . . , hr −1 and combination to approximation of higher order h1 h2 h3 h4 generating method extrapolation table possible task organizations: • data parallel execution • linear group partitioning: use r processor groups and adapt size of the groups to the computational effort: group i contains gi = p · 1/2·ri+1 (r +1) processors • extended group partitioning: use ⌈r /2⌉ groups; group i executes the computations for step-sizes hi and hr −i Introduction Example TwoL Compiler Framework Tlib library Extrapolation methods Extrapolation method on the Cray T3E runtime per time step for sparse non-stiff ODE system; p = 4 and p = 8 processors of a Cray T3E runtimes for extrapolation method, r=4 0.45 0.4 p=4 consecutive p=4 linear p=4 extended p=8 consecutive p=8 linear p=8 extended 0.35 runtime in seconds 0.3 0.25 0.2 0.15 0.1 0.05 0 242 882 1922 3362 5202 ODE system size 7442 10082 13122 Conclusions Introduction Example TwoL Compiler Framework Tlib library Conclusions Extrapolation methods Extrapolation method on the Cray T3E — cont. runtime per time step for sparse non-stiff ODE system p = 16 and p = 32 processors of a Cray T3E runtimes for extrapolation method, r=4 0.15 0.125 p=16 consecutive p=16 linear p=16 extended p=32 consecutive p=32 linear p=32 extended runtime in seconds 0.1 0.075 0.05 0.025 0 242 882 1922 3362 5202 ODE system size 7442 10082 13122 Introduction Example TwoL Compiler Framework Tlib library Outline 1 2 3 4 5 Introduction Motivation Multiprocessor Task Programming Example Solving ODEs Parallel Adams Methods TwoL Compiler Framework Two-Level Parallelism Implementation Extrapolation methods Tlib library Tlib API Hierachical Matrix Multiplication DRD-Lib – a library for dynamic data redistribution Conclusions Conclusions Introduction Example TwoL Compiler Framework Tlib library Conclusions Tlib API Tlib library – Basic steps The programmer has to identify the M-tasks; an automatic identification by a compiler tool is difficult; The runtime library Tlib allows the realization of hierarchically structured M-tasks programs; [SC02, JPDC05]; the execution order of the M-tasks and the group splittings are specified statically by the programmer; the number of executing processors and the recursion control can be adapted automatically. outlook: dynamic adaptation of the execution order during M-task execution; Introduction Example TwoL Compiler Framework Tlib library Conclusions Tlib API Tlib API Tlib is based on MPI and allows the use of arbitrary MPI functions. The Tlib API provides support for * the creation and administration of a dynamic hierarchy of processor groups; * the coordination and mapping of nested M-tasks; * the handling and termination of recursive calls and group splittings; * the organization of communication between M-tasks; The splitting and mapping can be adapted to the execution platform. Introduction Example TwoL Compiler Framework Tlib library Conclusions Tlib API Interface of the Tlib library Type of all Tlib tasks (basic and coordination): void *F (void * arg, MPI Comm comm, T Descr *pdescr) • void * arg packed arguments for F; • comm MPI communicator for internal communication; • pdescr pointer to a descriptor of executing processor group; The function F may contain calls of Tlib functions to create sub-groups and to map sub-computations to sub-groups. Initialization of the Tlib library: int T Init( int argc, char *argv[], MPI Comm comm, T Descr *pdescr) • argc , *argv[]: arguments of main(); • comm: initial MPI communicator; • pdescr: returns initial group descriptor; Introduction Example TwoL Compiler Framework Tlib library Conclusions Tlib API Splitting operations for processor groups int T SplitGrp( T Descr * pdescr, T Descr * pdescr1, float per1, float per2) • pdescr: current group descriptor • pdescr1: new group descriptor for two new subgroups • per1 relative size of first subgroup • per2 relative size of second subgroup int T SplitGrpParfor( int n, T Descr *pdescr, T Descr *pdescr1, float p[]) group subdivision group subdivision group subdivision int T SplitGrpExpl( T Descr * pdescr, T Descr * pdescr1, int color, int key) group subdivision Introduction Example TwoL Compiler Framework Tlib library Tlib API Concurrent execution of M-tasks Mapping of two concurrent M-tasks to processor groups: int T Par( void * (*f1)(void *, void * parg1, void * void * (*f2)(void *, void * parg2, void * T Descr *pdescr) • • • • • MPI Comm, T Descr *), pres1, MPI Comm, T Descr *), pres2, f1 function for first M-Task parg1, pres1 packed arguments and results for f1 f2 function for second M-Task parg2, pres2 packed arguments and results for f2 pdescr pointer to current group descriptor describing two groups Conclusions Introduction Example TwoL Compiler Framework Tlib library Conclusions Tlib API Concurrent execution of M-tasks Mapping of more than two independent M-tasks to processor groups: int T Parfor( void * (*f[])(void *, MPI Comm, T Descr), void * parg[], void * pres[], T Descr * pdescr) • f array of n pointers to functions for M-tasks; • pdescr descriptor of group partition into n disjoint groups; This corresponds to the execution of a parallel loop with n independent iterates; Introduction Example TwoL Compiler Framework Tlib library Conclusions Tlib API Programming example using Tlib M−task hierarchy F1 F3 F2 F4 F5 F6 M−task FH F7 SPMD task main( int argc, char * argv[] ) { T Descr descr, descr1; void *arg f1, *arg fh, *arg f3, *arg f7; void *res fh, *res f3; T Init (argc,argv,MPI COMM WORLD,&descr); /* pack argument arg f1 */ F1(arg f1,MPI COMM WORLD, &descr); T SplitGrp ( &descr, &descr1, 0.7, 0.3); /* pack argument arg fh in Ga */ /* pack argument arg f3 in Gb */ T Par (FH, &arg fh, &res fh, F3, &arg f3, &res f3, &descr1); /* possible redistribution of data from Ga and Gb */ /* to G = Ga ∪ Gb ; */ F7(arg f7,MPI COMM WORLD, &descr) } Introduction Example TwoL Compiler Framework Tlib library Conclusions Tlib API Programming example using Tlib -2- M−task hierarchy F1 F3 F2 void * FH(void * arg, MPI Comm comm, T Descr * pdescr){ T Descr descr2; void *arg f2, *arg f4, *arg f5, *arg f6; void *res f2, *res f4, *res f5, *res f6; res f2 = F2( arg f2, comm, pdescr); T SplitGrp ( pdescr, &descr2, 0.5, 0.5); /* pack argument arg f4 in Gc */ /* pack argument arg f5 in Gd */ T Par ( F4, &arg f4, &res f4, F5, &arg f5, &res f5, &descr2); /* possible redistribution of data from Gc and Gd */ /* to Ga = Gc ∪ Gd ; */ res f6 = F6( arg f6, comm, pdescr); /* build result of FH in res fh; */ return res fh; F4 F5 F6 M−task FH F7 SPMD task } Introduction Example TwoL Compiler Framework Tlib library Conclusions Hierachical Matrix Multiplication Fast Multilevel Hierachical Matrix Multiplication A combination of different algorithms for matrix multiplication can lead to very competitive implementations: Strassen method, tpMM, Atlas; [ICCS’04,ICS’04] Illustration: Strassen pdgemm tpMM ring low level dgemm BLAS ESSL ATLAS Introduction Example TwoL Compiler Framework Tlib library Conclusions Hierachical Matrix Multiplication Recursive splitting – Strassen algorithm matrix multiplication C = AB with A, B, C ∈ IRn×n : decompose A, B into square blocks of size n/2: C11 C12 A11 A12 B11 B12 = C21 C22 A21 A22 B21 B22 compute the submatrices C11 , C12 , C21 , C22 separately according to C11 = Q1 + Q4 − Q5 + Q7 C21 = Q2 + Q4 C12 = Q3 + Q5 C22 = Q1 + Q3 − Q2 + Q6 compute Q1 , . . . , Q7 by recursive calls: Q1 = strassen(A11 + A22 , B11 + B22 ); Q5 = strassen(A11 + A12 , B22 ); Q2 = strassen(A21 + A22 , B11 ); Q6 = strassen(A21 − A11 , B11 + B12 ); Q3 = strassen(A11 , B12 − B22 ); Q7 = strassen(A12 − A22 , B21 + B22 ); Q4 = strassen(A22 , B21 − B11 ); Introduction Example TwoL Compiler Framework Tlib library Conclusions Hierachical Matrix Multiplication Recursive splitting – Strassen algorithm organization into four tasks Task C11(), Task C12(), Task C21(), Task C22(): Task C11 on group 0 compute Q1 compute Q7 receive Q5 receive Q4 compute C11 Task C12 on group 1 compute Q3 compute Q5 send Q5 send Q3 compute C12 Task C21 on group 2 compute Q2 compute Q4 send Q2 send Q4 compute C21 Task C22 on group 3 compute Q1 compute Q6 receive Q2 receive Q3 compute C22 realization with Tlib: Task C11() and Task q1() Introduction Example TwoL Compiler Framework Tlib library Conclusions Hierachical Matrix Multiplication Strassen algorithm – main coordination function void * Strassen (void * arg, MPI Comm comm, T Descr *pdescr){ T Descr descr1; void * (* f[4])(), *pres[4], *parg[4]; float per[4]; per[0]=0.25; per[1]=0.25; per[2]=0.25; per[3]=0.25; T Split GrpParfor (4, pdescr, &descr1, per); f[0] = Task C11; f[1] = Task C12; f[2] = Task C21; f[3] = Task C22; parg[0] = parg[1] = parg[2] = parg[3] = arg; T Parfor (f, parg, pres, &descr1); assemble matrix (pres[0], pres[1], pres[2], pres[3]); } Introduction Example TwoL Compiler Framework Tlib library Conclusions Hierachical Matrix Multiplication Strassen algorithm – Task C11() void * Task C11 (void * arg, MPI Comm comm, T Descr *pdescr){ double **q1, **q4, **q5, **q7, **res; int i,j,n2; /* extract arguments from arg including matrix size n */ n2 = n/2; q1 = Task q1 (arg, comm, pdescr); q7 = Task q7 (arg, comm, pdescr); /* allocate res, q4 and q5 as matrices of size n/2 times n/2 */ /* receive q4 from group 2 and q5 from group 1 using parent comm. */ for (i=0; i < n2; i++) for (j=0; j < n2; j++) res[i][j] = q1[i][j]+q4[i][j]-q5[i][j]+q7[i][j]; return res; } Introduction Example TwoL Compiler Framework Tlib library Conclusions Hierachical Matrix Multiplication Strassen algorithm – Task q1() void *Task q1(void *arg, MPI Comm comm, T Descr *pdescr){ double **res, **q11, **q12, **q1; struct struct MM mul *mm, *arg1; *mm = (struct MM mul *) arg; n2 = (*mm).n /2; /* allocate q1, q11, q12 as matrices of size n2 times n2 */ for (i=0; i < n2; i++) for (j=0; j < n2; j++) { q11[i][j] = (*mm).a[i][j]+(*mm).a[i+n2][j+n2]; /* A11+A22 */ q12[i][j] = (*mm).b[i][j]+(*mm).b[i+n2][j+n2]; /* B11+B22 */ } arg1 = (struct MM mul *) malloc (sizeof(struct MM mul)); (*arg1).a = q11; (*arg1).b = q12; (*arg1).n = n2; q1 = Strassen (arg1, comm, pdescr); return q1; } Introduction Example TwoL Compiler Framework Conclusions Tlib library Hierachical Matrix Multiplication New hierarchical algorithm intermediate level: new algorithm tpMM: computation order for 8 processors: P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P1 P0 P3 P2 P5 P4 P7 P6 P2 P3 P0 P1 P6 P7 P4 P5 P3 P2 P1 P0 P7 P6 P5 P4 0 P0 0 1 P0 P1 1 0 P2 0 1 P3 1 0 2 0 P1 2 P2 0 P3 2 1 P1 0 P2 0 P2 P3 0 P3 P4 0 P4 P5 0 P5 P6 P7 0 1 0 1 P4 0 P6 P7 1 0 P6 1 P0 2 1 0 1 2 1 0 P5 2 P6 0 P7 2 1 0 P5 0 1 P7 1 0 P4 2 1 0 bottom level: Atlas for one-processor matrix multiplication Introduction Example TwoL Compiler Framework Conclusions Tlib library Hierachical Matrix Multiplication Experimental evaluation: IBM Regatta p690 MFLOPS for different algorithmic combinations [ICS’04]: 3000 pdgemm_tester_64 32 procs strassen_pdgemm_64_r1 32 procs strassen_pdgemm_64_r2 32 procs strassen_ring_64 32 procs strassen_tpmm_plug_64 32 procs 2500 pdgemm_tester_64 64 procs strassen_pdgemm_64_r1 64 procs strassen_pdgemm_64_r2 64 procs strassen_pdgemm_64_r3 64 procs strassen_ring_64 64 procs strassen_tpmm_plug_64 64 procs 2500 mflops (per processor) mflops (per processor) 2000 2000 1500 1500 1000 1000 500 500 0 2048 4096 6144 8192 matrix size 10240 12288 0 2048 4096 6144 8192 matrix size MFLOPS per processor for 32 and 64 processors 10240 12288 Introduction Example TwoL Compiler Framework Conclusions Tlib library Hierachical Matrix Multiplication Experimental evaluation: Dual Xeon Cluster 3 GHz MFLOPS for different algorithmic combinations: pdgemm_tester 16 procs strassen_pdgemm_r1 16 procs strassen_pdgemm_r2 16 procs strassen_ring 16 procs strassen_tpmm_plug 16 procs 2000 pdgemm_tester 32 procs strassen_pdgemm_r1 32 procs strassen_pdgemm_r2 32 procs strassen_ring 32 procs strassen_tpmm_plug 32 procs 1000 800 mflops (per processor) mflops (per processor) 1500 1000 600 400 500 200 0 2048 4096 5120 6144 7168 matrix size 8192 9216 10240 0 2048 4096 5120 6144 7168 matrix size MFLOPS per processor for 16 and 32 processors 8192 9216 10240 Introduction Example TwoL Compiler Framework Tlib library Conclusions DRD-Lib – a library for dynamic data redistribution DRD-Lib – a library for data redistribution between M-tasks distributed array-based data structures in M-Task parallel context data distribution format set of data re-distribution operations for M-task programming (especially Tlib) −→ dynamic character of redistribution· M-task coordination and processor group layout are not known in advance information about data location is required: distributed bookkeeping about location of distributed data or mechanism to determine data distribution information only when required Introduction Example TwoL Compiler Framework Tlib library Conclusions DRD-Lib – a library for dynamic data redistribution DRD-Lib Format distributed, opaque data distribution object information contained data distribution type (e.gḃlock-cyclic] parameters for data distribution specifics about storage (e.g. ghost cells) Tlib group descriptor – information about specific processor group size of processor group Interface for generating a distributed structure – runtime routines called collectively by a processor group executing an M-Task Introduction Example TwoL Compiler Framework Tlib library Conclusions DRD-Lib – a library for dynamic data redistribution DRD-Lib Re-Distributions data distribution type T1 = T2 T1 = T2 T1 = T2 T1 = T2 T 1 6= T 2 T 1 6= T 2 T 1 6= T 2 T 1 6= T 2 processor groups G1 = G2 G1 ⊂ G2 G1 ⊃ G2 G1 ∩ G2 = ∅ G1 = G2 G1 ⊂ G2 G1 ⊃ G2 G1 ∩ G2 = ∅ automatic detection of group situation and adaptive re-distribution (with distributed algorithm executed in SPMD) [IDPDS workshop APDPM 2005] Introduction Example TwoL Compiler Framework Tlib library Conclusions DRD-Lib – a library for dynamic data redistribution Percentage of the communication overhead total vs pure communication 120 communication PLUS administration pure communication percentage of communication 100 80 60 40 20 0 10000 100000 1000000 10000000 20000000 40000000 vector size Opteron cluster with Gigabit Ethernet for the PVA example and p = 4 processors. Introduction Example TwoL Compiler Framework Tlib library DRD-Lib – a library for dynamic data redistribution Communication and administration overhead overhead of DLIB operations 10 9 pure communication communication + administration execution time in seconds 8 7 6 5 4 3 2 1 0 4 8 16 32 number of processors Opteron cluster with Gigabit Ethernet for the PVA example, 10000000 elements. Conclusions Introduction Example TwoL Compiler Framework Tlib library Outline 1 2 3 4 5 Introduction Motivation Multiprocessor Task Programming Example Solving ODEs Parallel Adams Methods TwoL Compiler Framework Two-Level Parallelism Implementation Extrapolation methods Tlib library Tlib API Hierachical Matrix Multiplication DRD-Lib – a library for dynamic data redistribution Conclusions Conclusions Introduction Example TwoL Compiler Framework Tlib library Conclusions Conclusion and Future Work Scalability improvement: Multiprocessor Task programming scheduling of M-Tasks (or malleable tasks) with dependencies static scheduling with scheduling algorithm or dynamic scheduling Future work: dynamic scheduling module for the automatic re-organization of task assignement to processor groups; → application for grid environments; further information: ai2.uni-bayreuth.de rauber@uni-bayreuth.de