Programming Distributed Memory Sytems Using OpenMP Rudolf Eigenmann, Ayon Basumallik, Seung-Jai Min, School of Electrical and Computer Engineering, Purdue University, http://www.ece.purdue.edu/ParaMount Is OpenMP a useful programming model for distributed systems? OpenMP is a parallel programming model that assumes a shared address space #pragma OMP parallel for for (i=1; 1<n; i++) {a[i] = b[i];} Why is it difficult to implement OpenMP for distributed processors? The compiler or runtime system will need to partition and place data onto the distributed memories send/receive messages to orchestrate remote data accesses HPF (High Performance Fortran) was a large-scale effort to do so without success So, why should we try (again)? OpenMP is an easier programming (higher-productivity?) programming model. It allows programs to be incrementally parallelized starting from the serial versions, relieves the programmer of the task of managing the movement of logically shared data. R. Eigenmann, Purdue HIPS 2007 2 Two Translation Approaches Use a Software Distributed Shared Memory System Translate OpenMP directly to MPI R. Eigenmann, Purdue HIPS 2007 3 Approach 1: Compiling OpenMP for Software Distributed Shared Memory R. Eigenmann, Purdue HIPS 2007 4 Inter-procedural Shared Data Analysis SUBROUTINE SUB0 INTEGER DELTAT CALL DCDTZ(DELTAT,…) CALL DUDTZ(DELTAT,…) END SUBROUTINE DUDTZ(X, Y, Z) INTEGER X,Y,Z C$OMP PARALLEL C$OMP+REDUCTION(+:X) X=X+… C$OMP END PARALLEL END R. Eigenmann, Purdue SUBROUTINE DCDTZ(A, B, C) INTEGER A,B,C C$OMP PARALLEL C$OMP+PRIVATE (B, C) A=… CALL CCRANK … C$OMP END PARALLEL END SUBROUTINE CCRANK() … beta = 1 – alpha … END HIPS 2007 5 Access Pattern Analysis DO istep = 1, itmax, 1 !$OMP PARALLEL DO rsd (i, j, k) = … !$OMP END PARALLEL DO !$OMP PARALLEL DO rsd (i, j, k) = … !$OMP END PARALLEL DO !$OMP PARALLEL DO u (i, j, k) = rsd (i, j, k) !$OMP END PARALLEL DO CALL RHS() ENDDO R. Eigenmann, Purdue SUBROUTINE RHS() !$OMP PARALLEL DO u (i, j, k) = … !$OMP END PARALLEL DO !$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = rsd (i, j, k).. !$OMP END PARALLEL DO !$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = rsd (i, j, k).. !$OMP END PARALLEL DO !$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = ... !$OMP END PARALLEL DO HIPS 2007 6 => Data Distribution-Aware SUBROUTINE RHS() Optimization DO istep = 1, itmax, 1 !$OMP PARALLEL DO rsd (i, j, k) = … !$OMP END PARALLEL DO !$OMP PARALLEL DO rsd (i, j, k) = … !$OMP END PARALLEL DO !$OMP PARALLEL DO u (i, j, k) = rsd (i, j, k) !$OMP END PARALLEL DO CALL RHS() ENDDO R. Eigenmann, Purdue !$OMP PARALLEL DO u (i, j, k) = … !$OMP END PARALLEL DO !$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = rsd (i, j, k).. !$OMP END PARALLEL DO !$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = rsd (i, j, k).. !$OMP END PARALLEL DO !$OMP PARALLEL DO … = u (i, j, k).. rsd (i, j, k) = ... !$OMP END PARALLEL DO HIPS 2007 7 Adding Redundant Computation to Eliminate Communication Optimized S-DSM Code S-DSM Program OpenMP Program init00 = (N/proc_num)*(pid-1)… DO k = 1, z init00 = (N/proc_num)*(pid-1)… limit00 = (N/proc_num)*pid… limit00 = (N/proc_num)*pid … !$OMP PARALLEL DO new_init = init00 - 1 DO j = 1, N, 1 new_limit = limit00 + 1 DO k = 1, z flux(m, j) = u(3, i, j, k) + … DO k = 1, z ENDDO DO j = init00, limit00, 1 DO j = new_init, new_limit, 1 flux(m, j) = u(3, i, j, k) + … !$OMP PARALLEL DO flux(m, j) = u(3, i, j, k) + … ENDDO DO j = 1, N, 1 ENDDO CALL TMK_BARRIER(0) DO m = 1, 5, 1 CALL TMK_BARRIER(0) DO j = init00, limit00, 1 rsd(m, i, j, k) = … + DO j = init00, limit00, 1 DO m = 1, 5, 1 flux(m, j+1)-flux(m, j-1)) DO m = 1, 5, 1 rsd(m, i, j, k) = … + ENDDO rsd(m, i, j, k) = … + flux(m, j+1)-flux(m, j-1)) ENDDO flux(m, j+1)-flux(m, j-1)) ENDDO ENDDO ENDDO ENDDO ENDDO ENDDO ENDDO R. Eigenmann, Purdue HIPS 2007 8 Access Privatization Example from equake (SPEC OMPM2001) If (master) { shared->ARCHnodes = ….. shared->ARCHduration = … ... } READ-ONLY SHARED VARS // Done by all nodes /* Parallel Region */ /* Parallel Region */ N = ARCHnodes; iter = ARCHduration; …... N = shared->ARCHnodes ; iter = shared->ARCHduration; { ARCHnodes = ….. ARCHduration = … ... PRIVATE } VARIABLES …... R. Eigenmann, Purdue HIPS 2007 9 Optimized Performance of OMPM2001 Benchmarks SPEC OMP2001M Performance 6 5 4 3 2 1 0 1 2 4 8 wupwise 1 2 4 8 swim 1 2 4 8 1 2 4 8 mgrid applu Baseline Performance R. Eigenmann, Purdue 1 2 4 equake 8 1 2 4 8 art Optimized Performance HIPS 2007 10 A Key Question: How Close Are we to MPI Performance ? SPEC OMP2001 Performance 8 7 6 Baseline Performance 5 Optimized Performance 4 MPI Performance 3 2 1 0 1 2 4 8 wupwise R. Eigenmann, Purdue 1 2 4 8 swim 1 2 4 8 mgrid 1 2 4 8 applu HIPS 2007 11 Towards Adaptive Optimization A combined Compiler-Runtime Scheme Compiler identifies repetitive access patterns Runtime system learns the actual remote addresses and sends data early. Ideal program characteristics: Outer, serial loop R. Eigenmann, Purdue Data addresses are invariant or a linear sequence, w.r.t. outer loop Inner, parallel loops Communication points at barriers HIPS 2007 12 Current Best Performance of OpenMP for S-DSM 6 5 4 3 2 1 0 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 wupwise swim applu SpMul CG Baseline(No Opt.) R. Eigenmann, Purdue Locality Opt HIPS 2007 Locality Opt + Comp/Run Opt 13 Approach 2: Translating OpenMP directly to MPI Baseline translation Overlapping computation and communication for irregular accesses R. Eigenmann, Purdue HIPS 2007 14 Baseline Translation of OpenMP to MPI Execution Model SPMD model Serial Regions are replicated on all processes Iterations of parallel for loops are distributed (using static block scheduling) Shared Data is allocated on all nodes There is no concept of “owner” – only producers and consumers of shared data At the end of a parallel loop, producers communicate shared data to “potential” future consumers Array section analysis is used for summarizing array accesses R. Eigenmann, Purdue HIPS 2007 15 Baseline Translation Translation Steps: 1. 2. 3. 4. R. Eigenmann, Purdue Identify all shared data Create annotations for accesses to shared data (use regular section descriptors to summarize array accesses) Use interprocedural data flow analysis to identify potential consumers; incorporate OpenMP relaxed consistency specifications Create message sets to communicate data between producers and consumers HIPS 2007 16 Message Set Generation V1: For every write, determine all future reads <write,A,1,l1(p),u1(p)> … <read,A,1,l2(p),u2(p)> <write,A,1,l3(p),u3(p)> <read,A,1,l5(p),u5(p)> … <read,A,1,l4(p),u4(p)> Message Set at RSD vertex V1, for array A from process p to process q computed as SApq = Elements of A with subscripts in the set {[l1(p),u1(p)]∩[l2(q),u2(q)]} U {[l1(p),u1(p)] [l4(q),u4(q)]} … R. Eigenmann, Purdue U ([l1(p),u1(p)] {[l5(q),u5(q)][l3(p),u3(p)]}) HIPS 2007 17 Baseline Translation of Irregular Accesses Irregular Access – A[B[i]], A[f(i)] Reads: assumed the whole array accessed Writes: inspect at runtime, communicate at the end of parallel loop We often can do better than “conservative”: Monotonic array values => sharpen access regions R. Eigenmann, Purdue HIPS 2007 18 Optimizations based on Collective Communication Recognition of Reduction Idioms Translate to MPI_Reduce / MPI_Allreduce functions. Casting sends/receives in terms of alltoall calls Beneficial where the producer-consumer relationship is many-to-many and there is insufficient distance between producers and consumers. R. Eigenmann, Purdue HIPS 2007 19 Performance of the Baseline OpenMP to MPI Translation Platform II – Sixteen IBM SP-2 WinterHawk-II nodes connected by a high-performance switch. R. Eigenmann, Purdue HIPS 2007 20 We can do more for Irregular Applications ? L1 : #pragma omp parallel for for(i=0;i<10;i++) A[i] = ... L2 : #pragma omp parallel for for(j=0;j<20;j++) B[j] = A[C[j]] + ... produced by process 1 produced by process 2 Array A 1, 3, 5, 0, 2 ….. 2, 4, 8, 1, 2 ... accesses on accesses on process 1 process 2 R. Eigenmann, Purdue Subscripts of accesses to shared arrays not always analyzable at compile-time Baseline OpenMP to MPI translation: Conservatively estimate that each process accesses the entire array Try to deduce properties such as monotonicity for the irregular subscript to refine the estimate Still, there may be redundant communication Runtime tests (inspection) are needed to resolve accesses Array C HIPS 2007 21 Inspection Inspection allows accesses to be differentiated (at runtime) as local and non-local accesses. Inspection can also map iterations to accesses. This mapping can then be used to re-order iterations so that iterations with the same data source are clubbed together. Communication of remote data can be overlapped with the computation of iterations that access local data (or data already received) Array A 1, 3, 5, 0, 2 ….. 2, 5, 8, 1, 2 ... C[i] 0, 1, 2, 3, 4, …….. reorder iterations 3, 0, 4, 1, 2, …….. 10, 11, 12, 13, 14 ... Index i accesses on accesses on process 1 process 2 R. Eigenmann, Purdue 0, 1, 2, 3, 5 ….. . 5, 8, 1, 2, 2, ... HIPS 2007 accesses on process 1 11, 12, 13, 10, 14 ... accesses on process 2 22 Loop Restructuring Simple iteration reordering may not be sufficient to expose the full set of possibilities for computation-communication overlap. L1 : #pragma omp parallel for for(i=0;i<N;i++) Distribute loop p[i] = x[i] + alpha*r[i] ; L2 to form loops L2-1 and L2-2 L2 : #pragma omp parallel for for(j=0;j<N;j++) { w[j] = 0 ; for(k=rowstr[j];k<rowstr[j+1];k++) S2: w[j] = w[j] + a[k]*p[col[k]] ; } Reordering loop L2 may still not club together accesses from different sources R. Eigenmann, Purdue L1 : #pragma omp parallel for for(i=0;i<N;i++) p[i] = x[i] + alpha*r[i] ; L2-1 : #pragma omp parallel for for(j=0;j<N;j++) { w[j] = 0 ; } L2-2: #pragma omp parallel for for(j=0;j<N;j++) { for(k=rowstr[j];k<rowstr[j+1];k++) S2: w[j] = w[j] + a[k]*p[col[k]] ; } HIPS 2007 23 Loop Restructuring contd. L1 : #pragma omp parallel for for(i=0;i<N;i++) p[i] = x[i] + alpha*r[i] ; L1 : #pragma omp parallel for for(i=0;i<N;i++) p[i] = x[i] + alpha*r[i] ; L2-1 : #pragma omp parallel for for(j=0;j<N;j++) { Coalesce nested w[j] = 0 ; loop L2-2 to form loop L3 } L2-1 : #pragma omp parallel for for(j=0;j<N;j++) { w[j] = 0 ; } L2-2: #pragma omp parallel for for(j=0;j<N;j++) { for(k=rowstr[j];k<rowstr[j+1];k++) S2: w[j] = w[j] + a[k]*p[col[k]] ; } L3: for(i=0;i<num_iter;i++) w[T[i].j] = w[T[i].j] + a[T[i].k]*p[T[i].col] ; Reorder iterations of loop L3 to achieve computationcommunication overlap The T[i] data structure is created and filled in Purdue by the inspector R. Eigenmann, Final restructured and reordered loop HIPS 2007 24 Achieving actual overlap of computation and communication Non-blocking send/recv calls may not actually progress concurrently with computation. Use a multi-threaded runtime system with separate computation and communication threads – on dual CPU machines these threads can progress concurrently. The compiler extracts the send/recvs along with the packing/unpacking of message buffers into a communication thread. R. Eigenmann, Purdue HIPS 2007 25 Computation Thread on Process p Communication Thread on Process p Wake up communication thread Initiate sends to process q,r tsend Pack data and send to processes q and r. Execute iterations that access local data tcompp trecv-q Receive data from process q Receive data from process r trecv-r R. Eigenmann, Purdue twait-q Wait for receives from process q to complete tcomp- Execute iterations that access data received from process q q twait-r Wait for receives from process r to complete tcomp-r Execute iterations that access data received from process r Program Timeline HIPS 2007 26 Execution Time (seconds) 1200 1000 800 600 Performance of Equake 400 200 0 1 2 4 8 16 Node s Hand-Coded MPI Baseline (No Inspection) Inspection (No Reordering) Inspection and Reordering 450 Time (in Seconds) 400 350 300 250 200 150 100 50 0 1 2 4 8 16 Computationcommunication overlap in Equake Numbe r of Proce ssors Actual Time Spent in Send/Recv Computation available for Overlapping Actual Wait Time R. Eigenmann, Purdue HIPS 2007 27 Execution Time (seconds) 140 120 100 80 Performance of Moldyn 60 40 20 0 1 2 4 8 16 Number of Nodes Hand-coded MPI Baseline Inspector without Reordering Inspection and Reordering 12 Time (seconds) 10 8 6 4 2 0 1 2 4 8 16 Computationcommunication overlap in Moldyn Number of Nodes Time spent in Send/Recv R. Eigenmann, Purdue Computation Available for Overlapping Actual Wait Time HIPS 2007 28 Execution Time (in seconds) 1400 1200 1000 800 600 Performance of CG 400 200 0 1 2 4 8 16 Number of Nodes NPB-2.3-MPI Baseline Translation Inspector without Reordering Inspector with Iteration Reordering 100 90 339 Time (seconds) 80 169 70 60 Computationcommunication overlap in CG 50 40 30 20 10 0 1 2 Time spent in Send/Recv R. Eigenmann, Purdue 4 Numbe r of Node s 8 16 Computation available for Overlap Actual Wait Time HIPS 2007 29 Conclusions There is hope for easier programming models on distributed systems OpenMP can be translated effectively onto DPS; we have used benchmarks from Direct Translation of OpenMP to MPI outperforms translation via S-DSM SPEC OMP NAS additional irregular codes “Fall back” of S-DSM for irregular accesses incurs significant overhead Caveats: Data scalability is an issue Black-belt programmers will always be able to do better Advanced compiler technology is involved. There will be performance surprises. R. Eigenmann, Purdue HIPS 2007 30