PGAS Languages and Halo Updates Will Sawyer, CSCS Important concepts and acronyms PGAS: Partitioned Global Address Space UPC: Unified Parallel C CAF: Co-Array Fortran Titanium: PGAS Java dialect MPI: Message-Passing Interface SHMEM: Shared Memory API (SGI) POMPA Kickoff Meeting, May 3-4, 2011 2 Partitioned Global Address Space Global address space Global address space: any thread/process may directly read/write data allocated by any other Partitioned: data is designated as local (with ‘affinity’) or global (possibly far); programmer controls layout x: 1 y: x: 5 y: l: l: l: g: g: g: p0 x: 7 y: 0 p1 pn 3 Current languages: UPC, CAF, and Titanium POMPA Kickoff Meeting, May 3-4, 2011 By default: Object heaps are shared Program stacks are private Potential strengths of a PGAS language Interprocess communication intrinsic to language Explicit support for distributed data structures (private and shared data) Conceptually the parallel formulation can be more elegant One-sided shared-memory communication Values are either ‘put’ or ‘got’ from remote images Support for bulk messages, synchronization Could be implemented with message-passing library or through RDMA (remote direct memory access) PGAS hardware support available Cray Gemini (XE6) interconnect supports RDMA Potential interoperability with existing C/Fortran/Java code POMPA Kickoff Meeting, May 3-4, 2011 POP Halo Exchange with Co-Array Fortran Worley, Levesque, The Performance Evolution of the Parallel Ocean Program on the Cray X1, Cray User Group Meeting, 2004 Cray X1 had a single vector processor per node, internode comm. hardware support Co-Array Fortran (CAF) driven by Numrich, et al., also the authors of SHMEM Halo exchange programmed in MPI, CAF, SHMEM POMPA Kickoff Meeting, May 3-4, 2011 5 Halo Exchange “Stencil 2D” Benchmark Halo exchange and stencil operation over a square domain distributed over a 2-D virtual process topology Arbitrary halo ‘radius’ (number of halo cells in a given dimension, e.g. 3) MPI implementations: • Trivial: post all 8 MPI_Isend and Irecv • Sendrecv: MPI_Sendrecv between PE pairs • Halo: MPI_Isend/Irecv between PE pairs CAF implementations: • Trivial: simple copies to remote images • Put: reciprocal puts between image pairs • Get: reciprocal gets between image pairs • GetA: all images do inner region first, then all do block region (fine grain, no sync.) • GetH: half of images do inner region first, half do block region first (fine grain, no sync.) POMPA Kickoff Meeting, May 3-4, 2011 6 Example code: Trivial CAF real, allocatable, save :: V(:,:)[:,:] : allocate( V(1-halo:m+halo,1-halo:n+halo)[p,*] ) : WW = myP-1 ; if (WW<1) WW = p EE = myP+1 ; if (EE>p) EE = 1 SS = myQ-1 ; if (SS<1) SS = q NN = myQ+1 ; if (NN>q) NN = 1 : V(1:m,1:n) = dom(1:m,1:n) ! internal region V(1-halo:0, 1:n)[EE,myQ] V(m+1:m+halo, 1:n)[WW,myQ] V(1:m,1-halo:0)[myP,NN] V(1:m,n+1:n+halo)[myP,SS] V(1-halo:0,1-halo:0)[EE,NN] V(m+1:m+halo,1-halo:0)[WW,NN] V(1-halo:0,n+1:n+halo)[EE,SS] V(m+1:m+halo,n+1:n+halo)[WW,SS] = = = = = = = = ! ! ! ! ! ! ! ! to to to to to to to to dom(m-halo+1:m,1:n) dom(1:halo,1:n) dom(1:m,n-halo+1:n) dom(1:m,1:halo) dom(m-halo+1:m,n-halo+1:n) dom(1:halo,n-halo+1:n) dom(m-halo+1:m,1:halo) dom(1:halo,1:halo) East West North South North-East North-West South-East South-West sync all ! ! Now run a stencil filter over the internal region (the region unaffected by halo values) ! do j=1,n do i=1,m sum = 0. do l=-halo,halo do k=-halo,halo sum = sum + stencil(k,l)*V(i+k,j+l) enddo enddo dom(i,j) = sum enddo enddo POMPA Kickoff Meeting, May 3-4, 2011 7 Stencil 2D Results on XT5, XE6, X2; Halo = 1 Using a fixed size virtual PE topology, vary the size of the local square XT5: CAF puts/gets implemented through message-passing library XE6, X2: RMA-enabled hardware support for PGAS, but still must pass through software stack XT5 12x6 PE grid, halo=1 XE6 48x24 PE grid, halo=1 10 X2 4x2 PE grid, halo=1 1 1 1 0.1 Time (s.) Trivial (MPI) SendRecv (MPI) Halo (MPI) 0.01 0.01 Trivial (MPI) 0.01 SendRecv (MPI) Trivial (CAF) Trivial (CAF) Put (CAF) Put (CAF) Get (CAF) Time (s.) 0.1 Time (s.) 0.1 Trivial (MPI) SendRecv (MPI) Halo (MPI) 0.001 Trivial (CAF) 0.001 0.001 0.0001 0.0001 0.0001 50 500 5000 Square local domain edge size 50 500 Local domain size 5000 POMPA Kickoff Meeting, May 3-4, 2011 0.00001 50 500 5000 Square local domain edge size 8 Stencil 2D Weak Scaling on XE6 Fixed local dimension, vary the PE virtual topology (take the optimal configuration) XE6 weak scaling 2048^2 per PE, halo =1 XE6 weak scaling 2048^2 per PE, halo=3 0.25 1.4 1.2 0.2 1 0.15 SendRecv (MPI) Trivial (CAF) Put (CAF) 0.1 Time (s.) Time (s.) Trivial (MPI) Trivial (MPI) 0.8 SendRecv (MPI) Trivial (CAF) 0.6 Put (CAF) Get0 (CAF) Get1 (CAF) Get0 (CAF) Get1 (CAF) 0.4 0.05 0.2 0 0 1 10 100 1000 Number of processes 10000 1 10 POMPA Kickoff Meeting, May 3-4, 2011 100 1000 Number of processes 10000 9 SPIN: Transverse field Ising model No symmetries n Any lattice with n sites — 2 states Need n bits to encode the state split this in two parts of m and n-m bits m First part is a core index — 2 cores Second part is a state index within the core — 2n-m states Sparse matrix times dense vector Each process communicates (large vectors) only with m ‘neighbors’ Similar to a halo update, but with higher dimensional state space Implementation in C with MPI_Irecv/Isend, MPI_Allreduce Sergei Isakov POMPA Kickoff Meeting, May 3-4, 2011 10 10 UPC Version “Elegant” shared shared shared struct double *dotprod; /* on thread 0 */ double shared_a[THREADS]; double shared_b[THREADS]; ed_s { ... shared double *v0, *v1, *v2; /* vectors */ shared double *swap; /* for swapping vectors */ }; : for (iter = 0; iter < ed->max_iter; ++iter) { shared_b[MYTHREAD] = b; /* calculate beta */ upc_all_reduceD( dotprod, shared_b, UPC_ADD, THREADS, 1, NULL, UPC_IN_ALLSYNC | UPC_OUT_ALLSYNC ); ed->beta[iter] = sqrt(fabs(dotprod[0])); ib = 1.0 / ed->beta[iter]; /* normalize v1 */ upc_forall (i = 0; i < ed->nlstates; ++i; &(ed->v1[i]) ) ed->v1[i] *= ib; upc_barrier(0); /* matrix vector multiplication */ upc_forall (s = 0; s < ed->nlstates; ++s; &(ed->v1[s]) ) { /* v2 = A * v1, over all threads */ ed->v2[s] = diag(s, ed->n, ed->j) * ed->v1[s]; /* diagonal part */ for (k = 0; k < ed->n; ++k) { /* offdiagonal part */ s1 = flip_state(s, k); ed->v2[s] += ed->gamma * ed->v1[s1]; } } a = 0.0; /* Calculate local conjugate term */ upc_forall (i = 0; i < ed->nlstates; ++i; &(ed->v1[i]) ) { a += ed->v1[i] * ed->v2[i]; } shared_a[MYTHREAD] = a; upc_all_reduceD( dotprod, shared_a, UPC_ADD, THREADS, 1, NULL, UPC_IN_ALLSYNC | UPC_OUT_ALLSYNC ); ed->alpha[iter] = dotprod[0]; b = 0.0; /* v2 = v2 - v0 * beta1 - v1 * alpha1 */ upc_forall (i = 0; i < ed->nlstates; ++i; &(ed->v2[i]) ) { ed->v2[i] -= ed->v0[i] * ed->beta[iter] + ed->v1[i] * ed->alpha[iter]; b += ed->v2[i] * ed->v2[i]; } swap01(ed); swap12(ed); /* "shift" vectors */ } } POMPA Kickoff Workshop, May 3-4, 2011 UPC “Inelegant1”: reproduce existing messaging MPI MPI_Isend(ed->v1, ed->nlstates, MPI_DOUBLE, ed->to_nbs[0], k, MPI_COMM_WORLD, &req_send2); MPI_Irecv(ed->vv1, ed->nlstates, MPI_DOUBLE, ed->from_nbs[0], ed->nm-1, MPI_COMM_WORLD, &req_recv); : MPI_Isend(ed->v1, ed->nlstates, MPI_DOUBLE, ed->to_nbs[neighb], k, MPI_COMM_WORLD, &req_send2); MPI_Irecv(ed->vv2, ed->nlstates, MPI_DOUBLE, ed->from_nbs[neighb], k, MPI_COMM_WORLD, &req_recv2); : UPC shared[NBLOCK] double vtmp[THREADS*NBLOCK]; : for (i = 0; i < NBLOCK; ++i) vtmp[i+MYTHREAD*NBLOCK] = ed->v1[i]; upc_barrier(1); for (i = 0; i < NBLOCK; ++i) ed->vv1[i] = vtmp[i+(ed->from_nbs[0]*NBLOCK)]; : for (i = 0; i < NBLOCK; ++i) ed->vv2[i] = vtmp[i+(ed->from_nbs[neighb]*NBLOCK)]; upc_barrier(2); : sPOMPA Kickoff Workshop, May 3-4, 2011 UPC “Inelegant3”: use only PUT operations shared[NBLOCK] double vtmp1[THREADS*NBLOCK]; shared[NBLOCK] double vtmp2[THREADS*NBLOCK]; : upc_memput( &vtmp1[ed->to_nbs[0]*NBLOCK], ed->v1, upc_barrier(1); : if ( mode == 0 ) { upc_memput( &vtmp2[ed->to_nbs[neighb]*NBLOCK], } else { upc_memput( &vtmp1[ed->to_nbs[neighb]*NBLOCK], } : if ( mode == 0 ) { for (i = 0; i < ed->nlstates; ++i) { ed->v2[i] mode = 1; } else { for (i = 0; i < ed->nlstates; ++i) { ed->v2[i] mode = 0; } upc_barrier(2); NBLOCK*sizeof(double) ); ed->v1, NBLOCK*sizeof(double) ); ed->v1, NBLOCK*sizeof(double) ); += ed->gamma * vtmp1[i+MYTHREAD*NBLOCK]; } += ed->gamma * vtmp2[i+MYTHREAD*NBLOCK]; } POMPA Kickoff Workshop, May 3-4, 2011 13 But then: why not use light weight SHMEM protocol? #include <shmem.h> : double *vtmp1,*vtmp2; : vtmp1 = (double *) shmalloc(ed->nlstates*sizeof(double)); vtmp2 = (double *) shmalloc(ed->nlstates*sizeof(double)); : shmem_double_put(vtmp1,ed->v1,ed->nlstates,ed->from_nbs[0]); /* Do local work */ shmem_barrier_all(); : shmem_double_put(vtmp2,ed->v1,ed->nlstates,ed->from_nbs[0]); : for (i = 0; i < ed->nlstates; ++i) { ed->v2[i] += ed->gamma * vtmp1[i]; } shmem_barrier_all(); swap(&vtmp1, &vtmp2); : Thursday, February 3, 2011 SCR discussion of HP2C projects Strong scaling: Cray XE6/Gemini, n=22,24; 10 iter. XE6 SPIN Strong Scaling n=22 XE6 SPIN Strong Scaling n=24 1000 100 100 10 MPI SHMEM SHMEM fast 10 Elegant Inelegant 1 Inelegant 2 SHMEM Time (s.) Time (s.) MPI SHMEM fast Elegant Inelegant 1 1 Inelegant 2 Inelegant 3 Inelegant 3 1 0.1 1 0.1 1 10 100 Number of processes 1000 10 100 Number of processes POMPA Kickoff Workshop. May 3-4, 2011 1000 Weak scaling: Cray XE6/Gemini,10 iterations XE6 SPIN Weak Scaling n=m+20 XE6 SPIN Weak Scaling n=m+24 1000 1000 100 100 MPI MPI SHMEM fast Elegant Inelegant 1 Inelegant 2 10 SHMEM Time (s.) Time (s.) SHMEM SHMEM fast Inelegant 1 Inelegant 2 10 Inelegant 3 Inelegant 3 BGQ MPI BGQ MPI 1 1 10 100 1000 Number of processes 10000 1 1 10 POMPA Kickoff Workshop, May 3-4, 2011 100 1000 Number of processes 10000 Conclusions One-way communication has conceptual and can have real benefits (e.g., Cray T3E, X1, perhaps X2) On XE6, CAF/UPC formulation can achieve SHMEM performance, but only by using puts and gets, but ‘elegant’ implementations have poor performance If the domain decomposition is already properly formulated… why not use a simple, light-weight protocol like SHMEM?? For XE6 Gemini interconnect: study of one-sided communication primitives (Tineo, et al.) indicates 2-sided MPI communication is still most effective. To do: test MPI-2 one-sided primitives Still: PGAS path should be kept open; possible task: PGAS (CAF or SHMEM) implementation of COSMO halo update? POMPA Kickoff Workshop, May 3-4, 2011