ppt - Cosmo

advertisement
PGAS Languages and Halo Updates
Will Sawyer, CSCS
Important concepts and acronyms






PGAS: Partitioned Global Address Space
UPC: Unified Parallel C
CAF: Co-Array Fortran
Titanium: PGAS Java dialect
MPI: Message-Passing Interface
SHMEM: Shared Memory API (SGI)
POMPA Kickoff Meeting, May 3-4, 2011
2
Partitioned Global Address Space
Global address space
 Global address space: any thread/process may directly
read/write data allocated by any other
 Partitioned: data is designated as local (with ‘affinity’) or
global (possibly far); programmer controls layout
x: 1
y:
x: 5
y:
l:
l:
l:
g:
g:
g:
p0
x: 7
y: 0
p1
pn
3 Current languages: UPC, CAF, and Titanium
POMPA Kickoff Meeting, May 3-4, 2011
By default:
Object heaps are
shared
Program stacks
are private
Potential strengths of a PGAS language
 Interprocess communication intrinsic to language
 Explicit support for distributed data structures (private and shared data)
 Conceptually the parallel formulation can be more elegant
 One-sided shared-memory communication
 Values are either ‘put’ or ‘got’ from remote images
 Support for bulk messages, synchronization
 Could be implemented with message-passing library or through RDMA
(remote direct memory access)
 PGAS hardware support available
 Cray Gemini (XE6) interconnect supports RDMA
 Potential interoperability with existing C/Fortran/Java code
POMPA Kickoff Meeting, May 3-4, 2011
POP Halo Exchange with Co-Array Fortran
Worley, Levesque, The Performance Evolution of the Parallel Ocean Program
on the Cray X1, Cray User Group Meeting, 2004
 Cray X1 had a single vector processor per node, internode comm. hardware
support
 Co-Array Fortran (CAF) driven by Numrich, et al., also the authors of SHMEM
 Halo exchange programmed in MPI, CAF, SHMEM
POMPA Kickoff Meeting, May 3-4, 2011
5
Halo Exchange “Stencil 2D” Benchmark
Halo exchange and stencil operation over a square domain distributed over a 2-D virtual
process topology
 Arbitrary halo ‘radius’ (number of halo cells in a given dimension, e.g. 3)
 MPI implementations:
• Trivial: post all 8 MPI_Isend and Irecv
• Sendrecv: MPI_Sendrecv between PE pairs
• Halo: MPI_Isend/Irecv between PE pairs
CAF implementations:
• Trivial: simple copies to remote images
• Put: reciprocal puts between image pairs
• Get: reciprocal gets between image pairs
• GetA: all images do inner region first, then
all do block region (fine grain, no sync.)
• GetH: half of images do inner region first,
half do block region first (fine grain, no sync.)
POMPA Kickoff Meeting, May 3-4, 2011
6
Example code: Trivial CAF
real, allocatable, save :: V(:,:)[:,:]
:
allocate( V(1-halo:m+halo,1-halo:n+halo)[p,*] )
:
WW = myP-1 ; if (WW<1) WW = p
EE = myP+1 ; if (EE>p) EE = 1
SS = myQ-1 ; if (SS<1) SS = q
NN = myQ+1 ; if (NN>q) NN = 1
:
V(1:m,1:n)
= dom(1:m,1:n)
!
internal region
V(1-halo:0, 1:n)[EE,myQ]
V(m+1:m+halo, 1:n)[WW,myQ]
V(1:m,1-halo:0)[myP,NN]
V(1:m,n+1:n+halo)[myP,SS]
V(1-halo:0,1-halo:0)[EE,NN]
V(m+1:m+halo,1-halo:0)[WW,NN]
V(1-halo:0,n+1:n+halo)[EE,SS]
V(m+1:m+halo,n+1:n+halo)[WW,SS]
=
=
=
=
=
=
=
=
!
!
!
!
!
!
!
!
to
to
to
to
to
to
to
to
dom(m-halo+1:m,1:n)
dom(1:halo,1:n)
dom(1:m,n-halo+1:n)
dom(1:m,1:halo)
dom(m-halo+1:m,n-halo+1:n)
dom(1:halo,n-halo+1:n)
dom(m-halo+1:m,1:halo)
dom(1:halo,1:halo)
East
West
North
South
North-East
North-West
South-East
South-West
sync all
!
! Now run a stencil filter over the internal region (the region unaffected by halo values)
!
do j=1,n
do i=1,m
sum = 0.
do l=-halo,halo
do k=-halo,halo
sum = sum + stencil(k,l)*V(i+k,j+l)
enddo
enddo
dom(i,j) = sum
enddo
enddo
POMPA Kickoff Meeting, May 3-4, 2011
7
Stencil 2D Results on XT5, XE6, X2; Halo = 1
Using a fixed size virtual PE topology, vary the size of the local
square
 XT5: CAF puts/gets implemented through message-passing library
 XE6, X2: RMA-enabled hardware support for PGAS, but still must
pass through software stack
XT5 12x6 PE grid, halo=1
XE6 48x24 PE grid, halo=1
10
X2 4x2 PE grid, halo=1
1
1
1
0.1
Time (s.)
Trivial (MPI)
SendRecv (MPI)
Halo (MPI)
0.01
0.01
Trivial (MPI)
0.01
SendRecv (MPI)
Trivial (CAF)
Trivial (CAF)
Put (CAF)
Put (CAF)
Get (CAF)
Time (s.)
0.1
Time (s.)
0.1
Trivial (MPI)
SendRecv (MPI)
Halo (MPI)
0.001
Trivial (CAF)
0.001
0.001
0.0001
0.0001
0.0001
50
500
5000
Square local domain edge size
50
500
Local domain size
5000
POMPA Kickoff Meeting, May 3-4, 2011
0.00001
50
500
5000
Square local domain edge size
8
Stencil 2D Weak Scaling on XE6
Fixed local dimension, vary the PE virtual topology (take the optimal configuration)
XE6 weak scaling 2048^2 per PE,
halo =1
XE6 weak scaling 2048^2 per PE,
halo=3
0.25
1.4
1.2
0.2
1
0.15
SendRecv (MPI)
Trivial (CAF)
Put (CAF)
0.1
Time (s.)
Time (s.)
Trivial (MPI)
Trivial (MPI)
0.8
SendRecv (MPI)
Trivial (CAF)
0.6
Put (CAF)
Get0 (CAF)
Get1 (CAF)
Get0 (CAF)
Get1 (CAF)
0.4
0.05
0.2
0
0
1
10
100
1000
Number of processes
10000
1
10
POMPA Kickoff Meeting, May 3-4, 2011
100
1000
Number of processes
10000
9
SPIN: Transverse field Ising model
 No symmetries
n
 Any lattice with n sites — 2 states
 Need n bits to encode the state
 split this in two parts of m and n-m bits
m
 First part is a core index — 2 cores
 Second part is a state index within the core — 2n-m states
 Sparse matrix times dense vector
 Each process communicates (large vectors) only with m ‘neighbors’
 Similar to a halo update, but with higher dimensional state space
 Implementation in C with MPI_Irecv/Isend, MPI_Allreduce
Sergei Isakov
POMPA Kickoff Meeting, May 3-4, 2011
10
10
UPC Version “Elegant”
shared
shared
shared
struct
double *dotprod; /* on thread 0 */
double shared_a[THREADS];
double shared_b[THREADS];
ed_s { ...
shared double *v0, *v1, *v2;
/* vectors */
shared double *swap;
/* for swapping vectors */
};
:
for (iter = 0; iter < ed->max_iter; ++iter) {
shared_b[MYTHREAD] = b;
/* calculate beta */
upc_all_reduceD( dotprod, shared_b, UPC_ADD, THREADS, 1, NULL, UPC_IN_ALLSYNC | UPC_OUT_ALLSYNC );
ed->beta[iter] = sqrt(fabs(dotprod[0]));
ib = 1.0 / ed->beta[iter];
/* normalize v1 */
upc_forall (i = 0; i < ed->nlstates; ++i; &(ed->v1[i]) )
ed->v1[i] *= ib;
upc_barrier(0);
/* matrix vector multiplication */
upc_forall (s = 0; s < ed->nlstates; ++s; &(ed->v1[s]) ) { /* v2 = A * v1, over all threads */
ed->v2[s] = diag(s, ed->n, ed->j) * ed->v1[s];
/* diagonal part */
for (k = 0; k < ed->n; ++k) {
/* offdiagonal part */
s1 = flip_state(s, k);
ed->v2[s] += ed->gamma * ed->v1[s1];
}
}
a = 0.0;
/* Calculate local conjugate term */
upc_forall (i = 0; i < ed->nlstates; ++i; &(ed->v1[i]) ) { a += ed->v1[i] * ed->v2[i]; }
shared_a[MYTHREAD] = a;
upc_all_reduceD( dotprod, shared_a, UPC_ADD, THREADS, 1, NULL, UPC_IN_ALLSYNC | UPC_OUT_ALLSYNC );
ed->alpha[iter] = dotprod[0];
b = 0.0;
/* v2 = v2 - v0 * beta1 - v1 * alpha1 */
upc_forall (i = 0; i < ed->nlstates; ++i; &(ed->v2[i]) ) {
ed->v2[i] -= ed->v0[i] * ed->beta[iter] + ed->v1[i] * ed->alpha[iter];
b += ed->v2[i] * ed->v2[i];
}
swap01(ed); swap12(ed);
/* "shift" vectors */
}
}
POMPA Kickoff Workshop, May 3-4, 2011
UPC “Inelegant1”: reproduce existing messaging
 MPI
MPI_Isend(ed->v1, ed->nlstates, MPI_DOUBLE, ed->to_nbs[0], k, MPI_COMM_WORLD, &req_send2);
MPI_Irecv(ed->vv1, ed->nlstates, MPI_DOUBLE, ed->from_nbs[0], ed->nm-1, MPI_COMM_WORLD,
&req_recv);
:
MPI_Isend(ed->v1, ed->nlstates, MPI_DOUBLE, ed->to_nbs[neighb], k, MPI_COMM_WORLD, &req_send2);
MPI_Irecv(ed->vv2, ed->nlstates, MPI_DOUBLE, ed->from_nbs[neighb], k, MPI_COMM_WORLD, &req_recv2);
:
 UPC
shared[NBLOCK] double vtmp[THREADS*NBLOCK];
:
for (i = 0; i < NBLOCK; ++i) vtmp[i+MYTHREAD*NBLOCK] = ed->v1[i];
upc_barrier(1);
for (i = 0; i < NBLOCK; ++i) ed->vv1[i] = vtmp[i+(ed->from_nbs[0]*NBLOCK)];
:
for (i = 0; i < NBLOCK; ++i) ed->vv2[i] = vtmp[i+(ed->from_nbs[neighb]*NBLOCK)];
upc_barrier(2);
:
sPOMPA Kickoff Workshop, May 3-4, 2011
UPC “Inelegant3”: use only PUT operations
shared[NBLOCK] double vtmp1[THREADS*NBLOCK];
shared[NBLOCK] double vtmp2[THREADS*NBLOCK];
:
upc_memput( &vtmp1[ed->to_nbs[0]*NBLOCK], ed->v1,
upc_barrier(1);
:
if ( mode == 0 ) {
upc_memput( &vtmp2[ed->to_nbs[neighb]*NBLOCK],
} else {
upc_memput( &vtmp1[ed->to_nbs[neighb]*NBLOCK],
}
:
if ( mode == 0 ) {
for (i = 0; i < ed->nlstates; ++i) { ed->v2[i]
mode = 1;
} else {
for (i = 0; i < ed->nlstates; ++i) { ed->v2[i]
mode = 0;
}
upc_barrier(2);
NBLOCK*sizeof(double) );
ed->v1, NBLOCK*sizeof(double) );
ed->v1, NBLOCK*sizeof(double) );
+= ed->gamma * vtmp1[i+MYTHREAD*NBLOCK]; }
+= ed->gamma * vtmp2[i+MYTHREAD*NBLOCK]; }
POMPA Kickoff Workshop, May 3-4, 2011
13
But then: why not use light weight SHMEM protocol?
#include <shmem.h>
:
double *vtmp1,*vtmp2;
:
vtmp1 = (double *) shmalloc(ed->nlstates*sizeof(double));
vtmp2 = (double *) shmalloc(ed->nlstates*sizeof(double));
:
shmem_double_put(vtmp1,ed->v1,ed->nlstates,ed->from_nbs[0]);
/* Do local work */
shmem_barrier_all();
:
shmem_double_put(vtmp2,ed->v1,ed->nlstates,ed->from_nbs[0]);
:
for (i = 0; i < ed->nlstates; ++i) { ed->v2[i] += ed->gamma * vtmp1[i]; }
shmem_barrier_all();
swap(&vtmp1, &vtmp2);
:
Thursday, February 3, 2011
SCR discussion of HP2C projects
Strong scaling: Cray XE6/Gemini, n=22,24; 10 iter.
XE6 SPIN Strong Scaling n=22
XE6 SPIN Strong Scaling n=24
1000
100
100
10
MPI
SHMEM
SHMEM fast
10
Elegant
Inelegant 1
Inelegant 2
SHMEM
Time (s.)
Time (s.)
MPI
SHMEM fast
Elegant
Inelegant 1
1
Inelegant 2
Inelegant 3
Inelegant 3
1
0.1
1
0.1
1
10
100
Number of processes
1000
10
100
Number of processes
POMPA Kickoff Workshop. May 3-4, 2011
1000
Weak scaling: Cray XE6/Gemini,10 iterations
XE6 SPIN Weak Scaling n=m+20
XE6 SPIN Weak Scaling n=m+24
1000
1000
100
100
MPI
MPI
SHMEM fast
Elegant
Inelegant 1
Inelegant 2
10
SHMEM
Time (s.)
Time (s.)
SHMEM
SHMEM fast
Inelegant 1
Inelegant 2
10
Inelegant 3
Inelegant 3
BGQ MPI
BGQ MPI
1
1
10
100
1000
Number of processes
10000
1
1
10
POMPA Kickoff Workshop, May 3-4, 2011
100
1000
Number of processes
10000
Conclusions
 One-way communication has conceptual and can have real
benefits (e.g., Cray T3E, X1, perhaps X2)
 On XE6, CAF/UPC formulation can achieve SHMEM
performance, but only by using puts and gets, but ‘elegant’
implementations have poor performance
 If the domain decomposition is already properly formulated… why
not use a simple, light-weight protocol like SHMEM??
 For XE6 Gemini interconnect: study of one-sided communication
primitives (Tineo, et al.) indicates 2-sided MPI communication is
still most effective. To do: test MPI-2 one-sided primitives
 Still: PGAS path should be kept open; possible task: PGAS (CAF
or SHMEM) implementation of COSMO halo update?
POMPA Kickoff Workshop, May 3-4, 2011
Download