Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering dhudak@osc.edu Overview • Module 1: PGAS Fundamentals • Module 2: UPC • Module 3: pMATLAB • Module 4: Asynchronous PGAS and X10 2 Introduction to PGAS– The Basics PGAS Model Overview • Concepts – Memories and structures – Threads and affinity – Local and non-local accesses • Examples – – – – MPI OpenMP UPC X10 4 Software Memory Examples • Executable Image • Memories • Static memory • data segment • Heap memory • Holds allocated structures • Explicitly managed by programmer (malloc, free) • Stack memory • Holds function call records • Implicitly managed by runtime during execution 5 Memories and Structures • Software Memory – Distinct logical storage area in a computer program (e.g., heap or stack) – For parallel software, we use multiple memories • Structure – Collection of data created by program execution (arrays, trees, graphs, etc.) • Partition – Division of structure into parts • Mapping – Assignment of structure parts to memories 6 Threads • Units of execution • Structured threading – Dynamic threads: program creates threads during execution (e.g., OpenMP parallel loop) – Static threads: same number of threads running for duration of program • Single program, multiple data (SPMD) • We will defer unstructured threading 7 Affinity and Nonlocal Access • Affinity is the association of a thread to a memory – If a thread has affinity with a memory, it can access its structures – Such a memory is called a local memory • Nonlocal access – Thread 0 wants part B – Part B in Memory 1 – Thread 0 does not have affinity to memory 1 8 Comparisons Thread Count Memory Count Nonlocal Access 1 1 N/A Either 1 or p 1 N/A p p No. Message required. 1+p 2 (Host/device) No. DMA required. UPC, CAF, pMatlab p p Supported. X10, Asynchronous PGAS p q Supported. Traditional OpenMP MPI C+CUDA 9 Introduction to PGAS - UPC Slides adapted from some by Tarek El-Ghazawi (GWU) Kathy Yelick (UC Berkeley) Adam Leko (U of Florida) UPC Overview 1. Background 2. UPC memory/execution model 3. Data and pointers 4. Dynamic memory management 5. Work distribution/synchronization What is UPC? • UPC - Unified Parallel C – An explicitly-parallel extension of ANSI C – A distributed shared memory parallel programming language • Similar to the C language philosophy – Programmers are clever and careful, and may need to get close to hardware • to get performance, but • can get in trouble • Common and familiar syntax and semantics for parallel C with simple extensions to ANSI C Players in the UPC field • UPC consortium of government, academia, HPC vendors, including: – ARSC, Compaq, CSC, Cray Inc., Etnus, GWU, HP, IBM, IDA CSC, Intrepid Technologies, LBNL, LLNL, MTU, NSA, UCB, UMCP, UF, US DoD, US DoE, OSU – See http://upc.gwu.edu for more details Hardware support • Many UPC implementations are available – Cray: X1, X1E – HP: AlphaServer SC and Linux Itanium (Superdome) systems – IBM: BlueGene and AIX – Intrepid GCC: SGI IRIX, Cray T3D/E, Linux Itanium and x86/x86-64 SMPs – Michigan MuPC: “reference” implementation – Berkeley UPC Compiler: just about everything else 14 General view A collection of threads operating in a partitioned global address space that is logically distributed among threads. Each thread has affinity with a portion of the globally shared address space. Each thread has also a private space. Elements in partitioned global space belonging to a thread are said to have affinity to that thread. 15 First example: sequential vector addition //vect_add.c #define N 1000 int v1[N], v2[N], v1plusv2[N]; void main() { int i; for (i=0; i<N; i++) v1plusv2[i]=v1[i]+v2[i]; } 16 First example: parallel vector addition //vect_add.c #include <upc.h> #define N 1000 shared int v1[N], v2[N], v1plusv2[N]; void main() { int i; upc_forall (i=0; i<N; i++; &v1plusv2[N]) v1plusv2[i]=v1[i]+v2[i]; } 17 UPC memory model • A pointer-to-shared can reference all locations in the shared space • A pointer-to-local (“plain old C pointer”) may only reference addresses in its private space or addresses in its portion of the shared space • Static and dynamic memory allocations are supported for both shared and private memory 19 UPC execution model • A number of threads working independently in SPMD fashion – Similar to MPI – MYTHREAD specifies thread index (0..THREADS-1) – Number of threads specified at compile-time or run-time • Synchronization only when needed – Barriers – Locks – Memory consistency control 20 Shared scalar and array data • Shared array elements and blocks can be spread across the threads – shared int x[THREADS] /* One element per thread */ – shared int y[10][THREADS] /* 10 elements per thread */ • Scalar data declarations – shared int a; /* One item in global space (affinity to thread 0) */ – int b; /* one private b at each thread */ 22 Shared and private data • Example (assume THREADS = 3): shared int x; /*x will have affinity to thread 0 */ shared int y[THREADS]; int z; • The resulting layout is: 23 Blocking of shared arrays • Default block size is 1 • Shared arrays can be distributed on a block per thread basis, round robin, with arbitrary block sizes. • A block size is specified in the declaration as follows: – shared [block-size] array [N]; – e.g.: shared [4] int a[16]; 25 UPC pointers • Pointer declaration: – shared int *p; • p is a pointer to an integer residing in the shared memory space • p is called a pointer to shared • Other pointer declared same as in C – int *ptr; – “pointer-to-local” or “plain old C pointer,” can be used to access private data and shared data with affinity to MYTHREAD 29 Pointers in UPC C Pointer library functions (malloc, free, memcpy, memset) extended for private memories 30 Work sharing with upc_forall() • Distributes independent iterations • Each thread gets a bunch of iterations • Affinity (expression) field determines how to distribute work • Simple C-like syntax and semantics upc_forall (init; test; loop; expression) statement; • Function of note: upc_threadof(shared void *ptr) returns the thread number that has affinity to the pointerto-shared 40 Synchronization • No implicit synchronization among the threads • UPC provides the following synchronization mechanisms: – – – – Barriers Locks Fence Spinlocks (using memory consistency model) 41 Introduction to PGAS - pMatlab Credit: Slides based on some from Jeremey Kepner http://www.ll.mit.edu/mission/isr/pmatlab/pmatlab.html pMatlab Overview • pMatlab and PGAS • pMatlab Execution (SPMD) – Replicated arrays • Distributed arrays – Maps – Local components 46 Not real PGAS • PGAS – Partitioned Global Address Space • MATLAB doesn’t expose address space – Implicit memory management – User declares arrays and MATLAB interpreter allocates/frees the memory • So, when I say PGAS in MATLAB, I mean – Running multiple copies of the interpreter – Distributed arrays: allocating a single (logical) array as a collection of local (physical) array components • Multiple implementations – Open source: MIT Lincoln Labs’ pMatlab + OSC bcMPI – Commercial: Mathworks’ Parallel Computing Toolbox, Interactive Supercomputing (now Microsoft) Star-P http://www.osc.edu/bluecollarcomputing/applications/bcMPI/index.shtml 47 Serial Program Matlab X = zeros(N,N); Y = zeros(N,N); Y(:,:) = X + 1; • Matlab is a high level language • Allows mathematical expressions to be written concisely • Multi-dimensional arrays are fundamental to Matlab Parallel Execution pMatlab Pid=Np-1 Pid=1 Pid=0 X = zeros(N,N); Y = zeros(N,N); Y(:,:) = X + 1; • Run NP (or Np) copies of same program – Single Program Multiple Data (SPMD) • Each copy has a unique PID (or Pid) • Every array is replicated on each copy of the program Distributed Array Program pMatlab Pid=Np-1 Pid=1 Pid=0 XYmap = map([Np 1],{},0:Np-1); X = zeros(N,N,XYmap); Y = zeros(N,N,XYmap); Y(:,:) = X + 1; • Use map to make a distributed array • Tells program which dimension to distribute data • Each program implicitly operates on only its own data (owner computes rule) Explicitly Local Program pMatlab XYmap = map([Np 1],{},0:Np-1); Xloc = local(zeros(N,N,XYmap)); Yloc = local(zeros(N,N,XYmap)); Yloc(:,:) = Xloc + 1; • Use local function to explicitly retrieve local part of a distributed array • Operation is the same as serial program, but with different data in each process (recommended approach) Parallel Data Maps Array Matlab Xmap=map([Np 1],{},0:Np-1) Xmap=map([1 Np],{},0:Np-1) Xmap=map([Np/2 2],{},0:Np-1) Computer PID 0 1 2 3 Pid • A map is a mapping of array indices to processes • Can be block, cyclic, block-cyclic, or block w/overlap • Use map to set which dimension to split among processes Maps and Distributed Arrays A process map for a numerical array is an assignment of blocks of data to processes. Amap = map([Np 1],{},0:Np-1); Process Grid List of processes Distribution {}=default=block A = zeros(4,6,Amap); P0 P1 P2 P3 pMatlab constructors are overloaded to take a map as an argument, and return a distributed array. A = 000000 000000 000000 000000 Parallelizing Loops • The set of loop index values is known as an iteration space • In parallel programming, a set of processes cooperate in order to complete a single task • To parallelize a loop, we must split its iteration space among processes 54 loopSplit Construct • parfor is a neat construct that is supported by Mathworks’ PCT • pMatlab’s equivalent is called loopSplit • Why loopSplit and not parfor? That is a subtle question… 55 Global View vs. Local View • In parallel programming, a set of processes cooperate in order to complete a single task • The global view of the program refers to actions and data from the task perspective – OpenMP programming is an example of global view • parfor is a global view construct 56 Gobal View vs. Local View (con’t) • The local view of the program refers to actions and data within an individual process • Single Program-Multiple Data (SPMD) programs provide a local view – Each process is an independent execution of the same program – MPI programming is an example of SPMD • ParaM uses SPMD • loopSplit is the SPMD equivalent of parfor 57 loopSplit Example • Monte Carlo approximation of • Algorithm – Consider a circle of radius 1 – Let N = some large number (say 10000) and count = 0 – Repeat the following procedure N times • Generate two random numbers x and y between 0 and 1 (use the rand function) • Check whether (x,y) lie inside the circle • Increment count if they do – Pi_value = 4 * count / N Monte Carlo Example: Serial Code N = 1000; count = 0; radius = 1; fprintf('Number of iterations : %.0f\n', N); for k = 1:N % Generate two numbers between 0 and 1 p = rand(1,2); % i.e. test for the condition : x^2 + y^2 < 1 if sum(p.^2) < radius % Point is inside circle : Increment count count = count + 1; end end pival = 4*count/N; t1 = clock; fprintf('Calculated PI = %f\nError = %f\n', pival, abs(pipival)); fprintf('Total time : %f seconds\n', etime(t1, t0)); 59 Monte Carlo Example: Parallel Code if (PARALLEL) rand('state', Pid+1); end N = 1000000; count = 0; radius = 1; fprintf('Number of iterations : %.0f\n', N); [local_low, local_hi] = loopSplit(1, N, Pid, Np); fprintf('Process \t%i\tbegins %i\tends %i\n', Pid, local_low, ... local_hi); for k = local_low:local_hi % Here, p(x,y) represents a point in the x-y space p = rand(1,2); % i.e. test for the condition : x^2 + y^2 < 1 if sum(p.^2) < radius count = count + 1; end end 60 Monte Carlo Example: Parallel Output Number of iterations : 1000000 Process 0 begins 1 Process 1 begins 250001 Process 2 begins 500001 Process 3 begins 750001 Calculated PI = 3.139616 Error = 0.001977 61 ends ends ends ends 250000 500000 750000 1000000 Monte Carlo Example: Total Count if (PARALLEL) map1 = map([Np 1], {}, 0:Np-1); else map1 = 1; end answers = zeros(Np, 1, map1); my_answer = local(answers); my_answer(1,1) = count; answers = put_local(answers, my_answer); global_answers = agg(answers); if (Pid == 0) global_count = 0; for i = 1:Np global_count = global_count + global_answers(i,1); end pival = 4*global_count/N; fprintf(‘PI = %f\nError = %f\n', pival, abs(pi-pival)); end 62 answers = zeros(Np, 1, map1); my_answer = local(answers); my_answer(1,1) = count; answers = put_local(answers, my_answer); global_answers = agg(answers); 63 answers = zeros(Np, 1, map1); my_answer = local(answers); my_answer(1,1) = count; answers = put_local(answers, my_answer); global_answers = agg(answers); 64 answers = zeros(Np, 1, map1); my_answer = local(answers); my_answer(1,1) = count; answers = put_local(answers, my_answer); global_answers = agg(answers); 65 answers = zeros(Np, 1, map1); my_answer = local(answers); my_answer(1,1) = count; answers = put_local(answers, my_answer); global_answers = agg(answers); 66 answers = zeros(Np, 1, map1); my_answer = local(answers); my_answer(1,1) = count; answers = put_local(answers, my_answer); global_answers = agg(answers); 67 if (Pid == 0) global_count = 0; for i = 1:Np global_count = global_count + global_answers(i,1); 68 Introduction to PGAS - APGAS and the X10 Language Credit: Slides based on some from David Grove, et.al. http://x10.codehaus.org/Tutorials Outline • MASC architectures and APGAS • X10 fundamentals • Data distributions (points and regions) • Concurrency constructs • Synchronization constructs • Examples 70 What is X10? • X10 is a new language developed in the IBM PERCS project as part of the DARPA program on High Productivity Computing Systems (HPCS) • X10 is an instance of the APGAS framework in the Java family • X10’s goals: – more productive than current models – support high levels of abstraction – exploit multiple levels of parallelism and non-uniform data access – suitable for multiple architectures, and multiple workloads. X10 Project Status • X10 is an open source project (Eclipse Public License) – Documentation, releases, mailing lists, code, etc. all publicly available via http://x10-lang.org • X10RT: X10 Runtime in X10 (14kloc and growing) • X10 2.0.4 released June 14, 2010 – Java: Single process (all places in 1 JVM) • any platform with Java 5 – C++: Multi-process (1 place per process) • aix, linux, cygwin, MacOS X • x86, x86_64, PowerPC, Sparc • x10rt: APGAS runtime (binary only) or MPI (open source) Multicore/Accelerator multiSpace Computing (MASC) • Cluster of nodes • Each node – Multicore processing • • 2 to 4 sockets/board now 2, 4, 8 cores/socket now – Manycore accelerator • • Discrete device (GPU) Integrated w/CPU (Intel “Knights Corner”) • Multiple memory spaces – Per node memory (accessible by local cores) – Per accelerator memory 73 Multicore/Accelerator multiSpace Computing (MASC) • Achieving high performance requires detailed, systemdependent specification of data placement and movement • Programmability Challenges – exhibit multiple levels of parallelism – synchronize data motion across multiple memories – regularly overlap computation with communication 74 Every Parallel Architecture has a dominant programming model Parallel Architecture Programming Model Vector Machine (Cray 1) Loop vectorization (IVDEP) SIMD Machine (CM-2) Data parallel (C*) SMP Machine (SGI Origin) Threads (OpenMP) Clusters (IBM 1350) Message Passing (MPI) GPGPU (nVidia Tesla) Data parallel (CUDA) MASC Asynchronous PGAS? 75 • MASC Options – Pick a single model (MPI, OpenMP) – Hybrid code • MPI at node level • OpenMP at core level • CUDA at accelerator – Find a higher-level abstraction, map it to hardware X10 Concepts • Asynchronous PGAS – PGAS model in which threads can be dynamically created under programmer control – p distinct memories, q distinct threads (p <> q) • PGAS memories are called places in X10 • PGAS threads are called activities in X10 • Asynchronous extensions for other PGAS languages (UPC, CAF) entirely possible… 76 Two basic ideas: Places (Memories) and Activities (Threads) X10 Constructs Fine grained concurrency Atomicity Global data-structures • async S • atomic S Place-shifting operations • when (c) S Ordering • points, regions, distributions, arrays • at (P) S • finish S • clock Overview of Features • Many sequential features of Java inherited unchanged – Classes (w/ single inheritance) – Interfaces, (w/ multiple inheritance) – Instance and static fields – Constructors, (static) initializers – Overloaded, over-rideable methods – Garbage collection • Structs • Closures • Points, Regions, Distributions, Arrays • Substantial extensions to the type system – – – – Dependent types Generic types Function types Type definitions, inference • Concurrency – Fine-grained concurrency: • async (p,l) S – Atomicity • atomic (s) – Ordering • L: finish S – Data-dependent synchronization • when (c) S Points and Regions • A point is an element of an ndimensional Cartesian space (n>=1) with integer-valued coordinates e.g., [5], [1, 2], … • A point variable can hold values of different ranks e.g., – var p: Point = [1]; p = [2,3]; … • Operations – p1.rank • returns rank of point p1 – p1(i) • returns element (i mod p1.rank) if i < 0 or i >= p1.rank – p1 < p2, p1 <= p2, p1 > p2, p1 >= p2 • returns true iff p1 is lexicographically <, <=, >, or >= p2 • only defined when p1.rank and p2.rank are equal • Regions are collections of points of the same dimension • Rectangular regions have a simple representation, e.g. [1..10, 3..40] • Rich algebra over regions is provided Distributions and Arrays • Distributions specify mapping of points in a region to places – E.g. Dist.makeBlock(R) – E.g. Dist.makeUnique() • Arrays are defined over a distribution and a base type – A:Array[T] – A:Array[T](d) • Arrays are created through initializers – Array.make[T](d, init) • Arrays are mutable (considering immutable arrays) • Array operations • A.rank ::= # dimensions in array • A.region ::= index region (domain) of array • A.dist ::= distribution of array A • A(p) ::= element at point p, where p belongs to A.region • A(R) ::= restriction of array onto region R – Useful for extracting subarrays async • async S – Creates a new child activity that executes statement S – Returns immediately – S may reference final variables in enclosing blocks – Activities cannot be named – Activity cannot be aborted or cancelled Stmt ::= async(p,l) Stmt cf Cilk’s spawn // Compute the Fibonacci // sequence in parallel. def run() { if (r < 2) return; val f1 = new Fib(r-1), f2 = new Fib(r-2); finish { async f1.run(); f2.run(); } r = f1.r + f2.r; } finish • L: finish S – Execute S, but wait until all (transitively) spawned asyncs have terminated. • Rooted exception model – Trap all exceptions thrown by spawned activities. – Throw an (aggregate) exception if any spawned async terminates abruptly. – Implicit finish at main activity • finish is useful for expressing “synchronous” operations on (local or) remote data. Stmt ::= finish Stmt cf Cilk’s sync // Compute the Fibonacci // sequence in parallel. def run() { if (r < 2) return; val f1 = new Fib(r-1), f2 = new Fib(r-2); finish { async f1.run(); f2.run(); } r = f1.r + f2.r; } at Stmt ::= at(p) Stmt • at(p) S – Creates activity to execute statement S at place p – Calling activity is blocked until S completes // Copy field f from a to b def copyRemoteFields(a, b) { at (b.loc) b.f = at (a.loc) a.f; } // Increment field f of obj def incField(obj, inc) { at (obj.loc) obj.f += inc; } // Invoke method m on obj def invoke(obj, arg) { at (obj.loc) obj.m(arg); } atomic • atomic S – Execute statement S atomically within that place – Atomic blocks are conceptually executed in a single step while other activities are suspended: isolation and atomicity. • An atomic block body (S) ... – must be nonblocking – must not create concurrent activities (sequential) – must not access remote data (local) Stmt ::= atomic Statement MethodModifier ::= atomic // target defined in lexically // enclosing scope. atomic def CAS(old:Object, n:Object) { if (target.equals(old)) { target = n; return true; } return false; } // push data onto concurrent // list-stack val node = new Node(data); atomic { node.next = head; head = node; } Clocks: Motivation • Activity coordination using finish is accomplished by checking for activity termination • But in many cases activities have a producer-consumer relationship and a “barrier”-like coordination is needed without waiting for activity termination – The activities involved may be in the same place or in different places • Design clocks to offer determinate and deadlock-free coordination between a dynamically varying number of activities. Phase 0 Phase 1 ... Activity 0 Activity 1 Activity 2 ... Clocks: Main operations • var c = Clock.make(); – Allocate a clock, register current activity with it. Phase 0 of c starts. • async(…) clocked (c1,c2,…) S • ateach(…) clocked (c1,c2,…) S • foreach(…) clocked (c1,c2,…) S • Create async activities registered on clocks c1, c2, … • c.resume(); – Nonblocking operation that signals completion of work by current activity for this phase of clock c • next; – Barrier — suspend until all clocks that the current activity is registered with can advance. c.resume() is first performed for each such clock, if needed. • next can be viewed like a “finish” of all computations under way in the current phase of the clock Fundamental X10 Property • Programs written using async, finish, at, atomic, clock cannot deadlock • Intuition: cannot be a cycle in waits-for graph 2D Heat Conduction Problem • Based on the 2D Partial Differential Equation (1), 2D Heat Conduction problem is similar to a 4-point stencil operation, as seen in (2): t T y 1, x T T 2 Because of the time steps, Typically, two grids are used x t 1 2 2 y 2 1 T t (1) t T y,x T y , x 1 t y T y , x 1 t 1 Ti , j t T y 1, x x 1 4 T t i 1, j Ti 1, j Ti , j 1 T i , j 1 (2) t t t Heat Transfer in Pictures n A: n 1.0 4 repeat until max change < Heat transfer in X10 • X10 permits smooth variation between multiple concurrency styles – “High-level” ZPL-style (operations on global arrays) • Chapel “global view” style • Expressible, but relies on “compiler magic” for performance – OpenMP style • Chunking within a single place – MPI-style • SPMD computation with explicit all-to-all reduction • Uses clocks – “OpenMP within MPI” style • For hierarchical parallelism • Fairly easy to derive from ZPL-style program. Heat Transfer in X10 – ZPL style class Stencil2D { static type Real=Double; const n = 6, epsilon = 1.0e-5; const BigD = Dist.makeBlock([0..n+1, 0..n+1], 0), D = BigD | [1..n, 1..n], LastRow = [0..0, 1..n] as Region; const A = Array.make[Real](BigD, (p:Point)=>(LastRow.contains(p)?1:0)); const Temp = Array.make[Real](BigD); def run() { var delta:Real; do { finish ateach (p in D) Temp(p) = A(p.stencil(1)).reduce(Double.+, 0.0)/4; delta = (A(D)–Temp(D)).lift(Math.abs).reduce(Math.max, 0.0); A(D) = Temp(D); } while (delta > epsilon); } } Heat Transfer in X10 – ZPL style • Cast in fork-join style rather than SPMD style – Compiler needs to transform into SPMD style • Compiler needs to chunk iterations per place – Fine grained iteration has too much overhead • Compiler needs to generate code for distributed array operations – Create temporary global arrays, hoist them out of loop, etc. • Uses implicit syntax to access remote locations. Simple to write — tough to implement efficiently Heat Transfer in X10 – II def run() { val D_Base = Dist.makeUnique(D.places()); var delta:Real; do { finish ateach (z in D_Base) for (p in D | here) Temp(p) = A(p.stencil(1)).reduce(Double.+, 0.0)/4; delta =(A(D) – Temp(D)).lift(Math.abs).reduce(Math.max, 0.0); A(D) = Temp(D); } while (delta > epsilon); } • Flat parallelism: Assume one activity per place is desired. • D.places() returns ValRail of places in D. – Dist.makeUnique(D.places()) returns a unique distribution (one point per place) over the given ValRail of places • D | x returns sub-region of D at place x. Explicit Loop Chunking Heat Transfer in X10 – III def run() { val D_Base = Dist.makeUnique(D.places()); val blocks = DistUtil.block(D, P); var delta:Real; do { finish ateach (z in D_Base) foreach (q in 1..P) for (p in blocks(here,q)) Temp(p) = A(p.stencil(1)).reduce(Double.+, 0.0)/4; delta =(A(D)–Temp(D)).lift(Math.abs).reduce(Math.max, 0.0); A(D) = Temp(D); } while (delta > epsilon); } • Hierarchical parallelism: P activities at place x. – Easy to change above code so P can vary with x. • DistUtil.block(D,P)(x,q) is the region allocated to the q’th activity in place x. (Block-block division.) Explicit Loop Chunking with Hierarchical Parallelism Heat Transfer in X10 – IV def run() { finish async { val c = clock.make(); val D_Base = Dist.makeUnique(D.places()); val diff = Array.make[Real](D_Base), scratch = Array.make[Real](D_Base); One activity per place == MPI task ateach (z in D_Base) clocked(c) do { diff(z) = 0.0; for (p in D | here) { Temp(p) = A(p.stencil(1)).reduce(Double.+, 0.0)/4; diff(z) = Math.max(diff(z), Math.abs(A(p) - Temp(p))); } next; Akin to UPC barrier A(D | here) = Temp(D | here); reduceMax(z, diff, scratch); } while (diff(z) > epsilon); } } • reduceMax() performs an all-to-all max reduction. SPMD with all-to-all reduction == MPI style Heat Transfer in X10 – V def run() { finish async { val c = clock.make(); val D_Base = Dist.makeUnique(D.places()); val diff = Array.make[Real](D_Base), scratch = Array.make[Real](D_Base); ateach (z in D_Base) clocked(c) foreach (q in 1..P) clocked(c) var myDiff:Real = 0; do { if (q==1) { diff(z) = 0.0}; myDiff = 0; for (p in blocks(here,q)) { Temp(p) = A(p.stencil(1)).reduce(Double.+, 0.0)/4; myDiff = Math.max(myDiff, Math.abs(A(p) – Temp(p))); } atomic diff(z) = Math.max(myDiff, diff(z)); next; A(blocks(here,q)) = Temp(blocks(here,q)); if (q==1) reduceMax(z, diff, scratch); next; myDiff = diff(z); next; } while (myDiff > epsilon); } } “OpenMP within MPI style” Heat Transfer in X10 – VI • All previous versions permit fine-grained remote access – Used to access boundary elements • Much more efficient to transfer boundary elements in bulk between clock phases. • May be done by allocating extra “ghost” boundary at each place – API extension: Dist.makeBlock(D, P, f) • D: distribution, P: processor grid, f: region region transformer • reduceMax() phase overlapped with ghost distribution phase