X10 Tutorial IBM Research X10 Tutorial http://x10.sf.net 1 Vijay Saraswat (based on tutorial co-authored with Vivek Sarkar, Christoph von Praun, Nate Nystrom, Igor Peshansky) August 2008 This material is based upon work supported by the Defense Advanced IBM Research Research Projects Agency under its Agreement No. HR0011-07-9-0002. © 2007 IBM Corporation X10 Tutorial Acknowledgements IBM Research 2 X10 Core Team (IBM) – Ganesh Bikshandi, Sreedhar Kodali, Nathaniel Nystrom, Igor Peshansky, Vijay Saraswat, Pradeep Varma, Sayantan Sur, Olivier Tardieu, Krishna Venkat, Tong Wen, Jose Castanos, Ankur Narang, Komondoor Raghavan X10 Tools Recent Publications 1. 2. 3. – Philippe Charles, Robert Fuhrer 4. Emeritus 5. 6. 7. – Kemal Ebcioglu, Christian Grothoff, Vincent Cave, Lex Spoon, Christoph von Praun, Rajkishore Barik, Chris Donawa, Allan Kielstra 8. Research colleagues – Vivek Sarkar, Rice U – Satish Chandra,Guojing Cong – Ras Bodik, Guang Gao, Radha Jagadeesan, Jens Palsberg, Rodric Rabbah, Jan Vitek – Vinod Tipparaju, Jarek Nieplocha (PNNL) – Kathy Yelick, Dan Bonachea (Berkeley) – Several others at IBM 9. 10. 11. 12. “Solving large, irregular graph problems using adaptive work-stealing”, to appear in ICPP 2008. “Constrained types for OO languages”, to appear in OOPSLA 2008. “Type Inference for Locality Analysis of Distributed Data Structures”, PPoPP 2008. “Deadlock-free scheduling of X10 Computations with bounded resources”, SPAA 2007 “A Theory of Memory Models”, PPoPP 2007. “May-Happen-in-Parallel Analysis of X10 Programs”, PPoPP 2007. “An annotation and compiler plug-in system for X10”, IBM Technical Report, Feb 2007. “Experiences with an SMP Implementation for X10 based on the Java Concurrency Utilities” Workshop on Programming Models for Ubiquitous Parallelism (PMUP), September 2006. "An Experiment in Measuring the Productivity of Three Parallel Programming Languages”, P-PHEC workshop, February 2006. "X10: An Object-Oriented Approach to Non-Uniform Cluster Computing", OOPSLA conference, October 2005. "Concurrent Clustered Programming", CONCUR conference, August 2005. "X10: an Experimental Language for High Productivity Programming of Scalable Systems", P-PHEC workshop, February 2005. Tutorials TiC 2006, PACT 2006, OOPSLA 2006, PPoPP 2007, SC 2007 Graduate course on X10 at U Pisa (07/07) Graduate course at Waseda U (Tokyo, 04/08) © 2007 IBM Corporation X10 Tutorial Acknowledgements (contd.) X10 is an open source project (Eclipse Public License). Reference implementation in Java, runs on any Java 5 VM. – Windows/Intel, Linux/Intel – AIX/PPC, Linux/PPC – Runs on multiprocessors IBM Research X10Flash project --- cluster implementation of X10 under development at IBM 3 – Translation of X10 to C + LAPI – Will not be discussed in today’s tutorial – Contact: Vijay Saraswat, vsaraswa@us.ibm.com Website: http://x10.sf.net Website contains – – – – – – Language specification Tutorial material Presentations Download instructions Copies of some papers Pointers to mailing list This material is based upon work supported in part by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-90002. © 2007 IBM Corporation X10 Tutorial Outline 1. 2. 3. IBM Research 4. 4 5. Clocks What is X10? • creation, registration, next, • background, status resume, drop, Basic X10 (single place) ClockUseException • async, finish, atomic 6. Basic serial constructs that • future, force differ from Java Basic X10 (arrays & loops) • const, nullable, extern • points, rectangular regions, 7. Advanced topics arrays • Value types, conditional • for, foreach atomic sections (when), Scalable X10 (multiple places) general regions & • places, distributions, distributed distributions arrays, ateach, • Refer to language spec for BadPlaceException details © 2007 IBM Corporation X10 Tutorial IBM Research What is X10? 5 X10 is a new experimental language developed in the IBM PERCS project as part of the DARPA program on High Productivity Computing Systems (HPCS) X10’s goal is to provide a new parallel programming model and its embodiment in a high level language that: 1. is more productive than current models, 2. can support higher levels of abstraction better than current models, and 3. can exploit the multiple levels of parallelism and non-uniform data access that are critical for obtaining scalable performance in current and future HPC systems © 2007 IBM Corporation X10 Tutorial The X10 RoadMap DARPA milestones Internal milestone MS5 MS6 MS7 MS8 MS9 MS10 Libraries, APIs, Tools, user trials Trial at Rice U X10DT enhancements Concur. refactorings SSCA#2, UTS Other Apps (TBD) More refactorings Language Definition v1.7 spec v2.0 spec IBM Research JVM Implementation 6 v1.7 impl Initial debugger release v2.0 impl X10 Flash C/C++ Implementation v1.7 impl Rel 1 v1.7 impl Rel 2 v2.0 impl Port to PERCS h/w Design of X10 Implementation for PERCS Multi-process debugger Advanced debugger features P u b li c R e l e a s e o f X 1 0 2 . 0 s y s t e m © 2007 IBM Corporation X10 Tutorial Current X10 Environment: Unoptimized Single-process Implementation X10 source program --- must contain a class named Foo with a public static void main(String[] args) method Foo.x10 x10c Foo.x10 x10c Foo.class X10 compiler --- translates Foo.x10 to Foo.java, uses javac to generate Foo.class from Foo.java Foo.java IBM Research x10 Foo.x10 7 X10 Virtual Machine (JVM + J2SE libraries + X10 libraries + X10 Multithreaded Runtime) X10 Program Output X10 program translated into Java --// #line pseudocomment in Foo.java specifies source line mapping in Foo.x10 X10 extern interface External DLL’s X10 Abstract Performance Metrics (event counts, distribution efficiency) Caveat: this is a prototype implementation with many limitations. Please be patient! © 2007 IBM Corporation X10 Tutorial IBM Research Eclipse environment for X10 – X10DT 8 © 2007 IBM Corporation X10 Tutorial Future X10 Environment Parallelism through Domain Specific Unordered Languages (Streaming, Tasks …) X10 application programs IBM Research X10 Libraries 9 Implicit parallelism, Implicit data distributions X10 places --- abstraction of explicit control & data distribution X10 Deployment Mapping of places to nodes in HPC Parallel Environment HPC Runtime Environment Primitive constructs for parallelism, communication, and synchronization (Parallel Environment, MPI, LAPI, …) HPC Parallel System Target system for parallel application © 2007 IBM Corporation X10 Tutorial The current architectural landscape SMP Node SMP Node PEs, PEs, ... PEs, PEs, ... Memory ... Memory Interconnect Power6 Clusters Blue Gene IBM Research I/O gateway nodes 10 (100’s of such cluster nodes) “Scalable Unit” Cluster Interconnect Switch/Fabric Multi-core w/ accelerators (IXP 2850) Road Runner: Cell-accelerated Opteron © 2007 IBM Corporation X10 Tutorial X10 vs. Java X10 is an extended subset of Java – Base language = Java 1.4 • Java 5 features (generics, metadata, etc.) are currently not supported in X10 – Notable features removed from Java • Concurrency --- threads, synchronized, etc. • Java arrays – replaced by X10 arrays IBM Research – Notable features added to Java 11 • Concurrency – async, finish, atomic, future, force, foreach, ateach, clocks • Distribution --- points, distributions • X10 arrays --- multidimensional distributed arrays, array reductions, array initializers, • Serial constructs --- nullable, const, extern, value types X10 supports both OO and non-OO programming paradigms © 2007 IBM Corporation X10 Tutorial x10.lang standard library Java package with “built in” classes that provide support for selected X10 constructs IBM Research 12 Standard types – boolean, byte, char, double, float, int, long, short, String x10.lang.Object -- root class for all instances of X10 objects x10.lang.clock --- clock instances & clock operations x10.lang.dist --- distribution instances & distribution operations x10.lang.place --- place instances & place operations x10.lang.point --- point instances & point operations x10.lang.region --- region instances & region operations All X10 programs implicitly import the x10.lang.* package, so the x10.lang prefix can be omitted when referring to members of x10.lang.* classes e.g., place.MAX_PLACES, dist.factory.block([0:100,0:100]), … Similarly, all X10 programs also implicitly import the java.lang.* package e.g., X10 programs can use Math.min() and Math.max() from java.lang In case of conflict (e.g. Integer), user must import the desired one explicitly, e.g. import java.lang.Integer; © 2007 IBM Corporation X10 Tutorial Calling foreign functions from X10 programs Java methods – Can be called directly from X10 programs – Java class will be loaded automatically as part of X10 program execution – Basic rule: don’t call any method that can perform wait/notify or related thread operations • Calling synchronized methods is okay IBM Research C functions 13 – Need to use extern declaration in X10, and perform a System.loadLibrary() call © 2007 IBM Corporation X10 Tutorial Outline 1. 2. 3. IBM Research 4. 14 5. Clocks What is X10? • creation, registration, next, • background, status resume, drop, Basic X10 (single place) ClockUseException • async, finish, atomic 6. Basic serial constructs that • future, force differ from Java Basic X10 (arrays & loops) • const, nullable, extern • points, rectangular regions, 7. Advanced topics arrays • Value types, conditional • for, foreach atomic sections (when), Scalable X10 (multiple places) general regions & • places, distributions, distributed distributions arrays, ateach, • Refer to language spec for BadPlaceException details © 2007 IBM Corporation X10 Tutorial IBM Research Basic X10 (Single Place) 15 async = construct used to execute a statement in parallel as a new activity atomic = construct used to coordinate accesses to shared heap by multiple activities finish = construct used to check for global termination of statement and all the activities that it has created future = construct used to evaluate an expression in parallel as a new activity force = construct used to check for termination of future Core constructs used for intra-place (shared memory) parallel programming © 2007 IBM Corporation X10 Tutorial async IBM Research Stmt ::= async Stmt 16 async S Creates a new child activity that executes statement S Returns immediately S may reference final variables in enclosing blocks Activities cannot be named Activity cannot be aborted or cancelled cf Cilk’s spawn void run() { if (r < 2) return; final Fib f1 = new Fib(r-1), f2 = new Fib(r-2); finish { async f1.run(); f2.run(); } result = f1.r + f2.r; } See also TutAsync.x10 © 2007 IBM Corporation X10 Tutorial finish finish S Execute S, but wait until all (transitively) spawned asyncs have terminated. Stmt ::= finish Stmt cf Cilk’s sync void run() { IBM Research if (r < 2) return; 17 Rooted exception model Trap all exceptions thrown by spawned activities. Throw an (aggregate) exception if any spawned async terminates abruptly. implicit finish at main activity final Fib f1 = new Fib(r-1), f2 = new Fib(r-2); finish { async f1.run(); f2.run(); } result = f1.r + f2.r; finish is useful for expressing “synchronous” operations on (local or) remote data. } © 2007 IBM Corporation X10 Tutorial IBM Research Granularity of concurrency 18 Too many tasks can reduce performance. – Overhead for task creation, scheduling, interference, cache effects. Too few tasks may result in under-utilization. – Idle workers In some cases, workload can be divided equally among all tasks statically – E.g. array-based computations Some problems may need dynamic load-balancing – Work-stealing in X10 1.7. Identify tasks that can be performed in parallel – Use async to specify them Identify point in code by which tasks must be finished – Use finish If tasks will read/write shared variables during execution, ensure atomicity of updates – Use atomic In principle, create k*P tasks, where P is the number of processors available. © 2007 IBM Corporation X10 Tutorial Example programs Fibonacci NQueens Tree IBM Research How would you parallelize these computations? 19 © 2007 IBM Corporation X10 Tutorial Termination Local termination: Statement s terminates locally when activity has completed all its computation with respect to s. IBM Research Global termination: Local termination + activities that have been spawned by s terminated globally (recursive definition) 20 main function is root activity program terminates iff root activity terminates. (implicit finish at root activity) ‘daemon threads’ (child outlives root activity) not allowed in X10 © 2007 IBM Corporation X10 Tutorial Termination (Example) IBM Research termination start local global 21 public void main (String[] args) { ... finish { async { for () { async {... } } finish async {... } ... } } // finish } © 2007 IBM Corporation X10 Tutorial IBM Research Rooted computation X10 22 public void main (String[] args) { ... finish { async { Root-of hierarchy for () { async {... root activity } } finish async {... } ... } ... } // finish } ancestor relation © 2007 IBM Corporation X10 Tutorial IBM Research Rooted exception model 23 public void main (String[] args) { ... finish { root-of relation async { for () { async {... } } finish async {... } ... ... } } // finish exception flow along } root-of relation Propagation along the lexical scoping: Exceptions that are not caught inside an activity are propagated to the nearest suspended ancestor in the root-of relation. © 2007 IBM Corporation X10 Tutorial Example: rooted exception model (async) IBM Research int result = 0; try { finish { ateach (point [i]:dist.factory.unique()) { throw new Exception (“Exception from “+here.id) } throw new Exception(); result = 42; } // finish } catch (x10.lang.MultipleExceptions me) { System.out.print(me); } assert (result == 42); // always true 24 no exceptions are ‘thrown on the floor’ exceptions are propagated across activity and place boundaries © 2007 IBM Corporation X10 Tutorial Behavioral annotations nonblocking On any input store, a nonblocking method can continue execution or terminate. (dual: blocking, default: nonblocking) recursively nonblocking Nonblocking, and every spawned activity is recursively nonblocking. IBM Research local A local method guarantees that its execution will only access variables that are local to the place of the current activity. (dual: remote, default: local) 25 sequential Method does not create concurrent activities. In other words, method does not use async, foreach, ateach. (dual: parallel, default: parallel) Sequential and nonblocking imply recursively nonblocking. © 2007 IBM Corporation X10 Tutorial atomic Atomic blocks are conceptually executed in a single step while other activities are suspended: isolation and atomicity. IBM Research An atomic block ... 26 – must be nonblocking – must not create concurrent activities (sequential) – must not access remote data (local) Stmt ::= atomic Statement MethodModifier ::= atomic // target defined in lexically // enclosing scope. atomic boolean CAS(Object old, Object new) { if (target.equals(old)) { target = new; return true; } return false; } // push data onto concurrent // list-stack Node node = new Node(data); atomic { node.next = head; head = node; } © 2007 IBM Corporation X10 Tutorial Statically checked conditions on atomic blocks An atomic block must...be local, sequential, nonblocking: IBM Research ...not include blocking operations – no await, no when, no calls to blocking methods ... not include access to data at remote places – no ateach, no future, only calls to local methods ... not spawn other activities – no async, no foreach, only calls to sequential methods 27 © 2007 IBM Corporation X10 Tutorial Exceptions in atomic blocks IBM Research Atomicity guarantee only for successful execution. – Exceptions should be caught inside atomic block – Explicit undo in the catch handler 28 boolean move(Collection s, Collection d, Object o) { atomic { if (!s.remove(o)) { return false; // object not found } else { try { d.add(o); } catch (RuntimeException e) { s.add(o); // explicit undo throw e; // exception } return true; // move succeeded } } } (Uncaught) exceptions propagate across the atomic block boundary; atomic terminates on normal or abrupt termination of its block. © 2007 IBM Corporation X10 Tutorial Data races with async / foreach final double arr[R] = …; // global array IBM Research class ReduceOp { double accu = 0.0; double sum ( double[.] arr ) { finish foreach (point p: arr) { atomic accu += arr[p]; } concurrent conflicting return accu; access to shared variable: } data race 29 X10 guideline for avoiding data races: access shared variables inside an atomic block combine ateach and foreach with finish declare data to be read-only where possible (final or value type) © 2007 IBM Corporation X10 Tutorial NQueens: Searching in parallel NQueens.x10 IBM Research How should this be parallelized? 30 © 2007 IBM Corporation X10 Tutorial future Expr ::= future PlaceExpSingleListopt {Expr } IBM Research future (P) S Creates a new child activity at place P, that executes statement S; Returns immediately. S may reference final variables in enclosing blocks. 31 future vs. async Return result from asynchronous computation Tolerate latency of remote access. Considering addition of a delayed future: needs run() to be called before it is activated // global dist. array final double a[D] = …; final int idx = …; future<double> fd = future (a.distribution[idx]) { // executed at a[idx]’s // place a[idx]; }; future type no subtype relation between T and future<T> © 2007 IBM Corporation X10 Tutorial future example public class TutFuture1 { static int fib (final int n) { if ( n <= 0 ) return 0; if ( n == 1 ) return 1; future<int> x = future { fib(n-1) }; int y = fib(n-2); return x.force() + y; } IBM Research public static void main(String[] args) { System.out.println("fib(10) = " + fib(10)); } 32 } Divide and conquer: recursive calls execute concurrently. © 2007 IBM Corporation X10 Tutorial Example: rooted exception model (future) IBM Research double div (final double divisor) future<double> f = future { return 42.0 / divisor; } double result; try { result = f.force(); } catch (ArithmeticException e) { result = 0.0; } return result; } 33 Exception is propagated when the future is forced. © 2007 IBM Corporation X10 Tutorial Futures can deadlock nullable<future<int>> f1=null; nullable<future<int>> f2=null; void main(String[] args) { f1 = future(here){a1()}; f2 = future(here){a2()}; f1.force(); } IBM Research cyclic wait condition 34 int a1() { nullable<future<int>> tmp=null; do { tmp=f2; } while (tmp == null); return tmp.force(); } int a2() { nullable<future<int>> tmp=null; do { tmp=f1; } while (tmp == null); return tmp.force(); } X10 guidelines to avoid deadlock: avoid futures as shared variables force called by same activity that created body of future, or a descendent. © 2007 IBM Corporation X10 Tutorial IBM Research Outline 35 5. Clocks 1. What is X10? • creation, registration, next, • background, status resume, drop, 2. Basic X10 (single place) ClockUseException • async, finish, atomic 6. Basic serial constructs that • future, force differ from Java • const, nullable, extern 3. Basic X10 (arrays & loops) • points, rectangular regions, 7. Advanced topics • Value types, conditional arrays atomic sections (when), • for, foreach general regions & 4. Scalable X10 (multiple places) distributions • places, distributions, • Refer to language spec for distributed arrays, ateach, details BadPlaceException © 2007 IBM Corporation X10 Tutorial point A point is an element of an n-dimensional Cartesian space (n>=1) with integer-valued coordinates e.g., [5], [1, 2], … – Dimensions are numbered from 0 to n-1 – n is also referred to as the rank of the point A point variable can hold values of different ranks e.g., – point p; p = [1]; … p = [2,3]; … Operations – p1.rank IBM Research • returns rank of point p1 36 – p1[i] • returns element (i mod p1.rank) if i < 0 or i >= p1.rank – p1.lt(p2), p1.le(p2), p1.gt(p2), p1.ge(p2) • returns true iff p1 is lexicographically <, <=, >, or >= p2 • only defined when p1.rank and p1.rank are equal © 2007 IBM Corporation X10 Tutorial Syntax extensions for points Implicit syntax for points: point p = [1,2] point p = point.factory(1,2) IBM Research Exploded variable declarations for points: point p [i,j] // final int i,j 37 Typical uses : – region R = [0:M-1,0:N-1]; – for (point p [i, j] : R) { ... } – for (point [i, j] : R) { ... } – point sum (point [i,j], point [k, l]) { return [i+k, j+l]; } – int [.] iarr = new int [R] (point [i,j]) { return i; } © 2007 IBM Corporation X10 Tutorial IBM Research Example: point (TutPoint1) 38 public class TutPoint { public static void main(String[] args) { point p1 = [1,2,3,4,5]; point p2 = [1,2]; point p3 = [2,1]; System.out.println("p1 = " + p1 + " ; p1.rank = " + p1.rank + " ; p1[2] = " + p1[2]); System.out.println("p2 = " + p2 + " ; p3 = " + p3 + " ; p2.lt(p3) = " + p2.lt(p3)); } } Console output: p1 = [1,2,3,4,5] ; p1.rank = 5 ; p1[2] = 3 p2 = [1,2] ; p3 = [2,1] ; p2.lt(p3) = true © 2007 IBM Corporation X10 Tutorial Rectangular regions A rectangular region is the set of points contained in a rectangular subspace A region variable can hold values of different ranks e.g., – region R; R = [0:10]; … R = [-100:100, -100:100]; … R = [0:-1]; … IBM Research Operations 39 – – – – – – – – – – – – – R.rank ::= # dimensions in region; R.size() ::= # points in region R.contains(P) ::= predicate if region R contains point P R.contains(S) ::= predicate if region R contains region S R.equal(S) ::= true if region R equals region S R.rank(i) ::= projection of region R on dimension i (a one-dimensional region) R.rank(i).low() ::= lower bound of ith dimension of region R R.rank(i).high() ::= upper bound of ith dimension of region R R.ordinal(P) ::= ordinal value of point P in region R R.coord(N) ::= point in region R with ordinal value = N R1 && R2 ::= region intersection (will be rectangular if R1 and R2 are rectangular) R1 || R2 ::= union of regions R1 and R2 (may not be rectangular) R1 – R2 ::= region difference (may not be rectangular) © 2007 IBM Corporation X10 Tutorial Example: region (TutRegion1) IBM Research public class TutRegion { public static void main(String[] args) { region R1 = [1:10, -100:100]; System.out.println("R1 = " + R1 + " ; R1.rank = " + R1.rank + " ; R1.size() = " + R1.size() + " ; R1.ordinal([10,100]) = " + R1.ordinal([10,100])); region R2 = [1:10,90:100]; System.out.println("R2 = " + R2 + " ; R1.contains(R2) = " + R1.contains(R2) + " ; R2.rank(1).low() = " + R2.rank(1).low() + " ; R2.coord(0) = " + R2.coord(0)); } } 40 Console output: R1 = {1:10,-100:100} ; R1.rank = 2 ; R1.size() = 2010 ; R1.ordinal([10,100]) = 2009 R2 = {1:10,90:100} ; R1.contains(R2) = true ; R2.rank(1).low() = 90 ; R2.coord(0) = [1,90] © 2007 IBM Corporation X10 Tutorial Syntax extensions for regions Region constructors int hi, lo; region r = hi; region r = region.factory.region(0, hi) region r = [low:hi]; IBM Research region r = region.factory.region(lo, hi) 41 region r1, r2; // 1-dim regions region r = [r1, r2]; region r = region.factory.region(r1, r2); // 2-dim region © 2007 IBM Corporation X10 Tutorial X10 arrays Java arrays are one-dimensional and local – e.g., array args in main(String[] args) – Multi-dimensional arrays are represented as “arrays of arrays” in Java X10 has true multi-dimensional arrays (as Fortran) that can be distributed (as in UPC, Co-Array Fortran, ZPL, Chapel, etc.) IBM Research Array – – Array – 42 declaration T [.] A declares an X10 array with element type T An array variable can refer to arrays with different rank allocation new T [ R ] creates a local rectangular X10 array with rectangular region R as the index domain and T as the element (range) type – e.g., int[.] A = new int[ [0:N+1, 0:N+1] ]; Array initialization – elaborate on a slide that follows... © 2007 IBM Corporation X10 Tutorial Array declaration syntax: [] vs [.] IBM Research General arrays: <Type>[.] – one or multidimensional arrays – can be distributed – arbitrary region 43 Special case (“rail”): <Type>[] – 1 dimensional – 0-based, rectangular array – not distributed – can be used in place of general arrays – supports compile-time optimization © 2007 IBM Corporation X10 Tutorial Simple array operations A.rank ::= # dimensions in array A.region ::= index region (domain) of array A.distribution ::= distribution of array A A[P] ::= element at point P, where P belongs to A.region A | R ::= restriction of array onto region R – Useful for extracting subarrays IBM Research 44 © 2007 IBM Corporation X10 Tutorial IBM Research Aggregate array operations 45 A.sum(), A.max() ::= sum/max of elements in array A1 <op> A2 – returns result of applying a pointwise op on array elements, when A1.region = A2. region – <op> can include +, -, *, and / A1 || A2 ::= disjoint union of arrays A1 and A2 (A1.region and A2.region must be disjoint) A1.overlay(A2) – returns an array with region, A1.region || A2.region, with element value A2[P] for all points P in A2.region and A1[P] otherwise. © 2007 IBM Corporation X10 Tutorial Example: arrays (TutArray1) IBM Research public class TutArray1 { public static void main(String[] args) { int[.] A = new int[ [1:10,1:10] ] (point [i,j]) { return i+j;} ; System.out.println("A.rank = " + A.rank + " ; A.region = " + A.region); int[.] B = A | [1:5,1:5]; System.out.println("B.max() = " + B.max()); } array copy } 46 Console output: A.rank = 2 ; A.region = {1:10,1:10} B.max() = 10 © 2007 IBM Corporation X10 Tutorial Initialization of mutable arrays Mutable array with nullable references to mutable objects: nullable<RefType> [.] farr = new RefType[R]; // init with null value Mutable array with references to mutable objects: RefType [.] farr = new RefType [R]; // compile-time error, init required RefType [.] farr = new RefType [R] (point[i]) { return RefType(here, i); } Execution of initializer is implicitly parallel / distributed (pointwise operation): IBM Research That hold ‘reference to value objects’ (value object can be inlined by imp.) 47 int [.] iarr = new int[N] ; // init with default value, 0 int [.] iarr = new int[.] {1, 2, 3, 4}; // Java style ValType [.] V = new ValType[N] (point[i]) { return ValType(i);}; // explicit init © 2007 IBM Corporation X10 Tutorial Initialization of value arrays Initialization of value arrays requires an initializer. Value array of reference to mutable objects: RefType value [.] farr = new value RefType [N]; // compile-time error, init required RefType value [.] farr = new value RefType [N] (point[i]) { return new Foo(); } Value array of ‘reference to value objects’ (value object can be inlined) IBM Research int value [.] iarr = new value int[.] {1, 2, 3, 4}; // Java style init 48 ValType value [.] iarr = new value ValType[N] (point[i]) { return ValType(i); }; // explicit init © 2007 IBM Corporation X10 Tutorial foreach foreach ( FormalParam: Expr ) Stmt foreach (point p: R) S Creates |R| async statements in parallel at current place. foreach (point p:R) S for (point p: R) async { S } IBM Research Termination of all (recursively created) activities can be ensured with finish. 49 finish foreach is a convenient way to achieve master-slave fork/join parallelism (OpenMP programming model) © 2007 IBM Corporation X10 Tutorial Summing up elements of an array in parallel See ArraySum.x10 IBM Research When P is small, local sums can be summed up serially. 50 © 2007 IBM Corporation X10 Tutorial Iterative Averaging with a 1-D stencil IBM Research Problem Statement – Initialize a 1-D array, A[0:n+1] with A[0:n] = 0 & A[n+1] = n+1 – A[0] = 0 and A[n+1] = n+1 are fixed boundary conditions – Iteratively compute new values for A[1:n] by averaging the values of the two neighboring elements – Terminate when the sum of element changes is less than a given threshold Acknowledgment – Example and codes for C + MPI, UPC, CAF versions were provided by Steve Deitz and Brad Chamberlain from Cray, {deitz,bradc}@cray.com 51 51 © 2007 IBM Corporation X10 Tutorial 1-D stencil: parallelism within a place See Stencil1D. Uses x10.util.dist.Distribution.block(R, P) to blockdivide a region into P regions. IBM Research Exercise: Parallelize Pascal’s triangle. 52 © 2007 IBM Corporation X10 Tutorial IBM Research Outline 53 1. What is X10? • background, status 2. Basic X10 (single place) • async, finish, atomic • future, force 3. Basic X10 (arrays & loops) • points, rectangular regions, arrays • for, foreach 4. Scalable X10 (multiple places) • places, distributions, distributed arrays, ateach, BadPlaceException 5. 6. 7. Clocks • creation, registration, next, resume, drop, ClockUseException Basic serial constructs that differ from Java • const, nullable, extern Advanced topics • Value types, conditional atomic sections (when), general regions & distributions • Refer to language spec for details © 2007 IBM Corporation X10 Tutorial Limitations of using a Single Place Immutable Data (I) -- final variables, value type instances Activities IBM Research Locally Synchronous (coherent access to intra-place shared heap) 54 ... Activity Stacks (S) Storage classes: Immutable Data (I) Shared Heap (H) Activity Stacks (S) Place 0 Shared Heap (H) Largest deployment granularity for a single place is a single SMP – Smallest granularity can be a single CPU or even a single hardware thread Single SMP is inadequate for solving problems with large memory and compute requirements X10 solution: incorporate multiple places as a core foundation of the X10 programming model Enable deployment on large-scale clustered machines, with integrated support for intra-place parallelism © 2007 IBM Corporation X10 Tutorial What is Partitioned Global Address Space (PGAS)? IBM Research Process/Thread 55 Address Space Message passing Shared Memory PGAS MPI OpenMP UPC, CAF, X10 Computation is performed in multiple places. A place contains data that can be operated on remotely. Data lives in the place it was created, for its lifetime. A datum in one place may reference a datum in another place. Data-structures (e.g. arrays) may be distributed across many places. Places may have different computational properties (e.g. PPE, SPE, …). A place expresses locality. © 2007 IBM Corporation X10 Tutorial IBM Research What is Asynchronous PGAS? 56 Asynchrony – Simple explicitly concurrent model for the user: async (p) S runs statement S “in parallel” at place p – Controlled through finish, and local (conditional) atomic Used for active messaging (remote asyncs), DMAs, finegrained concurrency, fork/join concurrency, doall/do-across parallelism – SPMD is a special case Concurrency is made explicit and programmable. © 2007 IBM Corporation X10 Tutorial Places in X10 place.MAX_PLACES = total number of places (runtime constant) place.places = value array of all places in an X10 place.factory.place(i) = place corresponding to index i here = place in which current activity is executing <place-expr>.toString() returns a string of the form “place(id=99)” <place-expr>.id returns the id of the place IBM Research X10 programs defines mapping from X10 objects to X10 places, and abstract performance metrics on places 57 X10 Data Structures X10 Places Future X10 deployment system will define mapping from X10 places to processes, and processes to nodes. Processes Nodes © 2007 IBM Corporation X10 Tutorial IBM Research Specifying number of places in X10DT 58 © 2007 IBM Corporation X10 Tutorial IBM Research Locality rule 59 An activity executing an Activity may access remote atomic block may access data synchronously (outside only local data. an atomic block). Locality of data determined – Write (update) access is as through types. if performed in a finish/async. – Data of type Foo! is potentially remote. – Read access is as if performed in future/force. – Data of type Foo!p is at place p. Programmer may always use explicit asyncs to improve – Data of type Foo is local. locality of computation. Data may be cast to local type. Failure throws Current implementation limitation BadPlaceException. Place types not yet supported. Atomic block does not check array accesses are place local. © 2007 IBM Corporation X10 Tutorial Distributions in X10 A distribution maps every point in a region to a place. IBM Research Creating distributions (x10.lang.dist): – dist D1 = dist.factory.constant(R, here); // local distribution – maps region R to here – dist D2 = dist.factory.block(R); // blocked distribution – dist D3 = dist.factory.cyclic(R); // cyclic distribution – dist D4 = dist.factory.unique(); // identity map on [0:MAX_PLACES-1] 60 © 2007 IBM Corporation X10 Tutorial async and future with explicit place specifier async (P) S Creates new activity to execute statement S at place P async S is equivalent to async (here) S future (P) { E } Create new activity to evaluate expression E at place P future { E } is equivalent to future (here) { E } IBM Research Note that here in a child activity for an async/future computation will refer to the place P at which the child activity is executing, not the place where the parent activity is executing 61 Specify the destination place for async/future activities so as to obey the Locality rule e.g., async (O.location) O.x = 1; future<int> F = future (A.distribution[i]) { A[i] } ; © 2007 IBM Corporation X10 Tutorial Inter-place communication using async and future Question: how to assign A[i] = B[j], when A[i] and B[j] may be in different places? Answer #1: Use nested async: IBM Research finish async ( B.distribution[j] ) { final int bb = B[j]; async ( A.distribution[i] ) A[i] = bb; } 62 Answer #2: Use future-force and an async: final int b = future (B.distribution[j]) { B[j] }.force(); finish async ( A.distribution[i] ) A[i] = b; © 2007 IBM Corporation X10 Tutorial ateach (distributed parallel iteration) ateach ( FormalParam: Expr ) Stmt ateach (point p:D) S Creates |D| async statements in parallel at place specified by distribution. ateach (point p:D) S for (point p:D.region) async (D[p]) { S } IBM Research Termination of all (recursively created) activities with finish. ateach is a convenient construct for writing parallel matrix code that is independent of the underlying distribution, e.g., 63 ateach ( point p : A.distribution ) A[p] = f(B[p], C[p], D[p]) ; SPMD computation: finish ateach( point[i] : dist.factory.unique() ) S © 2007 IBM Corporation X10 Tutorial Example: ateach (TutAteach1) public class TutAteach1 { public static void main(String args[]) { finish ateach (point p: dist.factory.unique()) { System.out.println("Hello from " + here.id); } } // main() } IBM Research Console output: 64 Hello Hello Hello Hello from from from from unique distribution: maps point i in region [0 : place.MAX_PLACES-1] to place place.factory.place(i). 1 0 3 4 © 2007 IBM Corporation X10 Tutorial AllDistribution See AllDistributionP2P.x10 IBM Research See AllDistributionP2PRemoteAccess.x10 65 © 2007 IBM Corporation X10 Tutorial IBM Research Butterfly communication 66 © 2007 IBM Corporation X10 Tutorial IBM Research Outline 67 1. What is X10? • background, status 2. Basic X10 (single place) • async, finish, atomic • future, force 3. Basic X10 (arrays & loops) • points, rectangular regions, arrays • for, foreach 4. Scalable X10 (multiple places) • places, distributions, distributed arrays, ateach, BadPlaceException 5. 6. 7. Clocks • creation, registration, next, resume, drop, ClockUseException Basic serial constructs that differ from Java • const, nullable, extern Advanced topics • Value types, conditional atomic sections (when), general regions & distributions • Refer to language spec for details © 2007 IBM Corporation X10 Tutorial Clocks: Motivation Activity coordination using finish and force() is accomplished by checking for activity termination But in many cases activities have a producer-consumer relationship and a “barrier”-like coordination is needed without waiting for activity termination – The activities involved may be in the same place or in different places Design clocks to offer determinate and deadlock-free coordination between a dynamically varying number of activities. Phase 0 IBM Research Phase 1 68 ... Activity 0 Activity 1 Activity 2 ... © 2007 IBM Corporation X10 Tutorial Clocks (1/2) clock c = clock.factory.clock(); Allocate a clock, register current activity with it. Phase 0 of c starts. async(…) clocked (c1,c2,…) S ateach(…) clocked (c1,c2,…) S foreach(…) clocked (c1,c2,…) S Create async activities registered on clocks c1, c2, … IBM Research c.resume(); Nonblocking operation that signals completion of work by current activity for this phase of clock c 69 next; Barrier --- suspend until all clocks that the current activity is registered with can advance. c.resume() is first performed for each such clock, if needed. Next can be viewed like a “finish” of all computations under way in the current phase of the clock © 2007 IBM Corporation X10 Tutorial Clocks (2/2) c.drop(); Unregister with c. A terminating activity will implicitly drop all clocks that it is registered on. c.registered() Return true iff current activity is registered on clock c c.dropped() returns the opposite of c.registered() IBM Research ClockUseException Thrown if an activity attempts to transmit or operate on a clock that it is not registered on Or if an activity attempts to transmit a clock in the scope of a finish 70 © 2007 IBM Corporation X10 Tutorial Semantics Static semantics – An activity may operate only on those clocks it is registered with. – In finish S,S may not contain any (top-level) clocked asyncs. IBM Research Dynamic semantics – A clock c can advance only when all its registered activities have executed c.resume(). – An activity may not pass-on clocks on which it is not live to subactivities. – An activity is deregistered from a clock when it terminates 71 © 2007 IBM Corporation X10 Tutorial IBM Research Example (TutClock1.x10) 72 finish async { final clock c = clock.factory.clock(); foreach (point[i]: [1:N]) clocked (c) { parent transmits clock while ( true ) { to child int old_A_i = A[i]; int new_A_i = Math.min(A[i],B[i]); if ( i > 1 ) new_A_i = Math.min(new_A_i,B[i-1]); if ( i < N ) new_A_i = Math.min(new_A_i,B[i+1]); A[i] = new_A_i; next; int old_B_i = B[i]; int new_B_i = Math.min(B[i],A[i]); if ( i > 1 ) new_B_i = Math.min(new_B_i,A[i-1]); if ( i < N ) new_B_i = Math.min(new_B_i,A[i+1]); B[i] = new_B_i; next; if ( old_A_i == new_A_i && old_B_i == new_B_i ) break; exiting from while loop } // while terminates activity for } // foreach } // finish async iteration i, and automatically deregisters activity from clock © 2007 IBM Corporation X10 Tutorial Example (TutClock1.x10) hierarchical static dynamic concurrency concurrency model model ... ... IBM Research ... 73 activities (foreach, finish) clock phases ... © 2007 IBM Corporation X10 Tutorial Clock safety An activity may be registered on one or more clocks Clock c can advance only when all activities registered with the clock have executed c.resume(). IBM Research Runtime invariant: Clock operations are guaranteed to be deadlock-free and determinate. 74 © 2007 IBM Corporation X10 Tutorial Deadlock freedom IBM Research Central theorem of X10: – Arbitrary programs with async, atomic, finish (and clocks) are deadlock-free. 75 Key intuition: – atomic is deadlock-free. – finish has a tree-like structure. – clocks are made to satisfy conditions which ensure treelike structure. – Hence no cycles in wait-for graph. Where is this useful? – Whenever synchronization pattern of a program is independent of the data read by the program – True for a large majority of HPC codes. – (Usually not true of reactive programs.) © 2007 IBM Corporation X10 Tutorial Examples using clocks See AllReductionBarrier. See Streams, specifically Sieve of Eratosthenes, Fib. IBM Research Exercise: Try Pascal’s triangle. 76 © 2007 IBM Corporation X10 Tutorial IBM Research Outline 77 1. What is X10? • background, status 2. Basic X10 (single place) • async, finish, atomic • future, force 3. Basic X10 (arrays & loops) • points, rectangular regions, arrays • for, foreach 4. Scalable X10 (multiple places) • places, distributions, distributed arrays, ateach, BadPlaceException 5. 6. 7. Clocks • creation, registration, next, resume, drop, ClockUseException Basic serial constructs that differ from Java • const, nullable, extern Advanced topics • Value types, conditional atomic sections (when), general regions & distributions • Refer to language spec for details © 2007 IBM Corporation X10 Tutorial when Stmt ::= WhenStmt WhenStmt ::= when ( Expr ) Stmt | WhenStmt or (Expr) Stmt when (E) S – Activity suspends until a state in which the guard E is true. – In that state, S is executed atomically and in isolation. class OneBuffer { nullable<Object> datum = null; boolean filled = false; void send(Object v) { when ( ! filled ) { datum = v; filled = true; } } Guard E – boolean expression IBM Research – must be nonblocking – must not create concurrent activities (sequential) – must not access remote data (local) – must not have side-effects (const) 78 await (E) – syntactic shortcut for when (E) ; } Object receive() { when ( filled ) { Object v = datum; datum = null; filled = false; return v; } } © 2007 IBM Corporation X10 Tutorial Static semantics of guard for when / await IBM Research boolean field boolean expression with field access or constant values 79 class BufferBuffer { .. void send(Object v) { when (size() < MAX_SIZE) { datum = v; filled = true; } } ... } compile-time error © 2007 IBM Corporation X10 Tutorial Semaphores class Semaphore { private boolean taken; void p() { when (!taken) taken = true; } atomic void v() { taken = false; } IBM Research } 80 © 2007 IBM Corporation X10 Tutorial Value types : immutable instances IBM Research value class – Can only extend value class or x10.lang.Object. – All fields are implicitly final – Can only be extended by value classes. – May contain fields with reference type. – May be implemented by reference or copy. 81 Values are equal (==) if their fields are equal, recursively. public value complex { double im, re; public complex(double im, double re) { this.im = im; this.re = re; } public complex add(complex a) { return new complex(im+a.im, re+a.re); } … } © 2007 IBM Corporation X10 Tutorial Nullable Types Unlike Java, reference types in X10 do not contain the value null by default. – A method invocation on a variable of a reference type cannot throw an NPE. Nullable may be applied to value types as well. – nullable<int> : value is null or an int. IBM Research The nullable type constructor can be used to add null to a type: 82 – nullable<Foo> : values of this type are references to Foo or null. – nullable<nullable<Foo>> is the same as nullable<Foo> © 2007 IBM Corporation X10 Tutorial Dependent types IBM Research Classes and interfaces may define properties – final instance fields. 83 Types may contain a where clause: Foo(:c) – c is a condition, a conjunction of equalities. – c may reference final variables in environment or properties of Foo. – c may reference special variable self, of type Foo. Dependent types may be used everywhere where types are used. – Values may be cast to dependent types. – Code is generated to check values of properties at runtime. Examples – foo(:place==p) – region(:rank==2) – dist(:region==r) © 2007 IBM Corporation IBM Research: Software Technology Programming Technologies X10 Summary 84 © 2005 IBM Corporation X10 Tutorial IBM Research X10 v1.5 Language Summary 85 async [(Place)] [clocked(c…)] Stm – Run Stm asynchronously at Place finish Stm – Execute Stm, wait for all asyncs to terminate Region – Collection of index points, e.g. region r = [1:N,1:M]; foreach ( point P : Reg) Stm – Run Stm asynchronously for each point in region Distribution – Mapping from region to places, e.g. • dist d = dist.factory.block(r); ateach ( point P : Dist) Stm – Run Stm asynchronously for each point in dist, in its place. new T – Allocate object at this place (here) atomic Stm – Execute Stm atomically next – suspend till all clocks that the current activity is registered with can advance – Clocks are a generalization of barriers and MPI communicators future [(Place)] [clocked(c…)] Expr – Compute Expr asynchronously at Place F. force() – Block until future F has been computed atomic Stm – Execute Stm atomically extern – Lightweight interface to native code Deadlock safety: any X10 program written with async, atomic, finish, foreach, ateach, and clocks can never deadlock © 2007 IBM Corporation X10 Tutorial x10.lang standard library Java package with “built in” classes that provide support for selected X10 constructs IBM Research 86 Standard types – boolean, byte, char, double, float, int, long, short, String x10.lang.Object -- root class for all instances of X10 objects x10.lang.clock --- clock instances & clock operations x10.lang.dist --- distribution instances & distribution operations x10.lang.place --- place instances & place operations x10.lang.point --- point instances & point operations x10.lang.region --- region instances & region operations All X10 programs implicitly import the x10.lang.* package, so the x10.lang prefix can be omitted when referring to members of x10.lang.* classes e.g., place.MAX_PLACES, dist.factory.block([0:100,0:100]), … Similarly, all X10 programs also implicitly import the java.lang.* package e.g., X10 programs can use Math.min() and Math.max() from java.lang In case of conflict (e.g. Integer), user must import the desired one explicitly, e.g. import java.lang.Integer; © 2007 IBM Corporation X10 Tutorial Rectangular regions A rectangular region is the set of points contained in a rectangular subspace A region variable can hold values of different ranks e.g., – region R; R = [0:10]; … R = [-100:100, -100:100]; … R = [0:-1]; … IBM Research Operations 87 – – – – – – – – – – – – – R.rank ::= # dimensions in region; R.size() ::= # points in region R.contains(P) ::= predicate if region R contains point P R.contains(S) ::= predicate if region R contains region S R.equal(S) ::= true if region R equals region S R.rank(i) ::= projection of region R on dimension i (a one-dimensional region) R.rank(i).low() ::= lower bound of ith dimension of region R R.rank(i).high() ::= upper bound of ith dimension of region R R.ordinal(P) ::= ordinal value of point P in region R R.coord(N) ::= point in region R with ordinal value = N R1 && R2 ::= region intersection (will be rectangular if R1 and R2 are rectangular) R1 || R2 ::= union of regions R1 and R2 (may not be rectangular) R1 – R2 ::= region difference (may not be rectangular) © 2007 IBM Corporation X10 Tutorial X10 Cheat Sheet: Regions & Distributions IBM Research Region: Expr : Expr [ Range, …, Range ] Multidimensional Region Region && Region Region || Region Region – Region difference BuiltinRegion Dist: Region -> Place -- Constant -- 1-D region distribution -Distribution | Place -- Restriction Distribution | Region -- Restriction -- Intersection Distribution || Distribution -- Union -- Union Distribution – Distribution -- Set -- Set difference Distribution.overlay ( Distribution ) BuiltinDistribution 88 88 © 2007 IBM Corporation X10 Tutorial X10 runtime parameters -NUMBER_OF_LOCAL_PLACES=int The number of places (default = 1) -INIT_THREADS_PER_PLACE=int Initial number of Java threads per single place(default =2) -ABSTRACT_EXECUTION_STATS=false Dump out parallel execution statistics -ABSTRACT_EXECUTION_TIMES=false If dumping statistics also dump out unblocked exec times -BIND_THREADS=false Use platform-specific calls to bind Java threads to CPUs. -BIND_THREADS_DIAGNOSTICS=false Print diagnostics related to platform-specific calls to bind Java threads to CPUs. -BAD_PLACE_RUNTIME_CHECK=false Perform runtime place checks The name of the main class -OPTIMIZE_FOREACH=false Experimental: Enable runtime loop optimizations. -PRELOAD_CLASSES=false Pre-load all classes on recursively startup -PRELOAD_STRINGS=false If pre-loading classes, also pre-load all strings -LOAD=null Load specified shared library. IBM Research -MAIN_CLASS_NAME=null 89 © 2007 IBM Corporation X10 Tutorial IBM Research X10 preferences 90 © 2007 IBM Corporation X10 Tutorial IBM Research Specifying X10 runtime parameters 91 © 2007 IBM Corporation