IBM Research: Software Technology Programming Technologies APGAS: Programming for concurrency and distribution 1 Vijay Saraswat May 12, 2008 IBM TJ Watson Research Center 5/31/2016 © 2005 IBM Corporation X10 Programming Language 6/2008 The current architectural landscape SMP Node PEs, SMP Node PEs, ... PEs, ... Memory PEs, ... Memory Interconnect IBM Research Power5 Clusters 2 Multi-core processors, with accelerators e.g. Sun Niagara e.g. Intel multicore, IXP e.g. IBM Cell e.g. GPGPUs P7 supernode Blue Gene I/O gateway nodes (100’s of such cluster nodes) “Scalable Unit” Cluster Interconnect Switch/Fabric Road Runner: Cell-accelerated Opteron 5/31/2016 © 2007 IBM Corporation X10 Programming Language 6/2008 IBM Research The current architectural landscape 3 Substantial architectural innovation is anticipated over the next ten years. – Hardware situation remains murky, but programmers need stable interfaces to develop applications Heterogenous accelerator-based systems will exist, raising serious programmability challenges. – Programmers must choreograph interactions between heterogenous processors, memory subsystems. Multicore systems will dramatically raise the number of cores available to applications. – Programmers must understand concurrent structure of their applications. Applications seeking to leverage these architectures will need to go beyond data-parallel, globally synchronizing MPI model. These changes, while most profound for HPC now, will change the face of commercial computing over time. 5/31/2016 © 2007 IBM Corporation X10 Programming Language 6/2008 What is Partitioned Global Address Space (PGAS)? IBM Research Process/Thread 4 Address Space Message passing Shared Memory PGAS MPI OpenMP UPC, CAF, X10 Computation is performed in multiple places. A place contains data that can be operated on remotely. Data lives in the place it was created, for its lifetime. A datum in one place may reference a datum in another place. Data-structures (e.g. arrays) may be distributed across many places. Places may have different computational properties (e.g. PPE, SPE, …). A place expresses locality. 5/31/2016 © 2007 IBM Corporation X10 Programming Language 6/2008 IBM Research What is Asynchronous PGAS? 5 Asynchrony – Simple explicitly concurrent model for the user: async (p) S runs statement S “in parallel” at place p – Controlled through finish, and local (conditional) atomic Used for active messaging (remote asyncs), DMAs, finegrained concurrency, fork/join concurrency, doall/do-across parallelism – SPMD is a special case Concurrency is made explicit and programmable. 5/31/2016 © 2007 IBM Corporation X10 Programming Language 6/2008 IBM Research What is X10? 6 Based on sequential Java/Scala Extensions for concurrency, distribution and arrays Clean foundations – Advanced type system – Determinate and deadlock-free subsets Designed for multicore, clusters, for commercial applications and HPC. Supports simple constructs for atomicity and ordering Supports rich multidimensional distributed arrays A realization of APGAS as a modern OO language. 5/31/2016 © 2007 IBM Corporation X10 Programming Language 6/2008 X10 v 1.5= Sequential Java + Concurrency <stmt> ::= async (p) <stmt> <stmt> ::= finish <stmt> <stmt> := atomic <stmt> <stmt> ::= join <stmt> – New for X10 2.0 <expr> = future (p) <expr> IBM Research <expr> ::= f.force() 7 Clocks – <stmt> = clock a = new clock(); – c.next; – c.resume(); Streams planned as “data carrying” clocks for X10 2.0 Array language – multidimensional, distributed – Points – Regions – Distributions – Arrays 5/31/2016 © 2007 IBM Corporation X10 Programming Language 6/2008 How does X10 help productivity? Expressive and efficient core language – Rich, flexible, widely-used OO language base – With multidimensional arrays – Dependent types permit compiler to check datadependent assertions (catch bugs early) and generate better code – Supports reusable libraries IBM Research Expressive concurrent language 8 – Language supports global address space, one-sided remote operations, atomic operations, fine-grained asynchrony, termination detection, multi-level parallelism “Looks Faster time to correct code – Large classes of errors ruled out by design (e.g. language guarantees memory safety, pointer safety, deadlockfreedom for programs in a rich sublanguage, determinacy, advanced type-checking). • Cf. May 2005 PSC Study – While maintaining performance • Cf. SC 07 HPCC Award Tooling – Eclipse-based programming environment enables programmer to quickly develop, browse, understand and refactor code. good, runs fast” 5/31/2016 © 2007 IBM Corporation X10 Programming Language 6/2008 X10 1.7 language design In essence, support generics cf Java 1.5 val d: Array[double] = …; val ranks: List[int]= …; class Array[T]{ …} IBM Research Why Generics? 9 – Same class can be reused at different types. – Fosters reusable code. – Standard technique for commercial, type-safe languages --- modern programmers expect generics. – Permits (large portions of) X10 runtime to be written in X10 Actual language design influenced more by Scala than Java 1.5 – Java has made many “wrong” decisions, driven by backward compatibility. – Scala gives a much cleaner OO + functional base than Java, while being compatible with JVMs. – X10 1.7 is not Scala • Diverges on many dimensions including types, pattern matching, multiple inheritance. • Retains run-time types (rtti) 5/31/2016 © 2007 IBM Corporation X10 Programming Language 6/2008 X10 1.7: Design of generics X10 v 1.1 permits class parametrization with values value Array[T]{ val dist: dist; – Constrained types, cf. OOPSLA 08 – List(:length==3) IBM Research X10 v 1.7 permits classes/methods/constructor s to be parametrized with types as well. 10 – Basic idea is to permit type valued properties and parameters. – Same dependent type syntax is used for value- and typeparametrization. val pieces: ValRail[Rail[T]{current}]; … } Generic classes are implemented by type expansion Type parameters are not erased at runtihme – Provides significant run-time flexibility. 5/31/2016 © 2007 IBM Corporation X10 Programming Language 6/2008 Design of type definitions Types with many dependent clauses can become cumbersome to read – => Need abstractions over types. Type definitions permit new (type-, value-) parametrized names for types. Type definitions expand out into real types. – Not generative; do not introduce new subtypes. – Scoped in classes, just as other members IBM Research type nat = int{self >=0}; 11 type Rail[T](n:nat)=Array[T] {val p:place;dist==0..n-1->p}; type booleanVar = int; … val a : Rail[double](N)=…; 5/31/2016 © 2007 IBM Corporation X10 Programming Language 6/2008 Introduce functions/closures Permit succinct specifications of pieces of codes (as lambda expressions) A function may be applied as many times as necessary, returns once each time. Functions may capture lexical variables. Functions implemented as special objects. Many library methods parametrized to accept such functions. val d = dist.blockCyclic(r); IBM Research val a = new Array[double](d, (p:point)=>rand(p)); 12 val max = (x:double,y:double)=> (x <= y ? y :x); val m = a.reduce(0,max); X10 remains an OO language (but with functions) 5/31/2016 © 2007 IBM Corporation X10 Programming Language 6/2008 Example: RandomAccess finish ateach(val (p) in UNIQUE.region) { var ran=HPCC_starts (p*NumUpdates); for (val i in 0..NumUpdates-1) { val placeID =((ran>>LogTableSize)& PLACEIDMASK) as int; val arg=ran; async { val local = Table(placeID); IBM Research atomic local.array((arg & local.mask) as int) ^= arg; 13 } ran = (ran << 1) ^ (ran < 0 ? POLY : 0L); } } 5/31/2016 © 2007 IBM Corporation X10 Programming Language 6/2008 Pseudo Depth-first search Breadth-first search class V[T] { var parent: V[T]; val neighbors: Array[V[T]]; val data: T; def this(d: T){data=d; …} def compute(): void { for (val v in neighbors) { atomic v.parent=(v.parent==null?this:v.parent); if (v.parent==this) IBM Research async clocked (c) {next; v.compute(); }} } 14 def computeTree(): void {parent=this; finish compute();} … } 5/31/2016 © 2007 IBM Corporation X10 Programming Language 6/2008 December 2008 Public availability of X10 Flash compiler and X10lib runtime for clusters of SMPs. – On track. Demonstration of initial functionality of X10 sourcelevel debugger for Eclipse (operating on Java code produced by X10 compiler). IBM Research – On track. – Note that X10DT currently has source-level breakpoint and stepping functionality. 15 5/31/2016 © 2007 IBM Corporation X10 Programming Language 6/2008 X10 papers 1. 2. 3. 4. 5. IBM Research 6. 16 7. “Constrained types for OO Languages”, To appear in OOPSLA 08 “Solving irregular graph algorithms using adaptive work-stealing”, to appear in ICPP 08. “Optimizing array accesses in high productivity languages”, to appear in HPCC 2007 “Deadlock-free scheduling of X10 Computations with bounded resources”, SPAA 2007, June 2007 “A Theory of Memory Models”, PPoPP, March 2007. “May-Happen-in-Parallel Analysis of X10 Programs”, PPoPP, March 2007. “An annotation and compiler plug-in system for X10”, IBM Technical Report, Feb 2007. 1. “Experiences with an SMP Implementation for X10 based on the Java Concurrency Utilities” Workshop on Programming Models for Ubiquitous Parallelism (PMUP), September 2006. 7. "An Experiment in Measuring the Productivity of Three Parallel Programming Languages”, PPHEC workshop, February 2006. 8. "X10: An Object-Oriented Approach to NonUniform Cluster Computing", OOPSLA conference, October 2005. 9. "Concurrent Clustered Programming", CONCUR conference, August 2005. 10. "X10: an Experimental Language for High Productivity Programming of Scalable Systems", P-PHEC workshop, February 2005. 5/31/2016 © 2007 IBM Corporation