IBM Research: Software Technology Programming Technologies Determinate Imperative Programming 1 Vijay Saraswat, Radha Jagadeesan, Armando SolarLezama, Christoph von Praun November, 2006 IBM Research This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004 © 2006 IBM Corporation IBM Research: Software Technology Programming Technologies Outline 2 Problem: – Many concurrent imperative programs are determinate. – Determinacy is not apparent from the syntax. – Design a language in which all programs are guaranteed determinate. Many examples Semantics Implementation Future work Basic idea – A variable is the stream of values written to it by a thread. © 2006 IBM Corporation IBM Research: Software Technology Acknowledgments Programming Technologies 3 X10 Core Team – Rajkishore Barik – Vincent Cave – Chris Donawa – Allan Kielstra – Igor Peshansky – Christoph von Praun – Vijay Saraswat – Vivek Sarkar – Tong Wen X10 Tools – Philippe Charles – Julian Dolby – Robert Fuhrer – Frank Tip – Mandana Vaziri Emeritus – Kemal Ebcioglu – Christian Grothoff Research colleagues – R. Bodik, G. Gao, R. Jagadeesan, J. Palsberg, R. Rabbah, J. Vitek – Several others at IBM Recent Publications 1. "Concurrent Clustered Programming", V. Saraswat, R. Jagadeesan. CONCUR conference, August 2005. 2. "X10: An Object-Oriented Approach to Non-Uniform Cluster Computing", P. Charles, C. Donawa, K. Ebcioglu, C. Grothoff, A. Kielstra, C. von Praun, V. Saraswat, V. Sarkar. OOPSLA Onwards! conference, October 2005. 3. “A Theory of Memory Models”, V Saraswat, R Jagadeesan, M. Michael, C. von Praun, to appear PPoPP 2007. 4. “Experiences with an SMP Implementation for X10 based on the Java Concurrency Utilities Rajkishore Barik, Vincent Cave, Christopher Donawa, Allan Kielstra,Igor Peshansky, Vivek Sarkar. Workshop on Programming Models for Ubiquitous Parallelism (PMUP), September 2006. 5. "X10: an Experimental Language for High Productivity Programming of Scalable Systems", K. Ebcioglu, V. Sarkar, V. Saraswat. P-PHEC workshop, February 2005. Tutorials TiC 2006, PACT 2006, OOPSLA06 © 2006 IBM Corporation IBM Research: Software Technology A new era of mainstream parallel processing The Challenge Parallelism scaling replaces frequency scaling as foundation for increased performance Profound impact on future software Multi-core chips Heterogeneous Parallelism SPE Programming Technologies PEs, L1 $ 4 ... PEs, L1 $ SPU SXU ... SPU SPU SXU SPU SXU SPU SXU SPU SXU ... SXU LS LS LS LS LS LS LS LS SMF SMF SMF SMF SMF SMF SMF SMF ... 16B/cycle PPE PPU L2 L1 SMP Node MIC 16B/cycle (2x) SMP Node PEs, PEs, ... EIB (up to 96B/cycle) 16B/cycle PEs, L1 $ SPU SXU 16B/cycle L2 Cache PEs, L1 $ SPU SXU Cluster Parallelism ... Memory PEs, PEs, ... Memory BIC PXU 32B/cycle 16B/cycle L2 Cache Dual XDRTM FlexIOTM Interconnect 64-bit Power Architecture with VMX Our response: Use X10 as a new language for parallel hardware that builds on existing tools, compilers, runtimes, virtual machines and libraries © 2006 IBM Corporation IBM Research: Software Technology Server Trends: Concurrency, Distribution, Heterogeneity at all levels Workload Programming Technologies Apps 5 Servers Network Shared Administrative Domain Rack 32 Node Cards 2048 processors System 64 Racks, 64x32x32 131,072 processors Mode Card 16 compute cards (16 compute, 0-2 IO cards) 64 processors 5.6 TF/s 512 GB 20 KWatts 1 m2 footprint Compute Card 2 chips, 1x2x1 4 processors Chip 2 processors 360 TF/s 32 TB 1.3M Watts HPC Scale Out 180 GF/s 16 GB 11.2 GF/s 1.0 GB Appliance Commercial Scale Out Blade Multi-Core Chip © 2006 IBM Corporation IBM Research: Software Technology Programming Technologies The X10 Programming Model 6 Place = collection of resident activities & objects Storage classes Immutable Data PGAS – Local Heap – Remote Heap Activity Local Locality Rule Any access to a mutable datum must be performed by a local activity remote data accesses can be performed by creating remote activities Ordering Constraints (Memory Model) Locally Synchronous: Guaranteed coherence for local heap Sequential consistency Globally Asynchronous: No ordering of inter-place activities use explicit synchronization for coherence Few concepts, done right. © 2006 IBM Corporation IBM Research: Software Technology X10 v0.41 Cheat sheet DataType: Stm: async [ ( Place ) ] [clocked ClockList ] Stm ClassName | InterfaceName | ArrayType when ( SimpleExpr ) Stm nullable DataType finish Stm future DataType next; c.resume() Programming Technologies for( i : Region ) Stm 7 c.drop() Kind : value | reference foreach ( i : Region ) Stm ateach ( I : Distribution ) Stm Expr: ArrayExpr ClassModifier : Kind MethodModifier: atomic x10.lang has the following classes (among others) point, range, region, distribution, clock, array Some of these are supported by special syntax. Forthcoming support: closures, generics, dependent types, place types, implicit syntax, array literals. © 2006 IBM Corporation IBM Research: Software Technology X10 v0.41 Cheat sheet: Array support Region: ArrayExpr: new ArrayType ( Formal ) { Stm } Expr : Expr -- 1-D region Distribution Expr -- Lifting [ Range, …, Range ] -- Multidimensional Region ArrayExpr [ Region ] -- Section Region && Region -- Intersection ArrayExpr | Distribution -- Restriction Region || Region -- Union ArrayExpr || ArrayExpr -- Union Region – Region -- Set difference ArrayExpr.overlay(ArrayExpr) -- Update BuiltinRegion Programming Technologies ArrayExpr. scan( [fun [, ArgList] ) 8 ArrayExpr. reduce( [fun [, ArgList] ) Dist: Region -> Place -- Constant distribution Distribution | Place -- Restriction Distribution | Region -- Restriction Type [Kind] [ ] Distribution || Distribution -- Union Type [Kind] [ region(N) ] Distribution – Distribution -- Set difference Type [Kind] [ Region ] Distribution.overlay ( Distribution ) Type [Kind] [ Distribution ] BuiltinDistribution ArrayExpr.lift( [fun [, ArgList] ) ArrayType: Language supports type safety, memory safety, place safety, clock safety. © 2006 IBM Corporation IBM Research: Software Technology Programming Technologies Memory Model 9 Please see: http://www.saraswat.org/rao.html X10 v 0.41 specifies sequential consistency per place. – Not workable. We are considering a weaker memory model. Built on the notion of atomic: identify a step as the basic building block. – A step is a partial write function. Use links for non hb-reads. A process is a pomset of steps closed under certain transformations: – Composition – Decomposition – Augmentation – Linking – Propagation There may be opportunity for a weak notion of atomic: decouple atomicity from ordering. Correctly synchronized programs behave as SC. Correctly synchronized programs= programs whose SC executions have no races. © 2006 IBM Corporation IBM Research: Software Technology async Programming Technologies Stmt ::= async PlaceExpSingleListopt Stmt 10 async (P) S Creates a new child activity at place P, that executes statement S Returns immediately S may reference final variables in enclosing blocks Activities cannot be named Activity cannot be aborted or cancelled cf Cilk’s spawn // global dist. array final double a[D] = …; final int k = …; async ( a.distribution[99] ) { // executed at A[99]’s // place atomic a[99] = k; } Memory model: hb edge between stm before async and start of async. © 2006 IBM Corporation IBM Research: Software Technology finish Stmt ::= finish Stmt finish S cf Cilk’s sync Execute S, but wait until all (transitively) spawned asyncs have finish ateach(point [i]:A) terminated. Programming Technologies A[i] = i; 11 Rooted exception model Trap all exceptions thrown by spawned activities. Throw an (aggregate) exception if any spawned async terminates abruptly. implicit finish at main activity finish is useful for expressing “synchronous” operations on (local or) remote data. finish async (A.distribution [j]) A[j] = 2; // all A[i]=i will complete // before A[j]=2; Memory model: hb edge between last stm of each async and stm after finish S. © 2006 IBM Corporation IBM Research: Software Technology foreach foreach ( FormalParam: Expr ) Stmt foreach (point p: R) S Creates |R| async statements in parallel at current place. Programming Technologies foreach (point p:R) S 12 for (point p: R) async { S } Termination of all (recursively created) activities can be ensured with finish. finish foreach is a convenient way to achieve master-slave fork/join parallelism (OpenMP programming model) © 2006 IBM Corporation IBM Research: Software Technology atomic Atomic blocks are conceptually executed in a single step while other activities are suspended: isolation and atomicity. Programming Technologies An atomic block ... 13 – must be nonblocking – must not create concurrent activities (sequential) – must not access remote data (local) Memory model: end of tx hb start of next tx in the same place. Stmt ::= atomic Statement MethodModifier ::= atomic // target defined in lexically // enclosing scope. atomic boolean CAS(Object old, Object new) { if (target.equals(old)) { target = new; return true; } return false; } // push data onto concurrent // list-stack Node node = new Node(data); atomic { node.next = head; head = node; } © 2006 IBM Corporation IBM Research: Software Technology Clocks: Motivation Programming Technologies 14 Activity coordination using finish and force() is accomplished by checking for activity termination However, there are many cases in which a producer-consumer relationship exists among the activities, and a “barrier”-like coordination is needed without waiting for activity termination – The activities involved may be in the same place or in different places Phase 0 Phase 1 ... Activity 0 Activity 1 Activity 2 ... © 2006 IBM Corporation IBM Research: Software Technology Clocks (1/2) clock c = clock.factory.clock(); Allocate a clock, register current activity with it. Phase 0 of c starts. Programming Technologies async(…) clocked (c1,c2,…) S ateach(…) clocked (c1,c2,…) S foreach(…) clocked (c1,c2,…) S Create async activities registered on clocks c1, c2, … 15 c.resume(); Nonblocking operation that signals completion of work by current activity for this phase of clock c next; Barrier --- suspend until all clocks that the current activity is registered with can advance. c.resume() is first performed for each such clock, if needed. Next can be viewed like a “finish” of all computations under way in the current phase of the clock © 2006 IBM Corporation IBM Research: Software Technology Clocks (2/2) Programming Technologies c.drop(); Unregister with c. A terminating activity will implicitly drop all clocks that it is registered on. 16 c.registered() Return true iff current activity is registered on clock c c.dropped() returns the opposite of c.registered() Activity is deregistered from a clock when it terminates. Memory model: hb edge between next stm of all registered activities on c, and the subsequent stm in each activity. Static semantics – An activity may operate only on those clocks it is registered with. – In finish S,S may not contain any (top-level) clocked asyncs. Dynamic semantics – A clock c can advance only when all its registered activities have executed c.resume(). – An activity may not pass-on clocks on which it is not live to subactivities. ClockUseException – An activity may not transmit a clock into the scope of a finish. ClockUseException No explicit operation to register a clock. © 2006 IBM Corporation IBM Research: Software Technology Programming Technologies Example (TutClock1.x10) 17 finish async { final clock c = clock.factory.clock(); foreach (point[i]: [1:N]) clocked (c) { parent transmits clock while ( true ) { to child int old_A_i = A[i]; int new_A_i = Math.min(A[i],B[i]); if ( i > 1 ) new_A_i = Math.min(new_A_i,B[i-1]); if ( i < N ) new_A_i = Math.min(new_A_i,B[i+1]); A[i] = new_A_i; next; int old_B_i = B[i]; int new_B_i = Math.min(B[i],A[i]); if ( i > 1 ) new_B_i = Math.min(new_B_i,A[i-1]); if ( i < N ) new_B_i = Math.min(new_B_i,A[i+1]); B[i] = new_B_i; next; if ( old_A_i == new_A_i && old_B_i == new_B_i ) break; exiting from while loop } // while terminates activity for } // foreach } // finish async iteration i, and automatically deregisters activity from clock © 2006 IBM Corporation IBM Research: Software Technology Programming Technologies Clocked final variables 18 Permit variables to be marked as clocked final, e.g. clocked(c) final double[.] a = … In each phase of the clock, the variable is immutable. Writes for such a variable are performed on a shadow copy. Main copy of variable updated with value in shadow copy when clock moves to the next phase. Clocked final variables cannot introduce non-determinism. – Assuming multiple writers write the same value in each phase. © 2006 IBM Corporation IBM Research: Software Technology Clocked final example: Array relaxation G elements are assigned to at most once in each phase of clock c. clocked (c) final int [0:M-1,0:N-1] G = …; Each activity is registered on c. finish foreach (int i,j in [1:M-1,1:N-1]) clocked (c) { Programming Technologies for (int p in [0:TimeStep-1]) { 19 Read current value of cell. G[i,j] = omega/4*(G[i-1,j]+G[i+1,j]+G[i,j-1]+G[i,j+1])+(1-omega)*G[i,j]; next; Value written into ghost copy of G[i,j] } } Wait for clock to advance. Write visible (only) when clock advances. © 2006 IBM Corporation IBM Research: Software Technology Imperative Programming Revisited Programming Technologies 20 Variables Asynchrony introduces indeterminacy – Variable=Value in a Box – Read: fetch current value – Write: change value int x = 0; – Stability condition: Value does not change unless a write is async x=1; performed print(x); Very powerful – Permit repeated many-writer, many-reader communication May write out either 0 or 1. through arbitrary reference graphs Bugs due to races are very – Mutability in the presence of difficult to debug. sharing – Permits different variables to change at different rates. Reader-reader, reader-writer, writer-writer conflicts. © 2006 IBM Corporation IBM Research: Software Technology Programming Technologies Determinate programming design patterns 21 NAS parallel benchmarks – Conjugate gradient – Multigrid – LU factorization Single producer, multiple copying consumers – Kahn networks, StreamIt – Pipelining Stencil computations – Jacobi, SOR Molecular dynamics Graph algorithms – Connected components Detecting stable properties – Clocks! – Short circuit technique Parallelism for performance/scaling, not control. © 2006 IBM Corporation IBM Research: Software Technology Determinate programming anti-patterns Programming Technologies Reactive computing: arrivalorder indeterminism – “Races in the world” – E.g. Bank accounts 22 Resource contention: any of several possible outcomes is acceptable – Mutual exclusion – Load balancing • Shared work list But the program may still contain determinate concurrent components. Algorithm may permit any one of many possible solutions – One solution for N-queens – Some minimal spanning tree – Some Delauney triangulation © 2006 IBM Corporation IBM Research: Software Technology Programming Technologies Determinate Concurrent Imperative frameworks 23 Asynchronous Kahn networks – Nodes can be thought of as (continuous) functions over streams. – Pop/peek – Push – Node-local state may mutate arbitrarily Concurrent Constraint Programming – Tell constraints – Ask if a constraint is true – Subsumes Kahn networks (dataflow). – Subsumes (det) concurrent logic programming, lazy functional programming Do not support arbitrary mutable variables. © 2006 IBM Corporation IBM Research: Software Technology Programming Technologies Determinate Concurrent Imperative Frameworks 24 Safe Asynchrony (Steele 1991) – Parent may communicate with children. – Children may communicate with parent. – Siblings may communicate with each other only through commutative, associative writes (“commuting writes”). Good: int x=0; finish foreach (int i in 1:N) { x += i; } print(x); // N*(N+1)/2 Bad: int x=0; finish foreach (int i in 1:N) { x += i; async print(x); } Useful but limited. Does not permit dataflow synch. © 2006 IBM Corporation IBM Research: Software Technology Determinate X10 DataType: Stm: async [ ( Place ) ] [clocked ClockList ] Stm ClassName | InterfaceName | ArrayType when ( SimpleExpr ) Stm nullable DataType finish Stm future DataType Programming Technologies next; 25 c.resume() c.drop() local DataType for( i : Region ) Stm det DataType foreach ( i : Region ) Stm indet DataType ateach ( I : Distribution ) Stm Expr: Kind : value | reference ArrayExpr ClassModifier : Kind MethodModifier: atomic Constructs not available Constructs added. © 2006 IBM Corporation IBM Research: Software Technology local variables Instances of value classes always considered local. Programming Technologies A mutable object is local only if it is marked as local when created: – new local T(…) 26 A value of a local type can be assigned only into a variable of local type, or a field of a local object. Variables of a local type are not visible to contained asyncs. An async spawned in a finish may assign a value of a local type to a local variable of the parent activity. A value may be cast to local T; the cast may fail. – E.g. local T x = (local T) this; Invariant: Each activity owns the local objects and local variables it creates. Only local objects can reference local objects. Local objects of terminated activity become local objects of parent. Ownership type system used to maintain locality. © 2006 IBM Corporation IBM Research: Software Technology Programming Technologies det locations 27 A det location is represented in memory as a stream (indexed sequence) of immutable values. Each activity maintains an index i + clean/dirty bit for every det location. – Initially i=1, v[0] contains initial value. – Read: If clean, block until v[i] is written and return v[i++] else return v[i-1]. Mark as clean. – Write: Write into v[i++]. Mark as dirty. Note: index updated only as a result of activity’s operations. World Map = Collection of indices for an activity. Index transmission rules. – async: Activity initialized with current world map of parent activity. – finish: world map of activity is lubbed with world map of finished activities. – (clean lub dirty = dirty) The clock of clocked final is made implicit. © 2006 IBM Corporation IBM Research: Software Technology Indet locations Can be recovered as det locations + a mutable shared index (“current”). An activity’s world map does not need to contain index for indet locations. Programming Technologies All activities read and update location through current. 28 Therefore stream representation is not necessary, only the “current” value need be kept. © 2006 IBM Corporation IBM Research: Software Technology det example: Array relaxation det int [0:M-1,0:N-1] G = …; finish foreach (local int i,j in [1:M-1,1:N-1]) { for (local int p in [0:TimeStep-1]) { Programming Technologies G[i,j] = omega/4*(G[i-1,j]+G[i+1,j]+G[i,j-1]+G[i,j+1])+(1-omega)*G[i,j]; 29 } } All clock manipulations are implicit. © 2006 IBM Corporation IBM Research: Software Technology Some simple examples det int x=0; finish { 0 async { int r1 = x; int r2 = x; 1 println(r1); println(r2); } Programming Technologies async {x=1;x=2;} 30 } i x A1 0 0 read r1 1 1 read r2 2 2 A2 write 1 Convention: A type not marked det is assumed marked local. write 2 Only one result – independent of the scheduler! © 2006 IBM Corporation IBM Research: Software Technology Some simple examples det int x=0; finish { async {int r1 = x; int r2 = x; println(r1); println(r2);} 0 async {x=1;} 1 async {x=1; int r3 = x; async {x=2;}} 2 } Programming Technologies println(x); 31 i x A1 (0) 0 0 read r1 1 1 read r2 2 2 A2 (0) A3 (0) write 1 write 1; read r3 A4 (2) write 2 All programs are determinate. © 2006 IBM Corporation IBM Research: Software Technology Some StreamIt examples StreamIt Det X10 void -> void pipeline Minimal { det int x=0; 0 add IntSource; async while (true) x++; 1 add IntPrinter; async while (true) println(x); … } Programming Technologies void ->int filter IntSource { 32 The communication is through assignment to x, so the same result is obtained with: int x; init {x=0;} work push 1 { push(x++);} } det int x=0; 0 int->void filter IntPrinter { async while (true) ++x; 1 async while (true) println(x); … work pop 1 { print(pop());} } Each shared variable is a multi-reader, multi-writer stream. © 2006 IBM Corporation IBM Research: Software Technology Some StreamIt examples: fibonacci det int x=1, y=1; async while (true) y=x; Programming Technologies async while (true) x+=y; 33 i y x 0 1 1 1 1 2 2 2 3 3 3 5 … … … Activity 1 Activity 2 Can express any recursive, asynchronous Kahn network. © 2006 IBM Corporation IBM Research: Software Technology StreamIt examples: Moving Average void->void pipeline MovingAverage { add intSource(); add Averager(10); det int y=0; det int x=0; async while (true) x++; async while (true) { add IntPrinter(); int sum=x; } for (int i in 1:N-1) sum += peek(x, i); Programming Technologies int->int filter Average(int n) { 34 work pop 1 push 1 peek n { y = sum/N; } int sum=0; for (int i=0; i < n; i++) sum += peek(i); push(sum/n); pop(); • peek(x, i) reads the i’th future value, without popping it. Blocks if necessary. } } © 2006 IBM Corporation IBM Research: Software Technology Canon matrix multiplication void canon (det double[N,N] c, det double[N,N] a, det double[N,N] b) { finish foreach (int i,j in [0:N-1,0:N-1]) { a[i,j] = a[i,(j+1) % N]; Programming Technologies b[i,j] = b[(i+j)%N, j]; 35 } for (int k in [0:N-1]) finish foreach (int i,j in [0:N-1,0:N-1]) { c[i,j] = c[i+j] + a[i,j]*b[i,j]; a[i,j] = a[i,(j+1)%N]; b[i,j] = b[(i+1)%N, j]; } } The natural sequential program works (for finish foreach). © 2006 IBM Corporation IBM Research: Software Technology Implementation Programming Technologies 36 Each activity’s world map increases monotonically with time. Use garbage collection to erase past unreachable values. Programs with no sibling communication may be executed in buffers with unit windows. Considering permitting user to specify bounds on variables (cf push/pop specifications in StreamIt). – This will force writes to become blocking as well. Scheduling strategy affects size of buffers, not result. © 2006 IBM Corporation IBM Research: Software Technology Programming Technologies Future work 37 Formalization – MJ/CF – Very straightforward additions to field read/write. – Paper contains details. Implementation. – Leverage connection with StreamIt, and static scheduling. Paper contains ideas on detecting deadlock (stabilities) at runtime and recovering from them. – Programmability being investigated. – Devise static type system to establish deadlock-freedom. Coarser granularity for indices. – Use same clock for many variables. – Permits “coordinated” changes to multiple variables. Introduce fusion operation (x -> y) to support CCP. © 2006 IBM Corporation IBM Research: Software Technology Programming Technologies Backup 38 © 2005 IBM Corporation IBM Research: Software Technology StreamIt examples: Bandpass filter float->float pipeline BandPassFilter(float rate, float bandPassFilter(float rate, float low, float low, float high, int taps) { float high, int taps, int in) { add BPFCore(rate, low, high, taps); int tmp=in; add Subtracter();} det int in1=tmp, in2=tmp; float ->float splitjoin BPFCore async while (true) in1=in; Programming Technologies (float rate, float low, 39 async while (true) in2=in; float high, int taps) { det int o1 = lowPass(rate, low, taps, 0, in1), split duplicate; o2 = lowPass(rate, high, taps, 0, in2); add LowPass(rate, low, taps, 0); det int o = o1-o2; add LowPass(rate, high, taps, 0); async while(true) o = o1-o2; join roundrobin;} float->float filter Subtracter { return o; } Work pop 2 push 1 { push(peek(1)-peek(0)); pop(); pop();}} Functions return streams. © 2006 IBM Corporation IBM Research: Software Technology Programming Technologies Histogram 40 Permit “commuting” writes <int N> [1:N][] histogram([1:N][] A) { to be performed final int[] B = new int [1:N]; simultaneously in the same finish foreach(int i in A) B[A[i]]++; phase. Phase is completed when all return B; activities that can write have } written. B’s phase is not yet complete. A subsequent read will complete it. © 2006 IBM Corporation IBM Research: Software Technology Cilk programs with races int x; cilk void foo() { x = x +1; Determinate: Will always print 1 in CF. } Programming Technologies cilk int main() { 41 x=0; spawn foo(); spawn foo(); sync; printf(“x is \%d\n”, x); return 0; } CF smoothly combines Cilk and StreamIt. © 2006 IBM Corporation