Determinate Imperative Programming: The CF Model Vijay Saraswat IBM TJ Watson Research Center joint work with Radha Jagadeesan, Armando SolarLezama, Christoph von Praun http://www.saraswat.org/cf.html Outline Problem: Many concurrent imperative programs are determinate. Determinacy is not apparent from the syntax. Basic idea A variable is the stream of values written to it by a thread. Many examples Semantics Implementation Future work 2 Background: X10 Five basic themes: Partitioned address space Pervasive explicit asynchrony (Cilk-style recursive parallelism) Java base Guaranteed VM invariants Explicit, distributed VM Few language extensions <s> = async <s> <s> = finish <s> <s> = foreach ( <v>, …,<v> in <e>) <s> Multidimensional arrays over distributions Subsumes MPI, OpenMP, SPMD languages, Cilk … 3 X10: clocks, clocked final data structures Clocks can be created dynamically. Activities are registered with clocks. An activity may register a newly created activity with one of its clocks. “next;” resumes each clock; blocks until each clock advances. This is sufficient for deadlock-freedom. Adequate for parallel operations on arrays But not dataflow Clock advances when all activities registered on it resume the clock. Operations c.resume(); next; c.drop(); Clocked final datum In each phase of the clock the datum is immutable. Read gets current value; write updates in next phase. Clocks do not introduce deadlock; clocked finals are determinate. 4 Clocked final example: Array relaxation G elements are assigned to at most once in each phase of clock c. int clocked (c) final [0:M-1,0:N-1] G = …; Each activity is registered on c. finish foreach (int i,j in [1:M-1,1:N-1]) clocked (c) { for (int p in [0:TimeStep-1]) { Read current value of cell. G[i,j] = omega/4*(G[i-1,j]+G[i+1,j]+G[i,j-1]+G[i,j+1])+(1-omega)*G[i,j]; next; Wait for clock to advance. } } Write visible (only) when clock advances. Takeaway: Each cell is assigned a clocked stream of immutable values. 5 Imperative Programming Revisited Variables Value in a Box Read: fetch current value Write: change value Stability condition: Value does not change unless a write is performed Asynchrony introduces indeterminacy int x = 0; async x=1; print(x); Very powerful Permit repeated manywriter, many-reader communication through arbitrary reference graphs May write out either 0 or 1. Reader-reader, reader-writer, writer-writer conflicts. 6 Determinate Concurrent Imperative frameworks Asynchronous Kahn networks Nodes can be thought of as (continuous) functions over streams. Pop/peek Push Node-local state may mutate arbitrarily Concurrent Constraint Programming Tell constraints Ask if a constraint is true Subsumes Kahn networks (dataflow). Subsumes (det) concurrent logic programming, lazy functional programming Do not support arbitrary mutable variables. 7 Determinate Concurrent Imperative Frameworks Good: Safe Asynchrony (Steele 1991) Parent may communicate with children. Children may communicate with parent. Siblings may communicate with each other only through commutative, associative writes (“commuting writes”). int x=0; finish foreach (int i in 1:N) { x += i; } print(x); // N*(N+1)/2 Bad: int x=0; finish foreach (int i in 1:N) { x += i; async print(x); } Useful but limited. Does not permit dataflow synch. 8 The CF Basic model A shared variable is a stream of immutable values. Each activity maintains an index i + clean/dirty bit for every shared variable. World Map=Collection of indices for an activity. Index transmission rules. Initially i=1, v[0] contains initial value. Read: If clean, block until v[i] is written and return v[i++] else return v[i-1]. Mark as clean. Write: Write into v[i++]. Mark as dirty. A read stutters (returns value in last phase) if no activity can write in this phase. E.g. for local variables. Activity initialized with current world map of parent activity. On finish, world map of activity is lubbed with world map of finished activities. (clean lub dirty = clean) All programs are determinate and scheduler independent. May deadlock … nexts are not conjunctive. The clock of clocked final is made implicit. 9 CF example: Array relaxation shared int [0:M-1,0:N-1] G = …; finish foreach (int i,j in [1:M-1,1:N-1]) { for (int p in [0:TimeStep-1]) { G[i,j] = omega/4*(G[i-1,j]+G[i+1,j]+G[i,j-1]+G[i,j+1])+(1-omega)*G[i,j]; } } All clock manipulations are implicit. 10 Some simple examples shared int x=0; 0 finish { async {int r1 = x; int r2 = x; println(r1); println(r2);} 1 async {x=1;x=2;} } i x A1 0 0 read r1 1 1 read r2 2 2 A2 write 1 write 2 Only one result – independent of the scheduler! 11 Some simple examples shared int x=0; finish { async {int r1 = x; int r2 = x; println(r1); println(r2);} 0 async {x=1;} 1 async {x=1; int r3 = x; async {x=2;}} 2 } println(x); i x A1 (0) 0 0 read r1 1 1 read r2 2 2 A2 (0) A3 (0) write 1 write 1; read r3 A4 (2) write 2 All programs are determinate. 12 Some StreamIt examples StreamIt X10/CF void -> void pipeline Minimal { shared int x=0; 0 add IntSource; async while (true) x++; 1 add IntPrinter; async while (true) println(x); … } void ->int filter IntSource { The communication is through assignment to x, so the same result is obtained with: int x; init {x=0;} work push 1 { push(x++);} } shared int x=0; 0 int->void filter IntPrinter { async while (true) ++x; 1 async while (true) println(x); … work pop 1 { print(pop());} } Each shared variable is a multi-reader, multi-writer stream. 13 Some StreamIt examples: fibonacci shared int x=1, y=1; async while (true) y=x; async while (true) x+=y; i 0 1 2 3 … y 1 1 2 3 … x 1 2 3 5 … Activity 1 Activity 2 Can express any recursive, asynchronous Kahn network. 14 StreamIt examples: Moving Average void->void pipeline MovingAverage { add intSource(); add Averager(10); shared int y=0; shared int x=0; async while (true) x++; async while (true) { add IntPrinter(); int sum=x; } for (int i in 1:N-1) sum += peek(x, i); int->int filter Average(int n) { work pop 1 push 1 peek n { y = sum/N; } int sum=0; for (int i=0; i < n; i++) sum += peek(i); push(sum/n); pop(); • peek(x, i) reads the i’th future value, without popping it. Blocks if necessary. } } 15 StreamIt examples: Bandpass filter float->float pipeline BandPassFilter(float rate, float bandPassFilter(float rate, float low, float low, float high, int taps) { float high, int taps, int in) { add BPFCore(rate, low, high, taps); int tmp=in; add Subtracter();} shared int in1=tmp, in2=tmp; float ->float splitjoin BPFCore async while (true) in1=in; (float rate, float low, async while (true) in2=in; float high, int taps) { shared int o1 = lowPass(rate, low, taps, 0, in1), split duplicate; o2 = lowPass(rate, high, taps, 0, in2); add LowPass(rate, low, taps, 0); shared int o = o1-o2; add LowPass(rate, high, taps, 0); async while(true) o = o1-o2; join roundrobin;} float->float filter Subtracter { return o; } Work pop 2 push 1 { push(peek(1)-peek(0)); pop(); pop();}} Functions return streams. 16 Canon matrix multiplication Parameters whose values are finalized. <final int N>void canon (double[N,N] c, double[N,N] a, double[N,N] b) { finish foreach (int i,j in [0:N-1,0:N-1]) { a[i,j] = a[i,(j+1) % N]; b[i,j] = b[(i+j)%N, j]; } for (int k in [0:N-1]) Local variables in each activity. finish foreach (int i,j in [0:N-1,0:N-1]) { c[i,j] = c[i+j] + a[i,j]*b[i,j]; a[i,j] = a[i,(j+1)%N]; b[i,j] = b[(i+1)%N, j]; } } The natural sequential program works (for finish foreach). 17 Histogram Permit “commuting” writes to be performed simultaneously in the same phase. Phase is completed when all activities that can write have written. <int N> [1:N][] histogram([1:N][] A) { final int[] B = new int [1:N]; finish foreach(int i in A) B[A[i]]++; return B; } B’s phase is not yet complete. A subsequent read will complete it. 18 Cilk programs with races int x; cilk void foo() { x = x +1; Determinate: Will always print 1 in CF. } cilk int main() { x=0; spawn foo(); spawn foo(); sync; printf(“x is \%d\n”, x); return 0; } CF smoothly combines Cilk and StreamIt. 19 Implementation Each activity’s world map increases monotonically with time. Use garbage collection to erase past unreachable values. Programs with no sibling communication may be executed in buffers with unit windows. Considering permitting user to specify bounds on variables (cf push/pop specifications in StreamIt). This will force writes to become blocking as well. Scheduling strategy affects size of buffers, not result. 20 Formalization MJ/CF Very straightforward additions to field read/write. Paper contains details. Surprisingly localized. 21 Future work Paper contains ideas on detecting deadlock (stabilities) at runtime and recovering from them. Implementation. Programmability being investigated. Leverage connection with StreamIt, and static scheduling. Coarser granularity for indices. Use same clock for many variables. Permits “coordinated” changes to multiple variables. 22