Maged M. Michael, Martin T. Vechev, Vijay A. Saraswat PPoPP’09 1 Memory Operations Reordering Problem Definition – Idempotent Work-Stealing The algorithms Comparison to Previous Work Summary 2 Some architectures reorder the memory accesses to achieve faster execution read(a) read(b) write(a,1) write(b,2) read(a) write(b,2) write(a,1) read(b) Good optimization for uni-processors… But may be dangerous for multi-processors 3 Memory a = 0; b = 0; P1 L1: if(read(a) = 0) goto L1 print(read(b)) P1 P2 write(b, 7) write(a, 1) P2 Expected output of P1? What happens if P2 changes the order of memory stores? 4 Operations that synchronize memory accesses X-Y fence: all previous operations of type X must commit before all following operations of type Y start Example: store-load read1 write1 store-load write2 read2 store-store? 5 Memory a = 0; b = 0; P1 L1: if (read(a) = 0) goto L1 print(read(b)) P1 P2 write(b, 1) store-store write(a, 7) P2 6 A model where: ◦ All processors see all memory operations in the same order ◦ Must adhere to the program order (for each thread) Memory operations are not sequential consistent Makes program verification a non-simple task 7 Linearizability is stronger than sequential consistency If operation A is executed before operation B (in real-time), then A precedes B in the order (and not only for a single thread) 8 Memory Operations Reordering Problem Definition – Idempotent Work-Stealing The algorithms Comparison to Previous Work Summary 9 Idempotence – the property of certain operations, that can be applied multiple times without changing the result (Wikipedia) In other words: f(f(x))=f(x) Examples: 1. The absolute function 2. The number 1 is idempotent of multiplication: 1*1 3. SQL query (without updates) 10 A policy to divide procedure executions (jobs/tasks) efficiently among multiple processors Each processor has a deque (double-ended queue) of jobs job job job job job job job job job P1 P2 Pk 11 Each processor can put a new job in its own queue Each processor can take a job from its own queue job job job job job job job job job job P1 P2 Pk 12 A processor without work can steal jobs from another processor job job job job job job job P1 P2 Pk 13 Fibonacci numbers – fib(7) P1 – take() -> fib(7) P1 – put(fib(6)), put(fib(5)) P1 – take() -> fib(6) P2 – steal(P1) P2 – take() -> fib(5) P1 – put(fib(5)), put(fib(4)) P2 – put(fib(4)), put(fib(3)) P1 – take() -> fib(5) P3 – steal(P1) P3 – take() -> fib(4) P2 – take() -> fib(4) … Fib(4) fib(5) Fib(3) Fib(6) Fib(5) fib(7) Fib(4) P1 P2 P3 14 Work stealing seems like a good idea… But, it can be expensive… Can Work-Stealing algorithms of Because:Idempotent tasks avoid using 1. Using locks synchronization primitives? 2. Using atomic Read-Modify-Write operations 3. Using Memory Ordering Fence Previous work-stealing algorithms use strong synchronization primitives 15 Not exactly… Our goal: ◦ Making Work-stealing cheap when jobs are idempotent How? ◦ Making the owner’s operations (“put”, “take”) cheap, but “steal” remains expensive 16 A snippet of the Chase-Lev algorithm: 1. 2. 3. 4. 5. … Task take() { b := bottom; CircularArray a = activeArray; b = b – 1; bottom = b; t = top; store-load } 17 Memory Operations Reordering Problem Definition – Idempotent Work-Stealing The algorithms Comparison to Previous Work Summary 18 We will see 3 algorithms All algorithms insert (put) jobs at the tail 1. Idempotent LIFO – extracting tasks (take/steal) from the tail 2. Idempotent FIFO – extracting tasks (take/steal) from the head 3. Idempotent double-ended – the owner takes tasks from the tail, and the others steal from the head 19 Each processor has: ◦ Dynamic array of tasks ◦ A capacity variable ◦ An anchor (tail index) insert – to tail take/steal from tail tasks capacity = 7 anchor = 0 P1 20 1. 2. 3. 4. void put(Task task) { t := anchor; if (t = capacity) { expand(); goto 1;} tasks[t] := task; anchor := t + 1; } store-store tasks task1 capacity = 7 anchor = 1 0 21 1. 2. 3. 4. 5. Task take() { t := anchor; if (t = 0) return EMPTY; task := tasks[t – 1]; anchor := t - 1; return task; } tasks task1 task2 task3 capacity = 7 anchor = 2 3 22 1. 2. 3. 4. 5. 6. Task steal() { t := anchor; if (t = 0) return EMPTY; a := tasks; task := a[t – 1]; if !CAS(anchor, t, t-1) goto 1; return task; } load-load load-CAS tasks task1 task2 Why task3 tasks must be idempotent? capacity = 7 anchor = 2 3 23 1. 2. 3. 4. 5. Task take() { t := anchor; if (t = 0) return EMPTY; task := tasks[t – 1]; anchor := t - 1; return task; } task=task3 task1 task2 t 1. 2. 3. 4. 5. 6. t Task steal() { t := anchor; if (t = 0) return EMPTY; a := tasks; task := a[t – 1]; if !CAS(anchor, t, t-1) goto 1; return task; } task=task3 tasks a task3 capacity = 7 anchor = 2 3 24 How is ABA possible? task1 task2 t task3 taskX capacity = 7 anchor = 2 3 owner take(); put(taskX); … put(taskY); tasks taskX is lost! task=task3 1. 2. 3. 4. 5. 6. Task steal() { t := anchor; if (t = 0) return EMPTY; a := tasks; task := a[t – 1]; if !CAS(anchor, t, t-1) goto 1; return task; } 25 How can we prevent it? anchor: <integer, integer>; 1. 2. 3. 4. // <tail, tag> void put(Task task) { <t,tag> := anchor; if (t = capacity) { expand(); goto 1;} tasks[t] := task; anchor := <t + 1, tag + 1>; } Task steal() { 1. 2. 3. 4. 5. 6. } <t,tag> := anchor; if (t = 0) return EMPTY; a := tasks; task := a[t – 1]; if !CAS(anchor, <t,tag>, <t-1,tag>) goto 1; return task; 26 Each processor has: ◦ ◦ ◦ ◦ Dynamic cyclic-array of tasks A capacity variable Head index (always increasing) Tail index (always increasing) task2 task3 insert – to tail take/steal from head tasks task4 capacity = 7 head = 1 tail = 4 P1 Next… 27 1. 2. 3. 4. 5. void put(Task task) { h := head; t := tail; if (t = h + tasks.capacity) { expand(); goto 1;} tasks.array[t%tasks.capacity] := task; tail := t + 1; store-store } task2 task3 task4 task5 capacity = 7 head = 1 5 tail = 4 28 1. 2. 3. 4. 5. 6. Task take() { h := head; t := tail; if (h = t) return EMPTY; task := tasks.array[h%tasks.capacity]; head := h + 1; return task; } task2 task3 task4 task5 capacity = 7 head = 2 1 tail = 4 29 load-load 1. 2. 3. 4. 5. 6. 7. Task steal() { h := head; t := tail; if (h = t) return EMPTY; a := tasks; task := a.array[h%a.capacity]; if !CAS(head, h, h+1) goto 1; return task; } task2 task3 task4 load-load load-CAS task5 capacity = 7 head = 2 1 tail = 4 30 Each processor has: ◦ Dynamic cyclic-array of tasks ◦ A capacity variable ◦ An anchor (head, size) task2 task3 insert – to tail take – from tail steal - from head tasks task4 capacity= 7 anchor = <1, 3> P1 Next… 31 1. 2. 3. 4. void put(Task task) { <h, s> := anchor; if (s = tasks.capacity) { expand(); goto 1;} tasks.array[(h+s)%tasks.capacity] := task; anchor := <h, s + 1>; } task2 task3 task4 store-store task5 capacity = 7 4 anchor = <1, 3> 32 1. 2. 3. 4. 5. Task take() { <h, s> := anchor; if (s = 0) return EMPTY; task := tasks.array[(h+s-1)%tasks.capacity]; anchor := <h, s – 1>; return task; } task2 task3 task4 task5 capacity = 7 3 anchor = <1, 4> 33 1. 2. 3. 4. 5. 6. 7. Task steal() { <h, s> := head; if (s = 0) return EMPTY; a := tasks; task := a.array[h%a.capacity]; h2 := (h + 1) % a.capacity; if !CAS(head, <h,s>, <h2,s-1>) goto 1; return task; } task2 task3 task4 load-load load-CAS task5 capacity = 7 2, 4 3 > anchor = <1, 34 Memory Operations Reordering Problem Definition – Idempotent Work-Stealing The algorithms Comparison to Previous Work Summary 35 Compared against “Chase-Lev” and “Cilk THE” algorithms (after adding memory fences) Benchmarks: ◦ Micro – the common case – take() and put() ◦ Irregular Graph Applications 36 2 Scenarios: ◦ Both puts and takes (106 ops for each type) ◦ Only takes (106 ops) – pre populating the work-queues 37 2 Scenarios: ◦ Both puts and takes (106 ops for each type) ◦ Only takes (106 ops) – pre populating the work-queues 38 Based on SIMPLE framework 2D Torus Graph: ◦ Vertices – on the torus ◦ Each vertex connected to its 4 neighbors Build a spanning tree 40 Up to 6% redundant work 41 Memory Operations Reordering Problem Definition – Idempotent Work-Stealing The algorithms Comparison to Previous Work Summary 42 Memory operations reordering improves execution times Use with care in multi-processors “Idempotent Work-Stealing” useful for some workloads Idempotent-LIFO gives good results for all benchmarks 43 Thank You! Questions? 44