Idempotent Work Stealing

Maged M. Michael, Martin T. Vechev, Vijay A. Saraswat PPoPP’09 1      Memory Operations Reordering Problem Definition – Idempotent Work-Stealing The algorithms Comparison to Previous Work Summary 2    Some architectures reorder the memory accesses to achieve faster execution read(a) read(b) write(a,1) write(b,2) read(a) write(b,2) write(a,1) read(b) Good optimization for uni-processors… But may be dangerous for multi-processors 3 Memory a = 0; b = 0; P1 L1: if(read(a) = 0) goto L1 print(read(b)) P1 P2 write(b, 7) write(a, 1) P2 Expected output of P1? What happens if P2 changes the order of memory stores? 4    Operations that synchronize memory accesses X-Y fence: all previous operations of type X must commit before all following operations of type Y start Example: store-load  read1 write1  store-load  write2 read2   store-store?  5 Memory a = 0; b = 0; P1 L1: if (read(a) = 0) goto L1 print(read(b)) P1 P2 write(b, 1) store-store write(a, 7) P2 6  A model where: ◦ All processors see all memory operations in the same order ◦ Must adhere to the program order (for each thread)  Memory operations are not sequential consistent Makes program verification a non-simple task 7  Linearizability is stronger than sequential consistency If operation A is executed before operation B (in real-time), then A precedes B in the order (and not only for a single thread) 8      Memory Operations Reordering Problem Definition – Idempotent Work-Stealing The algorithms Comparison to Previous Work Summary 9  Idempotence – the property of certain operations, that can be applied multiple times without changing the result (Wikipedia) In other words: f(f(x))=f(x)  Examples:  1. The absolute function 2. The number 1 is idempotent of multiplication: 1*1 3. SQL query (without updates) 10   A policy to divide procedure executions (jobs/tasks) efficiently among multiple processors Each processor has a deque (double-ended queue) of jobs job job job job job job job job job P1 P2 Pk 11   Each processor can put a new job in its own queue Each processor can take a job from its own queue job job job job job job job job job job P1 P2 Pk 12  A processor without work can steal jobs from another processor job job job job job job job P1 P2 Pk 13             Fibonacci numbers – fib(7) P1 – take() -> fib(7) P1 – put(fib(6)), put(fib(5)) P1 – take() -> fib(6) P2 – steal(P1) P2 – take() -> fib(5) P1 – put(fib(5)), put(fib(4)) P2 – put(fib(4)), put(fib(3)) P1 – take() -> fib(5) P3 – steal(P1) P3 – take() -> fib(4) P2 – take() -> fib(4) … Fib(4) fib(5) Fib(3) Fib(6) Fib(5) fib(7) Fib(4) P1 P2 P3 14    Work stealing seems like a good idea… But, it can be expensive… Can Work-Stealing algorithms of Because:Idempotent tasks avoid using 1. Using locks synchronization primitives? 2. Using atomic Read-Modify-Write operations 3. Using Memory Ordering Fence  Previous work-stealing algorithms use strong synchronization primitives 15  Not exactly…  Our goal: ◦ Making Work-stealing cheap when jobs are idempotent  How? ◦ Making the owner’s operations (“put”, “take”) cheap, but “steal” remains expensive 16  A snippet of the Chase-Lev algorithm: 1. 2. 3. 4. 5. … Task take() { b := bottom; CircularArray a = activeArray; b = b – 1; bottom = b; t = top; store-load } 17      Memory Operations Reordering Problem Definition – Idempotent Work-Stealing The algorithms Comparison to Previous Work Summary 18   We will see 3 algorithms All algorithms insert (put) jobs at the tail 1. Idempotent LIFO – extracting tasks (take/steal) from the tail 2. Idempotent FIFO – extracting tasks (take/steal) from the head 3. Idempotent double-ended – the owner takes tasks from the tail, and the others steal from the head 19  Each processor has: ◦ Dynamic array of tasks ◦ A capacity variable ◦ An anchor (tail index) insert – to tail take/steal from tail tasks capacity = 7 anchor = 0 P1 20 1. 2. 3. 4. void put(Task task) { t := anchor; if (t = capacity) { expand(); goto 1;} tasks[t] := task; anchor := t + 1; } store-store tasks task1 capacity = 7 anchor = 1 0 21 1. 2. 3. 4. 5. Task take() { t := anchor; if (t = 0) return EMPTY; task := tasks[t – 1]; anchor := t - 1; return task; } tasks task1 task2 task3 capacity = 7 anchor = 2 3 22 1. 2. 3. 4. 5. 6. Task steal() { t := anchor; if (t = 0) return EMPTY; a := tasks; task := a[t – 1]; if !CAS(anchor, t, t-1) goto 1; return task; } load-load load-CAS tasks task1 task2 Why task3 tasks must be idempotent? capacity = 7 anchor = 2 3 23 1. 2. 3. 4. 5. Task take() { t := anchor; if (t = 0) return EMPTY; task := tasks[t – 1]; anchor := t - 1; return task; } task=task3 task1 task2 t 1. 2. 3. 4. 5. 6. t Task steal() { t := anchor; if (t = 0) return EMPTY; a := tasks; task := a[t – 1]; if !CAS(anchor, t, t-1) goto 1; return task; } task=task3 tasks a task3 capacity = 7 anchor = 2 3 24  How is ABA possible? task1 task2 t task3 taskX capacity = 7 anchor = 2 3 owner take(); put(taskX); … put(taskY); tasks taskX is lost! task=task3 1. 2. 3. 4. 5. 6. Task steal() { t := anchor; if (t = 0) return EMPTY; a := tasks; task := a[t – 1];  if !CAS(anchor, t, t-1) goto 1; return task; } 25  How can we prevent it? anchor: <integer, integer>; 1. 2. 3. 4. // <tail, tag> void put(Task task) { <t,tag> := anchor; if (t = capacity) { expand(); goto 1;} tasks[t] := task; anchor := <t + 1, tag + 1>; } Task steal() { 1. 2. 3. 4. 5. 6. } <t,tag> := anchor; if (t = 0) return EMPTY; a := tasks; task := a[t – 1]; if !CAS(anchor, <t,tag>, <t-1,tag>) goto 1; return task; 26  Each processor has: ◦ ◦ ◦ ◦ Dynamic cyclic-array of tasks A capacity variable Head index (always increasing) Tail index (always increasing) task2 task3 insert – to tail take/steal from head tasks task4 capacity = 7 head = 1 tail = 4 P1 Next… 27 1. 2. 3. 4. 5. void put(Task task) { h := head; t := tail; if (t = h + tasks.capacity) { expand(); goto 1;} tasks.array[t%tasks.capacity] := task; tail := t + 1; store-store } task2 task3 task4 task5 capacity = 7 head = 1 5 tail = 4 28 1. 2. 3. 4. 5. 6. Task take() { h := head; t := tail; if (h = t) return EMPTY; task := tasks.array[h%tasks.capacity]; head := h + 1; return task; } task2 task3 task4 task5 capacity = 7 head = 2 1 tail = 4 29 load-load 1. 2. 3. 4. 5. 6. 7. Task steal() { h := head; t := tail; if (h = t) return EMPTY; a := tasks; task := a.array[h%a.capacity]; if !CAS(head, h, h+1) goto 1; return task; } task2 task3 task4 load-load load-CAS task5 capacity = 7 head = 2 1 tail = 4 30  Each processor has: ◦ Dynamic cyclic-array of tasks ◦ A capacity variable ◦ An anchor (head, size) task2 task3 insert – to tail take – from tail steal - from head tasks task4 capacity= 7 anchor = <1, 3> P1 Next… 31 1. 2. 3. 4. void put(Task task) { <h, s> := anchor; if (s = tasks.capacity) { expand(); goto 1;} tasks.array[(h+s)%tasks.capacity] := task; anchor := <h, s + 1>; } task2 task3 task4 store-store task5 capacity = 7 4 anchor = <1, 3> 32 1. 2. 3. 4. 5. Task take() { <h, s> := anchor; if (s = 0) return EMPTY; task := tasks.array[(h+s-1)%tasks.capacity]; anchor := <h, s – 1>; return task; } task2 task3 task4 task5 capacity = 7 3 anchor = <1, 4> 33 1. 2. 3. 4. 5. 6. 7. Task steal() { <h, s> := head; if (s = 0) return EMPTY; a := tasks; task := a.array[h%a.capacity]; h2 := (h + 1) % a.capacity; if !CAS(head, <h,s>, <h2,s-1>) goto 1; return task; } task2 task3 task4 load-load load-CAS task5 capacity = 7 2, 4 3 > anchor = <1, 34      Memory Operations Reordering Problem Definition – Idempotent Work-Stealing The algorithms Comparison to Previous Work Summary 35   Compared against “Chase-Lev” and “Cilk THE” algorithms (after adding memory fences) Benchmarks: ◦ Micro – the common case – take() and put() ◦ Irregular Graph Applications 36  2 Scenarios: ◦ Both puts and takes (106 ops for each type) ◦ Only takes (106 ops) – pre populating the work-queues 37  2 Scenarios: ◦ Both puts and takes (106 ops for each type) ◦ Only takes (106 ops) – pre populating the work-queues 38   Based on SIMPLE framework 2D Torus Graph: ◦ Vertices – on the torus ◦ Each vertex connected to its 4 neighbors  Build a spanning tree 40 Up to 6% redundant work 41      Memory Operations Reordering Problem Definition – Idempotent Work-Stealing The algorithms Comparison to Previous Work Summary 42     Memory operations reordering improves execution times Use with care in multi-processors “Idempotent Work-Stealing” useful for some workloads Idempotent-LIFO gives good results for all benchmarks 43 Thank You! Questions? 44

Idempotent Work Stealing

Related documents

Products

Support

Idempotent Work Stealing

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib