Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit (Some images in this lecture courtesy of Charles Leiserson) How to write Parallel Apps? • Split a program into parallel parts • In an effective way • Thread management Art of Multiprocessor Programming 2 Matrix Multiplication C A B Art of Multiprocessor Programming 3 Matrix Multiplication N X¡ 1 ci j = ak i ¢bj k k= 0 Art of Multiprocessor Programming 4 Matrix Multiplication cij = k=0N-1 aki * bjk Art of Multiprocessor Programming 5 Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { row = row; col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} Art of Multiprocessor Programming 6 Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { row = row; col = col; } a thread public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} Art of Multiprocessor Programming 7 Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { row = row; col = col; } public void run() { Which matrix entry double dotProduct = 0.0; to compute for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} Art of Multiprocessor Programming 8 Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { row = row; col = col; Actual computation } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} Art of Multiprocessor Programming 9 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Art of Multiprocessor Programming 10 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) Create n x n for (int col …) threads worker[row][col].join(); } Art of Multiprocessor Programming 11 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) Start them for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Art of Multiprocessor Programming 12 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) Start them for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) Wait for for (int col …) them to worker[row][col].join(); finish } Art of Multiprocessor Programming 13 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) Start them for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …)wrong with this What’s worker[row][col].start(); picture? for (int row …) Wait for for (int col …) them to worker[row][col].join(); finish } Art of Multiprocessor Programming 14 Thread Overhead • Threads Require resources – Memory for stacks – Setup, teardown – Scheduler overhead • Short-lived threads – Ratio of work versus overhead bad Art of Multiprocessor Programming 15 Thread Pools • More sensible to keep a pool of – long-lived threads • Threads assigned short-lived tasks – Run the task – Rejoin pool – Wait for next assignment Art of Multiprocessor Programming 16 Thread Pool = Abstraction • Insulate programmer from platform – Big machine, big pool – Small machine, small pool • Portable code – Works across platforms – Worry about algorithm, not platform Art of Multiprocessor Programming 17 ExecutorService Interface • In java.util.concurrent – Task = Runnable object • If no result value expected • Calls run() method. – Task = Callable<T> object • If result value of type T expected • Calls T call() method. Art of Multiprocessor Programming 18 Future<T> Callable<T> task = …; … Future<T> future = executor.submit(task); … T value = future.get(); Art of Multiprocessor Programming 19 Future<T> Callable<T> task = …; … Future<T> future = executor.submit(task); … T value = future.get(); Submitting a Callable<T> task returns a Future<T> object Art of Multiprocessor Programming 20 Future<T> Callable<T> task = …; … Future<T> future = executor.submit(task); … T value = future.get(); The Future’s get() method blocks until the value is available Art of Multiprocessor Programming 21 Future<?> Runnable task = …; … Future<?> future = executor.submit(task); … future.get(); Art of Multiprocessor Programming 22 Future<?> Runnable task = …; … Future<?> future = executor.submit(task); … future.get(); Submitting a Runnable task returns a Future<?> object Art of Multiprocessor Programming 23 Future<?> Runnable task = …; … Future<?> future = executor.submit(task); … future.get(); The Future’s get() method blocks until the computation is complete Art of Multiprocessor Programming 24 Note • Executor Service submissions – Like New England traffic signs – Are purely advisory in nature • The executor – Like the Boston driver – Is free to ignore any such advice – And could execute tasks sequentially … Art of Multiprocessor Programming 25 Matrix Addition C00 C10 C00 A00 B00 C10 A10 B10 Art of Multiprocessor Programming B01 A01 A11 B11 26 Matrix Addition C00 C10 C00 A00 B00 C10 A10 B10 B01 A01 A11 B11 4 parallel additions Art of Multiprocessor Programming 27 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // add this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices aij and bij) Future<?> f00 = exec.submit(addTask(a00,b00)); … Future<?> f11 = exec.submit(addTask(a11,b11)); f00.get(); …; f11.get(); … }} Art of Multiprocessor Programming 28 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // add this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices aij and bij) Future<?> f00 = exec.submit(addTask(a00,b00)); … Future<?> f11 = exec.submit(addTask(a11,b11)); f00.get(); …; f11.get(); … Constant-time operation }} Art of Multiprocessor Programming 29 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // add this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices aij and bij) Future<?> f00 = exec.submit(addTask(a00,b00)); … Future<?> f11 = exec.submit(addTask(a11,b11)); f00.get(); …; f11.get(); … Submit 4 tasks }} Art of Multiprocessor Programming 30 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // add this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices aij and bij) Future<?> f00 = exec.submit(addTask(a00,b00)); … Future<?> f11 = exec.submit(addTask(a11,b11)); f00.get(); …; f11.get(); … Base case: add directly }} Art of Multiprocessor Programming 31 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // multiply this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices aij and bij) Future<?> f00 = exec.submit(addTask(a00,b00)); … Future<?> f11 = exec.submit(addTask(a11,b11)); f00.get(); …; f11.get(); … Let them finish }} Art of Multiprocessor Programming 32 Dependencies • Matrix example is not typical • Tasks are independent – Don’t need results of one task … – To complete another • Often tasks are not independent Art of Multiprocessor Programming 33 Fibonacci 1 if n = 0 or 1 F(n) F(n-1) + F(n-2) otherwise • Note – Potential parallelism – Dependencies Art of Multiprocessor Programming 34 Disclaimer • This Fibonacci implementation is – Egregiously inefficient • So don’t try this at home or job! – But illustrates our point • How to deal with dependencies • Exercise: – Make this implementation efficient! Art of Multiprocessor Programming 35 Multithreaded Fibonacci class FibTask implements Callable<Integer> { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future<Integer> left = exec.submit(new FibTask(arg-1)); Future<Integer> right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}} Art of Multiprocessor Programming 36 Multithreaded Fibonacci class FibTask implements Callable<Integer> { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; Parallel calls } public Integer call() { if (arg > 2) { Future<Integer> left = exec.submit(new FibTask(arg-1)); Future<Integer> right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}} Art of Multiprocessor Programming 37 Multithreaded Fibonacci class FibTask implements Callable<Integer> { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { Pick up & combine results arg = n; } public Integer call() { if (arg > 2) { Future<Integer> left = exec.submit(new FibTask(arg-1)); Future<Integer> right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}} Art of Multiprocessor Programming 38 The Blumofe-Leiserson DAG Model • Multithreaded program is – A directed acyclic graph (DAG) – That unfolds dynamically • Each node is – A single unit of work Art of Multiprocessor Programming 39 Fibonacci DAG fib(4) Art of Multiprocessor Programming 40 Fibonacci DAG fib(4) fib(3) Art of Multiprocessor Programming 41 Fibonacci DAG fib(4) fib(3) fib(2) fib(2) Art of Multiprocessor Programming 42 Fibonacci DAG fib(4) fib(3) fib(2) fib(2) fib(1) fib(1) fib(1) fib(1) fib(1) Art of Multiprocessor Programming 43 Fibonacci DAG fib(4) get call fib(3) fib(2) fib(2) fib(1) fib(1) fib(1) fib(1) fib(1) Art of Multiprocessor Programming 44 How Parallel is That? • Define work: – Total time on one processor • Define critical-path length: – Longest dependency path – Can’t beat that! Art of Multiprocessor Programming 45 Unfolded DAG Art of Multiprocessor Programming 46 Parallelism? Serial fraction = 3/18 = 1/6 … Amdahl’s Law says speedup cannot exceed 6. Art of Multiprocessor Programming 47 Work? 1 2 T1: time needed on one processor 3 4 11 5 7 7 8 12 13 15 9 10 Just count the nodes …. 14 16 17 18 T1 = 18 Art of Multiprocessor Programming 48 Critical Path? ∞: time needed on as many T processors as you like Art of Multiprocessor Programming 49 Critical Path? 1 2 ∞: time needed on as many T 3 processors as you like 4 Longest path …. 5 6 7 8 9 T∞ = 9 Art of Multiprocessor Programming 50 Notation Watch • TP = time on P processors • T1 = work (time on 1 processor) • T∞ = critical path length (time on ∞ processors) Art of Multiprocessor Programming 51 Simple Laws • Work Law: TP ≥ T1/P – In one step, can’t do more than P work • Critical Path Law: TP ≥ T∞ – Can’t beat infinite resources Art of Multiprocessor Programming 52 Performance Measures • Speedup on P processors – Ratio T1/TP – How much faster with P processors • Linear speedup – T1/TP = Θ(P) • Max speedup (average parallelism) – T1/T∞ Art of Multiprocessor Programming 53 Sequential Composition A B Work: T1(A) + T1(B) Critical Path: T∞ (A) + T∞ (B) Art of Multiprocessor Programming 54 Parallel Composition A B Work: T1(A) + T1(B) Critical Path: max{T∞(A), T∞(B)} Art of Multiprocessor Programming 55 Matrix Addition C00 C10 C00 A00 B00 C10 A10 B10 Art of Multiprocessor Programming B01 A01 A11 B11 56 Matrix Addition C00 C10 C00 A00 B00 C10 A10 B10 B01 A01 A11 B11 4 parallel additions Art of Multiprocessor Programming 57 Addition • Let AP(n) be running time – For n x n matrix – on P processors • For example – A1(n) is work – A∞(n) is critical path length Art of Multiprocessor Programming 58 Addition • Work is Partition, synch, etc A1(n) = 4 A1(n/2) + Θ(1) 4 spawned additions Art of Multiprocessor Programming 59 Addition • Work is A1(n) = 4 A1(n/2) + Θ(1) = Θ(n2) Same as double-loop summation Art of Multiprocessor Programming 60 Addition • Critical Path length is A∞(n) = A∞(n/2) + Θ(1) spawned additions in parallel Partition, synch, etc Art of Multiprocessor Programming 61 Addition • Critical Path length is A∞(n) = A∞(n/2) + Θ(1) = Θ(log n) Art of Multiprocessor Programming 62 Matrix Multiplication Redux C A B Art of Multiprocessor Programming 63 Matrix Multiplication Redux C11 C12 A11 C21 C22 A21 A12 B11 B12 A22 B21 B22 Art of Multiprocessor Programming 64 First Phase … C11 C12 A11B11 A12B21 C21 C22 A21B11 A22B21 A11B12 A12B22 A21B12 A22B22 8 multiplications Art of Multiprocessor Programming 65 Second Phase … C11 C12 A11B11 A12B21 C21 C22 A21B11 A22B21 A11B12 A12B22 A21B12 A22B22 4 additions Art of Multiprocessor Programming 66 Multiplication • Work is Final addition M1(n) = 8 M1(n/2) + A1(n) 8 parallel multiplications Art of Multiprocessor Programming 67 Multiplication • Work is M1(n) = 8 M1(n/2) + Θ(n2) = Θ(n3) Same as serial triple-nested loop Art of Multiprocessor Programming 68 Multiplication • Critical path length is Final addition M∞(n) = M∞(n/2) + A∞(n) Half-size parallel multiplications Art of Multiprocessor Programming 69 Multiplication • Critical path length is M∞(n) = M∞(n/2) + A∞(n) = M∞(n/2) + Θ(log n) = Θ(log2 n) Art of Multiprocessor Programming 70 Parallelism • M1(n)/ M∞(n) = Θ(n3/log2 n) • To multiply two 1000 x 1000 matrices – 10003/102=107 • Much more than number of processors on any real machine Art of Multiprocessor Programming 71 Shared-Memory Multiprocessors • Parallel applications – Do not have direct access to HW processors • Mix of other jobs – All run together – Come & go dynamically Art of Multiprocessor Programming 72 Ideal Scheduling Hierarchy Tasks User-level scheduler Processors Art of Multiprocessor Programming 73 Realistic Scheduling Hierarchy Tasks User-level scheduler Threads Kernel-level scheduler Processors Art of Multiprocessor Programming 74 For Example • Initially, – All P processors available for application • Serial computation – Takes over one processor – Leaving P-1 for us – Waits for I/O – We get that processor back …. Art of Multiprocessor Programming 75 Speedup • Map threads onto P processes • Cannot get P-fold speedup – What if the kernel doesn’t cooperate? • Can try for speedup proportional to P Art of Multiprocessor Programming 76 Scheduling Hierarchy • User-level scheduler – Tells kernel which threads are ready • Kernel-level scheduler – Synchronous (for analysis, not correctness!) – Picks pi threads to schedule at step i Art of Multiprocessor Programming 77 Greedy Scheduling • A node is ready if predecessors are done • Greedy: schedule as many ready nodes as possible • Optimal scheduler is greedy (why?) • But not every greedy scheduler is optimal Art of Multiprocessor Programming 78 Greedy Scheduling There are P processors Complete Step: •>P nodes ready • run any P Incomplete Step: • < P nodes ready • run them all Art of Multiprocessor Programming 79 Theorem For any greedy scheduler, TP ≤ T1/P + T∞ Art of Multiprocessor Programming 80 Theorem For any greedy scheduler, TP ≤ T1/P + T∞ Actual time Art of Multiprocessor Programming 81 Theorem For any greedy scheduler, TP ≤ T1/P + T∞ No better than work divided among processors Art of Multiprocessor Programming 82 Theorem For any greedy scheduler, TP ≤ T1/P + T∞ No better than critical path length Art of Multiprocessor Programming 83 TP ≤ T1/P + T∞ Proof: Number of incomplete steps ≤ T1/P … … because each performs P work. Number of complete steps ≤ T1 … … because each shortens the unexecuted critical path by 1 Art of Multiprocessor Programming 84 Near-Optimality Theorem: any greedy scheduler is within a factor of 2 of optimal. Remark: Optimality is NP-hard! Art of Multiprocessor Programming 85 Proof of Near-Optimality Let TP* be the optimal time. From work and critical path laws TP* ≥ max{T1/P, T∞} TP ≤ T1/P + T∞ Theorem TP ≤ 2 max{T1/P ,T∞} TP ≤ 2 TP* Art of Multiprocessor Programming 86 Work Distribution zzz… Art of Multiprocessor Programming 87 Work Dealing Yes! Art of Multiprocessor Programming 88 The Problem with Work Dealing D’oh! D’oh! D’oh! Art of Multiprocessor Programming 89 Work Stealing No work… Yes! Art of Multiprocessor Programming 90 Lock-Free Work Stealing • Each thread has a pool of ready work • Remove work without synchronizing • If you run out of work, steal someone else’s • Choose victim at random Art of Multiprocessor Programming 91 Local Work Pools Each work pool is a Double-Ended Queue Art of Multiprocessor Programming 92 Work DEQueue1 work pushBottom popBottom 1. Double-Ended Queue Art of Multiprocessor Programming 93 Obtain Work •Obtain work •Run task until •Blocks or terminates popBottom Art of Multiprocessor Programming 94 New Work •Unblock node •Spawn node pushBottom Art of Multiprocessor Programming 95 Whatcha Gonna do When the Well Runs Dry? @&%$!! empty Art of Multiprocessor Programming 96 Steal Work from Others Pick random thread’s DEQeueue Art of Multiprocessor Programming 97 Steal this Task! popTop Art of Multiprocessor Programming 98 Task DEQueue • Methods – pushBottom – popBottom – popTop Never happen concurrently Art of Multiprocessor Programming 99 Task DEQueue • Methods – pushBottom – popBottom – popTop Most common – make them fast (minimize use of CAS) Art of Multiprocessor Programming 100 Ideal • Wait-Free • Linearizable • Constant time Fortune Cookie: “It is better to be young, rich and beautiful, than old, poor, and ugly” Art of Multiprocessor Programming 101 Compromise • Method popTop may fail if – Concurrent popTop succeeds, or a – Concurrent popBottom takes last task Blame the victim! Art of Multiprocessor Programming 102 Dreaded ABA Problem top CAS Art of Multiprocessor Programming 103 Dreaded ABA Problem top Art of Multiprocessor Programming 104 Dreaded ABA Problem top Art of Multiprocessor Programming 105 Dreaded ABA Problem top Art of Multiprocessor Programming 106 Dreaded ABA Problem top Art of Multiprocessor Programming 107 Dreaded ABA Problem top Art of Multiprocessor Programming 108 Dreaded ABA Problem top Art of Multiprocessor Programming 109 Dreaded ABA Problem Yes! top CAS Uh-Oh … Art of Multiprocessor Programming 110 Fix for Dreaded ABA stamp top bottom Art of Multiprocessor Programming 111 Bounded DEQueue public class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] tasks; … } Art of Multiprocessor Programming 112 Bounded DEQueue public class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] tasks; … } Index & Stamp (synchronized) Art of Multiprocessor Programming 113 Bounded DEQueue public class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] deq; … } index of bottom task no need to synchronize memory barrier needed Art of Multiprocessor Programming 114 Bounded DEQueue public class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] tasks; … } Array holding tasks Art of Multiprocessor Programming 115 pushBottom() public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } … } Art of Multiprocessor Programming 116 pushBottom() public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } … Bottom is the index to store } the new task in the array Art of Multiprocessor Programming 117 pushBottom() public class BDEQueue { stamp … top void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } bottom … } Adjust the bottom index Art of Multiprocessor Programming 118 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Art of Multiprocessor Programming 119 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Read top (value & stamp) Art of Multiprocessor Programming 120 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Compute new value & stamp Art of Multiprocessor Programming 121 Steal Work public Runnable popTop() { stamp int[] stamp = new int[1]; top int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) bottom return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; Quit if queue is empty } Art of Multiprocessor Programming 122 Steal Work stamp public Runnable popTop() { top CAS int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; bottom int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; return null; } Try to steal the task Art of Multiprocessor Programming 123 Steal Work public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; Give up if return null; } conflict occurs Art of Multiprocessor Programming 124 Take Work Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; } Art of Multiprocessor Programming 125 Take Work Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; } Art of Multiprocessor Programming Make sure queue is non-empty 126 Take Work Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; } Art of Multiprocessor Programming Prepare to grab bottom task 127 Take Work Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; } Art of Multiprocessor Programming Read top, & prepare new values 128 Take Work Runnable popBottom() { if (bottom == 0) return null; stamp bottom--; top Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), bottom newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;} Art of Multiprocessor Programming If top & bottom one or more apart, no conflict 129 Take Work Runnable popBottom() { if (bottom == 0) return null; stamp bottom--; top Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), bottom newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;} Art of Multiprocessor Programming At most one item left 130 Take Work Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;} Art of Multiprocessor Programming Try to steal last task. Reset bottom because the DEQueue will be empty even if unsuccessful (why?) 131 Take Work Runnable popBottom() { if (bottom == 0) return null; stamp bottom--; top CAS Runnable r = tasks[bottom]; int[] stamp = new int[1]; bottom int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;} Art of Multiprocessor Programming I win CAS 132 Take Work Runnable popBottom() { if (bottom == 0) return null; stamp bottom--; top CAS Runnable r = tasks[bottom]; int[] stamp = new int[1]; bottom int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;} Art of Multiprocessor Programming If I lose CAS, thief must have won… 133 Take Work Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;} Art of Multiprocessor Programming Failed to get last task (bottom could be less than top) Must still reset top and bottom since deque is empty 134 Old English Proverb • “May as well be hanged for stealing a sheep as a goat” • From which we conclude: – Stealing was punished severely – Sheep were worth more than goats Art of Multiprocessor Programming 135 Variations • Stealing is expensive – Pay CAS – Only one task taken • What if – Move more than one task – Randomly balance loads? Art of Multiprocessor Programming 136 Work Balancing d22+5e/2=4 b2+5c/ 2 = 3 5 Art of Multiprocessor Programming 137 Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Art of Multiprocessor Programming 138 Work-Balancing Thread public void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { Keep running synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Art of Multiprocessor Programming tasks 139 Work-Balancing Thread public void run() { With probability int me = ThreadID.get(); 1/|queue| while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Art of Multiprocessor Programming 140 Work-Balancing Thread public void run() { int me = ThreadID.get(); Choose random while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Art of Multiprocessor Programming victim 141 Work-Balancing Thread public void run() { int me = ThreadID.get(); Lock queues in canonical while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Art of Multiprocessor Programming order 142 Work-Balancing Thread public void run() { int me = ThreadID.get(); Rebalance queues while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]); }}}}} Art of Multiprocessor Programming 143 Work Stealing & Balancing • Clean separation between app & scheduling layer • Works well when number of processors fluctuates. • Works on “black-box” operating systems Art of Multiprocessor Programming 144 This work is licensed under a Creative Commons AttributionShareAlike 2.5 License. • • • • • You are free: – to Share — to copy, distribute and transmit the work – to Remix — to adapt the work Under the following conditions: – Attribution. You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). – Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to – http://creativecommons.org/licenses/by-sa/3.0/. Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights. Art of Multiprocessor Programming 145 TOM M AR V O L O R I D DL E Art of Multiprocessor Programming 146