6/16/2010 Work Stealing Scheduler WORK STEALING SCHEDULER 1 6/16/2010 Work Stealing Scheduler Announcements • Text books • Assignment 1 • Assignment 0 results • Upcoming Guest lectures 2 6/16/2010 3 Work Stealing Scheduler Recommended Textbooks The Art of Multiprocessor Programming Maurice Herlihy, Nir Shavit Patterns for Parallel Programming Timothy Mattson, et.al. Parallel Programming with Microsoft .NET http://parallelpatterns.codeplex.com/ 6/16/2010 Work Stealing Scheduler 4 Available on Website • Assignment 1 (due two weeks from now) • Paper for Reading Assignment 1 (due next week) 6/16/2010 Work Stealing Scheduler Assignment 0 • Thanks for submitting • Provided important feedback to us 5 6/16/2010 Work Stealing Scheduler Percentage Students Answering Yes 100 80 60 40 20 0 6 6/16/2010 7 Work Stealing Scheduler Number of Yes Answers Per Student 14 12 10 8 6 4 2 0 0 2 4 6 8 6/16/2010 Work Stealing Scheduler Upcoming Guest Lectures • Apr 12: Todd Mytkowicz • How to (and not to) measure performance • Apr 19: Shaz Qadeer • Correctness Specifications and Data Race Detection 8 6/16/2010 Work Stealing Scheduler Last Lecture Recap 9 6/16/2010 Work Stealing Scheduler 10 Last Lecture Recap • Parallel computation can be represented as a DAG • Nodes represent sequential computation • Edges represent dependencies Sequential Sort Split Merge Sequential Sort Time 6/16/2010 Work Stealing Scheduler 11 Last Lecture Recap • Parallel computation can be represented as a DAG • π1 = Work = time on a single processor • π∞ = Depth = time assuming infinite processors • ππ = time on P processor using optimal scheduler 6/16/2010 Work Stealing Scheduler Last Lecture Recap • Work law: π1 π ≤ ππ • Depth law: π∞ ≤ TP 12 6/16/2010 13 Work Stealing Scheduler Last Lecture Recap • Work law: π1 π ≤ ππ • Depth law: π∞ ≤ TP • Greedy scheduler is optimal within a factor of 2 ππ ≤ π1 ππ πΊπππππ¦ ≤ + π∞ π ≤ 2 ∗ ππ 6/16/2010 Work Stealing Scheduler This Lecture • Design of a greedy scheduler • Task abstraction • Translating high-level abstractions to tasks 14 6/16/2010 Work Stealing Scheduler 15 This Lecture • Design of a greedy scheduler • Task abstraction • Translating high-level abstractions to tasks .NET Program .NET Scheduler … TBB Program Intel TBB 6/16/2010 16 Work Stealing Scheduler (Simple) Tasks • A node in the DAG • Executing a task generates dependent subtasks Sort(4) Split(8) Merge(8) Sort(4) Split(16) Merge(16) Sort(4) Split(8) Merge(8) Sort(4) 6/16/2010 17 Work Stealing Scheduler (Simple) Tasks • A node in the DAG • Executing a task generates dependent subtasks • Note: Task C is generated by A or B, whoever finishes last Sort(4) A Split(8) C Merge(8) Sort(4) B Split(16) Merge(16) Sort(4) Split(8) Merge(8) Sort(4) 6/16/2010 Work Stealing Scheduler 18 Design Constraints of the Scheduler • The DAG is generated dynamically • Based on inputs and program control flow • The graph is not known ahead of time • The amount of work done by a task is dynamic • The weight of each node is not know ahead of time • Number of processors P can change at runtime • Hardware processors are shared with other processes, kernel 6/16/2010 Work Stealing Scheduler 19 Design Requirements of the Scheduler 6/16/2010 Work Stealing Scheduler 20 Design Requirements of the Scheduler • Should be greedy • A processor cannot be idle when tasks are pending • Should limit communication between processors • Should schedule related tasks in the same processor • Tasks that are likely to access the same cachelines 6/16/2010 Work Stealing Scheduler 21 Attempt 0: Centralized Scheduler • “Manager distributes tasks to others” • Manager: assigns tasks to workers, ensures no worker is idle • Workers: On task completion, submit generated tasks to the manager 6/16/2010 Work Stealing Scheduler 22 Attempt 1: Centralized Work Queue • “All processors share a common work queue” • Every processor dequeues a task from the work queue • On task completion, enqueue the generated tasks to the work queue 6/16/2010 Work Stealing Scheduler 23 Attempt 2: Work Sharing • “Loaded workers share” • Every processor pushes and pops tasks into a local work queue • When the work queue gets large, send tasks to other processors 6/16/2010 Work Stealing Scheduler 24 Disadvantages of Work Sharing • If all processors are busy, each will spend time trying to offload • “Perform communication when busy” • Difficult to know the load on processors • A processor with two large tasks might take longer than a processor with five small tasks • Tasks might get shared multiple times before being executed • Some processors can be idle while others are loaded • Not greedy 6/16/2010 Work Stealing Scheduler 25 Attempt 3: Work Stealing • “Idle workers steal” • Each processor maintains a local work queue • Pushes generated tasks into the local queue • When local queue is empty, steal a task from another processor 6/16/2010 Work Stealing Scheduler 26 Nice Properties of Work Stealing • Communication done only when idle • No communication when all processors are in full throttle • Each task is stolen at most once • This scheduler is greedy, assuming stealers always succeed • Limited communication • π(π. π∞ ) steals on average for some stealing strategies 6/16/2010 Work Stealing Scheduler 27 Nice Properties of Work Stealing • Communication done only when idle • No communication when all processors are in full throttle • Each task is stolen at most once • This scheduler is greedy, assuming stealers always succeed • Limited communication • π(π. π∞ ) steals on average for some stealing strategies • Assignment 1 explores different stealing strategies 6/16/2010 Work Stealing Scheduler 28 Work Stealing Queue Datastructure • A specialized deque (Double-Ended Queue) with three operations: • Push : Local processor adds newly created tasks • Pop : Local processor removes task to execute • Steal : Remote processors remove tasks Push 6/16/2010 29 Work Stealing Scheduler Work Stealing Queue Datastructure • A specialized deque (Double-Ended Queue) with three operations: • Push : Local processor adds newly created tasks • Pop : Local processor removes task to execute • Steal : Remote processors remove tasks Push Pop? Pop? Steal? Steal? 6/16/2010 30 Work Stealing Scheduler Work Stealing Queue Datastructure • A specialized deque (Double-Ended Queue) with three operations: • Push : Local processor adds newly created tasks • Pop : Local processor removes task to execute • Steal : Remote processors remove tasks Push Pop Steal 6/16/2010 31 Work Stealing Scheduler Advantages • Stealers don’t interact with local processor when the queue has more than one task • Popping recently pushed tasks improves locality • Stealers take the oldest tasks, which are likely to be the largest (in practice) Push Pop Steal 6/16/2010 Work Stealing Scheduler 32 For Assignment 1 • We provide an implementation of Work Stealing Queue • This implementation is thread-safe • Clients can concurrently call the push, pop, steal operations • You don’t need to use additional locks or other synchronization • Implement the scheduler logic and stealing strategy • Hint: this implementation is single threaded