Getting the most out of Parallel Extensions for .NET Dr. Mike Liddell Senior Developer Microsoft (mikelid@microsoft.com) 1 Agenda Why parallelism, why now? Parallelism with today’s technologies Parallel Extensions to the .NET Framework PLINQ Task Parallel Library Coordination Data Structures Demos 2 Hardware Paradigm Shift Sun’s Surface 32,768 To Grow, To Keep Up, We Must Embrace Parallel Computing Rocket Nozzle 1,000 Nuclear Reactor 2,048 100 10 4004 8008 8086 8085 286 GOPS Power Density (W/cm2) 10,000 Today’s Architecture: Heat becoming an unmanageable problem! Hot Plate Pentium® processors 386 Parallelism Opportunity 80X 128 486 8080 1 ‘70 ‘80 ‘90 ‘00 ‘10 16 2004 2006 2008 2010 2012 2015 Intel Developer Forum, Spring 2004 - Pat Gelsinger “… we see a very significant shift in what architectures will look like in the future ... fundamentally the way we've begun to look at doing that is to move from instruction level concurrency to … multiple cores per die. But we're going to continue to go beyond there. And that just won't be in our server lines in the future; this will permeate every architecture that we build. All will have massively multicore implementations.” Pat Gelsinger Chief Technology Officer, Senior Vice President, Intel Corporation 3 It's An Industry Thing Open MP Intel TBB Java libraries Open CL CUDA MPI Erlang Cilk (many others) 4 5 What's the Problem? Multithreaded programming is “hard” today Robust solutions only by specialists Parallel patterns are not prevalent, well known, nor easy to implement Many potential correctness & performance issues Races, deadlocks, livelocks, lock convoys, cache coherency overheads, missed notifications, non-serializable updates, priority inversion, false-sharing, sub-linear scaling and so on… Features that can are often skimped on Last delta of perf, ensuring no missed exceptions, composable cancellation, dynamic partitioning, efficient and custom scheduling Businesses have little desire to “go deep” Developers should focus on business value, not concurrency hassles and common concerns 6 Example: Matrix Multiplication void MultiplyMatrices(int size, double[,] m1, double[,] m2, double[,] result) { for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { result[i, j] = 0; for (int k = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } } } } 7 Manual Parallel Solution int N = size; Static Work Distribution int P = 2 * Environment.ProcessorCount; int Chunk = N / P; ManualResetEvent signal = new ManualResetEvent(false); int counter = P; for (int c = 0; c < P; c++) { Potential scalability bottleneck ThreadPool.QueueUserWorkItem(o => { int lc = (int)o; for (int i = lc * Chunk; i < (lc + 1 == P ? N : (lc + 1) * Chunk); i++) { // original loop body Error Prone for (int j = 0; j < size; j++) { result[i, j] = 0; for (int k = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; Error Prone } } } if (Interlocked.Decrement(ref counter) == 0) { signal.Set(); } Manual locking }, c); } signal.WaitOne(); Manual Synchronization 8 Parallel Solution void MultiplyMatrices(int size, double[,] m1, double[,] m2, double[,] result) { Parallel.For(0, size, i => { for (int j = 0; j < size; j++) { result[i, j] = 0; for (int k = 0; k < size; k++) { result[i, j] += m1[i, k] * m2[k, j]; } } }); } Demo! 9 Parallel Extensions to the .NET Framework What is it? Additional APIs shipping in .NET BCL (mscorlib, System, System.Core) With corresponding enhancements to the CLR & ThreadPool Provides primitives, task parallelism and data parallelism Coordination/synchronization constructs (Coordination Data Structures) Imperative data and task parallelism (Task Parallel Library) Declarative data parallelism (PLINQ) Common exception handling model Common and rich cancellation model Why do we need it? Supports parallelism in any .NET language Delivers reduced concept count and complexity, better time to solution Begins to move parallelism capabilities from concurrency experts to domain experts Parallel Extensions Architecture User Code Applications PLINQ Execution Engine Data Partitioning (Chunk, Range, Stripe, Custom) Operators (Map, Filter, Sort, Search, Reduction Merging (Pipeline, Synchronous, Order preserving) Task Parallel Library Coordination Data Structures Structured Task Parallelism Thread-safe Collections Coordination Types Cancellation Types Pre-existing Primitives ThreadPool Monitor, Events, Threads 11 Task Parallel Library System.Threading.Tasks Task 1st-class debugger support! Parent-child relationships Structured waiting and cancellation Continuations on succes, failure, cancellation Implements IAsyncResult to compose with Async-Programming Model (APM). Task<T> A tasks that has a value on completion Asynchronous execution with blocking on task.Value Combines ideas of futures, and promises TaskScheduler We ship a scheduler that makes full use of the (vastly) improved ThreadPool Custom Task Schedulers can be written for specific needs. Parallel Convenience APIs: Parallel.For(), Parallel.ForEach() Automatic, scalable & dynamic partitioning. Task Parallel Library Loops Loops are a common source of work for (int i = 0; i < n; i++) work(i); … foreach (T e in data) work(e); Can be parallelized when iterations are independent Body doesn’t depend on mutable state e.g. static vars, writing to local vars to be used in subsequent iterations Parallel.For(0, n, i => work(i)); … Parallel.ForEach(data, e => work(e)); 13 Task Parallel Library Supports early exit via a Break API Parallel.For, Parallel.ForEach for loops. Parallel.Invoke for easy creation of simple tasks Parallel.Invoke( () => StatementA() , () => StatementB , () => StatementC() ); Synchronous (blocking) APIs, but with cancellation support Parallel.For(…, cancellationToken); 14 Parallel LINQ (PLINQ) Enable LINQ developers to leverage parallel hardware Supports all of the .NET Standard Query Operators Plus a few other extension methods specific to PLINQ Abstracts away parallelism details Partitions and merges data intelligently (“classic” data parallelism) Works for any IEnumerable<T> eg data.AsParallel().Select(..).Where(..); eg array.AsParallel().WithCancellation(ct)… Writing a PLINQ Query Different ways to write PLINQ queries Comprehensions Syntax extensions to C# and Visual Basic var q = from x in Y.AsParallel() where p(x) orderby x.f1 select x.f2; Normal APIs (two flavours) Used as extension methods on IParallelEnumerable<T> var q = Y.AsParallel() .Where(x => p(x)) .OrderBy(x => x.f1) .Select(x => x.f2); Direct use of ParallelEnumerable var q = ParallelEnumerable.Select( ParallelEnumerable.OrderBy( ParallelEnumerable.Where(Y.AsParallel(), x => p(x)), x => x.f1), x => x.f2); 16 Plinq Partitioning and Merging • Input to a single operator is partitioned into p disjoint subsets • Operators are replicated across the partitions • A merge marshals data back to consumer thread foreach(int i in D.AsParallel() .where(x=>p(x)) .Select(x=> x*x*x) .OrderBy(x=>-x) PLINQ … Task 1 … where p(x) D select x3 LocalSort() partition Merge where p(x) select x3 LocalSort() … Task n … • Each partition executes in (almost) complete isolation foreach Coordination Data Structures Used throughout PLINQ and TPL Assist with key concurrency patterns Thread-safe collections ConcurrentStack<T> ConcurrentQueue<T> … Work exchange BlockingCollection<T> … Locks and Signaling ManualResetEventSlim SemaphoreSlim SpinLock … Initialization LazyInit<T> … Cancellation Phased Operation CountdownEvent … CancellationTokenSource CancellationToken OperationCanceledException Common Cancellation A CancellationTokenSource is a source of cancellation requests. A CancellationToken is a notifier of a cancellation request. Workers… Work co-ordinator 1. 2. 3. Creates a CTS Starts work Cancels CTS if reqd CTS 19 CT 1. 2. 3. Get, share, and copy tokens Routinely poll token which observes CTS May attach callbacks to token CT CT Linking tokens allows combining of cancellation requesters. Slow code should poll every 1ms Blocking calls should observe a Token. CT CT1 CTS12 CT2 CT Common Cancellation (cont.) All blocking calls allow a CancellationToken to be supplied. var results = data .AsParallel() .WithCancellation(token) .Select(x => f(x)) .ToArray(); User code can observe the cancellation token, and cooperatively enact cancellation var results = data .AsParallel() .WithCancellation(token) .Select(x => { if (token.IsCancellationRequested) throw new OperationCanceledEx(token); return f(x); } ) .ToArray(); 20 Extension Points in TPL & PLINQ Partitioning strategies for Parallel & Plinq Extend via Partitioner<T>, OrderablePartitioner<T> eg partitioners for heterogenous data. TaskScheduling Extend via TaskScheduler eg GUI-thread scheduler, throttled scheduler BlockingCollection extend via IProducerConsumerCollection eg blocking priority queue. 21 Debugging Parallel Apps in VS2010 Two new debugger tool windows “Parallel Tasks” “Parallel Stacks” . Parallel Tasks Status Identifier Location + Tooltip Thread Assignment Parent ID Task Entry Point Current Task Task’s thread is frozen Column context menu Flagging . Tooltip shows info on waiting/deadlocked status 23 Item context menu Parallel Stacks active frame of other thread(s) Context menu active frame of current thread current frame Zoom control header tooltip 24 method tooltip Blue highlights path of current thread Bird’s eye view . Summary The ManyCore Shift is happening Parallelism in your code is inevitable Invest in a platform that enables parallelism …like the Parallel Extensions for .NET 25 Further Info and News Getting the bits! June 2008 CTP - http://msdn.microsoft.com/concurrency Microsoft Visual Studio 2010 – Beta coming soon. http://www.microsoft.com/visualstudio/en-us/products/2010/default.mspx MSDN Concurrency Developer Center http://msdn.microsoft.com/concurrency Parallel Extensions Team Blog http://blogs.msdn.com/pfxteam Blogs Parallel Extensions Team Joe Duffy Daniel Moth http://blogs.msdn.com/pfxteam http://www.bluebytesoftware.com http://www.danielmoth.com/Blog/ © 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. 27 Extra Slides … 28 Parallel Technologies from Microsoft Local computing CDS TPL Plinq Concurrency Runtime in Robotics Studio PPL (Native) OpenMP (Native) Distributed computing WCF MPI, MPI.NET 29 Types Key Common Types: AggregateException, OperationCanceledException, TaskCanceledException CancellationTokenSource, CancellationToken Partitioner<T> Key TPL types: Task, Task<T> TaskFactory, TaskFactory<T> TaskScheduler Key Plinq types: Extension methods IEnumerable.AsParallel(), Ienumerable<T>.AsParallel () ParallelQuery, ParallelQuery<T>, OrderableParallelQuery<T> Key CDS types: Lazy<T>, LazyVariable<T>, LazyInitializer, CountdownEvent, ManualResetEventSlim, SemaphoreSlim BlockingCollection, ConcurrentDictionary, ConcurrentQueue 30 Performance Tips Early community technology preview Keep in mind that performance will improve significantly Compute intensive and/or large data sets Work done should be at least 1,000s of cycles Measure, and combine/optimize as necessary Do not be gratuitous in task creation Lightweight, but still requires object allocation, etc. Parallelize only outer loops where possible Unless N is insufficiently large to offer enough parallelism Consider parallelizing only inner, or both, at that point Prefer isolation and immutability over synchronization Synchronization == !Scalable Try to avoid shared data Have realistic expectations Amdahl’s Law Speedup will be fundamentally limited by the amount of sequential computation Gustafson’s Law But what if you add more data, thus increasing the parallelizable percentage of the application? Parallelism Blockers Ordering not guaranteed int[] values = new int[] { 0, 1, 2 }; var q = from x in values.AsParallel() select x * 2; int[] scaled = q.ToArray(); // == { 0, 2, 4 } ?? Exceptions AggregateException object[] data = new object[] { "foo", null, null }; var q = from x in data.AsParallel() select o.ToString(); Thread affinity controls.AsParallel().ForAll(c => c.Size = ...); //Problem Operations with sub-linear speedup, or even speedup < 1.0 IEnumerable<int> input = …; var doubled = from x in input.AsParallel() select x*2; Side effects and mutability are serious issues Most queries do not use side effects, but… var q = from x in data.AsParallel() select x.f++; Race condition if non-unique elements Plinq Partitioning, cont. Types of partitioning Chunk Works with any IEnumerable<T> Single enumerator shared; chunks handed out on-demand Range Works only with IList<T> Input divided into contiguous regions, one per partition Stride Works only with IList<T> Elements handed out round-robin to each partition Hash Works with any IEnumerable<T> Elements assigned to partition based on hash code Repartitioning sometimes necessary Plinq Merging Pipelined: separate consumer thread Default for GetEnumerator() And hence foreach loops Thread 1 Access to data as its available Stop-and-go: consumer helps Thread 1 Thread 2 Thread 1 Thread 3 Thread 1 Inverted: no merging needed But not always applicable Thread 1 Thread 1 But higher latency and more memory ForAll extension method Most efficient by far Thread 3 Thread 4 But more synchronization overhead Sorts, ToArray, ToList, GetEnumerator(false), etc. Minimizes context switches Thread 2 Thread 1 Thread 2 Thread 3 Thread 1 Example: “Baby Names” IEnumerable<BabyInfo> babyRecords = GetBabyRecords(); var results = new List<BabyInfo>(); foreach (var babyRecord in babyRecords) { if (babyRecord.Name == queryName && babyRecord.State == queryState && babyRecord.Year >= yearStart && babyRecord.Year <= yearEnd) { results.Add(babyRecord); } } results.Sort((b1, b2) => b1.Year.CompareTo(b2.Year)); 35 Manual Parallel Solution Synchronization Knowledge IEnumerable<BabyInfo> babies = …; var results = new List<BabyInfo>(); int partitionsCount = Environment.ProcessorCount * 2; int remainingCount = partitionsCount; var enumerator = babies.GetEnumerator(); Inefficient locking try { using (ManualResetEvent done = new ManualResetEvent(false)) { for (int i = 0; i < partitionsCount; i++) { ThreadPool.QueueUserWorkItem(delegate { var partialResults = new List<BabyInfo>(); Lack of foreach simplicity while(true) { BabyInfo baby; lock (enumerator) { if (!enumerator.MoveNext()) break; baby = enumerator.Current; Manual aggregation } if (baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd) { partialResults.Add(baby); } Tricks } lock (results) results.AddRange(partialResults); if (Interlocked.Decrement(ref remainingCount) == 0) done.Set(); Lack of thread reuse }); } done.WaitOne(); results.Sort((b1, b2) => b1.Year.CompareTo(b2.Year)); Heavy synchronization } } finally { if (enumerator is Idisposable) ((Idisposable)enumerator).Dispose(); } 36 Non-parallel sort LINQ Solution var results = from baby in babyRecords .AsParallel() where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderby baby.Year ascending select baby; (or in different Syntax…) var results = babyRecords.AsParallel() .Where(b => b.Name == queryName && b.State == queryState && b.Year >= yearStart && b.Year <= yearEnd) .OrderBy(b=>baby.Year) .Select(b=>b); 37 ThreadPool Task (Work) Stealing ThreadPool Task Queues … Worker Thread 1 Task 1 TaskProgram 2 Thread 38 Task 4Task 3 Task 5 … Worker Thread p Task 6 .