Getting the most out of Parallel
Extensions for .NET
Dr. Mike Liddell
Senior Developer
Microsoft
(mikelid@microsoft.com)
1
Agenda
Why parallelism, why now?
Parallelism with today’s technologies
Parallel Extensions to the .NET Framework
PLINQ
Task Parallel Library
Coordination Data Structures
Demos
2
Hardware Paradigm Shift
Sun’s Surface
32,768
To Grow, To Keep Up,
We Must Embrace Parallel Computing
Rocket Nozzle
1,000
Nuclear Reactor
2,048
100
10
4004
8008
8086
8085
286
GOPS
Power Density (W/cm2)
10,000
Today’s Architecture: Heat becoming an
unmanageable problem!
Hot Plate
Pentium® processors
386
Parallelism Opportunity
80X
128
486
8080
1
‘70
‘80
‘90
‘00
‘10
16
2004
2006
2008
2010
2012
2015
Intel Developer Forum, Spring 2004 - Pat Gelsinger
“… we see a very significant shift in what architectures will look like in the future ...
fundamentally the way we've begun to look at doing that is to move from instruction level
concurrency to … multiple cores per die. But we're going to continue to go beyond there.
And that just won't be in our server lines in the future; this will permeate every
architecture that we build. All will have massively multicore implementations.”
Pat Gelsinger
Chief Technology Officer, Senior Vice President, Intel Corporation
3
It's An Industry Thing
Open MP
Intel TBB
Java libraries
Open CL
CUDA
MPI
Erlang
Cilk
(many others)
4
5
What's the Problem?
Multithreaded programming is “hard” today
Robust solutions only by specialists
Parallel patterns are not prevalent, well known, nor easy to
implement
Many potential correctness & performance issues
Races, deadlocks, livelocks, lock convoys, cache coherency overheads,
missed notifications, non-serializable updates, priority inversion,
false-sharing, sub-linear scaling and so on…
Features that can are often skimped on
Last delta of perf, ensuring no missed exceptions, composable
cancellation, dynamic partitioning, efficient and custom scheduling
Businesses have little desire to “go deep”
Developers should focus on business value,
not concurrency hassles and common concerns
6
Example: Matrix Multiplication
void MultiplyMatrices(int size,
double[,] m1, double[,] m2, double[,] result)
{
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
result[i, j] = 0;
for (int k = 0; k < size; k++) {
result[i, j] += m1[i, k] * m2[k, j];
}
}
}
}
7
Manual Parallel Solution
int N = size;
Static Work Distribution
int P = 2 * Environment.ProcessorCount;
int Chunk = N / P;
ManualResetEvent signal = new ManualResetEvent(false);
int counter = P;
for (int c = 0; c < P; c++) {
Potential scalability bottleneck
ThreadPool.QueueUserWorkItem(o => {
int lc = (int)o;
for (int i = lc * Chunk;
i < (lc + 1 == P ? N : (lc + 1) * Chunk);
i++) {
// original loop body
Error Prone
for (int j = 0; j < size; j++) {
result[i, j] = 0;
for (int k = 0; k < size; k++) {
result[i, j] += m1[i, k] * m2[k, j];
Error Prone
}
}
}
if (Interlocked.Decrement(ref counter) == 0) {
signal.Set();
}
Manual locking
}, c);
}
signal.WaitOne();
Manual Synchronization
8
Parallel Solution

void MultiplyMatrices(int size,
double[,] m1, double[,] m2, double[,] result)
{
Parallel.For(0, size, i => {
for (int j = 0; j < size; j++) {
result[i, j] = 0;
for (int k = 0; k < size; k++) {
result[i, j] += m1[i, k] * m2[k, j];
}
}
});
}
Demo!
9
Parallel Extensions to the
.NET Framework
What is it?
Additional APIs shipping in .NET BCL (mscorlib, System,
System.Core)
With corresponding enhancements to the CLR & ThreadPool
Provides primitives, task parallelism and data parallelism
Coordination/synchronization constructs (Coordination Data Structures)
Imperative data and task parallelism (Task Parallel Library)
Declarative data parallelism (PLINQ)
Common exception handling model
Common and rich cancellation model
Why do we need it?
Supports parallelism in any .NET language
Delivers reduced concept count and complexity, better time to
solution
Begins to move parallelism capabilities from concurrency experts
to domain experts
Parallel Extensions Architecture
User Code
Applications
PLINQ Execution Engine
Data Partitioning (Chunk, Range, Stripe, Custom)
Operators (Map, Filter, Sort, Search, Reduction
Merging (Pipeline, Synchronous, Order preserving)
Task Parallel Library
Coordination Data Structures
Structured Task Parallelism
Thread-safe Collections
Coordination Types
Cancellation Types
Pre-existing Primitives
ThreadPool
Monitor, Events, Threads
11
Task Parallel Library
System.Threading.Tasks
Task
1st-class debugger support!
Parent-child relationships
Structured waiting and cancellation
Continuations on succes, failure, cancellation
Implements IAsyncResult to compose with Async-Programming Model (APM).
Task<T>
A tasks that has a value on completion
Asynchronous execution with blocking on task.Value
Combines ideas of futures, and promises
TaskScheduler
We ship a scheduler that makes full use of the (vastly) improved ThreadPool
Custom Task Schedulers can be written for specific needs.
Parallel
Convenience APIs: Parallel.For(), Parallel.ForEach()
Automatic, scalable & dynamic partitioning.
Task Parallel Library
Loops
Loops are a common source of work
for (int i = 0; i < n; i++) work(i);
…
foreach (T e in data) work(e);
Can be parallelized when iterations are independent
Body doesn’t depend on mutable state
e.g. static vars, writing to local vars to be used in subsequent iterations
Parallel.For(0, n, i => work(i));
…
Parallel.ForEach(data, e => work(e));
13
Task Parallel Library
Supports early exit via a Break API
Parallel.For, Parallel.ForEach for loops.
Parallel.Invoke for easy creation of simple tasks
Parallel.Invoke(
() => StatementA() ,
() => StatementB
,
() => StatementC() );
Synchronous (blocking) APIs, but with
cancellation support
Parallel.For(…, cancellationToken);
14
Parallel LINQ (PLINQ)
Enable LINQ developers to leverage
parallel hardware
Supports all of the .NET Standard Query Operators
Plus a few other extension methods specific to PLINQ
Abstracts away parallelism details
Partitions and merges data intelligently
(“classic” data parallelism)
Works for any IEnumerable<T>
eg data.AsParallel().Select(..).Where(..);
eg array.AsParallel().WithCancellation(ct)…
Writing a PLINQ Query
Different ways to write PLINQ queries
Comprehensions
Syntax extensions to C# and Visual Basic
var q = from x in Y.AsParallel() where p(x) orderby x.f1 select x.f2;
Normal APIs (two flavours)
Used as extension methods on IParallelEnumerable<T>
var q =
Y.AsParallel()
.Where(x => p(x))
.OrderBy(x => x.f1)
.Select(x => x.f2);
Direct use of ParallelEnumerable
var q = ParallelEnumerable.Select(
ParallelEnumerable.OrderBy(
ParallelEnumerable.Where(Y.AsParallel(), x => p(x)),
x => x.f1),
x => x.f2);
16
Plinq Partitioning and Merging
• Input to a single operator is partitioned into p disjoint subsets
• Operators are replicated across the partitions
• A merge marshals data back to consumer thread
foreach(int i in D.AsParallel()
.where(x=>p(x))
.Select(x=> x*x*x)
.OrderBy(x=>-x)
PLINQ
… Task 1 …
where p(x)
D
select x3
LocalSort()
partition
Merge
where p(x)
select x3
LocalSort()
… Task n …
• Each partition executes in (almost) complete isolation
foreach
Coordination Data Structures
Used throughout PLINQ and TPL
Assist with key concurrency patterns
Thread-safe collections
ConcurrentStack<T>
ConcurrentQueue<T>
…
Work exchange
BlockingCollection<T>
…
Locks and Signaling
ManualResetEventSlim
SemaphoreSlim
SpinLock …
Initialization
LazyInit<T> …
Cancellation
Phased Operation
CountdownEvent
…
CancellationTokenSource
CancellationToken
OperationCanceledException
Common Cancellation
A CancellationTokenSource is a source of cancellation requests.
A CancellationToken is a notifier of a cancellation request.
Workers…
Work co-ordinator
1.
2.
3.
Creates a CTS
Starts work
Cancels CTS if reqd
CTS
19
CT
1.
2.
3.
Get, share, and copy tokens
Routinely poll token which observes CTS
May attach callbacks to token
CT
CT
Linking tokens allows combining of
cancellation requesters.
Slow code should poll every 1ms
Blocking calls should observe a Token.
CT
CT1
CTS12
CT2
CT
Common Cancellation (cont.)
All blocking calls allow a CancellationToken to be supplied.
var results = data
.AsParallel()
.WithCancellation(token)
.Select(x => f(x))
.ToArray();
User code can observe the cancellation token, and
cooperatively enact cancellation
var results = data
.AsParallel()
.WithCancellation(token)
.Select(x => {
if (token.IsCancellationRequested)
throw new OperationCanceledEx(token);
return f(x);
}
)
.ToArray();
20
Extension Points in TPL & PLINQ
Partitioning strategies for Parallel & Plinq
Extend via Partitioner<T>, OrderablePartitioner<T>
eg partitioners for heterogenous data.
TaskScheduling
Extend via TaskScheduler
eg GUI-thread scheduler, throttled scheduler
BlockingCollection
extend via IProducerConsumerCollection
eg blocking priority queue.
21
Debugging Parallel Apps in VS2010
Two new debugger tool windows
“Parallel Tasks”
“Parallel Stacks”
.
Parallel Tasks
Status
Identifier
Location
+
Tooltip
Thread Assignment
Parent ID
Task Entry Point
Current Task
Task’s
thread
is frozen
Column context menu
Flagging
.
Tooltip shows info on waiting/deadlocked status
23
Item context menu
Parallel Stacks
active frame of
other thread(s)
Context menu
active frame of
current thread
current frame
Zoom
control
header tooltip
24
method tooltip
Blue highlights path of current thread
Bird’s eye view
.
Summary
The ManyCore Shift is happening
Parallelism in your code is inevitable
Invest in a platform that enables parallelism
…like the Parallel Extensions for .NET 
25
Further Info and News
Getting the bits!
June 2008 CTP - http://msdn.microsoft.com/concurrency
Microsoft Visual Studio 2010 – Beta coming soon.
http://www.microsoft.com/visualstudio/en-us/products/2010/default.mspx
MSDN Concurrency Developer Center
http://msdn.microsoft.com/concurrency
Parallel Extensions Team Blog
http://blogs.msdn.com/pfxteam
Blogs
Parallel Extensions Team
Joe Duffy
Daniel Moth
http://blogs.msdn.com/pfxteam
http://www.bluebytesoftware.com
http://www.danielmoth.com/Blog/
© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should
not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS,
IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
27
Extra Slides …
28
Parallel Technologies from
Microsoft
Local computing
CDS
TPL
Plinq
Concurrency Runtime in Robotics Studio
PPL (Native)
OpenMP (Native)
Distributed computing
WCF
MPI, MPI.NET
29
Types
Key Common Types:
AggregateException, OperationCanceledException, TaskCanceledException
CancellationTokenSource, CancellationToken
Partitioner<T>
Key TPL types:
Task, Task<T>
TaskFactory, TaskFactory<T>
TaskScheduler
Key Plinq types:
Extension methods IEnumerable.AsParallel(), Ienumerable<T>.AsParallel ()
ParallelQuery, ParallelQuery<T>, OrderableParallelQuery<T>
Key CDS types:
Lazy<T>, LazyVariable<T>, LazyInitializer,
CountdownEvent, ManualResetEventSlim, SemaphoreSlim
BlockingCollection, ConcurrentDictionary, ConcurrentQueue
30
Performance Tips
Early community technology preview
Keep in mind that performance will improve significantly
Compute intensive and/or large data sets
Work done should be at least 1,000s of cycles
Measure, and combine/optimize as necessary
Do not be gratuitous in task creation
Lightweight, but still requires object allocation, etc.
Parallelize only outer loops where possible
Unless N is insufficiently large to offer enough parallelism
Consider parallelizing only inner, or both, at that point
Prefer isolation and immutability over synchronization
Synchronization == !Scalable
Try to avoid shared data
Have realistic expectations
Amdahl’s Law
Speedup will be fundamentally limited by the amount of sequential computation
Gustafson’s Law
But what if you add more data, thus increasing the parallelizable
percentage of the application?
Parallelism Blockers
Ordering not guaranteed
int[] values = new int[] { 0, 1, 2 };
var q = from x in values.AsParallel() select x * 2;
int[] scaled = q.ToArray(); // == { 0, 2, 4 }
??
Exceptions
AggregateException
object[] data = new object[] { "foo", null, null };
var q = from x in data.AsParallel() select o.ToString();
Thread affinity
controls.AsParallel().ForAll(c => c.Size = ...); //Problem
Operations with sub-linear speedup, or even speedup < 1.0
IEnumerable<int> input = …;
var doubled = from x in input.AsParallel() select x*2;
Side effects and mutability are serious issues
Most queries do not use side effects, but…
var q = from x in data.AsParallel() select x.f++;
Race condition if non-unique elements
Plinq Partitioning, cont.
Types of partitioning
Chunk
Works with any IEnumerable<T>
Single enumerator shared; chunks handed out on-demand
Range
Works only with IList<T>
Input divided into contiguous regions, one per partition
Stride
Works only with IList<T>
Elements handed out round-robin to each partition
Hash
Works with any IEnumerable<T>
Elements assigned to partition based on hash code
Repartitioning sometimes necessary
Plinq Merging
Pipelined: separate consumer thread
Default for GetEnumerator()
And hence foreach loops
Thread 1
Access to data as its available
Stop-and-go: consumer helps
Thread 1
Thread 2
Thread 1
Thread 3
Thread 1
Inverted: no merging needed
But not always applicable
Thread 1
Thread 1
But higher latency and more memory
ForAll extension method
Most efficient by far
Thread 3
Thread 4
But more synchronization overhead
Sorts, ToArray, ToList,
GetEnumerator(false), etc.
Minimizes context switches
Thread 2
Thread 1
Thread 2
Thread 3
Thread 1
Example: “Baby Names”
IEnumerable<BabyInfo> babyRecords = GetBabyRecords();
var results = new List<BabyInfo>();
foreach (var babyRecord in babyRecords)
{
if (babyRecord.Name == queryName &&
babyRecord.State == queryState &&
babyRecord.Year >= yearStart &&
babyRecord.Year <= yearEnd)
{
results.Add(babyRecord);
}
}
results.Sort((b1, b2) => b1.Year.CompareTo(b2.Year));
35
Manual Parallel Solution
Synchronization Knowledge
IEnumerable<BabyInfo> babies = …;
var results = new List<BabyInfo>();
int partitionsCount = Environment.ProcessorCount * 2;
int remainingCount = partitionsCount;
var enumerator = babies.GetEnumerator();
Inefficient locking
try {
using (ManualResetEvent done = new ManualResetEvent(false)) {
for (int i = 0; i < partitionsCount; i++) {
ThreadPool.QueueUserWorkItem(delegate {
var partialResults = new List<BabyInfo>();
Lack of foreach simplicity
while(true) {
BabyInfo baby;
lock (enumerator) {
if (!enumerator.MoveNext()) break;
baby = enumerator.Current;
Manual aggregation
}
if (baby.Name == queryName && baby.State == queryState &&
baby.Year >= yearStart && baby.Year <= yearEnd) {
partialResults.Add(baby);
}
Tricks
}
lock (results) results.AddRange(partialResults);
if (Interlocked.Decrement(ref remainingCount) == 0) done.Set();
Lack of thread reuse
});
}
done.WaitOne();
results.Sort((b1, b2) => b1.Year.CompareTo(b2.Year));
Heavy synchronization
}
}
finally { if (enumerator is Idisposable) ((Idisposable)enumerator).Dispose(); }
36
Non-parallel sort
LINQ Solution
var results = from baby in babyRecords .AsParallel()
where baby.Name == queryName &&
baby.State == queryState &&
baby.Year >= yearStart &&
baby.Year <= yearEnd
orderby baby.Year ascending
select baby;
(or in different Syntax…)
var results =
babyRecords.AsParallel()
.Where(b => b.Name == queryName &&
b.State == queryState &&
b.Year >= yearStart &&
b.Year <= yearEnd)
.OrderBy(b=>baby.Year)
.Select(b=>b);
37
ThreadPool Task (Work) Stealing
ThreadPool Task Queues
…
Worker
Thread 1
Task 1
TaskProgram
2
Thread
38
Task 4Task 3
Task 5
…
Worker
Thread p
Task 6
.