Pre-conference Session
 David Callahan
Distinguished Engineer
Microsoft Corporation
 Joe Duffy
Lead Software Engineer
Microsoft Corporation
 Stephen Toub
Lead Program Manager
Microsoft Corporation


















Overview and Architecture
The Shift to
Manycore
• Parallel computing matters
Foundations
• Parallel computing concepts
Techniques
• Concerns, top-down
“That means by 1975, the number of components
per integrated circuit for minimum cost will be
65,000. I believe that such a large circuit can be built
on a single wafer.”
-- Intel co-founder Gordon Moore in 1965
Quad-core Nehalem
announced at IDF in 2007:
731 Million transistors
(more than 13 doublings
later…)
Thanks to Jim Larus of Microsoft Research
1000.0
Processor (SPECInt)
Memory (MB)
Disk (MB)
Moore's Law
100.0
10.0
1.0
Windows 3.1
NT 3.51
Windows 95
52% CAGR in Spec Performance!
Attack of the Killer Micros!
Software is a Gas!
Windows 98
Windows 2000
Windows XP
Windows XP
SP2
Vista Premium
Sun’s Surface
Power Density (W/cm2)
10,000
Rocket Nozzle
1,000
Nuclear Reactor
100
10
8086
8085
4004
8008
1
‘70
286
Hot Plate
Pentium® processors
386
486
8080
‘80
‘90
‘00
‘10
Dr. Pat Gelsinger, Sr. VP, Intel Corporation and GM, Digital Enterprise Group, February 19,
2004, Intel Developer Forum, Spring 2004
The Memory Wall
The ILP Wall
Single-thread software performance will
not be improving (much)



Intel Larrabee
Latent parallelism for future scaling
Focus on data – the scalable dimension
Tasks instead of threads
No silver bullet – many “right” approaches








Identify connected components and map
every node to its containing component
All code in this talk
is pseudo-code
foreach node do
node.component = null
foreach node do
if(node.component == null) then
node.component = new Component;
roots.add(node);
dfsearch(node)
fi
function dfsearch(n)
foreach m in adjacent(n) do
if(m.component == null) do
m.component = node.component
dfsearch(m)
fi






Candidates & connections form a reduced
graph


Recursively find components on reduced graph
Update nodes to refer to final components
Concurrent processing:
independent requests
(most server applications)
Parallel processing: decompose
one task to enable concurrent
execution
“Arbitrate “ownership” of the
nodes”
“Start concurrent searches …”
Simulating isolation of threads
Scheduling tasks
Multi-threading, Asynchronous, …

Fairness

Preemption

Responsiveness

Throughput
For parallelism, not a
goal but a context
Existing architectural
concern
Drive overheads down




parallel foreach node do
node.component = NULL
Classically data parallel:
same operation applied
to an homogenous
collection
parallel foreach node do
…start a parallel search …
Data-focused but built on
an underlying “task”
model for generality
• Emphasize recursive decomposition
• Preserves function interfaces
• “fork-join”
• Structured control constructs
• Parallel loops, co-begin
function dfsearch(n)
parallel foreach m in adjacent(n) do
if(… first to visit m …)
dfsearch(m)
fi
Each iteration is a task
All tasks finish before function returns
• Emphasizes processors
• “fork –join” threads + barrier
• Structured control constructs
• “shared loops”
• Improving support for recursion
Parallel -- acquire workers
shared foreach node do
node.component = NULL
-- implied barrier, workers wait
shared foreach node do
…start a parallel seearch …
-- release workers
OpenMP is the
common binding of
this model
Resource
Management is
too hard








m1
m2
c11
m3
c12
m4
m5
c21
m6
c22
m7
Data flow graph for subtasks of
Strassen Multiplication
Identify where searches “collide”
Arbitrate “ownership” of the nodes
function dfsearch(n)
foreach m in adjacent(n) do
if(m.component == NULL) do
m.component = n.component
dfsearch(m)
fi
Search 1
Search 2
If(m.component == null)?
If(m.component == null)?
m.Component = n1.component
m.Component = n2.component
dfsearch(m)
dfsearch(m) !
One action at a time
for any specific node
function dfsearch(n)
foreach m in adjacent(n) do
m.lock();
var old = m.component;
if(old == NULL) m.component = n.component
m.unlock();
if(old == NULL) then
dfsearch(m)
else if (old != n.component) then
-- record the “edge” between searches
endif
Locks provide exclusion but the algorithm correction depends
on careful reasoning that order does not matter
word compare_and_swap(word * loc, word oldv, word newv) {
word current = *loc;
if(current == oldv) *loc = newv;
Common
return current;
hardware
}
primitive
function dfsearch(n)
foreach m in adjacent(n) do
var old = compare_and_swap(&m.component,
NULL, n.component)
if(old == NULL) then
• Short duration
• Preemption friendly
• Limited scenarios
function dfsearch(n,edges)
foreach m in adjacent(n) do
m.lock(); -- Arbitrate “ownership” of the nodes
var old = m.component;
if(old == NULL) m.component = n.component
m.unlock();
if(old == null) then
dfsearch(m,edges)
else if (old != n.component) then
edges.insert(old, n.component)
endif
Concurrency Safe
High-Bandwidth
parallel foreach node do
node = NULL
parallel foreach node do
node.lock()
var old = node.component
if(old == NULL) node.component = new Component
node.unlock()
if(old == NULL) then
roots.add(node)
dfsearch(node, edges)
fi
-- (roots, edges) form a derived problem








120
100
80
Time 95%
60
Efficiency 95%
40
Efficiency 99%
20
0
1
2
3
4
5
6
7 8 9
Processors
10 11 12 13 14
A program that is 95% (99%) with 3% overhead to parallelize
Contention
Load Balance
Cache Effects
Latencies
Preemption






Microsoft Visual Studio: Bringing out the Best in Multicore Systems
Parallel Programming for C++ Developers in the Next Version of Microsoft Visual Studio
The Concurrency and Coordination Runtime and Decentralized Software Services Toolkit
Research: Concurrency Analysis Platform and Tools for Finding Concurrency Bugs
Parallel Programming for Managed Developers with the Next Version of Microsoft Visual Studio
Concurrency Runtime Deep Dive: How to Harvest Multicore Computing Resources
Parallel Computing Application Architectures and Opportunities
Addressing the Hard Problems of Concurrency
Future of Parallel Computing (Panel)









Mechanisms for Asynchrony
For coarse-grained work and agents


Thread t = new Thread(delegate
{
// concurrent work
});
t.Start();









For fine-grained work


ThreadPool.QueueUserWorkItem(delegate
{
// concurrent work
});











ThreadPool
Queue
Worker
Thread 1
Item
Item45
Item 1
Item 2
Program
Item 3
Item
6
Thread
…
Worker
Thread p
For fine-grained work


ThreadPool.QueueUserWorkItem(delegate
{
// concurrent work
});











Advanced capabilities












Common Async API Pattern in the Framework


int Foo(object o, string s);

IAsyncResult BeginFoo(object o, string s, AsyncCallback callback, object state);
int EndFoo(IAsyncResult result);










Efficient async I/O on Windows











public static unsafe bool UnsafeQueueNativeOverlapped(NativeOverlapped* overlapped)
UI Marshaling




// on background thread
Control c = …;
c.BeginInvoke((Action)delegate
{
// runs on UI thread
});

// on background thread
Control c = …;
c.Dispatcher.BeginInvoke((Action)delegate
{
// runs on UI thread
});
Synchronization Context














BackgroundWorker













ExecutionContext
























Lunch (12pm-1:15pm)









Topics in Synchronization
The Pitfalls of Shared Memory






class C
{
static int s_f;
int m_f;
public:
void f(int * py)
{
int x;
x++;
// local variable
s_f++;
// static class member
m_f++;
// class member
(*py)++; // pointer to something
}
};
Isolation, Immutability, and Synchronization





+: no overhead, easy to reason about
-: sharing is often needed, leading to message passing





+: no overhead, easy to reason about
-: C# and VB encourage mutability … [lineage]
-: copying means efficiency can be a challenge
+: see F# for promising advances!



+: flexible, programming techniques remain similar
-: perf overhead, deadlocks, races, …


R/W

static int x = 0;
void t1() {
int y = x;
…
int z = x;
// y != z
}

void t2() {
x = 42;
}
W/R

static int x = 0;
void t1() {
try {
x = 42;
…
… throw e; …
} catch {
// whoops;
// rollback!
x = 0;
throw;
}
}

void t2() {
}
int y = x;
f(y);
W/W

static int x = 0;
void t1() {
x = 42;
int y = x;
}


void t2() {
x = 99;
int z = x;
}
Ensuring A happens-before () B











Example of a Serializability Problem
T
0
1
2
3
4
5
6
7
8
t0
t1
t2
t2(0): MOV EAX,[a] #0
t0(0): MOV EAX,[a] #0
t0(1): INC,EAX #1
t0(2): MOV [a],EAX #1
t1(0): MOV EAX,[a] #1
t1(1): INC,EAX #2
t1(2): MOV [a],EAX #2
t2(1): INC,EAX #1
t2(2): MOV [a],EAX #1
Sequential
Concurrent
Behavior
Deterministic
Nondeterministic
Memory
Stable
In flux (unless private, read-only, or
protected by a lock)
Unnecessary
Essential
Invariants
Must hold only on method
entry/exit or calls to external code
Anytime the protecting lock is not held
Deadlock
Impossible
Possible, but can be mitigated
Code coverage finds most bugs
Code coverage insufficient; races, timing,
and environments probabilistically
change
Trace execution leading to failure;
finding a fix is generally assured
Postulate a race and inspect code; root
causes easily remain unindentified
Locks
Testing
Debugging
Hardware Synchronization






int
int
int
int
int



Add(ref int l, int v);
CompareExchange(ref int l, int v, int cmp);
Decrement(ref int l);
Increment(ref int l);
Exchange(ref int l, int v);
The Foundation on top of Which All Else Exists












public class WaitHandle : IDisposable {
public void Close();
public void WaitOne();
// timeout-variants, and plenty of others…
}
public static void WaitAll(WaitHandle[] hs);
public static int WaitAny(WaitHandle[] hs);
In .NET





public class Mutex : WaitHandle {
public Mutex(string name, MutexSecurity acl, …);
public void ReleaseMutex();
}
public class Semaphore : WaitHandle {
public Semaphore(
int initialCount, int maximumCount,
string name, SemaphoreSecurity acl, …);
public void Release(int count);
}
In .NET






public class EventWaitHandle : WaitHandle {
public EventWaitHandle(
bool initialState, EventResetMode mode,
string name, EventWaitHandleSecurity acl, …);
public void Reset();
public void Set();
}
public enum EventResetMode {
AutoReset,
ManualReset
}
public class AutoResetEvent : EventWaitHandle { … }
public class ManualResetEvent : EventWaitHandle { … }
Locking


[C#]
[VB]
lock (obj) { … }
SyncLock obj … End SyncLock
Monitor.Enter(obj);
try {
…
} finally {
Monitor.Exit(obj);
}







Condition Variables

bool P = false;
…
lock (obj) {
while (!P) Monitor.Wait(obj);
…
}
… elsewhere …
lock (obj) {
P = true;
Monitor.Pulse[All](obj);
}





When Mutual Exclusion is Unnecessary












Convoy Avoidance











Confined State Within Threads











An ImmutableStack<T> Type
public class ImmutableStack<T> {
private readonly T m_value;
private readonly ImmutableStack<T> m_next;
private readonly bool m_empty;
public ImmutableStack() { m_empty = true; }
internal ImmutableStack(T value, ImmutableStack<T> next) {
m_value = value;
m_next = next;
m_empty = false;
}
public ImmutableStack<T> Push(T value) {
return new ImmutableStack(value, this);
}
}
public ImmutableStack<T> Pop(out T value) {
if (m_empty) throw new Exception("Empty.");
return m_next;
}
Architecture and Platform Guarantees















Examples
X = Y = 0;
~~~
X = 1;
A = Y;
Y = 1;
B = X;
~~~
A == 1 && B == 0?
X = Y = 0;
~~~
X = 1;
Y = 1;
A = Y;
B = X;
~~~
A == 0 && B == 0?
No, except on IA64.
(No StoreStore, No LoadLoad)
Yes!
(StoreLoad is permitted)
X = Y = 0;
~~~
X = 1;
A = X;
Y = 1;
~~~
A == 1 && B == 1 && C == 0?
No.
(Transitivity)
B = Y;
C = X;
Accessing Nonatomic Locations w/out Proper Synchronization
internal static long s_x;
void t1() {
int i = 0;
while (true) {
s_x = (i & 1) == 0 ? 0x0L : 0xaaaabbbbccccddddL;
i++;
}
}
void t2() {
while (true) {
long x = s_x;
Debug.Assert(x == 0x0L || x == 0xaaaabbbbccccddddL);
}
}





Double Edged Sword




class Stack<T> {
Node<T> head;
void Push(T obj) {
Node<T> n = new Node<T>(obj);
Node<T> h;
do {
h = head;
n.next = h;
} while (Interlocked.
CompareExchange(ref head,
n, h) != h);
}
T Pop() {
Node<T> n;
do {
n = head;
} while (Interlocked.
CompareExchange(ref head,
n.next, n) != n);
return n.Value;
}
…
}
Efficient Lazy Initialization (Variant 1: Never Create >1)


class Foo {
private static volatile Foo s_inst;
private static object s_mutex = new object();
internal Foo {
get {
if (s_inst == null)
lock (s_mutex)
if (s_inst == null)
s_inst = new Foo(…);
return s_inst;
}
}
}

Efficient Lazy Initialization (Variant 2: >1 OK)

class Foo {
private static volatile Foo s_inst;
internal Foo {
get {
if (s_inst == null) {
Foo candidate = new Foo();
Interlocked.CompareExchange(
ref s_inst, candidate, null);
}
return s_inst;
}
}
}
Trickier Than You Think!
class SpinLock {
private int m_state = 0;
public void Enter() {
while (Interlocked.CompareExchange(
ref m_state, 1, 0) != 0) ;
}
public void Exit() {
m_state = 0;
}
}
Brain Melting Details …











Try Numero Dos – Still Imperfect 
class SpinLock {
private volatile int m_state = 0;
public void Enter() {
int tid = Thread.CurrentThread.ManagedThreadId;
while (true) {
if (Interlocked.CompareExchange(ref m_state, tid, 0) != 0) {
int iters = 1;
while (m_state != 0) {
if (Environment.ProcessorCount == 1) {
if (iters % 5 == 0) Thread.Sleep(1);
else Thread.Sleep(0);
iters++;
} else {
Thread.SpinWait(iters);
if (iters >= 4096) Thread.Sleep(1);
else {
if (iters >= 2048) Thread.Sleep(0);
iters *= 2
}
}
}
}
}
}
public void Exit() {
m_state = 0;
}
}









Synchronization Best Practices
Lock consistently
Do
 Do
 Do

class MyList<T> {
T[] items; // lock: items
int n;
// lock: items
void Add(T item) {
lock (items) {
items[n] = item;
n++;
}
}
…
}
Lock for the right duration
Do
 Don’t

class MyList<T> {
T[] items; // lock: items
int n;
// lock: items
// invariant: n is count of valid
// items in list and items[n] == null
void Add(T item) {
lock (items) {
items[n] = item;
n++;
}
}
…
}
Make critical regions short and sweet
Do
 Don’t
 Don’t

class MyList<T> {
...
void Add(T t) {
lock(items) {
items[n] = t;
n++;
}
Listener.Notify(this);
}
…
}
Encapsulate your locks
Don’t
 Don’t

class MyList<T> {
T[] items;
int n;
static object slk = new object();
…
static void ResetStats() {
lock(slk){
…
}
}
…
}
Avoiding deadlocks

Do: Acquire locks in a consistent order
class MyService {
A a;
B b;
…
void DoAB() {
lock(a) lock(b) {
a.Do(); b.Do();
}
}
void DoBA() {
lock(b) lock(a) {
b.Do(); a.Do();
}
}
}
Locking Miscellany

Do: Document your locking policy






Especially for public APIs
Do: Use a reader/writer lock if readers are common
Do: Prefer lock-based code to lock-free code
Do: Prefer Monitors over kernel synchronization
Avoid: Lock recursion in your designs
Don’t: Build your own lock

Avoid: Writing your own thread pools









Break (3:15pm-3:45pm)









Designs and Algorithms
The Impact of Multi-core on Apps











Code and Data










A Taxonomy of Concurrency
Agents/CSPs
* Message Passing
* Loose Coupling
Task Parallelism
* Statements
* Structured
* Futures
* ~O(1) Parallelism
Data Parallelism
* Data Operations
* O(N) Parallelism
Messaging
…
Metrics Worth Measuring











Parallel For Loops








+: simple, predictable, efficient
-: can’t tolerate iteration imbalance, blocking



+: tolerates imbalance, blocking
-: more difficult, communication overhead

Parallel For Loops – Static Decomposition
void ParallelForS(int lo, int hi, Action<int> body, int p) {
int chunk = ((hi – lo) + p - 1) / p; // Iterations/thread
ManualResetEvent mre = new ManualResetEvent(false);
int remaining = p;
// Schedule the threads to run in parallel
for (int i = 0; i < p; i++) {
ThreadPool.QueueUserWorkItem(delegate(object procId) {
int start = lo + (int)procId * chunk;
for (int j=start; j<start + chunk && j < hi; j++) {
body(j);
}
if (Interlocked.Decrement(ref remaining) == 0)
mre.Set();
}, i);
}
mre.WaitOne(); // Wait for them to finish
}
Parallel For Loops – Dynamic Decomposition
void ParallelForD(int lo, int hi, Action<int> body, int p) {
const int chunk = 16; // Chunk size (constant)
ManualResetEvent mre = new ManualResetEvent(false);
int remaining = p;
int current = lo;
// Schedule the threads to run in parallel
for (int i = 0; i < p; i++) {
ThreadPool.QueueUserWorkItem(delegate(object procId) {
int j;
while ((j = (Interlocked.Add(
ref current, chunk) – chunk)) < hi) {
for (int k = 0; k < chunk && j + k < hi; k++) {
body(j + k);
}
}
if (Interlocked.Decrement(ref remaining) == 0)
mre.Set();
}, i);
}
mre.WaitOne(); // Wait for them to finish
}
Parallel Foreach Loops








Parallel Foreach Loops
void ParallelForEach<T>(IEnumerable<T> e, Action<T> body, int p) {
const int chunk = 16; // Chunk size (constant)
ManualResetEvent mre = new ManualResetEvent(false);
int remaining = p;
using (IEnumerator<T> en = e.GetEnumerator()) { // shared
// Schedule the threads to run in parallel
for (int i = 0; i < p; i++) {
ThreadPool.QueueUserWorkItem(delegate(object procId) {
T[] buffer = new T[chunk];
int j;
do {
lock (en) {
for (j = 0; j < chunk && en.MoveNext(); j++)
buffer[j] = en.Current;
}
for (int k = 0; k < j; k++)
body(buffer[k]);
} while (j == chunk);
if (Interlocked.Decrement(ref remaining) == 0)
mre.Set();
}, i);
}
mre.WaitOne(); // Wait for them to finish
}
}
Divide and Conquer - Recursion







Mirror(node.Right);
Reductions

int ParallelSum(int[] array, int p) {
int chunk = (array.Length + p - 1) / p; // Iterations/thread
ManualResetEvent mre = new ManualResetEvent(false);
int sum = 0, remaining = p;
// Schedule the threads to run in parallel
for (int i = 0; i < p; i++) {
ThreadPool.QueueUserWorkItem(delegate(object procId) {
int mySum = 0;
int start = (int)procId * chunk;
for (int j=start; j<start + chunk && j < array.Length; j++)
mySum += array[j];
Interlocked.Add(ref sum, mySum);
if (Interlocked.Decrement(ref remaining) == 0)
mre.Set();
}, i);
}
mre.WaitOne(); // Wait for them to finish
return sum;
}
When to “Go Parallel”?
There is a cost; only worthwhile when


Work per task/element is large, and/or
Number of tasks/elements is large
? tasks
Point of diminishing returns
-- Speedup ++

1 task
(Sequential)
--
? tasks
Break even point
Work Per Task // # of Tasks
++
Synchronous I/O

Thread 1:
6 work items in
4 time
Thread 2:
time

Overlapped IO
Thread 1:
6 work items in
3 time
Thread 2:
time
= Running ()
= Waiting ()
Synchronization

Thread 1:
(lock)
(lock)
(lock)
Thread 2:
Thread 3:
(lock)
Thread 4:
…
= Running ()
= Running w/ lock ()
= Waiting ()
Load Imbalance


Sequential:
Parallel:
Thread 1
Thread 2
 More than 2 threads is just wasted resource:
S = 50%, 1/S == 2
No matter how many processors, 2x is it
Thread 3
Thread 4
= Your API
Other Miscellaneous Algorithms










Producer/Consumer: Blocking & Bounded Queue
public class BlockingBoundedQueue<T> {
private Queue<T> m_queue = new Queue<T>();
private Semaphore m_fullSemaphore = new Semaphore(128);
private Semaphore m_emptySemaphore = new Semaphore(0);
public void Enqueue(T item) {
m_fullSemaphore.WaitOne();
lock (m_queue) {
m_queue.Enqueue(item);
}
m_emptySemaphore.Release();
}
public T Dequeue() {
T e;
m_emptySemaphore.WaitOne();
lock (m_queue) {
e = m_queue.Dequeue();
}
m_fullSemaphore.Release();
return e;
}
}









.NET Framework 4.0
IEnumerable<BabyInfo> babies = ...;
var results = new List<BabyInfo>();
foreach(var baby in babies)
{
if (baby.Name == queryName &&
baby.State == queryState &&
baby.Year >= yearStart &&
baby.Year <= yearEnd)
{
results.Add(baby);
}
}
results.Sort((b1, b2) =>
b1.Year.CompareTo(b2.Year));
IEnumerable<BabyInfo> babies = …;
var results = new List<BabyInfo>();
int partitionsCount = Environment.ProcessorCount;
int remainingCount = partitionsCount;
var enumerator = babies.GetEnumerator();
try {
using (var done = new ManualResetEvent(false)) {
for(int i = 0; i < partitionsCount; i++) {
ThreadPool.QueueUserWorkItem(delegate {
var partialResults = new List<BabyInfo>();
while(true) {
BabyInfo baby;
lock (enumerator) {
if (!enumerator.MoveNext()) break;
baby = enumerator.Current;
}
if (baby.Name == queryName && baby.State == queryState &&
baby.Year >= yearStart && baby.Year <= yearEnd) {
partialResults.Add(baby);
}
}
lock (results) results.AddRange(partialResults);
if (Interlocked.Decrement(ref remainingCount) == 0) done.Set();
});
}
done.WaitOne();
results.Sort((b1, b2) => b1.Year.CompareTo(b2.Year));
}
}
finally
{
if (enumerator is IDisposable) ((IDisposable)enumerator).Dispose();
}
var results = from baby in babies.AsParallel()
where baby.Name == queryName &&
baby.State == queryState &&
baby.Year >= yearStart &&
baby.Year <= yearEnd
orderby baby.Year ascending
select baby;
Tools
Programming Models
PLINQ
Concurrency Runtime
Profiler
Concurrency
Analysis
ThreadPool
Task Scheduler
Parallel Pattern
Library
Data Structures
Data Structures
Task Parallel
Library
Parallel
Debugger
Windows
Task Scheduler
Resource Manager
Resource Manager
Operating System
Threads
Key:
Managed Library
Agents
Library
Native Library
Tools

What is it?










Why is it good?



.NET Program
Declarative
Queries
Parallel
Algorithms
C# Compiler
PLINQ Execution Engine
Query Analysis
Data Partitioning
Chunk
Range
Hash
Striped
Repartitioning
Operator Types
Map
Filter
Sort
Search
Reduce
…
Merging
Buffering options
Order preservation
Inverted
VB Compiler
Task Parallel Library
C++ Compiler
F# Compiler
PLINQ
Coordination Data
Structures
Loop replacements
Imperative Task Parallelism
Scheduling
Concurrent Collections
Synchronization Types
Coordination Types
Other .NET
Compiler
Threads
MSIL
TPL or CDS
Proc 1
…
Proc p
Work-Stealing Scheduler
Global
Queue
Local
Queue
Worker
Thread 1
Task 1
TaskProgram
2
Thread
Task 4Task 3
Task 5
…
…
Local
Queue
Worker
Thread p
Task 6


Thread-safe collections
Locks
Work exchange
Initialization
Phased Operation









Wrap-up
Talk Recap












What the Future Holds
Programming Models

Safety


Current offerings minimal impact (sharp knives)
Three key themes





Functional: immutable & pure
Safe imperative: isolated
Safe side-effects: transactions
Verification tools
Patterns



Agents (CSPs) + tasks + data
1st class isolated agents
Raise level of abstraction: what, not how
110
What the Future Holds
Efficiency and Heterogeneity

Efficiency




“Do no harm” O(P) >= O(1)
More static decision-making vs. dynamic
Profile guided optimizations
The future is heterogeneous





+
=~
Chip multiprocessors are “easy”
Out-of-order vs. in-order
GPGPU’ (fusion of X86 with GPU)
Vector ISAs
Possibly different memory systems
111
All Programmers Will Not Be Parallel
Implicit Parallelism
Use APIs that internally use parallelism
Structured in terms of agents
Apps, LINQ queries, etc.
Explicit Parallelism
Safe
Frameworks, DSLs, XSLT, sorting, searching
Explicit Parallelism
Unsafe
(Parallel Extensions, etc)
In Conclusion

Opportunity and crisis



Architects & senior developers pay heed





Time to start thinking and experimenting
Not yet for ubiquitous consumption
[5 year horizon] but…
Can make a real difference today in select places:
embarassingly parallel
Begin experimenting today




Competitive advantage for those who grok it
Less incentive for the client platform without
Windows Vista + .NET 3.5
Play with Parallel Extensions (.NET 4.0 and C++)
Exciting times!
Thank-you.
113
Just released!
Available at the PDC bookstore
Concurrent
Programming
on Windows
(Addison-Wesley)
Covers Win32 &
.NET Framework
Book Signing
Where: PDC bookstore
Date/Time: Wednesday, Oct. 29 2:30PM – 3:00PM
msdn.com/concurrency
And download
Parallel Extensions to the .NET Framework!
Microsoft Visual Studio: Bringing out the Best in Multicore Systems
Parallel Programming for C++ Developers in the Next Version of Microsoft Visual Studio
The Concurrency and Coordination Runtime and Decentralized Software Services Toolkit
Research: Concurrency Analysis Platform and Tools for Finding Concurrency Bugs
Parallel Programming for Managed Developers with the Next Version of Microsoft Visual Studio
Concurrency Runtime Deep Dive: How to Harvest Multicore Computing Resources
Parallel Computing Application Architectures and Opportunities
Addressing the Hard Problems of Concurrency
Future of Parallel Computing (Panel)
© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.