MasonWells.ppt

advertisement
Rethinking Parallel Execution
Guri Sohi
(along with Matthew Allen, Srinath
Sridharan, Gagan Gupta)
University of Wisconsin-Madison
Outline
•
•
•
•
•
•
•
From sequential to multicore
Reminiscing: Instruction Level Parallelism (ILP)
Canonical parallel processing and execution
Rethinking canonical parallel execution
Dynamic Serialization
Consequences of Dynamic Serialization
Wrap up
April 27, 2010
Mason Wells
2
Microprocessor Generations
•
•
•
•
Generation 1: Serial
Generation 2: Pipelined
Generation 3: Instruction-level Parallel (ILP)
Generation 4: Multiple processing cores
April 27, 2010
Mason Wells
3
Microprocessor Generations
Gen 2: Pipelined (1980s)
Gen 1: Sequential (1970s)
Gen 4: Multicore (2000s)
Gen 3: ILP (1990s)
April 27, 2010
Mason Wells
4
From One Generation to Next
• Significant debate and research
– New solutions proposed
– Old solutions adapt in interesting ways to become
viable or even better than new solutions
• Solutions that involve changes “under the hood”
end up winning over others
5
From One Generation to Next
• From Sequential to Pipelined
– RISC (MIPS, Sun SPARC, Motorola 88k, IBM PowerPC)
vs. CISC (Intel x86)
– CISC architectures learned and employed RISC
innovations
• From Pipelined to Instruction-Level Parallel
– Statically scheduled VLIW/EPIC
– Dynamically scheduled superscalar
6
From One Generation to Next
• From ILP to Multicore
– Parallelism based upon canonical parallel execution
model
– Overcome constraints to canonical parallelization
• Thread-level speculation (TLS)
• Transactional memory (TM)
7
Reminiscing about ILP
• Late 1980s to mid 1990s
• Search for “post RISC” architecture
– More accurately, instruction processing model
• Desire to do more than one instruction per
cycle—exploit ILP
• Majority school of thought: VLIW/EPIC
• Minority: out-of-order (OOO) superscalar
8
VLIW/EPIC School
• Parallel execution requires a parallel ISA
• Parallel execution determined statically (by
compiler)
• Parallel execution expressed in static program
• Take program/algorithm parallelism and mold it to given
execution schedule for exploiting parallelism
9
VLIW/EPIC School
• Creating effective parallel representations
(statically) introduces several problems
–
–
–
–
Predication
Statically scheduling loads
Exception handling
Recovery code
• Lots of research addressing these problems
• Intel and HP pushed it as their future (Itanium)
10
OOO Superscalar
• Create dynamic parallel execution from
sequential static representation
– dynamic dependence information accurate
– execution schedule flexible
• None of the problems associated with trying to
create a parallel representation statically
• Natural growth path with no demands on
software
11
Lessons from ILP Generation
• Significant consequences of trying to statically
detect and express parallelism
• Techniques that make “under the hood” changes
are the winners
– Even though they may have some
drawbacks/overheads
12
The Multicore Generation
• How to achieve parallel execution on multiple
processors?
• Solution critical to the long-term health of the
computer and information technology industry
• And thus the economy and society as we know it
13
14
15
16
The Multicore Generation
• How to achieve parallel execution on multiple
processors?
• Over four decades of conventional wisdom in
parallel processing
– Mostly in the scientific application/HPC arena
– Use this as basis
Parallel Execution Requires a Parallel Representation
17
Canonical Parallel Execution Model
A: Analyze program to identify independence in
program
– independent portions executed in parallel
B: Create static representation of independence
– synchronization to satisfy independence assumption
C: Dynamic parallel execution unwinds as per static
representation
– potential consequences due to static assumptions
18
Canonical Parallel Execution Model
• Like VLIW/EPIC, canonical model creates a variety
of problems that have lead to a vast body of
research
– identifying independence
– creating static representation
– dynamic unwinding
19
Identifying Independence
• Static program analysis
– Over four decades of work
• Hard to identify statically
– Inherently dynamic properties
– Must be conservative statically
• Need to identify dependence in order to identify
independence
April 27, 2010
Mason Wells
20
Creating Static Representation
• Parallel representation for guaranteed
independent work
• Insert synchronization for potential dependences
– Conservative synchronization moves parallel
execution towards sequential execution
April 27, 2010
Mason Wells
21
Dynamic Unwinding
• Non-determinism
– Changes to program state may not be repeatable
• Race conditions
• Several startup companies to deal with this
problem
April 27, 2010
Mason Wells
22
Conventional Wisdom
Parallel Execution Requires a Parallel Representation
Consequences:
• Must create parallel representation
• For correct execution, must statically identify:
– Independence for parallel representation
– Dependence for synchronization
• Source of enormous difficulty and complexity
– Generally functions of input to program
– Inherently dynamic properties
April 27, 2010
Mason Wells
23
Current Approaches
• Stick with canonical model and try to overcome
limitations
• Thread Level Speculation (TLS) and Transactional
Memory (TM)
• Techniques to allow programmer to program
sequentially but automatically generate parallel
representation
• Techniques to handle non-determinism and race
conditions.
April 27, 2010
Mason Wells
24
TLS and TM
• Overcome major constraint to creating static
parallel representation
• Likely in several upcoming microprocessors
– Our work in mid 1990s will be key enabler
• Already in Sun MAJC, NEC Merlot, Sun Rock
April 27, 2010
Mason Wells
25
Static Program Representation
Issues
Sequential
Parallel
Bugs
Yes
Yes (more)
Data races
No
Yes
Locks/Synch
No
Yes
Deadlock
No
Yes
Nondeterminism
No
Yes
Parallel Execution
?
Yes
• Can we get parallel execution without a parallel
representation?  Yes
• Can dynamic parallelization extract parallelism
that is inaccessible to static methods?  Yes
April 27, 2010
Mason Wells
26
Serialization Sets: What?
• Sequential program representation and dynamic
parallel execution
– No static representation of independence
– No locks and no explicit synchronization
• “Under the hood” run time system dynamically
determines and orders dependent computations
– Independence and thus parallelism falls out as a side
• Comparable or better performance than
conventional parallel models
April 27, 2010
Mason Wells
27
How? Big Picture
• Write program in well object-oriented style
– Method operates on data of associated object (ver. 1)
• Identify parts of program for potential parallel
execution
– Make suitable annotations as needed
• Dynamically determine data object touched by
selected code
– Identify dependence
• Program thread assigns selected code to bins
April 27, 2010
Mason Wells
28
How? Big Picture
• Serialize computations to same object
– Enforce dependence
– Assign them to same bin; delegate thread executes
computations in same bin sequentially
• Do not look for/represent independence
– Falls out as an effect of enforcing dependence
– Computations in different bins execute in parallel
• Updates to given state in same order as in sequential
program
– Determinism
– No races
– If sequential correct; parallel execution is correct (same input)
April 27, 2010
Mason Wells
29
Big Picture
Delegate
Thread 0
Program
Thread
Delegate
Thread 1
Delegate
Thread 2
30
Serialization Sets: How?
• Sequential program with annotations
– Identify potentially independent methods
– Associate a serializers with objects to express dependence
• Serializer groups dependent method invocations into
a serialization set
– Runtime executes in order to honor dependences
• Independent method invocations in different sets
– Runtime opportunistically parallelizes execution
April 27, 2010
Mason Wells
31
Example: Debit/Credit Transactions
# of transactions?
trans_t* trans;
while ((trans = get_trans ()) != NULL) {
account_t* account = trans->account;
Points to?
if (trans->type == DEPOSIT)
account->deposit (trans->amount);
else if (trans->type == WITHDRAW)
account->withdraw (trans->amount);
Loop-carried
dependence?
}
Several static unknowns!
April 27, 2010
Mason Wells
32
Multithreading Strategy
trans_t* trans;
Oblivious to what accounts each
while ((trans = get_trans ()) != NULL) {
thread may access!
must
account_t*
account
= lock
trans[i]->account;
1) Read →
all Methods
transactions
into
anaccount
array to
ensureofmutual
exclusion
2) Divide→chunks
array among
multiple threads
if (trans->type == DEPOSIT)
account->deposit (trans->amount);
else if (trans->type == WITHDRAW)
account->withdraw (trans->amount);
}
April 27, 2010
Mason Wells
33
Example with Serialization Sets
private <account_t> private_account_t;
Declare wrapped account type
begin_nest ();
At execution, delegate:
trans_t* trans;
Initiate nesting
1) level
Creates
method invocation
while ((trans
= get_trans
()) != structure
NULL) {
2) Gets serializer
pointer from
base class
private_account_t*
account
= trans->account;
3) Enqueues invocation in serialization set
if (trans->type == DEPOSIT)
account->delegate(deposit, trans->amount);
else if (trans->type == WITHDRAW)
account->delegate(withdraw, trans->amount);
End nesting level, implicit barrier
}
end_nest (); Delegate indicates potentially-
independent operations
April 27, 2010
Mason Wells
34
Program
context
delegate
delegate
delegate
delegate
delegate
delegate
delegate
delegate
Delegate context
SS #100
SS #200
SS #300
deposit
acct=100
$2000
withdraw
acct=100
$50
withdraw
acct=100
$20
withdraw
acct=200
$1000
withdraw
acct=200
$1000
withdraw
acct=300
$350
deposit
acct=300
$5000
deposit
acct=100
$300
April 27, 2010
Mason Wells
35
Program
context
thread
delegate
delegate
delegate
delegate
delegate
delegate
delegate
delegate
Delegate threads
context
SS #100
DelegateSS
0 #200Delegate
SS1#300
depositdeposit
withdraw
acct=100
acct=100
acct=200
withdraw
$2000 $2000
$1000
acct=200
withdraw
withdraw
withdraw $1000 withdraw
acct=300
acct=100
acct=100
acct=300$350
$50
$50
$350
deposit
withdraw
withdraw
deposit
acct=300
withdraw
acct=100
acct=100
acct=300
$5000
acct=200
$20
$20
$5000
$1000
depositdeposit
withdraw
acct=100
acct=100
acct=200
$300
$300
$1000
Race-free, determinate execution without synchronization!
April 27, 2010
Mason Wells
36
Prometheus: C++ Library for SS
• Template library
– Compile-time instantiation of SS data structures
– Metaprogramming for static type checking
• Runtime orchestrates parallel execution
• Portable
– x86, x86_64, SPARC V9
– Linux, Solaris
April 27, 2010
Mason Wells
37
Prometheus Runtime
• Version 1.0
– Dynamically extracts parallelism
– Statically scheduled
– No nested parallelism
• Version 2.0
– Dynamically extracts parallelism
– Dynamically scheduled
• Work-stealing scheduler
– Supports nested parallelism
April 27, 2010
Mason Wells
38
Network Packet Classification
packet_t* packet;
classify_t* classifier;
vector<int> ruleCount(num_rules);
Vector<packet_queue_t> packet_queues;
int packetCount = 0;
for(i=0;i<packet_queues.size();i++)
{
while ((packet =
packet_queues[i].get_pkt()) != NULL)
{
ruleID = classifier->softClassify (packet);
ruleCount[ruleID]++;
packetCount++;
}
}
39
Example with Serialization Sets
Private <classify_t> private_classify_t;
vector<private_classify_t> classifiers;
int packetCount = 0;
vector<int> ruleCount(numRules,0);
int size = packet_queues.size();
begin_nest ();
for (i=0;i<size;i++){
classifiers[i].delegate
(&classifier_t::softClassify,
packet_queues[i]);
}
end_nest ();
for(i=0;i<size;i++){
ruleCount += classifier[i].getRuleCount();
packetCount += classifier[i].getPacketCount();
}
40
Speedup wrt sequential execution
Packet Classification(No Locks!)
26
24
22
20
18
16
14
12
10
8
6
4
2
0
Speedup vs Computation granularity
AMD Barcelona(4 sockets)
AMD Barcelona(8 sockets)
Intel Nehalem(1 socket)
Intel Nehalem(2 sockets)
1
2
4
8
16
32
64 128 256
Number of packets/grain
512 1024 2048
41
Network Intrusion Detection
• Very common networking application
• Most common program used: Snort
– Open source version (like Linux)
– But also commercial versions (Sourcefire)
• Basic structure of computation also found in many
other deep packet inspection applications
– E.g., packet de-duplication (Riverbed)
April 27, 2010
Mason Wells
42
Snort Speedup
Speedup wrt Sequential execution
12
10
8
4 Core Intel Nehalem
6
4
4 core AMD Barcelona
8 core Intel Nehalem
16 core AMD Barcelona
2
0
Other Applications
• Benchmarks
– Lonestar, NU-MineBench, PARSEC, Phoenix
• Conventional Parallelization
– pthreads, OpenMP
• Prometheus versions
– Port program to sequential C++ program
– Idiomatic C++: OO, inheritance, STL
– Parallelize with serialization sets
April 27, 2010
Mason Wells
44
Statically Scheduled Results
Speedup relative to original sequential
program
Conventional Parallel
20
18
16
14
12
10
8
6
4
2
0
Serialization Sets
18.1
16.3
17.9 17.5
13.8
12.8
12.3
9.2
7.5 7.5
4.65.0
12.3
10.4
8.78.2
5.0
3.9
4 Socket AMD Barcelona (4-way multicore) = 16 total cores
April 27, 2010
Mason Wells
45
Statically Scheduled Results
Conventional Parallel
Serialization Sets
Speedup reltaive to original
sequential program
10
8.7
9
8.2
8
7.0
7
6
5
4
8.3
6.4 6.1
4.0 3.8
3
2
1
0
AMD Barcelona AMD Barcelona Sun UltraSPARC Sun UltraSPARC
Multicore (4)
ccNUMA (16) T-1 Multicore (32) III+ SMP (8)
April 27, 2010
Mason Wells
46
Summary
• Sequential program with annotations
– No explicit synchronization, no locks
• Programmers focus on keeping computation
private to object state
– Consistent with OO programming practices
• Dependence-based model
– Determinate race-free parallel execution
• Do as well or better than incumbents but without
their negatives
• Can do things that are very hard for incumbents
April 27, 2010
Mason Wells
47
Download