DSWP-CARG - Computer Engineering Research Group

advertisement
Automatic Thread Extraction
with Decoupled Software Pipelining
Presented by
Jeremy Cutler
with thanks to
Guilherme Ottoni, Ram Rangan, Adam Stoler, David I. August
Liberty Research Group
Department of Computer Science
Princeton University
http://www.liberty-research.org
Automatic Thread Extraction with DSWP
A Fundamental Change…
Transistor trend continues…
Clock rate limited by:
• Power delivery
• Heat dissipation
• Design complexity
Source: Intel, Wikipedia, Sutter/Dr. Dobbs Journal
The Liberty Research Group
2
http://www.liberty-research.org
Automatic Thread Extraction with DSWP
The Response: CMP
For:
• legacy apps (C/C++)
• single-threaded
• sequential codes
Speedup over single core:
0.0%
Worse:
• Shared resources (e.g.
caches)
• Simple cores trend
Must Extract Thread
Parallelism!
IBM Power 5 (1.9GHz) Die Photo: Source IBM
The Liberty Research Group
3
http://www.liberty-research.org
Automatic Thread Extraction with DSWP
Existing Parallelization Approaches (Non-speculative)
Scientific Codes (FORTRAN-like)
for(i=1; i<=N; i++) // C
a[i] = a[i] + 1; // X
General-purpose Codes (legacy C/C++)
while(ptr = ptr->next)
// LD
ptr->val = ptr->val + 1; // X
DOALL
The Liberty Research Group
DOACROSS
[Cytron, ICPP 86]
4
http://www.liberty-research.org
Automatic Thread Extraction with DSWP
Pipelined Parallelism for General-Purpose Codes
while(ptr = ptr->next)
// LD
ptr->val = ptr->val + 1; // X
DOACROSS
The Liberty Research Group
Decoupled Software Pipelining (DSWP)
Generalization of
DOPIPE
[Davies, UIUC 81]
5
http://www.liberty-research.org
Automatic Thread Extraction with DSWP
Comparison: DOALL, DOACROSS, DSWP
DOALL
lat(comm) = 1:
lat(comm) = 2:
1 iter/cycle
1 iter/cycle
The Liberty Research Group
DOACROSS
1 iter/cycle
0.5 iter/cycle
6
DSWP
1 iter/cycle
1 iter/cycle
http://www.liberty-research.org
Automatic Thread Extraction with DSWP
Implementing Decoupled Software Pipelining (DSWP)
while(ptr = ptr->next)
ptr->val = ptr->val + 1;
Thread 1
Thread 2
Loop
register
control
Dependence
Graph
DAGSCC
memory
intra-iteration
Inter-thread communication
latency is a one-time cost
loop-carried
communication queue
The Liberty Research Group
7
http://www.liberty-research.org
Automatic Thread Extraction with DSWP
Implementing Inter-Thread Control Dependences
Node Splitting
L1
L2
register
control
memory
intra-iteration
loop-carried
The Liberty Research Group
8
http://www.liberty-research.org
Automatic Thread Extraction with DSWP
Handling Arbitrary Control Flow:
Control Extensions to Dependence Graph
CFG
• Loop-iteration control dependences
• Traditional definition of control
dependence [Ferrante et al., TOPLAS 87]
not appropriate for loops
• Conditional control dependences
• To implement inter-thread data
dependences that may or may not
occur
• Multi-threaded code generation from
the extended dependence graph
The Liberty Research Group
9
http://www.liberty-research.org
Automatic Thread Extraction with DSWP
Evaluation
• DSWP implemented in the back-end of IMPACT compiler
• Accurate dual-core Itanium 2 model
• Synchronization Array support for comm./sync.
• ISA extended with produce/consume instructions
• Important application loops selected (16-98% total execution)
The Liberty Research Group
10
http://www.liberty-research.org
The Liberty Research Group
ake
11
Ge
oM
ean
wc
c
jpe
gen
epi
cde
c
ec
adp
cm
d
25 6
.bz
ip2
18 8
.a m
mp
18 3
.e q
u
18 1
.mc
f
17 9
.a rt
12 9
.c o
mp
res
s
% Loop Speedup
Automatic Thread Extraction with DSWP
Evaluation
50
40
30
20
10
0
-10
http://www.liberty-research.org
Automatic Thread Extraction with DSWP
Partitioning and Parallelism
181.mcf
DAGSCC
32
Queue Occupancy
(elements)
Speedup
+45 %
0
Time (cycles)
+48 %
Time (cycles)
+43 %
Time (cycles)
-2 %
Currently use a simple load-balancing heuristic
Time (cycles)
The Liberty Research Group
12
http://www.liberty-research.org
• Modified, half-width Itanium 2 models
1-Core
2-Core (used by DSWP)
Full-width:
Half-width:
60
Half-width Base
Half-width DSWP
Full-width DSWP
40
20
ean
Ge
oM
jpe
ge
wc
nc
ec
ep
icd
ec
ad
pc
md
256
.bz
ip2
.am
mp
188
.eq
ua
ke
183
.m
cf
181
.ar
t
129
-40
179
-20
res
s
0
.co
mp
% Loop Speedup
Automatic Thread Extraction with DSWP
Evaluation: Varying Processor Width
-60
• On half-width model, speedup from DSWP is larger
• Better performance compatibility
• More effective on simpler cores
The Liberty Research Group
13
http://www.liberty-research.org
Automatic Thread Extraction with DSWP
What about more threads?
while(ptr = ptr->next)
ptr->val = ptr->val + 1;
2. DOALL Consumer
Producer
Dep.
Graph
Consumer 1
Consumer 2
1. Multiple SCCs
register
control
memory
intra-iteration
loop-carried
The Liberty Research Group
14
http://www.liberty-research.org
Automatic Thread Extraction with DSWP
Breaking SCCs: Speculative DSWP
164.gzip: 38% speedup
with 3 threads
Only one SCC!
181.mcf: 2.9x speedup
with 4 threads
x
x
The Liberty Research Group
Mis-speculation
detected
15
http://www.liberty-research.org
Automatic Thread Extraction with DSWP
Conclusion
• DSWP extracts pipelined thread-level parallelism from
general-purpose, sequential programs
• More applicable than traditional parallelization techniques
• Handles arbitrary control flow
• Future research directions:
• Additional analyses and optimizations
• Break dependence cycles – code transformations, speculation
• Reduce communication
• Explore DOALL-consumer opportunities
The Liberty Research Group
16
http://www.liberty-research.org
Download