Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

advertisement
Exploiting Coarse-Grained
Task, Data, and Pipeline Parallelism in
Stream Programs
Michael Gordon, William Thies, and
Saman Amarasinghe
Massachusetts Institute of Technology
ASPLOS
October 2006
San Jose, CA
http://cag.csail.mit.edu/streamit
1
Multicores Are Here!
512
Picochip
PC102
256
Ambric
AM2045
Cisco
CSR-1
Intel
Tflops
128
64
32
# of
cores 16
Raw
Niagara
8
Broadcom 1480
4
2
1
4004
8080
8086
286
386
486
Pentium
8008
1970
2
Raza
XLR
1975
1980
1985
1990
Cavium
Octeon
Cell
Opteron 4P
Xeon MP
Xbox360
PA-8800 Opteron
Tanglewood
Power4
PExtreme Power6
Yonah
P2 P3 Itanium
P4
Athlon
Itanium 2
1995
2000
2005
20??
Multicores Are Here!
512
256
128
64
32
# of
cores 16
8
4
For uniprocessors,
Uniprocessors:
C was:
C •Portable
is the common
machine
language
•High Performance
•Composable
•Malleable
•Maintainable
Picochip
PC102
Cisco
CSR-1
Intel
Tflops
Raw
1
8086
286
386
486
Broadcom 1480
Pentium
8008
1970
3
8080
1975
1980
1985
1990
Raza
XLR
Niagara
2
4004
Ambric
AM2045
Cavium
Octeon
Cell
Opteron 4P
Xeon MP
Xbox360
PA-8800 Opteron
Tanglewood
Power4
PExtreme Power6
Yonah
P2 P3 Itanium
P4
Athlon
Itanium 2
1995
2000
2005
20??
Multicores Are Here!
What is the common
machine language
for multicores?
512
256
128
Picochip
PC102
Ambric
AM2045
Cisco
CSR-1
Intel
Tflops
64
32
# of
cores 16
Raw
Niagara
8
Broadcom 1480
4
2
1
4004
8080
8086
286
386
486
Pentium
8008
1970
4
Raza
XLR
1975
1980
1985
1990
Cavium
Octeon
Cell
Opteron 4P
Xeon MP
Xbox360
PA-8800 Opteron
Tanglewood
Power4
PExtreme Power6
Yonah
P2 P3 Itanium
P4
Athlon
Itanium 2
1995
2000
2005
20??
Common Machine Languages
Uniprocessors:
Common Properties
Multicores:
Common Properties
Single flow of control
Multiple flows of control
Single memory image
Multiple local memories
Differences:
Differences:
Number and capabilities of cores
Register Allocation
Communication Model
ISA Instruction Selection
Synchronization Model
Functional Units Instruction Scheduling
Register File
von-Neumann languages represent the
common properties and abstract away
the differences
5
Need common machine language(s)
for multicores
Streaming as a Common Machine Language
AtoD
• Regular and repeating computation
FMDemod
• Independent filters
with explicit communication
– Segregated address spaces and
multiple program counters
Scatter
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
• Natural expression of Parallelism:
– Producer / Consumer dependencies
– Enables powerful, whole-program
transformations
Gather
Adder
Speaker
6
Types of Parallelism
Task Parallelism
– Parallelism explicit in algorithm
– Between filters without
producer/consumer relationship
Scatter
Data Parallelism
– Peel iterations of filter, place within
scatter/gather pair (fission)
– parallelize filters with state
Gather
7
Task
Pipeline Parallelism
– Between producers and consumers
– Stateful filters can be parallelized
Types of Parallelism
Task Parallelism
– Parallelism explicit in algorithm
Data Parallel
– Between filters without
Gather
producer/consumer relationship
Scatter
Pipeline
Scatter
Data Parallelism
– Between iterations of a stateless filter
– Place within scatter/gather pair (fission)
– Can’t parallelize filters with state
Gather
Data
8
Task
Pipeline Parallelism
– Between producers and consumers
– Stateful filters can be parallelized
Types of Parallelism
Traditionally:
Scatter
Gather
Pipeline
Scatter
Data Parallelism
– Data parallel loop (forall)
Gather
Data
9
Task Parallelism
– Thread (fork/join) parallelism
Task
Pipeline Parallelism
– Usually exploited in hardware
Problem Statement
Given:
– Stream graph with compute and communication
estimate for each filter
– Computation and communication resources of
the target machine
Find:
– Schedule of execution for the filters that best
utilizes the available parallelism to fit the
machine resources
10
Our 3-Phase Solution
Coarsen
Granularity
Data
Parallelize
Software
Pipeline
1. Coarsen: Fuse stateless sections of the graph
2. Data Parallelize: parallelize stateless filters
3. Software Pipeline: parallelize stateful filters
Compile to a 16 core architecture
–
11
11.2x mean throughput speedup over single core
Outline
• StreamIt language overview
• Mapping to multicores
– Baseline techniques
– Our 3-phase solution
12
The StreamIt Project
• Applications
StreamIt Program
– DES and Serpent [PLDI 05]
– MPEG-2 [IPDPS 06]
– SAR, DSP benchmarks, JPEG, …
Front-end
• Programmability
– StreamIt Language (CC 02)
– Teleport Messaging (PPOPP 05)
– Programming Environment in Eclipse (P-PHEC 05)
Annotated Java
• Domain Specific Optimizations
– Linear Analysis and Optimization (PLDI 03)
– Optimizations for bit streaming (PLDI 05)
– Linear State Space Analysis (CASES 05)
Simulator
(Java Library)
Stream-Aware
Optimizations
• Architecture Specific Optimizations
– Compiling for Communication-Exposed
Architectures (ASPLOS 02)
– Phased Scheduling (LCTES 03)
– Cache Aware Optimization (LCTES 05)
– Load-Balanced Rendering
(Graphics Hardware 05)
13
Uniprocessor
backend
Cluster
backend
Raw
backend
IBM X10
backend
C/C++
MPI-like
C/C++
C per tile +
msg code
Streaming
X10 runtime
Model of Computation
• Synchronous Dataflow [Lee ‘92]
A/D
– Graph of autonomous filters
– Communicate via FIFO channels
Band Pass
• Static I/O rates
– Compiler decides on an order
of execution (schedule)
Detect
– Static estimation of
computation
LED
14
Duplicate
Detect
Detect
Detect
LED
LED
LED
Example StreamIt Filter
0
1
2
3
4
5
6
7
8
9 10 11
FIR
0
1
output
floatfloat filter FIR (int N, float[N] weights) {
Stateless
work push 1 pop 1 peek N {
float result = 0;
for (int i = 0; i < N; i++) {
result += weights[i]  peek(i);
}
pop();
push(result);
}
}
15
input
Example StreamIt Filter
0
1
2
3
4
5
6
7
8
9 10 11
FIR
0
1
output
floatfloat filter FIR (int N,
N) float[N]
{
weights) {
;
Stateful
work push 1 pop 1 peek N {
float result = 0;
weights = adaptChannel(weights);
for (int i = 0; i < N; i++) {
result += weights[i]  peek(i);
}
pop();
push(result);
}
}
16
input
StreamIt Language Overview
• StreamIt is a novel
language for streaming
– Exposes parallelism and
communication
– Architecture independent
– Modular and composable
– Simple structures
composed to creates
complex graphs
filter
pipeline
may be
any StreamIt
language construct
splitjoin
splitter
parallel computation
joiner
– Malleable
– Change program behavior
with small modifications
feedback loop
joiner
17
splitter
Outline
• StreamIt language overview
• Mapping to multicores
– Baseline techniques
– Our 3-phase solution
18
Baseline 1: Task Parallelism
• Inherent task parallelism between
two processing pipelines
Splitter
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
BandStop
BandStop
Joiner
Adder
19
• Task Parallel Model:
– Only parallelize explicit task
parallelism
– Fork/join parallelism
• Execute this on a 2 core machine
~2x speedup over single core
• What about 4, 16, 1024, … cores?
Evaluation: Task Parallelism
16
Raw Microprocessor
Parallelism: Not matched
to target!
16 inorder, single-issue cores with D$ and I$
Synchronization:
Not
matched
towith
target!
16 memory
banks,
each bank
DMA
15
Cycle accurate simulator
19
Throughput Normalized to Single Core StreamIt
18
17
14
13
12
11
10
9
8
7
6
5
4
3
2
1
20
M
ea
n
ar
ric
R
ad
er
oc
od
G
eo
m
et
V
de
r
G
2D
ec
o
TD
E
P
E
M
t
o
ad
i
S
er
pe
n
lte
FM
R
rb
an
k
T
S
D
E
D
C
T
FF
Fi
oc
lV
nn
e
C
ha
B
i to
ni
cS
od
e
or
t
r
0
Baseline 2: Fine-Grained
Data Parallelism
Splitter
Splitter
Splitter
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
Joiner
Joiner
Splitter
Splitter
Compress
Compress
Compress
Compress
Joiner
Process
Process
Process
Process
Compress
Compress
Compress
Compress
Joiner
Joiner
Splitter
Splitter
Process
Process
Process
Process
Splitter
Splitter
Expand
Expand
Expand
Expand
BandStop
BandStop
BandStop
BandStop
Expand
Expand
Expand
Expand
Joiner
Joiner
Splitter
Splitter
Joiner
– Example: 4 cores
• Each fission group occupies
entire machine
Joiner
Splitter
– Fiss each stateless filter N
ways (N is number of cores)
– Remove scatter/gather if
possible
• We can introduce data
parallelism
BandStop
BandStop
BandStop
BandStop
Joiner
BandStop
BandStop
BandStop
Adder
Adder
Joiner
21
Joiner
• Each of the filters in the
example are stateless
• Fine-grained Data Parallel
Model:
22
G
en
t
eo
m
et
ri c
ea
n
ad
ar
M
R
Vo
co
de
r
E
PE
G
2D
ec
od
er
TD
Se
rp
ad
io
16
R
an
k
Task
Fine-Grained Data
FM
T
ES
C
FF
T
D
D
17
oc
od
er
or
t
18
Fi
lte
rb
ha
nn
el
V
19
M
C
Bi
to
ni
cS
Throughput Normalized to Single Core StreamIt
Evaluation:
Fine-Grained Data Parallelism
Good Parallelism!
Too Much Synchronization!
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Outline
• StreamIt language overview
• Mapping to multicores
– Baseline techniques
– Our 3-phase solution
23
Phase 1: Coarsen the Stream Graph
Splitter
BandPass
Peek
Compress
Compress
Process
Process
Expand
Expand
BandStop
Peek
Joiner
Adder
24
BandPass
BandStop
Peek
Peek
• Before data-parallelism is
exploited
• Fuse stateless pipelines as
much as possible without
introducing state
– Don’t fuse stateless with
stateful
– Don’t fuse a peeking filter with
anything upstream
Phase 1: Coarsen the Stream Graph
Splitter
BandPass
Compress
Process
Expand
BandPass
Compress
Process
Expand
BandStop
BandStop
• Before data-parallelism is
exploited
• Fuse stateless pipelines as
much as possible without
introducing state
– Don’t fuse stateless with
stateful
– Don’t fuse a peeking filter with
anything upstream
• Benefits:
Joiner
Adder
25
– Reduces global communication
and synchronization
– Exposes inter-node
optimization opportunities
Phase 2: Data Parallelize
Data Parallelize for 4 cores
Splitter
BandPass
Compress
Process
Expand
BandPass
Compress
Process
Expand
BandStop
BandStop
Joiner
Splitter
Adder
Adder
Adder
Adder
Joiner
26
Fiss 4 ways, to occupy entire chip
Phase 2: Data Parallelize
Data Parallelize for 4 cores
Splitter
Splitter
Splitter
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
Joiner
Joiner
BandStop
BandStop
Joiner
Splitter
Adder
Adder
Adder
Adder
Joiner
27
Task parallelism!
Each fused filter does equal work
Fiss each filter 2 times to occupy entire chip
Phase 2: Data Parallelize
Data Parallelize for 4 cores
Splitter
Splitter
Splitter
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
Joiner
Joiner
Splitter
Splitter
BandStop
BandStop
BandStop
BandStop
Joiner
Joiner
Joiner
Splitter
Adder
Adder
Adder
Adder
Joiner
28
• Task-conscious data
parallelization
– Preserve task parallelism
• Benefits:
– Reduces global communication
and synchronization
Task parallelism, each filter does equal work
Fiss each filter 2 times to occupy entire chip
Evaluation:
Coarse-Grained Data Parallelism
Task
Fine-Grained Data
Coarse-Grained Task + Data
19
Throughput Normalized to Single Core StreamIt
18
17
Good Parallelism!
Low Synchronization!
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
29
n
r
M
ea
ad
a
G
eo
m
et
ri
c
R
r
co
de
Vo
r
de
ec
o
G
2D
nt
rp
e
Se
ad
io
FM
R
Fi
lte
rb
an
k
T
FF
ES
D
CT
D
TD
E
M
PE
C
ha
n
ne
Bi
to
lV
oc
ni
c
od
So
er
rt
0
Simplified Vocoder
Splitter
6
AdaptDFT
AdaptDFT
6
Joiner
RectPolar
20
Data Parallel
Splitter
Splitter
2
UnWrap
Unwrap
2
1
Diff
Diff
1
1
Amplify
Amplify
1
1
Accum
Accum
1
Data Parallel, but too little work!
Joiner
Joiner
PolarRect
30
20
Data Parallel
Target a 4 core machine
Data Parallelize
Splitter
6
AdaptDFT
AdaptDFT
6
Joiner
Splitter
RectPolar
RectPolar
RectPolar
RectPolar
20 5
Joiner
Splitter
Splitter
2
UnWrap
Unwrap
2
1
Diff
Diff
1
1
Amplify
Amplify
1
1
Accum
Accum
1
Joiner
Joiner
Splitter
RectPolar
RectPolar
RectPolar
PolarRect
20 5
Joiner
31
Target a 4 core machine
Data + Task Parallel Execution
Splitter
6
6
Cores
Joiner
Splitter
5
Joiner
Splitter
Splitter
2
2
1
1
1
1
1
1
Time
21
Joiner
Joiner
Splitter
5
RectPolar
Joiner
32
Target 4 core machine
We Can Do Better!
Splitter
6
6
Cores
Joiner
Splitter
5
Joiner
Splitter
Splitter
2
2
1
1
1
1
1
1
Time
16
Joiner
Joiner
Splitter
5
RectPolar
Joiner
33
Target 4 core machine
Phase 3: Coarse-Grained
Software Pipelining
Prologue
New
Steady
State
RectPolar
RectPolar
• New steady-state is free of
dependencies
• Schedule new steady-state
using a greedy partitioning
34
RectPolar
RectPolar
Greedy Partitioning
Cores
To Schedule:
Time
35
16
Target 4 core machine
Task
Coarse-Grained Task + Data
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
n
r
M
ea
ad
a
c
et
ri
m
r
co
de
Vo
R
G
eo
M
PE
G
2D
ec
o
de
r
TD
E
nt
rp
e
Se
ad
io
FM
R
k
rb
an
FF
ES
D
CT
T
Fi
lte
lV
oc
ne
ha
n
C
D
er
od
So
ni
c
Bi
to
36
Fine-Grained Data
Coarse-Grained Task + Data + Software Pipeline
Best Parallelism!
Lowest Synchronization!
rt
Throughput Normalized to Single Core StreamIt
Evaluation: Coarse-Grained
Task + Data + Software Pipelining
Generalizing to Other Multicores
• Architectural requirements:
– Compiler controlled local memories with DMA
– Efficient implementation of scatter/gather
• To port to other architectures, consider:
– Local memory capacities
– Communication to computation tradeoff
• Did not use processor-to-processor
communication on Raw
37
Related Work
• Streaming languages:
– Brook [Buck et al. ’04]
– StreamC/KernelC [Kapasi ’03, Das et al. ’06]
– Cg [Mark et al. ‘03]
– SPUR [Zhang et al. ‘05]
• Streaming for Multicores:
– Brook [Liao et al., ’06]
• Ptolemy [Lee ’95]
• Explicit parallelism:
– OpenMP, MPI, & HPF
38
Conclusions
• Streaming model naturally exposes task, data, and
pipeline parallelism
• This parallelism must be exploited at the correct
granularity and combined correctly
Task
Fine-Grained Coarse-Grained
Data
Task + Data
Coarse-Grained
Task + Data +
Software Pipeline
Parallelism
Not
matched
Good
Good
Best
Synchronization
Not
matched
High
Low
Lowest
• Good speedups across varied benchmark suite
• Algorithms should be applicable across multicores
39
Download