MIT OpenCourseWare 6.189 Multicore Programming Primer, January (IAP) 2007

advertisement
MIT OpenCourseWare
http://ocw.mit.edu
6.189 Multicore Programming Primer, January (IAP) 2007
Please use the following citation format:
Saman Amarasinghe, 6.189 Multicore Programming Primer, January
(IAP) 2007. (Massachusetts Institute of Technology: MIT
OpenCourseWare). http://ocw.mit.edu (accessed MM DD, YYYY).
License: Creative Commons Attribution-Noncommercial-Share Alike.
Note: Please use the actual date you accessed this material in your citation.
For more information about citing these materials or our Terms of Use, visit:
http://ocw.mit.edu/terms
6.189 IAP 2007
Lecture 12
StreamIt Parallelizing Compiler
Prof. Saman Amarasinghe, MIT.
1
6.189 IAP 2007 MIT
Common Machine Language
● Represent common properties of architectures
Necessary for performance
Abstract away differences in architectures
• Necessary for portability
Cannot be too complex
• Must keep in mind the typical programmer
C and Fortran were the common machine languages for
uniprocessors
• Imperative languages are not the correct abstraction for
parallel architectures.
What is the correct abstraction for parallel multicore
machines?
•
●
●
●
●
Prof. Saman Amarasinghe, MIT.
2
6.189 IAP 2007 MIT
Common Machine Language for Multicores
● Current offerings:
• • • OpenMP
MPI
High Performance Fortran
● Explicit parallel constructs grafted onto imperative language
● Language features obscured:
• • • Composability
Malleability
Debugging
● Huge additional burden on programmer:
• • • Introducing parallelism
Correctness of parallelism
Optimizing parallelism
Prof. Saman Amarasinghe, MIT.
3
6.189 IAP 2007 MIT
Explicit Parallelism
● Programmer controls details of parallelism!
● Granularity decisions:
• • if too small, lots of synchronization and thread creation
if too large, bad locality
● Load balancing decisions
• Create balanced parallel sections (not data-parallel)
● Locality decisions
• Sharing and communication structure
● Synchronization decisions
• barriers, atomicity, critical sections, order, flushing
● For mass adoption, we need a better paradigm:
• • Where the parallelism is natural
Exposes the necessary information to the compiler
Prof. Saman Amarasinghe, MIT.
4
6.189 IAP 2007 MIT
Unburden the Programmer
● Move these decisions to compiler!
• • • • Granularity
Load Balancing
Locality
Synchronization
● Hard to do in traditional languages
• Can a novel language help?
Prof. Saman Amarasinghe, MIT.
5
6.189 IAP 2007 MIT
Properties of Stream Programs
AtoD
● Regular and repeating computation
● Synchronous Data Flow
● Independent actors
with explicit communication
● Data items have short lifetimes
FMDemod
Duplicate
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
Benefits:
● Naturally parallel
● Expose dependencies to compiler
● Enable powerful transformations
RoundRobin
Adder
Speaker
Prof. Saman Amarasinghe, MIT.
6
6.189 IAP 2007 MIT
Outline
●
●
●
●
Why we need New Languages?
Static Schedule
Three Types of Parallelism
Exploiting Parallelism
Prof. Saman Amarasinghe, MIT.
7
6.189 IAP 2007 MIT
Steady-State Schedule
● All data pop/push rates are constant
● Can find a Steady-State Schedule
•
•
# of items in the buffers are the same before and the
after executing the schedule
There exist a unique minimum steady state schedule
● Schedule = { }
A
B
C
…
push=2
pop=3
push=1
pop=2
…
Prof. Saman Amarasinghe, MIT.
8
6.189 IAP 2007 MIT
Steady-State Schedule
● All data pop/push rates are constant
● Can find a Steady-State Schedule
•
•
# of items in the buffers are the same before and the
after executing the schedule
There exist a unique minimum steady state schedule
● Schedule = { A }
A
B
C
…
push=2
pop=3
push=1
pop=2
…
Prof. Saman Amarasinghe, MIT.
9
6.189 IAP 2007 MIT
Steady-State Schedule
● All data pop/push rates are constant
● Can find a Steady-State Schedule
•
•
# of items in the buffers are the same before and the
after executing the schedule
There exist a unique minimum steady state schedule
● Schedule = { A, A }
A
B
C
…
push=2
pop=3
push=1
pop=2
…
Prof. Saman Amarasinghe, MIT.
10
6.189 IAP 2007 MIT
Steady-State Schedule
● All data pop/push rates are constant
● Can find a Steady-State Schedule
•
•
# of items in the buffers are the same before and the
after executing the schedule
There exist a unique minimum steady state schedule
● Schedule = { A, A, B }
A
B
C
…
push=2
pop=3
push=1
pop=2
…
Prof. Saman Amarasinghe, MIT.
11
6.189 IAP 2007 MIT
Steady-State Schedule
● All data pop/push rates are constant
● Can find a Steady-State Schedule
•
•
# of items in the buffers are the same before and the
after executing the schedule
There exist a unique minimum steady state schedule
● Schedule = { A, A, B, A }
A
B
C
…
push=2
pop=3
push=1
pop=2
…
Prof. Saman Amarasinghe, MIT.
12
6.189 IAP 2007 MIT
Steady-State Schedule
● All data pop/push rates are constant
● Can find a Steady-State Schedule
•
•
# of items in the buffers are the same before and the
after executing the schedule
There exist a unique minimum steady state schedule
● Schedule = { A, A, B, A, B }
A
B
C
…
push=2
pop=3
push=1
pop=2
…
Prof. Saman Amarasinghe, MIT.
13
6.189 IAP 2007 MIT
Steady-State Schedule
● All data pop/push rates are constant
● Can find a Steady-State Schedule
•
•
# of items in the buffers are the same before and the
after executing the schedule
There exist a unique minimum steady state schedule
● Schedule = { A, A, B, A, B, C }
A
B
C
…
push=2
pop=3
push=1
pop=2
…
Prof. Saman Amarasinghe, MIT.
14
6.189 IAP 2007 MIT
Initialization Schedule
● When peek > pop, buffer cannot be empty after
firing a filter
● Buffers are not empty at the beginning/end of the
steady state schedule
● Need to fill the buffers before starting the steady
state execution
peek=4
pop=3
push=1
Prof. Saman Amarasinghe, MIT.
15
6.189 IAP 2007 MIT
Initialization Schedule
● When peek > pop, buffer cannot be empty after
firing a filter
● Buffers are not empty at the beginning/end of the
steady state schedule
● Need to fill the buffers before starting the steady
state execution
peek=4
pop=3
push=1
Prof. Saman Amarasinghe, MIT.
16
6.189 IAP 2007 MIT
Outline
●
●
●
●
Why we need New Languages?
Static Schedule
Three Types of Parallelism
Exploiting Parallelism
Prof. Saman Amarasinghe, MIT.
17
6.189 IAP 2007 MIT
Types of Parallelism
Task Parallelism
•
•
Scatter
Data Parallelism
• • Gather
Parallelism explicit in algorithm
Between filters without producer/consumer
relationship
Peel iterations of filter, place within
scatter/gather pair (fission)
parallelize filters with state
Pipeline Parallelism
• • Between producers and consumers
Stateful filters can be parallelized
Task
Prof. Saman Amarasinghe, MIT.
18
6.189 IAP 2007 MIT
Types of Parallelism
Task Parallelism
Scatter
•
Data Parallel
•
Gather
Pipeline
Scatter
Data Parallelism
•
•
•
Gather
Between iterations of a stateless filter
Place within scatter/gather pair (fission)
Can’t parallelize filters with state
Pipeline Parallelism
•
•
Data
Parallelism explicit in algorithm
Between filters without producer/consumer
relationship
Between producers and consumers
Stateful filters can be parallelized
Task
Prof. Saman Amarasinghe, MIT.
19
6.189 IAP 2007 MIT
Types of Parallelism
Traditionally:
Scatter
Task Parallelism
Gather
Pipeline
Scatter
• Thread (fork/join) parallelism
Data Parallelism
• Data parallel loop (forall)
Pipeline Parallelism
• Usually exploited in hardware
Gather
Data
Task
Prof. Saman Amarasinghe, MIT.
20
6.189 IAP 2007 MIT
Outline
●
●
●
●
Why we need New Languages?
Static Schedule
Three Types of Parallelism
Exploiting Parallelism
Prof. Saman Amarasinghe, MIT.
21
6.189 IAP 2007 MIT
Baseline 1: Task Parallelism
Splitter
BandPass
BandPass
Compress
Compress
Process
Process
● Inherent task parallelism between two
processing pipelines
● Task Parallel Model:
• Only parallelize explicit task
parallelism
•
Expand
Fork/join parallelism
Expand
● Execute this on a 2 core machine ~2x
BandStop
BandStop
Joiner
speedup over single core
● What about 4, 16, 1024, … cores?
Adder
Prof. Saman Amarasinghe, MIT.
22
6.189 IAP 2007 MIT
Evaluation: Task Parallelism
Image by MIT OpenCourseWare.
Raw Microprocessor
Parallelism: Not matched
to target!
16 inorder, single-issue cores with D$ and I$
Synchronization:
Not
matched
towith
target!
16 memory
banks,
each bank
DMA
19
Throughput Normalized to Single Core StreamIt
18
17
16
Cycle accurate simulator
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Prof. Saman Amarasinghe, MIT.
23
R
ad
G
ar
eo
m
et
ric
M
ea
n
V
oc
od
er
TD
M
E
P
E
G
2D
ec
od
er
S
er
pe
nt
ad
io
FM
R
er
ba
nk
Fi
lt
FF
T
D
E
S
D
C
T
B
i to
ni
cS
or
C
t
ha
nn
el
V
oc
od
er
0
6.189 IAP 2007 MIT
Baseline 2: Fine-Grained Data Parallelism
Splitter
Splitter
Joiner
Splitter
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
Splitter
Splitter
•
Splitter
Compress
Compress
Compress
Compress
Joiner
Joiner
● Each of the filters in the example
are stateless
● Fine-grained Data Parallel Model:
Process
Process
Process
Process
Compress
Compress
Compress
Compress
Joiner
Joiner
Splitter
Splitter
Expand
Expand
Expand
Expand
BandStop
BandStop
BandStop
BandStop
•
Process
Process
Process
Process
Expand
Expand
Expand
Expand
Joiner
Joiner
Splitter
Splitter
Joiner
Splitter
● We can introduce data parallelism
•
Example: 4 cores
● Each fission group occupies entire
machine
BandStop
BandStop
BandStop
BandStop
Joiner
Fiss each stateless filter N ways
(N is number of cores)
Remove scatter/gather if
possible
Joiner
Joiner
Splitter
BandStop
BandStop
BandStop
Adder
Adder
Joiner
Prof. Saman Amarasinghe, MIT.
24
6.189 IAP 2007 MIT
Evaluation: Fine-Grained Data Parallelism
Image by MIT OpenCourseWare.
19
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
ea
n
ar
eo
m
et
ri c
M
R
ad
r
de
co
G
PE
M
25
Vo
G
2D
ec
od
er
E
TD
t
en
rp
ad
R
FM
ha
C
Prof. Saman Amarasinghe, MIT.
Se
io
k
lte
rb
an
T
Fi
el
nn
FF
ES
D
T
C
D
Vo
co
ni
cS
de
or
t
r
0
Bi
to
Throughput Normalized to Single Core StreamIt
Good Parallelism!
Too Much Synchronization!
Task
Fine-Grained Data
18
6.189 IAP 2007 MIT
Baseline 3: Hardware Pipeline Parallelism
Splitter
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
BandStop
BandStop
● The BandPass and BandStop
filters contain all the work
● Hardware Pipelining
• Use a greedy algorithm to
fuse adjacent filters
• Want # filters <= # cores
● Example: 8 Cores
● Resultant stream graph is
mapped to hardware
One filter per core
● What about 4, 16, 1024, cores?
„
Joiner
Adder
Prof. Saman Amarasinghe, MIT.
26
6.189 IAP 2007 MIT
Baseline 3: Hardware Pipeline Parallelism
Splitter
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
● The BandPass and BandStop
filters contain all the work
● Hardware Pipelining
• Use a greedy algorithm to
fuse adjacent filters
• Want # filters <= # cores
● Example: 8 Cores
● Resultant stream graph is
mapped to hardware
BandStop
BandStop
Joiner
Adder
Prof. Saman Amarasinghe, MIT.
One filter per core
● What about 4, 16, 1024, cores?
• Performance dependent on
fusing to a load-balanced
stream graph
•
27
6.189 IAP 2007 MIT
Evaluation: Hardware Pipeline Parallelism
Image by MIT OpenCourseWare.
Task
Fine-Grained Data
Hardware Pipelining
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
ea
ri c
M
et
G
eo
m
G
PE
M
28
n
ar
R
ad
r
de
co
Vo
2D
ec
od
er
E
TD
t
en
rp
ad
R
FM
ha
C
Prof. Saman Amarasinghe, MIT.
Se
io
k
lte
rb
an
T
Fi
el
nn
FF
ES
D
T
C
D
Vo
co
ni
cS
de
or
t
r
0
Bi
to
Throughput Normalized to Single Core StreamIt
Parallelism: Not matched to target!
Synchronization: Not matched to target!
6.189 IAP 2007 MIT
The StreamIt Compiler
Coarsen
Granularity
1.
2.
3.
Data
Parallelize
Software
Pipeline
Coarsen: Fuse stateless sections of the graph
Data Parallelize: parallelize stateless filters
Software Pipeline: parallelize stateful filters
Compile to a 16 core architecture
• 11.2x mean throughput speedup over single core
Prof. Saman Amarasinghe, MIT.
29
6.189 IAP 2007 MIT
Phase 1: Coarsen the Stream Graph
Splitter
BandPass
Peek
Compress
BandPass
● Before data-parallelism is exploited
● Fuse stateless pipelines as much
as possible without introducing
state
Peek
Compress
•
Process
Process
Expand
Expand
BandStop
Peek
BandStop
•
Don’t fuse stateless with stateful
Don’t fuse a peeking filter with
anything upstream
Peek
Joiner
Adder
Prof. Saman Amarasinghe, MIT.
30
6.189 IAP 2007 MIT
Phase 1: Coarsen the Stream Graph
Splitter
● Before data-parallelism is exploited
● Fuse stateless pipelines as much
as possible without introducing
state
BandPass
Compress
Process
Expand
BandPass
Compress
Process
Expand
•
•
Don’t fuse stateless with stateful
Don’t fuse a peeking filter with
anything upstream
● Benefits:
•
BandStop
BandStop
•
Reduces global communication
and synchronization
Exposes inter-node optimization
opportunities
Joiner
Adder
Prof. Saman Amarasinghe, MIT.
31
6.189 IAP 2007 MIT
Phase 2: Data Parallelize
Data Parallelize for 4 cores
Splitter
BandPass
Compress
Process
Expand
BandPass
Compress
Process
Expand
BandStop
BandStop
Joiner
Adder
Adder
Adder
Adder
Splitter
Fiss 4 ways, to occupy entire chip
Joiner
Prof. Saman Amarasinghe, MIT.
32
6.189 IAP 2007 MIT
Phase 2: Data Parallelize
Data Parallelize for 4 cores
Splitter
Splitter
Splitter
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
Joiner
Joiner
BandStop
BandStop
Task parallelism!
Each fused filter does equal work
Fiss each filter 2 times to occupy entire chip
Joiner
Adder
Adder
Adder
Adder
Splitter
Joiner
Prof. Saman Amarasinghe, MIT.
33
6.189 IAP 2007 MIT
Phase 2: Data Parallelize
Data Parallelize for 4 cores
Splitter
● Task-conscious data parallelization
Splitter
Splitter
•
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
Joiner
Joiner
Splitter
Splitter
BandStop
BandStop
BandStop
BandStop
Preserve task parallelism
● Benefits:
•
Reduces global communication
and synchronization
Task parallelism, each filter does equal work
Fiss each filter 2 times to occupy entire chip
Joiner
Joiner
Joiner
Adder
Adder
Adder
Adder
Splitter
Joiner
Prof. Saman Amarasinghe, MIT.
34
6.189 IAP 2007 MIT
Evaluation: Coarse-Grained Data Parallelism
Task
Fine-Grained Data
Hardware Pipelining
Coarse-Grained Task + Data
19
17
16
Image by MIT OpenCourseWare.
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
ea
M
c
et
ri
m
eo
G
PE
35
n
r
ad
a
R
r
co
de
de
ec
o
2D
G
Vo
r
E
TD
nt
rp
e
M
C
Prof. Saman Amarasinghe, MIT.
Se
ad
io
R
FM
rb
an
k
T
Fi
l te
ne
ha
n
FF
ES
D
CT
D
lV
oc
ni
c
od
So
er
rt
0
Bi
to
Throughput Normalized to Single Core StreamIt
18
Good Parallelism!
Low Synchronization!
6.189 IAP 2007 MIT
Simplified Vocoder
Splitter
6
AdaptDFT
AdaptDFT
6
Joiner
RectPolar
20
Data Parallel
Splitter
Splitter
2
UnWrap
Unwrap
2
1
Diff
Diff
1
1
Amplify
Amplify
1
1
Accum
Accum
1
Data Parallel, but too little work!
Joiner
Joiner
PolarRect
20
Data Parallel
Prof. Saman Amarasinghe, MIT.
Target a 4 core machine
36
6.189 IAP 2007 MIT
Data Parallelize
Splitter
6
AdaptDFT
AdaptDFT
6
Joiner
Splitter
RectPolar
lar
RectPo
RectPo
lar
RectPolar
20 5
Joiner
Splitter
Splitter
2
UnWrap
Unwrap
2
1
Diff
Diff
1
1
Amplify
Amplify
1
1
Accum
Accum
1
Joiner
Joiner
Splitter
20 5
RectPolar
lar
RectPo
RectPo
lar
PolarRect
Target a 4 core machine
Joiner
Prof. Saman Amarasinghe, MIT.
37
6.189 IAP 2007 MIT
Data + Task Parallel Execution
Splitter
6
6
Cores
Joiner
Splitter
5
Joiner
Splitter
Splitter
2
2
1
1
1
1
1
1
Time
21
Joiner
Joiner
5
Splitter
RectPolar
Target 4 core machine
Joiner
Prof. Saman Amarasinghe, MIT.
38
6.189 IAP 2007 MIT
We Can Do Better!
Splitter
6
6
Cores
Joiner
Splitter
5
Joiner
Splitter
Splitter
2
2
1
1
1
1
1
1
Time
16
Joiner
Joiner
5
Splitter
RectPolar
Joiner
Prof. Saman Amarasinghe, MIT.
Target 4 core machine
39
6.189 IAP 2007 MIT
Phase 3: Coarse-Grained Software Pipelining
Prologue
New
Steady
State
RectPolar
RectPolar
● New steady-state is free of
dependencies
● Schedule new steady-state using
a greedy partitioning
Prof. Saman Amarasinghe, MIT.
RectPolar
RectPolar
40
6.189 IAP 2007 MIT
Greedy Partitioning
Cores
To Schedule:
Time
16
Target 4 core machine
Prof. Saman Amarasinghe, MIT.
41
6.189 IAP 2007 MIT
Evaluation: Coarse-Grained
Task + Data + Software Pipelining
Task
Hardware Pipelining
Coarse-Grained Task + Data + Software Pipeline
Best Parallelism!
Lowest Synchronization!
18
17
16
Image by MIT OpenCourseWare.
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
n
r
ea
ad
a
et
ri
cM
R
r
co
de
de
ec
o
2D
m
eo
6.189 IAP 2007 MIT
G
M
42
PE
G
Vo
r
E
TD
t
en
rp
Se
ad
io
R
FM
rb
an
k
T
Fi
l te
ne
ha
n
C
Prof. Saman Amarasinghe, MIT.
FF
ES
D
CT
D
lV
oc
ni
c
od
So
er
rt
0
Bi
to
Throughput Normalized to Single Core StreamIt
19
Fine-Grained Data
Coarse-Grained Task + Data
Summary
● Streaming model naturally exposes task, data, and pipeline
parallelism
● This parallelism must be exploited at the correct granularity and
combined correctly
Task
FineGrained
Data
Hardware
Pipelining
Coarse-Grained
Task + Data
Coarse-Grained Task +
Data + Software
Pipeline
Parallelism
Application
Dependent
Good
Application
Dependent
Good
Best
Synchronization
Application
Dependent
High
Application
Dependent
Low
Lowest
● Robust speedups across varied benchmark suite
Prof. Saman Amarasinghe, MIT.
43
6.189 IAP 2007 MIT
Download