Saman Amarasinghe

advertisement
Saman Amarasinghe
Lets stick with current sequential languages
Parallel Programming is hard!
Billons of LOC written in sequential languages
Let the compiler do all the work
Maintain the current strong machine abstraction
SUIF Parallelizing Compiler
Monica Lam and the Stanford SUIF team 1993 – 1997
Automatically extract parallelism from sequential programs
600
4
1000
3
400
2
200
1
But… Techniques were not robust for general use
swm256
tomcatv
su2cor
ear
hydro2d
nasa7
alvinn
mdljsp2
wave5
ora
mdljdp2
fpppp
0
doduc
Vector processor
Cray C90
540
Uniprocessor
Digital 21164 508
SUIF on 8 processors Digital 8400 1,016
800
8
7
6
5
spice2g6
Achieved Best SPEC results of the day
1200
MFLOPS
Interprocedural analysis
Array and scalar data-flow analysis
Reduction and recurrence recognition
C to FORTRAN
Number of Processors
Heroic Analysis
Composition is key to building large systems
Implemented naturally via time-multiplexing
The framework for parallelizing sequential programs
Sequential parts at outermost
Global barriers
1
0.9
Utilization
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2004
2
4
2007
8
16
2010
32
64
2013
128
256
2016
512
1024
2019
2048
4096
2022
Number of cores
Expected Year
Speedup = 1/(1– p + p/N)
Utilization = 1/(p + N*(1 – p))
1
0.9
Utilization
0.8
0.7
0.6
0.5
% parallel
0.4
90%
0.3
0.2
0.1
0
1
2004
2
4
2007
8
16
2010
32
64
2013
128
256
2016
512
1024
2019
2048
4096
2022
Number of cores
Expected Year
Speedup = 1/(1– p + p/N)
Utilization = 1/(p + N*(1 – p))
1
0.9
Utilization
0.8
0.7
0.6
0.5
% parallel
0.4
90%
0.3
99%
0.2
0.1
0
1
2004
2
4
2007
8
16
2010
32
64
2013
128
256
2016
512
1024
2019
2048
4096
2022
Number of cores
Expected Year
Speedup = 1/(1– p + p/N)
Utilization = 1/(p + N*(1 – p))
1
0.9
Utilization
0.8
0.7
0.6
0.5
% parallel
0.4
90%
0.3
99%
0.2
99.90%
0.1
0
1
2004
2
4
2007
8
16
2010
32
64
2013
128
256
2016
512
1024
2019
2048
4096
2022
Number of cores
Expected Year
Speedup = 1/(1– p + p/N)
Utilization = 1/(p + N*(1 – p))
Currently…
Theory, algorithms, languages, tools all centered
around the sequential paradigm
A well enforced machine abstraction
Move to muticore is a fundamental shift
Akin to analog design to digital shift
Need a new abstraction where parallelism is the
primary form of expression
Parallelism is simple
Parallelism is natural
Communication is intuitive
Parallel composition of sequential segments
With possible space-multiplexed execution
Parallel programming still in the dark ages
Elite community of practitioners
Active open research, little stable consensus
Assumption: we don’t know how to teach parallel programming!
Aim for a “Mead and Conway” type revolution
Develop simple, cookbook approaches
If we can’t teach them, they’re too complex!
Make them accessible
Carefully thought-out courseware, tools, texts, courses
Focus on the educational community
Exporting, proselytizing, workshops, conferences, journals, …
1.
Move to a truly parallel world (long term)
Natural world is extremely parallel  learn to emulate it
Can we make sequential programs a special case of parallel
programming?
2.
Rejoice when parallelism is natural (medium term)
Switch to parallel languages if using them is easier than
sequential languages
3.
Help migrate legacy application (short term)
Existing large body of code – cannot ignore!
Written in sequential languages – need to work with them
Some domains are inherently parallel
Coding them using a sequential language is…
Harder than using the right parallel abstraction
All information on inherent parallelism is lost
There are win-win situations
Increasing the programmer productivity while
extracting parallel performance
Streaming domain and the StreamIt experience
MPEG bit stream
picture type
VLD
quantization coefficients
<QC>
<PT1, PT2>
macroblocks, motion vectors
splitter
frequency encoded
macroblocks
differentially coded
motion vectors
ZigZag
Structured block level diagram
describes computation and flow of
data
Motion Vector Decode
<QC> IQuantization
IDCT
Conceptually easy to understand
Repeat
Clean abstraction of functionality
Saturation
spatially encoded macroblocks
motion vectors
joiner
Mapping to C (sequentialization)
destroys this simple view
splitter
Cr
Cb
Y
Motion Compensation Motion Compensation Motion Compensation
reference
reference
reference
<PT1>
<PT1>
<PT1>
picture
picture
picture
Channel Upsample
Channel Upsample
joiner
recovered picture
<PT2> Picture Reorder
Color Space Conversion
MPEG-2 Decoder
MPEG bit stream
picture type
VLD
quantization coefficients
<QC>
<PT1, PT2>
macroblocks, motion vectors
splitter
frequency encoded
macroblocks
add VLD(QC, PT1, PT2);
add splitjoin {
split roundrobin(NB, V);
differentially coded
motion vectors
add pipeline {
add ZigZag(B);
add IQuantization(B) to QC;
add IDCT(B);
add Saturation(B);
}
add pipeline {
add MotionVectorDecode();
add Repeat(V, N);
}
ZigZag
Motion Vector Decode
<QC> IQuantization
IDCT
Repeat
Saturation
spatially encoded macroblocks
motion vectors
join roundrobin(B, V);
joiner
}
add splitjoin {
split roundrobin(4(B+V), B+V, B+V);
splitter
Cr
Cb
Y
add MotionCompensation(4(B+V)) to PT1;
for (int i = 0; i < 2; i++) {
add pipeline {
add MotionCompensation(B+V) to PT1;
add ChannelUpsample(B);
}
}
Motion Compensation Motion Compensation Motion Compensation
reference
reference
reference
<PT1>
<PT1>
<PT1>
picture
picture
picture
Channel Upsample
Channel Upsample
join roundrobin(1, 1, 1);
joiner
recovered picture
<PT2> Picture Reorder
Color Space Conversion
MPEG-2 Decoder
}
add PictureReorder(3WH) to PT2;
add ColorSpaceConversion(3WH);
MPEG bit stream
picture type
VLD
quantization coefficients
<QC>
<PT1, PT2>
macroblocks, motion vectors
splitter
frequency encoded
macroblocks
differentially coded
motion vectors
ZigZag
Motion Vector Decode
<QC> IQuantization
IDCT
Repeat
Thread (fork/join) parallelism
Parallelism explicit in algorithm
Between filters without
producer/consumer relationship
Data Parallelism
Saturation
spatially encoded macroblocks
Task Parallelism
motion vectors
joiner
Data parallel loop (forall)
Between iterations of a stateless filter
Can’t parallelize filters with state
splitter
Cr
Cb
Y
Motion Compensation Motion Compensation Motion Compensation
reference
reference
reference
<PT1>
<PT1>
<PT1>
picture
picture
picture
Channel Upsample
Channel Upsample
joiner
recovered picture
<PT2> Picture Reorder
Color Space Conversion
MPEG-2 Decoder
Pipeline Parallelism
Usually exploited in hardware
Between producers and consumers
Stateful filters can be parallelized
Benchmarks
On a 16 core MIT Raw Processor (http://cag.csail.mit.edu/raw)
m
et
ri
c
M
ea
n
r
nt
k
ad
a
rp
e
R
Se
rb
an
CT
ES
D
D
er
r
rt
de
od
lV
oc
ec
o
So
TD
E
ni
c
Fi
lte
ne
G
2D
ha
n
G
eo
C
M
PE
Bi
to
T
r
ad
io
FF
co
de
FM
R
Vo
Throughput Normalized to Single Core StreamIt
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
.
Don’t modify a code segment if…
The performance impact is insignificant and
is isolated from the rest
Automatic parallelizer works perfectly
Modify and annotate a segment if…
Still in
Existing
Sequential
Languages
Automatic parallelizer needs a little help
Otherwise rewrite the segment
Program Reincarnation
A new body with the same old soul
Use a
Parallel
Language
Dynamic analysis
Managed program execution
Program invariant inference
Application knowledge database
Legacy
Program
Original
Binary
Source File
Original
Compiler
.c
.exe
Assisted parallelization
GUI tool
Automatic
Parallelization
Managed
Program
Execution
Correctness in reincarnated
Test Generation
Divergence Analysis
Static analysis
Learn about the domain
.log
Block diagram
Refactoring identification
Domain
Knowledge
Extraction
Program Invariant
Inference Engine
Application
Knowledge
(program
representation &
invariants)
.log
Flag domain specific issues
Generate domain-specific hints
Bring programs to modern age
Instrumenter and
Binary interpreter
Static
Analysis
Automatic parallelization
info for program understanding
Divergence
Analysis
Test Generation
Known Idiom
Identification
&
Domain
Hint
Generation
Refactoring
Identification
Managed
Program
Execution
.exe
Domain
Knowledge
Database
Compiler &
Instrumenter
Reincarna
ted .c
Assisted
Application
Reincarnation
Tool
Block Diagram
Representation
Multicore menace will impact all of us in a big way
Parallelism need to keep up with Moore’s curve
Will definitely need new parallel languages where
parallelism is the primary form of composition
Low hanging fruit when parallelism is the natural
form of expression
However, cannot ignore the past investments
© 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Download