StreamIt-CAW-9-02 - MIT Computer Science and Artificial

advertisement
StreamIt on Raw
StreamIt Group:
Michael Gordon, William Thies, Michal
Karczmarek, David Maze, Jasper Lin, Jeremy
Wong, Andrew Lamb, Ali S. Meli, Chris Leger,
Sam Larsen, and Saman Amarasinghe
MIT Laboratory for Computer Science
MIT Computer Architecture Workshop
September 19, 2002
Von Neumann Languages
• Why C (FORTRAN, C++ etc.) became very
successful?
– Abstracted out the differences of von Neumann
machines
•
•
•
•
Register set structure
Functional units and capabilities
Pipeline depth/width
Memory/cache organization
– Directly expose the common properties
• Single memory image
• Single control-flow
• A clear notion of time
– Can have a very efficient mapping to a von Neumann
machine
• Today von Neumann languages are a curse!
StreamIt: A Spatially-Aware Language
• A language for streaming applications
– Provides high-level stream abstraction
• A filter is the autonomous unit of
computation.
• Breaks the von Neumann language barrier
–
–
–
–
Each filter has its own PC
Each filter has its own address space
No global time
Explicit data movement between filters
The Filter
• A filter communicates using FIFO channels, with
the following operations:
– pop(): dequeue the bottom item from the incoming
channel.
– peek(index): return the value at position index
without dequeuing it.
– push(value): enqueue value on the outgoing channel.
• The pop, peek, and push rate for each firing of a
filter must be statically determined.
• Each filters contains:
– An initialization function
– A steady-state “work” function
StreamIt Language
• A collection of filters connected by channels.
• Structured Streams
– Streaming applications have structure, not a
free-form graph.
– Use a few constructs: pipeline, splitjoin and
feedback
– Hierarchical composition
– Intuitive textual representation
– Greatly simplify compiler analysis
Hierarchical Structures
• pipeline
– Sequential composition of streams
• splitjoin
– Parallel composition of streams
• feedback loop
– Cyclic composition of streams
Compiler Flow Summary
Partitioning
StreamIt Code
Kopi
Front-End
Load-balanced
Stream Graph
Layout
Parse Tree
SIR
Conversion
SIR
(unexpanded)
Graph
Expansion
SIR
(expanded)
Scheduler
Filters assigned
to Raw tiles
Code
Generation
Processor
Code
Communication Switch
Code
Scheduler
Partitioning
?
• Goal: Granularity of the stream graph should match the target
architecture.
• For Raw, we want the number of filters in the stream graph to
equal the number of tiles.
• The final stream graph needs to be load balanced.
• Partitioning is currently driven by a simple greedy algorithm.
• Two primary transformations:
– Fission
– Fusion
Partitioning - Fission
• Fission - splitting streams
– Duplicate a filter, placing the duplicates in a
splitjoin to expose parallelism.
Splitter
Filter
Filter
…
Filter
Joiner
– Split a filter into a pipeline for load balancing.
…
Filter
Filter0
Filter1
FilterN
Partitioning - Fusion
• Fusion - merging streams
– Reduce the number of filters in a construct for
load balancing and synchronization removal.
Splitter
Filter0
…
Filter
FilterN
Joiner
Filter0
Filter1
…
FilterN
Filter
Partitioning Example (Sort)
242 Filters
16 Filters
Layout
• Goal: To assign each filter to exactly one Raw tile.
• The layout algorithm is implemented using
Simulated Annealing.
• The cost function (energy) tries to measure the
added synchronization imposed by the layout.
• Want to avoid:
– Crossed routes
– Routes passing through tiles assigned to filters
• Because of the static properties of StreamIt, exact
communication properties of the stream graph are
known at compile time.
– Cost function is quite accurate
– Leads to excellent layouts
Layout Example (FFT)
Partitioned Stream Graph
Zero-cost layout
Layout Example (Radio)
Partitioned Stream Graph
Best layout
Routing
• At this time, data items are routed using a
simple dimension-ordered router.
• The router traces the path from source to
destination by first routing the Y dimension
and then the X dimension.
• All items are sent over the first static
network.
• The second static network and the dynamic
network are unused.
Communication Scheduling
• The communication scheduler maps
StreamIt’s channel abstraction to Raw’s
static network.
• The communication scheduler simulates the
execution of a given schedule, recording the
communication as it simulates.
– Assume that each filter fires instantaneously.
– Record the routing instruction for the source,
destination, and intermediate hops.
Code Generation
• For the compute-processor, we generate C code
that is compiled using Raw's GCC port.
• We introduce an internal buffer for each filter.
– The buffer is necessary because of the peek operation.
– All items are received into this buffer.
• Loop “work” function infinitely in steady-state:
– Each filter buffers its input until it has peek items in its
buffer, then it fires.
– pop() and peek(index) are reads from the buffer.
– A push(value) is a static network send.
Results
• We have detailed performance measurements
over our 9 benchmarks in our upcoming ASPLOS
paper, but we will not give them here.
– This is our initial implementation and we are working on
optimizations.
• But the results show that we are not
communication limited.
– We need to focus on optimizing the generated computeprocessor code.
• In the following slides we give a comparison of
StreamIt and C code for our benchmarks.
Speedup Over Single Tile
Speedup of StreamIt on 16 tiles
over Sequential C on 1 tile
32
28
24
20
16
12
8
4
0
FIR
Radio
Sort
FFT
Filterbank
3GPP
–For Radio we obtained the C implementation from a 3rd party
–For FIR, Sort, FFT, Filterbank, and 3GPP we wrote the C implementation following a
reference algorithm.
Intel®
Xeon Comparison
TM
37
Throughput / cycle
normalized to a Xeon @ 2.2GHz
16
Sequential C program on 1 tile
14
StreamIt program on 16 tiles
12
10
8
6
4
2
0
FIR
Radar
Radio
Sort
FFT
Filterbank
GSM
Vocoder
3GPP
–For Radio, GSM, and Vocoder we obtained the C implementation from a 3 rd party
–For FIR, Sort, FFT, Filterbank, Radar, and 3GPP we wrote the C implementation following a reference algorithm.
–For Radar, GSM, and Vocoder the C implementation did not fit on a single Raw tile.
Conclusion
• First step toward a portable stream
language for communication-exposed
architectures.
• Future work:
– Optimizing the implementation
– Support more features of StreamIt
• Other cool StreamIt projects:
– New syntax
– DSP domain specific linear dataflow analysis
and transformation.
– Constrained scheduling
For More Information
StreamIt Homepage
http://cag.lcs.mit.edu/streamit
• William Thies, Michal Karczmarek, and Saman Amarasinghe, StreamIt: A Language for
Streaming Applications, 2002 International Conference on Compiler Construction, Grenoble,
France. To appear in the Springer-Verlag Lecture Notes on Computer Science.
•Michael I. Gordon, William Thies, et. al., A Stream Compiler for Communication-Exposed
Architectures, Proceedings of the Tenth International Conference on Architectural Support
for Programming Languages and Operating Systems, San Jose, CA, October, 2002.
•Michael I. Gordon. A Stream-Aware Compiler for Communication-Exposed Architectures.
S.M. Thesis, Massachusetts Institute of Technology, August 2002.
Download