StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris Leger, Sam Larsen, and Saman Amarasinghe MIT Laboratory for Computer Science MIT Computer Architecture Workshop September 19, 2002 Von Neumann Languages • Why C (FORTRAN, C++ etc.) became very successful? – Abstracted out the differences of von Neumann machines • • • • Register set structure Functional units and capabilities Pipeline depth/width Memory/cache organization – Directly expose the common properties • Single memory image • Single control-flow • A clear notion of time – Can have a very efficient mapping to a von Neumann machine • Today von Neumann languages are a curse! StreamIt: A Spatially-Aware Language • A language for streaming applications – Provides high-level stream abstraction • A filter is the autonomous unit of computation. • Breaks the von Neumann language barrier – – – – Each filter has its own PC Each filter has its own address space No global time Explicit data movement between filters The Filter • A filter communicates using FIFO channels, with the following operations: – pop(): dequeue the bottom item from the incoming channel. – peek(index): return the value at position index without dequeuing it. – push(value): enqueue value on the outgoing channel. • The pop, peek, and push rate for each firing of a filter must be statically determined. • Each filters contains: – An initialization function – A steady-state “work” function StreamIt Language • A collection of filters connected by channels. • Structured Streams – Streaming applications have structure, not a free-form graph. – Use a few constructs: pipeline, splitjoin and feedback – Hierarchical composition – Intuitive textual representation – Greatly simplify compiler analysis Hierarchical Structures • pipeline – Sequential composition of streams • splitjoin – Parallel composition of streams • feedback loop – Cyclic composition of streams Compiler Flow Summary Partitioning StreamIt Code Kopi Front-End Load-balanced Stream Graph Layout Parse Tree SIR Conversion SIR (unexpanded) Graph Expansion SIR (expanded) Scheduler Filters assigned to Raw tiles Code Generation Processor Code Communication Switch Code Scheduler Partitioning ? • Goal: Granularity of the stream graph should match the target architecture. • For Raw, we want the number of filters in the stream graph to equal the number of tiles. • The final stream graph needs to be load balanced. • Partitioning is currently driven by a simple greedy algorithm. • Two primary transformations: – Fission – Fusion Partitioning - Fission • Fission - splitting streams – Duplicate a filter, placing the duplicates in a splitjoin to expose parallelism. Splitter Filter Filter … Filter Joiner – Split a filter into a pipeline for load balancing. … Filter Filter0 Filter1 FilterN Partitioning - Fusion • Fusion - merging streams – Reduce the number of filters in a construct for load balancing and synchronization removal. Splitter Filter0 … Filter FilterN Joiner Filter0 Filter1 … FilterN Filter Partitioning Example (Sort) 242 Filters 16 Filters Layout • Goal: To assign each filter to exactly one Raw tile. • The layout algorithm is implemented using Simulated Annealing. • The cost function (energy) tries to measure the added synchronization imposed by the layout. • Want to avoid: – Crossed routes – Routes passing through tiles assigned to filters • Because of the static properties of StreamIt, exact communication properties of the stream graph are known at compile time. – Cost function is quite accurate – Leads to excellent layouts Layout Example (FFT) Partitioned Stream Graph Zero-cost layout Layout Example (Radio) Partitioned Stream Graph Best layout Routing • At this time, data items are routed using a simple dimension-ordered router. • The router traces the path from source to destination by first routing the Y dimension and then the X dimension. • All items are sent over the first static network. • The second static network and the dynamic network are unused. Communication Scheduling • The communication scheduler maps StreamIt’s channel abstraction to Raw’s static network. • The communication scheduler simulates the execution of a given schedule, recording the communication as it simulates. – Assume that each filter fires instantaneously. – Record the routing instruction for the source, destination, and intermediate hops. Code Generation • For the compute-processor, we generate C code that is compiled using Raw's GCC port. • We introduce an internal buffer for each filter. – The buffer is necessary because of the peek operation. – All items are received into this buffer. • Loop “work” function infinitely in steady-state: – Each filter buffers its input until it has peek items in its buffer, then it fires. – pop() and peek(index) are reads from the buffer. – A push(value) is a static network send. Results • We have detailed performance measurements over our 9 benchmarks in our upcoming ASPLOS paper, but we will not give them here. – This is our initial implementation and we are working on optimizations. • But the results show that we are not communication limited. – We need to focus on optimizing the generated computeprocessor code. • In the following slides we give a comparison of StreamIt and C code for our benchmarks. Speedup Over Single Tile Speedup of StreamIt on 16 tiles over Sequential C on 1 tile 32 28 24 20 16 12 8 4 0 FIR Radio Sort FFT Filterbank 3GPP –For Radio we obtained the C implementation from a 3rd party –For FIR, Sort, FFT, Filterbank, and 3GPP we wrote the C implementation following a reference algorithm. Intel® Xeon Comparison TM 37 Throughput / cycle normalized to a Xeon @ 2.2GHz 16 Sequential C program on 1 tile 14 StreamIt program on 16 tiles 12 10 8 6 4 2 0 FIR Radar Radio Sort FFT Filterbank GSM Vocoder 3GPP –For Radio, GSM, and Vocoder we obtained the C implementation from a 3 rd party –For FIR, Sort, FFT, Filterbank, Radar, and 3GPP we wrote the C implementation following a reference algorithm. –For Radar, GSM, and Vocoder the C implementation did not fit on a single Raw tile. Conclusion • First step toward a portable stream language for communication-exposed architectures. • Future work: – Optimizing the implementation – Support more features of StreamIt • Other cool StreamIt projects: – New syntax – DSP domain specific linear dataflow analysis and transformation. – Constrained scheduling For More Information StreamIt Homepage http://cag.lcs.mit.edu/streamit • William Thies, Michal Karczmarek, and Saman Amarasinghe, StreamIt: A Language for Streaming Applications, 2002 International Conference on Compiler Construction, Grenoble, France. To appear in the Springer-Verlag Lecture Notes on Computer Science. •Michael I. Gordon, William Thies, et. al., A Stream Compiler for Communication-Exposed Architectures, Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, October, 2002. •Michael I. Gordon. A Stream-Aware Compiler for Communication-Exposed Architectures. S.M. Thesis, Massachusetts Institute of Technology, August 2002.