Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose, CA http://cag.csail.mit.edu/streamit 1 Multicores Are Here! 512 Picochip PC102 256 Ambric AM2045 Cisco CSR-1 Intel Tflops 128 64 32 # of cores 16 Raw Niagara 8 Broadcom 1480 4 2 1 4004 8080 8086 286 386 486 Pentium 8008 1970 2 Raza XLR 1975 1980 1985 1990 Cavium Octeon Cell Opteron 4P Xeon MP Xbox360 PA-8800 Opteron Tanglewood Power4 PExtreme Power6 Yonah P2 P3 Itanium P4 Athlon Itanium 2 1995 2000 2005 20?? Multicores Are Here! 512 256 128 64 32 # of cores 16 8 4 For uniprocessors, Uniprocessors: C was: C •Portable is the common machine language •High Performance •Composable •Malleable •Maintainable Picochip PC102 Cisco CSR-1 Intel Tflops Raw 1 8086 286 386 486 Broadcom 1480 Pentium 8008 1970 3 8080 1975 1980 1985 1990 Raza XLR Niagara 2 4004 Ambric AM2045 Cavium Octeon Cell Opteron 4P Xeon MP Xbox360 PA-8800 Opteron Tanglewood Power4 PExtreme Power6 Yonah P2 P3 Itanium P4 Athlon Itanium 2 1995 2000 2005 20?? Multicores Are Here! What is the common machine language for multicores? 512 256 128 Picochip PC102 Ambric AM2045 Cisco CSR-1 Intel Tflops 64 32 # of cores 16 Raw Niagara 8 Broadcom 1480 4 2 1 4004 8080 8086 286 386 486 Pentium 8008 1970 4 Raza XLR 1975 1980 1985 1990 Cavium Octeon Cell Opteron 4P Xeon MP Xbox360 PA-8800 Opteron Tanglewood Power4 PExtreme Power6 Yonah P2 P3 Itanium P4 Athlon Itanium 2 1995 2000 2005 20?? Common Machine Languages Uniprocessors: Common Properties Multicores: Common Properties Single flow of control Multiple flows of control Single memory image Multiple local memories Differences: Differences: Number and capabilities of cores Register Allocation Communication Model ISA Instruction Selection Synchronization Model Functional Units Instruction Scheduling Register File von-Neumann languages represent the common properties and abstract away the differences 5 Need common machine language(s) for multicores Streaming as a Common Machine Language AtoD • Regular and repeating computation FMDemod • Independent filters with explicit communication – Segregated address spaces and multiple program counters Scatter LPF1 LPF2 LPF3 HPF1 HPF2 HPF3 • Natural expression of Parallelism: – Producer / Consumer dependencies – Enables powerful, whole-program transformations Gather Adder Speaker 6 Types of Parallelism Task Parallelism – Parallelism explicit in algorithm – Between filters without producer/consumer relationship Scatter Data Parallelism – Peel iterations of filter, place within scatter/gather pair (fission) – parallelize filters with state Gather 7 Task Pipeline Parallelism – Between producers and consumers – Stateful filters can be parallelized Types of Parallelism Task Parallelism – Parallelism explicit in algorithm Data Parallel – Between filters without Gather producer/consumer relationship Scatter Pipeline Scatter Data Parallelism – Between iterations of a stateless filter – Place within scatter/gather pair (fission) – Can’t parallelize filters with state Gather Data 8 Task Pipeline Parallelism – Between producers and consumers – Stateful filters can be parallelized Types of Parallelism Traditionally: Scatter Gather Pipeline Scatter Data Parallelism – Data parallel loop (forall) Gather Data 9 Task Parallelism – Thread (fork/join) parallelism Task Pipeline Parallelism – Usually exploited in hardware Problem Statement Given: – Stream graph with compute and communication estimate for each filter – Computation and communication resources of the target machine Find: – Schedule of execution for the filters that best utilizes the available parallelism to fit the machine resources 10 Our 3-Phase Solution Coarsen Granularity Data Parallelize Software Pipeline 1. Coarsen: Fuse stateless sections of the graph 2. Data Parallelize: parallelize stateless filters 3. Software Pipeline: parallelize stateful filters Compile to a 16 core architecture – 11 11.2x mean throughput speedup over single core Outline • StreamIt language overview • Mapping to multicores – Baseline techniques – Our 3-phase solution 12 The StreamIt Project • Applications StreamIt Program – DES and Serpent [PLDI 05] – MPEG-2 [IPDPS 06] – SAR, DSP benchmarks, JPEG, … Front-end • Programmability – StreamIt Language (CC 02) – Teleport Messaging (PPOPP 05) – Programming Environment in Eclipse (P-PHEC 05) Annotated Java • Domain Specific Optimizations – Linear Analysis and Optimization (PLDI 03) – Optimizations for bit streaming (PLDI 05) – Linear State Space Analysis (CASES 05) Simulator (Java Library) Stream-Aware Optimizations • Architecture Specific Optimizations – Compiling for Communication-Exposed Architectures (ASPLOS 02) – Phased Scheduling (LCTES 03) – Cache Aware Optimization (LCTES 05) – Load-Balanced Rendering (Graphics Hardware 05) 13 Uniprocessor backend Cluster backend Raw backend IBM X10 backend C/C++ MPI-like C/C++ C per tile + msg code Streaming X10 runtime Model of Computation • Synchronous Dataflow [Lee ‘92] A/D – Graph of autonomous filters – Communicate via FIFO channels Band Pass • Static I/O rates – Compiler decides on an order of execution (schedule) Detect – Static estimation of computation LED 14 Duplicate Detect Detect Detect LED LED LED Example StreamIt Filter 0 1 2 3 4 5 6 7 8 9 10 11 FIR 0 1 output floatfloat filter FIR (int N, float[N] weights) { Stateless work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i] peek(i); } pop(); push(result); } } 15 input Example StreamIt Filter 0 1 2 3 4 5 6 7 8 9 10 11 FIR 0 1 output floatfloat filter FIR (int N, N) float[N] { weights) { ; Stateful work push 1 pop 1 peek N { float result = 0; weights = adaptChannel(weights); for (int i = 0; i < N; i++) { result += weights[i] peek(i); } pop(); push(result); } } 16 input StreamIt Language Overview • StreamIt is a novel language for streaming – Exposes parallelism and communication – Architecture independent – Modular and composable – Simple structures composed to creates complex graphs filter pipeline may be any StreamIt language construct splitjoin splitter parallel computation joiner – Malleable – Change program behavior with small modifications feedback loop joiner 17 splitter Outline • StreamIt language overview • Mapping to multicores – Baseline techniques – Our 3-phase solution 18 Baseline 1: Task Parallelism • Inherent task parallelism between two processing pipelines Splitter BandPass BandPass Compress Compress Process Process Expand Expand BandStop BandStop Joiner Adder 19 • Task Parallel Model: – Only parallelize explicit task parallelism – Fork/join parallelism • Execute this on a 2 core machine ~2x speedup over single core • What about 4, 16, 1024, … cores? Evaluation: Task Parallelism 16 Raw Microprocessor Parallelism: Not matched to target! 16 inorder, single-issue cores with D$ and I$ Synchronization: Not matched towith target! 16 memory banks, each bank DMA 15 Cycle accurate simulator 19 Throughput Normalized to Single Core StreamIt 18 17 14 13 12 11 10 9 8 7 6 5 4 3 2 1 20 M ea n ar ric R ad er oc od G eo m et V de r G 2D ec o TD E P E M t o ad i S er pe n lte FM R rb an k T S D E D C T FF Fi oc lV nn e C ha B i to ni cS od e or t r 0 Baseline 2: Fine-Grained Data Parallelism Splitter Splitter Splitter BandPass BandPass BandPass BandPass BandPass BandPass BandPass BandPass Joiner Joiner Splitter Splitter Compress Compress Compress Compress Joiner Process Process Process Process Compress Compress Compress Compress Joiner Joiner Splitter Splitter Process Process Process Process Splitter Splitter Expand Expand Expand Expand BandStop BandStop BandStop BandStop Expand Expand Expand Expand Joiner Joiner Splitter Splitter Joiner – Example: 4 cores • Each fission group occupies entire machine Joiner Splitter – Fiss each stateless filter N ways (N is number of cores) – Remove scatter/gather if possible • We can introduce data parallelism BandStop BandStop BandStop BandStop Joiner BandStop BandStop BandStop Adder Adder Joiner 21 Joiner • Each of the filters in the example are stateless • Fine-grained Data Parallel Model: 22 G en t eo m et ri c ea n ad ar M R Vo co de r E PE G 2D ec od er TD Se rp ad io 16 R an k Task Fine-Grained Data FM T ES C FF T D D 17 oc od er or t 18 Fi lte rb ha nn el V 19 M C Bi to ni cS Throughput Normalized to Single Core StreamIt Evaluation: Fine-Grained Data Parallelism Good Parallelism! Too Much Synchronization! 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Outline • StreamIt language overview • Mapping to multicores – Baseline techniques – Our 3-phase solution 23 Phase 1: Coarsen the Stream Graph Splitter BandPass Peek Compress Compress Process Process Expand Expand BandStop Peek Joiner Adder 24 BandPass BandStop Peek Peek • Before data-parallelism is exploited • Fuse stateless pipelines as much as possible without introducing state – Don’t fuse stateless with stateful – Don’t fuse a peeking filter with anything upstream Phase 1: Coarsen the Stream Graph Splitter BandPass Compress Process Expand BandPass Compress Process Expand BandStop BandStop • Before data-parallelism is exploited • Fuse stateless pipelines as much as possible without introducing state – Don’t fuse stateless with stateful – Don’t fuse a peeking filter with anything upstream • Benefits: Joiner Adder 25 – Reduces global communication and synchronization – Exposes inter-node optimization opportunities Phase 2: Data Parallelize Data Parallelize for 4 cores Splitter BandPass Compress Process Expand BandPass Compress Process Expand BandStop BandStop Joiner Splitter Adder Adder Adder Adder Joiner 26 Fiss 4 ways, to occupy entire chip Phase 2: Data Parallelize Data Parallelize for 4 cores Splitter Splitter Splitter BandPass BandPass Compress Compress Process Process Expand Expand BandPass BandPass Compress Compress Process Process Expand Expand Joiner Joiner BandStop BandStop Joiner Splitter Adder Adder Adder Adder Joiner 27 Task parallelism! Each fused filter does equal work Fiss each filter 2 times to occupy entire chip Phase 2: Data Parallelize Data Parallelize for 4 cores Splitter Splitter Splitter BandPass BandPass Compress Compress Process Process Expand Expand BandPass BandPass Compress Compress Process Process Expand Expand Joiner Joiner Splitter Splitter BandStop BandStop BandStop BandStop Joiner Joiner Joiner Splitter Adder Adder Adder Adder Joiner 28 • Task-conscious data parallelization – Preserve task parallelism • Benefits: – Reduces global communication and synchronization Task parallelism, each filter does equal work Fiss each filter 2 times to occupy entire chip Evaluation: Coarse-Grained Data Parallelism Task Fine-Grained Data Coarse-Grained Task + Data 19 Throughput Normalized to Single Core StreamIt 18 17 Good Parallelism! Low Synchronization! 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 29 n r M ea ad a G eo m et ri c R r co de Vo r de ec o G 2D nt rp e Se ad io FM R Fi lte rb an k T FF ES D CT D TD E M PE C ha n ne Bi to lV oc ni c od So er rt 0 Simplified Vocoder Splitter 6 AdaptDFT AdaptDFT 6 Joiner RectPolar 20 Data Parallel Splitter Splitter 2 UnWrap Unwrap 2 1 Diff Diff 1 1 Amplify Amplify 1 1 Accum Accum 1 Data Parallel, but too little work! Joiner Joiner PolarRect 30 20 Data Parallel Target a 4 core machine Data Parallelize Splitter 6 AdaptDFT AdaptDFT 6 Joiner Splitter RectPolar RectPolar RectPolar RectPolar 20 5 Joiner Splitter Splitter 2 UnWrap Unwrap 2 1 Diff Diff 1 1 Amplify Amplify 1 1 Accum Accum 1 Joiner Joiner Splitter RectPolar RectPolar RectPolar PolarRect 20 5 Joiner 31 Target a 4 core machine Data + Task Parallel Execution Splitter 6 6 Cores Joiner Splitter 5 Joiner Splitter Splitter 2 2 1 1 1 1 1 1 Time 21 Joiner Joiner Splitter 5 RectPolar Joiner 32 Target 4 core machine We Can Do Better! Splitter 6 6 Cores Joiner Splitter 5 Joiner Splitter Splitter 2 2 1 1 1 1 1 1 Time 16 Joiner Joiner Splitter 5 RectPolar Joiner 33 Target 4 core machine Phase 3: Coarse-Grained Software Pipelining Prologue New Steady State RectPolar RectPolar • New steady-state is free of dependencies • Schedule new steady-state using a greedy partitioning 34 RectPolar RectPolar Greedy Partitioning Cores To Schedule: Time 35 16 Target 4 core machine Task Coarse-Grained Task + Data 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 n r M ea ad a c et ri m r co de Vo R G eo M PE G 2D ec o de r TD E nt rp e Se ad io FM R k rb an FF ES D CT T Fi lte lV oc ne ha n C D er od So ni c Bi to 36 Fine-Grained Data Coarse-Grained Task + Data + Software Pipeline Best Parallelism! Lowest Synchronization! rt Throughput Normalized to Single Core StreamIt Evaluation: Coarse-Grained Task + Data + Software Pipelining Generalizing to Other Multicores • Architectural requirements: – Compiler controlled local memories with DMA – Efficient implementation of scatter/gather • To port to other architectures, consider: – Local memory capacities – Communication to computation tradeoff • Did not use processor-to-processor communication on Raw 37 Related Work • Streaming languages: – Brook [Buck et al. ’04] – StreamC/KernelC [Kapasi ’03, Das et al. ’06] – Cg [Mark et al. ‘03] – SPUR [Zhang et al. ‘05] • Streaming for Multicores: – Brook [Liao et al., ’06] • Ptolemy [Lee ’95] • Explicit parallelism: – OpenMP, MPI, & HPF 38 Conclusions • Streaming model naturally exposes task, data, and pipeline parallelism • This parallelism must be exploited at the correct granularity and combined correctly Task Fine-Grained Coarse-Grained Data Task + Data Coarse-Grained Task + Data + Software Pipeline Parallelism Not matched Good Good Best Synchronization Not matched High Low Lowest • Good speedups across varied benchmark suite • Algorithms should be applicable across multicores 39