MIT OpenCourseWare http://ocw.mit.edu 6.189 Multicore Programming Primer, January (IAP) 2007 Please use the following citation format: Saman Amarasinghe, 6.189 Multicore Programming Primer, January (IAP) 2007. (Massachusetts Institute of Technology: MIT OpenCourseWare). http://ocw.mit.edu (accessed MM DD, YYYY). License: Creative Commons Attribution-Noncommercial-Share Alike. Note: Please use the actual date you accessed this material in your citation. For more information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms 6.189 IAP 2007 Lecture 12 StreamIt Parallelizing Compiler Prof. Saman Amarasinghe, MIT. 1 6.189 IAP 2007 MIT Common Machine Language ● Represent common properties of architectures Necessary for performance Abstract away differences in architectures • Necessary for portability Cannot be too complex • Must keep in mind the typical programmer C and Fortran were the common machine languages for uniprocessors • Imperative languages are not the correct abstraction for parallel architectures. What is the correct abstraction for parallel multicore machines? • ● ● ● ● Prof. Saman Amarasinghe, MIT. 2 6.189 IAP 2007 MIT Common Machine Language for Multicores ● Current offerings: • • • OpenMP MPI High Performance Fortran ● Explicit parallel constructs grafted onto imperative language ● Language features obscured: • • • Composability Malleability Debugging ● Huge additional burden on programmer: • • • Introducing parallelism Correctness of parallelism Optimizing parallelism Prof. Saman Amarasinghe, MIT. 3 6.189 IAP 2007 MIT Explicit Parallelism ● Programmer controls details of parallelism! ● Granularity decisions: • • if too small, lots of synchronization and thread creation if too large, bad locality ● Load balancing decisions • Create balanced parallel sections (not data-parallel) ● Locality decisions • Sharing and communication structure ● Synchronization decisions • barriers, atomicity, critical sections, order, flushing ● For mass adoption, we need a better paradigm: • • Where the parallelism is natural Exposes the necessary information to the compiler Prof. Saman Amarasinghe, MIT. 4 6.189 IAP 2007 MIT Unburden the Programmer ● Move these decisions to compiler! • • • • Granularity Load Balancing Locality Synchronization ● Hard to do in traditional languages • Can a novel language help? Prof. Saman Amarasinghe, MIT. 5 6.189 IAP 2007 MIT Properties of Stream Programs AtoD ● Regular and repeating computation ● Synchronous Data Flow ● Independent actors with explicit communication ● Data items have short lifetimes FMDemod Duplicate LPF1 LPF2 LPF3 HPF1 HPF2 HPF3 Benefits: ● Naturally parallel ● Expose dependencies to compiler ● Enable powerful transformations RoundRobin Adder Speaker Prof. Saman Amarasinghe, MIT. 6 6.189 IAP 2007 MIT Outline ● ● ● ● Why we need New Languages? Static Schedule Three Types of Parallelism Exploiting Parallelism Prof. Saman Amarasinghe, MIT. 7 6.189 IAP 2007 MIT Steady-State Schedule ● All data pop/push rates are constant ● Can find a Steady-State Schedule • • # of items in the buffers are the same before and the after executing the schedule There exist a unique minimum steady state schedule ● Schedule = { } A B C … push=2 pop=3 push=1 pop=2 … Prof. Saman Amarasinghe, MIT. 8 6.189 IAP 2007 MIT Steady-State Schedule ● All data pop/push rates are constant ● Can find a Steady-State Schedule • • # of items in the buffers are the same before and the after executing the schedule There exist a unique minimum steady state schedule ● Schedule = { A } A B C … push=2 pop=3 push=1 pop=2 … Prof. Saman Amarasinghe, MIT. 9 6.189 IAP 2007 MIT Steady-State Schedule ● All data pop/push rates are constant ● Can find a Steady-State Schedule • • # of items in the buffers are the same before and the after executing the schedule There exist a unique minimum steady state schedule ● Schedule = { A, A } A B C … push=2 pop=3 push=1 pop=2 … Prof. Saman Amarasinghe, MIT. 10 6.189 IAP 2007 MIT Steady-State Schedule ● All data pop/push rates are constant ● Can find a Steady-State Schedule • • # of items in the buffers are the same before and the after executing the schedule There exist a unique minimum steady state schedule ● Schedule = { A, A, B } A B C … push=2 pop=3 push=1 pop=2 … Prof. Saman Amarasinghe, MIT. 11 6.189 IAP 2007 MIT Steady-State Schedule ● All data pop/push rates are constant ● Can find a Steady-State Schedule • • # of items in the buffers are the same before and the after executing the schedule There exist a unique minimum steady state schedule ● Schedule = { A, A, B, A } A B C … push=2 pop=3 push=1 pop=2 … Prof. Saman Amarasinghe, MIT. 12 6.189 IAP 2007 MIT Steady-State Schedule ● All data pop/push rates are constant ● Can find a Steady-State Schedule • • # of items in the buffers are the same before and the after executing the schedule There exist a unique minimum steady state schedule ● Schedule = { A, A, B, A, B } A B C … push=2 pop=3 push=1 pop=2 … Prof. Saman Amarasinghe, MIT. 13 6.189 IAP 2007 MIT Steady-State Schedule ● All data pop/push rates are constant ● Can find a Steady-State Schedule • • # of items in the buffers are the same before and the after executing the schedule There exist a unique minimum steady state schedule ● Schedule = { A, A, B, A, B, C } A B C … push=2 pop=3 push=1 pop=2 … Prof. Saman Amarasinghe, MIT. 14 6.189 IAP 2007 MIT Initialization Schedule ● When peek > pop, buffer cannot be empty after firing a filter ● Buffers are not empty at the beginning/end of the steady state schedule ● Need to fill the buffers before starting the steady state execution peek=4 pop=3 push=1 Prof. Saman Amarasinghe, MIT. 15 6.189 IAP 2007 MIT Initialization Schedule ● When peek > pop, buffer cannot be empty after firing a filter ● Buffers are not empty at the beginning/end of the steady state schedule ● Need to fill the buffers before starting the steady state execution peek=4 pop=3 push=1 Prof. Saman Amarasinghe, MIT. 16 6.189 IAP 2007 MIT Outline ● ● ● ● Why we need New Languages? Static Schedule Three Types of Parallelism Exploiting Parallelism Prof. Saman Amarasinghe, MIT. 17 6.189 IAP 2007 MIT Types of Parallelism Task Parallelism • • Scatter Data Parallelism • • Gather Parallelism explicit in algorithm Between filters without producer/consumer relationship Peel iterations of filter, place within scatter/gather pair (fission) parallelize filters with state Pipeline Parallelism • • Between producers and consumers Stateful filters can be parallelized Task Prof. Saman Amarasinghe, MIT. 18 6.189 IAP 2007 MIT Types of Parallelism Task Parallelism Scatter • Data Parallel • Gather Pipeline Scatter Data Parallelism • • • Gather Between iterations of a stateless filter Place within scatter/gather pair (fission) Can’t parallelize filters with state Pipeline Parallelism • • Data Parallelism explicit in algorithm Between filters without producer/consumer relationship Between producers and consumers Stateful filters can be parallelized Task Prof. Saman Amarasinghe, MIT. 19 6.189 IAP 2007 MIT Types of Parallelism Traditionally: Scatter Task Parallelism Gather Pipeline Scatter • Thread (fork/join) parallelism Data Parallelism • Data parallel loop (forall) Pipeline Parallelism • Usually exploited in hardware Gather Data Task Prof. Saman Amarasinghe, MIT. 20 6.189 IAP 2007 MIT Outline ● ● ● ● Why we need New Languages? Static Schedule Three Types of Parallelism Exploiting Parallelism Prof. Saman Amarasinghe, MIT. 21 6.189 IAP 2007 MIT Baseline 1: Task Parallelism Splitter BandPass BandPass Compress Compress Process Process ● Inherent task parallelism between two processing pipelines ● Task Parallel Model: • Only parallelize explicit task parallelism • Expand Fork/join parallelism Expand ● Execute this on a 2 core machine ~2x BandStop BandStop Joiner speedup over single core ● What about 4, 16, 1024, … cores? Adder Prof. Saman Amarasinghe, MIT. 22 6.189 IAP 2007 MIT Evaluation: Task Parallelism Image by MIT OpenCourseWare. Raw Microprocessor Parallelism: Not matched to target! 16 inorder, single-issue cores with D$ and I$ Synchronization: Not matched towith target! 16 memory banks, each bank DMA 19 Throughput Normalized to Single Core StreamIt 18 17 16 Cycle accurate simulator 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Prof. Saman Amarasinghe, MIT. 23 R ad G ar eo m et ric M ea n V oc od er TD M E P E G 2D ec od er S er pe nt ad io FM R er ba nk Fi lt FF T D E S D C T B i to ni cS or C t ha nn el V oc od er 0 6.189 IAP 2007 MIT Baseline 2: Fine-Grained Data Parallelism Splitter Splitter Joiner Splitter BandPass BandPass BandPass BandPass BandPass BandPass BandPass BandPass Splitter Splitter • Splitter Compress Compress Compress Compress Joiner Joiner ● Each of the filters in the example are stateless ● Fine-grained Data Parallel Model: Process Process Process Process Compress Compress Compress Compress Joiner Joiner Splitter Splitter Expand Expand Expand Expand BandStop BandStop BandStop BandStop • Process Process Process Process Expand Expand Expand Expand Joiner Joiner Splitter Splitter Joiner Splitter ● We can introduce data parallelism • Example: 4 cores ● Each fission group occupies entire machine BandStop BandStop BandStop BandStop Joiner Fiss each stateless filter N ways (N is number of cores) Remove scatter/gather if possible Joiner Joiner Splitter BandStop BandStop BandStop Adder Adder Joiner Prof. Saman Amarasinghe, MIT. 24 6.189 IAP 2007 MIT Evaluation: Fine-Grained Data Parallelism Image by MIT OpenCourseWare. 19 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 ea n ar eo m et ri c M R ad r de co G PE M 25 Vo G 2D ec od er E TD t en rp ad R FM ha C Prof. Saman Amarasinghe, MIT. Se io k lte rb an T Fi el nn FF ES D T C D Vo co ni cS de or t r 0 Bi to Throughput Normalized to Single Core StreamIt Good Parallelism! Too Much Synchronization! Task Fine-Grained Data 18 6.189 IAP 2007 MIT Baseline 3: Hardware Pipeline Parallelism Splitter BandPass BandPass Compress Compress Process Process Expand Expand BandStop BandStop ● The BandPass and BandStop filters contain all the work ● Hardware Pipelining • Use a greedy algorithm to fuse adjacent filters • Want # filters <= # cores ● Example: 8 Cores ● Resultant stream graph is mapped to hardware One filter per core ● What about 4, 16, 1024, cores? Joiner Adder Prof. Saman Amarasinghe, MIT. 26 6.189 IAP 2007 MIT Baseline 3: Hardware Pipeline Parallelism Splitter BandPass BandPass Compress Compress Process Process Expand Expand ● The BandPass and BandStop filters contain all the work ● Hardware Pipelining • Use a greedy algorithm to fuse adjacent filters • Want # filters <= # cores ● Example: 8 Cores ● Resultant stream graph is mapped to hardware BandStop BandStop Joiner Adder Prof. Saman Amarasinghe, MIT. One filter per core ● What about 4, 16, 1024, cores? • Performance dependent on fusing to a load-balanced stream graph • 27 6.189 IAP 2007 MIT Evaluation: Hardware Pipeline Parallelism Image by MIT OpenCourseWare. Task Fine-Grained Data Hardware Pipelining 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 ea ri c M et G eo m G PE M 28 n ar R ad r de co Vo 2D ec od er E TD t en rp ad R FM ha C Prof. Saman Amarasinghe, MIT. Se io k lte rb an T Fi el nn FF ES D T C D Vo co ni cS de or t r 0 Bi to Throughput Normalized to Single Core StreamIt Parallelism: Not matched to target! Synchronization: Not matched to target! 6.189 IAP 2007 MIT The StreamIt Compiler Coarsen Granularity 1. 2. 3. Data Parallelize Software Pipeline Coarsen: Fuse stateless sections of the graph Data Parallelize: parallelize stateless filters Software Pipeline: parallelize stateful filters Compile to a 16 core architecture • 11.2x mean throughput speedup over single core Prof. Saman Amarasinghe, MIT. 29 6.189 IAP 2007 MIT Phase 1: Coarsen the Stream Graph Splitter BandPass Peek Compress BandPass ● Before data-parallelism is exploited ● Fuse stateless pipelines as much as possible without introducing state Peek Compress • Process Process Expand Expand BandStop Peek BandStop • Don’t fuse stateless with stateful Don’t fuse a peeking filter with anything upstream Peek Joiner Adder Prof. Saman Amarasinghe, MIT. 30 6.189 IAP 2007 MIT Phase 1: Coarsen the Stream Graph Splitter ● Before data-parallelism is exploited ● Fuse stateless pipelines as much as possible without introducing state BandPass Compress Process Expand BandPass Compress Process Expand • • Don’t fuse stateless with stateful Don’t fuse a peeking filter with anything upstream ● Benefits: • BandStop BandStop • Reduces global communication and synchronization Exposes inter-node optimization opportunities Joiner Adder Prof. Saman Amarasinghe, MIT. 31 6.189 IAP 2007 MIT Phase 2: Data Parallelize Data Parallelize for 4 cores Splitter BandPass Compress Process Expand BandPass Compress Process Expand BandStop BandStop Joiner Adder Adder Adder Adder Splitter Fiss 4 ways, to occupy entire chip Joiner Prof. Saman Amarasinghe, MIT. 32 6.189 IAP 2007 MIT Phase 2: Data Parallelize Data Parallelize for 4 cores Splitter Splitter Splitter BandPass BandPass Compress Compress Process Process Expand Expand BandPass BandPass Compress Compress Process Process Expand Expand Joiner Joiner BandStop BandStop Task parallelism! Each fused filter does equal work Fiss each filter 2 times to occupy entire chip Joiner Adder Adder Adder Adder Splitter Joiner Prof. Saman Amarasinghe, MIT. 33 6.189 IAP 2007 MIT Phase 2: Data Parallelize Data Parallelize for 4 cores Splitter ● Task-conscious data parallelization Splitter Splitter • BandPass BandPass Compress Compress Process Process Expand Expand BandPass BandPass Compress Compress Process Process Expand Expand Joiner Joiner Splitter Splitter BandStop BandStop BandStop BandStop Preserve task parallelism ● Benefits: • Reduces global communication and synchronization Task parallelism, each filter does equal work Fiss each filter 2 times to occupy entire chip Joiner Joiner Joiner Adder Adder Adder Adder Splitter Joiner Prof. Saman Amarasinghe, MIT. 34 6.189 IAP 2007 MIT Evaluation: Coarse-Grained Data Parallelism Task Fine-Grained Data Hardware Pipelining Coarse-Grained Task + Data 19 17 16 Image by MIT OpenCourseWare. 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 ea M c et ri m eo G PE 35 n r ad a R r co de de ec o 2D G Vo r E TD nt rp e M C Prof. Saman Amarasinghe, MIT. Se ad io R FM rb an k T Fi l te ne ha n FF ES D CT D lV oc ni c od So er rt 0 Bi to Throughput Normalized to Single Core StreamIt 18 Good Parallelism! Low Synchronization! 6.189 IAP 2007 MIT Simplified Vocoder Splitter 6 AdaptDFT AdaptDFT 6 Joiner RectPolar 20 Data Parallel Splitter Splitter 2 UnWrap Unwrap 2 1 Diff Diff 1 1 Amplify Amplify 1 1 Accum Accum 1 Data Parallel, but too little work! Joiner Joiner PolarRect 20 Data Parallel Prof. Saman Amarasinghe, MIT. Target a 4 core machine 36 6.189 IAP 2007 MIT Data Parallelize Splitter 6 AdaptDFT AdaptDFT 6 Joiner Splitter RectPolar lar RectPo RectPo lar RectPolar 20 5 Joiner Splitter Splitter 2 UnWrap Unwrap 2 1 Diff Diff 1 1 Amplify Amplify 1 1 Accum Accum 1 Joiner Joiner Splitter 20 5 RectPolar lar RectPo RectPo lar PolarRect Target a 4 core machine Joiner Prof. Saman Amarasinghe, MIT. 37 6.189 IAP 2007 MIT Data + Task Parallel Execution Splitter 6 6 Cores Joiner Splitter 5 Joiner Splitter Splitter 2 2 1 1 1 1 1 1 Time 21 Joiner Joiner 5 Splitter RectPolar Target 4 core machine Joiner Prof. Saman Amarasinghe, MIT. 38 6.189 IAP 2007 MIT We Can Do Better! Splitter 6 6 Cores Joiner Splitter 5 Joiner Splitter Splitter 2 2 1 1 1 1 1 1 Time 16 Joiner Joiner 5 Splitter RectPolar Joiner Prof. Saman Amarasinghe, MIT. Target 4 core machine 39 6.189 IAP 2007 MIT Phase 3: Coarse-Grained Software Pipelining Prologue New Steady State RectPolar RectPolar ● New steady-state is free of dependencies ● Schedule new steady-state using a greedy partitioning Prof. Saman Amarasinghe, MIT. RectPolar RectPolar 40 6.189 IAP 2007 MIT Greedy Partitioning Cores To Schedule: Time 16 Target 4 core machine Prof. Saman Amarasinghe, MIT. 41 6.189 IAP 2007 MIT Evaluation: Coarse-Grained Task + Data + Software Pipelining Task Hardware Pipelining Coarse-Grained Task + Data + Software Pipeline Best Parallelism! Lowest Synchronization! 18 17 16 Image by MIT OpenCourseWare. 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 n r ea ad a et ri cM R r co de de ec o 2D m eo 6.189 IAP 2007 MIT G M 42 PE G Vo r E TD t en rp Se ad io R FM rb an k T Fi l te ne ha n C Prof. Saman Amarasinghe, MIT. FF ES D CT D lV oc ni c od So er rt 0 Bi to Throughput Normalized to Single Core StreamIt 19 Fine-Grained Data Coarse-Grained Task + Data Summary ● Streaming model naturally exposes task, data, and pipeline parallelism ● This parallelism must be exploited at the correct granularity and combined correctly Task FineGrained Data Hardware Pipelining Coarse-Grained Task + Data Coarse-Grained Task + Data + Software Pipeline Parallelism Application Dependent Good Application Dependent Good Best Synchronization Application Dependent High Application Dependent Low Lowest ● Robust speedups across varied benchmark suite Prof. Saman Amarasinghe, MIT. 43 6.189 IAP 2007 MIT