Compiling with multicore Jeehyung Lee 15-745 Spring 2009 1 Papers Automatic Thread Extraction with Decoupled Software Pipelining Fully automatic Fine grained pipelining A Practical Approach to Exploring CoarseGrained Pipeline Parallelism in C Programs Semi-automatic Coarse grained pipelining 2 First paper Automatic Thread Extraction with Decoupled Software Pipelining Guilherme Ottoni, Ram Rangan, Adam Stoler and David August From Princeton University 3 What is the paper about? Despite increasing uses of multiprocessors, many single threaded applications do not benefit Let the compiler automatically extract threads and exploit lurking pipeline parallelism Extract non-speculative and truly decoupled threads through Decoupled Software Pipelining(DSWP) 4 Why decoupled pipelining? Example Linked list traversal 5 Why decoupled pipelining? DOACROSS Iteration * (LD latency + communication latency) 6 Why decoupled pipelining? DSWP One way pipelining Iteration * LD latency 7 DSWP Flow of data (dependency) is acyclic among cores With use of inter-core queue, threads can be decoupled Efficiency + high tolerance for latency 8 DSWP Algorithm Build dependence graph Find strongly connected components (SCC) Create DAG of SCC Partition DAG Split codes into partitions Add flows to partitions 9 Build dependence graph Include every traditional dependence (data, control, and memory) & extensions 10 Find SCC SCC : Instructions that form a dependency cycle in a loop Instructions in SCC cannot be parallelized 1 1 2 2 1 2 11 Create DAG of SCCs Merge instructions within each SCC and update dependency arrows 12 Partition DAG Partition DAG nodes into n partitions ( n <= # of processors) Use heuristic to maximize load balance Decide # of partitions (threads) Start filling in from partition 1 with nodes from the top of DAG. When the partition is stuffed (estimated by # of cycles), move on to next partition Find the best # of threads and its partition 13 Split codes and insert flows (done!) For each partition, insert code basic blocks relevant to its contained SCC node Add in codes for dependency flow 14 Result 19.4% speedup on important benchmark loops, 9.2% overall When core bandwidth is halved Single threaded code slows down by 17.1% DSWP code is still slightly faster than singlethreaded code running on full-bandwidth core Promising enabler for Thread-LevelParallelism(TLP)? 15 Second Paper A Practical Approach to Exploring CoarseGrained Pipeline Parallelism in C Programs William Thies, Vikram Chandrasekhar and Saman Amaransinghe From MIT 16 What is the paper about? Despite increasing uses of multiprocessors, many single threaded… (Repeated) Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes Let people define pipeline, and learn practical dependencies in runtime 17 What is the paper about? Despite increasing uses of multiprocessors, many single threaded… (Repeated) Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes Let people define stages, and learn practical dependencies in runtime …for streaming applications 18 Interface Add annotations in the body of top loop 19 Dynamic analysis The system creates a stream graph according to annotations. How do they find dependencies? 20 Dynamic analysis Streaming applications tend to have a fixed pattern of dataflow (stable flow) among pipeline stages 21 Dynamic analysis Run the application on training examples, and record every relevant store-load pair across pipeline boundaries This gives us practical dependencies 22 Interface Program shows a complete stream graph User decides if he/she likes this pipelining or not • If yes, done! • else, redo annotations. Iterate over until satisfied 23 Actual pipelining When compiled, annotation macros emit codes that will fork original program for each pipeline stage 24 Result Average 2.78x speedup, max 3.89x on 4-core Seems unsound but practical (?) 25