Compiling with multicore Jeehyung Lee 15-745 Spring 2009 1

Compiling with multicore Jeehyung Lee 15-745 Spring 2009 1 Papers  Automatic Thread Extraction with Decoupled Software Pipelining    Fully automatic Fine grained pipelining A Practical Approach to Exploring CoarseGrained Pipeline Parallelism in C Programs   Semi-automatic Coarse grained pipelining 2 First paper  Automatic Thread Extraction with Decoupled Software Pipelining  Guilherme Ottoni, Ram Rangan, Adam Stoler and David August  From Princeton University 3 What is the paper about?  Despite increasing uses of multiprocessors, many single threaded applications do not benefit  Let the compiler automatically extract threads and exploit lurking pipeline parallelism  Extract non-speculative and truly decoupled threads through Decoupled Software Pipelining(DSWP) 4 Why decoupled pipelining? Example Linked list traversal 5 Why decoupled pipelining? DOACROSS Iteration * (LD latency + communication latency) 6 Why decoupled pipelining? DSWP One way pipelining Iteration * LD latency 7 DSWP  Flow of data (dependency) is acyclic among cores  With use of inter-core queue, threads can be decoupled  Efficiency + high tolerance for latency 8 DSWP Algorithm       Build dependence graph Find strongly connected components (SCC) Create DAG of SCC Partition DAG Split codes into partitions Add flows to partitions 9 Build dependence graph Include every traditional dependence (data, control, and memory) & extensions 10 Find SCC  SCC : Instructions that form a dependency cycle in a loop  Instructions in SCC cannot be parallelized 1 1 2 2 1 2 11 Create DAG of SCCs  Merge instructions within each SCC and update dependency arrows 12 Partition DAG  Partition DAG nodes into n partitions ( n <= # of processors)  Use heuristic to maximize load balance     Decide # of partitions (threads) Start filling in from partition 1 with nodes from the top of DAG. When the partition is stuffed (estimated by # of cycles), move on to next partition Find the best # of threads and its partition 13 Split codes and insert flows (done!)  For each partition, insert code basic blocks relevant to its contained SCC node  Add in codes for dependency flow 14 Result  19.4% speedup on important benchmark loops, 9.2% overall  When core bandwidth is halved    Single threaded code slows down by 17.1% DSWP code is still slightly faster than singlethreaded code running on full-bandwidth core Promising enabler for Thread-LevelParallelism(TLP)? 15 Second Paper  A Practical Approach to Exploring CoarseGrained Pipeline Parallelism in C Programs  William Thies, Vikram Chandrasekhar and Saman Amaransinghe  From MIT 16 What is the paper about?  Despite increasing uses of multiprocessors, many single threaded… (Repeated)  Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes  Let people define pipeline, and learn practical dependencies in runtime 17 What is the paper about?  Despite increasing uses of multiprocessors, many single threaded… (Repeated)  Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes  Let people define stages, and learn practical dependencies in runtime …for streaming applications 18 Interface  Add annotations in the body of top loop 19 Dynamic analysis  The system creates a stream graph according to annotations. How do they find dependencies? 20 Dynamic analysis  Streaming applications tend to have a fixed pattern of dataflow (stable flow) among pipeline stages 21 Dynamic analysis  Run the application on training examples, and record every relevant store-load pair across pipeline boundaries This gives us practical dependencies 22 Interface  Program shows a complete stream graph User decides if he/she likes this pipelining or not • If yes, done! • else, redo annotations. Iterate over until satisfied 23 Actual pipelining  When compiled, annotation macros emit codes that will fork original program for each pipeline stage 24 Result  Average 2.78x speedup, max 3.89x on 4-core  Seems unsound but practical (?) 25

Compiling with multicore Jeehyung Lee 15-745 Spring 2009 1

Related documents

Products

Support

Compiling with multicore Jeehyung Lee 15-745 Spring 2009 1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib