Automatic Thread Extraction with Decoupled Software Pipelining Presented by Jeremy Cutler with thanks to Guilherme Ottoni, Ram Rangan, Adam Stoler, David I. August Liberty Research Group Department of Computer Science Princeton University http://www.liberty-research.org Automatic Thread Extraction with DSWP A Fundamental Change… Transistor trend continues… Clock rate limited by: • Power delivery • Heat dissipation • Design complexity Source: Intel, Wikipedia, Sutter/Dr. Dobbs Journal The Liberty Research Group 2 http://www.liberty-research.org Automatic Thread Extraction with DSWP The Response: CMP For: • legacy apps (C/C++) • single-threaded • sequential codes Speedup over single core: 0.0% Worse: • Shared resources (e.g. caches) • Simple cores trend Must Extract Thread Parallelism! IBM Power 5 (1.9GHz) Die Photo: Source IBM The Liberty Research Group 3 http://www.liberty-research.org Automatic Thread Extraction with DSWP Existing Parallelization Approaches (Non-speculative) Scientific Codes (FORTRAN-like) for(i=1; i<=N; i++) // C a[i] = a[i] + 1; // X General-purpose Codes (legacy C/C++) while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X DOALL The Liberty Research Group DOACROSS [Cytron, ICPP 86] 4 http://www.liberty-research.org Automatic Thread Extraction with DSWP Pipelined Parallelism for General-Purpose Codes while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X DOACROSS The Liberty Research Group Decoupled Software Pipelining (DSWP) Generalization of DOPIPE [Davies, UIUC 81] 5 http://www.liberty-research.org Automatic Thread Extraction with DSWP Comparison: DOALL, DOACROSS, DSWP DOALL lat(comm) = 1: lat(comm) = 2: 1 iter/cycle 1 iter/cycle The Liberty Research Group DOACROSS 1 iter/cycle 0.5 iter/cycle 6 DSWP 1 iter/cycle 1 iter/cycle http://www.liberty-research.org Automatic Thread Extraction with DSWP Implementing Decoupled Software Pipelining (DSWP) while(ptr = ptr->next) ptr->val = ptr->val + 1; Thread 1 Thread 2 Loop register control Dependence Graph DAGSCC memory intra-iteration Inter-thread communication latency is a one-time cost loop-carried communication queue The Liberty Research Group 7 http://www.liberty-research.org Automatic Thread Extraction with DSWP Implementing Inter-Thread Control Dependences Node Splitting L1 L2 register control memory intra-iteration loop-carried The Liberty Research Group 8 http://www.liberty-research.org Automatic Thread Extraction with DSWP Handling Arbitrary Control Flow: Control Extensions to Dependence Graph CFG • Loop-iteration control dependences • Traditional definition of control dependence [Ferrante et al., TOPLAS 87] not appropriate for loops • Conditional control dependences • To implement inter-thread data dependences that may or may not occur • Multi-threaded code generation from the extended dependence graph The Liberty Research Group 9 http://www.liberty-research.org Automatic Thread Extraction with DSWP Evaluation • DSWP implemented in the back-end of IMPACT compiler • Accurate dual-core Itanium 2 model • Synchronization Array support for comm./sync. • ISA extended with produce/consume instructions • Important application loops selected (16-98% total execution) The Liberty Research Group 10 http://www.liberty-research.org The Liberty Research Group ake 11 Ge oM ean wc c jpe gen epi cde c ec adp cm d 25 6 .bz ip2 18 8 .a m mp 18 3 .e q u 18 1 .mc f 17 9 .a rt 12 9 .c o mp res s % Loop Speedup Automatic Thread Extraction with DSWP Evaluation 50 40 30 20 10 0 -10 http://www.liberty-research.org Automatic Thread Extraction with DSWP Partitioning and Parallelism 181.mcf DAGSCC 32 Queue Occupancy (elements) Speedup +45 % 0 Time (cycles) +48 % Time (cycles) +43 % Time (cycles) -2 % Currently use a simple load-balancing heuristic Time (cycles) The Liberty Research Group 12 http://www.liberty-research.org • Modified, half-width Itanium 2 models 1-Core 2-Core (used by DSWP) Full-width: Half-width: 60 Half-width Base Half-width DSWP Full-width DSWP 40 20 ean Ge oM jpe ge wc nc ec ep icd ec ad pc md 256 .bz ip2 .am mp 188 .eq ua ke 183 .m cf 181 .ar t 129 -40 179 -20 res s 0 .co mp % Loop Speedup Automatic Thread Extraction with DSWP Evaluation: Varying Processor Width -60 • On half-width model, speedup from DSWP is larger • Better performance compatibility • More effective on simpler cores The Liberty Research Group 13 http://www.liberty-research.org Automatic Thread Extraction with DSWP What about more threads? while(ptr = ptr->next) ptr->val = ptr->val + 1; 2. DOALL Consumer Producer Dep. Graph Consumer 1 Consumer 2 1. Multiple SCCs register control memory intra-iteration loop-carried The Liberty Research Group 14 http://www.liberty-research.org Automatic Thread Extraction with DSWP Breaking SCCs: Speculative DSWP 164.gzip: 38% speedup with 3 threads Only one SCC! 181.mcf: 2.9x speedup with 4 threads x x The Liberty Research Group Mis-speculation detected 15 http://www.liberty-research.org Automatic Thread Extraction with DSWP Conclusion • DSWP extracts pipelined thread-level parallelism from general-purpose, sequential programs • More applicable than traditional parallelization techniques • Handles arbitrary control flow • Future research directions: • Additional analyses and optimizations • Break dependence cycles – code transformations, speculation • Reduce communication • Explore DOALL-consumer opportunities The Liberty Research Group 16 http://www.liberty-research.org