† § Fine-grain Task Aggregation and Coordination on GPUs Marc S. Orr†§, Bradford M. Beckmann§, Steven K. Reinhardt§, David A. Wood†§ ISCA, June 16, 2014 Executive Summary SIMT languages (e.g. CUDA & OpenCL) restrict GPU programmers to regular parallelism ‒Compare to Pthreads, Cilk, MapReduce, TBB, etc. Goal: enable irregular parallelism on GPUs ‒Why? More GPU applications ‒How? Fine-grain task aggregation ‒What? Cilk on GPUs 2 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 Outline Background ‒GPUs ‒Cilk ‒Channel Abstraction Our Work ‒Cilk on Channels ‒Channel Design Results/Conclusion 3 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 GPUs Today GPU CP SIMD SIMD SIMD SIMD System Memory GPU tasks scheduled by control processor (CP)— small, in-order programmable core Today’s GPU abstractions are coarse-grain + Maps well to SIMD hardware - Limits fine-grain scheduling 4 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 Cilk Background Cilk extends C for divide and conquer parallelism Adds keywords ‒spawn: schedule a thread to execute a function ‒sync: wait for prior spawns to complete 1: int fib(int n) { 2: if (n <= 2) return 1; 3: int x = spawn fib(n - 1); 4: int y = spawn fib(n - 2); 5: sync; 6: return (x + y); 7: } 5 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 Prior Work on Channels GPU SIMD SIMD Agg SIMD SIMD CP, or aggregator (agg), manages channels Finite task queues, except: System Memory channels 1. User-defined scheduling 2. Dynamic aggregation 3. One consumption function Dynamic aggregation enables “CPU-like” scheduling abstractions on GPUs 6 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 Outline Background ‒GPUs ‒Cilk ‒Channel Abstraction Our Work ‒Cilk on Channels ‒Channel Design Results/Conclusion 7 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 Enable Cilk on GPUs via Channels Step 1 Cilk routines split by sync into sub-routines 1: int fib (int n) { 2: if (n<=2) return 1; 3: int x = spawn fib “pre-sync” (n-1); 4: int y = spawn fib (n-2); 5: sync; 6: return (x+y); 7: } “continuation” 1: int fib (int n) { 2: if (n<=2) return 1; 3: int x = spawn fib (n-1); 4: int y = spawn fib (n-2); 5: } 6: int fib_cont(int x, int y) { 7: return (x+y); 8: } 8 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 Enable Cilk on GPUs via Channels Step 2 “pre-sync” task ready “pre-sync” task done “continuation” task A B task A spawned task B A B task B depends on task A 2 1 3 Channels instantiated for breadth-first traversal ‒Quickly populates GPU’s tens of thousands of lanes ‒Facilitates coarse-grain dependency management 9 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 3 2 4 2 4 5 1 3 3 5 fib channel fib_cont channel stack: top of stack Bound Cilk’s Memory Footprint Bound memory to the depth of the Cilk tree by draining channels closer to the base case 2 1 3 2 2 1 4 3 5 ‒The amount of work generated dynamically is not known a priori We propose that GPUs allow SIMT threads to yield ‒Facilitates resolving conflicts on shared resources like memory 10 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 Channel Implementation See Paper Our design accommodates SIMT access patterns + array-based + lock-free + non-blocking 11 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 Outline Background ‒GPUs ‒Cilk ‒Channel Abstraction Our Work ‒Cilk on Channels ‒Channel Design Results/Conclusion 12 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 Methodology Implemented Cilk on channels on a simulated APU ‒Caches are sequentially consistent ‒Aggregator schedules Cilk tasks 13 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 Fibonacci Queens Sort Strassen More Compute Units Faster execution 14 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 8CUs 4CUs 2CUs 1CU 8CUs 4CUs 2CUs 1CU 8CUs 4CUs 2CUs 1CU 8CUs 4CUs 2CUs 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1CU Normalized execution time Cilk scales with the GPU Architecture Conclusion We observed that dynamic aggregation enables new GPU programming languages and abstractions We enabled dynamic aggregation by extending the GPU’s control processor to manage channels We found that breadth first scheduling works well for Cilk on GPUs We proposed that GPUs allow SIMT threads to yield for breadth first scheduling Future work should focus on how the control processor can enable more GPU applications 15 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 Backup 16 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 Divergence and Channels Branch divergence Percent of wavefronts 1-16 17-32 33-48 49-64 lanes active 100 80 60 40 20 0 Fibonacci Queens Memory divergence + Data in channels good ‒Pointers to data in channels bad 17 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 Sort Strassen GPU NOT Blocked on Aggregator 100 90 80 % of time 70 60 simple 2-way light OoO 2-way OoO 4-way OoO 50 40 30 20 10 0 fib queens sort 18 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 strassen GPU Cilk vs. standard GPU workloads Strassen Queens LOC reduction 42% 36% Dispatch rate 13x 12.5x Speedup 1.06 0.98 Cilk is more succinct than SIMT languages Channels trigger more GPU dispatches Same performance, easier to program 19 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014 Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. 20 | Fine-grain Task Aggregation and Coordination on GPUs | ISCA, June 16, 2014