A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, Onur Mutlu Executive Summary Observation: Imbalances in execution leave GPU resources underutilized Our Goal: Employ underutilized GPU resources to do something useful – accelerate bottlenecks using helper threads Challenge: How do you efficiently manage and use helper threads in a throughput-oriented architecture? Our Solution: CABA (Core-Assisted Bottleneck Acceleration) A new framework to enable helper threading in GPUs Enables flexible data compression to alleviate the memory bandwidth bottleneck A wide set of use cases (e.g., prefetching, memoization) Key Results: Using CABA to implement data compression in memory improves performance by 41.7% 2 GPUs today are used for a wide range of applications … Computer Vision Data Analytics Scientific Simulation Medical Imaging 3 Challenges in GPU Efficiency Threads Thread 3 Register File Cores Idle! Thread 2 Full! Thread 1 Idle! Full! Memory Hierarchy Thread 0 GPU Streaming Multiprocessor The memory bandwidth bottleneck leadsregister to idle cores Thread limits lead to an underutilized file 4 % Unallocated Registers Motivation: Unutilized On-chip Memory 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 24% of the register file is unallocated on average Similar trends for on-chip scratchpad memory 5 % Cycles Motivation: Idle Pipelines 100% 80% 60% 40% 20% 0% Active Stalls 67% of cycles idle CONS JPEG LPS MUM RAY SCP PVC PVR bfs Avg. Memory Bound % Cycles 100% 80% 60% 35% of cycles idle 40% Active Stalls 20% 0% NN STO bp hs dmr NQU Compute Bound SLA lc pt mc 6 Motivation: Summary Heterogeneous application requirements lead to: Bottlenecks in execution Idle resources 7 Our Goal Use idle resources to do something useful: accelerate bottlenecks using helper threads Cores Helper threads Register File Memory Hierarchy A flexible framework to enable helper threading in GPUs: Core-Assisted Bottleneck Acceleration (CABA) 8 Helper threads in GPUs Large body of work in CPUs … [Chappell+ ISCA ’99, MICRO ’02], [Yang+ USC TR ’98], [Dubois+ CF ’04], [Zilles+ ISCA ’01], [Collins+ ISCA ’01, MICRO ’01], [Aamodt+ HPCA ’04], [Lu+ MICRO ’05], [Luk+ ISCA ’01], [Moshovos+ ICS ’01], [Kamruzzaman+ ASPLOS ’11], etc. However, there are new challenges with GPUs… 9 Challenge How do you efficiently manage and use helper threads in a throughput-oriented architecture? 10 Managing Helper Threads in GPUs Thread Warp Hardware Block Software Where do we add helper threads? 11 Approach #1: Software-only No hardware changes Coarse grained Regular threads Synchronization is difficult Not aware of runtime program behavior Helper threads 12 Where Do We Add Helper Threads? Thread Warp Hardware Block Software 13 Approach #2: Hardware-only Warps Cores Fine-grained control – – Synchronization Enforcing Priorities Register File GPU Providing contexts efficiently is difficult CPU Core 0 Core 1 Reg File 0 Reg File 0 Reg File 1 Reg File 1 14 CABA: An Overview SW HW “Tight coupling” of helper threads and regular threads Efficient context management Simpler data communication “Decoupled management” of helper threads and regular threads Dynamic management of threads Fine-grained synchronization 15 CABA: 1. In Software Regular threads Regs Helper threads Block Helper threads: Tightly coupled to regular threads Simply instructions injected into the GPU pipelines Share the same context as the regular threads Efficient context management Simpler data communication 16 CABA: 2. In Hardware Helper threads: Decoupled from regular threads Tracked at the granularity of a warp – Assist Warp Each regular (parent) warp can have different assist warps Dynamic management of threads Assist Warp: A Fine-grained synchronization Parent Warp: X Assist Warp: B 17 Key Functionalities Triggering and squashing assist warps Associating Deploying active assist warps Scheduling events with assist warps instructions for execution Enforcing priorities Between assist warps and parent warps Between different assist warps 18 CABA: Mechanism Fetch Stages instructions from triggered W Scoreboard r D warps assist for execution ALU i I ALU e ALU I-Cache t c enforce priorities s Helps Assist Instruction for different s e o Holds instructions assist warp Warp Buffer u b d routines Mem Store Assist Central point ofecontrol for: a Warp e oBuffer Triggering assist warps Deploy c o Squashing them Assist Warp Scheduler k Tracks progress for active assist Controller warps Trigger 19 Other functionality In the paper: More details on the hardware structures Data communication and synchronization Enforcing priorities 20 CABA: Applications Data compression Memoization Prefetching … 21 A Case for CABA: Data Compression Data compression can help alleviate the memory bandwidth bottleneck - transmits data in a more condensed form Uncompressed Compressed Idle! Memory Hierarchy CABA employs idle compute pipelines to perform compression 22 Data Compression with CABA Use assist warps to: Compress cache blocks before writing to memory Decompress cache blocks before placing into the cache CABA flexibly enables various compression algorithms Example: BDI Compression [Pekhimenko+ PACT ’12] Parallelizable Low across SIMT width latency Others: FPC [Alameldeen+ TR ’04], C-Pack [Chen+ VLSI ’10] 23 Walkthrough of Decompression Assist Warp Store Assist Warp Controller Trigger Scheduler Miss! L1D Hit! L2 + Memory Cores 24 Walkthrough of Compression Assist Warp Store Assist Warp Controller Trigger Scheduler L1D L2 + Memory Cores 25 Evaluation Methodology Simulator: GPGPUSim, GPUWattch Workloads Lonestar, Rodinia, MapReduce, CUDA SDK System Parameters 15 SMs, 32 threads/warp 48 warps/SM, 32768 registers, 32KB Shared Memory Core: 1.4GHz, GTO scheduler , 2 schedulers/SM Memory: 177.4GB/s BW, 6 GDDR5 Memory Controllers, FR-FCFS scheduling Cache: L1 - 16KB, 4-way associative; L2 - 768KB, 16-way associative Metrics Performance: Instructions per Cycle (IPC) Bandwidth Consumption: Fraction of cycles the DRAM data bus is busy 27 Normalized Performance Effect on Performance 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 41.7% CABA-BDI No-Overhead-BDI CABA provides a 41.7% performance improvement CABA achieves performance close to that of designs with no overhead for compression 28 Memory Bandwidth Consumption Effect on Bandwidth Consumption 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Baseline CABA-BDI Data compression with CABA alleviates the memory bandwidth bottleneck 29 Normalized Performance Different Compression Algorithms 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 CABA-FPC CABA-BDI CABA-CPack CABA-BestOfAll CABA is flexible: Improves performance with different compression algorithms 30 Other Results CABA’s performance is similar to pure-hardware based BDI compression CABA reduces the overall system energy (22%) by decreasing the off-chip memory traffic Other evaluations: Compression ratios Sensitivity to memory bandwidth Capacity compression Compression at different levels of the hierarchy 31 Conclusion Observation: Imbalances in execution leave GPU resources underutilized Our Goal: Employ underutilized GPU resources to do something useful – accelerate bottlenecks using helper threads Challenge: How do you efficiently manage and use helper threads in a throughput-oriented architecture? Our Solution: CABA (Core-Assisted Bottleneck Acceleration) A new framework to enable helper threading in GPUs Enables flexible data compression to alleviate the memory bandwidth bottleneck A wide set of use cases (e.g., prefetching, memoization) Key Results: Using CABA to implement data compression in memory improves performance by 41.7% 32 A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, Onur Mutlu 34 Backup Slides Effect on Energy Normalized Energy 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 CABA-BDI Ideal-BDI HW-BDI-Mem HW-BDI CABA reduces the overall system energy by decreasing the off-chip memory traffic 35 Effect on Compression Ratio 36 Other Uses of CABA Hardware Memoization Goal: avoid redundant computation by reusing previous results over the same/similar inputs Idea: hash the inputs at predefined points use load/store pipelines to save inputs in shared memory eliminate redundant computation by loading stored results Prefetching Similar to CPU 37