Wilson W. L. Fung Tor M. Aamodt University of British Columbia HPCA-17 Feb 14, 2011 1 Graphics Processor Unit (GPU) Commodity ManyCore Accelerator SIMD HW: Compute BW + Efficiency Non-Graphics API: CUDA, OpenCL, DirectCompute Programming Model: Hierarchy of scalar threads SIMD-ness not exposed at ISA Scalar threads grouped into warps, run in lockstep Warp Scalar Thread Grid Blocks Blocks Thread Blocks 1 12 23 34 4 5 6 7 8 9 91010 1111 1212 Single-Instruction-MultiThread Wilson Fung, Tor Aamodt Thread Block Compaction 2 2 SIMT Execution Model Reconvergence Stack A[] = {4,8,12,16}; A 1 2 3 4 B: if (K > 10) B 1 2 3 4 C 1 2 -- -- D -- -- 3 4 E: B = C[tid.x] + K; E 1 2 3 4 C: K = 10; else D: K = 0; 50% SIMD Efficiency! Time A: K = A[tid.x]; PC RPC Active Mask 1111 B E 0011 D E 1100 C E E Branch Divergence In some cases: SIMD Efficiency 20% Wilson Fung, Tor Aamodt Thread Block Compaction 3 Dynamic Warp Formation (W. Fung et al., MICRO 2007) Warp 0 A Warp 1 Warp 2 1234 A 5678 A 9 10 11 12 B 1234 B 5678 B 9 10 11 12 C 1 2 -- -- Time C D 5 -- 7 8 C -- -- 11 12 D 9 10 -- -- SIMD Efficiency 88% C 1 2 7 8 Pack C 5 -- 11 12 -- -- 3 4 D E Reissue/Memory Latency -- 6 -- -- 1234 E 5678 E 9 10 11 12 Wilson Fung, Tor Aamodt Thread Block Compaction 4 This Paper Identified DWF Pathologies Greedy Warp Scheduling Starvation Breaks up coalesced memory access Lower memory efficiency Extreme case: 5X slowdown Some CUDA apps require lockstep execution of static warps Lower SIMD Efficiency DWF breaks them! Additional HW to fix these issues? Simpler solution: Thread Block Compaction Wilson Fung, Tor Aamodt Thread Block Compaction 5 Thread Block Compaction Block-wide Reconvergence Stack Thread Warp Block 0 0 Warp 1 Warp 2 PC RPC AMask Active PC RPC MaskAMask PC RPC AMask E -- 1111 1111 1111 E 1111 -1111 E E -- Warp 11110 D E 0011 0011 0100 D 1100 E 0100 DD E E Warp 1100U 1 C E 1100 1100 1011 C 0011 E 1011 CD C E E Warp 0011X T 2 C Warp Y Better Reconv. Stack: Likely Convergence Regroup threads within a block Converge before Immediate Post-Dominator Robust Avg. 22% speedup on divergent CUDA apps No penalty on others Wilson Fung, Tor Aamodt Thread Block Compaction 6 Outline Introduction GPU Microarchitecture DWF Pathologies Thread Block Compaction Likely-Convergence Experimental Results Conclusion Wilson Fung, Tor Aamodt Thread Block Compaction 7 GPU Microarchitecture SIMTCore Core SIMT SIMT Core SIMTCore Core SIMT Interconnection Network MemoryPartition Partition Memory Memory Partition Last-Level Last-Level Last-Level CacheBank Bank Cache Cache Bank Off-Chip Off-Chip Off-Chip DRAMChannel Channel DRAM DRAM Channel Done (Warp ID) SIMT Front End Fetch Decode Schedule Branch SIMD Datapath Memory Subsystem SMem L1 D$ Tex $ Const$ Wilson Fung, Tor Aamodt Thread Block Compaction Icnt. Network More Details in Paper 8 DWF Pathologies: Starvation Majority Scheduling Starvation LOWER SIMD Efficiency! Other Warp Scheduler? Tricky: Variable Memory Latency Wilson Fung, Tor Aamodt Thread Block Compaction C 1 2 7 8 C 5 -- 11 12 D E 9 2 1 6 7 3 8 4 D E -5 -1011 -- 12 -E 1 2 3 4 1000s cycles E 5 6 7 8 D 9 6 3 4 E 9 10 11 12 D -- 10 -- -E 9 6 3 4 E -- 10 -- -- Time Best Performing in Prev. Work Prioritize largest group of threads with same PC B: if (K > 10) C: K = 10; else D: K = 0; E: B = C[tid.x] + K; 9 DWF Pathologies: Extra Uncoalesced Accesses Coalesced Memory Access = Memory SIMD 1st Order CUDA Programmer Optimization Not preserved by DWF E: B = C[tid.x] + K; No DWF With DWF Wilson Fung, Tor Aamodt E E E E E E #Acc = 3 0x100 1 2 3 4 0x140 5 6 7 8 9 10 11 12 0x180 #Acc = 9 0x100 1 2 7 12 0x140 9 6 3 8 5 10 11 4 0x180 Thread Block Compaction Memory Memory L1 Cache Absorbs Redundant Memory Traffic L1$ Port Conflict 10 DWF Pathologies: Implicit Warp Sync. Some CUDA applications depend on the lockstep execution of “static warps” Warp 0 Warp 1 Warp 2 Thread 0 ... 31 Thread 32 ... 63 Thread 64 ... 95 E.g. Task Queue in Ray Tracing Implicit Warp Sync. int wid = tid.x / 32; if (tid.x % 32 == 0) { sharedTaskID[wid] = atomicAdd(g_TaskID, 32); } my_TaskID = sharedTaskID[wid] + tid.x % 32; ProcessTask(my_TaskID); Wilson Fung, Tor Aamodt Thread Block Compaction 11 Observation Compute kernels usually contain divergent and non-divergent (coherent) code segments Coalesced memory access usually in coherent code segments DWF no benefit there Wilson Fung, Tor Aamodt Thread Block Compaction Coherent Divergent Static Warp Divergence Dynamic Warp Reset Warps Coales. LD/ST Static Coherent Warp Recvg Pt. 12 Thread Block Compaction Run a thread block like a warp Barrier @ Branch/reconverge pt. Implicit Warp Sync. All avail. threads arrive at branch Insensitive to warp scheduling Starvation Whole block move between coherent/divergent code Block-wide stack to track exec. paths reconvg. Warp compaction Extra Uncoalesced Memory Access Regrouping with all avail. threads If no divergence, gives static warp arrangement Wilson Fung, Tor Aamodt Thread Block Compaction 13 Thread Block Compaction PC RPC Active Threads A E - 1 2 3 4 5 6 7 8 9 10 11 12 D E -- -- -3 -4 -- -6 -- -- -9 10 -- -- -C E -1 -2 -- -- -5 -- -7 -8 -- -- 11 -- 12 -- A: K = A[tid.x]; B: if (K > 10) C: K = 10; else D: K = 0; E: B = C[tid.x] + K; A A A 1 2 3 4 5 6 7 8 9 10 11 12 A A A 1 2 3 4 5 6 7 8 9 10 11 12 C C 1 2 7 8 5 -- 11 12 D D 9 6 3 4 -- 10 -- -- C C C 1 2 -- -5 -- 7 8 -- -- 11 12 E E E 1 2 3 4 5 6 7 8 9 10 11 12 D D D -- -- 3 4 -- 6 -- -9 10 -- -- E E E 1 2 7 8 5 6 7 8 9 10 11 12 Time Wilson Fung, Tor Aamodt Thread Block Compaction 14 Thread Block Compaction Barrier every basic block?! (Idle pipeline) Switch to warps from other thread blocks Multiple thread blocks run on a core Already done in most CUDA applications Branch Block 0 Warp Compaction Execution Block 1 Execution Execution Block 2 Execution Time Wilson Fung, Tor Aamodt Thread Block Compaction 15 Microarchitecture Modification Per-Warp Stack Block-Wide Stack I-Buffer + TIDs Warp Buffer New Unit: Thread Compactor Store the dynamic warps Translate activemask to compact dynamic warps More Detail in Paper Branch Target PC Block-Wide Fetch Valid[1:N] I-Cache Warp Buffer Decode Mask Issue ScoreBoard Wilson Fung, Tor Aamodt Stack Thread Compactor Active Pred. ALU ALU ALU ALU RegFile MEM Done (WID) Thread Block Compaction 16 Likely-Convergence Immediate Post-Dominator: Conservative Convergence can happen earlier A: B: C: D: E: F: All paths from divergent branch must merge there When any two of the paths merge while (i < K) { X = data[i]; if ( X = 0 ) result[i] = Y; B else if ( X = 1 ) break; i++; iPDom of A } return result[i]; A Rarely Taken C E D F Extended Recvg. Stack to exploit this TBC: 30% speedup for Ray Tracing Wilson Fung, Tor Aamodt Thread Block Compaction Details in Paper 17 Outline Introduction GPU Microarchitecture DWF Pathologies Thread Block Compaction Likely-Convergence Experimental Results Conclusion Wilson Fung, Tor Aamodt Thread Block Compaction 18 Evaluation Simulation: GPGPU-Sim (2.2.1b) ~Quadro FX5800 + L1 & L2 Caches 21 Benchmarks All of GPGPU-Sim original benchmarks Rodinia benchmarks Other important applications: Face Detection from Visbench (UIUC) DNA Sequencing (MUMMER-GPU++) Molecular Dynamics Simulation (NAMD) Ray Tracing from NVIDIA Research Wilson Fung, Tor Aamodt Thread Block Compaction 19 Experimental Results 2 Benchmark Groups: COHE = Non-Divergent CUDA applications DIVG = Divergent CUDA applications COHE DIVG DWF TBC 0.6 0.7 0.8 0.9 1 1.1 1.2 Serious Slowdown from pathologies No Penalty for COHE 22% Speedup on DIVG 1.3 IPC Relative to Baseline Per-Warp Stack Wilson Fung, Tor Aamodt Thread Block Compaction 20 Conclusion Thread Block Compaction Addressed some key challenges of DWF One significant step closer to reality Benefit from advancements on reconvergence stack Likely-Convergence Point Extensible: Integrate other stack-based proposals Wilson Fung, Tor Aamodt Thread Block Compaction 21 Thank You! Wilson Fung, Tor Aamodt Thread Block Compaction 22 Thread Compactor Convert activemask from block-wide stack to thread IDs in warp buffer Array of Priority-Encoder C E 1 2 -- -- 5 -- 7 8 -- -- 11 12 1 5 -- 2 -- -- -- 7 11 -- 8 12 P-Enc P-Enc P-Enc P-Enc 1 5 -2 11 7 12 8 Warp Buffer C 1 2 7 8 C 5 -- 11 12 Wilson Fung, Tor Aamodt Thread Block Compaction 23 Effect on Memory Traffic TBC does still generate some extra uncoalesced memory access #Acc = 4 Memory 1.2 TBC DWF Baseline Wilson Fung, Tor Aamodt DIVG Thread Block Compaction WP STO STMCL RAY NNC MGST LKYT LIB DG CP BACKP HRTWL 0% NVRT AES 50% 0.6 NAMD 100% 2.67x 0.8 MUMpp 150% MUM 200% TBC-RRB 1 FCDT 250% TBC-AGE BFS2 Memory Traffic Normalized to Baseline Normalized Memory Stalls 2nd Acc will hit the L1 cache 300% No Change to Overall Memory Traffic In/out of a core 0x100 0x140 0x180 1 2 7 8 5 -- 11 12 LPS C C HOTSP COHE 24 Likely-Convergence (2) NVIDIA uses break instruction for loop exits That handles last example Our solution: Likely-Convergence Points PC RPC LPC LPos ActiveThds F ---1234 E F --1-22 B E F E 1 1 C D F E 1 23344 E F E 1 2 Convergence! This paper: only used to capture loop-breaks Wilson Fung, Tor Aamodt Thread Block Compaction 25 Likely-Convergence (3) Applies to both per-warp stack (PDOM) and thread block compaction (TBC) Enable more threads grouping for TBC Side effect: Reduce stack usage in some case Wilson Fung, Tor Aamodt Thread Block Compaction 26