Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted (1) Objective • Understand the occurrence of control divergence and the concept of thread reconvergence Also described as branch divergence and thread divergence • Cover a basic thread reconvergence mechanism – PDOM Set up discussion of further optimizations and advanced techniques • Explore one approach for mitigating the performance degradation due to control divergence – dynamic warp formation (2) Reading • W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009 • W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011 • M. Rhu and M. Erez, “CAPRI: Prediction of CompactionAdequacy for Handling Control-Divergence in GPGPU Architectures,” ISCA 2012 • Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013 (3) Handling Branches • CUDA Code: if(…) … (True for some threads) T T taken T T not taken else … (True for others) • What if threads takes different branches? • Branch Divergence! (4) Branch Divergence • • Occurs within a warp Branches lead serialization branch dependent code Performance issue: low warp utilization if(…) {… } Idle threads else { …} Reconvergence! (5) Branch Divergence A Thread Warp B C D F Common PC Thread Thread Thread Thread 1 2 3 4 E G • • Different threads follow different control flow paths through the kernel code Thread execution is (partially) serialized Subset of threads that follow the same path execute in parallel Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Basic Idea • Split: partition a warp Two mutually exclusive thread subsets, each branching to a different target Identify subsets with two activity masks effectively two warps • Join: merge two subsets of a previously split warp Reconverge the mutually exclusive sets of threads • Orchestrate the correct execution for nested branches • Note the long history of technques in SIMD processors (see background in Fung et. al.) Thread Warp Common PC 1 T1 T2 T3 0 0 1 activity mask T4 (7) Thread Reconvergence • Fundamental problem: Merge threads with the same PC How do we sequence execution of threads? Since this can effect the ability to reconverge • Question: When can threads productively reconverge? • Question: When is the best time to reconverge? A B C D F E G (8) Dominator • Node d dominates node n if every path from the entry node to n must go through d A B C D F E G (9) Immediate Dominator • Node d immediate dominates node n if every path from the entry node to n must go through d and no other nodes dominate n between d and n A B C D F E G (10) Post Dominator • Node d post dominates node n if every path from the node n to the exit node must go through d A B C D F E G (11) Immediate Post Dominator • Node d immediate post dominates node n if every path from node n to the exist node must go through d and no other nodes post dominate n between d and n A B C D F E G (12) Baseline: PDOM Stack AA/1111 Reconv. PC Next PC Active Mask E E G A B E D C E 1111 0110 1001 TOS TOS TOS BB/1111 C/1001 D D/0110 C F Thread Warp EE/1111 Common PC Thread Thread Thread Thread 1 2 3 4 G/1111 G A B C D E G A Time Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Stack Entry Stack AA/1111 TOS TOS TOS BB/1111 C/1001 D D/0110 C EE/1111 G/1111 G Reconv. PC Next PC Active Mask E E G A B E D C E 1111 0110 1001 F • A stack entry is a specification of a group of active threads that will execute that basic block • The natural nested structure of control exposes the use stack-based serialization (14) More Complex Example • Stack based implementation for nested control flow Stack entry RPC set to IPDOM • Re-convergence at the immediate post-dominator of the branch From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009 (15) Implementation I-Fetch Decode Stack I-Buffer TOS TOS TOS Issue • PRF RF scalar pipeline scalar pipeline scalar Pipeline All Hit? Data • Next PC Active Mask E E G A B E D C E 1111 0110 1001 GPGPUSim model: pending warps D-Cache Reconv. PC Implement per warp stack at issue stage Acquire the active mask and PC from the TOS Scoreboard check prior to issue Register writeback updates scoreboard and ready bit in instruction buffer When RPC = Next PC, pop the stack Implications for instruction fetch? Writeback FromGPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores (16) Implementation (2) • warpPC (next instruction) compared to reconvergence PC • On a branch Can store the reconvergence PC as part of the branch instruction Branch unit has NextPC, TargetPC and reconvergence PC to update the stack • On reaching a reconvergence point Pop the stack Continue fetching from the NextPC of the next entry on the stack (17) Can We Do Better? • Warps are formed statically • Key idea of dynamic warp formation Show a pool of warps and how they can be merged • At a high level what are the requirements A B C D F E G (18) Compaction Techniques • Can we reform warps so as to increase utilization? • Basic idea: Compaction Reform warps with threads that follow the same control flow path Increase utilization of warps • Two basic types of compaction techniques • Inter-warp compaction Group threads from different warps Group threads within a warp o Changing the effective warp size (19) Inter-Warp Thread Compaction Techniques Lectures Slides and Figures contributed from sources as noted (20) Goal Warp 0 Merge threads? Warp 1 Warp 2 Taken Not Taken if(…) Warp 3 Warp 4 {… } else { …} Warp 5 (21) Reading • W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009 • W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011 (22) DWF: Example A x/1111 y/1111 A x/1110 y/0011 B x/1000 Execution of Warp x at Basic Block A x/0110 C y/0010 D y/0001 F E Legend A x/0001 y/1100 Execution of Warp y at Basic Block A D A new warp created from scalar threads of both Warp x and y executing at Basic Block D x/1110 y/0011 x/1111 G y/1111 A A B B C C D D E E F F G G A A Baseline Time Dynamic Warp Formation A A Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt B B C D E E F G G A A Time How Does This Work? • Criteria for merging Same PC Complements of active threads in each warp Recall: many warps/TB all executing the same code • What information do we need to merge two warps Need thread IDs and PCs • Ideally how would you find/merge warps? (24) DWF: Microarchitecture Implementation Identify available lanes Assists in aggregating threads Warp Update Register T TID x N PC-Warp LUT REQ PC A H Warp Update Register NT TID x N REQ PC B Point to warp being formed Warp Pool PC OCC IDX PC OCC IDX H PC PC TID x N Prio TID x N Prio PC TID x N Prio Issue Logic Thread Scheduler Identify occupied lanes Warp Allocator (TID, Reg#) ALU 2 RF 2 (TID, Reg#) ALU 3 RF 3 (TID, Reg#) ALU 4 RF 4 (TID, Reg#) I-Cache RF 1 Decode Commit/ Writeback • • ALU 1 Warps formed dynamically in the warp pool After commit check PC-Warp LUT and merge or allocated newly forming warp Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt DWF: Microarchitecture Implementation A: BEQ R2, B C: … Warp Update Register T PC-Warp LUT REQ PCBA 5TID 2 x7 3N8 0110 1011 H Warp Update Register NT 1TID 6 x N4 1001 0100 REQ PCCB Warp Pool PC OCC IDX 0 2 B 0010 0110 PC OCC IDX 1 C 1101 1001 H xN 5TID 1 23 4 Prio 8 xN 5TID 1 67 8 Prio 4 PC x Prio B TID 7 N X Y X Y ALU 1 ALU 3 ALU 4 X Z Y X Z Y 1 5 (TID, Reg#) 6 2 (TID, Reg#) 7 3 4 (TID, Reg#) 8 1 5 6 2 7 3 4 8 1 5 6 2 7 3 4 8 X Z Y RF 1 RF 2 RF 3 RF 4 (TID, Reg#) I-Cache ALU 2 5 1 2 6 3 7 4 8 Warp Allocator Decode Commit/ Writeback 5 1 2 6 3 7 4 8 PC A B PC C A Issue Logic Thread Scheduler No Lane Conflict Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt Resource Usage • Ideally would like a small number of unique PCs in progress at a time minimize overhead • Warp divergence will increase the number of unique PCs Mitigate via warp scheduling • Scheduling policies FIFO Program counter – address variation measure of divergence Majority/Minority- most common vs. helping stragglers Post dominator (catch up) (27) Hardware Consequences • Expose the implications that warps have in the base design Implications for register file access lane aware DWF • Register bank conflicts From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009 (28) Relaxing Implications of Warps • Thread swizzling Essentially remap work to threads so as to create more opportunities for DWF requires deep understanding of algorithm behavior and data sets • Lane swizzling in hardware Provide limited connectivity between register banks and lanes avoiding full crossbars (29) Summary • Control flow divergence is a fundamental performance limiter for SIMT execution • Dynamic warp formation is one way to mitigate these effects We will look at several others • Must balance a complex set of effects Memory behaviors Synchronization behaviors Scheduler behaviors (30) Thread Block Compaction W. Fung and T. Aamodt HPCA 2011 Goal • Overcome some of the disadvantages of dynamic warp formation Impact of Scheduling Breaking implicit synchronization Reduction of memory coalescing opportunities (32) DWF Pathologies: Starvation • Majority Scheduling – Best Performing – Prioritize largest group of threads with same PC • Starvation – LOWER SIMD Efficiency! – Tricky: Variable Memory Latency Wilson Fung, Tor Aamodt Thread Block Compaction C 1 2 7 8 C 5 -- 11 12 D E 9 2 1 6 7 3 8 4 D E 5-- --1011-- 12 -E 1000s cycles1 2 3 4 E 5 6 7 8 D 9 6 3 4 E 9 10 11 12 D -- 10 -- -E 9 6 3 4 E -- 10 -- -- Time • Other Warp Scheduler? B: if (K > 10) C: K = 10; else D: K = 0; E: B = C[tid.x] + K; 33 DWF Pathologies: Extra Uncoalesced Accesses • Coalesced Memory Access = Memory SIMD – 1st Order CUDA Programmer Optimization • Not preserved by DWF E: B = C[tid.x] + K; No DWF E E E Memory #Acc = 3 0x100 0x140 0x180 1 2 3 4 5 6 7 8 9 10 11 12 Memory #Acc = 9 With DWF Wilson Fung, Tor Aamodt E E E 1 2 7 12 9 6 3 8 5 10 11 4 0x100 0x140 0x180 Thread Block Compaction L1 Cache Absorbs Redundant Memory Traffic L1$ Port Conflict 34 DWF Pathologies: Implicit Warp Sync. • Some CUDA applications depend on the lockstep execution of “static warps” Warp 0 Warp 1 Warp 2 Thread 0 ... 31 Thread 32 ... 63 Thread 64 ... 95 From W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009 Wilson Fung, Tor Aamodt (35) Performance Impact From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011 (36) Thread Block Compaction Block-wide Reconvergence Stack Thread Warp Block 0 0 Warp 1 Warp 2 PC RPC AMask Active PC RPC MaskAMask PC RPC AMask E -- 1111 1111 1111 E 1111 -1111 E E -- Warp 11110 D E 0011 0011 0100 D 1100 E 0100 DD E E Warp 1100U 1 C E 1100 1100 1011 C 0011 E 1011 CD C E E Warp 0011X T 2 C Warp Y Better Reconv. Stack: Likely Convergence Regroup threads within a block Converge before Immediate Post-Dominator Robust Avg. 22% speedup on divergent CUDA apps No penalty on others Wilson Fung, Tor Aamodt Thread Block Compaction 37 GPU Microarchitecture SIMTCore Core SIMT SIMT Core SIMT Core Core SIMT MemoryPartition Partition Memory Memory Partition Last-Level Last-Level Last-Level CacheBank Bank Cache Cache Bank Interconnection Network Off-Chip Off-Chip Off-Chip DRAMChannel Channel DRAM DRAM Channel Done (Warp ID) SIMT Front End Fetch Decode Schedule Branch SIMD Datapath Memory Subsystem SMem L1 D$ Tex $ Const$ Wilson Fung, Tor Aamodt Thread Block Compaction Icnt. Network More Details in Paper Observation Compute kernels usually contain divergent and non-divergent (coherent) code segments Coalesced memory access usually in coherent code segments DWF no benefit there Wilson Fung, Tor Aamodt Thread Block Compaction Coherent Divergent Static Warp Divergence Dynamic Warp Reset Warps Coales. LD/ST Static Coherent Warp Recvg Pt. 39 Thread Block Compaction Barrier @ Branch/reconverge pt. Starvation Run a thread block like a warp All avail. threads arrive at branch Insensitive to warp scheduling Implicit Warp Sync. Whole block move between coherent/divergent code Block-wide stack to track exec. paths reconvg. Warp compaction Extra Uncoalesced Memory Access Regrouping with all avail. threads If no divergence, gives static warp arrangement Wilson Fung, Tor Aamodt Thread Block Compaction 40 Thread Block Compaction PC RPC Active Threads A E - 1 2 3 4 5 6 7 8 9 10 11 12 D E -- -- -3 -4 -- -6 -- -- -9 10 -- -- -C E -1 -2 -- -- -5 -- -7 -8 -- -- 11 -- 12 -- A: K = A[tid.x]; B: if (K > 10) C: K = 10; else D: K = 0; E: B = C[tid.x] + K; A A A 1 2 3 4 5 6 7 8 9 10 11 12 A A A 1 2 3 4 5 6 7 8 9 10 11 12 C C 1 2 7 8 5 -- 11 12 D D 9 6 3 4 -- 10 -- -- C C C 1 2 -- -5 -- 7 8 -- -- 11 12 E E E 1 2 3 4 5 6 7 8 9 10 11 12 D D D -- -- 3 4 -- 6 -- -9 10 -- -- E E E 1 2 7 8 5 6 7 8 9 10 11 12 Time Wilson Fung, Tor Aamodt Thread Block Compaction 41 Thread Block Compaction Barrier every basic block?! (Idle pipeline) Switch to warps from other thread blocks Multiple thread blocks run on a core Already done in most CUDA applications Branch Block 0 Warp Compaction Execution Block 1 Execution Execution Block 2 Execution Time Wilson Fung, Tor Aamodt Thread Block Compaction 42 High Level View • DWF: warp broken down every cycle and threads in a warp shepherded into a new warp (LUT and warp pool) • TBC: warps broken down at potentially divergent points and threads compacted across the thread block (43) Microarchitecture Modification Per-Warp Stack Block-Wide Stack I-Buffer + TIDs Warp Buffer New Unit: Thread Compactor Store the dynamic warps Translate activemask to compact dynamic warps More Detail in Paper Branch Target PC Block-Wide Fetch Valid[1:N] I-Cache Warp Buffer Decode Mask Issue ScoreBoard Wilson Fung, Tor Aamodt Stack Thread Compactor Active Pred. ALU ALU ALU ALU RegFile MEM Done (WID) Thread Block Compaction 44 Microarchitecture Modification (2) TIDs of #compacted warps #compacted warps From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011 (45) Operation All threads mapped to the same lane • Warp 2 arrives first creating 2 target entries • Next warps update the active mask When this is zero – compact (priority encoder) Pick a thread mapped to this lane From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011 (46) Thread Compactor Convert activemask from block-wide stack to thread IDs in warp buffer Array of Priority-Encoder C E 1 2 -- -- 5 -- 7 8 -- -- 11 12 1 5 -- 2 -- -- -- 7 11 -- 8 12 P-Enc P-Enc P-Enc P-Enc 1 5 -2 11 7 12 8 Warp Buffer C 1 2 7 8 C 5 -- 11 12 Wilson Fung, Tor Aamodt Thread Block Compaction 47 Likely-Convergence Immediate Post-Dominator: Conservative Convergence can happen earlier A: B: C: D: E: F: All paths from divergent branch must merge there When any two of the paths merge while (i < K) { X = data[i]; if ( X = 0 ) result[i] = Y; B else if ( X = 1 ) break; i++; iPDom of A } return result[i]; A Rarely Taken C E D F Extended Recvg. Stack to exploit this TBC: 30% speedup for Ray Tracing Wilson Fung, Tor Aamodt Thread Block Compaction Details in Paper 48 Likely-Convergence (2) NVIDIA uses break instruction for loop exits That handles last example Our solution: Likely-Convergence Points PC RPC LPC LPos ActiveThds F ---1234 E F --1-22 B E F E 1 1 C D F E 1 23344 E F E 1 2 Convergence! This paper: only used to capture loop-breaks Wilson Fung, Tor Aamodt Thread Block Compaction 49 Likely-Convergence (3) Check ! Merge inside the stack From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International Symposiumomn High Performance Computer Architecture, 2011 (50) Likely-Convergence (4) Applies to both per-warp stack (PDOM) and thread block compaction (TBC) Enable more threads grouping for TBC Side effect: Reduce stack usage in some case Wilson Fung, Tor Aamodt Thread Block Compaction 51 Evaluation Simulation: GPGPU-Sim (2.2.1b) ~Quadro FX5800 + L1 & L2 Caches 21 Benchmarks All of GPGPU-Sim original benchmarks Rodinia benchmarks Other important applications: Face Detection from Visbench (UIUC) DNA Sequencing (MUMMER-GPU++) Molecular Dynamics Simulation (NAMD) Ray Tracing from NVIDIA Research Wilson Fung, Tor Aamodt Thread Block Compaction 52 Experimental Results 2 Benchmark Groups: COHE = Non-Divergent CUDA applications DIVG = Divergent CUDA applications COHE DIVG DWF TBC 0.6 0.7 0.8 0.9 1 1.1 1.2 Serious Slowdown from pathologies No Penalty for COHE 22% Speedup on DIVG 1.3 IPC Relative to Baseline Per-Warp Stack Wilson Fung, Tor Aamodt Thread Block Compaction 53 Effect on Memory Traffic TBC does still generate some extra uncoalesced memory access #Acc = 4 Memory 1.2 TBC DWF Baseline Wilson Fung, Tor Aamodt DIVG Thread Block Compaction WP STO STMCL RAY NNC MGST LKYT LIB DG CP BACKP HRTWL 0% NVRT AES 50% 0.6 NAMD 100% 2.67x 0.8 MUMpp 150% MUM 200% TBC-RRB 1 FCDT 250% TBC-AGE BFS2 Memory Traffic Normalized to Baseline Normalized Memory Stalls 2nd Acc will hit the L1 cache 300% No Change to Overall Memory Traffic In/out of a core 0x100 0x140 0x180 1 2 7 8 5 -- 11 12 LPS C C HOTSP COHE 54 Conclusion • Thread Block Compaction Addressed some key challenges of DWF One significant step closer to reality • Benefit from advancements on reconvergence stack Likely-Convergence Point Extensible: Integrate other stack-based proposals Wilson Fung, Tor Aamodt Thread Block Compaction (55) CAPRI: Prediction of CompactionAdequacy for Handling ControlDivergence in GPGPU Architectures M. Rhu and M. Erez ISCA 2012 Goals • Improve the performance of inter-warp compaction techniques • Predict when branches diverge Borrow philosophy from branch prediction • Use prediction to apply compaction only when it is beneficial (57) Issues with Thread Block Compaction Implicit barrier across warps to collect compaction candidates • TBC: warps broken down at potentially divergent points and threads compacted across the thread block • Barrier synchronization overhead cannot always be hidden • When it works, it works well A B C D E F G (58) Divergence Behavior: A Closer Look Compaction ineffective branch Loop executes a fixed number of times Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012 (59) Compaction-Resistant Branches Control flow graph Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012 (60) Compaction-Resistant Branches(2) Control flow graph Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012 (61) Impact of Ineffective Compaction Threads shuffled around with no performance improvement However, can lead to increased memory divergence! Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012 (62) Basic Idea • Only stall and compact when there is a high probability of compaction success • Otherwise allow warps to bypass the (implicit) barrier • Compaction adequacy predictor! Think branch prediction! (63) Example: TBC vs. CAPRI Bypassing enables increased overlap of memory references Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012 (64) CAPRI: Example No divergence, no stalling Diverge, stall, initialize history, all others will now stall • Diverge and history available, predict, update prediction • All other warps follow (one prediction/branc h • CAPT updated Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012 (65) The Predictor • Prediction uses active masks of all warps Need to understand what could have happened • Actual compaction only uses actual stalled warps • Minimum provides the maximum compaction ability, i.e., #compacted warps Available threads in a lane • Update history predictor accordingly Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012 (66) Behavior Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012 (67) Impact of Implicit Barriers • Idle cycle count helps us understand the negative effects of implicit barriers in TBC Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012 (68) Summary • The synchronization overhead of thread block compaction can introduce performance degradation • Some branches more divergence than others • Apply TBC judiciously predict when it is beneficial • Effectively predict when the inter-warp compaction is effective. (69) M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013 Goals • Understand the limitations of compaction techniques and proximity to ideal compaction • Provide mechanisms to overcome these limitations and approach ideal compaction rates (71) Limitations of Compaction Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013 (72) Mapping Threads to Lanes: Today Linearization of thread IDs Warp 1 Warp 0 Modulo assignment to lanes Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) RF n-1 RF n-1 RF n-1 RF n-1 RF n-1 RF n-1 RF n-2 RF n-2 RF n-2 RF n-2 RF n-2 RF n-2 RF n-2 RF n-2 RF n-3 RF n-3 RF n-3 RF n-3 RF n-3 RF n-3 RF n-3 RF n-3 RF n-4 RF n-4 RF n-4 RF n-4 RF n-4 RF n-4 RF n-4 RF n-4 RF1 RF1 RF1 RF1 RF1 RF1 RF1 RF1 RF0 RF0 RF0 RF0 RF0 RF0 RF0 RF0 scalar pipeline scalar pipeline scalar pipeline scalar Pipeline scalar pipeline scalar pipeline Block (1,1) RF n-1 RF n-1 (1,0,0) (1,0,1) (1,0,2) (1,0,3) Thread (0,1,0) Thread Thread (0,1,1) (0,1,2) Thread (0,0,3) Thread Thread (0,0,0) (0,1,3) scalar pipeline Thread Thread (0,0,1) (0,0,2) scalar Pipeline Thread (0,0,0) (73) Data Dependent Branches • Data dependent control flows less likely to produce lane conflicts Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013 (74) Programmatic Branches • Programmatic branches can be correlated to lane assignments (with modulo assignment) • Program variables that operate like constants across threads can produced correlated branching behaviors Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013 (75) P-Branches vs. D-Branches P-branches are the problem! Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013 (76) Compaction Opportunities • Lane reassignment can improve compaction opportunities Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013 (77) Aligned Divergence • Threads mapped to a lane tend to evaluate (programmatic) predicates the same way Empirically, rarely exhibited for input, data dependent control flow behavior • Compaction cannot help in the presence of lane conflicts • performance of compaction mechanisms depends on both divergence patterns and lane conflicts • We need to understand impact of lane assignment (78) Impact of Lane Reassignment Goal: Improve “compactability” Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013 (79) Random Permutations • Does not always work well works well on average • Better understanding of programs can lead to better permutations choices Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013 (80) Mapping Threads to Lanes; New Warp 0 Warp 1 RF n-1 RF n-1 RF n-1 RF n-1 RF n-1 RF n-1 RF n-1 RF n-1 RF n-2 RF n-2 RF n-2 RF n-2 RF n-2 RF n-2 RF n-2 RF n-2 RF n-3 RF n-3 RF n-3 RF n-3 RF n-3 RF n-3 RF n-3 RF n-3 RF n-4 RF n-4 RF n-4 RF n-4 RF n-4 RF n-4 RF n-4 RF n-4 RF1 RF1 RF1 RF1 RF1 RF1 RF1 RF1 RF0 RF0 RF0 RF0 RF0 RF0 RF0 RF0 scalar pipeline scalar pipeline scalar Pipeline scalar pipeline scalar pipeline Warp N-1 scalar pipeline scalar pipeline scalar Pipeline SIMD Width = 8 What criteria do we use for lane assignment? (81) Balanced Permutation Even warps: permutation within a half warp Odd warps: swap upper and lower evenWID XORevenWID = 2 Each lane has a single instance of a logical thread from each warp Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013 (82) Balanced Permutation (2) evenWID XORevenWID = 2 Logical TID of 0 in each warp is now assigned a different lane Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013 (83) Characteristics • Vertical Balance: Each lane only has logical TIDs of distinct threads in a warp • Horizontal balance: Logical TID x in all of the warps is bound to different lanes • This works when CTA have fewer than SIMD_Width warps: why? • Note that random permutations achieve this only on average (84) Impact on Memory Coalescing • Modern GPUs do not require ordered requests • Coalescing can occur across a set of requests speicific lane assignments do not affect coalescing behavior • Increase is L1 miss rate offset by benefits of compaction Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013 (85) Speedup of Compaction • Can improve the compaction rate of divergence dues to the majority of programmatic branches Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013 (86) Compaction Rate vs. Utilization Distinguish between compaction rate and utilization! Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane Permutation,” ISCA 2013 (87) Application of Balanced Permutation I-Fetch Decode I-Buffer • Permutation is applied when the warp is launched • Maintained for the life of the warp • Does not affect the baseline compaction mechanism • Enable/disable SLP to preserve target specific, programmer implemented optimizations Issue PRF RF scalar pipeline scalar pipeline scalar Pipeline pending warps D-Cache All Hit? Data Writeback (88) Summary • Structural hazards limit the performance improvements from inter-warp compaction • Program behaviors produce correlated lane assignments today • Remapping threads to lanes enables extension of compaction opportunities (89) Summary: Inter-Warp Compaction A Thread Block Thread Block B C D F E G Program Properties Co-Design of applications, resource management, software, microarchitecture, (90)