Outline • GPU Computing • GPGPU-Sim / Manycore Accelerators • (Micro)Architecture Challenges: – Branch Divergence (DWF, TBC) – On-Chip Interconnect 2 What do these have in common? 3 Source: AMD Hotchips 19 5 GPU Computing • Technology trends => want “simpler” cores (less power). • GPUs represent an extreme in terms of computation per unit area. • Current GPUs tend to work well for applications with regular parallelism (e.g., dense matrix multiply). • Research Questions: Can we make GPUs better for a wider class of parallel applications? Can we make them even more efficient? 4 • Split problem between CPU and GPU GPU (most computation here) CPU (sequential code “accelerator”) 6 Heterogeneous Computing CPU spawn GPU CPU CPU done spawn GPU Time 9 CUDA Thread Hierarchy • Kernel = grid of blocks of warps of threads • scalar threads 8 CUDA Example [Luebke] Standard C Code void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXPY kernel main() { … saxpy_serial(n, 2.0, x, y); } CUDA Example [Luebke] CUDA code __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if(i<n) y[i]=a*x[i]+y[i]; } main() { // omitted: allocate and initialize memory // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); // omitted: transfer results from GPU to CPU } GPU Microarchitecture Overview (10,000’) GPU Shader Core Shader Core Shader Core Shader Core Interconnection Network Memory Controller Memory Controller GDDR GDDR Memory Controller Off-chip DRAM GDDR 13 Single Instruction, Multiple Thread (SIMT) All threads in a kernel grid run same “code”. A given block in kernel grid runs on single “shader core”. A Warp in a block is a set of threads grouped to execute in SIMD lock step Using stack hardware and/or predication can support different branch outcomes per thread in warp. Thread Warp Common PC Scalar Scalar Scalar Thread Thread Thread W X Y Scalar Thread Z Thread Warp 3 Thread Warp 8 Thread Warp 7 SIMD Pipeline 15 “Shader Core” Microarchitecture Heavily multithreaded: 32 “warps” each representing 32 scalar threads Designed to tolerate long latency operations rather than avoid them. 14 “GPGPU-Sim” (ISPASS 2009) • GPGPU simulator developed by my group at UBC • Goal: platform for architecture research on manycore accelerators running massively parallel applications. • Support CUDA’s “virtual instruction set” (PTX). • Provide a timing model with “good enough” accuracy for architecture research. 10 GPGPU-Sim Usage Input: Unmodified CUDA or OpenCL application Output: Clock cycles required to execute + statistics that can be used to determine where cycles were lost due to “microarchitecture level” inefficiency. Accuracy vs. hardware (GPGPU-Sim 2.1.1b) HW - GPGPU-Sim Comparison 250 GPGPU-Sim IPC 200 150 100 Correlation ~0.90 50 0 0 50 100 150 Quadro FX 5800 IPC 200 250 (Architecture simulators give up accuracy to enable flexibility-can explore more of the design space) 11 GPGPU-Sim Visualizer (ISPASS 2010) 17 GPGPU-Sim w/ SASS (decuda) + uArch Tuning (under development) HW - GPGPU-Sim Comparison ~0.976 correlation on subset of CUDA SDK that currently runs. 250.00 GPGPU-Sim IPC 200.00 Currently adding in Support for Fermi Correlation uArch ~0.95 150.00 100.00 Don’t ask when it Will be available 50.00 0.00 0.00 50.00 100.00 150.00 200.00 250.00 Quadro FX5800 IPC 12 First Problem: Control flow Group scalar threads into warps Branch divergence when threads inside warps want to follow different execution paths. Branch Path A Path B 16 Current GPUs: Stack-Based Reconvergence (Building upon Levinthal & Porter, SIGGRAPH’84) Our version: Immediate postdominator reconvergence Stack AA/1111 Reconv. PC TOS TOS TOS BB/1111 C/1001 C D/0110 D E E B E A 20 G D C E F Thread Warp EE/1111 Next PC Active Mask 1111 0110 1001 Common PC Thread Thread Thread Thread 1 2 3 4 G/1111 G A B C D E G A Time 17 Dynamic Warp Formation (MICRO’07 / TACO’09) Consider multiple warps Opportunity? Branch Path A Path B 21 18 Dynamic Warp Formation Idea: Form new warp at divergence Enough threads branching to each path to create full new warps 19 Dynamic Warp Formation: Example A x/1111 y/1111 A x/1110 y/0011 B x/1000 Execution of Warp x at Basic Block A x/0110 C y/0010 D y/0001 F E Legend A x/0001 y/1100 Execution of Warp y at Basic Block A D A new warp created from scalar threads of both Warp x and y executing at Basic Block D x/1110 y/0011 x/1111 G y/1111 A A B B C C D D E E F F G G A A Baseline Time Dynamic Warp Formation A A B B C D E E F G G A A Time 23 Dynamic Warp Formation: Implementation New Logic Modified Register File 21 Thread Block Compaction (HPCA 2011) DWF Pathologies: Starvation • Majority Scheduling – Best Performing in Prev. Work – Prioritize largest group of threads with same PC • Starvation, Poor Reconvergence – LOWER SIMD Efficiency! • Key obstacle: Variable Memory Latency B: if (K > 10) C: K = 10; else D: K = 0; E: B = C[tid.x] + K; Time C 1 2 7 8 C 5 -- 11 12 D E 9 2 1 6 7 3 8 4 D E -5 -1011 -- 12 -E 1 2 3 4 1000s cycles E 5 6 7 8 D 9 6 3 4 E 9 10 11 12 D -- 10 -- -E 9 6 3 4 E -- 10 -- -- 26 DWF Pathologies: Extra Uncoalesced Accesses • Coalesced Memory Access = Memory SIMD – 1st Order CUDA Programmer Optimization • Not preserved by DWF E: B = C[tid.x] + K; No DWF With DWF E E E E E E #Acc = 3 0x100 1 2 3 4 0x140 5 6 7 8 9 10 11 12 0x180 #Acc = 9 0x100 1 2 7 12 0x140 9 6 3 8 5 10 11 4 0x180 Memory Memory L1 Cache Absorbs Redundant Memory Traffic L1$ Port Conflict 27 DWF Pathologies: Implicit Warp Sync. • Some CUDA applications depend on the lockstep execution of “static warps” Warp 0 Warp 1 Warp 2 Thread 0 ... 31 Thread 32 ... 63 Thread 64 ... 95 – E.g. Task Queue in Ray Tracing Implicit Warp Sync. int wid = tid.x / 32; if (tid.x % 32 == 0) { sharedTaskID[wid] = atomicAdd(g_TaskID, 32); } my_TaskID = sharedTaskID[wid] + tid.x % 32; ProcessTask(my_TaskID); 28 Observation • Compute kernels usually contain divergent and non-divergent (coherent) code segments • Coalesced memory access usually in coherent code segments – DWF no benefit there Coherent Divergent Static Warp Divergence Dynamic Warp Reset Warps Coales. LD/ST Static Coherent Warp Recvg Pt. 29 Thread Block Compaction • Block-wide Reconvergence Stack Thread Warp Block 0 0 Warp 1 Warp 2 PC RPC AMask Active PC RPC MaskAMask PC RPC AMask E -- 1111 1111 1111 E 1111 -1111 E E -- Warp 11110 D E 0011 0011 0100 D 1100 E 0100 DD E E Warp 1100U 1 C E 1100 1100 1011 C 0011 E 1011 CD C E E Warp 0011X T 2 C Warp Y – Regroup threads within a block • Better Reconv. Stack: Likely Convergence – Converge before Immediate Post-Dominator • Robust – Avg. 22% speedup on divergent CUDA apps – No penalty on others 30 Thread Block Compaction Implicit Warp Sync. – Whole block moves between coherent/divergent code – Block-wide stack to track exec. paths reconvg. • Run a thread block like a warp • Barrier at branch/reconverge pt. – All avail. threads arrive at branch – Insensitive to warp scheduling • Warp compaction Starvation Extra Uncoalesced Memory Access – Regrouping with all avail. threads – If no divergence, gives static warp arrangement 31 Thread Block Compaction PC RPC Active Threads B E - 1 2 3 4 5 6 7 8 9 10 11 12 D E -- -- -3 -4 -- -6 -- -- -9 10 -- -- -C E -1 -2 -- -- -5 -- -7 -8 -- -- 11 -- 12 -- A: K = A[tid.x]; B: if (K > 10) C: K = 10; else D: K = 0; E: B = C[tid.x] + K; B B B 1 2 3 4 5 6 7 8 9 10 11 12 B B B 1 2 3 4 5 6 7 8 9 10 11 12 C C 1 2 7 8 5 -- 11 12 D D 9 6 3 4 -- 10 -- -- C C C 1 2 -- -5 -- 7 8 -- -- 11 12 E E E 1 2 3 4 5 6 7 8 9 10 11 12 D D D -- -- 3 4 -- 6 -- -9 10 -- -- E E E 1 2 7 8 5 6 7 8 9 10 11 12 Time 32 Thread Block Compaction • Barrier every basic block?! (Idle pipeline) • Switch to warps from other thread blocks – Multiple thread blocks run on a core – Already done in most CUDA applications Branch Block 0 Block 1 Warp Compaction Execution Execution Execution Block 2 Execution Time 33 Microarchitecture Modifications • Per-Warp Stack Block-Wide Stack • I-Buffer + TIDs Warp Buffer – Store the dynamic warps • New Unit: Thread Compactor – Translate activemask to compact dynamic warps Branch Target PC Block-Wide Fetch Valid[1:N] I-Cache Warp Buffer Decode Stack Thread Compactor Active Pred. Mask Issue ScoreBoard ALU ALU ALU ALU RegFile MEM Done (WID) 34 Likely-Convergence • Immediate Post-Dominator: Conservative – All paths from divergent branch must merge there • Convergence can happen earlier – When any two of the paths merge A: B: C: D: E: F: while (i < K) { X = data[i]; if ( X = 0 ) result[i] = Y; B else if ( X = 1 ) break; i++; iPDom of A } return result[i]; A Rarely Taken C E D F • Extended Recvg. Stack to exploit this – TBC: 30% speedup for Ray Tracing 35 Experimental Results • 2 Benchmark Groups: – COHE = Non-Divergent CUDA applications – DIVG = Divergent CUDA applications COHE DIVG DWF TBC 0.6 0.7 0.8 0.9 1 1.1 1.2 Serious Slowdown from pathologies No Penalty for COHE 22% Speedup on DIVG 1.3 IPC Relative to Baseline Per-Warp Stack 36 Next: How should on-chip interconnect be designed? (MICRO 2010) 36 Throughput-Effective Design Two approaches: • Reduce Area • Increase performance Look at properties of bulk-synchronous parallel (aka “CUDA”) workloads 38 0.0021 0.0020 0.0019 0.0018 0.0017 0.0016 0.0015 0.0014 0.0013 0.0012 LESS AREA (Total Chip Area)-1 [1/mm2] Throughput vs inverse of Area Constant IPC/mm2 Checkerboard Thr. Eff. Ideal NoC HIGHER THROUGHPUT Baseline Mesh (Balanced Bisection Bandwidth) 2x BW 190 210 230 250 270 290 Average Throughput [IPC] 39 Many-to-Few-to-Many Traffic Pattern MC input bandwidth core injection bandwidth MC output bandwidth C 0 C n MC0 MC1 reply network C 2 request network C 1 C 0 C 1 C 2 MCm C n 40 Exploit Traffic Pattern Somehow? • Keep bisection bandwidth same, reduce router area… • Half-Router: – Limited connectivity • No turns allowed – Might save ~50% of router crossbar area Half-Router Connectivity 41 Checkerboard Routing, Example • Routing from a halfrouter to a half-router – even # of columns away – not in the same row • Solution: needs two turns – (1) route to an intermediate full-router using YX – (2)then route to the destination using XY 42 Multi-port routers at MCs • Increase the injection ports of Memory Controller routers – – – – Only increase terminal BW of the few nodes No change in Bisection BW Minimal area overhead (~1% in NoC area) Speedups of up to 25% • Reduces the bottleneck at the few nodes North West North Injection Router East South Router Memory Controller Ejection South East 43 Results Speedup • HM speedup 13% across 24 benchmarks • Total router area reduction of 14.2% 70% 60% 50% 40% 30% 20% 10% 0% -10% AES BIN HSP NE NDL HW LE HIS LU SLA BP CON NNC BLK MM LPS RAY DG SS TRA SR WP MUM LIB FWT SCP STC KM CFD BFS RD HM 44 Next: GPU Off-chip Memory Bandwidth Problem (MICRO’09) 24 Background: DRAM DRAM Column Decoder Row Buffer Row Decoder Row Access: Activate a row of DRAM bank and load into row buffer (slow) Column Access: Read and write data in row buffer (fast) Precharge: Write row buffer data back into row (slow) Memory Controller Memory Array 46 Background: DRAM Row Access Locality Definition: Number of accesses to a row between row switches “row switch” Bank RA RA RA RA tRP Precharge Row A tRCD Activate Row B Pre...Row B Precharge RB RB Act.. tRC (GDDR uses multiple banks to hide latency) tRC = row cycle time tRP = row precharge time tRCD = row activate time Row access locality Achievable DRAM Bandwidth Performance 47 Interconnect Arbitration Policy: Round-Robin RowY RowA RowA RowX RowC RowB RowB Memory Controller 0 N W Router E S RowC RowB RowA Memory Controller 1 RowB RowA RowX RowY 48 48 DRAM Access Locality The Trend: DRAM Access Locality in Many-Core Before Interconnect After Interconnect Good Pre-interconnect access locality Post-interconnect access locality Bad 8 16 32 64 Number of Cores Inside the interconnect, interleaving of memory request streams reduces the DRAM access locality seen by the memory controller 49 49 Today’s Solution: Out-of-Order Scheduling Request Queue Youngest Row A Row B Row A Row A Row B Oldest Row A Opened Row: AB Opened Switching Row: Row Queue size needs to increase as number of cores increase Requires fully-associative logic Circuit issues: o Cycle time o Area o Power DRAM 50 50 Interconnect Arbitration Policy: HG RowY RowA RowA RowX RowC RowB RowB Memory Controller 0 N W Router E S RowC RowB RowB Memory Controller 1 RowA RowA RowX RowY 51 51 Results – IPC Normalized to FR-FCFS FIFO BFIFO BFIFO+HG BFIFO+HMHG4 FR-FCFS 100% 80% 60% 40% 20% 0% fwt lib mum neu nn ray red sp wp HM Crossbar network, 28 shader cores, 8 DRAM controllers, 8-entry DRAM queues: BFIFO: 14% speedup over regular FIFO BFIFO+HG: 18% speedup over BFIFO, within 91% of FRFCFS 52 52 48 Thank you. Questions? aamodt@ece.ubc.ca http://www.gpgpu-sim.org