RE - Computer Science and Engineering

The Architecture and Evolution of CPU-GPU Systems for General Purpose Computing Manish Arora Computer Science and Engineering University of California, San Diego From GPU to GPGPU GPGPU ... L2 Frame Buffer Operations Geometry Processing Vertex Processing Input Assembly GPU SM Shared Mem SM ... Shared Mem L2 Memory Controller Memory Controller Off-Chip Memory Off-Chip Memory Widespread adoption (300M devices) First with NVIDIA Tesla in 2006-2007. 2 Previous Generation Consumer Hardware1 CPU GPGPU ... Cache Hierarchy SM Core Core Shared Mem Cache Hierarchy ... Shared Mem L2 Last Level Cache Memory Controller Memory Controller Off-Chip Memory SM Off-Chip Memory PCI Bridge 1 2006 – 2010 3 Current Consumer Hardware2 Core Core ... Cache Hierarchy Cache Hierarchy SM SM Shared Mem Shared Mem SM ... Shared Mem L2 Shared On-Chip Last Level Cache CPU GPGPU Memory Controller 2 Off-Chip Memory Intel Sandy Bridge AMD Fusion APUs 2011 - 2012 4 Our Goals Today    Examine the current state of the art Trace the next steps of this evolution (major part) Lay out research opportunities 5 Throughput Applications Part 1 Energy Efficient GPUs Outline Lower Costs Overheads CPU only Workloads GPGPU Part 6 (Future Work) Chip Integrated CPU-GPU Systems Part 2 GPGPU Evolution Part 5 Part 3 Holistic Optimizations Opportunistic Optimizations Part 4 CPU Core Redundancy Optimization Elimination Shared Components Tools Emerging Technologies Power Temperature Reliability Next Generation CPU – GPU Architectures 6 Throughput Applications Energy Efficient GPUs Lower Costs Overheads GPGPU Part 1 CPU only Workloads Chip Integrated CPU-GPU Systems Progression of GPGPU Architectures 7 GPGPUs - 1  The fixed function graphics era (pre 2006)     Non-graphics processing was possible       Programmable vertex processors Programmable pixel processors Lots of fixed hardware blocks (assembly, geometry, z-culling…) Represent user work as graphics tasks Trick the graphics pipeline Programming via graphics APIs No hardware for bit-wise operations, no explicit branching… Imbalance in modern workloads motivated unification General purpose opportunity sensed by vendors 8 GPGPUs - 2  The unified graphics and computing era (2006 - 2010)     Non-graphics processing easy     High level programming (C, C++, Python etc.) Separate GPU and CPU memory space Explicit GPU memory management required High overhead to process on the GPU   Single programmable processor design Explicit support for both graphics and computing Computing specific modifications (IEEE FP Compliance and ECC) Memory transfers over PCI Significant customer market penetration 9 GPGPUs - 3  Chip Integrated CPU-GPU era (2011 onwards)     Chip Integration advantages      Multicore CPU + GPGPU on the same die Shared last level caches and memory controller Shared main memory system Lower total system costs Shared hardware blocks improve utilization Lower latency Higher Bandwidth Continued improvements in programmability  Standardization efforts (OpenCL and DirectCompute) 10 Contemporary GPU Architecture DRAM DRAM DRAM (Lindholm et al. IEEE Micro 2007 / Wittenbrink et al. IEEE Micro 2011) CPU GPGPU Memory Controller Memory Controller Memory Controller L2 Cache Core Core ... Cache Hierarchy SM SM SM L2 Cache Cache Hierarchy ... SM L2 Cache SM Shared Mem ... SM ... SM L2 Shared Mem SM Interconnect SM Last SM Cache. . . SM Level Memory Controller L2 Cache Memory Controller Off-Chip Memory DRAM ... L2 Cache Memory Controller PCI Bridge DRAM SM SM SM Memory Controller L2 Cache Off-Chip Memory Memory Controller DRAM 11 SM Architecture (Lindholm et al. IEEE Micro 2007 / Wittenbrink et al. IEEE Micro 2011) Banked Register File Warp Scheduler Operand Buffering SIMT Lanes ALUs SFUs MEM TEX Shared Memory / L1 Cache 12 Multi-threading and Warp Scheduling  Warp processing     Hardware Multithreading for latency hiding      32 threads grouped and processed as a Warp Single instruction fetched and issued per warp Lots of active threads per SM (Fermi: 1536 threads in 48 Warps) Threads has dedicated registers (Fermi: 21 registers per thread) Register state need not be copied or restored Enables fast switching (potentially new warp each cycle) Threads processed in-order Warps scheduled out-of-order Example of Warp Scheduling SM Multithreaded Instruction Scheduler (Lindholm et al. IEEE Micro 2007) Time Warp 1 Instruction 1 Warp 2 Instruction 1 Warp 3 Instruction 1 Warp 3 Instruction 2 . . . Warp 2 Instruction 2 Warp 1 Instruction 2 . . . Design for Efficiency and Scalability Nickolls et al. IEEE Micro 2010 / Keckler et al. IEEE Micro 2011  Amortized costs of instruction supply   Efficient Data supply     Large register files Managed locality (via shared memories) Lack of global structures   Single instruction multiple thread model No out-of-order processing High utilization with hardware multithreading Biggest tradeoff : Programmability   Exposed microarchitecture, frequent changes Programmer has to manage data 15 Scalability (Lee et al. ISCA 2010 / Nickolls et al. IEEE Micro 2010 / Keckler et al. IEEE Micro 2011 and other public sources)   Double precision performance 10x in 3 generations Memory structures growing slower than ALUs (22.5x)    Memory bandwidth even slower (2.2x in 4 generations) Clearly favors workloads with high Arithmetic Intensity CPU performance gap increasing rapidly  Double precision performance gap 2x  9x 16 Throughput Applications Energy Efficient GPUs Lower Costs Overheads GPGPU Part 2 CPU only Workloads Chip Integrated CPU-GPU Systems GPGPU Evolution Towards Better GPGPU Next Generation CPU – GPU Architectures 17 Control-flow Divergence Losses (Fung et al. Micro 2007) Mask = 1111 Code A Low Utilization Divergent Branch Code B Mask = 1111 Diverge Point Path A: Ins 1 Path A: Ins 2 Merge Point Path B: Ins 1 Path B: Ins 2 Converge Point … Time … Divergent Branch Dynamic Warp Formation Mask = 1111 (Fung et al. Micro 2007)     Code A Code B Key Insight: Several warps at the same diverge point Combine threads from same execution path dynamically Merge Mask = 1111 Point Generate warps on the fly Original Scheme Warp 0 : Path A Warp 1 : Path A Warp 0 : Path B Warp 1 : Path B 20.7% improvements @ 4.7% area overhead With DWF Warp 0+1 : Path A Warp 0+1 : Path B Time Dynamically formed 2 new warps from 4 original warps Dynamic Warp Formation Intricacies (Fung et al. Micro 2007)  Needs several warps at the same execution point Denotes register accessed   “Majority” warp Register File scheduling policy Bank 2 Bank N ALU 1 ALU 2 ALU N ALU N Bank 1 Bank N ALU 2  Bank 2  ALU 1  Bank 1 Need for Lane-awareness Register File Banked register files Spread out threads of the dynamic warp Simplifies design Register file accesses for static warps Register file accesses without during lanelane aware dynamic awareness warp formation Large Warp Microarchitecture (Narasiman et al. Micro 2011)  Time Similar idea to generate dynamic warps  = Differs in the creation T = 1method T 0 1  0  0 1 Machine 0 1 0 T=2 T=3 organized as warps 0 0 than - SIMT 0 - bigger 0 - the 0 1 0 large 1 width 0 1 0 0 1 0 1 0 1 0 1 0 Dynamically create warps from within the large warp - 0 1 1 0 0 1 1 0 0 - 1 0 0 - - 1 1 1 1 1 1 1 1 1 1 1 - 1 - 1 Original Large Warp 1 1 1 Activity Mask Activity Mask Activity Mask 1 1 1 1 1 - 1 - 1 Two level Scheduling (Narasiman et al. Micro 2011)  Typical Warp scheduling scheme: Round Robin   All warps tend to reach long latency operations at the same time   Beneficial because it exploits data locality across warps Cannot hide latency because everyone is waiting Solution: Group warps into several sets    Schedule warps within a single set round robin Still exploit data locality Switch to another set when all warps of a set hit long latency operations 22 Dynamic Warps vs Large Warp + 2-Level Scheduling (Fung et al Micro 2007 vs Narasiman et al. Micro 2011)  Dynamic Warp formation gives better performance vs Large Warp alone     More opportunities to form warps All warps vs large warp size Large Warp + 2-level scheduling better than dynamic warp formation 2-level scheduling can be applied together with dynamic warp formation 23 Throughput Applications Energy Efficient GPUs Lower Costs Overheads Part 3 CPU only Workloads GPGPU Chip Integrated CPU-GPU Systems GPGPU Evolution Holistic Optimizations Holistically Optimized CPU Designs CPU Core Redundancy Optimization Elimination Next Generation CPU – GPU Architectures 24 Motivation to Rethink CPU Design (Arora et al. In Submission to IEEE Micro 2012)        Heterogeneity works best when each composing core runs subsets of codes well (Kumar et al. PACT 2006) GPGPU already an example of this The CPU need not be fully general-purpose Sufficient to optimize it for non-GPU code CPU undergoes a “Holistic Optimization” Code expected to run on the CPU is very different We start by investigating properties of this code 25 Benchmarks   Took important computing applications and partitioned them over the CPU and GPU Partitioning knowledge mostly based on expert information      Either used publically available source code Or details from publications Performed own CUDA implementations for 3 benchmarks Also used serial and parallel programs with no known GPU implementations as CPU only workloads Total of 11 CPU-heavy, 11 mixed and 11 GPU-heavy benchmarks 26 Methodology  Used a combination of two techniques     Branches categorized into 4 categories   Biased (same direction), patterned (95% accuracy on local predictor), correlated (95% accuracy on gshare), hard (remaining) Loads and stores characterized into 4 categories   Inserted start-end functions based on partitioning information Real machine measurements PIN based simulators Static (same address), Strided (95% accuracy on stride prefetcher), Patterned (95% accuracy on Markov predictor), Hard (remaining) Thread level parallelism is speedup on 32 core machine 27 Results – CPU Time   Conservative speedups are capped at 10x More time being spent on the CPU than GPU 28 Results – Instruction Level Parallelism   Drops in 17/22 apps (11% drop for larger window size) Short independent loops on GPU / Dependence heavy code on CPU 29 Results – Branch Characterization   Frequency of hard branches 11.3%  18.6% Occasional effects of data dependent branches 30 Results – Loads   Reduction in strided loads  Increase in hard loads Occasional GPU mapping of irregular access kernels 31 Results – Vector Instructions   SSE usage drops to almost half GPUs and SSE extensions targeting same regions of code 32 Results – Thread Level Parallelism    GPU heavy worst hit (14x  2.1x), Overall 40-60% drops Majority of benchmarks have almost no post-GPU TLP Going from 8 cores to 32 cores has a 10% benefit 33 Impact : CPU Core Directions   Larger instruction windows will have muted gains Considerably increased pressure on branch predictor   Need to adopt better performing techniques (L-Tage Seznec et al. ) Memory access will continue to be major bottlenecks    Stride or next-line prefetching almost irrelevant Need to apply techniques that capture complex patterns Lots of literature but never adapted on real machines (e.g. Markov prediction, Helper thread prefetching) 34 Impact : Redundancy Elimination  SSE rendered significantly less important     Every core need not have it Cores could share SSE hardware Extra CPU cores not of much use because of lack of TLP Few bigger cores with a focus on addressing highly irregular code will improve performance 35 Throughput Applications Energy Efficient GPUs Lower Costs Overheads Part 4 CPU only Workloads GPGPU Chip Integrated CPU-GPU Systems GPGPU Evolution Holistic Optimizations CPU Core Redundancy Optimization Elimination Shared Component Designs Shared Components Next Generation CPU – GPU Architectures 36 Optimization of Shared Structures Core Core ... Cache Hierarchy Cache Hierarchy SM SM Shared Mem Shared Mem SM ... Shared Mem L2 Shared On-Chip Last Level Cache CPU Memory Controller Latency Sensitive Off-Chip Memory GPGPU Potentially Latency In-Sensitive But Bandwidth Hungry 37 TAP: TLP Aware Shared LLC Management (Lee et al. HPCA 2012)  Insight 1: GPU cache misses / hits may or may not Impact performance    Measure cache sensitivity to performance   Core sampling controller Insight 2: GPU causes a lot more cache traffic than CPU    Misses only matter if there is not enough latency hiding Allocated capacity useless if there is abundant parallelism Allocation schemes typically allocate based on number of accesses Normalization needed for larger number of GPU accesses Cache block lifetime normalization 38 TAP Design - 1  Core sampling controller     Usually GPUs run the same workload on all cores Use different cache policies on 2 of cores and measure performance difference E.g. LRU for one core / MRU on the other Cache block lifetime normalization   Count number of cache accesses for all CPU and GPU workloads Calculate ratios of access counts across workloads 39 TAP Design - 2  Utility based Cache Partitioning (UCP)      TLP Aware Utility based Cache Partitioning (TAP-UCP)     Dynamic cache way allocation scheme Allocate ways based on an applications expected gain from additional space (utility) Uses cache hit rate to calculate utility Uses cache access rates to calculate cache block lifetime Uses core sampling controller information Allocate ways based on performance sensitivity and not hit rate TAP-UCP normalizes access rates to reduce GPU workload weight 5% better performance than UCP, 11% over LRU 40 QoS Aware Mem Bandwidth Partitioning Jeong et al. DAC 2012  Typical Memory Controller Policy: Always Prioritize CPU    CPU latency sensitive, GPU not However, this can slow down GPU traffic Problem for real-time applications (graphics) 41 QoS Aware Mem Bandwidth Partitioning (Jeong et al. DAC 2012)   Static management policies problematic Authors propose a dynamic management scheme       Default scheme is to prioritize CPU over GPU Periodically measure current rate of progress on the frame Work decomposed into smaller tiles, so measurement simple Compare with target frame rate If current frame rate slower than measured rate, set CPU and GPU priorities equal If close to deadline and still behind, boost GPU request priority even further 42 Throughput Applications Energy Efficient GPUs Lower Costs Overheads CPU only Workloads GPGPU Chip Integrated CPU-GPU Systems GPGPU Evolution Holistic Optimizations CPU Core Redundancy Optimization Elimination Part 5 Opportunistic Optimizations Opportunistic Optimizations Shared Components Next Generation CPU – GPU Architectures 43 Opportunistic Optimizations  Chip integration advantages     Lower latency New communication paths e.g. shared L2 Opportunity for non-envisioned usage Using idle resources to help active execution   Idle GPU helps CPU Idle CPU helps GPU 44 Idle GPU Shader based Prefetching (Woo et al. ASPLOS 2010)    Realization: Advanced Prefetching not adopted because of high storage costs GPU system can have exploitable idle resources Use idle GPU shader resources     Propose an OS based enabling and control interface    Register files as prefetcher storage Execution threads as logic structures Parallel prefetcher execution threads to improve latency Miss Address Provider Library of prefetchers and application specific selection Prefetching performance benefit of 68% 45 Miss Address Provider ... Core Core SM ... SM MAP Data prefetched into Shared LLC OS Allocates Idle GPU Core Shared On-Chip Last Level Cache GPU Core stores and processes miss stream Command Buffer Shader Pointer Miss info forwarded To GPU Core Miss PC Miss Address MAP 46 CPU assisted GPGPU processing (Yang et al. HPCA 2012)  Use idle CPU resources to prefetch for GPGPU applications    Compiler based framework to convert GPU kernels to CPU prefetching program CPU runs ahead appropriately of the GPU     Target bandwidth sensitive GPGPU applications If too far behind then the CPU cache hit rate will be very high If too far ahead then GPU cache hit rate will be very low Very few CPU cycle required since LLC line is large Prefetching performance benefit of 21% 47 Example GPU Kernel and CPU program GPU Kernel Requests for Single thread __global__ void VecAdd (float *A, *B, *C, int N) { int I = blockDim.x * blockIdx.x + threadIdx.x; C[i] = A[i] + B[i] } float mem_fetch (float *A, *B, *C, int N) { return A[N] + B[N] + C[N] } void cpu_prefetching (…) { unroll_factor = 8 //traverse through all thread blocks (TB) Skip_factor for (j = 0; j < N_TB; j += Concurrent_TB) //loop to traverse concurrent threads TB_Size controls CPU for (i = 0; i < Concurrent_TB*TB_Size; timing i += skip_factor*batch_size*unroll_factor) { for (k=0; j<batch_size; k++) { Batch_size controls id = i + skip_factor*k*unroll_factor + j*TB_Size how often skip_fctor //unrolled loop is updated float a0 = mem_fetch (id + skip_factor*0) float a1 = mem_fetch (id + skip_factor*1) ... sum += a0 + a1 + . . . } update skip_factor }}} Unroll_factor artificially boost CPU requests For all concurrent Thread blocks 48 Drawbacks: CPU assisted GPGPU processing   Does not consider effects of Thread block scheduling CPU program stripped of actual computations  Memory requests from data or computation dependent paths not considered 49 Throughput Applications Energy Efficient GPUs Part 6 Lower Costs Overheads CPU only Workloads GPGPU Future Work Chip Integrated CPU-GPU Systems Tools GPGPU Evolution Holistic Optimizations CPU Core Redundancy Optimization Elimination Opportunistic Optimizations Shared Components Emerging Technologies Power Temperature Reliability Next Generation CPU – GPU Architectures 50 Continued System Optimizations  Continued holistic optimizations   Understand impact of GPU workloads on CPU requests to the memory controller? Continued opportunistic optimizations   Latest GPUs allow different kernels to be run on the same GPU Can GPU threads prefetch for other GPU kernels? 51 Research Tools     Severe lack of GPU research tools No GPU power model No GPU temperature model Immediate and impactful opportunities 52 Power, Temperature and Reliability    Bounded by lack of power tools No work yet on effective power management No work yet on effective temperature management 53 Emerging Technologies    Impact of non-volatile memories on GPUs 3D die stacked GPUs Stacked CPU-GPU-Main memory systems 54 Conclusions      In this work we looked at the CPU-GPU research landscape GPGPUs systems are quickly scaling in performance CPU needs to be refocused to handle extremely irregular code Design of shared components needs to be rethought Abundant optimization and research opportunities! Questions? 55 Backup Slides Results – Stores  Similar trends as loads but slightly less pronounced 57 Results – Branch Prediction Rates   Hard branches translate to higher misprediction rates Strong influence of CPU only benchmarks 58

RE - Computer Science and Engineering

Related documents

Products

Support

RE - Computer Science and Engineering

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib