A Variable Warp-Size Architecture Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler Contemporary GPU Massively Multithreaded 10,000’s of threads concurrently executing on 10’s of Streaming Multiprocessors (SM) GPU MemCtrl ... MemCtrl L2$ L2$ ... SM SM Interconnect SM ... SM MemCtrl ... MemCtrl L2$ L2$ 2 A Variable Warp-Size Architecture Tim Rogers Contemporary Streaming Multiprocessor (SM) 1,000’s of schedulable threads Amortize front end and memory overheads by grouping threads into warps. Size of the warp is fixed based on the architecture Streaming Multiprocessor Frontend L1 I-Cache Decode Warp Control Logic Warp Datapath 32-wide Memory Unit 3 A Variable Warp-Size Architecture GPU MemCtrl ... MemCtrl L2$ L2$ ... SM SM Interconnect SM ... SM MemCtrl ... MemCtrl L2$ L2$ Tim Rogers Contemporary GPU Software Regular structured computation Predictable access and control flow patterns Can take advantage of HW amortization for increased performance and energy efficiency Execute Efficiently on a GPU Today Graphics Shaders … Matrix Multiply 4 A Variable Warp-Size Architecture Tim Rogers Forward-Looking GPU Software Still Massively Parallel Less Structured Memory access and control flow patterns are less predictable Less efficient on today’s GPU Molecular Dynamics Raytracing Execute efficiently on a GPU today Graphics Shaders … Object Classification … Matrix Multiply 5 A Variable Warp-Size Architecture Tim Rogers Divergence: Source of Inefficiency Regular hardware that amortizes front end and overhead Irregular software with many different control flow paths and less predicable memory accesses. Branch Divergence Memory Divergence Load R1, 0(R2) … 32- Wide 32- Wide if (…) { … } … Can cut function unit utilization to 1/32. Instruction may wait for 32 cache lines Main Memory 6 A Variable Warp-Size Architecture Tim Rogers Irregular “Divergent” Applications: Perform better with a smaller warp size Branch Divergence Memory Divergence Load R1, 0(R2) … if (…) { … } … Each instruction waits on fewer accesses Increased function unit utilization 7 Allows more threads to proceed concurrently A Variable Warp-Size Architecture Main Memory Tim Rogers Negative effects of smaller warp size Less front end amortization Increase in fetch/decode energy Negative performance effects 8 Scheduling skew increases pressure on the memory system A Variable Warp-Size Architecture Tim Rogers Regular “Convergent” Applications: Perform better with a wider warp GPU memory coalescing Load R1, 0(R2) Smaller warps: Less coalescing Load R1, 0(R2) 32- Wide One memory system request can service all 32 threads 8 redundant memory accesses – no longer occurring together Main Memory 9 Main Memory A Variable Warp-Size Architecture Tim Rogers Performance vs. Warp Size 165 Applications IPC normalized to warp size 32 1.8 1.6 Warp Size 4 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Application Convergent Applications 10 Warp-Size Insensitive Applications A Variable Warp-Size Architecture Divergent Applications Tim Rogers Goals Convergent Applications Warp Size Insensitive Applications Maintain wide-warp performance Maintain front end efficiency Maintain front end efficiency Divergent Applications Gain small warp performance Set the warp size based on the executing application 11 A Variable Warp-Size Architecture Tim Rogers Sliced Datapath + Ganged Scheduling Split the SM datapath into narrow slices. Extensively studied 4-thread slices Gang slice execution to gain efficiencies of wider warp. Slices share an L1 I-Cache and Memory Unit Frontend L1 I-Cache Ganging Unit Slice Warp Slice Datapath Slice Control Logic Front 4-wide End Slices can execute independently Warp Datapath 32-wide Slice Front Memory Unit End ... Slice Slice Datapath 4-wide Memory Unit 12 A Variable Warp-Size Architecture Tim Rogers Initial operation Slices begin execution in ganged mode Mirrors the baseline 32-wide warp system Question: When to break the gang? Instructions are fetched and decoded once Frontend L1 I-Cache Ganging Unit Slice Front End Slice Slice Datapath 4-wide ... Ganging unit drives the slices Slice Front End Slice Slice Datapath 4-wide Memory Unit 13 A Variable Warp-Size Architecture Tim Rogers Breaking Gangs on Control Flow Divergence PCs common to more than one slice form a new gang Slices that follow a unique PC in the gang are transferred to independent control Observes different PCs from each slice Ganging Unit Slice Front End Frontend L1 I-Cache Slice Slice Datapath 4-wide ... Unique PCs: no longer controlled by ganging unit Slice Front End Slice Slice Datapath 4-wide Memory Unit 14 A Variable Warp-Size Architecture Tim Rogers Breaking Gangs on Memory Divergence Latency of accesses from each slice can differ Evaluated several heuristics on breaking the gang when this occurs Frontend L1 I-Cache Ganging Unit Slice Front End Hits in L1 15 Slice Slice Datapath 4-wide ... Memory Unit A Variable Warp-Size Architecture Slice Front End Slice Slice Datapath 4-wide Goes to memory Tim Rogers Gang Reformation Performed opportunistically Ganging unit checks for gangs or independent slices at the same PC Forms them into a gang Ganging unit re-gangs them Frontend L1 I-Cache Ganging Unit Slice Front End Slice Slice Datapath 4-wide Independent BUT at the same PC ... Memory Unit 16 A Variable Warp-Size Architecture Slice Front End Slice Slice Datapath 4-wide More details in the paper Tim Rogers Methodology In House, Cycle-Level Streaming Multiprocessor Model 17 1 In-order core 64KB L1 Data cache 128KB L2 Data Cache (One SM’s worth) 48KB Shared Memory Texture memory unit Limited BW memory system Greedy-Then-Oldest (GTO) Issue Scheduler A Variable Warp-Size Architecture Tim Rogers Configurations Warp Size 32 (WS 32) Warp Size 4 (WS 4) Inelastic Variable Warp Sizing (I-VWS) Elastic Variable Warp Sizing (E-VWS) Gangs break on control flow divergence Are not reformed Like I-VWS, except gangs are opportunistically reformed Studied 5 applications from each 18 Paper Explores Many Configurations categoryMore in detail A Variable Warp-Size Architecture Tim Rogers Divergent Application Performance HMEAN-DIV Raytracing E-VWS ObjClassifier I-VWS GamePhysics CoMD IPC normalized to warp size 32 WS 4 Lighting WS 32 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 E-VWS: I-VWS: Break Break Warp Size 4 on + Reform CF Only Divergent Applications 19 A Variable Warp-Size Architecture Tim Rogers Used as a proxy for energy consumption 20 AVG-DIV I-VWS: E-VWS:Break Break Warp Size 4 on+ CF Reform Only E-VWS Raytracing I-VWS ObjClassifier WS 4 GamePhysics WS 32 Lighting 8 7 6 5 4 3 2 1 0 CoMD Avg. Fetches Per Cycle Divergent Application Fetch Overhead Divergent Applications A Variable Warp-Size Architecture Tim Rogers Convergent Application Performance IPC normalized to warp size 32 1.2 WS 32 WS 4 I-VWS E-VWS: I-VWS: Break Warp Size 4 on + Reform CF Only E-VWS 1 0.8 0.6 0.4 0.2 HMEAN-CON Radix Sort FeatureDetect Game 2 MatrixMultiply Game 1 0 Warp-Size Insensitive Applications Unaffected Convergent Applications 21 A Variable Warp-Size Architecture Tim Rogers on CF Only + Reform Warp Size 4 Warp-Size Insensitive Applications 22 AVG-CON Radix Sort FeatureDetect Game 2 MatrixMultiply Game 1 E-VWS AVG-WSI I-VWS FFT WS 4 Game 4 Game 3 WS 32 Convolution 9 8 7 6 5 4 3 2 1 0 Image Proc. Avg. Fetches Per Cycle Convergent/Insensitive Application Fetch I-VWS: Break Overhead E-VWS: Break Convergent Applications A Variable Warp-Size Architecture Tim Rogers 165 Application Performance IPC normalized to warp size 32 1.8 Warp Size 4 I-VWS 1.6 1.4 1.2 1 0.8 0.6 0.4 Application Convergent Applications 23 Warp-Size Insensitive Applications A Variable Warp-Size Architecture Divergent Applications Tim Rogers Related Work Compaction/Formation Decrease Thread Level Parallelism Area Cost Improve Function Unit Utilization Subdivision/Multipath Increase Thread Level Parallelism Area Cost Decreased Function Unit Utilization Variable Warp Sizing Increase Thread Level Parallelism 24 Area Cost Improve Function Unit Utilization A Variable Warp-Size Architecture VWS Estimate: 5% for 4-wide slices 2.5% for 8-wide slices Tim Rogers Conclusion Explored space surrounding warp size and perfromance Vary the size of the warp to meet the depends of the workload 35% performance improvement on divergent apps No performance degradation on convergent apps Narrow slices with ganged execution Improves both SIMD efficiency and thread-level parallelism Questions? 25 A Variable Warp-Size Architecture Tim Rogers Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian Irregular GPGPU Applications • Conventional GPGPU workloads access vector or matrix-based data structures • Predictable strides, large data parallelism • Emerging Irregular Workloads • Pointer-based data-structures & data-dependent memory accesses • Memory Latency Divergence on SIMT platforms Warp-aware memory scheduling to reduce DRAM latency divergence 27 SC 2014 SIMT Execution Overview Warps THREADS SIMT Core Lockstep execution Warp stalled on memory access SIMT Core SIMT Core Warp Scheduler Warp 1 Warp 2 Warp 3 Warp N SIMD Lanes L1 Memory Port I N T E R C O N N E C T Memory Partition L2 Slice Memory Controller GDDR 5 Channel GDDR 5 GDDR 5 Channel GDDR 5 Memory Partition L2 Slice 28 Memory Controller SC 2014 Memory Latency Divergence SIMD Lanes (32) • Coalescer has limited efficacy in irregular workloads Load Inst • Partial hits in L1 and L2 Access Coalescing Unit • 1st source of latency divergence L1 • DRAM requests can have varied latencies • Warp stalled for last request L2 • DRAM Latency Divergence GDDR5 29 SC 2014 GPU Memory Controller (GMC) • Optimized for high throughput • Harvest channel and bank parallelism • Address mapping to spread cache-lines across channels and banks. • Achieve high row-buffer hit rate • Deep queuing • Aggressive reordering of requests for row-hit batching • Not cognizant of the need to service requests from a warp together • Interleave requests from different warps leading to latency divergence 30 SC 2014 Warp-Aware Scheduling SM 1 SM 2 MC Stall CyclesStall CyclesA: A: LD A: Use Use Stall Cycles B: LD A A B B A A B B B: Use Reduced Average Memory Stall Time Baseline GMC Scheduling A B A B A B A B Warp-Aware Scheduling A A A A B B B B 31 SC 2014 Impact of DRAM Latency Divergence If all requests a warp were to–be in If there was onlyfrom 1 request per warp 5Xreturned improvement. perfect sequence from the DRAM – ~40% improvement. 32 SC 2014 Key Idea • Form batches of requests from each warp • warp-group • Schedule all requests from a warp-group together • Scheduling algorithm arbitrates between warp-groups to minimize average stall-time of warps 33 SC 2014 Controller Design 34 SC 2014 Controller Design 35 SC 2014 Warp-Group Scheduling : Single Channel Pending Warp-Groups Queuing delay in cmd queues Row hit/miss status of reqs Warp-group priority table # of reqs in warpgroup • Each Warp-Group assigned a priority • Reflects completion time of last request • Higher Priority to • Few requests • High spatial locality • Lightly loaded banks Transaction Scheduler • Priorities updated dynamically Pick warp-group with lowest runtime 36 • Transaction Scheduler picks warpgroup with lowest run-time • Shortest-job-first based on actual service time SC 2014 WG-scheduling Latency Divergence GMC Baseline WG Ideal Bandwidth Utilization 37 SC 2014 Multiple Memory Controllers • Channel level parallelism • Warp’s requests sent to multiple memory channels • Independent scheduling at each controller • Subset of warp’s requests can be delayed at one or few memory controllers • Coordinate scheduling between controllers • Prioritize warp-group that has already been serviced at other controllers • Coordination message broadcast to other controllers on completion of a warp-group. 38 SC 2014 Warp-Group Scheduling : Multi-Channel Pending Warp-Groups Queuing delay in cmd queues Row hit/miss status of reqs Priority Table Status of Warp-group in other channels # of reqs in warpgroup Transaction Scheduler Periodic messages to other channels about completed warp-groups Pick warp-group with lowest runtime 39 SC 2014 WG-M Scheduling Latency Divergence GMC Baseline WG WG-M Ideal Bandwidth Utilization 40 SC 2014 Bandwidth-Aware Warp-Group Scheduling • Warp-group scheduling negatively affects bandwidth utilization • Reduced row-hit rate • Conflicting objectives • Issue row-miss request from current warp-group • Issue row-hit requests to maintain bus utilization • Activate and Precharge idle cycles • Hidden by row-hits in other banks • Delay row-miss request to find the right slot 41 SC 2014 Bandwidth-Aware Warp-Group Scheduling • The minimum number of row-hits needed in other banks to overlap (tRTP+tRP+tRCD) • Determined by GDDR timing parameters • Minimum efficient row burst (MERB) • Stored in a ROM looked up by Transaction Scheduler • More banks with pending row-hits • smaller MERB • Schedule row-miss after MERB row-hits have been issued to bank 42 SC 2014 WG-Bw Scheduling Latency Divergence GMC Baseline WG WG-Bw WG-M Ideal Bandwidth Utilization 43 SC 2014 Warp-Aware Write Draining • Writes drained in batches • starts at High_Watermark • Can stall small warp-groups • When WQ reaches a threshold (lower than High_Watermark) • Drain singleton warp-groups only • Reduce write-induced latency 44 SC 2014 WG-scheduling Latency Divergence GMC Baseline WG WG-Bw WG-M WG-W Ideal Bandwidth Utilization 45 SC 2014 Methodology • GPGPUSim v3.1 : Cycle Accurate GPGPU simulator SM Cores 30 Max Threads/Core 1024 Warp Size 32 Threads/warp L1 / L2 32KB / 128 KB DRAM 6Gbps GDDR5 DRAM Channels Banks 6 Channels 16 Banks/channel • USIMM v1.3 : Cycle Accurate DRAM Simulator • modified to model GMC-baseline & GDDR5 timings • Irregular and Regular workloads from Parboil, Rodinia, Lonestar, and MARS. 46 SC 2014 IPC NORMALIZED TO BASELINE Performance Improvement Restored Bandwidth Utilization 12% Reduced Latency Divergence 10% 8% 6% 4% 2% 0% WG WG-M 47 WG-Bw WG-W SC 2014 Impact on Regular Workloads • Effective coalescing • High spatial locality in warp-group • WG scheduling works similar to GMC-baseline • No performance loss • WG-Bw and WG-W provide • Minor benefits 48 SC 2014 Energy Impact of Reduced Row Hit-Rate • Scheduling Row-misses over Rowhits GDDR5 Energy/bit • Reduces the row-buffer hit rate 16% 40 30 • In GDDR5, power consumption dominated by I/O. pJ/bit 20 10 0 Baseline • Increase in DRAM power negligible compared to execution speed-up • Net improvement in system energy 49 WG-Bw I/O Row Column Control DLL Background SC 2014 Conclusions • Irregular applications place new demands on the GPU’s memory system • Memory scheduling can alleviate the issues caused by latency divergence • Carefully orchestrating the scheduling of commands can help regain the bandwidth lost by warp-aware scheduling • Future techniques must also include the cache-hierarchy in reducing latency divergence 50 SC 2014 Thanks ! 51 SC 2014 Backup Slides 52 SC 2014 Performance Improvement : IPC 53 SC 2014 Average Warp Stall Latency 54 SC 2014 DRAM Latency Divergence 55 SC 2014 Bandwidth Utilization 56 SC 2014 Memory Controller Microarchitecture 57 SC 2014 Warp-Group Scheduling • Every batch assigned a priority-score • completion time of the longest request • Higher priority to warp groups with • Few requests • High spatial locality • Lightly loaded banks • Priorities updated after each warp-group scheduling • Warp-group with lowest service time selected • Shortest-job-first based on actual service time, not number of requests 58 SC 2014 Priority-Based Cache Allocation in Throughput Processors Dong Li*†, Minsoo Rhu*§†, Daniel R. Johnson§, Mike O’Connor§†, Mattan Erez†, Doug Burger‡ , Donald S. Fussell† and Stephen W. Keckler§† † § ‡ *First authors Li and Rhu have made equal contributions to this work and are listed alphabetically Introduction • Graphics Processing Units (GPUs) • General purpose throughput processors • Massive thread hierarchy, warp=32 threads • Challenge: • Small caches + many threads contention + cache thrashing • Prior work: throttling thread level parallelism (TLP) • Problem 1: under-utilized system resources • Problem 2: unexplored cache efficiency 60 Motivation: Case Study (CoMD) Throttling improves performance Observation 1: Resource under-utilized Observation 2: Unexplored cache efficiency Throttling Throttling Best Throttled Throttling max TLP (default) 61 Motivation • Throttling • Idea • Decoupling cache efficiency from parallelism # Warps Allocating Cache • Tradeoff between cache efficiency and parallelism. • # Active Warps = # Warps Allocating Cache Baseline Unexplored # Active Warps 62 Our Proposal: Priority Based Cache Allocation (PCAL) 63 Baseline Standard operation Regular Threads Warp 0 Warp 1 Regular threads: threads have all capabilities, operate as usual (full access to the L1 cache) Warp 2 Warp 3 Warp 4 Warp n-1 64 TLPtoThrottling TLP reduced improve performance Regular Threads Warp 0 Warp 1 Warp 2 Warp 3 Warp 4 Throttled Threads Regular threads: threads have all capabilities, operate as usual (full access to the L1 cache) Second subset: prevented from polluting L1 cache (fills bypass) Throttled threads: throttled (not runnable) Warp n-1 65 Proposed Solution: PCAL Three subsets of threads Regular Threads Warp 0 Nonpolluting Threads Warp 2 Warp 1 Warp 3 Warp 4 Throttled Threads Regular threads: threads have all capabilities, operate as usual (full access to the L1 cache) Non-polluting threads: runnable, but prevented from polluting L1 cache Throttled threads: throttled (not runnable) Warp n-1 66 Proposed Solution:cache PCAL Tokens for privileged access Regular Threads 1 Warp 0 1 Warp 1 Nonpolluting Threads 0 Warp 2 0 Warp 3 0 Warp 4 Throttled Threads Tokens grant capabilities or privileges to some threads Threads with tokens are allowed to allocate and evict L1 cache lines Threads without tokens can read and write, but not allocate or evict, L1 cache lines Warp n-1 Priority Token Bit 67 Proposed PCAL Enhancedsolution: scheduler Static operation Warp 0 1 Warp 1 Nonpolluting Threads 0 Warp 2 0 Warp 3 0 Warp 4 Throttled Threads Warp n-1 Priority Token Bit Scheduler Status Bits HW implementation: simplicity Warp Scheduler 1 Token Assignment Regular Threads Scheduler manages per-warp token bits Token bits sent with memory requests Policies: token assignment Assign at warp launch, release at termination Re-assigned to oldest token-less warp Parameters supplied by software prior to kernel launch 68 Two Optimization Strategies to exploit the 2-D (#Warps, #Tokens) search space # Warps Allocating Cache (#Tokens) Baseline • 1. Increasing TLP (ITLP) Resource under-utilized ITLP Unexplored cache efficiency MTLP • Adding non-polluting warps • Without hurting regular warps • 2. Maintaining TLP (MTLP) • Reduce #Token to increase the efficiency • Without decreasing TLP # Active Warps (#Warps) 69 Proposed Solution: Dynamic PCAL • Decide parameters at run time • Choose MTLP/ITLP based on resource usage performance counter • For MTLP: search #Tokens in parallel • For ITLP: search #Warps in sequential • Please refer to our paper for more details 70 Evaluation Simulation: GPGPU-Sim Open source academic simulator, configured similar to Fermi (fullchip) Benchmarks Open source GPGPU suites: Parboil, Rodinia, LonestarGPU. etc Selected benchmark subset shows sensitivity to cache size 71 MTLP Example (CoMD) MTLP: reducing #Tokens, keeping #Warps =6 MTLP MTLP PCAL with MTLP MTLP Best Throttled 72 ITLP Example (Similarity-Score) ITLP: increasing #Warps, keeping #Tokens=2 ITLP ITLP PCAL with ITLP ITLP Best Throttled 73 PCAL Results Summary Baseline: best throttled results Performance improvement: Static-PCAL: 17%, Dynamic-PCAL: 11% 74 Conclusion Existing throttling approach may lead to two problems: Resources underutilization Unexplored cache efficiency We propose PCAL, a simple mechanism to rebalance cache efficiency and parallelism ITLP: increases parallelism without hurting regular warps MTLP: alleviates cache thrashing while maintaining parallelism Throughput improvement over best throttled results Static PCAL: 17% Dynamic PCAL: 11% 75 Questions 76 UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS Neha Agarwal* David Nellans Mike O’Connor Stephen W. Keckler Thomas F. Wenisch* NVIDIA University of Michigan* (Major part of this work was done when Neha Agarwal was an intern at NVIDIA) HPCA-2015 EVOLVING GPU MEMORY SYSTEM GDDR5 Roadmap 200 GB/s GPU NVLink PCI-E (Cache Coherent) 80 GB/s 15.8 GB/s CPU 80 GB/s DDR4 CUDA 6.0+: Current Future cudaMemcpy Unified virtual memory CPU-GPU cache-coherent high BW interconnect Programmer controlled copying to GPU memory Run-time controlled copying Better productivity How to best exploit full BW while maintaining programmability? CUDA 1.0-5.x HPCA-2015 78 DESIGN GOALS & OPPORTUNITIES No need for explicit data copying Exploit full DDR + GDDR BW Additional 30% BW via NVLink Crucial to BW sensitive GPU apps Total Bandwidth (GB/s) Simple programming model: 280 200 80 15.8 Legacy CUDA UVM PCI-E NVLink Target CC-NUMA BW Design intelligent dynamic page migration policies to achieve both these goals HPCA-2015 79 NVLink GPU 200 GB/s GDDR5 CPU 80 GB/s 80 GB/s DDR4 Total Bandwidth (GB/s) BANDWIDTH UTILIZATION Total System Memory BW 28 0 20 0 GPU Memory BW 80 BW from CC 0 % of Migrated Pages to GPU Memory 100 Coherence-based accesses, no page migration Wastes GPU memory BW HPCA-2015 80 NVLink GPU 200 GB/s GDDR5 CPU 80 GB/s 80 GB/s DDR4 Total Bandwidth (GB/s) BANDWIDTH UTILIZATION 28 0 Total System Memory BW 20 0 GPU Memory BW 80 Additional BW from CC 0 % of Migrated Pages to GPU Memory 100 Static Oracle: Place data in the ratio of memory bandwidths [ASPLOS’15] Dynamic migration can exploit the full system memory BW HPCA-2015 81 NVLink GPU 200 GB/s GDDR5 CPU 80 GB/s 80 GB/s DDR4 Total Bandwidth (GB/s) BANDWIDTH UTILIZATION 28 0 Total System Memory BW 20 0 GPU Memory BW 80 Additional BW from CC 0 % of Migrated Pages to GPU Memory 100 Excessive migration leads to under-utilization of DDR BW Migrate pages for optimal BW utilization HPCA-2015 82 CONTRIBUTIONS Intelligent Dynamic Page Migration Aggressively migrate pages upon First-Touch to GDDR memory Pre-fetch neighbors of touched pages to reduce TLB shootdowns Throttle page migrations when nearing peak BW Dynamic page migration performs 1.95x better than no migration Comes within 28% of the static oracle performance 6% better than Legacy CUDA HPCA-2015 83 OUTLINE Page Migration Techniques First-Touch page migration Range-Expansion to save TLB shootdowns BW balancing to stop excessive migrations Results & Conclusions HPCA-2015 84 Naive: Migrate pages that are touched 1 1 1 1 1 CPU memory 1 Throughput Relative to No Migration FIRST-TOUCH PAGE MIGRATION 3 2.5 2 1.5 1 0.5 0 GPU memory First-Touch migration approaches Legacy CUDA First-Touch migration is cheap, no hardware counters required HPCA-2015 85 Migrate V1 VA PA V1 P2 P1 … … Ack Send Invalidate VA PA V1 P1 P2 … … CPU TLB GPU TLB CPU GPU Throughput Relative… PROBLEMS WITH FIRST-TOUCH MIGRATION 2 No overhead 1.5 1 0.5 0 TLB shootdowns may negate benefits of page migration How to migrate pages without incurring shootdown cost? HPCA-2015 86 PRE-FETCH TO AVOID TLB SHOOTDOWNS Hot pages, consume 80% BW Intuition: Hot virtual addresses are clustered Pre-fetch pages before access by the GPU Ma x Min No shootdown cost for pre-fetched pages HPCA-2015 87 Mi n MIGRATION USING RANGE-EXPANSION X X GPU memory X X No range expansion Throughput Relative to No Migration CPU memory 2 1.5 1 TLB shootdo wn No overhe ad TLB shootdo wn No overhe ad Pre-fetched, shootdown not required 0 TLB shootdo wn X Accessed, shootdown required No overhe ad 0.5 Pre-fetch pages in spatially contiguous range HPCA-2015 88 MIGRATION USING RANGE-EXPANSION GPU memory X X No range expansion Throughput Relative to No Migration CPU memory Range:16 2 1.5 1 TLB shootdo wn No overhe ad TLB shootdo wn No overhe ad Pre-fetched, shootdown not required 0 TLB shootdo wn X Accessed, shootdown required No overhe ad 0.5 Pre-fetch pages in spatially contiguous range Range-Expansion hides TLB shootdown overhead HPCA-2015 89 Total Bandwidth (GB/s) REVISITING BANDWIDTH UTILIZATION 28 0 20 0 Total System Memory BW GPU Memory BW First-Touch migration 80 0 Excessive migration BW from CC % of Migrated Pages to GPU Memory 100 First-Touch + Range-Expansion aggressively unlocks GDDR BW How to avoid excessive page migrations? HPCA-2015 90 BANDWIDTH BALANCING 100 Excessive migrations DDR under-subscribed Target 70 First-Touch migrations GDDR under-subscribed % of Accesses from GPU Memory % of Migrated Pages to GPU Memory First-Touch + Range Exp 0 80 20 0 Star t 28 0 Total Bandwidth (GB/s) End Time Throttle migrations when nearing peak BW HPCA-2015 91 BANDWIDTH BALANCING 100 Excessive migrations DDR under-subscribed Target 70 First-Touch migrations GDDR under-subscribed % of Accesses from GPU Memory % of Migrated Pages to GPU Memory First-Touch + Range Exp 0 80 20 0 Star t 28 0 Total Bandwidth (GB/s) End Time Throttle migrations when nearing peak BW HPCA-2015 92 SIMULATION ENVIRONMENT Simulator: GPGPU-Sim 3.x Heterogeneous 2-level memory GPU GDDR5 (200GB/s, 8-channels) 100 clks additional latency 80 GB/s CPU DDR4 (80GB/s, 4-channels) GPU-CPU interconnect 200 GB/s Latency: 100 GPU core cycles GDDR5 Workloads: 80 GB/s DDR4 Rodinia applications [Che’IISWC2009] DoE mini apps [Villa’SC2014] HPCA-2015 93 Throughput Relative to No Migration RESULTS 6 5 4 3 2 1 0 Legacy CUDA First-Touch + Range Exp + BW Balancing ORACLE Static Oracle HPCA-2015 94 Throughput Relative to No Migration RESULTS 6 5 4 3 2 1 0 Legacy CUDA First-Touch + Range Exp + BW Balancing ORACLE Static Oracle First-Touch + Range-Expansion + BW Balancing outperforms Legacy CUDA HPCA-2015 95 Throughput Relative to No Migration RESULTS 6 5 4 3 2 1 0 Legacy CUDA First-Touch + Range Exp + BW Balancing ORACLE Static Oracle Streaming accesses, no-reuse after migration HPCA-2015 96 Throughput Relative to No Migration RESULTS 6 5 4 3 2 1 0 Legacy CUDA First-Touch + Range Exp + BW Balancing ORACLE Static Oracle Dynamic page migration performs 1.95x better than no migration Comes within 28% of the static oracle performance 6% better than Legacy CUDA HPCA-2015 97 CONCLUSIONS Developed migration policies without any programmer involvement First-Touch migration is cheap but has high TLB shootdowns First-Touch + Range-Expansion technique unlocks GDDR memory BW BW balancing maximizes BW utilization, throttles excessive migrations These 3 complementary techniques effectively unlock full system BW HPCA-2015 98 THANK YOU HPCA-2015 99 poor GDDR bandwidth utilization. Range expansion prefetches in time, because the algorithms are designed to use blocked access to the key data structures to enhance locality. Thus, the prefetching effect of range expansion is only visible when a page is migrated upon first touch to a neighboring page, by the second access to a page, all its neighbors have already been accessed at least once and there will be no savings from avoiding TLB shootdowns. On the other hand, in benchmarks such as needl e, there is low temporal correlation among touches to neighboring pages. Even if a migration candidate is touched 64 or 128 times, some of its neighboring pages may not have been touched, and thus the prefetching effect of range expansion provides up to 42% performance improvement even at higher thresholds. these low-touch pages to GDDR as well, recouping the performance losses of the higher threshold policies and making them perform similar to a first touch migration policy. For mi ni f e, previously discussed in subsection III-B, the effect of prefetching via range expansion is to recoup some of the performance loss due to needless migrations. However, performance still falls short of the legacy memcpy approach, which in effect, achieves perfect prefetching. Overuse of range expansion hurts performance in some cases. Under the first touch migration policy (threshold-1), using range expansion 16, 64, and 128, the worst-case performance degradations are 2%, 3%, and 2.5% respectively. While not visible in the graph due to the stacked nature of Figure 7, they are included in the geometric mean calculations. Benchmark ExecutionOF RANGE-EXPANSION % Migrations Exececution EFFECTIVENESS Overhead of Without Runtime TLB Shootdowns Shootdown Saved Execution % Migrations Execution 7.6% backpropBenchmark 29.1% 26% Overhead of Without Runtime TLB Shootdown Shootdown Saved bfs 6.7% 12% 0.8% Backprop 29.1% 26% 7.6% cns 2.4% 0.5% Pathfinder 25.9% 10%20% 2.6% 24.9% 55%89% 2.75% 1.8% comd Needle 2.02% 21.15% 13% 2.75% kmeans Mummer 4.01% 79% 3.17% Bfs 6.7% 12% 0.8% minife 3.6% 36% 1.3% mummer 21.15% 13% 2.75% Range-Expansion can save up to 45% TLB shootdowns needle 24.9% 55% 13.7% pathfinder 25.9% 10% 2.6% srad v1 0.5% 27% 0.14% In the case of backpr op, we can see that higher thresholds perform poorly compared to threshold 1. Thresholds above 64 are simply too high; most pages are not accessed this frequently and thus few pages are migrated, resulting in poor GDDR bandwidth utilization. Range expansion prefetches Benchmark backprop bfs cns comd kmeans minife mummer needle pathfinder srad v1 xsbench Average Execution Overhead of TLB Shootdowns 29.1% 6.7% 2.4% 2.02% 4.01% 3.6% 21.15% 24.9% 25.9% 0.5% 2.1% 11.13% % Migrations Without Shootdown 26% 12% 20% 89% 79% 36% 13% 55% 10% 27% 1% 33.5% Exececution Runtime Saved 7.6% 0.8% 0.5% 1.8% 3.17% 1.3% 2.75% 13.7% 2.6% 0.14% 0.02% 3.72% TABLE II: Effectiveness of range prefetching at avoiding TLB shootdowns and runtime savings under a 100-cycle TLB shootdown overhead. Overall, we observe that even with range expansion, higherthreshold policies do not significantly outperform the much simpler first-touch policy. With threshold 1, the average performance gain with range expansion of 128 is 1.85⇥. The best absolute performance is observed when using a threshold of 64 combined with a range expansion value of 64, providing 1.95⇥ speedup. We believe that this additional ⇡ 5% speedup over first touch migration with aggressive range expansion is not worth the implementation complexity of tracking and differentiating all pages in the system. In the next section, we discuss how to recoup some of this performance for benchmarks such as bf s and xsbench, which benefit most from using a higher threshold. V. BA NDWI DTH BA L A NCI NG In Section III, we showed that using a static thresholdbased page migration policy alone could not ideally balance migrating enough pages to maximize GDDR bandwidth utilization while selectively moving only the hottest data. In Section IV, we showed that informed page prefetching using a low threshold and range expansion to exploit locality within an application’ s virtual address space matches or exceeds the performance of a simple threshold-based policy. Combining low threshold migration with aggressive prefetching drastically f a 6 1 o i d w b f HPCA-2015 100 b DATA TRANSFER RATIO Performance is low when GDDR/DDR ratio is away from optimal HPCA-2015 101 GPU WORKLOADS: BW SENSITIVITY Rela%ve'Throughput'' backprop" hotspot" needle" comd" 3" bfs" kmeans" pathfinder" minife" btree" lava_MD" sc_gpu" xsbench" gaussian" lud_cuda" srad_v1" heartwell" mummergpu" cns" 2.5" 2" 1.5" 1" 0.5" 0" Rela%ve' Throughput' 0.12x" 0.14x" 0.17x" 0.20x" 0.25x" 0.33x" 0.5x" 1.2$ 1x" 2x" 3x" Aggregate'DRAM'Bandwidth,'1x=200GB/sec' 1$ 0.8$ '99$ '80$ '60$ '40$ '20$ 0$ 20$ 40$ 60$ 80$ 100$ Additonal'DRAM'delay'in'GPU'core'cycles' GPU workloads are highly BW sensitive HPCA-2015 102 Throughput Relative to No Migration RESULTS 6 5 4 3 2 1 0 Legacy CUDA First-Touch Oracle Dynamic page migration performs 1.95x better than no migration Comes within 28% of the static oracle performance 6% better than Legacy CUDA HPCA-2015 103