Cache Coherence for GPU Architectures Inderpreet Singh1, Arrvindh Shriraman2, Wilson Fung1, Mike O’Connor3, Tor Aamodt1 1 University of British Columbia 2 Simon Fraser University 3 AMD Research Image source: www.forces.gc.ca What is a GPU? Workgroups CPU spawn GPU Wavefronts GPU Core GPU Core L1D L1D ▪▪▪ done CPU Interconnect CPU time Inderpreet Singh spawn L2 Bank ▪▪▪ GPU Cache Coherence for GPU Architectures 2 Evolution of GPUs • Graphics pipeline OpenGL/ DirectX Vertex Shader Pixel Shader • Compute (OpenCL, CUDA) • e.g. Matrix Multiplication Inderpreet Singh Cache Coherence for GPU Architectures 3 Evolution of GPUs • Future: coherent memory space • Efficient critical sections • Load balancing Stencil computation lock shared structure … computation … unlock Inderpreet Singh Workgroups Cache Coherence for GPU Architectures 4 GPU Coherence Challenges • Challenge 1: Coherence traffic Load C Load D Load E Load F … Load C MESI No coherence GPU-VI C1 2.2 Interconnect traffic 1.5 Load G Load H Load I Load J … 1.3 Recalls C2 L1D A B L1D A B rcl A rcl A 1.0 Load K Load L Load M Load N … C3 C4 L1D A B L1D A B rcl A ack ack Load O Load P Load Q Load R … ack rcl A ack gets C 0.5 A Do not require coherence Inderpreet Singh L2/Directory Cache Coherence for GPU Architectures B 5 GPU Coherence Challenges • Challenge 2: Tracking in-flight requests • Significant % of L2 S Shared S_M M Modified L2 / Directory MSHR Inderpreet Singh Cache Coherence for GPU Architectures 6 GPU Coherence Challenges • Challenge 3: Complexity Non-coherent L1 MESI L2 States MESI L1 States Events States Non-coherent L2 Inderpreet Singh Cache Coherence for GPU Architectures 7 GPU Coherence Challenges All three challenges result from introducing coherence messages on a GPU 1. Traffic: transferring 2. Storage: tracking 3. Complexity: managing GPU cache coherence without coherence messages? • YES – using global time Inderpreet Singh Cache Coherence for GPU Architectures 8 Temporal Coherence (TC) • Global time Local Timestamp > Global Time VALID Core 1 Core 2 L1D L1D 0 A=0 Interconnect L2 Bank 0 Inderpreet Singh ▪▪▪ A=0 Global Timestamp ▪▪▪ < Global Time NO L1 COPIES Cache Coherence for GPU Architectures 9 Temporal Coherence (TC) T=11 T=0 T=15 Core 1 Core 2 L1D L1D No coherence Interconnect messages 10 A=0 L2 Bank 10 0 Inderpreet Singh ▪▪▪ A=0 A=1 Cache Coherence for GPU Architectures 10 Temporal Coherence (TC) What lifetime values should be requested on loads? • Use a predictor to predict lifetime values What about stores to unexpired blocks? • Stall them at the L2? Inderpreet Singh Cache Coherence for GPU Architectures 11 TC Stalling Issues Stall? Problem #1: Sensitive to mispredictions Problem #2: Impedes other accesses Problem #3: Hurts existing GPU applications Solution: TC-Weak Inderpreet Singh Cache Coherence for GPU Architectures 12 TC-Weak • Stores return Global Write Completion Time (GWCT) 1 data=NEW 2 FENCE 3 flag=SET T=0 T=31 T=1 GPU Core 1 GPU Core 2 L1D GWCT Table W0: W1: L1D GWCT Table W0: W1: 30 data=OLD No stalling at L2 Interconnect L2 Bank 30 47 Inderpreet Singh data=NEW data=OLD flag=NULL flag=SET Cache Coherence for GPU Architectures 13 TC-Weak Stalling TC-Weak Misprediction sensitivity Doesn’t impedes other accesses Good for existing GPU applications Inderpreet Singh Cache Coherence for GPU Architectures 14 Methodology • • • • • GPGPU-Sim v3.1.2 for GPU core model GEMS Ruby v2.1.1 for memory system All protocols written in SLICC Model a generic NVIDIA Fermi-based GPU (see paper for details) Applications: • 6 do not require coherence • 6 require coherence • • • • • • Inderpreet Singh Barnes Hut Cloth Physics Versatile Place and Route Max-Flow Min-Cut 3D Wave Equation Solver Octree Partitioning Cache Coherence for GPU Architectures Locks Stencil communication Load balancing 15 Interconnect Traffic MESI Interconnect Traffic 1.50 NO-COH GPU-VI TC-Weak 2.3 • Reduces traffic by 53% over MESI and 23% over GPU-VI for intra-workgroup applications 1.25 1.00 • Lower traffic than 16x-sized 32-way directory 0.75 0.50 0.25 0.00 Inderpreet Singh Do not require coherence Cache Coherence for GPU Architectures 16 Performance MESI Speedup 2.0 NO-L1 GPU-VI TC-Weak 1.5 • TC-Weak with simple predictor performs 85% better than disabling L1 caches 1.0 • Performs 28% better than TC with stalling 0.5 • Larger directory sizes do not improve performance 0.0 Inderpreet Singh Require coherence Cache Coherence for GPU Architectures 17 Complexity Non-Coherent L1 MESI TC-Weak L1L2 States MESI L1 States Non-Coherent L2 Inderpreet Singh TC-Weak L2 Cache Coherence for GPU Architectures 18 Summary • First work to characterize GPU coherence challenges • Save traffic and energy by using global time • Reduce protocol complexity • 85% performance improvement over no coherence Questions? Inderpreet Singh Cache Coherence for GPU Architectures 19 Backup Slides Inderpreet Singh Cache Coherence for GPU Architectures 20 Lifetime Predictor • One prediction value per L2 bank • Events local to L2 bank update prediction value Events L2 Bank 1. Expired load: ↑ 2. Unexpired store: ↓ 3. Unexpired eviction: ↓ TT==20 0 Prediction prediction++ prediction-Value Inderpreet Singh 10 30 A Prediction Cache Coherence for GPU Architectures 21 TC-Strong vs TC-Weak TCSUO TCW Fixed lifetime for all applications TCSOO TCS TCW w/ predictor Best lifetime for each application 1.2 1.2 Speedup Speedup 1.4 1.0 0.8 0.6 Inderpreet Singh 1.0 0.8 0.6 All applications All applications Cache Coherence for GPU Architectures 22 Interconnect Power and Energy Interworkgroup Inderpreet Singh Intraworkgroup 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Router (Static) Interworkgroup Cache Coherence for GPU Architectures NO-COH MESI GPU-VI GPU-Vini TCW Link (Static) NO-L1 MESI GPU-VI GPU-Vini TCW Normalized Energy Router (Dynamic) NO-COH MESI GPU-VI GPU-Vini TCW 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 NO-L1 MESI GPU-VI GPU-Vini TCW Normalized Power Link (Dynamic) Intraworkgroup 23