Exploring Memory Consistency for Massively Threaded ThroughputOriented Processors Blake Hechtman Daniel J. Sorin 0 Executive Summary • Massively Threaded Throughput-Oriented Processors (MTTOPs) like GPUs are being integrated on chips with CPUs and being used for general purpose programming • Conventional wisdom favors weak consistency on MTTOPs • We implement a range of memory consistency models (SC, TSO and RMO) on MTTOPs • We show that strong consistency is viable for MTTOPs 1 What is an MTTOP? • Massively Threaded Throughput-Oriented – 4-16 core clusters – 8-64 threads wide SIMD – 64-128 deep SMT Thousands of concurrent threads • Massively Threaded Throughput-Oriented – Sacrifice latency for throughput • Heavily banked caches and memories • Many cores, each of which is simple 2 Example MTTOP Core Cluster Fetch Decode E E E E EE E L1 Core Cluster Core Cluster Core Cluster Core Cluster Core Cluster L2 Bank L2 Bank Core Cluster Core Cluster L2 Bank L2 Bank Core Cluster Core Cluster L2 Bank L2 Bank Core Cluster Core Cluster Core Cluster Core Cluster Core Cluster Cache Coherent L2 Core Core L2 Memory Shared Bank Cluster Cluster Bank Memory Controller 3 What is Memory Consistency? Initially A = B = 0 Thread 0 Thread 1 ST B = 1 ST A = 1 LD r1, A LD r2, B Sequential Consistency : {r1,r2} = 0,1; 1,0; 1,1 Weak Consistency : {r1,r2} = 0,1; 1,0; 1,1; 0,0 enables store buffering In this work, we explore hardware consistency models MTTOP hardware concurrency seems likely to be constrained by Sequential Consistency (SC) 4 (CPU) Memory Consistency Debate Performance Strong Consistency Weak Consistency Slower Faster Programmability Easier Harder • Conclusion for CPUs: trading off ~10-40% performance for programmability – “Is SC + ILP = RC?” (Gniady ISCA99) But does this conclusion apply to MTTOPs? 5 Memory Consistency on MTTOPs • GPUs have undocumented hardware consistency models • Intel MIC uses x86-TSO for the full chip with directory cache coherence protocol • MTTOP programming languages provide weak ordering guarantees – OpenCL does not guarantee store visibility without a barrier or kernel completion – CUDA includes a memory fence that can enable global store visibility 6 MTTOP Conventional Wisdom • Highly parallel systems benefit from less ordering – Graphics doesn’t need ordering • Strong Consistency seems likely to limit MLP • Strong Consistency likely to suffer extra latencies Weak ordering helps CPUs, does it help MTTOPs? It depends on how MTTOPs differ from CPUs … 7 Diff 1: Ratio of Loads to Stores Weak Consistency reduces impact of store latency on performance CPUs Loads per Store Prior work shows CPUs perform 2-4 loads per store MTTOPs 10000 1000 100 10 1 MTTOPs perform more loads per store store latency optimizations will not be as critical to MTTOP performance 8 Diff 2: Outstanding L1 cache misses Weak consistency enables more outstanding L1 misses per thread CPU Core MTTOP Core (CU/SM) threads per core 4 64 SIMD Width 4 64 L1 Miss Rate 0.1 0.5 SC Misses per Core 1.6 (too few misses) 2048 (enough misses) … … … RMO Misses per Core 6.4 8192 MTTOPs have more outstanding L1 cache misses thread reordering enabled by weak consistency is less important to handle memory latency 9 Diff 3: Memory System Latencies Weak consistency enables reductions of store latencies CPU core Fetch Decode Issue/Sel E E E E 1-2 cycles 5-20 cycles 100-500 cycles LSQ L1 L2 Mem MTTOP core cluster Fetch Decode R O B EE E E EE E 10-70 cycles L1 100-300 cycles L2 300-1000 cycles Mem MTTOPs have longer memory latencies small latency savings will not significantly improve performance 10 Diff 4: Frequency of Synchronization Weak consistency only re-orders memory ops between sync MTTOPs CPUs split problem into regions assign regions to threads do: work on local region synchronize CPU local region: ~private cache size MTTOP local region: ~private cache size/threads per cache MTTOPs have more threads to compute a problem each thread will have fewer independent memory ops between syncs. 11 Diff 5: RAW Dependences Through Memory Weak consistency enables store to load forwarding CPUs • Blocking for cache performance • Frequent function calls • Few architected registers Many RAW dependencies through memory MTTOPs • Coalescing for cache performance • Inlined function calls • Many architected registers Few RAW dependencies through memory MTTOP algorithms have fewer RAW memory dependencies there is little benefit to being able to read from a write buffer 12 MTTOP Differences & Their Impact • Other differences are mentioned in the paper • How much do these differences affect performance of memory consistency implementations on MTTOPs? 13 Memory Consistency Implementations Strongest Weakest SC simple SC wb TSO RMO Fetch Decode Fetch Decode Fetch Decode Fetch Decode E E E E EE E E E E E EE E E E E E EE E E E E E EE E FIFO WB FIFO WB L1 L1 L1 No write buffer Per-lane FIFO write buffer drained on LOADS Per-lane FIFO write buffer drained on FENCES L1 C A M Per-lane CAM for outstanding write addresses 14 Methodology • Modified gem5 to support SIMT cores running a modified version of the Alpha ISA • Looked at typical MTTOP workloads – Had to port workloads to run in system model • Ported Rodinia benchmarks – bfs, hotspot, kmeans, and nn • Handwritten benchmarks – dijkstra, 2dconv, and matrix_mul 15 Target MTTOP System Parameter core clusters core Value 16 core clusters; 8 wide SIMD in-order, Alpha-like ISA, 64 deep SMT interconnection network coherence protocol L1I cache (shared by cluster) L1D cache (shared by cluster) L2 cache (shared by all clusters) 2D torus Writeback MOESI protocol perfect, 1-cycle hit 16KB, 4-way, 20-cycle hit, no local memory 256KB, 8 banks, 8-way, 50-cycle hit consistency model-specific features (give benefit to weaker models) write buffer (SCwb and TSO) perfect, instant access CAM for store address matching perfect, instant access 16 Results MTTOP Consistency Model Performance Comparison 1.6 1.4 Speedup 1.2 SC SC_WB TSO RMO 1 0.8 0.6 0.4 0.2 0 2dconv barnes bfs djisktra fft hotspot kmeans matrix_mul nn 17 Results MTTOP Consistency Model Performance Comparison 1.6 Significant load reordering 1.4 Speedup 1.2 1 0.8 0.6 SC SC_WB TSO RMO 0.4 0.2 0 2dconv barnes bfs djisktra fft hotspot kmeans matrix_mul nn 18 Conclusions • Strong Consistency should not be ruled out for MTTOPs on the basis of performance • Improving store performance with write buffers appears unnecessary • Graphics-like workloads may get significant MLP from load reordering (dijkstra, 2dconv) Conventional wisdom may be wrong about MTTOPs 19 Caveats and Limitations • Results do not necessarily apply to all possible MTTOPs or MTTOP software • Evaluation with writeback caches when current MTTOPs use write-through caches • Potential of future workloads to be more CPUlike 20 Thank you! Questions? 21