The Locality-Aware Adaptive Cache Coherence Protocol George Kurian1, Omer Khan2, Srini Devadas1 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs 1 Cache Hierarchy Organization Directory-Based Coherence • Private caches: 1 or 2 levels • Shared cache: Last-level Private cache Write Writeword miss 1 3 Sharer Shared Cache + 2 Directory • Concurrent reads lead to replication in private caches • Directory maintains coherence for replicated lines 4 Sharer 2 Private Caching Advantages & Drawbacks ☺ Exploits spatiotemporal locality ☹ Inefficiently handles data with LOW spatio-temporal locality ☺ Efficient low-latency local access to private + shared data (cache ☹ Working set > private cache size line replication) ☹ Inefficient cache utilization (Cache thrashing) ☹ Unnecessary fetch of entire cache line ☹ Shared data replication 3 increases working set Private Caching Advantages & Drawbacks ☺ Exploits spatiotemporal locality ☹ Inefficiently handles data with LOW spatio-temporal locality ☺ Efficient low-latency local access to private + shared data (cache communication and time Increased on-chip ☹ Working set > private cache size line replication) spent waiting for expensive events ☹ Shared data with frequent writes ☹ Wasteful invalidations, synchronous writebacks, cache line ping-ponging 4 On-Chip Communication Problem Bill Dally, Stanford Shekhar Borkar, Intel Wires relative to gates are getting worse every generation Bit movement is much more expensive than computation Must Architect Efficient Coherence Protocols 5 Locality of Benchmarks Evaluating Reuse before Evictions 80% 20% • Utilization: # private L1 cache accesses before cache line is evicted • 40% of lines evicted have a utilization < 4 6 Locality of Benchmarks Evaluating Reuse before Invalidations 80% 10% • Utilization: # private L1 cache accesses before cache line is invalidated (intervening write) 7 • Assign each memory address to unique “home” core – Cache line present only in shared cache at “home” core (single location) • For access to non-locally cached word, request “remote” shared cache on “home” core to perform the read/write access Home core Remote-Word Access (RA) 2 1 Write word NUCA-based protocol [Fensch et al HPCA’08] [Hoffmann et al HiPEAC’10] 8 Remote-Word Access Advantages & Drawbacks ☺ Energy Efficient (low locality data) Word access (~200 bits) cheaper than cache line fetch (~640 bits) ☺ NO data replication Efficient private cache utilization ☺ NO invalidations / synchronous writebacks ☹ Round-trip network request for remoteWORD access ☹ Expensive for high locality data ☹ Data placement dictates distance & frequency of remote accesses 9 Locality-Aware Cache Coherence • Combine advantages of private caching and remote access • Privately cache high locality lines – Optimize hit latency and energy • Remotely cache low locality lines – Prevent data replication & costly data movement • Private Caching Threshold (PCT) – Utilization >= PCT Mark as private – Utilization < PCT Mark as remote 10 Locality-Aware Cache Coherence • Private Caching Theshold (PCT) = 4 Invalidations Breakdown (%) 1 2,3 4,5 6,7 >=8 100% 80% Private 60% 40% 20% Remote 0% Invalidations vs Utilization 11 Outline • Motivation for Locality-Aware Coherence • Detailed Implementation • Optimizations • Evaluation • Conclusion 12 Baseline System Core M Compute Pipeline L1 D-Cache M M L1 I-Cache L2 Shared Cache Directory Router • Compute pipeline • Private L1-I and L1-D caches • Logically shared physically distributed L2 cache with integrated directory • L2 cache managed by Reactive-NUCA [Hardavellas – ISCA09] • ACKwise limited-directory protocol [Kurian – PACT10] 13 Locality-Aware Coherence Important Features • Intelligent allocation of cache lines – In the private L1 cache – Allocation decision made per-core at cache line level • Efficient locality tracking hardware – Decoupled from traditional coherence tracking structures • Protocol complexity low – NO additional networks for deadlock avoidance 14 Implementation Details Private Cache Line Tag State LRU Tag Private Utilization • Private Utilization bits to track cache line usage in L1 cache • Communicated back to directory on eviction or invalidation • Storage overhead is only 0.4% 15 Implementation Details Directory Entry State ACKwise Pointers 1…p Tag P/R1 Remote Utilization1 … … P/Rn Remote Utilizationn • P/Ri: Private/Remote Mode • Remote-Utilizationi: Line usage by Corei at shared L2 cache • Complete Locality Classifier: Track mode/remote-utilization for all cores • Storage overhead reduced later 16 Mode Transitions Summary • Classification based on previous behavior Remote Utilization < PCT Private Utilization < PCT Initial Private Private Utilization >= PCT Remote Remote Utilization >= PCT 17 Walk Through Example Core A Private Caching Threshold PCT = 2 Pipeline + L1 Cache Network Core B Pipeline + L1 Cache All cores start out in private mode Core C Pipeline + L1 Cache Uncached Directory Core-A Core-B Core-C Private Private Private U U U Core D L2 Cache + Directory 18 Walk Through Example Core A PCT = 2 Read[X] Core B Core C Uncached Directory Core-A Core-B Core-C Private Private Private U U U Core D 19 Walk Through Example Core A PCT = 2 Core B Core C Shared Directory Cache Line [X] Core-A Core-B Core-C Private Private Private C U U Core D Clean - 20 Walk Through Example Core A PCT = 2 Shared 1 Cache Line [X] Core B Core C Shared Directory Core-A Core-B Core-C Private Private Private C U U Core D Clean - 21 Walk Through Example Core A PCT = 2 Shared 1 Core B Core C Read[X] Shared Directory Core-A Core-B Core-C Private Private Private C U U Core D Clean - 22 Walk Through Example Core A PCT = 2 Shared 1 Core B Core C Shared Directory Cache Line [X] Core-A Core-B Core-C Private Private Private C U C Core D Clean - 23 Walk Through Example Core A PCT = 2 Shared 1 Core B Core C Cache Line [X] Shared 1 Shared Directory Core-A Core-B Core-C Private Private Private C U C Core D Clean - 24 Walk Through Example Core A PCT = 2 Shared 1 Core B Core C Shared 1 Read[X] Shared Directory Core-A Core-B Core-C Private Private Private C U C Core D Clean - 25 Walk Through Example Core A PCT = 2 Shared 1 Core B Core C Shared 2 Shared Directory Core-A Core-B Core-C Private Private Private C U C Core D Clean - 26 Walk Through Example Core A PCT = 2 Shared 1 Core B Core C Shared Write[X] 2 Shared Directory Core-A Core-B Core-C Private Private Private C U C Core D Clean - 27 Walk Through Example Core A PCT = 2 Shared 1 Core B Core C Shared Shared Directory 2 Inv [X] Core-A Core-B Core-C Private Private Private C U C Core D Clean - 28 Walk Through Example Core A PCT = 2 Invalid 0 Inv-Reply [X] (1) Core B Core C Shared 2 Shared Directory Core-A Core-B Core-C Private Private Private C U C Core D Clean - 29 Walk Through Example Core A PCT = 2 Core B Core C Shared Inv-Reply [X] (1) Directory Shared 2 Core-A Core-B Core-C Remote Private Private 0 U C Core D Clean - 30 Walk Through Example Core A PCT = 2 Core B Core C Inv-Reply [X] (2) Invalid 0 Shared Directory Core-A Core-B Core-C Remote Private Private 0 U C Core D Clean - 31 Walk Through Example Core A PCT = 2 Core B Core C Inv-Reply [X] (2) Uncached Directory Core-A Core-B Core-C Remote Private Private 0 U U Core D Clean - 32 Walk Through Example Core A PCT = 2 Core B Core C Modified Directory Cache Line [X] Core-A Core-B Core-C Remote Private Private 0 C U Core D Clean - 33 Walk Through Example Core A PCT = 2 Core B Modified 1 Core C Cache Line [X] Modified Directory Core-A Core-B Core-C Remote Private Private 0 C U Core D Clean - 34 Walk Through Example Core A PCT = 2 Read[X] Core B Modified Core C 1 Modified Directory Core-A Core-B Core-C Remote Private Private 0 C U Core D Clean - 35 Walk Through Example Core A PCT = 2 Core B Modified Core C 1 Modified Directory WB [X] Core-A Core-B Core-C Remote Private Private 0 C U Core D Clean - 36 Walk Through Example Core A PCT = 2 Core B Shared 1 Core C WB-Reply [X] Modified Directory Core-A Core-B Core-C Remote Private Private 0 C U Core D Clean - 37 Walk Through Example Core A PCT = 2 Core B Core C Shared 1 Shared Directory WB-Reply [X] Core-A Core-B Core-C Remote Private Private 0 C U Core D Dirty - 38 Walk Through Example Core A PCT = 2 Core B Core C Shared 1 Shared Directory Word [X] Core-A Core-B Core-C Remote Private Private 1 C U Core D Dirty - 39 Walk Through Example Core A PCT = 2 Core B Shared 1 Core C Write [X] Shared Directory Core-A Core-B Core-C Remote Private Private 1 C U Core D Dirty - 40 Walk Through Example Core A PCT = 2 Core B Shared Core C 1 UpgradeReply [X] Modified Directory Core-A Core-B Core-C Remote Private Private 0 C U Core D Dirty - 41 Walk Through Example Core A PCT = 2 Core B Core C Modified 2 Modified Directory Core-A Core-B Core-C Remote Private Private 0 C U Core D Dirty - 42 Walk Through Example Core A PCT = 2 Read [X] Core B Core C Modified 2 Shared Directory Core-A Core-B Core-C Remote Private Private 0 C U Core D Dirty - 43 Walk Through Example Core A PCT = 2 Core B Core C Shared 2 Shared Directory Read [X] Core-A Core-B Core-C Remote Private Private 1 C U Core D Dirty - 44 Walk Through Example Core A PCT = 2 Core B Core C Shared 2 Shared Directory Word [X] Core-A Core-B Core-C Remote Private Private 1 C U Core D Dirty - 45 Walk Through Example Core A PCT = 2 Read [X] Core B Core C Shared 2 Shared Directory Core-A Core-B Core-C Remote Private Private 1 C U Core D Dirty - 46 Walk Through Example Core A PCT = 2 Core B Core C Shared 2 Shared Directory Read [X] Core-A Core-B Core-C Remote Private Private 2 C U Core D Dirty - 47 Walk Through Example Core A PCT = 2 Core B Core C Shared 2 Cache Line [X] (2) Shared Directory Core-A Core-B Core-C Private Private Private C C U Core D Dirty - 48 Walk Through Example Core A PCT = 2 Shared 2 Cache Line [X] (2) Core B Core C Shared 2 Shared Directory Core-A Core-B Core-C Private Private Private C C U Core D Dirty - 49 Outline • • • • • Motivation for Locality-Aware Coherence Detailed Implementation Optimizations Evaluation Conclusion 50 Complete Locality Classifier High Directory Storage State ACKwise Pointers 1…p Tag P/R1 Remote Utilization1 … … P/Rn Remote Utilizationn • Complete Locality Classifier: Tracks locality information for all cores Classifier Complete Bit Overhead per core 192 KB (60%) (256 KB L2) 51 Limited Locality Classifier Reduces Directory Storage State ACKwise Pointers 1…p Tag Core ID1 … Core IDk P/R1 Remote Utilization1 … P/Rk Remote Utilizationk … • Utilization and mode tracked for k sharers • Modes of other sharers obtained by taking a majority vote 52 Limited-3 Locality Classifier • Utilization and mode tracked for 3 sharers Metric Limited-3 vs Complete Achieves the performance and energy Completion Time 3 % lower of the Complete1.5locality Energy % lower classifier • CT and Energy lower because remote mode classification learned faster with Limited-3 Classifier Complete Bit Overhead per core 192 KB (60%) (256 KB L2) Limited-3 18 KB (5.7%) 53 Private <-> Remote Transition Results In Private Cache Thrashing • Difficult to measure private cache locality of line in shared L2 cache Initial Private Utilization < PCT Private Private Utilization >= PCT Remote Utilization < PCT Remote Remote Utilization >= PCT • Core reverts back to private mode after #PCT accesses to cache line at shared L2 cache • Evicts other lines in the private L1 cache • Results in low spatio-temporal locality for all 54 Ideal Classifier NO Private Cache Thrashing • Ideal classifier maintains part of the working set in the private cache • Other lines placed in remote mode at shared cache 55 Remote Access Threshold Reduces Private Cache Thrashing • If core classified as remote sharer (capacity), increase cost of promotion to private mode • If core classified as private sharer, reset the Reduces private cache thrashing to a cost back to its starting value Initial negligible levelRemote Utilization < RAT Private Utilization < PCT Private Private Utilization >= PCT Remote Remote Utilization >= RAT • Remote Access Threshold (RAT) varied based on PCT & application behavior [details in paper] 56 Outline • • • • • Motivation for Locality-Aware Coherence Implementation Details Optimizations Evaluation Conclusion 57 Reducing Capacity Misses Private L1 Cache Miss Rate vs PCT (Blackscholes) Cache Miss Rate Breakdown (%) Cold Capacity Upgrade Sharing Word 3 2 1 0 1 2 3 4 5 6 PCT 7 8 • Miss rate reduces as PCT increases (better utilization) • Multiple capacity misses (expensive) replaced with single word access (cheap) • Cache miss rate increases towards the end (one capacity miss turns into multiple word misses) 58 Energy vs PCT Blackscholes Energy (normalized) 1.2 1 Network Link 0.8 Network Router 0.6 Directory 0.4 L2 Cache L1-D Cache 0.2 L1-I Cache 0 1 2 3 4 5 6 PCT 7 8 • Reducing L1 cache misses (& Capacity Word) lead to lesser network traffic and L2 accesses • Accessing a word (200 bits) cheaper than 59 fetching the entire cache line (640 bits) Completion Time vs PCT Blackscholes Completion Time (normalized) 1.2 1 Synchronization 0.8 L2Cache-OffChip 0.6 L2Cache-Sharers 0.4 L2Cache-Waiting L1Cache-L2Cache 0.2 Compute 0 1 2 3 4 5 6 7 8 • Lower L1 cache miss rate + miss penalty • Less time spent waiting on L1 cache misses 60 Reducing Sharing Misses Private L1 Cache Miss Rate vs PCT (Streamcluster) Cache Miss Rate Breakdown (%) Cold Capacity Upgrade Sharing Word 8 6 4 2 0 1 2 3 4 5 6 7 8 PCT • Sharing misses (expensive) turned into word misses (cheap) as PCT increases 61 Energy vs PCT Energy (normalized) Streamcluster 1.2 1 Network Link 0.8 Network Router 0.6 Directory 0.4 L2 Cache L1-D Cache 0.2 L1-I Cache 0 1 2 3 4 5 6 7 PCT 8 • Reduce invalidations, asynchronous writebacks and cache-line ping-pong’ing 62 Completion Time vs PCT Streamcluster Completion Time (normalized) 1.2 1 Synchronization 0.8 L2Cache-OffChip 0.6 L2Cache-Sharers 0.4 L2Cache-Waiting L1Cache-L2Cache 0.2 Compute 0 1 2 3 4 5 6 7 PCT 8 • Less time spent waiting for invalidations and invalidations and by loads waiting for previous stores • Critical section time reduction -> synchronization time 63 reduction Variation with PCT Results Summary • Evaluated 18 benchmarks from the SPLASH-2, PARSEC, parallel-MI bench and UHPC suites + 3 hand-written benchmarks • PCT of 4 obtains 25% reduction in energy and 15% reduction in completion time • Evaluations done using Graphite simulator for 64 cores, McPAT/CACTI cache energy models and DSENT network energy models at 11 nm 64 Conclusion • Three potential advantages of the locality-aware adaptive cache coherence protocol – Better private cache utilization – Reduced on-chip communication (invalidations, asynchronous write-backs and cache-line transfers) – Reduced memory access latency and energy • Efficient locality tracking hardware • Decoupled from traditional coherence tracking structures • Limited3 locality classifier has low overhead of 18KB per-core (with 256KB per-core L2 cache) • Simple to implement – NO additional networks for deadlock avoidance 65