The Locality-Aware Adaptive Cache Coherence Protocol

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian1, Omer Khan2, Srini Devadas1 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs 1 Cache Hierarchy Organization Directory-Based Coherence • Private caches: 1 or 2 levels • Shared cache: Last-level Private cache Write Writeword miss 1 3 Sharer Shared Cache + 2 Directory • Concurrent reads lead to replication in private caches • Directory maintains coherence for replicated lines 4 Sharer 2 Private Caching Advantages & Drawbacks ☺ Exploits spatiotemporal locality ☹ Inefficiently handles data with LOW spatio-temporal locality ☺ Efficient low-latency local access to private + shared data (cache ☹ Working set > private cache size line replication) ☹ Inefficient cache utilization (Cache thrashing) ☹ Unnecessary fetch of entire cache line ☹ Shared data replication 3 increases working set Private Caching Advantages & Drawbacks ☺ Exploits spatiotemporal locality ☹ Inefficiently handles data with LOW spatio-temporal locality ☺ Efficient low-latency local access to private + shared data (cache communication and time Increased on-chip ☹ Working set > private cache size line replication) spent waiting for expensive events ☹ Shared data with frequent writes ☹ Wasteful invalidations, synchronous writebacks, cache line ping-ponging 4 On-Chip Communication Problem Bill Dally, Stanford Shekhar Borkar, Intel  Wires relative to gates are getting worse every generation Bit movement is much more expensive than computation Must Architect Efficient Coherence Protocols 5 Locality of Benchmarks Evaluating Reuse before Evictions 80% 20% • Utilization: # private L1 cache accesses before cache line is evicted • 40% of lines evicted have a utilization < 4 6 Locality of Benchmarks Evaluating Reuse before Invalidations 80% 10% • Utilization: # private L1 cache accesses before cache line is invalidated (intervening write) 7 • Assign each memory address to unique “home” core – Cache line present only in shared cache at “home” core (single location) • For access to non-locally cached word, request “remote” shared cache on “home” core to perform the read/write access Home core Remote-Word Access (RA) 2 1 Write word NUCA-based protocol [Fensch et al HPCA’08] [Hoffmann et al HiPEAC’10] 8 Remote-Word Access Advantages & Drawbacks ☺ Energy Efficient (low locality data)  Word access (~200 bits) cheaper than cache line fetch (~640 bits) ☺ NO data replication  Efficient private cache utilization ☺ NO invalidations / synchronous writebacks ☹ Round-trip network request for remoteWORD access ☹ Expensive for high locality data ☹ Data placement dictates distance & frequency of remote accesses 9 Locality-Aware Cache Coherence • Combine advantages of private caching and remote access • Privately cache high locality lines – Optimize hit latency and energy • Remotely cache low locality lines – Prevent data replication & costly data movement • Private Caching Threshold (PCT) – Utilization >= PCT  Mark as private – Utilization < PCT  Mark as remote 10 Locality-Aware Cache Coherence • Private Caching Theshold (PCT) = 4 Invalidations Breakdown (%) 1 2,3 4,5 6,7 >=8 100% 80% Private 60% 40% 20% Remote 0% Invalidations vs Utilization 11 Outline • Motivation for Locality-Aware Coherence • Detailed Implementation • Optimizations • Evaluation • Conclusion 12 Baseline System Core M Compute Pipeline L1 D-Cache M M L1 I-Cache L2 Shared Cache Directory Router • Compute pipeline • Private L1-I and L1-D caches • Logically shared physically distributed L2 cache with integrated directory • L2 cache managed by Reactive-NUCA [Hardavellas – ISCA09] • ACKwise limited-directory protocol [Kurian – PACT10] 13 Locality-Aware Coherence Important Features • Intelligent allocation of cache lines – In the private L1 cache – Allocation decision made per-core at cache line level • Efficient locality tracking hardware – Decoupled from traditional coherence tracking structures • Protocol complexity low – NO additional networks for deadlock avoidance 14 Implementation Details Private Cache Line Tag State LRU Tag Private Utilization • Private Utilization bits to track cache line usage in L1 cache • Communicated back to directory on eviction or invalidation • Storage overhead is only 0.4% 15 Implementation Details Directory Entry State ACKwise Pointers 1…p Tag P/R1 Remote Utilization1 … … P/Rn Remote Utilizationn • P/Ri: Private/Remote Mode • Remote-Utilizationi: Line usage by Corei at shared L2 cache • Complete Locality Classifier: Track mode/remote-utilization for all cores • Storage overhead reduced later 16 Mode Transitions Summary • Classification based on previous behavior Remote Utilization < PCT Private Utilization < PCT Initial Private Private Utilization >= PCT Remote Remote Utilization >= PCT 17 Walk Through Example Core A Private Caching Threshold PCT = 2 Pipeline + L1 Cache Network Core B Pipeline + L1 Cache All cores start out in private mode Core C Pipeline + L1 Cache Uncached Directory Core-A Core-B Core-C Private Private Private U U U Core D L2 Cache + Directory 18 Walk Through Example Core A PCT = 2 Read[X] Core B Core C Uncached Directory Core-A Core-B Core-C Private Private Private U U U Core D 19 Walk Through Example Core A PCT = 2 Core B Core C Shared Directory Cache Line [X] Core-A Core-B Core-C Private Private Private C U U Core D Clean - 20 Walk Through Example Core A PCT = 2 Shared 1 Cache Line [X] Core B Core C Shared Directory Core-A Core-B Core-C Private Private Private C U U Core D Clean - 21 Walk Through Example Core A PCT = 2 Shared 1 Core B Core C Read[X] Shared Directory Core-A Core-B Core-C Private Private Private C U U Core D Clean - 22 Walk Through Example Core A PCT = 2 Shared 1 Core B Core C Shared Directory Cache Line [X] Core-A Core-B Core-C Private Private Private C U C Core D Clean - 23 Walk Through Example Core A PCT = 2 Shared 1 Core B Core C Cache Line [X] Shared 1 Shared Directory Core-A Core-B Core-C Private Private Private C U C Core D Clean - 24 Walk Through Example Core A PCT = 2 Shared 1 Core B Core C Shared 1 Read[X] Shared Directory Core-A Core-B Core-C Private Private Private C U C Core D Clean - 25 Walk Through Example Core A PCT = 2 Shared 1 Core B Core C Shared 2 Shared Directory Core-A Core-B Core-C Private Private Private C U C Core D Clean - 26 Walk Through Example Core A PCT = 2 Shared 1 Core B Core C Shared Write[X] 2 Shared Directory Core-A Core-B Core-C Private Private Private C U C Core D Clean - 27 Walk Through Example Core A PCT = 2 Shared 1 Core B Core C Shared Shared Directory 2 Inv [X] Core-A Core-B Core-C Private Private Private C U C Core D Clean - 28 Walk Through Example Core A PCT = 2 Invalid 0 Inv-Reply [X] (1) Core B Core C Shared 2 Shared Directory Core-A Core-B Core-C Private Private Private C U C Core D Clean - 29 Walk Through Example Core A PCT = 2 Core B Core C Shared Inv-Reply [X] (1) Directory Shared 2 Core-A Core-B Core-C Remote Private Private 0 U C Core D Clean - 30 Walk Through Example Core A PCT = 2 Core B Core C Inv-Reply [X] (2) Invalid 0 Shared Directory Core-A Core-B Core-C Remote Private Private 0 U C Core D Clean - 31 Walk Through Example Core A PCT = 2 Core B Core C Inv-Reply [X] (2) Uncached Directory Core-A Core-B Core-C Remote Private Private 0 U U Core D Clean - 32 Walk Through Example Core A PCT = 2 Core B Core C Modified Directory Cache Line [X] Core-A Core-B Core-C Remote Private Private 0 C U Core D Clean - 33 Walk Through Example Core A PCT = 2 Core B Modified 1 Core C Cache Line [X] Modified Directory Core-A Core-B Core-C Remote Private Private 0 C U Core D Clean - 34 Walk Through Example Core A PCT = 2 Read[X] Core B Modified Core C 1 Modified Directory Core-A Core-B Core-C Remote Private Private 0 C U Core D Clean - 35 Walk Through Example Core A PCT = 2 Core B Modified Core C 1 Modified Directory WB [X] Core-A Core-B Core-C Remote Private Private 0 C U Core D Clean - 36 Walk Through Example Core A PCT = 2 Core B Shared 1 Core C WB-Reply [X] Modified Directory Core-A Core-B Core-C Remote Private Private 0 C U Core D Clean - 37 Walk Through Example Core A PCT = 2 Core B Core C Shared 1 Shared Directory WB-Reply [X] Core-A Core-B Core-C Remote Private Private 0 C U Core D Dirty - 38 Walk Through Example Core A PCT = 2 Core B Core C Shared 1 Shared Directory Word [X] Core-A Core-B Core-C Remote Private Private 1 C U Core D Dirty - 39 Walk Through Example Core A PCT = 2 Core B Shared 1 Core C Write [X] Shared Directory Core-A Core-B Core-C Remote Private Private 1 C U Core D Dirty - 40 Walk Through Example Core A PCT = 2 Core B Shared Core C 1 UpgradeReply [X] Modified Directory Core-A Core-B Core-C Remote Private Private 0 C U Core D Dirty - 41 Walk Through Example Core A PCT = 2 Core B Core C Modified 2 Modified Directory Core-A Core-B Core-C Remote Private Private 0 C U Core D Dirty - 42 Walk Through Example Core A PCT = 2 Read [X] Core B Core C Modified 2 Shared Directory Core-A Core-B Core-C Remote Private Private 0 C U Core D Dirty - 43 Walk Through Example Core A PCT = 2 Core B Core C Shared 2 Shared Directory Read [X] Core-A Core-B Core-C Remote Private Private 1 C U Core D Dirty - 44 Walk Through Example Core A PCT = 2 Core B Core C Shared 2 Shared Directory Word [X] Core-A Core-B Core-C Remote Private Private 1 C U Core D Dirty - 45 Walk Through Example Core A PCT = 2 Read [X] Core B Core C Shared 2 Shared Directory Core-A Core-B Core-C Remote Private Private 1 C U Core D Dirty - 46 Walk Through Example Core A PCT = 2 Core B Core C Shared 2 Shared Directory Read [X] Core-A Core-B Core-C Remote Private Private 2 C U Core D Dirty - 47 Walk Through Example Core A PCT = 2 Core B Core C Shared 2 Cache Line [X] (2) Shared Directory Core-A Core-B Core-C Private Private Private C C U Core D Dirty - 48 Walk Through Example Core A PCT = 2 Shared 2 Cache Line [X] (2) Core B Core C Shared 2 Shared Directory Core-A Core-B Core-C Private Private Private C C U Core D Dirty - 49 Outline • • • • • Motivation for Locality-Aware Coherence Detailed Implementation Optimizations Evaluation Conclusion 50 Complete Locality Classifier High Directory Storage State ACKwise Pointers 1…p Tag P/R1 Remote Utilization1 … … P/Rn Remote Utilizationn • Complete Locality Classifier: Tracks locality information for all cores Classifier Complete Bit Overhead per core 192 KB (60%) (256 KB L2) 51 Limited Locality Classifier Reduces Directory Storage State ACKwise Pointers 1…p Tag Core ID1 … Core IDk P/R1 Remote Utilization1 … P/Rk Remote Utilizationk … • Utilization and mode tracked for k sharers • Modes of other sharers obtained by taking a majority vote 52 Limited-3 Locality Classifier • Utilization and mode tracked for 3 sharers Metric Limited-3 vs Complete Achieves the performance and energy Completion Time 3 % lower of the Complete1.5locality Energy % lower classifier • CT and Energy lower because remote mode classification learned faster with Limited-3 Classifier Complete Bit Overhead per core 192 KB (60%) (256 KB L2) Limited-3 18 KB (5.7%) 53 Private <-> Remote Transition Results In Private Cache Thrashing • Difficult to measure private cache locality of line in shared L2 cache Initial Private Utilization < PCT Private Private Utilization >= PCT Remote Utilization < PCT Remote Remote Utilization >= PCT • Core reverts back to private mode after #PCT accesses to cache line at shared L2 cache • Evicts other lines in the private L1 cache • Results in low spatio-temporal locality for all 54 Ideal Classifier NO Private Cache Thrashing • Ideal classifier maintains part of the working set in the private cache • Other lines placed in remote mode at shared cache 55 Remote Access Threshold Reduces Private Cache Thrashing • If core classified as remote sharer (capacity), increase cost of promotion to private mode • If core classified as private sharer, reset the Reduces private cache thrashing to a cost back to its starting value Initial negligible levelRemote Utilization < RAT Private Utilization < PCT Private Private Utilization >= PCT Remote Remote Utilization >= RAT • Remote Access Threshold (RAT) varied based on PCT & application behavior [details in paper] 56 Outline • • • • • Motivation for Locality-Aware Coherence Implementation Details Optimizations Evaluation Conclusion 57 Reducing Capacity Misses Private L1 Cache Miss Rate vs PCT (Blackscholes) Cache Miss Rate Breakdown (%) Cold Capacity Upgrade Sharing Word 3 2 1 0 1 2 3 4 5 6 PCT 7 8 • Miss rate reduces as PCT increases (better utilization) • Multiple capacity misses (expensive) replaced with single word access (cheap) • Cache miss rate increases towards the end (one capacity miss turns into multiple word misses) 58 Energy vs PCT Blackscholes Energy (normalized) 1.2 1 Network Link 0.8 Network Router 0.6 Directory 0.4 L2 Cache L1-D Cache 0.2 L1-I Cache 0 1 2 3 4 5 6 PCT 7 8 • Reducing L1 cache misses (& Capacity  Word) lead to lesser network traffic and L2 accesses • Accessing a word (200 bits) cheaper than 59 fetching the entire cache line (640 bits) Completion Time vs PCT Blackscholes Completion Time (normalized) 1.2 1 Synchronization 0.8 L2Cache-OffChip 0.6 L2Cache-Sharers 0.4 L2Cache-Waiting L1Cache-L2Cache 0.2 Compute 0 1 2 3 4 5 6 7 8 • Lower L1 cache miss rate + miss penalty • Less time spent waiting on L1 cache misses 60 Reducing Sharing Misses Private L1 Cache Miss Rate vs PCT (Streamcluster) Cache Miss Rate Breakdown (%) Cold Capacity Upgrade Sharing Word 8 6 4 2 0 1 2 3 4 5 6 7 8 PCT • Sharing misses (expensive) turned into word misses (cheap) as PCT increases 61 Energy vs PCT Energy (normalized) Streamcluster 1.2 1 Network Link 0.8 Network Router 0.6 Directory 0.4 L2 Cache L1-D Cache 0.2 L1-I Cache 0 1 2 3 4 5 6 7 PCT 8 • Reduce invalidations, asynchronous writebacks and cache-line ping-pong’ing 62 Completion Time vs PCT Streamcluster Completion Time (normalized) 1.2 1 Synchronization 0.8 L2Cache-OffChip 0.6 L2Cache-Sharers 0.4 L2Cache-Waiting L1Cache-L2Cache 0.2 Compute 0 1 2 3 4 5 6 7 PCT 8 • Less time spent waiting for invalidations and invalidations and by loads waiting for previous stores • Critical section time reduction -> synchronization time 63 reduction Variation with PCT Results Summary • Evaluated 18 benchmarks from the SPLASH-2, PARSEC, parallel-MI bench and UHPC suites + 3 hand-written benchmarks • PCT of 4 obtains 25% reduction in energy and 15% reduction in completion time • Evaluations done using Graphite simulator for 64 cores, McPAT/CACTI cache energy models and DSENT network energy models at 11 nm 64 Conclusion • Three potential advantages of the locality-aware adaptive cache coherence protocol – Better private cache utilization – Reduced on-chip communication (invalidations, asynchronous write-backs and cache-line transfers) – Reduced memory access latency and energy • Efficient locality tracking hardware • Decoupled from traditional coherence tracking structures • Limited3 locality classifier has low overhead of 18KB per-core (with 256KB per-core L2 cache) • Simple to implement – NO additional networks for deadlock avoidance 65

The Locality-Aware Adaptive Cache Coherence Protocol

Related documents

Products

Support

The Locality-Aware Adaptive Cache Coherence Protocol

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib