SAMPLING DEAD BLOCK PREDICTION FOR LAST-LEVEL CACHES Samira Khan Yingying Tian Daniel A. Jiménez Dead Blocks 2 The last-level cache (LLC) contains useless blocks A dead block will not be used again until replaced Dead blocks waste cache space and power Each pixel presents average live time of a block brighter is higher live time 400.perlbench, On average 86% of blocks are dead in a 2MB last-level cache! Origin of Dead Blocks 3 fill hit hit hit last hit eviction Least-recently-used replacement (LRU) live dead MRU LRU Cache set After last hit blocks remain in the cache for a long time Dead Block Predictors 4 Identify dead blocks Problems with current predictors: Consume significant state Update predictor at every access Depend on the LRU replacement policy Do not work well in last-level cache L1 and L2 filter the temporal locality Goal: A Dead block predictor that uses far less state and works well for LLC Sampling Dead Block Predictor 5 No need to update predictor at every cache access Predictor can learn from a few sampler sets of partial tags Sampling set for set 3 Sampling set for set 7 Predictor learns only from some sample sets Set 0 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Contribution 6 Prediction using sampling Predictor learns from only a few sample sets Sampled sets do not need to reflect real sets Decoupled replacement policy Cache can have different replacement policy than sampler Skewed predictor Results Speedup of 5.9% for single-thread workloads Weighted speedup of 12.5% for multi-core workloads Sampling predictor consumes low power Outline 7 Introduction Background Sampling Predictor Methodology Results Conclusion Dead Block Predictors 8 Trace Based [Lai & Falsafi 2001] Time Based [Hu et al. 2002] Predicts the last touch based on PC sequence Predicts dead after certain number of cycles Counting Based [Kharbutli & Solihin 2008] Predicts dead after certain number of accesses Dead Block Optimizations 9 Dead Blocks can be used for optimization Prefetching [Lai & Falsafi 2001, Hu et al. 2002, Liu et al. 2008] Reducing cache leakage power [Abella et al. 2005] Dead block replacement and bypass [Kharbutli & Solihin 2008] On a cache miss: Replaced a dead block rather than the LRU block If the new block is predicted dead, do not place it We use dead block replacement and bypass with Sampling Dead Block Predictor Reference Trace predictor [ Lai & Falsafi 2001 ] 10 Predicts last touch based on sequence of instructions Encoding: truncated addition of instruction PCs Called signature Predictor table is indexed by hashed signature 2 bit saturating counters Reference Trace predictor [ Lai & Falsafi 2001 ] 11 Predictor Table Update live PC sequence PC i: ld a Miss in set 2, replace PC j: ld b Hit in set 4 PC k: st c Update evicted dead Set 0 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Update live PC l: ld a Hit in set 7 Cache Predictor learns from every access to the cache Outline 12 Introduction Background Sampling Predictor Methodology Results Conclusion Sampling Dead Block Prediction 13 Cache behavior is more-or-less uniform across all sets [Moin et al. 2007] Keep a few sampler sets of partial tags Update the predictor when a sampler set is accessed Update the predictor Only when these sets Are accessed Sampling set for set 3 Sampling set for set 7 Set 0 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Predictor learns only from accesses to the sampler sets Sampling Dead Block Prediction 14 Predictor table PC i: ld a No update Hit in set 4 Set 0 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 PC j: ld b PC k: st c PC l: ld a Update live Miss in set 2, replace Hit in set 7 Hit in set 4 Sampling set for set 3 Sampling set for set 7 cache Sampler sets Only 32 sampler sets in 2MB LLC Sampler can have reduced associativity 15 Sampler can have associativity different from cache Blocks closer to LRU are dead most of the time Reduced associativity evicts them early Accelerates discovery of dead blocks Sampling set for set 3 Sampling set for set 7 In a 16 way 2MB set associative last level cache the sampler can be 12 way set associative Set 0 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Sampler Decouples Replacement Policy 16 Predictor learns from the LRU policy in sampler Cache can deploy a cheap replacement policy e.g. random replacement LRU Replacement R Random Replacement Sampling set for set 3 Sampling set for set 7 Can save state and power needed for LRU policy Set 0 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Skewed Predictor 17 Index1 = hash1(pc) Index2 = hash2(pc) Index3 = hash3(pc) Index = hash(signature) conf3 conf1 confidence conf2 dead if confidence >= threshold dead if conf1+conf2+conf3 >= threshold Reference trace predictor table Skewed predictor table Reduces conflict in the predictor table Outline 18 Introduction Background Sampling Predictor Methodology Results Conclusion Methodology 19 CMP$im cycle accurate simulator [Jaleel et al. 2008] 2MB /core 16-way set-associative shared LLC 32KB I+D L1, 256KB L2 200-cycle DRAM access time 19 memory-intensive SPEC CPU 2006 benchmarks 10 mixes of SPEC CPU 2006 for 4-cores Power numbers are from CACTI 5.3 [Shyamkumar et al 2008] Fewer Dead Blocks from Sampling Based Dead Block Replacement and Bypass 20 400.perlbench Each pixel = average live time of a cache block brighter means higher live time Space Overhead 21 120 Storage Overhead in KB sampler 100 predictor storage 80 perline metadata 60 40 20 0 Ref Trace DBP Couting based DBP Sampling DBP Sampling Dead Block Prediction uses 3KB predictor table, one bit per cache line and 6.75KB sampler tag array Power Usage 22 0.35 Total Power in Watt 0.3 dynamic leakage 0.25 0.2 0.15 0.1 0.05 0 Ref Trace DBP Counting based DBP Sampling DBP sampling prediction consumes 3.1% of the total dynamic power and 1.2% of the total leakage power of LLC Component Contribution to Speedup 23 Predictor table Speedup ove r Baseline LRU 6 Sampler sets 5 4 3 2 1 Cache sets 0 Achieves 5.9% average geometric mean speedup for single threaded workloads Speedup for single-thread workloads Geometric mean Speedup 24 1.06 1.05 1.04 1.03 1.02 1.01 1 0.99 0.98 On average sampler provides 5.9% speedup over LRU Sampler with default random policy yields 3.5% speedup Normalized weighted speedup for multi-core workloads Geometric mean speedup 25 1.14 1.12 1.1 1.08 1.06 1.04 1.02 1 0.98 On average sampler provides 12.5% benefit over LRU Sampler with default random policy yields 7% benefit Conclusion 26 Sampling Consumes less power Reduces storage overhead Decouples replacement policy Dead block replacement and bypass with sampling achieves 5.9% geometric mean speedup of for single threaded workloads 12.5% for multi threaded workloads 27 Thank you