18th International Symposium on High Performance Computer Architecture, 2012 Decoupled Dynamic Cache Segmentation Samira M. Khan, Zhe Wang and Daniel A. Jiménez The University of Texas at San Antonio Introduction • Least Recently Used (LRU) replacement policy – Does not work well in last-level cache(LLC) [Lai et al. ISCA’00, Qureshi et al. ISCA’07, Chaudhuri et al. MICRO’09, Jaleel et al. ISCA’10] – Temporal locality is filtered by L1 and L2 accesses – Does not address workload behavior • Working set is larger than the cache size (thrashing) • Bursts of non-temporal references (scanning) Need a replacement policy in LLC that adapts to workload behavior 2 Introduction • Non-referenced and referenced blocks exhibit different behavior in LLC [Karedla et al. 94, Megiddo et al. FAST’03, Qureshi et al. ISCA’07, Jaleel et al. ISCA’10] Non-referenced blocks, will not be referenced before eviction Non-referenced blocks, will be referenced before eviction Referenced lines 400.perlbench Each pixel represents a block in a 1MB LLC We propose cache segmentation technique that dynamically adjusts the number of non-referenced and referenced blocks and adapts to workload behavior 3 Contribution • We propose Dynamic Cache Segmentation(DCS) – Each set divided into two segments • Non-referenced and referenced blocks in each segment – Predicts the optimal number of blocks in each segment • Sampling-based scalable low-overhead predictor – Decoupled from the replacement policy • Works with random policy Improves performance of an 8MB LLC four-core system on average by 12%, with a random replacement policy requiring only half the space of the LRU policy 4 Outline • • • • • • • Introduction Motivation Dynamic Cache Segmentation Optimal Segment Predictor Methodology Results Conclusion 5 Motivation 400.perlbench 434.zeusmp 435.gromacs Each pixel represents a block in a 1MB LLC Non-referenced blocks, will not be referenced before eviction Non-referenced blocks, will be referenced before eviction Referenced blocks Non-referenced and referenced blocks are different in different workloads 6 Motivation Workloads have different sensitivity to 3.1segmentation 2.3 1.6 2.2 3 IPC IPC IPC 1.4 2.1 1.2 2.9 Static LRU 2 Static 3 5 7 9 11 13 15 Static Segmentation Size LRU 2.8 1 1 Static LRU 1 3 5 7 9 11 13 15 Static Segmentation Size 1 3 5 7 9 11 13 15 Static Segmentation Size 400.perlbench 434.zeusmp 435.gromacs Sensitive Insensitive LRU Friendly Need a dynamic segmentation policy to address workload behavior 7 Dynamic Cache Segmentation Optimal Segment Size Predictor predicted segment size <= non-referenced list size yes Evict from nonreferenced list Cache sets non-referenced list no Evict from referenced list referenced list Only 1 bit is needed to differentiate between non-referenced and referenced blocks 8 Dynamic Cache Segmentation Optimal Segment Size Predictor Access x, miss Access d, hit Access d, hit Access c, hit Access y, miss Non-referenced blocks: 5 Greater than optimal size Evict from non-referenced list ax b c d e Non-referenced list f g h i Optimal Segment size 4 j k l m n o p Referenced list 9 Dynamic Cache Segmentation Optimal Segment Size Predictor Access x, miss Access d, hit Access d, hit Access c, hit Access y, miss d is now a referenced block Nothing happens, d is already in referenced list x a b c d Non-referenced list f g h i Optimal Segment size 4 j k l m n o p Referenced list 9 Dynamic Cache Segmentation Optimal Segment Size Predictor Access x, miss Access d, hit Access d, hit Access c, hit Access y, miss c is now a referenced block x a b c Non-referenced list d f g h i Optimal Segment size 4 j k l m n o p Referenced list 9 Dynamic Cache Segmentation Optimal Segment Size Predictor Access x, miss Access d, hit Access d, hit Access c, hit Access y, miss non-referenced blocks: 3 Less than optimal size Evict from referenced list xy a b Non-referenced list c d f g h i Optimal Segment size 4 j k l m n o p Referenced list 9 Decoupled DCS • Our policy only decides which list to use in eviction • Each list is free to use any replacement policy • Decouples replacement policy from segmentation • We use 1 bit Not-Recently-Used(NRU) and random a b c d d f g h i Non-referenced list j k l m n o p Referenced list Use random/NRU policy within the list 10 Optimal Segment Predictor • Sampling based set-dueling [Qureshi et al. ISCA’07] – Samples a small number of sets – A counter keeps track of misses in specific type of set – Policy that has less miss is the winner • Sampler sets have specific segment size – assoc assoc assoc , , assoc , assoc Size 1, 4 2 4 • Sets in cache follow the dynamic segment size – Decided by the predictor • Decision tree analysis minimizes space overhead Our predictor uses only 16 sets for each type of sets 11 Optimal Segment Predictor Set-dueling between segment size 16 and 8 Segment 8 winner Segment 16 winner Set-dueling between segment size 1 and 8 Segment 1 winner Optimal segment size 1 Set-dueling between segment size 12 and 16 Segment 8 winner Set-dueling between segment size 4 and 8 Segment 4 winner Optimal segment size 4 Segment 8 winner Segment 12 winner Optimal segment size 12 Segment 16 winner Optimal segment size 16 Optimal segment size 8 Winner of each level decides the segment size of the next level 12 Optimal Segment Predictor Set-dueling between segment size 16 and 8 Segment 8 winner Set-dueling between segment size 1 and 8 Segment 1 winner Optimal segment size 1 Set-dueling between segment size 12 and 16 Segment 8 winner Set-dueling between size 1 and ignore segment Segment 1 winner Segment 16 winner Set-dueling between segment size 4 and 8 Ignore segment winner Ignore segment Segment 8 winner Segment 4 winner Set-dueling between size 4 and ignore segment Segment 4 winner Optimal segment size 4 Set-dueling between size 8 and ignore segment Ignore Segment 8 segment winner winner Ignore segment Segment 16 winner Segment 12 winner Optimal segment size 8 Ignore segment winner Set-dueling between size 12 and ignore segment Segment 12 winner Optimal segment size 12 Set-dueling between size 16 and ignore segment Ignore Segment 16 segment winner winner Ignore segment Optimal segment size 16 Ignore segment winner Ignore segment Ignore segment We also incorporate ignoring segmentation for LRU-friendly workloads 13 DCS in Shared Cache Partitioning • Complements shared cache partitioning techniques [Qureshi et al. ISCA’06, Suh et al. HPCA’02] • Utility-based Cache Partitioning(UCP)[Qureshi et al. ISCA’06] – Partitions cache ways to each core based on utility • DCS segments each partition – Into non-referenced and referenced blocks core 0 core 1 core 2 core 3 UCP decides the way partitions among the cores 14 DCS in Shared Cache Partitioning • Complements shared cache partitioning techniques [Qureshi et al. ISCA’06, Suh et al. HPCA’02] • Utility-based Cache Partitioning(UCP)[Qureshi et al. ISCA’06] – Partitions cache ways to each core based on utility • DCS segments each partition – Into non-referenced and referenced blocks core 0 core 1 core 2 core 3 DCS decides the segmentation in each partition 14 Methodology • CMP$im cycle accurate simulator [Jaleel et al. 2008] – 2MB per core 16-way set-associative LLC – 32KB I+D L1, 256KB L2 – 200-cycle DRAM access time • 19 memory-intensive SPEC CPU 2006 benchmarks • 10 mixes of SPEC CPU 2006 for 4-cores 15 433.milc 434.zeusmp 436.cactusADM 462.libquantum 470.lbm 435.gromacs 450.soplex 459.GemsFDTD 471.omnetpp 473.astar 400.perlbench 401.bzip2 403.gcc 429.mcf 437.leslie3d 456.hmmer 481.wrf 482.sphinx3 483.xalancbmk Percentage of Total Cycle Segment Size in SPEC CPU 2006 100 80 60 40 20 0 ignore size 16 size 12 size 8 size 4 size 1 Insensitive LRU Friendly Sensitive Segment size chosen by DCS is different depending on the workload 16 Speedup Compared to LRU in Singlethreaded Workloads 1.14 DCS with NRU DCS with Random DIP[Qureshi et al. ISCA'07] RRIP[Jaleel et al. ISCA'10] Speedup 1.12 1.1 1.08 1.06 1.04 1.02 1 Insensitive LRU Friendly mean 483.xalancbmk 482.sphinx3 481.wrf 456.hmmer 437.leslie3d 429.mcf 403.gcc 401.bzip2 400.perlbench 473.astar 471.omnetpp 459.GemsFDTD 450.soplex 435.gromacs 470.lbm 462.libquantum 436.cactusADM 434.zeusmp 433.milc 0.98 Sensitive Outperforms LRU replacement on average by 5.2% with NRU replacement 17 Speedup Compared to LRU in Multiprogrammed Workloads 1.3 1.25 Speedup 1.2 DCS with NRU UCP with LRU[Qureshi et al. MICRO'06] DCS + UCP with Random DCS + UCP with NRU 1.15 1.1 1.05 1 0.95 mix1 ssif mix2 sfii mix3 isis mix4 iifi mix5 siss mix6 siis mix7 sisi mix8 sssi mix9 mix10 Mean isss ifss Random performs similar to NRU with DCS + UCP 18 Space Overhead in 8MB LLC Multi-core 80 DCS Overhead(Predictor) DCS Overhead(Reference bit) UCP Overhead Replacement State Overhead in KB 70 60 50 40 30 20 10 0 LRU UCP (LRU) DCS with UCP (NRU) DCS with UCP (Rand) Space overhead is less than half of LRU in random This is 0.35% of the 8MB LLC capacity 19 Conclusion • Dynamic Segmentation – Treats non-referenced and referenced blocks differently – Addresses different workload behavior – Decoupled from replacement policy • Results – On average 12% improvement with random policy – Requires half of the space of LRU 20 Thank you