IMPROVING CACHE MANAGEMENT POLICIES USING DYNAMIC REUSE DISTANCES Nam Duong1, Dali Zhao1, Taesu Kim1, Rosario Cammarota1, Mateo Valero2, Alexander V. Veidenbaum1 1University of California, Irvine 2Universitat Politecnica de Catalunya and Barcelona Supercomputing Center CACHE MANAGEMENT Have been a hot research topic Cache Management Singlecore Sharedcache Replacement Bypass Partitioning LRU NRU EELRU DIP RRIP … PDP SPD … UCP PIPP TA-DIP TA-DRRIP Vantage … PDP PDP Prefetch 2 OVERVIEW Proposed new cache replacement and partitioning algorithms with a better balance between reuse and pollution Introduced a new concept, Protecting Distance (PD), which is shown to achieve such a balance Developed single- and multi-core hit rate models as a function of PD, cache configuration and program behavior Models are used to dynamically compute the best PD Showed that PD-based cache management policies improve performance for both single- and multi-core systems 3 OUTLINE 1. 2. 3. 4. The concept of Protecting Distance The single-core PD-based replacement and bypass policy (PDP) The multi-core PD-based management policies Evaluation 4 DEFINITIONS The (line) reuse distance: The number of accesses to the same cache set between two accesses to the same line The reuse distance distribution (RDD) This metric is directly related to hit rate A distribution of observed reuse distances A program signature for a given cache configuration RDDs of representative benchmarks X-axis: the RD (<256) 403.gcc 436.cactusADM 464.h264ref 5 0 64 128 192 256 0 64 128 192 256 0 64 128 192 256 FUTURE BEHAVIOR PREDICTION Cache management policies use past reference behavior to predict future accesses Prediction accuracy is critical Prediction in some of the prior policies LRU: predicts that lines are reused after K unique accesses, where K < W (W: cache associativity) Early eviction LRU (EELRU): Counts evictions in two nonLRU regions (early/late) to predict a line to evict RRIP: Predicts if a line will be reused in a near, long, or distant future 6 BALANCING REUSE AND CACHE POLLUTION Key to good performance (high hit rate) Cache lines must be reused as much as possible before eviction AND must be evicted soon after the “last” reuse to give space to new lines The former can be achieved by using the reuse distance and actively preventing eviction “Protecting” a line from eviction The latter can be achieved by evicting when not reused within this distance There is an optimal reuse distance balancing the two It is called a Protecting Distance (PD) 7 EXAMPLE: 436.CACTUSADM A majority of lines are reused at 64 or fewer accesses There are multiple peaks at different reuse distances Reuse maximized if lines are kept in the cache for 64 accesses Lines may not be reused if evicted before that Lines kept beyond that are likely to pollute cache Assume that no lines are kept longer than a given RD Reduction in miss rate over LRU 436.cactusADM 0 64 128 192 256 60% 50% 40% 30% 20% 10% 0% 8 THE PROTECTING DISTANCE (PD) A distance at which a majority of lines are covered A single value for all sets Predicted based on the current RDD Questions to answer/solve Why does using the PD achieve the balance? How to dynamically find the PD for an application and a cache configuration? How to build the PD-based management policies? 9 OUTLINE 1. 2. 3. 4. The concept of Protecting Distance Single-core PD-based replacement and bypass policy (PDP) The multi-core PD-based management policies Evaluation 10 THE SINGLE-CORE PDP Reused line A cache tag contains a line’s remaining PD (RPD) A line can be evicted when its RPD=0 The RPD of an inserted or promoted line set to the predicted PD Inserted line (unused) RPDs of other lines in a set are decremented Example: A 4-way cache, the predicted PD is 7 A line is promoted on a hit A set with RPDs before and after the hit access 1 4 6 3 0 6 5 2 11 THE SINGLE-CORE PDP (CONT.) Reused line Inserted line (unused) Selecting a victim on a miss A line with an RPD = 0 can be replaced 0 4 6 3 6 3 5 2 Two cases when all RPDs > 0 (no unprotected lines) Caches without bypass (inclusive): Unused lines are less likely to be reused than reused lines Replace unused line with highest RPD first 1 6 3 0 3 5 6 No unused line: Replace a line with highest RPD 1 4 4 6 3 0 3 6 2 Caches with bypass (non-inclusive): Bypass the new line 12 1 4 6 3 0 3 5 2 EVALUATION OF THE STATIC PDP 30% 25% 20% 15% 10% 5% 0% -5% Miss reduction over DRRIP SPDP-NB SPDP-B Static PDP: use the best static PD for each benchmark PD < 256 SPDP-NB: Static PDP with replacement only SPDP-B: Static PDP with replacement and bypass Performance: in general, DDRIP < SPDP-NB < SPDP-B 436.cactusADM: a 10% additional miss reduction Two static PDP policies have similar performance 483.xalancbmk: 3 different execution windows have different behavior for SPDP-B 13 436.CACTUSADM: EXPLAINING THE PERFORMANCE DIFFERENCE Hit Bypass Evict before 16 accesses (early) Evict after 16 accesses (late) 100% 80% 60% 40% 20% 0% Access Occupancy DRRIP Access Occupancy SPDP-NB Access Occupancy SPDP-B PDP has less caused by long RD How the evicted lines pollution occupy the cache? DRRIP: lines in the cache than RRIP Early evicted lines: 75% of accesses, but occupy only 4% Late evicted lines: 2% of accesses, but occupy 8% of the cache → pollution SPDP-NB: Early and late evicted lines: 42% of accesses but occupy only 4% SPDP-B: Late evicted lines: 1% of accesses, occupy 3% of the cache → yielding cache space to useful lines 14 CASE STUDY: 483.XALANCBMK RDD 0 32 64 96 483.xalancbmk.1 483.xalancbmk.2 483.xalancbmk.3 128 80% Hit rate of SPDP-B 60% 40% 20% 0% There is 483.xalancbmk.2 a close relationship between the 483.xalancbmk.3 hit rate, the PD and the RDD The best PD is different in different windows And for different programs Need a dynamic policy that finds best PD Need a model to drive the search 483.xalancbmk.1 15 A HIT RATE MODEL FOR NON-INCLUSIVE CACHE The model estimates the hit rate as a function of dp and the RDD dp Hits 1 E (d p ) * Accesses W N i 1 i dp Ni * i Nt Ni * d p d e i 1 i 1 dp {Ni}, Nt: The RDD dp: The protecting distance de: Experimentally set to W (W: Cache associativity) E RDD Hit rate 464.h264ref 403.gcc 436.cactusADM Used to find the PD maximizing the hit rate 0 64 128 192 256 0 64 128 192 256 0 64 128 192 256 16 PDP CACHE ORGANIZATION Access address Higher level Main memory LLC PD PD Compute Logic RDD RD Sampler RD RD Counter Array RD Sampler tracks access to several cache sets In L2 miss/WB stream, can reduce sampling rate Measures reuse distance of a new access RD Counter Array collects # of accesses at RD=i, Nt PD Compute Logic: finds PD that maximizes E To reduce overhead, each counter covers a range of RDs Computed PD used in the next interval (.5M L3 accesses) Reasonable hardware overhead 2 or 3 bits per tag to store the RPD 17 PDP VS. EXISTING POLICIES Management policy Supported policy(*) Replacement Bypass Reuse Pollution Distance measurement LRU Yes No No Yes Stack-based No EELRU [1] Yes No No Yes Stack-based Probabilistic DIP [2] Yes No Yes No N/A No RRIP [3] Yes No Yes No N/A No SDP [4] No Yes Yes No N/A No PDP Yes Yes Yes Yes Access-based Hit rate (*)Originally Balance Model proposed EELRU has the concept of late eviction point, which shares some similarities with the protecting distance However, lines are not always guaranteed to be protected [1] Y. Smaragdakis, S. Kaplan, and P. Wilson. EELRU: simple and effective adaptive page replacement. In SIGMETRICS’99 [2] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA’07 [3] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (RRIP). In ISCA’10 18 [4] S. M. Khan, Y. Tian, and D. A. Jimenez. Sampling dead block prediction for last-level caches. In MICRO’10 OUTLINE 1. 2. 3. 4. The concept of Protecting Distance The single-core PD-based replacement and bypass policy (PDP) The multi-core PD-based management policies Evaluation 19 PD-BASED SHARED CACHE PARTITIONING Each thread has its own PD (thread-aware) Counter array replicated per thread Sampler and compute logic shared A thread’s PD determines its cache partition Its lines occupy cache longer if its PD is large The cache is implicitly partitioned per needs of each thread using thread PDs The problem is to find a set of thread PDs that together maximize the hit rate 20 SHARED-CACHE HIT RATE MODEL Extending the single-core approach Compute a vector <PD> (T= number of threads) HitsT 1 E PD * AccessesT W T T Exhaustive search for <PD> is not practical A heuristic search algorithm finds a combination of threads’ RDD peaks that maximizes hit rate The single-core model generates top 3 peaks per thread 2 The complexity is O(T ) See the paper for more detail 21 OUTLINE 1. 2. 3. 4. The concept of Protecting Distance The single-core PD-based replacement and bypass policy (PDP) The multi-core PD-based management policies Evaluation 22 EVALUATION METHODOLOGY CMP$im simulator, LLC replacement Target cache: LLC Cache Params DCache 32KB, 8-way, 64B, 2 cycles ICache 32KB, 4-way, 64B, 2 cycles L2Cache 256KB, 8-way, 64B, 10 cycles L3Cache (LLC) 2MB, 16-way, 64B, 30 cycles Memory 200 cycles 23 EVALUATION METHODOLOGY (CONT.) Benchmarks: SPEC CPU 2006 benchmarks Excluded those which did not stress the LLC Single-core: Compared to EELRU, SDP, DIP, DRRIP Multi-core 4- and 16-core configurations, 80 workloads each The workloads generated by randomly combining benchmarks Compared to UCP, PIPP, TA-DRRIP Our policy: PDP-x, where x is the number of bits per cache line24 SINGLE-CORE PDP 30% IPC improvement over DIP 20% 10% 0% -10% -20% -30% SDP DRRIP EELRU PDP-2 PDP-3 PDP-8 SPDP-B PDP-x, where x is the number of bits per cache line Each benchmark is executed for 1B instructions 25 Best if can use 3 bits per line, but still better than prior work at 2 bits ADAPTATION TO PROGRAM PHASES 5 benchmarks which demonstrate significant phase changes Each benchmark is run for 5B instructions Change of PD (X-axis: 1M LLC accesses) 200 403.gcc 150 200 200 150 150 429.mcf 100 100 50 50 50 0 0 0 0 20 40 60 0 200 400 450.soplex 100 600 0 100 200 300 200 200 482.sphinx3 150 100 100 50 50 0 0 0 30 60 90 120 483.xalancbmk 150 26 0 50 100 150 200 ADAPTATION TO PROGRAM PHASES (CONT.) IPC improvement over DIP 15% 10% 5% 0% RRIP PDP-2 PDP-3 PDP-8 -5% 27 PD-BASED CACHE PARTITIONING FOR 16 CORES Normalized to TA-DRRIP 40% 30% 20% 10% 0% -10% -20% UCP PIPP PDP-2 PDP-3 0 40% 20 30% 20% 10% W 40 60 Workload UCP PIPP PDP-2 PDP-3 40% 30% 20% 10% 0% -10% -20% 80 UCP PIPP PDP-2 PDP-3 0 20 T 40 60 Workload 80 10% Average H 5% 0% 0% W -10% T H -5% -20% 0 20 40 60 Workload 80 28 -10% UCP PIPP PDP-2 PDP-3 HARDWARE OVERHEAD Policy Per-line bits Overhead (%) DIP 4 0.8% RRIP 2 0.4% SDP 4 1.4% PDP-2 2 0.6% PDP-3 3 0.8% 29 OTHER RESULTS Exploration of PDP cache parameters Cache bypass fraction Prefetch-aware PDP PD-based cache management policy for 4-core 30 CONCLUSIONS Proposed the concept of Protecting Distance (PD) Showed that it can be used to better balance reuse and cache pollution Developed a hit rate model as a function of the PD, program behavior, and cache configuration Proposed PD-based management policies for both single- and multi-core systems PD-based policies outperform existing policies 31 THANK YOU! 32 BACKUP SLIDES RDD, E and hit rate of all benchmarks 33 RDDS, MODELED AND REAL HIT RATES OF SPEC CPU 2006 BENCHMARKS 434.zeusmp 433.milc E 429.mcf Hit rate 403.gcc RDD 0 64 128 192 256 0 64 128 192 256 0 64 128 192 256 0 64 128 192 256 456.hmmer 450.soplex 437.leslie3d 436.cactusADM 0 64 128 192 256 0 64 128 192 256 0 64 128 192 256 0 64 128 192 34 256 RDDS, MODELED AND REAL HIT RATES OF SPEC CPU 2006 BENCHMARKS (CONT.) 459.GemsFDTD 462.libquantum 470.lbm 464.h264ref 0 64 128 192 256 0 64 128 192 256 0 64 128 192 256 0 64 128 192 256 471.omnetpp 482.sphinx3 473.astar 0 64 128 192 256 0 64 128 192 256 0 64 128 192 256 35 RDDS, MODELED AND REAL HIT RATES OF SPEC CPU 2006 BENCHMARKS (CONT.) 483.xalancbmk.1 0 64 128 192 256 483.xalancbmk.2 0 64 128 192 256 483.xalancbmk.3 0 64 128 192 256 36