2012 11th International Symposium on Distributed Computing and Applications to Business, Engineering & Science An Improved Multi-core Shared Cache Replacement Algorithm Fang Juan Li Chengyan College of Computer Science Beijing University of Technology Beijing, China e-mail: fangjuan@bjut.edu.cn College of Computer Science Beijing University of Technology Beijing, China e-mail: lisa890608@126.com Abstract—Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that traditional LRU and its approximation can lead to poor performance and unfairness when the multiple cores compete for the limited LLC capacity, and is susceptible to thrashing for memory-intensive workloads that have a working set greater than the available cache size. As the LLC grows in capacity, associativity, the performance gap between the LRU and the theoretical optimal replacement algorithms has widened. In this paper, we propose FLRU (Frequency based LRU) replacement algorithm, which is applied to multi-core shared L2 cache, and it takes the recent access information, partition and the frequency information into consideration. FLRU manages to filter the less reused blocks through dynamic insertion/promotion policy and victim selection strategy to ensure that some fraction of the working set is retained in the cache so that at least that fraction of the working set can contribute to cache hits and to avoid trashing; meanwhile we augment traditional cache partition with victim selection, insertion and promotion policies to manage shared L2 caches. its approximations have remained as the de-facto standard for replacement decisions in LLC. Recent studies have shown that in multi-processor system with a highly associative caches and a bigger capacity, the performance gap between the Least Recently Used (LRU) and the theoretical optimal replacement algorithm is large, (up to 197%) especially in the number of misses [13, 14]. One reason for this performance gap is that LRU replacement inserts a new block into the MRU position and gradually demotes its priority until it arrives at the LRU position, which means it can be evicted. However, in this condition a line that is no longer needed or only used twice has to occupy the cache until it becomes the LRU line in its set before it can be evicted. We call this unnecessary time dead time, which means, this block has become a dead block and never reused again. Apparently this dead time becomes longer with larger cache associativities. Another reason is that for memory-intensive workloads whose working set is greater than the available cache size, LRU behaves badly. That means the locality is not that obvious and large working set may leads to trashing. The other reason is that LRU replacement algorithm does not consider the access frequency and the core-to-core interference, which play an important role in multi-core processors accesses. For the above aspects, researchers propose corresponding solutions, including cache partition, insertion/promotion policy, access frequency based replacement algorithm and so on. This work gives a comprehensive consideration about multi-core interference, access frequency, insertion/promotion policy with the purpose of improve the performance and lower power consumption through lower cache miss rate. The rest of our paper is organized as follows: section II describes the current situation of multi-core processors; section III gives the FLRU algorithm and section IV is the methodology and results of our experiments; we make a conclusion in section V. Keywords-component; multi-core;replacement; shared cache I. INTRODUCTION Chip Multi-Processors (CMPs) have become a mainstream design choice for modern high-performance microprocessors with excellent performance. One of the key issues facing multi-processors is the higher and higher power consumption, which may in return leads to performance reduction. To lower the power consumption, many researchers have focused on the management of the on-chip shared last-level cache(LLC), which dominates in the processor’s whole power consumption and proposed a variety of techniques to manage the LLC to provide better performance and fairness [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. Most of these schemes include the following three steps: information collection, policy selection, and enforcement. Information collection takes charge of collecting the memory reference behaviors; the policy selection decides the way to improve the performance, and the enforcement mechanism manages to implement it. As we all know a cache miss will cause the processor to stall hundreds of cycles, and accesses to memory will lead to more power than accesses to cache. So paying attention to reduce the cache misses may make a contribution to improve the entire performance. The LRU replacement policy and 978-0-7695-4818-0/12 $26.00 © 2012 IEEE DOI 10.1109/DCABES.2012.39 II. BACKGROUND Most main-stream CMPs use a large capacity and high association shared last level L2 cache to provide a fast access to resources. But compared to single-core processors, there exist many specific features. First, thread interferences are still a big problem and even more complicated, especially in CMPs: conflicting accesses to shared L2 cache will lead to a big performance reduction of concurrent executive threads; the real cache capacity of a thread will be influenced by 13 Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:35:13 UTC from IEEE Xplore. Restrictions apply. extent among cores. We maintain a table to record the stealing info. The main purpose of cache partition is that cache partition is an effective way to reduce cache confliction, so that we should take cache partition into consideration when modify cache replacement. We can adopt a more efficient partition method, and this is what to do next. other threads, whose execution time will be uncertain. Second, for memory-intensive workloads whose working set is larger than the available cache size, trashing phenomenon may appear, which will lead to performance reduction. We observe that for some applications, there exist many cache lines that get brought into the cache and then are never used again or just reused few times before they are evicted. So we should try to retain some fraction of the working set in the cache so that at least that fraction of the working set can contribute to cache hits. Third, LRU doesn’t take access frequency into consideration, which may contribute to performance enhancement. All these features motivate us to give a comprehensive consideration of multi-core processors with a new scheme instead of one sense only. There have been numerous proposals to improve the performance of LRU. These works can be mainly classified into three parts: way granularity cache partitioning [1, 5, 6, 8, 11, 12] dynamic insertion/promotion policy [4, 20, 21, 22, 23, 25] and frequency based replacement policy [15, 16, 17, 18, 19, 24]. Cache partition mitigates inter-application cache contention by partitioning the cache based on how much the application is likely to benefit from the cache. Dynamic insertion/promotion policy improves cache performance through retaining some fraction of the working set in the cache so that at least that fraction of the working set can contribute to cache hits. Frequency based replacement policy takes access frequency into consideration when choosing a victim block, which plays am important role in cache access. While previously proposed work has also adopted counters for monitoring access frequency, they did not consider shared caches in CMPs which is the main factor of performance reduction. For cache partition has been a trend in multi-core, it is necessary to combine cache partition with other schemes. In this paper, we mainly discuss how to combine the benefits of cache partitioning, cache insertion/promotion and frequency information together, and finally augment traditional cache partitioning with insertion/promotion and frequency based replacement policies. First, we partition cache ways into N parts, ensuring every core has its own belonging ways, and the replacement victim is chosen based on the partition and we allow cache way stealing among different cores. Next, we maintain M LRU ways as candidates to evict. When it comes to a victim selection, we choose a way in the M, according to the partition, cache stealing and the access frequency information. Finally, we propose an insertion/promotion policy to implement both dynamic insertion policy and dynamic promotion policy with a new method, regarding cache partition and the frequency information. III. B. Insertion/Promotion Policy We divide cache replacement into 3 steps: victim selection, block insertion and priority promotion. Victim selection provides the block to evict, block insertion decides the suitable insertion position, and priority promotion promotes the priority of the hit block based on the principle of locality. Fig. 1 illustrates an example cache set with sixteen lines, logically organized left-to-right from highest priority (A: keep in the cache) to lowest priority (P: to be evicted). For LRU replacement, the priority ordering is the access sequence (the least recently accessed line has the lowest priority for retention). When it comes to a miss, traditional victim selection will choose the LRU line to evict, inserting the new block into the MRU position (a). If hit, a non-MRU block will be promoted to the MRU position. It has been observed that there are cache lines that are accessed only once and then never accessed again [20, 25]. By installing these lines in the highest priority positions, LRU actually maximizes the blocks’ occupation time of the cache without any performance profit. Paper [8] proposes a LRU Insertion Policy (LIP) which puts the new block into the LRU position (b). For non-reused lines, insertion of this particular line into the LRU position is actually much better as this minimizes the amount of time that the line spends in the cache. Paper [9] proposes a single incremental promotion policy (SIP), which moves the target block forward step by step as compared to traditional MRU promotion, LIP and SIP can both reduce the delay time of non-reused blocks but will have negative effects on those reused blocks. The two policies both choose the LRU block to evict. (a) Conventional LRU insertion policy ALGORITHM A. Cache Partition Policy We adopt an average cache partition method in our paper, which partitions cache ways into N parts according to the number of cores, ensuring every core has its own belonging ways. The replacement victim is chosen based on the partition and we allow cache way stealing to a certain 14 Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:35:13 UTC from IEEE Xplore. Restrictions apply. (b) LIP insertion policy D. FLRU Algorithm Here assume N cores, L ways, and partition L into N parts, with every core owns L/N ways = {1, 2 ...n}. Table s maintains the stealing information, core (i, j).core=1, means corej contains a block of corei and core (i, j).index[x] [y] =1, x is the corresponding set, y means the specific way. Select a victim within the M LRU lines˄w1,w2...wm˅, here assume concurrent access core is i ,set number is x, and with a cache access, the algorithm is as follows: 1) Victim Selection a) If there’s a belonging block among the M LRU, that is wi belongs to corei, wi is the victim, else b) If corei contains block of other cores, that is core (j, i).core=1, and core (j, i).index[x] [y] =1, evict block y, and clear the corresponding record of core (j, i), else (c) FLRU insertion policy Figure1. Different insertion policy FLRU implements both Dynamic Insertion Policy and Dynamic Promotion Policy with a new method, regarding cache partition and the frequency information. We insert a new block in to the Mth (the number of candidates to evict) position instead of LRU and MRU (c), which will avoid dead block occupying too long time and reused block being evicted too fast. And when there’s a cache hit, promote the block to the MRU position or M position according to the hit block characteristic. In our algorithm, we choose a victim block in three conditions based on the frequency, partition and stealing information, which will be elaborated in the following FLRU algorithm. c) Evict the lowest frequency block of M, and record the stealing info. 2) Insertion Promotion a) Insert the block into the highest priority of the M b) Update the access frequency of the hit block 3) Promotion Policy a) When it comes to a cache hit,if the hie block is one of the M, promote block to the M position,else b) promote block to the M position c) Update the access frequency of the hit block C. Frequency Policy We use a counter to record the access frequency of every line and maintain M LRU ways as candidates to evict. When it comes to a victim selection, we choose a way in the M, according to the access frequency of the M candidates, the partition and the stealing information of concurrent core. The block with lowest frequency will be most probably but not definitely evicted, because we have to take partition information, cache stealing information into account. Table 1 shows the logical cache structure with counter. We add each cache line with a counter and it records the access frequency of the line. When there is a victim selection, take the counter into consideration instead of the LRU only. Partition information means the current core needs to select a victim within its own belongings. Cache stealing information is that a core may put its own blocks into other core’s belonging ways when its belonging ways is full and the other’s is available. Cache stealing provides a solution to the phenomenon that different cores have different access pattern. When there’s a cache selection, we take all those information into consideration instead of LRU information only in order to make a best choice. Traditional LRU inserts a new block into the MRU position, which extends the occupation time of dead blocks. DIP [9] inserts a new block into the LRU position which minimizes the time but may cause trashing phenomenon that the LRU block will be used again while it has already been evicted. Here our policy inserts a new block into the highest priority position of the M candidates, which avoids a long occupation and a soon eviction in the meantime. When selecting a victim line, we take frequency, LRU and partition information into consideration instead of only the LRU, this makes the replacement process more accurate , so as to get a lower miss rate. As our algorithm is used when there’s a cache miss, there’s no influence on the other access modules. Furthermore since the reference table is small; it can be located on chip. Experimental results show that the FLRU replacement algorithm low the miss rate and runtime greatly with an average of 22%, 28.12%. IV. A. Experimental Environment We use a full-system experiment simulator SIMICS with a General Execution-driven Multiprocessor Simulator (GEMS) which provides a more detailed simulation. We use speccpu2000 suits as benchmarks, and choose 3 memoryintensive benchmarks gzip, vpr, bzip2 then get their IPC, MISSES, RUNTIME respectively. Our baseline CMP contains 4 cores, with independent data and instruction 8way, 32KB L1 cache each, and shared 16-way, 2MB L2 TABLE1. LOGICAL CACHE STRUCTURE WITH COUNTER Tag 1011 1100 0110 0011 Data ... ... ... ... LRU 0001 0111 0010 0001 EXPERIMENTAL EVALUATION Counter 11000110 11000011 11000010 11000001 15 Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:35:13 UTC from IEEE Xplore. Restrictions apply. cache; the block size is 64B. Table 2 gives the detailed simulator configuration. TABLE2. SIMULATION CONFIGURATIONS Processor Configuration 1 Chip Num Core per Chip 4 Frequency 2GHz Micro Structure Out of order,4 issue width, one cycle per instruction Figure3. L2 cache Misses Memory configuration Fig. 4 is the total IPC of the two schemes. From the figure IPC has improved by 0.06%, 2.4% and 3.6% with an average of 2.02%. Apparently, with shorter runtime and lower miss rate, the IPC will be improved accordingly. This further proves our performance improvement. Private I-Cache and D-Cache I-cache˖32KBˈblock size 64Bˈ8-way, L1 Cache hit overhead 3 cycles D-cache˖32KBˈblock size 64Bˈ8-way ˈhit overhead 3 cycles sharedˈ2M Bˈ16-wayˈblock size 64B L2 Cache Memory ˈhit overhead 12 cycles Access delay 400 cyclesˈ 4GB B. Experimental Result We use average shared L2 cache partition with LRU replacement as the baseline, and evaluate the performance with 4 cores. Fig. 2 shows the runtime of the two schemes. We can see that the runtime has been shortened by 1.96%, 38.4%, and 44.4%, with an average of 28.12%. This is because our scheme makes a difference in core interference and makes a better use of different access pattern among different cores. Figure4. IPC improvement In general, our scheme has an obvious improvement in multi-core performance with low overhead, a 32KB counter, accounting for 1.5% of L2 cache, and a core-to-core stealing table of little capacity. V. CONCLUSIONS Many previous studies have proposed a variety of mechanisms to improve the performance of shared last-level cache by exploiting a variety of common memory access behaviors. In this work, we have introduced a technique that can combine the benet of cache partition, adaptive insertion/promotion and inter-core capacity stealing together. By taking multi-core and their partition into consideration, FLRU arguments cache partition with insertion/promotion policy and frequency information, and our scheme delivers higher performance than previously proposed techniques. With a good performance improvement, this work has a lot to extend to get a further success. First, we can adopt a more efficient partition mechanism. Second we plan to take way prediction into consideration, and combine way prediction, way partition and insertion/promotion policy together to get a much better result. Figure2. Runtime of benchmarks Fig. 3 illustrates the total misses of the two schemes. From the figure misses are reduced by 6.62%, 34.2% and 25.2%, with a comparatively increase of 22% averagely. With lower misses, the runtime can also be shortened, and this means our replacement algorithm has made a progress in miss rate compared to the LRU. ACKNOWLEDGMENT This work was supported by General program of science and technology development project of Beijing Municipal Education Commission (KM201210005022) and other 16 Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:35:13 UTC from IEEE Xplore. Restrictions apply. [12] G. E. Suh, L. Rudolph, and S. Devadas. Dynamic Partitioning of Shared Cache Memory. Jour. Of Supercomputing, 28(1):7̄26, 2004. [13] W.-F. Lin and S. K. Reinhardt. Predicting last-touch references under optimal replacement. University of Michigan Tech. Rep.CSE-TR447-02, 2002. [14] R. L. Mattson, J. Gecsei, D. Slutz, and I. Traiger. Evaluation Techniques for Storage Hierarchies. IBM Systems Journal, 9(2), 1970. [15] John TRobinson, and Murthy V. Devarakonda,“Data Cache Management Using Frequency-Based Replacement,” In SIGMETRICS, vol.18, p. 134-142, May 1990. [16] Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Noh, S.H., and Sang Lyul Min,”LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies,” IEEE Transactions on Computers,vol.50,pp.1352-1361,Dec 2001,doi: 10.1109/TC.2001.970573 [17] Jaafar Alghazo,Adil Akaaboune, and Nazeih Botros,”SF-LRU Cache Replacement Algorithm,” MTDT '04 Proceedings of the Records of the 2004 International Workshop on Memory Technology, Design and Testing,pp.19-24,2004 [18] Mazen Kharbutli, and Yan Solihin,”counter-based cache replacement algorithms,” IEEE International Conference on Computer Design,pp.61, Oct 2005,ios:10.1109/ICCD.2005.41 [19] Haakon Dybdahl, Per Stenström, and Lasse Natvig,”An LRU-based Replacement Algorithm Augmented with Frequency of Access inShared Chip-Multiprocessor Caches,” MEDEA '06 Proceedings of the 2006 workshop on MEmory performance,pp.45 – 52,2006. [20] Moinuddin K. Qureshi,Aamer Jaleel,Yale N. Patt,Simon C. Steely, and Joel Emer,”Adaptive Insertion Policies for High Performance Caching,” ISCA '07 Proceedings of the 34th annual international symposium on Computer architecture,pp.381-391,May 2007. [21] Yuejian Xie, and Gabriel H. Loh,”PIPP: Promotion/Insertion PseudoPartitioning of Multi-Core Shared Caches,” In Proc. of the 36th Intl. Symp. on Computer Architecture,June 2009 [22] Jonathan D. Kron , Brooks Prumo , Gabriel H. Loh Double-DIP: Augmenting DIP with Adaptive Promotion Policies to Manage Shared L2 Caches [23] Xiufeng Sui,Junmin Wu,Guoliang Chen,Yixuan Tang, and Xiaodong Zhu,”Augmenting Cache Partitioning with Thread-Aware Insertion/Promotion Policies to Manage Shared Caches,” CF '10 Proceedings of the 7th ACM international conference on Computing frontiers”,pp. 79-80,2010. [24] Aamer Jaleel,Kevin B. Theobald,Simon C. Steely, Jr. , and Joel Emer ,”High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP),” ISCA '10 Proceedings of the 37th annual international symposium on Computer architecture,pp.60-71,June 2010. [25] Xi Zhang, Chongmin Li, Haixia Wang, and Dongsheng Wang㧘”A Cache Replacement Policy Using Adaptive Insertion and ReReference Prediction 㧘 ”2010 22nd International Symposium on Computer Architecture and High Performance Computing,pp.95102,Oct. 2010 government sponsors. We thank the efforts of our reviewers for their helpful suggestions that have led to several important improvements of our work. We also thank all the teachers and students in our lab for helpful discussion. REFERENCES [1] D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting Inter-Thread Cache Contenton on a Chip Multi-Processor Architecture. In Proc. of the 11th Int. Symp. on High Performance Computer Architecture, pages 340̄351, SanFrancisco, CA, USA, Feb. 2005. [2] J. Chang and G. Sohi. Cooperative Cache Partitioning for Chip Multiprocessors. In Proc. of the 21st Int. Conference on Supercomputing, pages 242̄252, Seattle, WA, June 2007. [3] L. R. Hsu, S. K. Reinhardt, R. R. Iyer, and S. Makineni.Communist, Utilitarian, and Capitalist Cache Policies on CMPs: Caches as a Shared Resource. In Proc. of the 15th Int. Conference on Parallel Architectures and Compilation Techniques, pages 13 ̄ 22, Seattle, WA, USA, Sep. 2006. [4] A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. S. Jr., and J. Emer. Adaptive Insertion Policies for Managing Shared Caches. In Proc. of the 17th Int. Conference on Parallel Architectures and Compilation Techniques, 2007. [5] S. Kim, D. Chandra, and Y. Solihin. Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture. In Proc.of the 13th Int. Conference on Parallel Architectures and Compilation Techniques, pages 111 ̄ 122, Antibes Juan-les-Pins, France, Sep. 2004. [6] J. Lin, Q. Lu, X. Ding, Z. Zhang, and P. Sadayappan. Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems. In Proc. of the 14th Int. Symp. on High Performance Computer Architecture,pages 367̄378, Salt Lake City, UT, USA, Feb. 2008. [7] M. K. Qureshi. Dynamic Spill-Accept for Scalable High-Performance Caching in CMPs. In Proc. of the 15th Int. Symp. On High Performance Computer Architecture, Raleigh, NC, USA, Feb. 2009. [8] M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In Proc. of the 39th Int. Symp. On Microarchitecture, pages 423̄432, Orlando, FL, Dec. 2006. [9] N. Raque, W.-T. Lin, and M. Thottethodi. Architectural Support for Operating System-Driven CMP Cache Management. In Proc. of the 15th Int. Conference on Parallel Architectures and Compilation Techniques, pages 2̄12, Seattle, WA, USA, Sep. 2006. [10] S. Srikantaiah, M. Kandemir, and M. J. Irwin. Adaptive Set-Pinning: Managing Shared Caches in Chip Multiprocessors. In Proc. of the 13th Symp. On Architectural Support for Programming Languages and Operating Systems, Seattle, WA, USA, Mar. 2009. [11] H. S. Stone, J. Tuerk, and J. L. Wolf. Optimal Paritioning of Cache Memory. Trans. on Computers, 41(9):1054̄1068, Sep. 1992. 17 Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:35:13 UTC from IEEE Xplore. Restrictions apply.