A Hybrid Cache Replacement Policy for Heterogeneous Multi-Cores K.M.AnandKumar*, Akash S*, Divyalakshmi Ganesh*, Monica Snehapriya Christy* *Department of CSE Easwari Engineering College, Chennai Abstract-Future generation computer architectures are en­ deavoring to achieve high performance without compromise on energy efficiency. In a multiprocessor system, cache miss degrades the performance as the miss penalty scales by an exponential factor across a shared memory system when com­ pared to general purpose processors. This instigates the need for an efficient cache replacement scheme to cater to the data needs of underlying functional units in case of a cache miss. Minimal cache miss improves resource utilization and reduces data movement across the core which in turn contributes to a high performance and lesser power dissipation. Existing replacement policies has several issues when implemented in a heterogeneous multi-core system. The commonly used LRU replacement policy does not offer optimal performance for applications with high dependencies. Motivated by the limitations of the existing algorithms, we propose a hybrid cache replacement policy which combines Least Recently Used (LRU) and Least Frequently Used (LFU) replacement policies. Each cache block has two weighing values corresponding to LRU and LFU policies and a cumulative weight is calculated using these two values. Conducting simulations over wide range of cache sizes and associativity, we show that our proposed approach has shown increased cache hit to miss ratio when compared with LRU and other conventional cache replacement policies. Keywords-Cache Replacement, cache miss, resource utiliza­ tion, multi-core I. INTRODUCTION Caching is the main technique used to bridge the gap between processor and main memory [1, 2]. There are three conventional cache mapping schemes. The first is direct mapping, where any block has a unique place in the cache and does not need replacement. This implementation has an adept hit time but worse miss rate. Second mapping scheme is the fully associative, which allows a memory block to be mapped to any empty cache block. In case, if there are no empty blocks a replacement policy is used to evict a victim block from the cache. This organization has high hardware implementation cost and has worse hit time, since all addresses need to be compared to find the victim block. The third scheme is set-associative, in which cache is divided into sets and allows a memory block to be mapped into any empty block within a cache set. In case, if there are no empty blocks a replacement policy is used to evict a victim block from the set. This mapping scheme is a trade-off between the previous direct and associative mapping. At a multiprocessor computing environment, cache miss degrades the performance as the cache miss penalty scales 978-1-4799-3080-7114/$31.00 ©2014 IEEE by several folds across a shared memory system when compared to general purpose processors. This necessitates the need for effective cache replacement policy. Effective cache replacement policy is the current research thrust in computer architecture. All cache replacement policies aim to reduce the miss rate [2, 3, 4]. The cost of misses takes into account the miss penalty, power consumption, and bandwidth consumption. The cache "hit rate" accounts for how often a searched-for block is found in the cache. The choice of a cache replacement algorithm in set-associative cache system has a direct and considerable effect on the overall system performance. Some replacement policies are trivial while others are complex. Trivial cache replacement policies include Random replacement, FIFO replacement, Least Recently Used (LRU) and Least Frequently Used (LFU). The Random replacement policy involves replacing a random victim block from all the blocks in the set. Another trivial replacement policy is the FIFO replacement, where the set is considered as a queue and every block inserted to the set will be accessed according to First In First Out order. The trivial replacement policies lead to higher miss rate with increasing associativity and application complexity. Recency of a cache block is the time span from the current access time to the previous access time of that block. Recency is exploited by LRU replacement policy and its different implementations. Frequency is a measure which is exploited by its basic implementation, LFU replacement policy where each block maintains a counter that is incremented every time the block is accessed. The victim block is the block with the minimum counter value. Recency and frequency were extensively used in cache replacement policies research LRU replacement policy ignores the usability of the cache block, thus the most accessed block may be the victim. LFU replacement policy may ignore the last accessed block, so the latest block may be the victim, and it may not take the chance to increment its value. The major challenge is to combine both frequency and recency of the block to obtain better performance by increasing the hit ratio. The system architecture of the proposed system is shown in Figure 1. It is split in to three different stages. The proposed system is implemented on CUBEMACH, a cus­ tom built heterogeneous multi-core simulator. Application is given as input to the simulator in terms of Libraries. Any application/algorithm is developed into Library in the 594 Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:36:28 UTC from IEEE Xplore. Restrictions apply. Application Library Generation Phase CUBEMACH CUBEMACH Architecture Simulator CUBEMACH Custom ization Simulation Framework Simulation Dumps I I , and Evaluation \ ,-------------------� Figure l. System Architecture for Proposed System Library Generation Phase. The generated Library is given as input to CUBEMACH simulator. One of the most key criteri­ ons in CUBEMACH execution involves setting up the initial architecture for simulation. Various architectural parameters like number of functional units, type of functional unit, cache organization, interconnects architecture etc. are customizable within the CUBEMACH design space. The CUBEMACH simulation space mimics the dynamics of heterogeneous multi-core systems. After all the Libraries are executed, the simulator parameters are dumped for evaluation. In this paper we propose RFR, a cache replacement policy by combining recency and frequency of a cache block. Section 2 deals with related works while section 3 describes our proposed replacement policy. Section 4 describes the simulation results of our proposed replacement policy with other existing policies followed by the Conclusion section. II. R ELATED WORKS The study of cache replacement policies is, in essence, the study of correlating the past access patterns with future access patterns. Depending up on the access pattern analysis of past behaviour, the replacement policy identifies the cache block that will be used furthest down in time [2, 8]. Random-LRU [5] proposes a cache replacement policy that combines Random and LRU policies. The entire cache is segmented into Random Partition (RP) and Replacement Partition (LP). In case of a block replacement, the newly arriving block is placed in RP in place of a victim block 2014 ,/ which is randomly chosen in RP. The LRFU policy [6] has an associated value with each cache block called the Combined Recency and Frequency (CRF) value. It quantifies the probability that a cache block will be accessed in the near future. Each reference to a cache block in the past contributes to CRF value and a particular reference's contribution is influenced by a weighing function denoted by F(x), where x is the time difference between the references in the past to the current time (Figure 2). Zhansheng et al. [7] proposed a cache replacement policy that switches between LRU and LFU during runtime where they use a queue called Qout within the cache. The queue has limited size and the block replacement is done using LRU in the queue. A block removed from the cache is placed in Qout. They use two additional counters: the H (hit) counter and the 0 (out) counter, which are initialized to zero. LRU policy manages block replacement initially and with each miss, the victim block will be pushed in to Qout and the new block will be checked for its presence in Qout. If it does, then H counter is incremented by one and if the new block is not present in Qout, the 0 counter is incremented by one. If the value of H counter is greater than Q counter, then the replacement policy will be switched to LFU. Other replacement polices combine LRU and LFU to optimize the overall system performance. In Dynamic Insertion Policy (DIP) [8], the selection of the replacement policy depends on which one incurs fewer cache misses. International Conference on Advances in Computing, Communications and Informatics (ICACCI) 595 Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:36:28 UTC from IEEE Xplore. Restrictions apply. F{X) ,-------� F(x)= 1 (LFU Extreme) 1 - - - - - Spectrum (Recency/frequency) --- - o Current time - R eferen ce time Figure 2. III. LRU-LFU Analysis Spectrum RECENCY FREQUENCY R EPLACEMENT (RFR) POLICY All data that are stored in the dynamic memory are in the form of data packets and they are clubbed together to be called a cache block, whose size is equal to a single cache line and a group of lines are known as a cache set. Cache mapping is predominantly decided by cache size and associativity. These cache mapping techniques are also dependent on levels of cache. We either employ a single mapping strategy or a combination of direct, associative or set associative mapping is used. The Recency Frequency Replacement (RFR) policy com­ bines LRU and LFU cache replacement policies. The pro­ posed replacement policy involves three main steps: (i) weighing LRU and LFU, (ii) fusion of LRU and LFU, (iii) predict the line to be replaced. To maintain a balance be­ tween the two replacement policies, we associate a weighing value with each of them. A. Weighing Recency and Frequency In the algorithm given in Figure 3, WeighcLRU [i] is the weight value of LRU replacement policy. When a block in cache is referenced, the Weight LRU algorithm (algorithm 1) gives this block the MRU value. If any cache block has a weight value greater than the weight value of accessed block, it is reduced by l. The weight value of the referenced block is the largest, which equals to assoc-l. Similarly, the weight 596 20 J 4 x associated with LFU is calculated for each cache block using Weight LFU algorithm algorithm 2. RFR policy depends on the Weight RF values, which is determined by equation (1): WeightRF = WeighCLRU*CLRu+WeightLFU*CLFU (1) where, CLRU and CLFU are priority constants for recency and frequency respectively. In simulation process, we will fine tune values of these priority constants to improve the overall system performance. The line with the minimum Weight RF value will be replaced. The implementation of the proposed replacement policy is simple and it needs a minimal hardware to be added. It requires two counters; the first counter for LRU weighing and the second one for LFU weighing. For instance, if the cache organization is four-way set associative then two-bit counters are implemented and if the cache organization is eight-way set associative then three-bit counters are im­ plemented. In addition to the above mentioned hardware, we need an additional counter to store the usage number of each block. It is interesting to note that if the existing hardware architecture uses the LFU replacement policy then the additional hardware counters are easy to add and implement. B. Functional Architecture The functional architecture of the proposed system is depicted in the Figure 4. The major modules involved in International Conference on Advances in Computing, Communications and Informatics (ICACCI) Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:36:28 UTC from IEEE Xplore. Restrictions apply. lfi:igIIlJ?eC"IICY For each "il ill cache if( \leighl_lrII_block If<:ighljfil.;!cllrtt'lIl_block}1 " 'eiglll_lrII_block - \I "iglll_lrII_block - I lJi:iglll_LRI.;!cllrn:lIl_block} - (ls~'oCi(lli\ 'ily - f I \'al_LFLill) - conlller mil/(' ofall hilt's 1J"iglll_LFLillj - \t'ciglll ofallllll"~' illS III" /IIllIIher aflllles ill a cacll" ~','I 1J"iglll_LFLillj - fOl for (i - O: ill-I: i++1 if (ml_LFL'! i) m '-LFL/i + I}) If,'ig"'-'J"L'/IJ ++ e/s,' If"ig"'-'J"L/ H 1/ ++ Figure 3. RFR Policy Algorithm of cache, the corresponding cache controller transmits a control signal to the controller at next higher level of hierarchy in order to trigger a search operation at that level. The cache controller follows different execution trace for different replacement policies. The execution trace of the controller is dependent on the replacement policy used. Hence, a Finite State Machine (FSM) needs to be designed so that various state transitions during block replacement can be mimicked. Different cache replacement policies consider different statistics and parameters but they involve almost similar operations. Therefore, these replacement policies can be modeled as a single state. The design of the cache con­ troller revolves around designing a FSM, which alleviates the hardware implementation of different replacement policies. From the above discussions it is quite evident that most of the operations involved in cache replacement are almost similar across various levels of cache which paves way for a simpler cache controller design. However, to support parallel access of different cache levels, it is necessary to implement an independent cache controller or a single controller that can be shared across all levels. IV. Figure 4. Functional Architecture of Proposed System our system are Library Generation, Data Retrieval and Re­ placement Policy. The Library Generation module involves generating the Library for an application/algorithm, which will be given as input to the CUBEMACH simulator. The Data Retrieval module fetches data from cache and maps it to the underlying functional units. Whenever there is a miss at Ll cache, the Replacement Policy module is invoked. It replaces a block in cache with the block required by the functional unit at that instant. C. Cache Controller The cache controller (Figure 5) controls and coordinates all operations of the cache at each level of hierarchy. The working of the cache controller is similar in all three levels of the cache system. When a miss occurs at lower level 2014 RESULTS AND ANALYSIS To illustrate the effectiveness of the proposed RFR policy we use CUBEMACH simulator, that decouples timing and simulation functionlaity. We use it to capture and make a comparative analysis of the cache dynamics of various replacement policies. It uses a fully functional simulator to speed up the simulator development and memory sub­ simulator to model mUlti-processor memory systems. It is implemented on C++ platform and it has queue-driven event model to mimic timing to great precision. It has a controller that conununicates with other memory controllers by sending messages. It also provides user-customizability in specifying different cache coherence protocols. We used SPECCPU2000 benchmarks for evaluation. SPECCPU2000 benchmark suite is a collection of twenty six compute-intensive and non-trivial programs used to eval­ uate the performance of a computers CPU, compilers and memory system. The benchmarks in this suite are selected to represent real world applications, and these benchmarks exhibit a wide range of runtime behaviors [9]. We compared the RFR policy with other replacement poli­ cies that depends on cache size, block size and associativity by using these following benchmarks: Gcc: Integer component of SPECCPU2000, C language optimizing compiler, 176.gcc is based on gcc version 2.7.2.2. It generates code for a Motorola 88100 proces­ sor. Therebenchmark runs as a compiler with many of its optimization flags enabled [9]. • Vpr: Integer component of SPECCPU2000, Integrated Circuit Computer-Aided Design Program (More specif­ ically, performs placement and routing in Field­ Programmable Gate Arrays) [9]. • International Conference on Advances in Computing, Communications and Informatics (lCACCl) 597 Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:36:28 UTC from IEEE Xplore. Restrictions apply. miss rate of RFR policy when tested using four different benchmarks and compared with LRU, LFU and FIFO. It is evident from the plots that the proposed RFR policy gives better results in terms of miss ratio than LRU, LFU and FIFO replacement policies. Figure 5. • • Cache Controller FSM Parser: Integer component of SPECCPU2000, Word Processing, The Link Grammar Parser is a syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns to it a syntactic structure, which consists of set of labeled links connecting pairs of words [9]. Equake: Floating point component of SPECCPU2000, The program simulates the propagation of elastic waves in large, highly heterogeneous valleys, such as Califor­ nia's San Fernando Valley, or the Greater Los Angeles Basin. The goal is to recover the time history of the ground motion everywhere within the valley due to a specific seismic event. Computations are performed on an unstructured mesh that locally resolves wavelengths, using a finite element method [9]. As shown in Figure 6, equation (I) gives the highest per­ formance results for RFR when CLRU 1 and CLFU 5. Through several experimental evaluations, we inferred that giving CLFU a value larger than 5 doesnt have a significant rise in performance. The inception of this equation is very similar to reducing A value in [6], to improve performance, wherein reducing the A means the behavior of LRFU policy is becoming closer to LFU rather than to LRU. = Figure 6. = Effect of CLRU and CLFU on Miss Rate Our simulation focuses on L2 cache, so we used the same organization of Ll cache for different policies. It had a separate i-cache and d-cache with 4-way associativity, 128 sets and 32 bytes block size. Figures 7 and 8 shows the 598 20 J 4 Figure 7. Comparison of Miss Rate for 4-way Set Associative Cache Figure 8. Comparison of Miss Rate for 8-way Set Associative Cache V. CONCLUSIONS We have proposed a new cache replacement policy (RFR) which combines the recency and frequency measure as­ sociated with cache in an efcient manner. To analyze the effectiveness of our proposed policy, we used Gcc, Vpr, Parser, Equa benchmarks from SPECCPU2000 suite. The priority constants CLRU and CLFU can be easily optimized to harness the maximum performance. By testing against different SPECCPU2000 benchmarks, we have shown that RFR policy gives around 9% better performance in terms of miss ratio when compared with LRU, LFU and FIFO. We are planning to extend the proposed RFR cache replacement policy by including additional parameters like dependencies, application complexity and Library execution status. By including additional parameters in weighing cache blocks, we can achieve high degree of accuracy in deter­ mining the victim block. By assigning priority constants to the parameters associated with each block, we calculate the weight of each cache block. This would help us scale our International Conference on Advances in Computing, Communications and Informatics (ICACCI) Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:36:28 UTC from IEEE Xplore. Restrictions apply. replacement policy to support grand challenge applications and further improve the overall system performance. But, we must ensure that the trade-off between performance and the delay associated with replacement is balanced. REFERENCES [1] Megiddo N. and Modha D.,ARC: A Self- Tuning, Low Over­ head Replacement Cache in Proceedings of the 2nd USENIX Symposium on File and Storage Technologies, USA, pp. 115130,2003. [2] Stallings W., Computer Organization and Architecture, Pren­ tice Hall,2006 [3] Hennessy J. and Patterson D.,Computer Architecture: A Quan­ titative Approach, Morgan Kaufmann Publishers,2007 [4] Rakesh Kumar, Dean M. Tullsen,Parthasarathy Ranganathan, Norman P. Jouppi, and Keith I. Farkas., Single-ISA Hetero­ geneous MultiCore Architectures for Multithreaded Workload Performance SIGARCH Comput. Archit. News 32,2, (March 2004) [5] Das, Shirshendu, et aI., Random-LRU: A Replacement Policy for Chip Multiprocessors, VLSI Design and Test. Springer Berlin Heidelberg,pp 204-213,2013. [6] Lee D., Choi J., Kim J., Noh S., Min S., Cho Y., and Kim c., LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies, IEEE Transaction on Computers, vol. 50, no. 12, pp. 1352-l361, 200l. [7] Zhansheng L., Dawei L., and Huijuan B., CRFP: A Novel Adaptive Replacement Policy Combined the LRU and LFU Policies, in Proceedings of IEEE 8 th International Conference on Computer and Information Technology Workshops,Sydney, pp. 7279, 2008. [8] Qureshi, Moinuddin K., et al. Adaptive insertion policies for high performance caching, ACM SIGARCH Computer Architecture News. Vol. 35. No. 2. ACM,2007. [9] Standard Performance Evaluation Corporation, available at www.spec.org 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI) 599 Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:36:28 UTC from IEEE Xplore. Restrictions apply.