Uploaded by edwardhuang1

A hybrid cache replacement policy for heterogeneous multi-cores-2014

advertisement
A Hybrid Cache Replacement Policy for Heterogeneous Multi-Cores
K.M.AnandKumar*, Akash S*, Divyalakshmi Ganesh*, Monica Snehapriya Christy*
*Department of CSE
Easwari Engineering College, Chennai
Abstract-Future generation computer architectures are en­
deavoring to achieve high performance without compromise
on energy efficiency. In a multiprocessor system, cache miss
degrades the performance as the miss penalty scales by an
exponential factor across a shared memory system when com­
pared to general purpose processors. This instigates the need
for an efficient cache replacement scheme to cater to the data
needs of underlying functional units in case of a cache miss.
Minimal cache miss improves resource utilization and reduces
data movement across the core which in turn contributes to
a high performance and lesser power dissipation. Existing
replacement policies has several issues when implemented
in a heterogeneous multi-core system. The commonly used
LRU replacement policy does not offer optimal performance
for applications with high dependencies. Motivated by the
limitations of the existing algorithms, we propose a hybrid
cache replacement policy which combines Least Recently Used
(LRU) and Least Frequently Used (LFU) replacement policies.
Each cache block has two weighing values corresponding to
LRU and LFU policies and a cumulative weight is calculated
using these two values. Conducting simulations over wide range
of cache sizes and associativity, we show that our proposed
approach has shown increased cache hit to miss ratio when
compared with LRU and other conventional cache replacement
policies.
Keywords-Cache Replacement, cache miss, resource utiliza­
tion, multi-core
I.
INTRODUCTION
Caching is the main technique used to bridge the gap
between processor and main memory [1, 2]. There are three
conventional cache mapping schemes. The first is direct
mapping, where any block has a unique place in the cache
and does not need replacement. This implementation has an
adept hit time but worse miss rate. Second mapping scheme
is the fully associative, which allows a memory block to
be mapped to any empty cache block. In case, if there are
no empty blocks a replacement policy is used to evict a
victim block from the cache. This organization has high
hardware implementation cost and has worse hit time, since
all addresses need to be compared to find the victim block.
The third scheme is set-associative, in which cache is divided
into sets and allows a memory block to be mapped into any
empty block within a cache set. In case, if there are no empty
blocks a replacement policy is used to evict a victim block
from the set. This mapping scheme is a trade-off between
the previous direct and associative mapping.
At a multiprocessor computing environment, cache miss
degrades the performance as the cache miss penalty scales
978-1-4799-3080-7114/$31.00 ©2014 IEEE
by several folds across a shared memory system when
compared to general purpose processors. This necessitates
the need for effective cache replacement policy. Effective
cache replacement policy is the current research thrust in
computer architecture. All cache replacement policies aim
to reduce the miss rate [2, 3, 4]. The cost of misses takes
into account the miss penalty, power consumption, and
bandwidth consumption. The cache "hit rate" accounts for
how often a searched-for block is found in the cache. The
choice of a cache replacement algorithm in set-associative
cache system has a direct and considerable effect on the
overall system performance.
Some replacement policies are trivial while others are
complex. Trivial cache replacement policies include Random
replacement, FIFO replacement, Least Recently Used (LRU)
and Least Frequently Used (LFU). The Random replacement
policy involves replacing a random victim block from all the
blocks in the set. Another trivial replacement policy is the
FIFO replacement, where the set is considered as a queue
and every block inserted to the set will be accessed according
to First In First Out order. The trivial replacement policies
lead to higher miss rate with increasing associativity and
application complexity. Recency of a cache block is the time
span from the current access time to the previous access time
of that block. Recency is exploited by LRU replacement
policy and its different implementations. Frequency is a
measure which is exploited by its basic implementation,
LFU replacement policy where each block maintains a
counter that is incremented every time the block is accessed.
The victim block is the block with the minimum counter
value. Recency and frequency were extensively used in
cache replacement policies research
LRU replacement policy ignores the usability of the cache
block, thus the most accessed block may be the victim. LFU
replacement policy may ignore the last accessed block, so
the latest block may be the victim, and it may not take the
chance to increment its value. The major challenge is to
combine both frequency and recency of the block to obtain
better performance by increasing the hit ratio.
The system architecture of the proposed system is shown
in Figure 1. It is split in to three different stages. The
proposed system is implemented on CUBEMACH, a cus­
tom built heterogeneous multi-core simulator. Application
is given as input to the simulator in terms of Libraries.
Any application/algorithm is developed into Library in the
594
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:36:28 UTC from IEEE Xplore. Restrictions apply.
Application
Library Generation
Phase
CUBEMACH
CUBEMACH
Architecture
Simulator
CUBEMACH
Custom ization
Simulation
Framework
Simulation Dumps
I
I
,
and Evaluation
\
,-------------------�
Figure l.
System Architecture for Proposed System
Library Generation Phase. The generated Library is given as
input to CUBEMACH simulator. One of the most key criteri­
ons in CUBEMACH execution involves setting up the initial
architecture for simulation. Various architectural parameters
like number of functional units, type of functional unit, cache
organization, interconnects architecture etc. are customizable
within the CUBEMACH design space. The CUBEMACH
simulation space mimics the dynamics of heterogeneous
multi-core systems. After all the Libraries are executed, the
simulator parameters are dumped for evaluation.
In this paper we propose RFR, a cache replacement policy
by combining recency and frequency of a cache block.
Section 2 deals with related works while section 3 describes
our proposed replacement policy. Section 4 describes the
simulation results of our proposed replacement policy with
other existing policies followed by the Conclusion section.
II. R ELATED WORKS
The study of cache replacement policies is, in essence,
the study of correlating the past access patterns with future
access patterns. Depending up on the access pattern analysis
of past behaviour, the replacement policy identifies the cache
block that will be used furthest down in time [2, 8].
Random-LRU [5] proposes a cache replacement policy
that combines Random and LRU policies. The entire cache
is segmented into Random Partition (RP) and Replacement
Partition (LP). In case of a block replacement, the newly
arriving block is placed in RP in place of a victim block
2014
,/
which is randomly chosen in RP. The LRFU policy [6]
has an associated value with each cache block called the
Combined Recency and Frequency (CRF) value. It quantifies
the probability that a cache block will be accessed in
the near future. Each reference to a cache block in the
past contributes to CRF value and a particular reference's
contribution is influenced by a weighing function denoted by
F(x), where x is the time difference between the references
in the past to the current time (Figure 2).
Zhansheng et al. [7] proposed a cache replacement policy
that switches between LRU and LFU during runtime where
they use a queue called Qout within the cache. The queue has
limited size and the block replacement is done using LRU
in the queue. A block removed from the cache is placed in
Qout. They use two additional counters: the H (hit) counter
and the 0 (out) counter, which are initialized to zero. LRU
policy manages block replacement initially and with each
miss, the victim block will be pushed in to Qout and the
new block will be checked for its presence in Qout. If it
does, then H counter is incremented by one and if the new
block is not present in Qout, the 0 counter is incremented
by one. If the value of H counter is greater than Q counter,
then the replacement policy will be switched to LFU. Other
replacement polices combine LRU and LFU to optimize the
overall system performance. In Dynamic Insertion Policy
(DIP) [8], the selection of the replacement policy depends
on which one incurs fewer cache misses.
International Conference on Advances in Computing, Communications and Informatics (ICACCI)
595
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:36:28 UTC from IEEE Xplore. Restrictions apply.
F{X) ,-------�
F(x)= 1 (LFU Extreme)
1
-
-
- -
-
Spectrum
(Recency/frequency)
---
-
o
Current time - R eferen ce time
Figure 2.
III.
LRU-LFU Analysis Spectrum
RECENCY FREQUENCY R EPLACEMENT (RFR)
POLICY
All data that are stored in the dynamic memory are in
the form of data packets and they are clubbed together to
be called a cache block, whose size is equal to a single
cache line and a group of lines are known as a cache
set. Cache mapping is predominantly decided by cache size
and associativity. These cache mapping techniques are also
dependent on levels of cache. We either employ a single
mapping strategy or a combination of direct, associative or
set associative mapping is used.
The Recency Frequency Replacement (RFR) policy com­
bines LRU and LFU cache replacement policies. The pro­
posed replacement policy involves three main steps: (i)
weighing LRU and LFU, (ii) fusion of LRU and LFU, (iii)
predict the line to be replaced. To maintain a balance be­
tween the two replacement policies, we associate a weighing
value with each of them.
A. Weighing Recency and Frequency
In the algorithm given in Figure 3, WeighcLRU [i] is the
weight value of LRU replacement policy. When a block in
cache is referenced, the Weight LRU algorithm (algorithm
1) gives this block the MRU value. If any cache block has a
weight value greater than the weight value of accessed block,
it is reduced by l. The weight value of the referenced block
is the largest, which equals to assoc-l. Similarly, the weight
596
20 J 4
x
associated with LFU is calculated for each cache block using
Weight LFU algorithm algorithm 2.
RFR policy depends on the Weight RF values, which is
determined by equation (1):
WeightRF
=
WeighCLRU*CLRu+WeightLFU*CLFU
(1)
where, CLRU and CLFU are priority constants for recency
and frequency respectively. In simulation process, we will
fine tune values of these priority constants to improve the
overall system performance. The line with the minimum
Weight RF value will be replaced.
The implementation of the proposed replacement policy
is simple and it needs a minimal hardware to be added. It
requires two counters; the first counter for LRU weighing
and the second one for LFU weighing. For instance, if the
cache organization is four-way set associative then two-bit
counters are implemented and if the cache organization is
eight-way set associative then three-bit counters are im­
plemented. In addition to the above mentioned hardware,
we need an additional counter to store the usage number
of each block. It is interesting to note that if the existing
hardware architecture uses the LFU replacement policy
then the additional hardware counters are easy to add and
implement.
B.
Functional Architecture
The functional architecture of the proposed system is
depicted in the Figure 4. The major modules involved in
International Conference on Advances in Computing, Communications and Informatics (ICACCI)
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:36:28 UTC from IEEE Xplore. Restrictions apply.
lfi:igIIlJ?eC"IICY
For each "il ill cache
if( \leighl_lrII_block
If<:ighljfil.;!cllrtt'lIl_block}1
" 'eiglll_lrII_block - \I "iglll_lrII_block - I
lJi:iglll_LRI.;!cllrn:lIl_block} - (ls~'oCi(lli\ 'ily - f
I
\'al_LFLill)
- conlller mil/(' ofall hilt's
1J"iglll_LFLillj - \t'ciglll ofallllll"~'
illS III" /IIllIIher aflllles ill a cacll" ~','I
1J"iglll_LFLillj - fOl
for (i - O: ill-I: i++1
if (ml_LFL'! i) m '-LFL/i + I})
If,'ig"'-'J"L'/IJ ++
e/s,'
If"ig"'-'J"L/ H 1/ ++
Figure 3.
RFR Policy Algorithm
of cache, the corresponding cache controller transmits a
control signal to the controller at next higher level of
hierarchy in order to trigger a search operation at that level.
The cache controller follows different execution trace for
different replacement policies. The execution trace of the
controller is dependent on the replacement policy used.
Hence, a Finite State Machine (FSM) needs to be designed
so that various state transitions during block replacement can
be mimicked. Different cache replacement policies consider
different statistics and parameters but they involve almost
similar operations. Therefore, these replacement policies can
be modeled as a single state. The design of the cache con­
troller revolves around designing a FSM, which alleviates the
hardware implementation of different replacement policies.
From the above discussions it is quite evident that most
of the operations involved in cache replacement are almost
similar across various levels of cache which paves way for a
simpler cache controller design. However, to support parallel
access of different cache levels, it is necessary to implement
an independent cache controller or a single controller that
can be shared across all levels.
IV.
Figure 4.
Functional Architecture of Proposed System
our system are Library Generation, Data Retrieval and Re­
placement Policy. The Library Generation module involves
generating the Library for an application/algorithm, which
will be given as input to the CUBEMACH simulator. The
Data Retrieval module fetches data from cache and maps it
to the underlying functional units. Whenever there is a miss
at Ll cache, the Replacement Policy module is invoked. It
replaces a block in cache with the block required by the
functional unit at that instant.
C. Cache Controller
The cache controller (Figure 5) controls and coordinates
all operations of the cache at each level of hierarchy. The
working of the cache controller is similar in all three levels
of the cache system. When a miss occurs at lower level
2014
RESULTS AND ANALYSIS
To illustrate the effectiveness of the proposed RFR policy
we use CUBEMACH simulator, that decouples timing and
simulation functionlaity. We use it to capture and make
a comparative analysis of the cache dynamics of various
replacement policies. It uses a fully functional simulator
to speed up the simulator development and memory sub­
simulator to model mUlti-processor memory systems. It
is implemented on C++ platform and it has queue-driven
event model to mimic timing to great precision. It has a
controller that conununicates with other memory controllers
by sending messages. It also provides user-customizability
in specifying different cache coherence protocols.
We used SPECCPU2000 benchmarks for evaluation.
SPECCPU2000 benchmark suite is a collection of twenty
six compute-intensive and non-trivial programs used to eval­
uate the performance of a computers CPU, compilers and
memory system. The benchmarks in this suite are selected
to represent real world applications, and these benchmarks
exhibit a wide range of runtime behaviors [9].
We compared the RFR policy with other replacement poli­
cies that depends on cache size, block size and associativity
by using these following benchmarks:
Gcc: Integer component of SPECCPU2000, C language
optimizing compiler, 176.gcc is based on gcc version
2.7.2.2. It generates code for a Motorola 88100 proces­
sor. Therebenchmark runs as a compiler with many of
its optimization flags enabled [9].
• Vpr: Integer component of SPECCPU2000, Integrated
Circuit Computer-Aided Design Program (More specif­
ically, performs placement and routing in Field­
Programmable Gate Arrays) [9].
•
International Conference on Advances in Computing, Communications and Informatics (lCACCl)
597
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:36:28 UTC from IEEE Xplore. Restrictions apply.
miss rate of RFR policy when tested using four different
benchmarks and compared with LRU, LFU and FIFO. It is
evident from the plots that the proposed RFR policy gives
better results in terms of miss ratio than LRU, LFU and
FIFO replacement policies.
Figure 5.
•
•
Cache Controller FSM
Parser: Integer component of SPECCPU2000, Word
Processing, The Link Grammar Parser is a syntactic
parser of English, based on link grammar, an original
theory of English syntax. Given a sentence, the system
assigns to it a syntactic structure, which consists of set
of labeled links connecting pairs of words [9].
Equake: Floating point component of SPECCPU2000,
The program simulates the propagation of elastic waves
in large, highly heterogeneous valleys, such as Califor­
nia's San Fernando Valley, or the Greater Los Angeles
Basin. The goal is to recover the time history of the
ground motion everywhere within the valley due to a
specific seismic event. Computations are performed on
an unstructured mesh that locally resolves wavelengths,
using a finite element method [9].
As shown in Figure 6, equation (I) gives the highest per­
formance results for RFR when CLRU
1 and CLFU
5.
Through several experimental evaluations, we inferred that
giving CLFU a value larger than 5 doesnt have a significant
rise in performance. The inception of this equation is very
similar to reducing A value in [6], to improve performance,
wherein reducing the A means the behavior of LRFU policy
is becoming closer to LFU rather than to LRU.
=
Figure 6.
=
Effect of CLRU and CLFU on Miss Rate
Our simulation focuses on L2 cache, so we used the same
organization of Ll cache for different policies. It had a
separate i-cache and d-cache with 4-way associativity, 128
sets and 32 bytes block size. Figures 7 and 8 shows the
598
20 J 4
Figure 7.
Comparison of Miss Rate for 4-way Set Associative Cache
Figure 8.
Comparison of Miss Rate for 8-way Set Associative Cache
V.
CONCLUSIONS
We have proposed a new cache replacement policy (RFR)
which combines the recency and frequency measure as­
sociated with cache in an efcient manner. To analyze the
effectiveness of our proposed policy, we used Gcc, Vpr,
Parser, Equa benchmarks from SPECCPU2000 suite. The
priority constants CLRU and CLFU can be easily optimized
to harness the maximum performance. By testing against
different SPECCPU2000 benchmarks, we have shown that
RFR policy gives around 9% better performance in terms of
miss ratio when compared with LRU, LFU and FIFO.
We are planning to extend the proposed RFR cache
replacement policy by including additional parameters like
dependencies, application complexity and Library execution
status. By including additional parameters in weighing cache
blocks, we can achieve high degree of accuracy in deter­
mining the victim block. By assigning priority constants to
the parameters associated with each block, we calculate the
weight of each cache block. This would help us scale our
International Conference on Advances in Computing, Communications and Informatics (ICACCI)
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:36:28 UTC from IEEE Xplore. Restrictions apply.
replacement policy to support grand challenge applications
and further improve the overall system performance. But,
we must ensure that the trade-off between performance and
the delay associated with replacement is balanced.
REFERENCES
[1] Megiddo N. and Modha D.,ARC: A Self- Tuning, Low Over­
head Replacement Cache in Proceedings of the 2nd USENIX
Symposium on File and Storage Technologies, USA, pp.
115130,2003.
[2] Stallings W., Computer Organization and Architecture, Pren­
tice Hall,2006
[3] Hennessy J. and Patterson D.,Computer Architecture: A Quan­
titative Approach, Morgan Kaufmann Publishers,2007
[4] Rakesh Kumar, Dean M. Tullsen,Parthasarathy Ranganathan,
Norman P. Jouppi, and Keith I. Farkas., Single-ISA Hetero­
geneous MultiCore Architectures for Multithreaded Workload
Performance SIGARCH Comput. Archit. News 32,2, (March
2004)
[5] Das, Shirshendu, et aI., Random-LRU: A Replacement Policy
for Chip Multiprocessors, VLSI Design and Test. Springer
Berlin Heidelberg,pp 204-213,2013.
[6] Lee D., Choi J., Kim J., Noh S., Min S., Cho Y., and Kim
c., LRFU: A Spectrum of Policies that Subsumes the Least
Recently
Used and Least Frequently Used Policies, IEEE
Transaction on Computers, vol. 50, no. 12, pp. 1352-l361,
200l.
[7] Zhansheng L., Dawei L., and Huijuan B., CRFP: A Novel
Adaptive Replacement Policy Combined the LRU and LFU
Policies, in Proceedings of IEEE 8 th International Conference
on Computer and Information Technology Workshops,Sydney,
pp. 7279, 2008.
[8] Qureshi, Moinuddin K., et al. Adaptive
insertion policies
for high performance caching, ACM SIGARCH Computer
Architecture News. Vol. 35. No. 2. ACM,2007.
[9] Standard Performance Evaluation Corporation, available at
www.spec.org
2014
International Conference on Advances in Computing, Communications and Informatics (ICACCI)
599
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:36:28 UTC from IEEE Xplore. Restrictions apply.
Download