Uploaded by edwardhuang1

An Improved Multi-core Shared Cache Replacement Algorithm-2012

advertisement
2012 11th International Symposium on Distributed Computing and Applications to Business, Engineering & Science
An Improved Multi-core Shared Cache Replacement Algorithm
Fang Juan
Li Chengyan
College of Computer Science
Beijing University of Technology
Beijing, China
e-mail: fangjuan@bjut.edu.cn
College of Computer Science
Beijing University of Technology
Beijing, China
e-mail: lisa890608@126.com
Abstract—Many multi-core processors employ a large last-level
cache (LLC) shared among the multiple cores. Past research
has demonstrated that traditional LRU and its approximation
can lead to poor performance and unfairness when the
multiple cores compete for the limited LLC capacity, and is
susceptible to thrashing for memory-intensive workloads that
have a working set greater than the available cache size. As the
LLC grows in capacity, associativity, the performance gap
between the LRU and the theoretical optimal replacement
algorithms has widened. In this paper, we propose FLRU
(Frequency based LRU) replacement algorithm, which is
applied to multi-core shared L2 cache, and it takes the recent
access information, partition and the frequency information
into consideration. FLRU manages to filter the less reused
blocks through dynamic insertion/promotion policy and victim
selection strategy to ensure that some fraction of the working
set is retained in the cache so that at least that fraction of the
working set can contribute to cache hits and to avoid trashing;
meanwhile we augment traditional cache partition with victim
selection, insertion and promotion policies to manage shared
L2 caches.
its approximations have remained as the de-facto standard
for replacement decisions in LLC. Recent studies have
shown that in multi-processor system with a highly
associative caches and a bigger capacity, the performance
gap between the Least Recently Used (LRU) and the
theoretical optimal replacement algorithm is large, (up to
197%) especially in the number of misses [13, 14]. One
reason for this performance gap is that LRU replacement
inserts a new block into the MRU position and gradually
demotes its priority until it arrives at the LRU position,
which means it can be evicted. However, in this condition a
line that is no longer needed or only used twice has to
occupy the cache until it becomes the LRU line in its set
before it can be evicted. We call this unnecessary time dead
time, which means, this block has become a dead block and
never reused again. Apparently this dead time becomes
longer with larger cache associativities. Another reason is
that for memory-intensive workloads whose working set is
greater than the available cache size, LRU behaves badly.
That means the locality is not that obvious and large working
set may leads to trashing. The other reason is that LRU
replacement algorithm does not consider the access
frequency and the core-to-core interference, which play an
important role in multi-core processors accesses. For the
above aspects, researchers propose corresponding solutions,
including cache partition, insertion/promotion policy, access
frequency based replacement algorithm and so on. This work
gives a comprehensive consideration about multi-core
interference, access frequency, insertion/promotion policy
with the purpose of improve the performance and lower
power consumption through lower cache miss rate.
The rest of our paper is organized as follows: section II
describes the current situation of multi-core processors;
section III gives the FLRU algorithm and section IV is the
methodology and results of our experiments; we make a
conclusion in section V.
Keywords-component; multi-core;replacement; shared cache
I. INTRODUCTION
Chip Multi-Processors (CMPs) have become a
mainstream design choice for modern high-performance
microprocessors with excellent performance. One of the key
issues facing multi-processors is the higher and higher power
consumption, which may in return leads to performance
reduction. To lower the power consumption, many
researchers have focused on the management of the on-chip
shared last-level cache(LLC), which dominates in the
processor’s whole power consumption and proposed a
variety of techniques to manage the LLC to provide better
performance and fairness [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12].
Most of these schemes include the following three steps:
information collection, policy selection, and enforcement.
Information collection takes charge of collecting the memory
reference behaviors; the policy selection decides the way to
improve the performance, and the enforcement mechanism
manages to implement it.
As we all know a cache miss will cause the processor to
stall hundreds of cycles, and accesses to memory will lead to
more power than accesses to cache. So paying attention to
reduce the cache misses may make a contribution to improve
the entire performance. The LRU replacement policy and
978-0-7695-4818-0/12 $26.00 © 2012 IEEE
DOI 10.1109/DCABES.2012.39
II. BACKGROUND
Most main-stream CMPs use a large capacity and high
association shared last level L2 cache to provide a fast access
to resources. But compared to single-core processors, there
exist many specific features. First, thread interferences are
still a big problem and even more complicated, especially in
CMPs: conflicting accesses to shared L2 cache will lead to a
big performance reduction of concurrent executive threads;
the real cache capacity of a thread will be influenced by
13
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:35:13 UTC from IEEE Xplore. Restrictions apply.
extent among cores. We maintain a table to record the
stealing info. The main purpose of cache partition is that
cache partition is an effective way to reduce cache
confliction, so that we should take cache partition into
consideration when modify cache replacement. We can
adopt a more efficient partition method, and this is what to
do next.
other threads, whose execution time will be uncertain.
Second, for memory-intensive workloads whose working set
is larger than the available cache size, trashing phenomenon
may appear, which will lead to performance reduction. We
observe that for some applications, there exist many cache
lines that get brought into the cache and then are never used
again or just reused few times before they are evicted. So we
should try to retain some fraction of the working set in the
cache so that at least that fraction of the working set can
contribute to cache hits. Third, LRU doesn’t take access
frequency into consideration, which may contribute to
performance enhancement. All these features motivate us to
give a comprehensive consideration of multi-core processors
with a new scheme instead of one sense only.
There have been numerous proposals to improve the
performance of LRU. These works can be mainly classified
into three parts: way granularity cache partitioning [1, 5, 6, 8,
11, 12] dynamic insertion/promotion policy [4, 20, 21, 22, 23,
25] and frequency based replacement policy [15, 16, 17, 18,
19, 24]. Cache partition mitigates inter-application cache
contention by partitioning the cache based on how much the
application is likely to benefit from the cache. Dynamic
insertion/promotion policy improves cache performance
through retaining some fraction of the working set in the
cache so that at least that fraction of the working set can
contribute to cache hits. Frequency based replacement policy
takes access frequency into consideration when choosing a
victim block, which plays am important role in cache access.
While previously proposed work has also adopted counters
for monitoring access frequency, they did not consider
shared caches in CMPs which is the main factor of
performance reduction. For cache partition has been a trend
in multi-core, it is necessary to combine cache partition with
other schemes.
In this paper, we mainly discuss how to combine the
benefits of cache partitioning, cache insertion/promotion and
frequency information together, and finally augment
traditional cache partitioning with insertion/promotion and
frequency based replacement policies. First, we partition
cache ways into N parts, ensuring every core has its own
belonging ways, and the replacement victim is chosen based
on the partition and we allow cache way stealing among
different cores. Next, we maintain M LRU ways as
candidates to evict. When it comes to a victim selection, we
choose a way in the M, according to the partition, cache
stealing and the access frequency information. Finally, we
propose an insertion/promotion policy to implement both
dynamic insertion policy and dynamic promotion policy with
a new method, regarding cache partition and the frequency
information.
III.
B. Insertion/Promotion Policy
We divide cache replacement into 3 steps: victim
selection, block insertion and priority promotion. Victim
selection provides the block to evict, block insertion decides
the suitable insertion position, and priority promotion
promotes the priority of the hit block based on the principle
of locality. Fig. 1 illustrates an example cache set with
sixteen lines, logically organized left-to-right from highest
priority (A: keep in the cache) to lowest priority (P: to be
evicted). For LRU replacement, the priority ordering is the
access sequence (the least recently accessed line has the
lowest priority for retention). When it comes to a miss,
traditional victim selection will choose the LRU line to evict,
inserting the new block into the MRU position (a). If hit, a
non-MRU block will be promoted to the MRU position. It
has been observed that there are cache lines that are accessed
only once and then never accessed again [20, 25]. By
installing these lines in the highest priority positions, LRU
actually maximizes the blocks’ occupation time of the cache
without any performance profit. Paper [8] proposes a LRU
Insertion Policy (LIP) which puts the new block into the
LRU position (b). For non-reused lines, insertion of this
particular line into the LRU position is actually much better
as this minimizes the amount of time that the line spends in
the cache. Paper [9] proposes a single incremental promotion
policy (SIP), which moves the target block forward step by
step as compared to traditional MRU promotion, LIP and
SIP can both reduce the delay time of non-reused blocks but
will have negative effects on those reused blocks. The two
policies both choose the LRU block to evict.
(a) Conventional LRU insertion policy
ALGORITHM
A. Cache Partition Policy
We adopt an average cache partition method in our
paper, which partitions cache ways into N parts according to
the number of cores, ensuring every core has its own
belonging ways. The replacement victim is chosen based on
the partition and we allow cache way stealing to a certain
14
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:35:13 UTC from IEEE Xplore. Restrictions apply.
(b) LIP insertion policy
D. FLRU Algorithm
Here assume N cores, L ways, and partition L into N
parts, with every core owns L/N ways = {1, 2 ...n}.
Table s maintains the stealing information, core (i, j).core=1,
means corej contains a block of corei and core (i, j).index[x]
[y] =1, x is the corresponding set, y means the specific way.
Select a victim within the M LRU lines˄w1,w2...wm˅, here
assume concurrent access core is i ,set number is x, and with
a cache access, the algorithm is as follows:
1) Victim Selection
a) If there’s a belonging block among the M LRU, that
is wi belongs to corei, wi is the victim, else
b) If corei contains block of other cores, that is core (j,
i).core=1, and core (j, i).index[x] [y] =1, evict block y, and
clear the corresponding record of core (j, i), else
(c) FLRU insertion policy
Figure1. Different insertion policy
FLRU implements both Dynamic Insertion Policy and
Dynamic Promotion Policy with a new method, regarding
cache partition and the frequency information. We insert a
new block in to the Mth (the number of candidates to evict)
position instead of LRU and MRU (c), which will avoid
dead block occupying too long time and reused block being
evicted too fast. And when there’s a cache hit, promote the
block to the MRU position or M position according to the hit
block characteristic. In our algorithm, we choose a victim
block in three conditions based on the frequency, partition
and stealing information, which will be elaborated in the
following FLRU algorithm.
c) Evict the lowest frequency block of M, and record
the stealing info.
2) Insertion Promotion
a) Insert the block into the highest priority of the M
b) Update the access frequency of the hit block
3) Promotion Policy
a) When it comes to a cache hit,if the hie block is one
of the M, promote block to the M position,else
b) promote block to the M position
c) Update the access frequency of the hit block
C. Frequency Policy
We use a counter to record the access frequency of every
line and maintain M LRU ways as candidates to evict. When
it comes to a victim selection, we choose a way in the M,
according to the access frequency of the M candidates, the
partition and the stealing information of concurrent core. The
block with lowest frequency will be most probably but not
definitely evicted, because we have to take partition
information, cache stealing information into account. Table 1
shows the logical cache structure with counter. We add each
cache line with a counter and it records the access frequency
of the line. When there is a victim selection, take the counter
into consideration instead of the LRU only. Partition
information means the current core needs to select a victim
within its own belongings. Cache stealing information is that
a core may put its own blocks into other core’s belonging
ways when its belonging ways is full and the other’s is
available. Cache stealing provides a solution to the
phenomenon that different cores have different access
pattern. When there’s a cache selection, we take all those
information into consideration instead of LRU information
only in order to make a best choice.
Traditional LRU inserts a new block into the MRU
position, which extends the occupation time of dead blocks.
DIP [9] inserts a new block into the LRU position which
minimizes the time but may cause trashing phenomenon that
the LRU block will be used again while it has already been
evicted. Here our policy inserts a new block into the highest
priority position of the M candidates, which avoids a long
occupation and a soon eviction in the meantime. When
selecting a victim line, we take frequency, LRU and partition
information into consideration instead of only the LRU, this
makes the replacement process more accurate , so as to get a
lower miss rate. As our algorithm is used when there’s a
cache miss, there’s no influence on the other access modules.
Furthermore since the reference table is small; it can be
located on chip. Experimental results show that the FLRU
replacement algorithm low the miss rate and runtime greatly
with an average of 22%, 28.12%.
IV.
A.
Experimental Environment
We use a full-system experiment simulator SIMICS with
a General Execution-driven Multiprocessor Simulator
(GEMS) which provides a more detailed simulation. We use
speccpu2000 suits as benchmarks, and choose 3 memoryintensive benchmarks gzip, vpr, bzip2 then get their IPC,
MISSES, RUNTIME respectively. Our baseline CMP
contains 4 cores, with independent data and instruction 8way, 32KB L1 cache each, and shared 16-way, 2MB L2
TABLE1. LOGICAL CACHE STRUCTURE WITH
COUNTER
Tag
1011
1100
0110
0011
Data
...
...
...
...
LRU
0001
0111
0010
0001
EXPERIMENTAL EVALUATION
Counter
11000110
11000011
11000010
11000001
15
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:35:13 UTC from IEEE Xplore. Restrictions apply.
cache; the block size is 64B. Table 2 gives the detailed
simulator configuration.
TABLE2. SIMULATION CONFIGURATIONS
Processor Configuration
1
Chip Num
Core per Chip
4
Frequency
2GHz
Micro Structure
Out of order,4 issue width, one cycle per
instruction
Figure3. L2 cache Misses
Memory configuration
Fig. 4 is the total IPC of the two schemes. From the
figure IPC has improved by 0.06%, 2.4% and 3.6% with an
average of 2.02%. Apparently, with shorter runtime and
lower miss rate, the IPC will be improved accordingly. This
further proves our performance improvement.
Private I-Cache and D-Cache
I-cache˖32KBˈblock size 64Bˈ8-way,
L1 Cache
hit overhead 3 cycles
D-cache˖32KBˈblock size 64Bˈ8-way
ˈhit overhead 3 cycles
sharedˈ2M Bˈ16-wayˈblock size 64B
L2 Cache
Memory
ˈhit overhead 12 cycles
Access delay 400 cyclesˈ 4GB
B. Experimental Result
We use average shared L2 cache partition with LRU
replacement as the baseline, and evaluate the performance
with 4 cores. Fig. 2 shows the runtime of the two schemes.
We can see that the runtime has been shortened by 1.96%,
38.4%, and 44.4%, with an average of 28.12%. This is
because our scheme makes a difference in core interference
and makes a better use of different access pattern among
different cores.
Figure4. IPC improvement
In general, our scheme has an obvious improvement in
multi-core performance with low overhead, a 32KB counter,
accounting for 1.5% of L2 cache, and a core-to-core stealing
table of little capacity.
V. CONCLUSIONS
Many previous studies have proposed a variety of
mechanisms to improve the performance of shared last-level
cache by exploiting a variety of common memory access
behaviors. In this work, we have introduced a technique that
can combine the benet of cache partition, adaptive
insertion/promotion and inter-core capacity stealing together.
By taking multi-core and their partition into consideration,
FLRU arguments cache partition with insertion/promotion
policy and frequency information, and our scheme delivers
higher performance than previously proposed techniques.
With a good performance improvement, this work has a
lot to extend to get a further success. First, we can adopt a
more efficient partition mechanism. Second we plan to take
way prediction into consideration, and combine way
prediction, way partition and insertion/promotion policy
together to get a much better result.
Figure2. Runtime of benchmarks
Fig. 3 illustrates the total misses of the two schemes.
From the figure misses are reduced by 6.62%, 34.2% and
25.2%, with a comparatively increase of 22% averagely.
With lower misses, the runtime can also be shortened, and
this means our replacement algorithm has made a progress in
miss rate compared to the LRU.
ACKNOWLEDGMENT
This work was supported by General program of science
and technology development project of Beijing Municipal
Education Commission (KM201210005022) and other
16
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:35:13 UTC from IEEE Xplore. Restrictions apply.
[12] G. E. Suh, L. Rudolph, and S. Devadas. Dynamic Partitioning of
Shared Cache Memory. Jour. Of Supercomputing, 28(1):7̄26, 2004.
[13] W.-F. Lin and S. K. Reinhardt. Predicting last-touch references under
optimal replacement. University of Michigan Tech. Rep.CSE-TR447-02, 2002.
[14] R. L. Mattson, J. Gecsei, D. Slutz, and I. Traiger.
Evaluation
Techniques for Storage Hierarchies. IBM Systems Journal, 9(2),
1970.
[15] John TRobinson, and Murthy V. Devarakonda,“Data Cache
Management Using Frequency-Based Replacement,” In SIGMETRICS, vol.18, p. 134-142, May 1990.
[16] Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Noh, S.H., and Sang
Lyul Min,”LRFU: A Spectrum of Policies that Subsumes the Least
Recently Used and Least Frequently Used Policies,” IEEE
Transactions on Computers,vol.50,pp.1352-1361,Dec 2001,doi:
10.1109/TC.2001.970573
[17] Jaafar Alghazo,Adil Akaaboune, and Nazeih Botros,”SF-LRU Cache
Replacement Algorithm,” MTDT '04 Proceedings of the Records of
the 2004 International Workshop on Memory Technology, Design
and Testing,pp.19-24,2004
[18] Mazen Kharbutli, and Yan Solihin,”counter-based cache replacement
algorithms,” IEEE International Conference on Computer
Design,pp.61, Oct 2005,ios:10.1109/ICCD.2005.41
[19] Haakon Dybdahl, Per Stenström, and Lasse Natvig,”An LRU-based
Replacement Algorithm Augmented with Frequency of Access
inShared Chip-Multiprocessor Caches,” MEDEA '06 Proceedings of
the 2006 workshop on MEmory performance,pp.45 – 52,2006.
[20] Moinuddin K. Qureshi,Aamer Jaleel,Yale N. Patt,Simon C. Steely,
and Joel Emer,”Adaptive Insertion Policies for High Performance
Caching,” ISCA '07 Proceedings of the 34th annual international
symposium on Computer architecture,pp.381-391,May 2007.
[21] Yuejian Xie, and Gabriel H. Loh,”PIPP: Promotion/Insertion PseudoPartitioning of Multi-Core Shared Caches,” In Proc. of the 36th Intl.
Symp. on Computer Architecture,June 2009
[22] Jonathan D. Kron , Brooks Prumo , Gabriel H. Loh Double-DIP:
Augmenting DIP with Adaptive Promotion Policies to Manage
Shared L2 Caches
[23] Xiufeng Sui,Junmin Wu,Guoliang Chen,Yixuan Tang, and Xiaodong
Zhu,”Augmenting Cache Partitioning with Thread-Aware
Insertion/Promotion Policies to Manage Shared Caches,” CF '10
Proceedings of the 7th ACM international conference on Computing
frontiers”,pp. 79-80,2010.
[24] Aamer Jaleel,Kevin B. Theobald,Simon C. Steely, Jr. , and Joel
Emer ,”High Performance Cache Replacement Using Re-Reference
Interval Prediction (RRIP),” ISCA '10 Proceedings of the 37th annual
international symposium on Computer architecture,pp.60-71,June
2010.
[25] Xi Zhang, Chongmin Li, Haixia Wang, and Dongsheng Wang㧘”A
Cache Replacement Policy Using Adaptive Insertion and ReReference Prediction 㧘 ”2010 22nd International Symposium on
Computer Architecture and High Performance Computing,pp.95102,Oct. 2010
government sponsors. We thank the efforts of our reviewers
for their helpful suggestions that have led to several
important improvements of our work. We also thank all the
teachers and students in our lab for helpful discussion.
REFERENCES
[1]
D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting Inter-Thread
Cache Contenton on a Chip Multi-Processor Architecture. In Proc. of
the 11th Int. Symp. on High Performance Computer Architecture,
pages 340̄351, SanFrancisco, CA, USA, Feb. 2005.
[2] J. Chang and G. Sohi. Cooperative Cache Partitioning for Chip
Multiprocessors. In Proc. of the 21st Int. Conference on
Supercomputing, pages 242̄252, Seattle, WA, June 2007.
[3] L. R. Hsu, S. K. Reinhardt, R. R. Iyer, and S. Makineni.Communist,
Utilitarian, and Capitalist Cache Policies on CMPs: Caches as a
Shared Resource. In Proc. of the 15th Int. Conference on Parallel
Architectures and Compilation Techniques, pages 13 ̄ 22, Seattle,
WA, USA, Sep. 2006.
[4] A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. S. Jr., and J.
Emer. Adaptive Insertion Policies for Managing Shared Caches. In
Proc. of the 17th Int. Conference on Parallel Architectures and
Compilation Techniques, 2007.
[5] S. Kim, D. Chandra, and Y. Solihin. Fair Cache Sharing and
Partitioning in a Chip Multiprocessor Architecture. In Proc.of the
13th Int. Conference on Parallel Architectures and Compilation
Techniques, pages 111 ̄ 122, Antibes Juan-les-Pins, France, Sep.
2004.
[6] J. Lin, Q. Lu, X. Ding, Z. Zhang, and P. Sadayappan. Gaining
Insights into Multicore Cache Partitioning: Bridging the Gap between
Simulation and Real Systems. In Proc. of the 14th Int. Symp. on High
Performance Computer Architecture,pages 367̄378, Salt Lake City,
UT, USA, Feb. 2008.
[7] M. K. Qureshi. Dynamic Spill-Accept for Scalable High-Performance
Caching in CMPs. In Proc. of the 15th Int. Symp. On High
Performance Computer Architecture, Raleigh, NC, USA, Feb. 2009.
[8] M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A
Low-Overhead, High-Performance, Runtime Mechanism to Partition
Shared Caches. In Proc. of the 39th Int. Symp. On Microarchitecture,
pages 423̄432, Orlando, FL, Dec. 2006.
[9] N. Raque, W.-T. Lin, and M. Thottethodi. Architectural Support for
Operating System-Driven CMP Cache Management. In Proc. of the
15th Int. Conference on Parallel Architectures and Compilation
Techniques, pages 2̄12, Seattle, WA, USA, Sep. 2006.
[10] S. Srikantaiah, M. Kandemir, and M. J. Irwin. Adaptive Set-Pinning:
Managing Shared Caches in Chip Multiprocessors. In Proc. of the
13th Symp. On Architectural Support for Programming Languages
and Operating Systems, Seattle, WA, USA, Mar. 2009.
[11] H. S. Stone, J. Tuerk, and J. L. Wolf. Optimal Paritioning of Cache
Memory. Trans. on Computers, 41(9):1054̄1068, Sep. 1992.
17
Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 29,2024 at 06:35:13 UTC from IEEE Xplore. Restrictions apply.
Download