Achieving Non-Inclusive Cache Performance with Inclusive Caches

advertisement
Achieving Non-Inclusive Cache Performance
with Inclusive Caches
Temporal Locality Aware (TLA) Cache Management Policies
Aamer Jaleel, Eric Borch, Malini Bhandaru,
Simon Steely Jr., Joel Emer
In International Symposium on Microarchitecture (MICRO), December 2010
Presented by: Yingying Tian
http://www.cs.utsa.edu
High Performing Cache Hierarchy in CMPs
• Cache Hierarchy
• Multiple interacting caches on chip
• Tradeoff between cache latency and hit rate
• Chip-Multi Processors (CMPs)
widen the gap between processor and memory speeds
Goal: efficient and high performing
cache hierarchy
http://www.cs.utsa.edu
Key issue: Inclusion or Not?
Size of the cache hierarchy v.s. Simplicity of the cache coherence
*Some materials are taken from original presentation slides
http://www.cs.utsa.edu
Inclusive Caches
• Simplify cache coherence
• Waste cache capacity (= size of the LLC)
• Inclusion property causes invalidation of
blocks that keep high temporal locality in core
caches – back invalidate problem
 hundreds of cycles memory access penalty
http://www.cs.utsa.edu
Back-Invalidate Problem
• Inclusion property: all the higher-level caches be a
subset of the last-level cache (LLC).
• Back-invalidation: When a block is evicted from the
LLC, inclusion is enforced by invalidating that block
from all the caches in the hierarchy.
-- Inclusion Victim
• Small caches filter temporal locality  inclusion
victims keep temporal locality
-- Hot Inclusion Victim
http://www.cs.utsa.edu
Back-Invalidate Problem (Cont.)
• Consider following access pattern in a 2-level
inclusive cache hierarchy: … a, b, a, c, a, d, a, e, a, f…
L1:
b
d
a
e
c
b
d
a
c
L2:
b
d
a
e
c
b
d
a
c
b
a
c
MRU
b
a
LRU
Next Reference
to ‘a’ misses.
While
keeps‘a’high
temporal
locality in L1.
Reference
‘e’ misses
and‘a’evicts
from
hierarchy
http://www.cs.utsa.edu
Back-Invalidate Problem (Cont.)
Intel Core i7– 1:8 cache ratio, inclusive LLCs.
AMD Phenom Ⅱ-- 1:4 cache ratio, non-inclusive LLCs.
http://www.cs.utsa.edu
• Goal: to implement efficient and high
performing cache hierarchy
• by eliminating hot inclusion victims to
improve inclusive cache performance
Temporal Locality Aware Cache
Management Polices
http://www.cs.utsa.edu
Outline
• Background and motivation
• Problem description
• Temporal Locality Aware (TLA) Cache
Management Policy Suite
• Evaluation
• Conclusion
http://www.cs.utsa.edu
3 Temporal Locality Aware (TLA) Cache
Management Policies:
•Temporal Locality Hints (TLH)
•Early Core Invalidation (ECI)
•Query Based Selection (QBS)
http://www.cs.utsa.edu
Temporal Locality Hints (TLH)
conveys the temporal locality of hot blocks in
core caches by sending hints to the LLC on each
hit of core caches to
update the replacement
state of that block in LLC.
• Significantly reduce the number
of inclusion victims
• The number of requests to the
LLC is extremely large and does not
scale well with increasing number of cores
(even with filter optimizations)
• Limit study
http://www.cs.utsa.edu
Early Core Invalidation (ECI)
• derives the temporal locality of a block before
its becomes LRU in the LLC. The LLC chooses the
block located at [LRU-1] position and invalidates
it in the core caches while keeping it in the LLC
• by observing the core’s subsequent request,
the LLC derives the temporal locality
• occurs on each LLC miss
http://www.cs.utsa.edu
Early Core Invalidation (ECI) cont.
• Early-invalidated block – ECI block
• ECI block is hot in certain core cache
 re-requested by that core cache 
L1 miss but LLC hit, move back to MRU
in LLC to keep the temporal locality
• ECI block is not hot (not re-requested
or re-requested after a long time) 
evicted from the LLC on next LLC miss in the corresponding set
• Lower traffic solution (# of LLC misses is much smaller)
• low-accurate prediction (predict the ECI block is hot in core caches)
what if the ECI block is hot, but not that hot?
http://www.cs.utsa.edu
Query Based Selection (QBS)
• infers the temporal locality of a block in the LLC by query the
core caches on each LLC miss
• The LLC selects a replacement candidate and queries all core
caches if this block is present in certain core caches.
• Only replace the block that is not present in any core caches.
• If the QBS block is present in
certain core cache. The LLC updates
the corresponding replacement
state to MRU and re-select, requery another replacement
candidate.
http://www.cs.utsa.edu
Query Based Selection (QBS) Cont.
• The QBS victim selection process is hidden by memory
latency.
• The cache controller can limit the number of queries issued
on an LLC miss.
• Based on the experiments, sending 2 queries is sufficient to
achieve performance benefits.
• Performs similar to a non-inclusive cache hierarchy.
• The on-chip communication overhead is extremely large. [not
mentioned in the paper]
http://www.cs.utsa.edu
An example (. . . a, b, a, c, a, d, a, e, a, f, a, . . . . )
http://www.cs.utsa.edu
Experimental Methodology
•CMP$im: x86 simulator
•Baseline: 2-core CMP, 3 level inclusive cache hierarchy
• L1 I/D: 4-way, 32KB, 64B block size, 1 cycle access latency
• L2: 8-way, 256KB, 64B block size, 10 cycles access latency, non-inclusive
• L3 (LLC): Shared, 16-way, 2MB, 24 cycles access latency, enforce inclusion
• Main memory: 150 cycles access latency
•Benchmarks: 15 benchmarks selected from SPEC CPU 2006
benchmark suite based on program behaviors (core cache
fitting, LLC fitting, LLC thrashing, 5 benchmarks of each)
•Total workloads: 105 2-core workloads. (15 choose 2)
http://www.cs.utsa.edu
Performance
1.35
5.2%
1.30
1.40
Non-Inclusive
Relative Performance
TLH-L1
6.1%
1.25
1.20
1.15
1.10
1.05
1.00
0.95
ECI
1.35
Non-Inclusive
3.4%
1.30
6.1%
1.25
1.20
1.15
1.10
1.05
1.00
0.95
0
20
40
60
80
1.40
Relative Performance
Relative Performance
1.40
1.35
1.30
100
QBS
6.6%
0
20
40
Non-Inclusive
6.1%
1.25
1.20
1.15
1.10
1.05
1.00
0.95
http://www.cs.utsa.edu
0
20
40
60
80
100
60
80
100
Relative Performance
Performance (Cont.)
1.25
TLH-L1
QBS
Non-Inclusive
Exclusive
1.20
1.15
1.10
1.05
1.00
1:2
1:4
1:8
Ratio of MLC Size : LLC Size
1:16
QBS performs similar to non-inclusive caches for all cache ratios
http://www.cs.utsa.edu
Performance (Cont.)
1.15
Relative Performance
QBS
Non-Inclusive
Exclusive
1.10
1.05
1.00
2-Core
4-Core
8-Core
Scalability of QBS in 2-core, 4-core and 8-core CMPs
(1:4 cache size ratio)
http://www.cs.utsa.edu
Conclusion
• Temporal Locality Aware Cache Management
• Retains benefit of inclusion while minimizing
back-invalidate problem
• TLA managed inclusive cache = performance of
non-inclusive cache
Thanks! Questions?
http://www.cs.utsa.edu
Download