Re-Reference Interval - Computation Structures Group

advertisement
High Performance Cache Replacement Using
Re-Reference Interval Prediction (RRIP)
Aamer Jaleel, Kevin Theobald,
Simon Steely Jr., Joel Emer
Intel Corporation, VSSAD
International Symposium on Computer Architecture ( ISCA – 2010 )
Motivation
• Factors making caching important
• Increasing ratio of CPU speed to memory speed
• Multi-core poses challenges on better shared cache management
• LRU has been the standard replacement policy at LLC
• However LRU has problems!
2
Problems with LRU Replacement
LLCsize
Wsize
Working set larger than the cache causes thrashing
miss
miss
miss
miss
miss
References to non-temporal data (scans) discards frequently referenced working set
hit
Wsize
hit
hit
scan
miss hit
scan
miss
hit
scan
miss
Our studies show that scans occur frequently in many commercial workloads
3
Desired Behavior from Cache Replacement
hit
miss
hit
miss
hit
miss
hit
miss
hit
miss
Working set larger than the cache  Preserve some of working set in the cache
Wsize
LLCsize
Recurring scans  Preserve frequently referenced working set in the cache
hit
hit
hit
scan
hit
hit
scan
hit
hit
scan
hit
4
Prior Solutions to Enhance Cache Replacement
Working set larger than the cache  Preserve some of working set in the cache
Dynamic Insertion Policy (DIP)  Thrash-resistance with minimal changes to HW
Recurring scans  Preserve frequently referenced working set in the cache
Least Frequently Used (LFU)  addresses scans
LFU adds complexity and also performs bad for recency friendly workloads
GOAL: Design a High Performing Scan-Resistant Policy that Requires Minimum Changes to HW
5
Belady’s Optimal (OPT) Replacement Policy
•
•
Replacement decisions using perfect knowledge of future reference order
Victim Selection Policy:
•
Replaces block that will be re-referenced furthest in future
victim block
Physical Way #
Cache Tag
“Time” when block will
be referenced next
0
1
2
3
4
5
6
7
a
c
b
h
f
d
g
e
4
13
11
5
3
6
9
1
6
Practical Cache Replacement Policies
•
•
Replacement decisions made by predicting the future reference order
Victim Selection Policy:
•
•
Replace block predicted to be re-referenced furthest in future
Continually update predictions on the future reference order
•
Natural update opportunities are on cache fills and cache hits
victim block
Physical Way #
Cache Tag
0
1
2
3
4
5
6
7
a
c
b
h
f
d
g
e
“Predicted Time” when block will
be referenced next
~
4
~
13
~
11
~
5
~
3
~
6
~
9
~
1
7
LRU Replacement in Prediction Framework
•
The “LRU chain” maintains the re-reference prediction
•
•
•
•
Head of chain (i.e. MRU position) predicted to be re-referenced soon
Tail of chain (i.e. LRU position) predicted to re-referenced far in the future
LRU predicts that blocks are re-referenced in reverse order of reference
Rename “LRU Chain” to the “Re-Reference Prediction (RRP) Chain ”
•
Rename “MRU position”  RRP Head and “LRU position”  RRP Tail
RRP head
MRU position
LRU chain position
stored with each
cache block
RRP tail
LRU position
h
g
f
e
d
c
b
a
0
1
2
3
4
5
6
7
8
Practicality of Chain Based Replacement
RRP Head
h
RRPV (n=2):
Qualitative Prediction:
•
g
f
e
d
c
b
a
0
1
2
3
‘nearimmediate’
‘intermediate’
‘far’
‘distant’
Problem: Chain based replacement is too expensive!
•
•
RRP Tail
log2(associativity) bits required per cache block (16-way requires 4-bits/block)
Solution: LRU chain positions can be quantized into different buckets
•
•
•
Each bucket corresponds to a predicted Re-Reference Interval
Value of bucket is called the Re-Reference Prediction Value (RRPV)
Hardware Cost: ‘n’ bits per block [ ideally you would like n < log2A ]
9
Representation of Quantized Replacement (n = 2)
RRP Head
h
RRPV:
Qualitative Prediction:
RRP Tail
g
f
e
d
c
b
a
0
1
2
3
‘nearimmediate’
‘intermediate’
‘far’
‘distant’
Physical Way #
0
1
2
3
4
5
6
7
Cache Tag
a
c
b
h
f
d
g
e
RRPV
3
2
3
0
1
1
0
1
10
Emulating LRU with Quantized Buckets (n=2)
•
n
Victim Selection Policy: Evict block with distant RRPV (i.e. 2 -1 = ‘3’)
•
•
•
•
If no distant RRPV (i.e. ‘3’) found, increment all RRPVs and repeat the search
If multiple found, need tie breaker. Let us always start search from physical way ‘0’
Insertion Policy: Insert new block with RRPV=‘0’
Update Policy: Cache hits update the block’s RRPV=‘0’
hit
victim block
Physical Way #
0
1
2
3
4
5
6
7
Cache Tag
s
a
c
b
h
f
d
g
e
RRPV
0
3
2
0
3
0
1
1
0
1
But We Want to do BETTER than LRU!!!
11
Re-Reference Interval Prediction (RRIP)
•
1
2
3
4
5
6
7
Cache Tag
a
c
b
h
f
d
g
e
RRPV
3
2
3
0
1
1
0
1
Unlike LRU, can use non-zero RRPV on insertion
Unlike LRU, can use a non-zero RRPV on cache hits
Static Re-Reference Interval Prediction (SRRIP)
•
•
0
Framework enables re-reference predictions to be tuned at insertion/update
•
•
•
Physical Way #
Determine best insertion/update prediction using profiling [and apply to all apps]
Dynamic Re-Reference Interval Prediction (DRRIP)
•
Dynamically determine best re-reference prediction at insertion
12
Static RRIP Insertion Policy – Learn Block’s Re-reference Interval
•
Key Idea: Do not give new blocks too much (or too little) time in the cache
•
•
•
Predict new cache block will not be re-referenced soon
Insert new block with some RRPV other than ‘0’
Similar to inserting in the “middle” of the RRP chain
• However it is NOT identical to a fixed insertion position on RRP chain (see paper)
victim block
Physical Way #
0
1
2
3
4
5
6
7
Cache Tag
s
a
c
b
h
f
d
g
e
RRPV
2
3
2
3
0
1
1
0
1
13
Static RRIP Update Policy on Cache Hits
•
Hit Priority (HP)
•
•
Like LRU, Always update RRPV=0 on cache hits.
Intuition: Predicts that blocks receiving hits after insertion will be re-referenced soon
hit
Physical Way #
0
1
2
3
4
5
6
7
Cache Tag
s
c
b
h
f
d
g
e
RRPV
2
0
2
3
0
1
1
0
1
An Alternative Update Scheme Also Described in Paper
14
SRRIP Hit Priority Sensitivity to Cache Insertion Prediction at LLC
% Fewer Cache Misses Relative to LRU
Averaged Across PC Games, Multimedia, Server, and SPEC06 Workloads on 16-way 2MB LLC
10.00
7.50
5.00
2.50
0.00
-2.50
-5.00
-7.50
n=1
-1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-10.00
Re-Reference Interval Prediction (RRIP) Value At Insertion
n=1 is in fact the NRU replacement policy commonly used in commercial processors
15
SRRIP Hit Priority Sensitivity to Cache Insertion Prediction at LLC
% Fewer Cache Misses Relative to LRU
Averaged Across PC Games, Multimedia, Server, and SPEC06 Workloads on 16-way 2MB LLC
10.00
7.50
5.00
2.50
0.00
-2.50
-5.00
-7.50

n=1




n=2
n=3
n=4
n=5
-1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-10.00
Re-Reference Interval Prediction (RRIP) Value At Insertion
Regardless of ‘n’ Static RRIP Performs Best When RRPVinsertion is 2n-2
Regardless of ‘n’ Static RRIP Performs Worst When RRPVinsertion is 2n-1
16
Why Does RRPVinsertion of 2n-2 Work Best for SRRIP?
Wsize
hit
•
•
hit
scan
?
hit
scan
?
hit
scan
?
Recall, NRU (n=1) is not scan-resistant
For scan resistance RRPVinsertion MUST be different from RRPV of working set blocks
Larger insertion RRPV tolerates larger scans
•
•
hit
Before scan, re-reference prediction of active working set is ‘0’
•
•
Slen
Maximum insertion prediction (i.e. 2n-2) works best!
In general, re-references after scan hit IF
Slen < ( RRPVinsertion – Starting-RRPVworkingset) * (LLCsize – Wsize)
SRRIP is Scan Resistant for Slen < ( RRPVinsertion ) * (LLCsize – Wsize)
For n > 1 Static RRIP is Scan Resistant! What about Thrash Resistance?
17
DRRIP: Extending Scan-Resistant SRRIP to Be Thrash-Resistant
SRRIP
miss
miss
DRRIP
miss
hit
•
•
miss
hit
miss
miss
hit
miss
miss
hit
miss
Always using same prediction for all insertions will thrashes the cache
Like DIP, need to preserve some fraction of working set in cache
•
•
miss
Extend DIP to SRRIP to provide thrash resistance
Dynamic Re-Reference Interval Prediction:
•
•
Dynamically select between inserting blocks with 2n-1 and 2n-2 using Set Dueling
Inserting blocks with 2n-1 is same as “no update on insertion”
DRRIP Provides Both Scan-Resistance and Thrash-Resistance
18
% Performance Improvement over LRU
Performance Comparison of Replacement Policies
NRU
20
DIP
SRRIP
DRRIP
16-way 2MB LLC
15
10
5
0
-5
GAMES
MULTIMEDIA
SERVER
SPEC06
Static RRIP Always Outperforms LRU Replacement
Dynamic RRIP Further Improves Performance of Static RRIP
ALL
19
Cache Replacement Competition (CRC) Results
NRU
% Performance Improvement Over LRU
8
7
3-bit SRRIP
3-bit DRRIP
Dueling Segmented-LRU (CRC winner)
Averaged Across PC Games, Multimedia, Enterprise
Server, SPEC CPU2006 Workloads
6
5
4
3
2
1
DRRIP
D
R
R
I
P
0
-1
16-way 1MB Private Cache
16-way 4MB Shared Cache
65 Single-Threaded Workloads
165 4-core Workloads
Private Caches
Shared Caches
Un-tuned DRRIP Would Be Ranked 2nd and is within 1% of CRC Winner
Unlike CRC Winner, DRRIP Does Not Require Any Changes to Cache Structure
20
Total Storage Overhead (16-way Set Associative Cache)
• LRU:
• NRU
• DRRIP-3:
• CRC Winner:
4-bits / cache block
1-bit / cache block
3-bits / cache block
~8-bits / cache block
DRRIP Outperforms LRU With Less Storage Than LRU
NRU Can Be Easily Extended to Realize DRRIP!
21
Summary
• Scan-resistance is an important problem in commercial workloads
• State-of-the art policies do not address scan-resistance
• We Propose a Simple and Practical Replacement Policy
• Static RRIP (SRRIP) for scan-resistance
• Dynamic RRIP (DRRIP) for thrash-resistance and scan-resistance
• DRRIP requires ONLY 3-bits per block
• In fact it incurs less storage than LRU
• Un-tuned DRRIP would be 2nd place in CRC Championship
• DRRIP requires significantly less storage than CRC winner
22
Q&A
23
Q&A
24
Q&A
25
Static RRIP with n=1
•
Static RRIP with n = 1 is the commonly used NRU policy (polarity inverted)
•
•
•
Victim Selection Policy: Evict block with RRPV=‘1’
Insertion Policy: Insert new block with RRPV=‘0’
Update Policy: Cache hits update the block’s RRPV=‘0’
hit
victim block
Physical Way #
0
1
2
3
4
5
6
7
Cache Tag
s
a
c
b
h
f
d
g
e
RRPV
0
1
1
0
1
0
1
1
0
1
But NRU Is Not Scan-Resistant 
26
SRRIP Update Policy on Cache Hits
•
Frequency Priority (FP):
•
•
Improve re-reference prediction to be shorter than before on hits (i.e. RRPV--).
Intuition: Like LFU, predicts that frequently referenced blocks should have higher
priority to stay in cache
Physical Way #
0
1
2
3
4
5
6
7
Cache Tag
s
c
b
h
f
d
g
e
RRPV
2
2
1
3
0
1
1
0
1
27
SRRIP-HP and SRRIP-FP Cache Performance
SRRIP-Frequency Priority
15
10
5
0
-5
SRRIP-HP has 2X better cache performance relative
to LRUn=3thann=4
SRRIP-FP
n=1
n=2
n=5
-10
We do not need
to precisely
detect frequently
referenced
blocks
GAMES
MULTIMEDIA
SERVER
SPEC06
ALL
We need
20 to preserve blocks that receive hits
SRRIP-Hit Priority
% Fewer Cache Misses
Relative to LRU
•
•
•
% Fewer Cache Misses
Relative to LRU
20
15
10
5
0
-5
n=1
-10
GAMES
MULTIMEDIA
SERVER
n=2
SPEC06
n=3
n=4
ALL
n=5
28
Games, Multimedia, Enterprise Server, Mixed Workloads
Common Access Patterns in Workloads
•
Stack Access Pattern: (a1, a2,…ak,…a2, a1)A
•
•
Streaming Access Pattern: (a1, a2,… ak) for k >> assoc
•
•
Solution: For any ‘k’, LRU performs well for such access patterns
No Solution: Cache replacement can not solve this problem
Thrashing Access Pattern: (a1, a2,… ak)A , for k > assoc
•
•
LRU receives no cache hits due to cache thrashing
Solution: preserve some fraction of working set in cache (e.g. Use BIP)
• BIP does NOT update replacement state for the majority of cache insertions
•
Mixed Access Pattern: [(a1, a2,…ak,…a2, a1)A (b1, b2,… bm)] N, m > assoc-k
•
•
•
•
LRU always misses on frequently referenced: (a1, a2, … ak, … a2, a1)A
(b1, b2, … bm) commonly referenced to as a scan in literature
In absence of scan, LRU performs well for such access patterns
Solution: preserve frequently referenced working set in cache (e.g. use LFU)
• LFU replaces infrequently referenced blocks in the presence of frequently referenced blocks
29
% Fewer Misses Relative to LRU
Performance of Hybrid Replacement Policies at LLC
70
60
PC Games / multimedia
server
SPEC CPU2006
Average
DIP
50
HYB(LRU, LFU)
40
30
20
10
0
ALL
SPEC06
SERVER
MULTIMEDIA
GAMES
bzip2_c
mcf_r
hmmer_n
sphinx3_a
cactusADM_b
app-server
tpc-c
sap
renderman
photoshop
final-fantasy
gunmetal2
halo
halflife2
-10
4-way OoO Processor, 32KB L1, 256KB L2, 2MB LLC
• DIP addresses SPEC workloads but NOT PC games & multimedia workloads
• Real world workloads prefer scan-resistance instead of thrash-resistance
30
Understanding LRU Enhancements in the Prediction Framework
RRP Head
h
•
g
f
e
d
c
b
a
Recent policies, e.g., DIP, say “Insert new blocks at the ‘LRU position’”
•
•
•
•
RRP Tail
What does it mean to insert an MRU line in the LRU position?
Prediction that new block will be re-referenced later than existing blocks in the cache
What DIP really means is “Insert new blocks at the `RRIP Tail’ ”
Other policies, e.g., PIPP, say “Insert new blocks in ‘middle of the LRU chain’”
•
Prediction that new block will be re-referenced at an intermediate time
The Re-Reference Prediction Framework Helps Describe
the Intuitions Behind Existing Replacement Policy Enhancements
31
% Performance Improvement over LRU
Performance Comparison of Replacement Policies
20
NRU
DIP
SRRIP
Best RRP Chain Insertion
DRRIP
16-way 2MB LLC
15
10
5
0
-5
GAMES
MULTIMEDIA
SERVER
SPEC06
Static RRIP Always Outperforms LRU Replacement
Dynamic RRIP Further Improves Performance of Static RRIP
ALL
32
Download