Improving Cache Management Policies Using

advertisement
IMPROVING CACHE MANAGEMENT POLICIES
USING DYNAMIC REUSE DISTANCES
Nam Duong1, Dali Zhao1, Taesu Kim1,
Rosario Cammarota1, Mateo Valero2,
Alexander V. Veidenbaum1
1University
of California, Irvine
2Universitat Politecnica de Catalunya and
Barcelona Supercomputing Center
CACHE MANAGEMENT

Have been a hot research topic
Cache
Management
Singlecore
Sharedcache
Replacement
Bypass
Partitioning
LRU
NRU
EELRU
DIP
RRIP
…
PDP
SPD
…
UCP
PIPP
TA-DIP
TA-DRRIP
Vantage
…
PDP
PDP
Prefetch
2
OVERVIEW

Proposed new cache replacement and partitioning
algorithms with a better balance between reuse and
pollution

Introduced a new concept, Protecting Distance (PD), which
is shown to achieve such a balance

Developed single- and multi-core hit rate models as a
function of PD, cache configuration and program behavior


Models are used to dynamically compute the best PD
Showed that PD-based cache management policies
improve performance for both single- and multi-core
systems
3
OUTLINE
1.
2.
3.
4.
The concept of Protecting Distance
The single-core PD-based replacement and
bypass policy (PDP)
The multi-core PD-based management policies
Evaluation
4
DEFINITIONS

The (line) reuse distance: The number of accesses to the
same cache set between two accesses to the same line


The reuse distance distribution (RDD)



This metric is directly related to hit rate
A distribution of observed reuse distances
A program signature for a given cache configuration
RDDs of representative benchmarks

X-axis: the RD (<256)
403.gcc
436.cactusADM
464.h264ref
5
0
64
128 192 256
0
64
128 192 256
0
64
128 192 256
FUTURE BEHAVIOR PREDICTION

Cache management policies use past reference
behavior to predict future accesses


Prediction accuracy is critical
Prediction in some of the prior policies



LRU: predicts that lines are reused after K unique
accesses, where K < W (W: cache associativity)
Early eviction LRU (EELRU): Counts evictions in two nonLRU regions (early/late) to predict a line to evict
RRIP: Predicts if a line will be reused in a near, long, or
distant future
6
BALANCING REUSE AND CACHE POLLUTION

Key to good performance (high hit rate)



Cache lines must be reused as much as possible before eviction
AND must be evicted soon after the “last” reuse to give space to
new lines
The former can be achieved by using the reuse distance
and actively preventing eviction

“Protecting” a line from eviction

The latter can be achieved by evicting when not reused
within this distance

There is an optimal reuse distance balancing the two

It is called a Protecting Distance (PD)
7
EXAMPLE: 436.CACTUSADM

A majority of lines are reused at 64 or fewer accesses
There are multiple peaks at different reuse distances


Reuse maximized if lines are kept in the cache for 64 accesses
Lines may not be reused if evicted before that
Lines kept beyond that are likely to pollute cache



Assume that no lines are kept longer than a given RD
Reduction in miss rate over LRU
436.cactusADM
0
64
128
192
256
60%
50%
40%
30%
20%
10%
0%
8
THE PROTECTING DISTANCE (PD)

A distance at which a majority of lines are covered



A single value for all sets
Predicted based on the current RDD
Questions to answer/solve



Why does using the PD achieve the balance?
How to dynamically find the PD for an application and a
cache configuration?
How to build the PD-based management policies?
9
OUTLINE
1.
2.
3.
4.
The concept of Protecting Distance
Single-core PD-based replacement and bypass
policy (PDP)
The multi-core PD-based management policies
Evaluation
10
THE SINGLE-CORE PDP
Reused line

A cache tag contains a line’s remaining PD (RPD)


A line can be evicted when its RPD=0
The RPD of an inserted or promoted line set to the
predicted PD


Inserted line (unused)
RPDs of other lines in a set are decremented
Example: A 4-way cache, the predicted PD is 7

A line is promoted on a hit
 A set with RPDs before and after the hit access
1
4
6
3
0
6
5
2
11
THE SINGLE-CORE PDP (CONT.)
Reused line

Inserted line (unused)
Selecting a victim on a miss

A line with an RPD = 0 can be replaced
0

4
6
3
6
3
5
2
Two cases when all RPDs > 0 (no unprotected lines)
 Caches without bypass (inclusive):
 Unused lines are less likely to be reused than reused lines
 Replace unused line with highest RPD first
1

6
3
0
3
5
6
No unused line: Replace a line with highest RPD
1

4
4
6
3
0
3
6
2
Caches with bypass (non-inclusive): Bypass the new line
12
1
4
6
3
0
3
5
2
EVALUATION OF THE STATIC PDP
30%
25%
20%
15%
10%
5%
0%
-5%

Miss reduction over DRRIP
SPDP-NB
SPDP-B
Static PDP: use the best static PD for each benchmark


PD < 256
 SPDP-NB: Static PDP with replacement only
 SPDP-B: Static PDP with replacement and bypass
Performance: in general, DDRIP < SPDP-NB < SPDP-B

436.cactusADM: a 10% additional miss reduction


Two static PDP policies have similar performance
483.xalancbmk: 3 different execution windows have different
behavior for SPDP-B
13
436.CACTUSADM:
EXPLAINING THE PERFORMANCE DIFFERENCE
Hit
Bypass
Evict before 16 accesses (early)
Evict after 16 accesses (late)
100%
80%
60%
40%
20%
0%
Access
Occupancy
DRRIP
Access
Occupancy
SPDP-NB
Access
Occupancy
SPDP-B

PDP
has less
caused by long RD
How
the evicted
lines pollution
occupy the cache?

DRRIP:
lines in the cache than RRIP




Early evicted lines: 75% of accesses, but occupy only 4%
Late evicted lines: 2% of accesses, but occupy 8% of the cache
→ pollution
SPDP-NB: Early and late evicted lines: 42% of accesses but
occupy only 4%
SPDP-B: Late evicted lines: 1% of accesses, occupy 3% of
the cache → yielding cache space to useful lines
14
CASE STUDY: 483.XALANCBMK
RDD
0
32
64
96
483.xalancbmk.1
483.xalancbmk.2
483.xalancbmk.3
128
80%
Hit rate of SPDP-B
60%
40%
20%
0%
There is 483.xalancbmk.2
a close relationship
between the
483.xalancbmk.3
hit rate, the PD and the RDD
 The best PD is different in different windows
 And for different programs
 Need a dynamic policy that finds best PD
 Need a model to drive the search
483.xalancbmk.1
15
A HIT RATE MODEL FOR NON-INCLUSIVE CACHE

The model estimates the hit rate as a function of dp and the RDD
dp
Hits
1
E (d p ) 
* 
Accesses W
N
i 1
i
dp



Ni * i    Nt   Ni  * d p  d e 

i 1
i 1


dp
{Ni}, Nt: The RDD
 dp: The protecting distance
 de: Experimentally set to W (W: Cache associativity)

E
RDD
Hit rate
464.h264ref
403.gcc
436.cactusADM
Used to find the PD maximizing the hit rate
0
64
128 192 256
0
64
128 192 256
0
64
128 192 256
16
PDP CACHE ORGANIZATION
Access
address
Higher
level
Main
memory
LLC
PD
PD Compute
Logic
RDD
RD Sampler

RD
RD Counter
Array
RD Sampler tracks access to several cache sets
In L2 miss/WB stream, can reduce sampling rate
 Measures reuse distance of a new access


RD Counter Array collects # of accesses at RD=i, Nt


PD Compute Logic: finds PD that maximizes E


To reduce overhead, each counter covers a range of RDs
Computed PD used in the next interval (.5M L3 accesses)
Reasonable hardware overhead

2 or 3 bits per tag to store the RPD
17
PDP VS. EXISTING POLICIES
Management
policy
Supported policy(*)
Replacement
Bypass
Reuse
Pollution
Distance
measurement
LRU
Yes
No
No
Yes
Stack-based
No
EELRU [1]
Yes
No
No
Yes
Stack-based
Probabilistic
DIP [2]
Yes
No
Yes
No
N/A
No
RRIP [3]
Yes
No
Yes
No
N/A
No
SDP [4]
No
Yes
Yes
No
N/A
No
PDP
Yes
Yes
Yes
Yes
Access-based
Hit rate
(*)Originally

Balance
Model
proposed
EELRU has the concept of late eviction point, which shares some
similarities with the protecting distance
 However, lines are not always guaranteed to be protected
[1] Y. Smaragdakis, S. Kaplan, and P. Wilson. EELRU: simple and effective adaptive
page replacement. In SIGMETRICS’99
[2] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion
policies for high performance caching. In ISCA’07
[3] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache
replacement using re-reference interval prediction (RRIP). In ISCA’10
18
[4] S. M. Khan, Y. Tian, and D. A. Jimenez. Sampling dead block prediction for last-level
caches. In MICRO’10
OUTLINE
1.
2.
3.
4.
The concept of Protecting Distance
The single-core PD-based replacement and
bypass policy (PDP)
The multi-core PD-based management policies
Evaluation
19
PD-BASED SHARED CACHE PARTITIONING

Each thread has its own PD (thread-aware)
Counter array replicated per thread
 Sampler and compute logic shared


A thread’s PD determines its cache partition
Its lines occupy cache longer if its PD is large
 The cache is implicitly partitioned per needs of each
thread using thread PDs


The problem is to find a set of thread PDs that
together maximize the hit rate
20
SHARED-CACHE HIT RATE MODEL

Extending the single-core approach

Compute a vector <PD> (T= number of threads)
 HitsT  1
E  PD   
*
 AccessesT  W
T
T

Exhaustive search for <PD> is not practical


A heuristic search algorithm finds a combination of threads’
RDD peaks that maximizes hit rate
 The single-core model generates top 3 peaks per thread
2
 The complexity is O(T )
See the paper for more detail
21
OUTLINE
1.
2.
3.
4.
The concept of Protecting Distance
The single-core PD-based replacement and
bypass policy (PDP)
The multi-core PD-based management policies
Evaluation
22
EVALUATION METHODOLOGY
CMP$im simulator, LLC replacement
 Target cache: LLC

Cache
Params
DCache
32KB, 8-way, 64B, 2 cycles
ICache
32KB, 4-way, 64B, 2 cycles
L2Cache
256KB, 8-way, 64B, 10 cycles
L3Cache (LLC) 2MB, 16-way, 64B, 30 cycles
Memory
200 cycles
23
EVALUATION METHODOLOGY (CONT.)

Benchmarks: SPEC CPU 2006 benchmarks
 Excluded those which did not stress the LLC

Single-core:
 Compared to EELRU, SDP, DIP, DRRIP

Multi-core
 4- and 16-core configurations, 80 workloads each
 The workloads generated by randomly combining benchmarks
 Compared to UCP, PIPP, TA-DRRIP

Our policy: PDP-x, where x is the number of bits per cache line24
SINGLE-CORE PDP
30%
IPC improvement over DIP
20%
10%
0%
-10%
-20%
-30%
SDP


DRRIP
EELRU
PDP-2
PDP-3
PDP-8
SPDP-B
PDP-x, where x is the number of bits per cache line
Each benchmark is executed for 1B instructions
25

Best if can use 3 bits per line, but still better than prior work at 2 bits
ADAPTATION TO PROGRAM PHASES
5 benchmarks which demonstrate significant phase changes
 Each benchmark is run for 5B instructions


Change of PD (X-axis: 1M LLC accesses)
200
403.gcc
150
200
200
150
150
429.mcf
100
100
50
50
50
0
0
0
0
20
40
60
0
200
400
450.soplex
100
600
0
100 200 300
200
200
482.sphinx3
150
100
100
50
50
0
0
0
30
60
90 120
483.xalancbmk
150
26
0
50 100 150 200
ADAPTATION TO PROGRAM PHASES (CONT.)

IPC improvement over DIP
15%
10%
5%
0%
RRIP
PDP-2
PDP-3
PDP-8
-5%
27
PD-BASED CACHE PARTITIONING FOR 16 CORES

Normalized to TA-DRRIP
40%
30%
20%
10%
0%
-10%
-20%
UCP
PIPP
PDP-2
PDP-3
0
40%
20
30%
20%
10%
W
40
60
Workload
UCP
PIPP
PDP-2
PDP-3
40%
30%
20%
10%
0%
-10%
-20%
80
UCP
PIPP
PDP-2
PDP-3
0
20
T
40
60
Workload
80
10%
Average
H
5%
0%
0%
W
-10%
T
H
-5%
-20%
0
20
40
60
Workload
80
28
-10%
UCP
PIPP
PDP-2
PDP-3
HARDWARE OVERHEAD
Policy
Per-line
bits
Overhead
(%)
DIP
4
0.8%
RRIP
2
0.4%
SDP
4
1.4%
PDP-2
2
0.6%
PDP-3
3
0.8%
29
OTHER RESULTS
Exploration of PDP cache parameters
 Cache bypass fraction
 Prefetch-aware PDP
 PD-based cache management policy for 4-core

30
CONCLUSIONS

Proposed the concept of Protecting Distance (PD)

Showed that it can be used to better balance reuse
and cache pollution

Developed a hit rate model as a function of the PD,
program behavior, and cache configuration

Proposed PD-based management policies for both
single- and multi-core systems

PD-based policies outperform existing policies
31
THANK YOU!
32
BACKUP SLIDES

RDD, E and hit rate of all benchmarks
33
RDDS, MODELED AND REAL HIT RATES OF SPEC
CPU 2006 BENCHMARKS
434.zeusmp
433.milc
E
429.mcf
Hit rate
403.gcc
RDD
0
64
128
192
256
0
64
128 192 256
0
64
128 192 256
0
64
128 192 256
456.hmmer
450.soplex
437.leslie3d
436.cactusADM
0
64
128 192 256
0
64
128 192 256
0
64
128 192 256
0
64
128 192 34
256
RDDS, MODELED AND REAL HIT RATES OF
SPEC CPU 2006 BENCHMARKS (CONT.)
459.GemsFDTD
462.libquantum
470.lbm
464.h264ref
0
64
128 192 256
0
64
128 192 256 0
64
128 192 256
0
64
128 192 256
471.omnetpp
482.sphinx3
473.astar
0
64
128 192 256
0
64
128 192 256
0
64
128 192 256
35
RDDS, MODELED AND REAL HIT RATES OF
SPEC CPU 2006 BENCHMARKS (CONT.)
483.xalancbmk.1
0
64
128 192 256
483.xalancbmk.2
0
64
128 192 256
483.xalancbmk.3
0
64
128 192 256
36
Download