Memory Re-Use Distance Analysis

advertisement
Instruction Based Memory Distance
Analysis and its Application to
Optimization
Changpeng Fang
Steve Carr
Soner Önder
Zhenlin Wang
1
Motivation

Widening gap between processor and memory speed


Static compiler analysis has limited capability




memory wall
regular array references only
index arrays
integer code
Reuse distance prediction across program inputs




number of distinct memory locations accessed between two
references to the same memory location
applicable to more than just regular scientific code
locality as a function of data size
predictable on whole program and per instruction basis for
scientific codes
2
Motivation

Memory distance





A dynamic quantifiable distance in terms of memory
reference between tow access to the same memory
location.
reuse distance
access distance
value distance
Is memory distance predictable across both
integer and floating-point codes?



predict miss rates
predict critical instructions
identify instructions for load speculation
3
Related Work

Reuse distance









Mattson, et al. ’70
Sugamar and Abraham ’94
Beyls and D’Hollander ’02
Ding and Zhong ’03
Zhong, Dropsho and Ding ’03
Shen, Zhong and Ding ’04
Fang, Carr, Önder and Wang ‘04
Marin and Mellor-Crummey ’04
Load speculation



Moshovos and Sohi ’98
Chyrsos and Emer ’98
Önder and Gupta ‘02
4
Background

Memory distance




Represent memory distance as a pattern



can use any granularity (cache line, address, etc.)
either forward or backward
represented as a pattern
divide consecutive ranges into intervals
we use powers of 2 up to 1K and then 1K intervals
Data size



the largest reuse distance for an input set
characterize reuse distance as a function of the data size
Given two sets of patterns for two runs, can we predict a third
set of patterns given its data size?
5
Background


1
d
Let i be the distance of the ith bin in the first
pattern and d i2 be that of the second pattern.
Given the data sizes s1 and s2 we can fit the
memory distances using
1
di  ci  ei  fi ( s1 )
2
di  ci  ei  fi ( s2 )
Given ci, ei, and fi, we can predict the memory
distance of another input set with its data size
6
Instruction Based Memory Distance
Analysis

How can we represent the memory distance of an
instruction?

For each active interval, we record 4 words of data
•

min, max, mean, frequency
Some locality patterns cross interval boundaries
•
merge adjacent intervals, i and i + 1, if
min i 1  max i  max i  min i
•
•

merging process stops when a minimum frequency is found
needed to make reuse distance predictable
The set of merged intervals make up memory distance
patterns
7
Merging Example
8
What do we do with patterns?

Verify that we can predict patterns given two
training runs




coverage
accuracy
Predict miss rates for instructions
Predict loads that may be speculated
9
Prediction Coverage

Prediction coverage indicates the percentage of instructions
whose memory distance can be predicted


appears in both training runs
access pattern appears in both runs and memory distance does
not decrease with increase in data size (spatial locality)
•


Called a regular pattern
For each instruction, we predict its ith pattern by



same number of intervals in both runs
curve fitting the ith pattern of both training runs
applying the fitting function to construct a new min, max and
mean for the third run
Simple, fast prediction
10
Prediction Accuracy

An instruction’s memory distance is correctly
predicted if all of its patterns are predicted
correctly


predicted and observed patterns fall in same interval
or, given two patterns A and B such that
B.min  A.max  B.max
A.max  max( A.min, B.min)
 0.9
max( B.max  B.min, A.max  A.min)
11
Experimental Methodology

Use 11 CFP2000 and 11 CINT2000 benchmarks





others don’t compile correctly
Use ATOM to collect reuse distance statistics
Use test and train data sets for training runs
Evaluation based on dynamic weighting
Report reuse distance prediction accuracy

value and access very similar
12
Reuse Distance Prediction
Suite
Patterns
Coverage
%
Accuracy
%
%constant
%linear
CFP2000
85.1
7.7
93.0
97.6
CINT2000
81.2
5.1
91.6
93.8
13
Coverage issues

Reasons for no coverage
1.
2.
3.
instruction does not appear in at least one test run
reuse distance of test is larger than train
number of patterns does not remain constant in both training
runs
Suite
Reason 1
Reason 2
Reason 3
CFP2000
4.2%
0.3%
2.5%
CINT2000
2.2%
4.4%
1.8%
14
Prediction Details

Other patterns



183.equake has 13.6% square root patterns
200.sixtrack, 186.crafty all constant (no data size
change)
Low coverage


189.lucas – 31% of static memory operations do not
appear in training runs
164.gzip – the test reuse distance greater than train
reuse distance
•
cache-line alignment
15
Number of Patterns
Suite
1
2
3
4
5
CFP2000
81.8%
10.5%
4.8%
1.4%
1.5%
CINT2000
72.3%
10.9%
7.6%
4.6%
5.3%
16
Miss Rate Prediction

Predict a miss for a reference if the backward
reuse distance is greater than the cache size.


neglects conflict misses
Accurate miss rate prediction
1
actual  predicted
max  actual, predicted 
17
Miss Rate Prediction Methodology

Three miss-rate prediction schemes

TCS – test cache simulation
•

RRD – reference reuse distance
•
•

Use the actual miss rates from running the program on a the
test data for the reference data miss rates
Use the actual reuse distance of the reference data set to
predict the miss rate for the reference data set
An upper bound on using reuse distance
PRD –predicted reuse distance
•
Use the predicted reuse distance for the reference data set
to predict the miss rate.
18
Cache Configurations
config no.
L1
1
32K, fully assoc.
2
3
4
32K, 2-way
L2
1M
fully assoc.
1M
8-way
4-way
2-way
19
L1 Miss Rate Prediction Accuracy
Suite
PRD
RRD
TCS
CFP2000
97.5
98.4
95.1
CINT2000
94.4
96.7
93.9
20
L2 Miss Rate Accuracy
Suite
2-way
Fully Associative
PRD
RRD
TCS
PRD
RRD
TCS
CFP2000
91%
93%
87%
97%
99.9%
91%
CINT2000
91%
95%
87%
94%
99.9%
89%
21
Critical Instructions

Given reuse distance for an instruction


An instruction is critical if it is in the set of instructions
that generate the most L2 cache misses



Can we determine which instructions are critical in terms of
cache performance?
Those top miss-rate instructions whose cumulative total misses
account for 95% of the misses in a program.
Use the execution frequency of one training run to determine
the relative contribution number of misses for each
instruction
Compare the actual critical instructions with predicted

Use cache configuration 2
22
Critical Instruction Prediction
Suite
PRD
RRD
TCS
%pred
%act
CPF2000
92%
98%
51%
1.66%
1.67%
CINT2000
89%
98%
53%
0.94%
0.97%
23
Critical Instruction Patterns
Suite
1
2
3
4
5
CFP2000
22.1
38.4
20.0
12.8
6.7
CINT2000
18.7
14.5
25.5
22.5
18
24
Miss Rate Discussion



PRD performs better than TCS when data size is a
factor
TCS performs better when data size doesn’t
change much and there are conflict misses
PRD is much better at identifying the critical
instructions than TCS

these instructions should be targets of optimization
25
Memory Disambiguation

Load speculation



Can a load safely be issued prior to a preceding store?
Use a memory distance to predict the likelihood that a
store to the same address has not finished
Access distance



The number of memory operations between a store to and
load from the same address
Correlated to instruction distance and window size
Use only two runs
•
If access distance not constant, use the access distance of
larger of two data sets as a lower bound on access distance
26
When to Speculate

Definitely “no”


Definitely “yes”


access distance less than threshold
access distance greater than threshold
Threshold lies between intervals

compute predicted mis-speculation frequency (PMSF)
•

When threshold does not intersect intervals
•

speculate is PMSF < 5%
total of frequencies that lie below the threshold
Otherwise
(theshold  min)
 frequency
(max  min)
27
Value-based Prediction

Memory dependence only if addresses and values
match
store a1, v1
store a2, v2
store a3, v3
load a4, v4
Can move ahead if a1=a2=a3=a4, v2=v3 and v1≠v2

The access distance of a load to the first store in
a sequence of stores storing the same value is
called the value distance
28
Experimental Design

SPEC CPU2000 programs

SPEC CFP2000
•

SPEC CINT2000
•


171.swim, 172.mgrid, 173.applu, 177.mesa, 179.art,
183.equake, 188.ammp, 301.apsi
164.gzip, 175.vpr, 176.gcc, 181.mcf, 186.crafty, 197.parser,
253.perlbmk, 300.twolf
Compile with gcc 2.7.2 –O3
Comparison



Access distance, value distance
Store set with 16KB table, also with values
Perfect disambiguation
29
Micro-architecture
issue width
8
fetch width
8
load
2
retire width
16
int division
8
window size
128
int multiply
4
load/store
queue
128
other int
1
8
float multiply
4
multiblock gshare
float addition
3
perfect
float division
8
other float
2
functional units
fetch
data cache
memory ports
2
Operation
Latency
30
IPC and Mis-speculation
Access
Distance
Store Set
16KB Table
Perfect
CFP2000
3.21
3.37
3.71
CINT2000
2.90
3.22
3.35
Suite
Suite
Mis-speculation Rate
% Speculated Loads
Access
Store Set
Access
Store Set
CFP2000
2.36
0.07
57.2
62.0
CINT2000
2.33
0.08
26.9
34.7
31
Value-based Disambiguation
Value
Distance
Store Set
16KB Value
CFP2000
3.34
3.55
CINT2000
3.00
3.23
Suite
Mis-speculation
Rate
% Speculated
Loads
CFP2000
1.22
59.3
CINT2000
1.55
27.6
Suite
32
Cache Model
Suite
Access
Store Set 16K
CFP2000
1.55
1.61
CINT2000
1.53
1.60
Value
Store Set 16K
CFP2000
1.59
1.63
CINT2000
1.55
1.65
Suite
33
Summary




Over 90% of memory operations can have reuse
distance predicted with a 97% and 93% accuracy,
for floating-point and integer programs,
respectively
We can accurately predict miss rates for floatingpoint and integer codes
We can identify 92% of the instructions that
cause 95% of the L2 misses
Access- and value-distance-based memory
disambiguation are competitive with best hardware
techniques without a hardware table
34
Future Work





Develop a prefetching mechanism that uses the
identified critical loads.
Develop an MLP system that uses critical loads and
access distance.
Path-sensitive memory distance analysis
Apply memory distance to working-set based cache
optimizations
Apply access distance to EPIC style architectures
for memory disambiguation.
35
Download