Cache hints

advertisement
Euro-Par’02
30 aug 2002
Reuse Distance-Based
Cache Hint Selection
Kristof Beyls and Erik D’Hollander
Ghent University
1
Introduction
• Cache Hints: why and what
• Reuse distance
• Reuse distance-based cache hint
selection: how
• Experiments and results
2
Cache Hints: Why
• Anti-law of Moore: the speed gap
doubles every 2 years.
• To bridge the gap, efforts must be
combined at
– algorithm level
– compiler level
– hardware level
Cache hints
3
Cache Hints: What
• Cache control instructions are emerging
in a number of architectures.
• HP-PlayDoh EPIC architecture provides
2 kinds of cache hints:
LD
LD_C2_C3
L1
L2
L3
MRU
CPU
L1
L2
L3
MRU
CPU
Cache
LRU
Cache
LRU
4
Cache Hints: What (2)
• 2 kinds of cache hints
LD_C2_C3
– source cache hint (C2): Indicates cache level
where data is expected.
• used by compiler to know real latency of load
– target cache hint (C3): Indicates cache level where
data should be kept.
• used by hardware to adapt replacement policy
• Question: How to select appropriate cache
hints? ( reuse distance)
5
Introduction
• Cache Hints: why and what
• Reuse distance
• Reuse distance-based cache hint
generation: how
• Experiments and results
6
Reuse Distance: Definition
•
•
•
•
Reuse pair
Reuse distance of reuse pair
Forward reuse distance
Backward reuse distance
3
A
3

B


C


D

0
D
0

0
A
3
F

7
Introduction
• Cache Hints: why and what
• Reuse distance
• Reuse distance-based cache hint
selection: how
• Experiments and results
8
Reuse Distance: Properties
• Backward reuse distance > cache size

Cache miss in fully-assoc. LRU cache
• Forward reuse distance > cache size

Data will not be retained in fully-assoc.
LRU cache
9
Reuse-distance based
cache hint selection...
Reuse Distance (RD)
Cache Hint
RD < L1
C1
L1 <= RD < L2
C2
L2 <= RD < L3
C3
L3 <= RD
C4
L1
L2
L3
Cache size
10
Differentiating between
source and target hints
• Source cache specifier: based on
backward reuse distance
• Target cache specifier: based on
forward reuse distance
e.g.:
BRD = 13
...
B
L1 = 256 lines
L2 = 8K lines
L3 = 64K lines
H
F
...
FRD = 5000
LD_C1_C2
11
Memory access vs.
Memory instruction
• Cache hints are applied to instructions.
• Instructions can generate multiple
accesses (e.g. in a loop), with different
reuse distances.
• Only 1 cache hint can be selected for
different accesses from the same
instruction.
12
Example of cache hint per
instruction problem
for i := 1 to 100
A( i ) = ...
A(1) A(2) A(3) A(4)
A(5) A(6) A(7) A(8)
75%
LD_C1_C1
or
LD_C4_C4
????
cumulative
reuse distance
distribution
25%
C1
reuse distance
C4
13
Cumulative reuse distance
distribution to cache hint
100% 5%
0% 90%
45%
C2
40%
Reuse distance
0%
L1
L2
L3
Cache size
14
Introduction
• Cache Hints: why and what
• Reuse distance
• Reuse distance-based cache hint
selection: how
• Implementation, experiments and
results
15
Strategy
• Instrument the program to obtain
memory access stream.
• Profile to obtain reuse distance
distribution.
• Generate cache hints.
• Execute optimized program.
16
Implementation
• Open64 compiler for Itanium (IA-64)
– source cache hints: in instruction scheduler
• only visible in internal compiler representation
of instructions.
– target cache hints: in assembly output
• shows up in assembly code
e.g. ld.nta r34 = [r47]
• Programs from Olden and Spec95fp.
17
Execution times
Normalized execution time after cache hint
selection
100%
90%
80%
70%
sw
to im
m
ca
tv
ap
pl
u
w
av
e5
m
g
av rid
er
ag
e
ts
p
em
3d
he
al
th
pe ms
rim t
et
er
po
w
tr er
ee
ad
d
rt
bi
so
bh
60%
18
Results
• In Olden (pointer chasing), speedup
comes mainly from target cache hints
(better replacement policy) (4% on
average)
• In Specfp (numerical loops), speedup
results mainly from source cache hints
(better latency hiding through instruction
scheduling). (10% on average)
19
Conclusion
• Reuse distance is independent of cache
parameters such as size or associativity.
•  good metric for optimizations targeting
multiple cache levels.
• As such, it is an appropriate measure to base
cache hint selection on.
• The implementation in an EPIC compiler
resulted in 7% speedup on average with a
maximum of 36%.
20
Download