CSC 573 Lecture Feb. 8, 2005 (DS) Locality Models

advertisement
Motivation
CSC 573 Lecture Feb. 8, 2005
•
(DS)2 Locality Models
Dependence, Stream, Distance, and Stride
•
•
Adapted from “Program Locality Models and
Their Use in Memory Performance
Optimization,” a tutorial at ASPLOS 2004 by
Steve Carr
Michigan Tech
Trishul Chilimbi
Microsoft Research
Chen Ding
Youfeng Wu
U. of Rochester
Intel MTL
CSC573, Computer Science, U. of Rochester
•
1
CSC573, Computer Science, U. of Rochester
•
•
•
Compilers
– effective for scalars
– for loop nests with linear index expressions
– need to approximate branches, recursion, and indirect
data access
Profiling
– accurate for one input
– need to predict behavior in other inputs
Run-time analysis
– find input-dependent patterns
– need to be efficient
CSC573, Computer Science, U. of Rochester
2
DS 2 Locality Models
Program Locality
•
Computer speed improvement
– powered by Moore’s law
– supercomputers [Allen&Kennedy, Optimizing Compilers, page 2]
“Memory wall” [Wulf&McKee, CAN95 ]
Software solutions
– improving cache utilization
• temporal locality
• spatial locality
• reduce both latency and bandwidth
– prefetching
• reduce latency but not bandwidth
Need to understand program locality
– when do data accesses happen
– to what data
•
•
•
3
Dependence
– dependences among data references in a program
– static measurement of frequency, distance, and stride
Stream
– frequent sequences of data access in a trace
Distance
– patterns of the reuse distance of data access in a trace
Stride
– patterns of the stride of data access in a trace
CSC573, Computer Science, U. of Rochester
4
The Distance of Data Reuse
Distance Model
•
The Distance Model and Its Uses
Chen Ding
Computer Science Department
University of Rochester
Rochester, New York
•
•
Stack distances
– Evaluation techniques for storage hierarchies
• by R.L. Mattson, J. Gecsei, D. Slutz, and I.L. Traiger
• IBM System Journal, volume 9 (2), 1970, pp. 78-117
– one-pass evaluation of hardware configurations
– basis for memory management and cache design
LRU stack distance between two accesses
– the volume of data accessed in between
The reuse distance of an access
– the LRU stack distance between the access and the
previous access to the same data
CSC573, Computer Science, U. of Rochester
6
1
Reuse Distance
•
Measuring Reuse Distance
Definition
– the number of distinct elements between this and the
previous access to the same data
– measures volume not time
– Euclidean
d=2
bcaacb
•
– naïve counting, O(N) time per access, O(N) space
• N is the number of memory accesses
• M is the number of distinct data elements
– Too costly
• up to 120 billions accesses in our tests
• up to 25 million data elements
A good notion of program locality
– bounded
– independent of hardware and
– independent of variations in coding and data allocation
Precise Methods
Approximation
•
•
– stack algorithm [Mattson+ IBM70]
– O(M) time per access, O(M) space
– search tree [Olken LBL 81, Sugumar&Abraham UM93]
– O(logM) time per access, O(M) space
– the space cost remains a major problem
Tree node
(
time range, weight, capacity, size)
•
Basic idea
– measure only the first few digits of a long distance
– use non-unit size tree nodes
• tree size = M / average node size
• the node size bounds the measurement error
Guaranteed relative accuracy
– a <= measured/actual_distance <= 1
• e.g. a = 99%
– logarithmic space cost if a < 1
Hashtable cost
– space problem solved by Bennett & Kruskal in 1975
– not considered in the discussion
Tree node
( time range, weight,
capacity, size)
(8-10, 7,
(8-10 , 7, 2, 2)
2, 2)
The three tree nodes
have capacities
(1-7 , 4, 6, 4)
(11-11 , 1, 1, 1)
1, 2, and 6.
(1-7, 4, 6, 4)
(11-11, 1,
1, 1)
It guarantees 33%
accuracy.
2
Tree node
Tree node
(time, weight, capacity, size)
(time, weight, capacity, size)
(8-10 , 7, 2, 2)
Search for last
access of b, whose
access time is 4.
Tree node
(8-10 , 7, 2, 2)
4 ∈ (1-7)
(1-7 , 4, 6, 4)
(11-11 , 1, 1, 1)
(1-7 , 4, 6, 4)
Tree node
(time, weight, capacity, size)
(time, weight, capacity, size)
(8-10, 7, 2, 2)
Set d to be 0 first.
(8-10, 7, 2,
2)
Add node size:
The error in distance
(1-7, 4, 6,
is at most 4.
(11-11 , 1, 1, 1)
4)
d += 2
(11-11, 1, 1, 1)
(1-7, 4, 6, 4)
(11-11, 1, 1, 1)
Measurement of Reuse Distance
Algorithms
Tree node
Time
Space
Naïve counting
O(N2)
O(N)
Trace as a stack (or list) [IBM’70]
O(NM)
O(M)
Trace as a vector [IBM’75, UIUC’02]
O(NlogN)
O(N)
Trace as a tree [Berkeley’81,
O(NlogM)
O(N)
List-based aggregation [Wisconsin’91]
O(NS)
O(M)
Approx. tree [Rochester’03]
O(NloglogM) O(logM)
(time, weight, capacity, size)
(8-10, 7, 2, 2)
Add node weight:
Michigan’93, UIUC’02]
d += 1.
Measured distance is 3,
60% of the actual
distance.
(1-7, 4, 6, 4)
(11-11, 1, 1, 1)
CSC573, Computer Science, U. of Rochester
18
3
Analysis Accuracy for FFT
approximation 99%
million refs per sec.
•
2
1
1E
+1
+1
0
1E
+1
+9
+8
data size (reuse distance=1% data)
Pattern Recognition and Prediction
Spec2K/Lucas
Pattern analysis
– predictable and unpredictable parts
– constant, linear, sub-linear
– correlation among training inputs
• distance is independent of code and data
train 1
train 2
prediction
test
35
30
25
% references
•
1E
reuse distance
1E
1E
55K 57K 59K 61K 63K 66K
+7
0
+5
0
1E
0.5
1
+6
1
1E
2
1.5
Bennett-Kruskal
2
1E
accurate, 65783 tree nodes
99.9%, 5869 tree nodes
99%, 823 tree nodes
2.5
% references
Analysis Speed on 1.7GHz Pentium 4
– regression analysis
– distance-based sampling
Issues
– granularity of measurement and prediction
– the separation of patterns
– high-dimensional data
– predicting locality but not time
20
15
10
5
0
1
3
5
7
9
11
13
15
17
19
21
23
25
reuse distance (log-scale)
• 95.1% accuracy and 99.6% coverage for Lucas
• average 94% accuracy for all programs
CSC573, Computer Science, U. of Rochester
21
SP reuse miss rate for 1M cache
Prediction
Train 1
Train 2
SP reuse miss rate for 1M cache
Largest Simulation
8%
8%
6%
Pred
SA_1
SA_2
SA_4
SA_8
FA
miss rate
miss rate
6%
4%
4%
2%
2%
0%
1.E+03
1.E+05
1.E+07
1.E+09
data size in cache blocks
CSC573, Computer Science, U. of Rochester
1.E+11
23
0%
1.E+4
1.E+5
1.E+6
1.E+7
data size in cache blocks
CSC573, Computer Science, U. of Rochester
24
4
A Web-based Interactive Tool
Cross-architecture performance prediction
•
•
•
•
http://www.cs.rochester.edu/research/locality
CSC573, Computer Science, U. of Rochester
25
CSC573, Computer Science, U. of Rochester
Cache Hint Selection
•
•
•
•
•
– predict L1, L2, and L3 miss rate
– insert hints based on a threshold
Evaluation
– Intel Itanium
– Spec95fp benchmarks
– 7% average performance improvement
•
27
C. Fang, S. Carr, S. Onder and Z. Wang from Michigan Tech.
– “Reuse-distance-based Miss-rate Prediction on a Per
Instruction Basis”
– Workshop on Memory System Performance (MSP), 2004
Methodology
– reuse distance models for each memory reference
– predict references that cause most cache misses
Results
– Spec2Kfp benchmarks
– over 90% references can be predicted with 97%
accuracy
– ~2% load instructions account for 95% of misses
CSC573, Computer Science, U. of Rochester
Other Distance-Based Studies
•
26
Per-Instruction Locality
Kristof Beyls Erik H. D'Hollander from Ghent University
– “Reuse distance-based cache hint selection”
– EuroPar 2002
Methodology
– profile reuse distance for each memory reference
CSC573, Computer Science, U. of Rochester
•
Gabriel Marin and John Mellor-Crummey
– “Cross-architecture performance prediction for
scientific applications using parameterized models”
– a full paper in SIGMetrics 2004
Methodology
– reuse distance models
– compiler support
• program, routine, and loop scopes
Results
– Sweep3D from ASCI, LU, BT, and SP from NASA
– measured using SPARC binaries
– predict L1, L2, TLB misses, and running time on
Origin2K
28
Comparison with Other Locality Models
Register and cache performance modeling
– Li et al. from Purdue, Interact 1996
– Huang and Shen, CMU, Micro 1996
– Beyls and D’Hollander from Ghent, PDCS 2001
– Almasi et al. from Illinois, MSP 2002
– Zhong et al. from Rochester, LCR 2002
File caching
– Zhou et al. from Princeton, USENIX 2001
– Jiang and Zhang from William and Mary, SIGMetrics
2002
•
•
•
with compiler dependence analysis
– distance analysis is generally applicable
– less structured and incomplete information from regular
loop nests
– cannot reorder computation
with frequency profiling
– distance analysis finds repetitions at larger granularity
– more temporal information
– no prediction of access order
with stride analysis
– distance analysis predicts reuses
– no prediction of access order
CSC573, Computer Science, U. of Rochester
29
CSC573, Computer Science, U. of Rochester
30
5
Summary
•
•
Dependence Model
Distance-based locality analysis
– more general than compiler analysis
– more structured than traditional profiling
– best for analyzing long-range program behavior patterns
Uses
– whole-program locality prediction
– affinity-based structure and array reorganization
– phase prediction for pro-active system adaptation
Dependence-based Locality Analysis and
Optimization
copied from slides by
Steve Carr
Department of Computer Science
Michigan Technological University
Houghton, Michigan
CSC573, Computer Science, U. of Rochester
31
CSC573, Computer Science, U. of Rochester
Dependence and Loops
•
•
Dependence Distance Vector
Iteration vector: a vector of values for the loop control
variables
– one entry per variable
– indexed from outermost to innermost
The set of all iteration vectors for a loop is called the
iteration space
•
Suppose that there is a dependence from reference R 1 on
iteration i of a loop nest to reference R 2 on iteration j of the
loop nest, then the dependence distance vector is d(i,j)
DO I = 1, 100
DO J = 1, 100
A(I,J) = A(I-3, J-1)
DO I = 1, 10
DO J = 3, 5
DO K = 6, 9
A(I,J,K) = ...
when I = 1, J = 4, and K = 7 the iteration vector is ⟨1,4,7⟩
CSC573, Computer Science, U. of Rochester
A(2,2) is accessed by A(I,J) on iteration ⟨2,2⟩
A(2,2) is accessed by A(I-3,J-1) on iteration ⟨5,3⟩
The distance vector is ⟨3,1⟩
33
CSC573, Computer Science, U. of Rochester
Cache Line (Block)
•
•
•
32
34
Temporal Reuse Definition
The unit of memory in a cache is a line (block)
A line may have 1 or more words ( a word is typically 4 or 8
bytes)
Example Cache Line
•
temporal reuse - reuse of the same memory location
Do I = 2, N
32 bytes
A(I) = A(I-1)
Access the same element
one iteration apart
ENDDO
•
•
Accessing any member of the line brings the entire line into
cache (replacing whatever was there previously)
Interference - multiple lines that map to the same cache
location and need to be in the cache at the same time
CSC573, Computer Science, U. of Rochester
35
Note that the true dependence from A(I) to A(I-1)
tells that the temporal
exists
CSC573,reuse
Computer
Science, U. of Rochester
36
6
Spatial Reuse Definition
•
Self Reuse Definition
spatial reuse - reuse of a cache block with a nearby
memory reference
•
Self spatial
Do I = 1, N
A(I) =
DO I = 1, N
A(I) =
ENDDO
Access the same cache line
on successive iterations
ENDDO
CSC573, Computer Science, U. of Rochester
Group Temporal
DO I = 1, N
A(I) = A(I-1)
ENDDO
DO I = 1, N
A(I) = A(I-1)
ENDDO
•
CSC573, Computer Science, U. of Rochester
RefGroup Definition
•
40
RefGroup Leader
represent group reuse
all references in one reference group will have some kind of
group reuse
Two references R1 and R2 are in the same reference group
with respect
to Loop L if:
r
– R1δ R2 (group-temporal reuse)
r
• δ is a loop-independent dependence
• δ L is a small constant d and all other entries are zero
– R1 and R2 differ in the in the first subscript dimension by
a small constant d and all other subscripts are identical.
(group-spatial reuse) (d is decided by cache line size
and array element size)
CSC573, Computer Science, U. of Rochester
McKinley, Carr & Tseng 96
– Compute reuse across the innermost loop only
– Self reuse
• examine the subscript
• Temporal - missing the innermost induction variable
• Spatial - innermost induction variable in 1st subscript
position only
– Group reuse
• look at reference connected by a dependence carried by
innermost loop or loop independent
Access the same cache line
Access the same element one
CSC573, Computer Science, U. of Rochester
39
on same iteration
iteration apart
•
•
DO I = 1, N
DO J=1,N
A(I) =
Dependence-based Reuse Analysis
group reuse - reuse that arises due to multiple static
references
Group spatial
Self temporal
Access the same cache line
Access the same cache location
CSC573, Computer Science, U. of
on successive iterations
onRochester
successive iterations 38
37
Group Reuse Definition
•
self reuse – reuse that arises due to a single static
reference
41
•
•
The reference that brings in the cache lines accessed by
other members
Leader is
– the reference without an incoming dependence
– or all outgoing edges are loop carried and references
are invariant at source and sink or any incoming edge is
from a reference that is not in the RefGroup
CSC573, Computer Science, U. of Rochester
42
7
Leader Example
Loop Cost
•
DO K = 1, N
DO J=1,N
DO I=1,N
•
⟨0,0,1⟩
⟨0,1,0⟩
Loop Cost gives the number of cache lines that are brought into
the cache if a given loop is innermost. This relates to the locality
of the loop.
Given the RefGroups we can compute the cost of a loop, LC(L) by
summing the cost of each RefGroup, RC(R).
– Let R be the leader of a RefGroup
LC ( L) =
A(I,J,K) = B(I,J+1,K)+B(I,J,K)+ B(I+1,J,K)
∑ RC ( R ) × ∏ trip
k
1≤ k ≤ n
⟨0,1,-1⟩
h
h≠L
where triph is the number of loop iterations for loop h
Leaders(K) = {A(I,J,K), B(I+1,J,K), B(I,J+1,K)}
Leaders(J) = {A(I,J,K), B(I+1,J,K)}
Leaders(I) = {A(I,J,K), B(I+1,J,K), B(I,J+1,K)}
CSC573, Computer Science, U. of Rochester
43
CSC573, Computer Science, U. of Rochester
RefGroup Cost
1
if
⎧
⎪ triph
RC ( Rk ) = ⎨
if
⎪ cls / stride
if
⎩ triph
44
Loop Cost Example
DO K = 1, N
DO J=1,N
DO I=1,N
Rk
is self-temporal
Rk
is self-spatial
Rk
has no self-reuse
⟨0,0,1⟩
⟨0,1,0⟩
A(I,J,K) = B(I,J+1,K)+B(I,J,K)+ B(I+1,J,K)
⟨0,1,-1⟩
3()3 C N
Leaders(K) = A(I,J,K), B(I+1,J,K), B(I,J+1,K)}
Leaders(J) = {A(I,J,K), B(I+1,J,K)}
3()2LCJN
3()3NLCIl
CSC573, Computer Science, U. of Rochester
Leaders(I) = {A(I,J,K), B(I+1,J,K), B(I,J+1,K)}
45
CSC573, Computer Science, U. of Rochester
How To Use Loop Cost
•
•
•
•
Uniformly Generated Sets
The loop cost gives the number of cache lines accessed by
a loop as if it were innermost
The number of cache lines is a measure of locality
lower loop cost implies better locality
Order loops based on decreasing cost from outer to inner
•
•
•
CSC573, Computer Science, U. of Rochester
46
47
Gannon, Jalby and Gallivan 1988
Def: Let n be the depth of a loop nest and [()]Afir
d be the
[()]Agir
dimensions of an array D. Two references
a(<f 1,f2, …,f k>)
ndZZ
and b((<g 1,g2, …,gk>), where f and g are index functions, are
called ()()
uniformly
generated if
fgfiHicgiHic
• <f1,f 2, …,fk> = H < i1,i2, …,iL> + C f
• <g1,g 2, …,g k> = H < i 1,i2, …,i L> + Cg
where H is a linear transformation and fC
are
crf and Cggcr
constant vectors.
CSC573, Computer Science, U. of Rochester
48
8
Localized Vector Space
•
•
•
•
Nearby Permutation Example
Wolf and Lam 1991
Z depth n corresponds to a finite convex
A loop nest nof
polyhedron bounded by the loop bounds.
The loop iterations that exploit reuse and have locality are
called the localized iteration space.
If we abstract away the loop bounds we have a localized
vector space, L.
DO I = 1, 100
DO J = 1, 100
DO K = 1, 100
A(I,J,K) = A(I+1,J-1,K)+B(J,I,K)
()300 00() 00 00
⟨0,1,-1⟩
order K,J,I is illegal
use K,I,J
CSC573, Computer Science, U. of Rochester
49
Application - Loop Tiling
do j = 1, n
do i = 1, n
do k = 1, n
c(i,j) += a(i,k) * b(k,j)
50
Application - Software Prefetching and Scheduling
do i = 1, n
do j = 1, n
x(i) += m(i,j)
do ii = 1, n, is
do kk = 1, n, ks
do j = 1, n, js
do i = ii, min(ii+is-1,n)
do k = kk,min(kk+ks-1,n)
c(i,j)+=a(i,k)*b(k,j)
CSC573, Computer Science, U. of Rochester
CSC573, Computer Science, U. of Rochester
51
do i = 1, n
do j = 1, n
prefetch(m(i,j+c))
x(i) += m(i,j)
CSC573, Computer Science, U. of Rochester
52
Limits of Compile-time Analysis
•
•
•
Stream Model
Not all information is known at compile time
– loop bounds
– cache interference
What if references are not uniformly generated?
– index arrays
Non-scientific code
The Hot Data Stream Model
Reformatted from slides by
Trishul Chilimbi
Runtime Analysis & Design Group
http://research.microsoft.com/~trishulc/Daedalus.htm
CSC573, Computer Science, U. of Rochester
53
CSC573, Computer Science, U. of Rochester
54
9
Exploitable Locality
SEQUITUR (Nevill-Manning & Witten)
Exploitable Locality = Reference Skew + Regularity
aaabac aaabac aaabac aaabac aaabad aaabad aaabad aaabad aaabad aa
Reference Skew
SEQUITUR
SEQUITUR
S -> BBDDCaa
abcacbdbaecfbbbcgaafadcc
A -> aaabac
Regularity
abchdefabchiklfimdefmklf
S
B -> AA
B
C -> aaabad
Exploitable locality: Reference Skew + Regularity
abcabcdefabcgabcfabcdabc
CSC573, Computer Science, U. of Rochester
SEQUITUR Grammar
55
Address
Abstraction
0x32cc00
Data Reference
Trace
b
a
•
•
regularity detector
C
B
b
c
d
Whole Program Streams
(WPS)
b
57
C’’
10
5
CSC573, Computer Science, U. of Rochester
59
r
f
ve
ol
ls
er
tw
CSC573, Computer Science, U. of Rochester
sq
ex
rt
m
k
vo
eo
rlb
pe
Stream Flow
Graph (SFG)
n
0
er
B’’
70
15
c
5
20
gc
Graph
summarization
70
A’’
75
58
25
pa
B
Whole program A’’ d’ b’
a’
streams (WPS) B’’
a’
b’
Hot Data
Streams
Wt. avg. # of
elements in hot data
streams
S
A
CSC573, Computer Science, U. of Rochester
Hot Data Stream Size (Regularity)
Inter-stream Analysis
b’
56
SEQUITUR
S
CSC573, Computer Science, U. of Rochester
a’
of grammar
d
Eliminating noise
Inter-stream analysis
rs
a
c
Abstracted Trace
A
a
d
b
Efficient Offline Implementation
b
Increases repetitions
Hot Data Streams
b’
a
DAG
representation
a
Hot data
stream
analyses
a’
C
CSC573, Computer Science, U. of Rochester
Computing Exploitable Locality
0x53cd1e
D
A
D -> CC
60
10
Efficient Online Implementation
Bursty Tracing
comp1
CSC573, Computer Science, U. of Rochester
proc1
61
CSC573, Computer Science, U. of Rochester
Duplicate Basic Blocks
comp1
63
CSC573, Computer Science, U. of Rochester
Dynamic Optimization Extensions
Add Instrumentation
Awake phase
prof$proc1
CSC573, Computer Science, U. of Rochester
64
65
Hibernate phase
nHibernate
CSC573, Computer Science, U. of Rochester
comp1
proc1
prof$proc1
prof$proc1
nProf
proc1
62
Insert Dispatch Checks
comp1
proc1
nOrig
•
Low-overhead profiling
– Bursty Tracing
– Dynamic Optimization Extensions
Low-overhead analysis
nAwake
time
•
CSC573, Computer Science, U. of Rochester
66
11
Fast Hot Data Stream Detection
0:3
C
A
uses :
word
length
1:2
cold uses :
word
length
a a
s5
v=abacadae
s6
w=bbghij
b
s0
b
b
b s
4
b
s2
b
{[w,1]}
g
{[w,2]}
{[w,3]}
CSC573, Computer Science, U. of Rochester
69
CSC573, Computer Science, U. of Rochester
Putting it all together
Dynamic
Vulcan
injection
Sequitur grammar
analysis
finite state hot data
machine streams
profiling
analysis and
optimization
execution with
prefetching
hibernation
8
7
6
5
4
3
2
1
0
Checks
Profiling
Analysis
r
deoptimization
% overhead
code
data
reference
sequence
Profiling and Analysis Overhead
vp
timeline
execution with
profiling
70
im
a
s3
xs
b
a
a.pc: if(accessing a.addr){
if(v.seen == 2){
v.seen = 3;
prefetch c.addr,a.addr,d.addr,e.addr;
}else{
v.seen = 1;
}
}else{
v.seen = 0;
}
b.pc: if(accessing b.addr)
if(v.seen == 1)
v.seen = 2;
else
v.seen = 0;
else v.seen = 0;
{[v,3],[v,1]}
bo
a
68
Prefetch Generation
cf
{}
a
s1
Hot/cold
splitting
}
CSC573, Computer Science, U. of Rochester
m
a
{[v,2],[w,1]}
…;
8—23% improvements in execution time
67
Prefetch Generation
{[v,1]}
struct cnode {
20 min
Hot data stream: abcabc (generated by non terminal B)
CSC573, Computer Science, U. of Rochester
struct node {
int key;
struct node* next;
struct cnode* cold;}
ex
reverse
postorder
numbering
4:3
5:2
rt
parse tree
C
A
struct node {
int key;
struct node* next;
…;}
vo
grammar
(omitting
terminals)
a ba a bca b ca bca bc
2
3
A
2:6
B
struct node {
Field reordering
int key;
…
struct node* next;}
er
C
A
2:6
B
1:15
S
rs
A A
1
B
C
C C
A A
1:15
S
f
C C
A
B
B
0
pa
B
S
ol
S
tw
S
Structure Layout Transformations
3—7% total overhead
CSC573, Computer Science, U. of Rochester
71
CSC573, Computer Science, U. of Rochester
72
12
Dynamic Prefetching Evaluation
Conclusions
•
20
15
5
im
Analysis
No pref
Seq pref
Dyn pref
•
•
bo
xs
ex
rt
vo
rs
er
f
pa
ol
tw
cf
m
-5
-10
r
0
vp
% overhead
10
-15
Hot data streams useful locality model
– Can handle irregular, pointer-based codes
– Can be efficiently computed online
– Small number account for 90% of data references
Stream flow graph compact representation
– 10 GB Trace => 100 KB SFG
– Analyze hot data stream interactions
Stream-based optimizations appear promising
– Good performance improvements
-20
5—19% overall execution time improvement
CSC573, Computer Science, U. of Rochester
73
CSC573, Computer Science, U. of Rochester
74
Memory Optimization Opportunity
Stride Model
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
instr access
issue limit
ar
s
2 5 er
2.e
3.p on
er
lbm
k
25
4
25 . g a p
5.v
or
2 5 tex
6.b
zip
30
2
0.t
wo
lf
data access
7.p
RSE
25
cf
fty
ra
19
1.m
6.c
18
pr
6.g
4.g
17
16
cc
exec latency
zip
Programming Systems Research
Intel Labs
fetch window
18
Youfeng Wu
taken br
17
Reformatted from slides by
pipe BE flush
5.v
Efficient Discovery of Regular Stride Patterns in
Irregular Programs and Its Use in Data Prefetching
4MB L3
CSC573, Computer Science, U. of Rochester
75
CSC573, Computer Science, U. of Rochester
Loads Responsible for the Data Cycles
Concentration of delinquent Loads
•
• Cache simulation to identify delinquent loads
cummutive misses/load sites
– The set of load instructions that account for 95% or
more of L1, L2, L3, DTC, and DTLB misses
– Some references are not simulated
• Library references: ~20%
• Additional references from speculation to more
frequent blocks: ~8%
1.2-3.6% of executed load sites produce 95% misses
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
L1 cache misses
L2 cache misses
L3 cache misses
DTC misses
DTLB misses
0%
CSC573, Computer Science, U. of Rochester
76
77
1%
2%
3%
4%
5%
6%
CSC573, Computer Science, U. of Rochester
7%
8%
9%
10%
78
13
Motivations
•
•
•
Stride Profiling
Data prefetching for regular programs is very effective
– Array references often have constant strides
Automatic prefetching of delinquent loads in irregular
programs is hard
– Future data addresses are difficult to determine
automatically
– Blindly applying automatic prefetching actually slows
down SPECint2000 on Itanium
Opportunities
– Some delinquent loads have stride access patterns
CSC573, Computer Science, U. of Rochester
•
For each static load
– Collect top N most frequently occurred strides and
their frequencies [Calder-97]
– Count the number of times the strides change
Addresses from a static load:
1,3,5,7,9,11,111,211,311,411,413,415,417
Strides:
2, 2,2,2,2,100,100,100,100,2,2,2
Stride profile:
#1 stride = 2,
#1 stride freq = 8
#2 stride = 100,
#2 stride freq = 4
Number ofCSC573,
stride
changes = 3
Computer Science, U. of Rochester
79
Discover Stride Patterns
80
Prefetching for Multi-Stride Load
• Strong single stride (SSST)
– One non-zero stride, e.g. 60, occurs very
frequently
• Phased multi-stride (PMST)
– Multiple non-zero strides together occur
frequently
– The strides change infrequently
Compute stride and prefetch
prev_P = P
While (P) {
stride = (P-prev_P);
prev_P=P;
L:
prefetch(P+2*stride)
D= P->data
Use (D)
P = P->next
delinquent
}
CSC573, Computer Science, U. of Rochester
81
Prefetching for Single-Stride Load
82
Key Observations for Efficient Profiling
Prefetch with a constant stride, e.g. 60
• Stride patterns exist mostly with loads inside
loop with high trip count
– Targeting these loads can significantly reduce
profiling overhead
• Stride profile is statistically stable
– Sampling can be used
While (P) {
prefetch(P+120)
L:
D= P->data
Use (D)
P = P->next
}
CSC573, Computer Science, U. of Rochester
CSC573, Computer Science, U. of Rochester
83
CSC573, Computer Science, U. of Rochester
84
14
Sampling Methods
•
•
•
Sampling Example
Profiling a small sample of loads in high-trip count loops
Take place inside strideProf routine
Chunk sampling
– After every N1 references are skipped, profile the next N2
references (N1 = 8 millions, N2 = 2 millions)
Fine sampling of references from a static load
– Inside each chunk, profile one after every F (e.g. 4)
references
Sampling ratio
•
•
Addresses seen by strideProf:
{N1 refs}2,4,6,8,10,12,14,16,18,20,22,24,26,28,…{N1 refs}..
N2 refs (assume from the same static load)
Addresses with chunk sampling:
2,4,6,8,10,12,14,16,18,20,22,24,26,28,…..
Addresses with fine sampling F = 4:
N2
1
2 1 1
×
= × = = 4%
N1 + N 2 1 + F 10 5 25
2, , , ,10,
,
,
,18, ,...
Stride with both sampling = 8
Original stride = 8/F = 2
CSC573, Computer Science, U. of Rochester
85
CSC573, Computer Science, U. of Rochester
Performance Gains
Other Stride Patterns
•
•
•
86
Compare with no-prefetching
Profiles collected with train input, execution with ref input
•
We have discovered “self-stride”: difference between
successive addresses of the same load
“cross-stride”: difference between addresses of dependent
but different loads
1.60
edge-check
naïve-loop
naïve-all
sample-edge-check
sample-naïve-loop
sample-naïve-all
1.54
1.48
1.42
speedup
1.36
Stride
Stride
1.30
1.24
x=p
1.18
->
brandInfo
-> value
1.12
1.06
16
4.
gz
ip
17
5.
vp
r
17
6.
gc
c
18
1.
m
cf
18
6.
cr
af
ty
19
7.
pa
rs
er
25
2.
eo
25
n
3.
pe
rlb
m
k
25
4.
ga
p
25
5.
vo
rte
x
25
6.
bz
ip
2
30
0.
tw
ol
f
av
er
ag
e
1.00
0.94
CSC573, Computer Science, U. of Rochester
87
CSC573, Computer Science, U. of Rochester
Recent Extensions
•
•
88
PMU-based Stride Profiling
Other Stride Profiling Methods
– PMU-based sampling [Luk-04]
– Partial interpretation [Inagaki-03]
– PMU+GC analysis [Adl-Tabatabai-04]
• PMU identify delinquent loads
• GC identify strides and maintain them
Prefetching more delinquent loads [Inagaki-03] [AdlTabatabai-04]
– Prefetch “cross-stride” loads
– Create more stride patterns for delinquent loads
•
Sample load misses with 2 phases:
Skipping phases (1 sample per 1000 misses)
Time
A1
A2
A2-A1=5*48=240
Inspection phases (1 sample per miss)
A3
A4
A3-A2=7*48=336
Time
A4-A3=3*48=144
A1, A2, A3, A4 are four consecutive miss addresses of a load.
The load has a stride of 48 bytes.
• Via memory management
Use GCD to figure out strides from miss addresses:
GCD(A2-A1, A3-A2 )=GCD(240,336)=48
GCD(A3-A2, A4-A3 )=GCD(336,144)=48
CSC573, Computer Science, U. of Rochester
89
CSC573, Computer Science, U. of Rochester
90
15
Partial Interpretation in JIT
•
•
Before translating a method
– Interpret each loop a few iterations (e.g. 20)
– Identify both self-strides and cross-strides
Translate the method with prefetching
– Can achieve better performance than prefetching self-strides
alone
•
•
Sampling Cache Misses
Self-stride Prefetch
•
tmp = prefetch a[i+K]
prefetch(tmp+stride1)
prefetch(tmp+stride2)
Loop
– tmp = load a[i]
• HW PMU delivers samples
– IP of the load causing the miss
– Effective address of miss
– Miss latency
– Low overhead
PMU
Metadata
Paths
Cross-stride Prefetch
Strides
tmp = spec_load a[i+K]
– tmp1 = load tmp->field1
– tmp2 = load tmp1->field2
CSC573, Computer Science, U. of Rochester
Inject
91
CSC573, Computer Science, U. of Rochester
92
Discovering Delinquent Paths
Finding Strides Along Path
• Delinquent paths abstract high latency linked data
structure
– Approximates traversals causing latency
• Discover edges along the paths
PMU
– Piggyback on GC mark phase traversal
– Characterize edges GC encounters
Metadata
– Glean global properties about edges
– Use characterization to estimate path
Paths
• Combine the edges into paths
PMU
Metadata
Paths
strides
Strides
Inject
Inject
CSC573, Computer Science, U. of Rochester
• Process set of delinquent objects
• If type matches the base of a delinquent path
– Traverse path
– Summarizing strides between base and objects
along the path
• Useful strides exist even without proactive
placement by GC
• Proactive placement by GC enables more loads can
be stride-prefetched
93
CSC573, Computer Science, U. of Rochester
Maintaining strides
PMU
Metadata
Paths
strides
Inserting Prefetches
• GC must be stride aware
• Allocation order placement
– Frontier pointer allocation creates useful strides
– Sliding compaction algorithm maintains
allocation order strides
– Various GC algorithms break/alter strides
• Expands GC role into a memory hierarchy controller
and optimizer
PMU
Metadata
Paths
Strides
• IPs associated with path start are prefetch point
candidates
• Screening a candidate point:
– Must contributes a significant number of the
base type misses
– Use JIT analysis to locate earliest availability of
the pointer used to load base
• Finally, recompile with prefetches
– A speedup of 11-14% for SPEC JBB2000
Inject
Inject
stride2
Stride1
p
CSC573, Computer Science, U. of Rochester
94
95
-> field_of_p -> value
CSC573, Computer Science, U. of Rochester
tmp=load
tmp=load [p]
prefetch(p+stride1)
prefetch(p+stride2)
96
16
Summary
Summary
•
•
•
•
Stride profiling technique can discover stride pattern efficiently
– Instrumentation based approach is only 17% slower than the
frequency profiling alone
• Transparent to users for a compiler with edge frequency profiling
– PMU or interpretation based techniques can also discover stride
patterns
Significant performance improvement
– Average 7% for SPECINT2000 suite for self-stride prefetching
• 1.59x speedup for "181.mcf", 1.14x for "254.gap", and 1.08x for
"197.parser".
– Performance gain is stable across profiling data sets.
Cross-stride prefetching can contribute additional performance gains
– Demonstrated in Java JIT environment
CSC573, Computer Science, U. of Rochester
•
•
•
97
Dependence
– dependence, temporal, spatial group, and self reuse
– dependence analysis
– loop permutation, tiling, prefetching
Stream
– hot-streams
– off-line and run-time measurements
– structure splitting, prefetching
Distance
– reuse distance, approximate measurement
– reuse behavior prediction across program inputs
– miss-rate prediction across program inputs
Stride
– stride patterns
– profiling and sampling
– prefetching
CSC573, Computer Science, U. of Rochester
98
17
Download