15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu

advertisement
15-740/18-740
Computer Architecture
Lecture 15: Efficient Runahead Execution
Prof. Onur Mutlu
Carnegie Mellon University
Fall 2011, 10/14/2011
Review Set
Due next Wednesday
Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE
Trans. On Electronic Computers, 1965.
Jouppi, “Improving Direct-Mapped Cache Performance by the
Addition of a Small Fully-Associative Cache and Prefetch
Buffers,” ISCA 1990.
Recommended:
Hennessy and Patterson, Appendix C.2 and C.3
Liptay, “Structural aspects of the System/360 Model 85 II: the
cache,” IBM Systems Journal, 1968.
Qureshi et al., “A Case for MLP-Aware Cache Replacement,“
ISCA 2006.
2
Announcements
Milestone I
Midterm I
Due this Friday (Oct 14)
Format: 2-pages
Include results from your initial evaluations. We need to see
good progress.
Of course, the results need to make sense (i.e., you should be
able to explain them)
October 24
Milestone II
Will be postponed. Stay tuned.
3
Course Feedback
I have read them
Fill out the form and return, if you have not done so
4
Last Lecture
More issues in load related instruction scheduling
Better utilizing the instruction window
Runahead execution
5
Today
More on runahead execution
6
Efficient Scaling of Instruction Window Size
One of the major research issues in out of order execution
How to achieve the benefits of a large window with a small
one (or in a simpler way)?
Runahead execution?
Continual flow pipelines?
Upon L2 miss, checkpoint architectural state, speculatively
execute only for prefetching, re-execute when data ready
Upon L2 miss, deallocate everything belonging to an L2 miss
dependent, reallocate/re-rename and re-execute upon data ready
Dual-core execution?
One core runs ahead and does not stall on L2 misses, feeds
another core that commits instructions
7
Runahead Example
Perfect Caches:
Load 1 Hit
Compute
Load 2 Hit
Compute
Small Window:
Load 2 Miss
Load 1 Miss
Compute
Stall
Compute
Miss 1
Stall
Miss 2
Runahead:
Load 1 Miss
Compute
Load 2 Miss
Runahead
Miss 1
Miss 2
Load 1 Hit
Load 2 Hit
Compute
Saved Cycles
Benefits of Runahead Execution
Instead of stalling during an L2 cache miss:
Pre-executed loads and stores independent of L2-miss
instructions generate very accurate data prefetches:
For both regular and irregular access patterns
Instructions on the predicted program path are prefetched
into the instruction/trace cache and L2.
Hardware prefetcher and branch predictor tables are trained
using future access information.
Runahead Execution Pros and Cons
Advantages:
+ Very accurate prefetches for data/instructions (all cache levels)
+ Follows the program path
+ Simple to implement, most of the hardware is already built in
+ Versus other pre-execution based prefetching mechanisms:
+ Uses the same thread context as main thread, no waste of context
+ No need to construct a pre-execution thread
Disadvantages/Limitations:
------
Extra executed instructions
Limited by branch prediction accuracy
Cannot prefetch dependent cache misses. Solution?
Effectiveness limited by available “memory-level parallelism” (MLP)
Prefetch distance limited by memory latency
Implemented in IBM POWER6, Sun “Rock”
10
Memory Level Parallelism (MLP)
Idea: Find and service multiple cache misses in parallel
Why generate multiple misses?
parallel miss
isolated miss
B
A
C
time
Enables latency tolerance: overlaps latency of different misses
How to generate multiple misses?
Out-of-order execution, multithreading, runahead, prefetching
11
Memory Latency Tolerance Techniques
Caching [initially by Wilkes, 1965]
Widely used, simple, effective, but inefficient, passive
Not all applications/phases exhibit temporal or spatial locality
Prefetching [initially in IBM 360/91, 1967]
Works well for regular memory access patterns
Prefetching irregular access patterns is difficult, inaccurate, and hardwareintensive
Multithreading [initially in CDC 6600, 1964]
Works well if there are multiple threads
Improving single thread performance using multithreading hardware is an
ongoing research effort
Out-of-order execution [initially by Tomasulo, 1967]
Tolerates cache misses that cannot be prefetched
Requires extensive hardware resources for tolerating long latencies
12
Performance of Runahead Execution
1.3
No prefetcher, no runahead
Only prefetcher (baseline)
Only runahead
Prefetcher + runahead
12%
1.2
Micro-operations Per Cycle
1.1
1.0
22%
0.9
12%
15%
0.8
22%
35%
0.7
0.6
16%
52%
0.5
0.4
13%
0.3
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
SERV
WS
AVG
13
Runahead Execution vs. Large Windows
1.5
128-entry window (baseline)
128-entry window with Runahead
256-entry window
384-entry window
512-entry window
1.4
1.3
Micro-operations Per Cycle
1.2
1.1
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
SERV
WS
AVG
14
Runahead vs. A Real Large Window
When is one beneficial, when is the other?
Pros and cons of each
15
Runahead on In-order vs. Out-of-order
1.3
1.2
in-order baseline
in-order + runahead
out-of-order baseline
out-of-order + runahead
15% 10%
1.1
Micro-operations Per Cycle
1.0
14% 12%
0.9
20% 22%
0.8
17% 13%
0.7
39% 20%
73% 23%
0.6
0.5
28% 15%
50% 47%
SERV
WS
0.4
73% 16%
0.3
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
AVG
16
3.00
3.00
2.75
2.75
2.50
2.25
2.00
1.75
1.50
1.25
1.00
0.75
0.50
Baseline
0.25
Runahead
0.00
Instructions Per Cycle Performance
Instructions Per Cycle Performance
Runahead vs. Large Windows (Alpha)
2.50
2.25
2.00
1.75
1.50
1.25
1.00
0.75
0.50
Baseline
0.25
Runahead
0.00
64
128
256
384
512
1024
2048
4096
Instruction Window Size (mem latency = 500 cycles)
8192
64
128
256
384
512
1024
2048
4096
Instruction Window Size (mem latency = 1000 cycles)
17
8192
In-order vs. Out-of-order Execution (Alpha)
3.00
Instructions Per Cycle Performance
2.75
OOO+RA
OOO
IO+RA
IO
2.50
2.25
2.00
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
100
300
500
700
900
1100
1300
1500
1700
1900
Memory Latency (in cycles)
18
Sun ROCK Cores
Load miss in L1 cache starts parallelization using 2 HW threads
Ahead thread
Behind thread
Executes deferred instructions and re-defers them if necessary
Memory-Level Parallelism (MLP)
Checkpoints state and executes speculatively
Instructions independent of load miss are speculatively executed
Load miss(es) and dependent instructions are deferred to behind
thread
Run ahead on load miss and generate additional load misses
Instruction-Level Parallelism (ILP)
Ahead and behind threads execute independent instructions from
different points in program in parallel
19
ROCK Pipeline
20
Effect of Runahead in Sun ROCK
Chaudhry talk, Aug 2008.
21
Limitations of the Baseline Runahead Mechanism
Energy Inefficiency
Ineffectiveness for pointer-intensive applications
A large number of instructions are speculatively executed
Efficient Runahead Execution [ISCA’05, IEEE Micro Top Picks’06]
Runahead cannot parallelize dependent L2 cache misses
Address-Value Delta (AVD) Prediction [MICRO’05, IEEE TC’06]
Irresolvable branch mispredictions in runahead mode
Cannot recover from a mispredicted L2-miss dependent branch
Wrong Path Events [MICRO’04]
The Efficiency Problem [ISCA’05]
A runahead processor pre-executes some instructions
speculatively
Each pre-executed instruction consumes energy
Runahead execution significantly increases the
number of executed instructions, sometimes
without providing performance improvement
23
AVG
wupwise
swim
sixtrack
mgrid
mesa
lucas
galgel
fma3d
facerec
equake
110%
art
apsi
applu
ammp
vpr
vortex
twolf
perlbmk
parser
mcf
gzip
gcc
90%
gap
100%
eon
crafty
bzip2
The Efficiency Problem
235%
% Increase in IPC
% Increase in Executed Instructions
80%
70%
60%
50%
40%
30%
20%
22%
27%
10%
0%
Causes of Inefficiency
Short runahead periods
Overlapping runahead periods
Useless runahead periods
Mutlu et al., “Efficient Runahead Execution: Power-Efficient
Memory Latency Tolerance,” ISCA 2005, IEEE Micro Top
Picks 2006.
Short Runahead Periods
Processor can initiate runahead mode due to an already in-flight L2
miss generated by
the prefetcher, wrong-path, or a previous runahead period
Load 1 Miss
Load 2 Miss
Load 1 Hit Load 2 Miss
Compute Runahead
Miss 1
Miss 2
Short periods
are less likely to generate useful L2 misses
have high overhead due to the flush penalty at runahead exit
Eliminating Short Periods
Mechanism to eliminate short periods:
Record the number of cycles C an L2-miss has been in flight
If C is greater than a threshold T for an L2 miss, disable entry
into runahead mode due to that miss
T can be determined statically (at design time) or dynamically
T=400 for a minimum main memory latency of 500 cycles
works well
27
Overlapping Runahead Periods
Two runahead periods that execute the same instructions
Load 1 Miss Load 2 INV
Compute
Load 1 Hit Load 2 Miss
Miss 1
OVERLAP
OVERLAP
Runahead
Second period is inefficient
Miss 2
Eliminating Overlapping Periods
Overlapping periods are not necessarily useless
The availability of a new data value can result in the
generation of useful L2 misses
But, this does not happen often enough
Mechanism to eliminate overlapping periods:
Keep track of the number of pseudo-retired instructions R
during a runahead period
Keep track of the number of fetched instructions N since the
exit from last runahead period
If N < R, do not enter runahead mode
29
Useless Runahead Periods
Periods that do not result in prefetches for normal mode
Load 1 Miss
Compute
Load 1 Hit
Runahead
Miss 1
They exist due to the lack of memory-level parallelism
Mechanism to eliminate useless periods:
Predict if a period will generate useful L2 misses
Estimate a period to be useful if it generated an L2 miss that
cannot be captured by the instruction window
Useless period predictors are trained based on this estimation
AVG
wupwise
swim
sixtrack
mgrid
mesa
lucas
galgel
fma3d
facerec
equake
110%
art
apsi
applu
ammp
vpr
vortex
twolf
perlbmk
parser
mcf
gzip
90%
gcc
100%
gap
eon
crafty
bzip2
Increase in Executed Instructions
Overall Impact on Executed Instructions
235%
baseline runahead
all techniques
80%
70%
60%
50%
40%
30%
26.5%
20%
10%
6.2%
0%
AVG
wupwise
swim
sixtrack
mgrid
mesa
lucas
galgel
fma3d
facerec
equake
110%
art
apsi
applu
ammp
vpr
vortex
twolf
perlbmk
parser
mcf
gzip
90%
gcc
100%
gap
eon
crafty
bzip2
Increase in IPC
Overall Impact on IPC
116%
baseline runahead
all techniques
80%
70%
60%
50%
40%
30%
20%
22.6%
22.1%
10%
0%
Limitations of the Baseline Runahead Mechanism
Energy Inefficiency
Ineffectiveness for pointer-intensive applications
A large number of instructions are speculatively executed
Efficient Runahead Execution [ISCA’05, IEEE Micro Top Picks’06]
Runahead cannot parallelize dependent L2 cache misses
Address-Value Delta (AVD) Prediction [MICRO’05]
Irresolvable branch mispredictions in runahead mode
Cannot recover from a mispredicted L2-miss dependent branch
Wrong Path Events [MICRO’04]
The Problem: Dependent Cache Misses
Runahead: Load 2 is dependent on Load 1
Cannot Compute Its Address!
Load 1 Miss Load 2 INV
Compute
Miss 1
Runahead
Miss 2
Runahead execution cannot parallelize dependent misses
Load 1 Hit Load 2 Miss
wasted opportunity to improve performance
wasted energy (useless pre-execution)
Runahead performance would improve by 25% if this
limitation were ideally overcome
The Goal of AVD Prediction
Enable the parallelization of dependent L2 cache misses in
runahead mode with a low-cost mechanism
How:
Predict the values of L2-miss address (pointer) loads
Address load: loads an address into its destination register,
which is later used to calculate the address of another load
as opposed to data load
Parallelizing Dependent Cache Misses
Cannot Compute Its Address!
Load 1 Miss Load 2 INV
Load 1 Hit Load 2 Miss
Runahead
Compute
Miss 1
Miss 2
Value Predicted
Can Compute Its Address
Load 1 Miss Load 2 Miss
Compute
Runahead
Miss 1
Miss 2
Load 1 Hit Load 2 Hit
Saved Speculative
Instructions
Saved Cycles
AVD Prediction [MICRO’05]
Address-value delta (AVD) of a load instruction defined as:
AVD = Effective Address of Load – Data Value of Load
For some address loads, AVD is stable
An AVD predictor keeps track of the AVDs of address loads
When a load is an L2 miss in runahead mode, AVD
predictor is consulted
If the predictor returns a stable (confident) AVD for that
load, the value of the load is predicted
Predicted Value = Effective Address – Predicted AVD
Why Do Stable AVDs Occur?
Regularity in the way data structures are
allocated in memory AND
traversed
Two types of loads can have stable AVDs
Traversal address loads
Produce addresses consumed by address loads
Leaf address loads
Produce addresses consumed by data loads
Traversal Address Loads
Regularly-allocated linked list:
A
A+k
A traversal address load loads the
pointer to next node:
node = nodenext
AVD = Effective Addr – Data Value
Effective Addr Data Value AVD
A+2k
...
A+3k
A
A+k
-k
A+k
A+2k
-k
A+2k
A+3k
-k
Striding
Stable AVD
data value
Properties of Traversal-based AVDs
Stable AVDs can be captured with a stride value predictor
Stable AVDs disappear with the re-organization of the data
structure (e.g., sorting)
A
A+k
A+3k
A
A+2k
A+2k
A+3k
A+k
Sorting
Distance between
nodes NOT constant!
Stability of AVDs is dependent on the behavior of the
memory allocator
Allocation of contiguous, fixed-size chunks is useful
AVD Prediction
40
Leaf Address Loads
Sorted dictionary in parser:
Nodes point to strings (words)
String and node allocated consecutively
m = check_match(ptr_str, input);
// …
A
C+k
C
E+k
D
E
// ...
ptr_str = nodestring;
B
D+k
A leaf address load loads the pointer to
the string of each node:
lookup (node, input) {
A+k
B+k
Dictionary looked up for an input word.
F+k
F
node
}
string
AVD = Effective Addr – Data Value
G+k
Effective Addr Data Value AVD
G
A+k
A
k
C+k
C
k
F+k
F
k
No stride!
Stable AVD
Properties of Leaf-based AVDs
Stable AVDs cannot be captured with a stride value predictor
Stable AVDs do not disappear with the re-organization of
the data structure (e.g., sorting)
A+k
C+k
A
B+k
B
C
Sorting
C+k
Distance between
node and string
still constant!
A+k
C
B+k
A
B
Stability of AVDs is dependent on the behavior of the
memory allocator
AVD Prediction
42
Identifying Address Loads in Hardware
Insight:
If the AVD is too large, the value that is loaded is likely not an
address
Only keep track of loads that satisfy:
-MaxAVD ≤ AVD ≤ +MaxAVD
This identification mechanism eliminates many loads from
consideration
Enables the AVD predictor to be small
AVD Prediction
43
An Implementable AVD Predictor
Set-associative prediction table
Prediction table entry consists of
Tag (Program Counter of the load)
Last AVD seen for the load
Confidence counter for the recorded AVD
Updated when an address load is retired in normal mode
Accessed when a load misses in L2 cache in runahead mode
Recovery-free: No need to recover the state of the processor
or the predictor on misprediction
Runahead mode is purely speculative
AVD Prediction
44
AVD Update Logic
AVD Prediction
45
AVD Prediction Logic
AVD Prediction
46
Baseline Processor
Execution-driven Alpha simulator
8-wide superscalar processor
128-entry instruction window, 20-stage pipeline
64 KB, 4-way, 2-cycle L1 data and instruction caches
1 MB, 32-way, 10-cycle unified L2 cache
500-cycle minimum main memory latency
32 DRAM banks, 32-byte wide processor-memory bus (4:1
frequency ratio), 128 outstanding misses
Detailed memory model
Pointer-intensive benchmarks from Olden and SPEC INT00
AVD Prediction
47
runahead
1.0
0.9
14.3%
15.5%
0.8
0.7
0.6
0.5
0.4
Execution Time
0.3
Executed Instructions
0.2
0.1
AV
G
r
vp
f
tw
ol
er
pa
rs
cf
m
i
vo
ro
no
p
ts
dd
ea
ri m
pe
t re
et
er
st
m
he
al
th
rt
0.0
bi
so
Normalized Execution Time and Executed Instructions
Performance of AVD Prediction
AVD vs. Stride VP Performance
1.00
AVD
Normalized Execution Time (excluding health)
0.98
stride
2.7%
0.96
0.94
hybrid
5.1%
5.5%
4.7%
6.5%
0.92
8.6%
0.90
0.88
0.86
0.84
0.82
0.80
entries
1616entries
AVD Prediction
4096 entries
4096
entries
49
Download