Efficient Runahead Execution Processors A Power-Efficient Processing Paradigm for Onur Mutlu

advertisement
Efficient Runahead Execution Processors
A Power-Efficient Processing Paradigm for
Tolerating Long Main Memory Latencies
Onur Mutlu
PhD Defense
4/28/2006
Talk Outline




Motivation: The Memory Latency Problem
Runahead Execution
Evaluation
Limitations of the Baseline Runahead Mechanism




Efficient Runahead Execution
Address-Value Delta (AVD) Prediction
Summary of Contributions
Future Work
2
Motivation

Memory latency is very long in today’s processors




And, it continues to increase (in terms of processor cycles)


CDC 6600: 10 cycles [Thornton, 1970]
Alpha 21264: 120+ cycles [Wilkes, 2001]
Intel Pentium 4: 300+ cycles [Sprangle & Carmean, 2002]
DRAM latency is not reducing as fast as processor cycle time
Conventional techniques to tolerate memory latency do not
work well enough.
3
Conventional Latency Tolerance Techniques




Caching [initially by Wilkes, 1965]
 Widely used, simple, effective, but inefficient, passive
 Not all applications/phases exhibit temporal or spatial locality
Prefetching [initially in IBM 360/91, 1967]
 Works well for regular memory access patterns
 Prefetching irregular access patterns is difficult, inaccurate, and hardwareintensive
Multithreading [initially in CDC 6600, 1964]
 Works well if there are multiple threads
 Improving single thread performance using multithreading hardware is an
ongoing research effort
Out-of-order execution [initially by Tomasulo, 1967]
 Tolerates cache misses that cannot be prefetched
 Requires extensive hardware resources for tolerating long latencies
4
Out-of-order Execution



Instructions are executed out of sequential program order
to tolerate latency.
Instructions are retired in program order to support
precise exceptions/interrupts.
Not-yet-retired instructions and their results are buffered in
hardware structures, called the instruction window.

The size of the instruction window determines how much
latency the processor can tolerate.
5
Small Windows: Full-window Stalls



When a long-latency instruction is not complete,
it blocks retirement.
Incoming instructions fill the instruction window.
Once the window is full, processor cannot place new
instructions into the window.


This is called a full-window stall.
A full-window stall prevents the processor from making
progress in the execution of the program.
6
Small Windows: Full-window Stalls
8-entry instruction window:
Oldest
LOAD R1  mem[R5]
L2 Miss! Takes 100s of cycles.
BEQ R1, R0, target
ADD R2  R2, 8
LOAD R3  mem[R2]
MUL R4  R4, R3
ADD R4  R4, R5
Independent of the L2 miss,
executed out of program order,
but cannot be retired.
STOR mem[R2]  R4
ADD R2  R2, 64
LOAD R3  mem[R2]
Younger instructions cannot be executed
because there is no space in the instruction window.
The processor stalls until the L2 Miss is serviced.

L2 cache misses are responsible for most full-window stalls.
7
Normalized Execution Time
Impact of L2 Cache Misses
100
95
90
85
80
75
70
65
60
55
50
45
40
35
30
25
20
15
10
5
0
Non-stall (compute) time
Full-window stall time
L2 Misses
128-entry window
512KB L2 cache, 500-cycle DRAM latency, aggressive stream-based prefetcher
Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model
8
Normalized Execution Time
Impact of L2 Cache Misses
100
95
90
85
80
75
70
65
60
55
50
45
40
35
30
25
20
15
10
5
0
Non-stall (compute) time
Full-window stall time
L2 Misses
128-entry window
2048-entry window
500-cycle DRAM latency, aggressive stream-based prefetcher
Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model
9
The Problem



Out-of-order execution requires large instruction windows
to tolerate today’s main memory latencies.
As main memory latency increases, instruction window size
should also increase to fully tolerate the memory latency.
Building a large instruction window is a challenging task
if we would like to achieve



Low power/energy consumption
Short cycle time
Low design and verification complexity
10
Talk Outline




Motivation: The Memory Latency Problem
Runahead Execution
Evaluation
Limitations of the Baseline Runahead Mechanism




Efficient Runahead Execution
Address-Value Delta (AVD) Prediction
Summary of Contributions
Future Work
11
Overview of Runahead Execution [HPCA’03]




A technique to obtain the memory-level parallelism benefits of a large
instruction window (without having to build it!)
When the oldest instruction is an L2 miss:
 Checkpoint architectural state and enter runahead mode
In runahead mode:
 Instructions are speculatively pre-executed
 The purpose of pre-execution is to discover other L2 misses
 The processor does not stall due to L2 misses
Runahead mode ends when the original L2 miss returns
 Checkpoint is restored and normal execution resumes
12
Perfect Caches:
Load 1 Hit
Compute
Runahead Example
Load 2 Hit
Compute
Small Window:
Load 2 Miss
Load 1 Miss
Compute
Stall
Compute
Miss 1
Stall
Miss 2
Runahead:
Load 1 Miss
Compute
Load 2 Miss
Runahead
Miss 1
Miss 2
Load 1 Hit
Load 2 Hit
Compute
Saved Cycles
Benefits of Runahead Execution
Instead of stalling during an L2 cache miss:

Pre-executed loads and stores independent of L2-miss
instructions generate very accurate data prefetches:



For both regular and irregular access patterns
Instructions on the predicted program path are prefetched
into the instruction/trace cache and L2.
Hardware prefetcher and branch predictor tables are trained
using future access information.
14
Runahead Execution Mechanism

Entry into runahead mode

Checkpoint architectural register state

Instruction processing in runahead mode

Exit from runahead mode

Restore architectural register state from checkpoint
15
Instruction Processing in Runahead Mode
Load 1 Miss
Compute
Runahead
Miss 1
Runahead mode processing is the same as
normal instruction processing, EXCEPT:


It is purely speculative: Architectural (software-visible)
register/memory state is NOT updated in runahead mode.
L2-miss dependent instructions are identified and treated
specially.


They are quickly removed from the instruction window.
Their results are not trusted.
16
L2-Miss Dependent Instructions
Load 1 Miss
Compute
Runahead
Miss 1

Two types of results produced: INV and VALID

INV = Dependent on an L2 miss


INV results are marked using INV bits in the register file and
store buffer.
INV values are not used for prefetching/branch resolution.
17
Removal of Instructions from Window
Load 1 Miss
Compute
Runahead
Miss 1

Oldest instruction is examined for pseudo-retirement



Pseudo-retired instructions free their allocated resources.


An INV instruction is removed from window immediately.
A VALID instruction is removed when it completes execution.
This allows the processing of later instructions.
Pseudo-retired stores communicate their data to
dependent loads.
18
Store/Load Handling in Runahead Mode
Load 1 Miss
Compute
Runahead
Miss 1

A pseudo-retired store writes its data and INV status to a
dedicated memory, called runahead cache.

Purpose: Data communication through memory in runahead mode.

A dependent load reads its data from the runahead cache.

Does not need to be always correct  Size of runahead cache is
very small.
19
Branch Handling in Runahead Mode
Load 1 Miss
Compute
Runahead
Miss 1

INV branches cannot be resolved.
A mispredicted INV branch causes the processor to stay on the wrong
program path until the end of runahead execution.


VALID branches are resolved and initiate recovery if mispredicted.
20
Hardware Cost of Runahead Execution

Checkpoint of the architectural register state

Already exists in current processors

INV bits per register and store buffer entry

Runahead cache (512 bytes)
<0.05% area overhead
21
Talk Outline




Motivation: The Memory Latency Problem
Runahead Execution
Evaluation
Limitations of the Baseline Runahead Mechanism




Efficient Runahead Execution
Address-Value Delta (AVD) Prediction
Summary of Contributions
Future Work
22
Baseline Processor




3-wide fetch, 29-stage pipeline x86 processor
128-entry instruction window
512 KB, 8-way, 16-cycle unified L2 cache
Approximately 500-cycle L2 miss latency



Bandwidth, contention, conflicts modeled in detail
Aggressive streaming data prefetcher (16 streams)
Next-two-lines instruction prefetcher
23
Evaluated Benchmarks


147 Intel x86 benchmarks simulated for 30 million
instructions
Benchmark Suites








SPEC CPU 95 (S95): Mostly scientific FP applications
SPEC FP 2000 (FP00)
SPEC INT 2000 (INT00)
Internet (WEB): Spec Jbb, Webmark2001
Multimedia (MM): MPEG, speech recognition, games
Productivity (PROD): Powerpoint, Excel, Photoshop
Server (SERV): Transaction processing, E-commerce
Workstation (WS): Engineering/CAD applications
24
Performance of Runahead Execution
1.3
No prefetcher, no runahead
Only prefetcher (baseline)
Only runahead
Prefetcher + runahead
12%
1.2
Micro-operations Per Cycle
1.1
1.0
22%
0.9
12%
15%
0.8
35%
0.7
22%
0.6
16%
52%
0.5
0.4
13%
0.3
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
SERV
WS
AVG
25
Runahead Execution vs. Large Windows
1.5
128-entry window (baseline)
128-entry window with Runahead
256-entry window
384-entry window
512-entry window
1.4
1.3
Micro-operations Per Cycle
1.2
1.1
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
SERV
WS
AVG
26
Talk Outline




Motivation: The Memory Latency Problem
Runahead Execution
Evaluation
Limitations of the Baseline Runahead Mechanism




Efficient Runahead Execution
Address-Value Delta (AVD) Prediction
Summary of Contributions
Future Work
27
Limitations of the Baseline Runahead
Mechanism

Energy Inefficiency



Ineffectiveness for pointer-intensive applications



A large number of instructions are speculatively executed
Efficient Runahead Execution [ISCA’05, IEEE Micro Top Picks’06]
Runahead cannot parallelize dependent L2 cache misses
Address-Value Delta (AVD) Prediction [MICRO’05]
Irresolvable branch mispredictions in runahead mode


Cannot recover from a mispredicted L2-miss dependent branch
Wrong Path Events [MICRO’04]
28
Talk Outline




Motivation: The Memory Latency Problem
Runahead Execution
Evaluation
Limitations of the Baseline Runahead Mechanism




Efficient Runahead Execution
Address-Value Delta (AVD) Prediction
Summary of Contributions
Future Work
29
The Efficiency Problem [ISCA’05]



A runahead processor pre-executes some instructions
speculatively
Each pre-executed instruction consumes energy
Runahead execution significantly increases the
number of executed instructions, sometimes
without providing performance improvement
30
AVG
wupwise
swim
sixtrack
mgrid
mesa
lucas
galgel
fma3d
facerec
equake
110%
art
apsi
applu
ammp
vpr
vortex
twolf
perlbmk
parser
mcf
gzip
gcc
90%
gap
100%
eon
crafty
bzip2
The Efficiency Problem
235%
% Increase in IPC
% Increase in Executed Instructions
80%
70%
60%
50%
40%
30%
20%
22%
27%
10%
0%
31
Efficiency of Runahead Execution
Efficiency =
% Increase in IPC
% Increase in Executed Instructions

Goals:


Reduce the number of executed instructions
without reducing the IPC improvement
Increase the IPC improvement
without increasing the number of executed instructions
32
Causes of Inefficiency

Short runahead periods

Overlapping runahead periods

Useless runahead periods
33
Short Runahead Periods

Processor can initiate runahead mode due to an already in-flight L2
miss generated by
 the prefetcher, wrong-path, or a previous runahead period
Load 1 Miss
Load 2 Miss
Load 1 Hit Load 2 Miss
Compute Runahead
Miss 1
Miss 2

Short periods
 are less likely to generate useful L2 misses
 have high overhead due to the flush penalty at runahead exit
34
Eliminating Short Runahead Periods

Mechanism to eliminate short periods:




Record the number of cycles C an L2-miss has been in flight
If C is greater than a threshold T for an L2 miss, disable entry
into runahead mode due to that miss
T can be determined statically (at design time) or dynamically
T=400 for a minimum main memory latency of 500 cycles
works well
35
Overlapping Runahead Periods

Two runahead periods that execute the same instructions
Load 1 Miss Load 2 INV
Compute
Load 1 Hit Load 2 Miss
Miss 1

OVERLAP
OVERLAP
Runahead
Miss 2
Second period is inefficient
36
Overlapping Runahead Periods (cont.)

Overlapping periods are not necessarily useless

The availability of a new data value can result in the
generation of useful L2 misses

But, this does not happen often enough

Mechanism to eliminate overlapping periods:



Keep track of the number of pseudo-retired instructions R
during a runahead period
Keep track of the number of fetched instructions N since the
exit from last runahead period
If N < R, do not enter runahead mode
37
Useless Runahead Periods

Periods that do not result in prefetches for normal mode
Load 1 Miss
Compute
Load 1 Hit
Runahead
Miss 1


They exist due to the lack of memory-level parallelism
Mechanism to eliminate useless periods:


Predict if a period will generate useful L2 misses
Estimate a period to be useful if it generated an L2 miss that
cannot be captured by the instruction window

Useless period predictors are trained based on this estimation
38
Predicting Useless Runahead Periods

Prediction based on the past usefulness of runahead
periods caused by the same static load instruction


Prediction based on too many INV loads


If the fraction of INV loads in a runahead period is greater than T,
exit runahead mode
Sampling (phase) based prediction


A 2-bit state machine records the past usefulness of a load
If last N runahead periods generated fewer than T L2 misses, do
not enter runahead for the next M runahead opportunities
Compile-time profile-based prediction

If runahead modes caused by a load were not useful in the profiling
run, mark it as non-runahead load
39
Performance Optimizations for Efficiency


Both efficiency AND performance can be increased by
increasing the usefulness of runahead periods
Three major optimizations:

Turning off the Floating Point Unit (FPU) in runahead mode


Optimizing the update policy of the hardware prefetcher
(HWP) in runahead mode


FP instructions do not contribute to the generation of load
addresses
Improves the positive interaction between runahead and HWP
Early wake-up of INV instructions

Enables the faster removal of INV instructions
40
AVG
wupwise
swim
sixtrack
mgrid
mesa
lucas
galgel
fma3d
facerec
equake
110%
art
apsi
applu
ammp
vpr
vortex
twolf
perlbmk
parser
mcf
gzip
90%
gcc
100%
gap
eon
crafty
bzip2
Increase in Executed Instructions
Overall Impact on Executed Instructions
235%
baseline runahead
all techniques
80%
70%
60%
50%
40%
30%
26.5%
20%
10%
6.2%
0%
41
AVG
wupwise
swim
sixtrack
mgrid
mesa
lucas
galgel
fma3d
facerec
equake
110%
art
apsi
applu
ammp
vpr
vortex
twolf
perlbmk
parser
mcf
gzip
90%
gcc
100%
gap
eon
crafty
bzip2
Increase in IPC
Overall Impact on IPC
116%
baseline runahead
all techniques
80%
70%
60%
50%
40%
30%
20%
22.6%
22.1%
10%
0%
42
Talk Outline




Motivation: The Memory Latency Problem
Runahead Execution
Evaluation
Limitations of the Baseline Runahead Mechanism




Efficient Runahead Execution
Address-Value Delta (AVD) Prediction
Summary of Contributions
Future Work
43
The Problem: Dependent Cache Misses
Runahead: Load 2 is dependent on Load 1
Cannot Compute Its Address!
Load 1 Miss Load 2 INV
Compute
Miss 1

Runahead
Miss 2
Runahead execution cannot parallelize dependent misses



Load 1 Hit Load 2 Miss
wasted opportunity to improve performance
wasted energy (useless pre-execution)
Runahead performance would improve by 25% if this
limitation were ideally overcome
44
The Goal of AVD Prediction


Enable the parallelization of dependent L2 cache misses in
runahead mode with a low-cost mechanism
How:

Predict the values of L2-miss address (pointer) loads


Address load: loads an address into its destination register,
which is later used to calculate the address of another load
as opposed to data load
45
Parallelizing Dependent Cache Misses
Cannot Compute Its Address!
Load 1 Miss Load 2 INV
Load 1 Hit Load 2 Miss
Runahead
Compute
Miss 1
Miss 2
Value Predicted
Can Compute Its Address
Load 1 Miss Load 2 Miss
Compute
Runahead
Load 1 Hit Load 2 Hit
Saved Speculative
Instructions
Saved Cycles
Miss 1
Miss 2
46
AVD Prediction [MICRO’05]

Address-value delta (AVD) of a load instruction defined as:
AVD = Effective Address of Load – Data Value of Load




For some address loads, AVD is stable
An AVD predictor keeps track of the AVDs of address loads
When a load is an L2 miss in runahead mode, AVD
predictor is consulted
If the predictor returns a stable (confident) AVD for that
load, the value of the load is predicted
Predicted Value = Effective Address – Predicted AVD
47
Why Do Stable AVDs Occur?

Regularity in the way data structures are



allocated in memory AND
traversed
Two types of loads can have stable AVDs

Traversal address loads


Produce addresses consumed by address loads
Leaf address loads

Produce addresses consumed by data loads
48
Traversal Address Loads
Regularly-allocated linked list:
A
A+k
A traversal address load loads the
pointer to next node:
node = nodenext
AVD = Effective Addr – Data Value
Effective Addr Data Value AVD
A+2k
...
A+3k
A
A+k
-k
A+k
A+2k
-k
A+2k
A+3k
-k
Striding
Stable AVD
data value
49
Leaf Address Loads
Sorted dictionary in parser:
Nodes point to strings (words)
String and node allocated consecutively
m = check_match(ptr_str, input);
// …
A
C+k
C
E+k
D
E
// ...
ptr_str = nodestring;
B
D+k
A leaf address load loads the pointer to
the string of each node:
lookup (node, input) {
A+k
B+k
Dictionary looked up for an input word.
F+k
F
node
}
string
AVD = Effective Addr – Data Value
G+k
Effective Addr Data Value AVD
G
A+k
A
k
C+k
C
k
F+k
F
k
No stride!
Stable AVD
50
runahead
1.0
0.9
14.3%
15.5%
0.8
0.7
0.6
0.5
0.4
Execution Time
0.3
Executed Instructions
0.2
0.1
AV
G
r
vp
f
tw
ol
er
rs
pa
m
cf
i
no
ro
vo
ts
p
dd
ea
tre
et
er
pe
rim
m
st
al
he
so
rt
th
0.0
bi
Normalized Execution Time and Executed Instructions
Performance of AVD Prediction
51
Talk Outline




Motivation: The Memory Latency Problem
Runahead Execution
Evaluation
Limitations of the Baseline Runahead Mechanism




Efficient Runahead Execution
Address-Value Delta (AVD) Prediction
Summary of Contributions
Future Work
52
Summary of Contributions

Runahead execution provides the latency tolerance benefit of
a large instruction window by parallelizing independent cache
misses



Efficient runahead execution techniques improve the energyefficiency of base runahead execution


With very modest increase in hardware cost and complexity
128-entry window + Runahead ~ 384-entry window
Only 6% extra instructions executed for 22% performance benefit
Address-Value Delta (AVD) prediction enables the
parallelization of dependent cache misses


By exploiting regular memory allocation patterns
A 16-entry (102-byte) AVD predictor improves the performance of runahead
execution by 14% on pointer-intensive applications
53
Talk Outline




Motivation: The Memory Latency Problem
Runahead Execution
Evaluation
Limitations of the Baseline Runahead Mechanism




Efficient Runahead Execution
Address-Value Delta (AVD) Prediction
Summary of Contributions
Future Work
54
Future Work in Runahead Execution





Compilation/programming techniques for runahead
processors
Keeping runahead execution on the correct program path
Parallelizing dependent cache misses in linked data
structure traversals
Runahead co-processors/accelerators
Evaluation of runahead execution on multithreaded and
multiprocessor systems
55
Research Summary





Runahead execution

Original runahead proposal [HPCA’03, IEEE Micro Top Picks’03]

Efficient runahead execution [ISCA’05, IEEE Micro Top Picks’06]

AVD prediction [MICRO’05]

Result reuse in runahead execution [Comp. Arch. Letters’05]
High-performance memory system designs

Pollution-aware caching [IJPP’05]

Parallelism-aware caching [ISCA’06]

Performance analysis of speculative memory references [IEEE Trans. on Computers’05]

Latency/bandwidth tradeoffs in memory controllers [Patent’04]
Branch instruction handling techniques through compiler-microarchitecture cooperation

Wish branches [MICRO’05 & IEEE Micro Top Picks’06]

Wrong path events [MICRO’04]

Compiler-assisted dynamic predication [in progress]
Efficient compile-time profiling techniques for detecting input-dependent program behavior

2D profiling [CGO’06]
Fault tolerant microarchitecture design

Microarchitecture-based introspection [DSN’05]
56
Thank you.
Backup Slides
Thesis Statement

Efficient runahead execution is a cost- and complexity-effective
microarchitectural technique that can tolerate long main
memory latencies without requiring


unreasonably large, slow, complex, and power-hungry hardware
structures
significant increases in processor complexity and power
consumption.
59
Normalized Execution Time
Impact of L2 Cache Misses
100
95
90
85
80
75
70
65
60
55
50
45
40
35
30
25
20
15
10
5
0
Non-stall (compute) time
Full-window stall time
L2 Misses
128-entry window
2048-entry window
infinite window
500-cycle DRAM latency, aggressive stream-based prefetcher
Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model
60
Entry into Runahead Mode
Load 1 Miss
Compute
Miss 1
When an L2-miss load instruction is the oldest in the
instruction window:

Processor checkpoints architectural register state.

Processor records the address of the L2-miss load.

L2-miss load marks its destination register as INV (invalid)
and it is removed from the instruction window.
61
Exit from Runahead Mode
Load 1 Miss
Compute
Runahead
Load 1 Re-fetched and Re-executed
Compute
Miss 1
When the runahead-causing L2 miss is serviced:




All instructions in the machine are flushed.
INV bits are reset. Runahead cache is flushed.
Processor restores the architectural state as it was before the
runahead-causing instruction was fetched.

Architecturally, NOTHING happened.

But, hopefully useful prefetch requests were generated (caches warmed up).
Mode is switched to normal mode

Instructions executed in runahead mode are re-executed in normal mode.
62
When to Enter Runahead Mode



Why not at the time an L2 miss happens?
 Not guaranteed to be a valid correct-path instruction until it becomes
the oldest.
 Limited potential (An L2-miss inst. becomes oldest instruction 10
cycles later on average)
 Need to checkpoint state at oldest instruction (Throw away all
instructions older than the L2 miss?)
Why not when the window becomes full?
 Delays the removal of instructions from window, which can result in
slow progress in runahead mode.
 No significant gain (The window becomes full 98% of the time after
we see an L2 miss)
Why not on L1 cache misses?
 >50% of L1 cache misses hit in the L2 cache  Many short
runahead periods
63
When to Exit Runahead Mode

Why not exit early to fill the pipeline and the window?


How do we determine “how early”? Memory does not have
fixed latency
This reduces the progress made in runahead mode


Exiting early using oracle information has lower performance than
exiting when miss returns
Why not exit late to make further progress in runahead?



Not necessarily beneficial to stay in runahead longer
On average, this policy hurts performance
But, more intelligent runahead period extension schemes
improve performance
64
Modifications to Pipeline
INV
Uop Queue
Frontend
RAT
FP
Sched
Int Queue
INT
Sched
Renamer
Mem Queue
Instruction
Decoder
MEM
Sched
FP
Regfile
FP
Units
INV
Trace
Cache
Fetch
Unit
FP Queue
INT
Regfile
CHECKPOINTED
STATE
Reorder
Buffer
INT
Units
ADDR
GEN
Units
L1
Data
Cache
Selection Logic
< 0.05% area overhead
Prefetcher
Store
Buffer
INV
L2 Access Queue
Backend
RAT
RUNAHEAD
CACHE
From memory
L2 Cache
Front Side Bus
Access Queue
To memory
65
Effect of a Better Front-end
1.4
1.3
12% 15% 20%
12% 13% 13%
15% 21% 29%
1.2
22% 26% 34%
Micro-operations Per Cycle
1.1
1.0
Base
Runahead
Base w/ perf TC
Runahead w/ perf TC
Base w/ perf TC/BP
Runahead w/ perf TC/BP
16% 27% 37%
22% 27% 31%
0.9
52% 75% 89%
0.8
35% 35% 36%
0.7
0.6
0.5
0.4
13% 13% 13%
0.3
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
SERV
WS
AVG
66
Why is Runahead Better with a Better Front-end?




A better front-end provides more correct-path instructions (hence, more
and more accurate L2 misses) in runahead periods
Average number of instructions during runahead: 711
 before mispredicted INV branch: 431
 with perfect TC/BP this average increases to 909
Average number of L2 misses during runahead: 2.6
 before mispredicted INV branch: 2.38
 with perfect TC/BP this average increases to 3.18
If all INV branches were resolved correctly during runahead
 performance gain would be 25% instead of 22%
67
Importance of Store-Load Communication
1.3
1.2
Baseline
34%
Runahead, no communication through memory
Micro-operations Per Cycle
1.1
Runahead with runahead cache
1.0
34%
0.9
39%
33%
0.8
90%
0.7
50%
0.6
24%
45%
0.5
0.4
0.3
57%
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
SERV
WS
AVG
68
In-order vs. Out-of-order
1.3
1.2
in-order baseline
in-order + runahead
out-of-order baseline
out-of-order + runahead
15% 10%
1.1
Micro-operations Per Cycle
1.0
14% 12%
0.9
20% 22%
0.8
17% 13%
0.7
39% 20%
73% 23%
0.6
0.5
28% 15%
50% 47%
SERV
WS
0.4
73% 16%
0.3
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
AVG
69
Sensitivity to L2 Cache Size
1.4
1.3
Micro-operations Per Cycle
1.2
No RA - 0.5 MB
RA - 0.5 MB
No RA - 1 MB
RA - 1 MB
No RA - 4 MB
RA - 4 MB
6%
7%
10%
1.1
13%
20%
1
11%
0.9
12%
0.8
19%
22%
8%
10%
12%
16%
13%
0.7
23%
0.6
17%
8%
23%
30%
13%
15%
0.5
32%
40%
48%
20%
0.4
14%
16%
0.3
0.2
0.1
0
S95
FP00
INT00
WEB
MM
PROD
SERV
WS
AVG
70
Instruction vs. Data Prefetching Benefits
1.3
1.2
No RA
83%
RA, no instruction prefetching benefit
1.1
RA, all benefits
Micro-operations Per Cycle
1.0
0.9
91%
86%
0.8
74%
0.7
86%
97%
0.6
55%
0.5
96%
0.4
0.3
94%
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
SERV
WS
AVG
71
Why Does Runahead Work?

~70% of instructions are VALID in runahead mode

These values show periodic behavior

Runahead prefetching reduces L1, L2, TC misses during normal mode

Data Miss Reduction




Instruction Miss Reduction



18% decrease in normal mode L1 misses (base: 13.7/1K uops)
33% of normal mode L2 data misses are fully or partially covered by runahead prefetching (base L2 data
miss rate: 4.3/1K uops)
15% of normal mode L2 data misses are fully covered (these misses are never seen in normal mode)
3% decrease in normal mode TC misses
14% decrease in normal mode L2 fetch misses (some of these are only partially covered by runahead
requests)
Overall Increase in Data Misses

L2 misses are increased by 5% (due to contention and useless prefetches)
72
Correlation Between L2 Miss Reduction and Speedup
160%
150%
140%
130%
120%
% IPC increase
110%
100%
90%
80%
70%
60%
R2 = 0.3893
50%
40%
30%
20%
10%
0%
-20%
-10%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
% normal-mode L2 misses covered (full or partial) by
runahead prefetches
73
Runahead on a More Aggressive Processor
1.8
1.7
4MB L2, 256 window baseline
4%
4MB L2, 256 window runahead
1.6
Micro-operations Per Cycle
1.5
1.4
4%
1.3
1%
7%
4%
1.2
7%
1.1
1.0
2%
0.9
27%
0.8
0.7
10%
0.6
0.5
0.4
0.3
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
SERV
WS
AVG
74
Runahead on Future Model
75
Future Model with Perfect Front-end
76
Baseline Alpha Processor








Execution-driven Alpha simulator
8-wide superscalar processor
128-entry instruction window, 20-stage pipeline
64 KB, 4-way, 2-cycle L1 data and instruction caches
1 MB, 32-way, 10-cycle unified L2 cache
500-cycle minimum main memory latency
Aggressive stream-based prefetcher
32 DRAM banks, 32-byte wide processor-memory bus (4:1
frequency ratio), 128 outstanding misses

Detailed memory model
77
3.00
3.00
2.75
2.75
2.50
2.25
2.00
1.75
1.50
1.25
1.00
0.75
0.50
Baseline
0.25
Runahead
0.00
Instructions Per Cycle Performance
Instructions Per Cycle Performance
Runahead vs. Large Windows (Alpha)
2.50
2.25
2.00
1.75
1.50
1.25
1.00
0.75
0.50
Baseline
0.25
Runahead
0.00
64
128
256
384
512
1024
2048
4096
Instruction Window Size (mem latency = 500 cycles)
8192
64
128
256
384
512
1024
2048
4096
Instruction Window Size (mem latency = 1000 cycles)
78
8192
In-order vs. Out-of-order Execution (Alpha)
3.00
Instructions Per Cycle Performance
2.75
OOO+RA
OOO
IO+RA
IO
2.50
2.25
2.00
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
100
300
500
700
900
1100
1300
1500
1700
1900
Memory Latency (in cycles)
79
Comparison to 1024-entry Window
80
Runahead vs. HWP (Alpha)
81
Effect of Memory Latency (Alpha)
82
1K & 2K Memory Latency (Alpha)
83
Efficient Runahead
Methods for Efficient Runahead Execution

Eliminating inefficient runahead periods

Increasing the usefulness of runahead periods

Reuse of runahead results

Value prediction of L2-miss load instructions

Optimizing the exit policy from runahead mode
85
Impact on Efficiency
baseline runahead
35%
short
overlapping
useless
30%
Increase Over Baseline OOO
short+overlapping+useless
25%
26.5%26.5%26.5%26.5%
22.6%
20.1%
20%
15%
15.3%
14.9%
11.8%
10%
6.7%
5%
0%
Executed Instructions
IPC
86
Extra Instructions with Efficient Runahead
50%
45%
Increase in Executed Instructions
40%
baseline runahead
all techniques
35%
30%
25%
20%
15%
10%
5%
0%
100
300
500
700
900
Memory Latency
87
Performance Increase with Efficient Runahead
50%
Increase in IPC
45%
40%
baseline runahead
35%
all techniques
30%
25%
20%
15%
10%
5%
0%
100
300
500
700
900
Memory Latency
88
Cache Sizes (Executed Instructions)
45%
Increase in Executed Instructions
40%
35%
baseline runahead
all techniques
30%
25%
20%
15%
10%
5%
0%
512 KB
1 MB
2 MB
4 MB
89
Cache Sizes (IPC Delta)
45%
40%
35%
baseline runahead
all techniques
Increase in IPC
30%
25%
20%
15%
10%
5%
0%
512 KB
1 MB
2 MB
4 MB
90
Turning Off the FPU in Runahead Mode


FP instructions do not contribute to the generation of load
addresses
FP instructions can be dropped after decode





Spares processor resources for more useful instructions
Increases performance by enabling faster progress
Enables dynamic/static energy savings
Results in an unresolvable branch misprediction if a
mispredicted branch depends on an FP operation (rare)
Overall – increases IPC and reduces executed instructions
91
HWP Update Policy in Runahead Mode



Aggressive hardware prefetching in runahead mode may
hurt performance, if the prefetcher accuracy is low
Runahead requests more accurate than prefetcher requests
Three policies:





Do not update the prefetcher state
Update the prefetcher state just like in normal mode
Only train existing streams, but do not create new streams
Runahead mode improves the timeliness of the prefetcher
in many benchmarks
Only training the existing streams is the best policy
92
Early INV Wake-up


Keep track of INV status of an instruction in the scheduler.
Scheduler wakes up the instruction if any source is INV.
+ Enables faster progress during runahead mode by removing
the useless INV instructions faster.
- Increases the number of executed instructions.
- Increases the complexity of the scheduling logic.

Not worth implementing due to small IPC gain
93
Short Runahead Periods
94
RCST Counter
95
Sampling for Useless Periods
96
Efficiency Techniques Extra Instructions
97
Efficiency Techniques IPC Increase
98
Effect of Memory Latency on Efficient Runahead
99
Usefulness of Runahead Periods
100
L2 Misses Per Useful Runahead Periods
101
Other Considerations for
Efficient Runahead
Performance Potential of Result Reuse



Ideal reuse study
 To determine the upper bound on the performance gain possible by reusing
results of runahead instructions
Valid pseudo-retired runahead instructions magically update architectural state
during normal mode
 They do not consume any resources (ROB or buffer entries)
Only invalid pseudo-retired runahead instructions are re-executed
 They are fed into the renamer (fetch/decode pipeline is skipped)
103
Ideal Reuse of All Valid Runahead Results
1.4
1.3
Baseline
Runahead
Runahead + ideal reuse
3%
1.2
Micro-operations Per Cycle
1.1
1.0
5%
0.9
4%
6%
0.8
0.7
5%
2%
0.6
8%
5%
0.5
0.4
2%
0.3
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
SERV
WS
AVG
104
Alpha Reuse IPC’s
105
Number of Reused Instructions
106
Why Does Reuse Not Work?
107
IPC Increase with Reuse
108
Extra Instructions with Reuse
109
Runahead Period Statistics
110
Mem Latency and BP Accuracy in Reuse
111
Runahead + VP: Extra Instructions
112
Runahead + VP: IPC Increase
113
Late Exit from Runahead (Extra Inst.)
114
Late Exit from Runahead (IPC Increase)
115
AVD Prediction and
Optimizations
Identifying Address Loads in Hardware



Observation: If the AVD is too large, the value being loaded
is likely NOT an address
Only predict loads that have satisfied:
-MaxAVD < AVD < +MaxAVD
This identification mechanism eliminates almost all
data loads from consideration

Enables the AVD predictor to be small
117
An Implementable AVD Predictor


Set-associative prediction table
Prediction table entry consists of






Tag (PC of the load)
Last AVD seen for the load
Confidence counter for the recorded AVD
Updated when an address load is retired in normal mode
Accessed when a load misses in L2 cache in runahead mode
Recovery-free: No need to recover the state of the processor
or the predictor on misprediction

Runahead mode is purely speculative
118
AVD Update Logic
Effective Address
Data Value
computed AVD = Effective Addr - Data Value
>=
-MaxAVD?
<=
MaxAVD?
Confidence
Update/Reset
Logic
Tag
Conf
AVD
valid AVD?
PC of Retired Load
119
AVD Prediction Logic
Predicted?
(not INV?)
Predicted Value
= Effective Addr - AVD
==?
Tag
Program Counter of
L2-miss Load
Conf
AVD
Effective Address of
L2-miss Load
120
Properties of Traversal-based AVDs


Stable AVDs can be captured with a stride value predictor
Stable AVDs disappear with the re-organization of the data
structure (e.g., sorting)
A
A+k
A+2k
A+3k

A+3k
A+k
Sorting
A
A+2k
Distance between
nodes NOT constant!
Stability of AVDs is dependent on the behavior of the
memory allocator

Allocation of contiguous, fixed-size chunks is useful
121
Properties of Leaf-based AVDs


Stable AVDs cannot be captured with a stride value predictor
Stable AVDs do not disappear with the re-organization of the
data structure (e.g., sorting)
A+k
C+k
A
B+k
B

C
Sorting
C+k
C
Distance between
node and string
still constant!
A+k
B+k
A
B
Stability of AVDs is dependent on the behavior of the
memory allocator
122
AVD Prediction vs. Stride Value Prediction

Performance:

Both can capture traversal address loads with stable AVDs


Stride VP cannot capture leaf address loads with stable AVDs


e.g., health, mst, parser
AVD predictor cannot capture data loads with striding data
values


e.g., treeadd
Predicting these can be useful for the correct resolution of
mispredicted L2-miss dependent branches, e.g., parser
Complexity:


AVD predictor requires much fewer entries (only address loads)
AVD prediction logic is simpler (no stride maintenance)
123
AVD vs. Stride VP Performance
1.00
AVD
0.98
stride
2.5%
0.96
Normalized Execution Time
hybrid
4.5%
0.94
0.92
0.90
0.88
12.1%
0.86
13.4%
12.6%
0.84
16%
0.82
0.80
entries
1616entries
4096 entries
4096
entries
124
AVD vs. Stride VP Performance
1.00
AVD
Normalized Execution Time (excluding health)
0.98
stride
2.7%
0.96
0.94
hybrid
5.1%
5.5%
4.7%
6.5%
0.92
8.6%
0.90
0.88
0.86
0.84
0.82
0.80
entries
1616entries
4096 entries
4096
entries
125
AVD vs. Stream Prefetching Performance
1.00
AVD prediction
Normalized Execution Time
0.95
stream prefetching
AVD + stream
0.90
12.1%
12.1%
0.85
13.4%
16.5%
0.80
20.1%
22.5%
0.75
0.70
prefetch distance 32
prefetch distance 8
126
AVD vs. Stream Pref. (L2 bandwidth)
1.40
Normalized Number of L2 Accesses
1.35
35.3%
AVD prediction
stream prefetching
AVD + stream
32.8%
1.30
1.25
24.5%
26%
1.20
1.15
1.10
1.05
5.1%
5.1%
1.00
prefetch distance 32
prefetch distance 8
127
AVD vs. Stream Pref. (Mem bandwidth)
Normalized Number of Memory Accesses
1.25
AVD prediction
stream prefetching
AVD + stream
1.20
19.5%
16.4%
1.15
14.9%
12.1%
1.10
1.05
3.2%
3.2%
1.00
prefetch distance 32
prefetch distance 8
128
Source Code Optimization for AVD Prediction
struct Dict_node {
char *string;
Dict_node *left, *right;
// ...
}
string allocated FIRST
struct Dict_node {
char *string;
Dict_node *left, *right;
// ...
}
char *get_a_word(...) {
// read a word from file
s = (char *) xalloc(strlen(word) + 1);
strcpy(s, word);
return s;
}
char *get_a_word(..., Dict_node **dn) {
// read a word from file
*dn = (Dict_node *) xalloc(sizeof(Dict_node));
s = (char *) xalloc(strlen(word) + 1);
strcpy(s, word);
return s;
}
string allocated NEXT
Dict_node *read_word_file(...) {
Dict_node *read_word_file(...) {
// ...
char *s; Dict_node *dn;
while ((s = get_a_word(...)) != NULL) {
dn = (Dict_node *) xalloc(sizeof(Dict_node));
dn->string = s;
// ...
}
return dn;
Dict_node allocated NEXT
}
(a) Base source code that allocates the nodes of the
dictionary (binary tree) and the strings
Dict_node allocated FIRST
// ...
char *s; Dict_node *dn;
while ((s = get_a_word(..., &dn)) != NULL) {
dn->string = s;
// ...
}
return dn;
}
(b) Modified source code (modified lines are in bold)
129
Effect of Code Optimization (parser)
1.02
1.00
Normalized Execution Time
0.98
0.96
0.94
0.92
6.4%
0.90
10.5%
0.88
0.86
base binary - runahead
base binary - 16-entry AVD
0.84
0.82
modified binary - runahead
modified binary - 16-entry AVD
0.80
parser
130
Accuracy/Coverage w/ Code Optimization
100
90
base_binary
80
modified_binary
Percentage (%)
70
60
50
40
30
20
10
0
Coverage
Accuracy
131
AVD and Efficiency Techniques
1.5
1.4
1.3
1.1
1
0.9
0.8
0.7
no runahead
0.6
runahead
0.5
AVD (16-entry)
0.4
efficiency
0.3
AVD + efficiency
0.2
0.1
AV
G
r
vp
f
tw
ol
er
pa
rs
m
cf
on
oi
vo
r
p
ts
d
ea
d
tre
er
pe
rim
et
m
st
th
he
al
so
rt
0
bi
Normalized Execution Time
1.2
132
AVD and Efficiency Techniques
0.9
0.8
0.7
0.6
0.5
0.4
no runahead
0.3
runahead
AVD (16-entry)
0.2
efficiency
AVD + efficiency
0.1
AV
G
vp
r
f
ol
tw
pa
rs
er
m
cf
no
i
vo
ro
ts
p
d
ea
d
tre
er
pe
rim
et
m
st
he
al
th
so
rt
0
bi
Normalized Number of Executed Instructions
1
133
Effect of AVD on Runahead Periods
134
AVD Example from treeadd
135
AVD Example from parser
136
AVD Example from health
137
Motivation for NULL-value Optimization
138
Effect of NULL-value Optimization
139
Related Work
Methodology
Future Research Directions
Download