Runahead Execution An Alternative to Very Large Instruction Windows for Out-of-order Processors

advertisement
Runahead Execution
An Alternative to Very Large Instruction
Windows for Out-of-order Processors
Onur Mutlu
Yale N. Patt
Jared Stark
Chris Wilkerson
2
Outline
•
•
•
•
•
Motivation
Overview
Mechanism
Experimental Evaluation
Conclusions
3
Motivation
• Out-of-order processors require very large instruction
windows to tolerate today’s main memory latencies.
– Even in the presence of caches and prefetchers
• As main memory latency (in terms of processor cycles)
increases, instruction window size should also increase to
fully tolerate the memory latency.
• Building a very large instruction window is not an easy task.
4
Small Windows: Full-window Stalls
• Instructions are retired in-order from the instruction
window to support precise exceptions.
• When a very long-latency instruction is not complete, it
blocks retirement and incoming instructions fill the
instruction window if the window is not large enough.
• Processor cannot place new instructions into the window if
the window is already full. This is called a full-window
stall.
• L2 misses are responsible for most full-window stalls.
5
Impact of Full-window Stalls
% cycles with full window stalls
80%
70%
60%
512KB L2, 128-entry window
IPC: 0.77
perfect L2, 128-entry window
512KB L2, 2048-entry window
50%
40%
IPC: 1.69
IPC: 1.15
30%
20%
10%
0%
Machine Model (L2 size, instruction window size)
6
Overview of Runahead Execution
• During a significant percentage of full-window stall cycles,
no work is performed in the processor.
• Runahead execution unblocks the full window stall
caused by a long-latency L2-miss instruction.
• Enter runahead mode when the oldest instruction is an L2-miss load
and remove that load from the processor.
• While in runahead mode, keep processing instructions
without updating architectural state and without blocking the
instruction window due to L2 misses.
• When the original load miss returns back, resume normal-mode
execution starting with the runahead-causing load.
7
Benefits of Runahead Execution
• Loads and stores independent of L2-miss instructions
generate useful prefetch requests:
– From main memory to L2
– From L2 to L1
• Instructions on the predicted program path are prefetched
into the trace cache and L2.
• Hardware prefetcher tables are trained using future
memory access information. The prefetcher also runs
ahead along with the processor.
8
Mechanism
9
Entry into Runahead Mode
When an L2-miss load instruction reaches the head of the
instruction window:
• Processor checkpoints architectural register state, branch
history register, return address stack.
• Processor records the address of the L2-miss load.
• Processor enters runahead mode.
• L2-miss load marks its destination register as invalid and is
removed from the instruction window.
10
Processing in Runahead Mode
• Two types of results are produced: INV (invalid), VALID
• First INV result is produced by the L2-miss load that
caused entry into runahead mode.
• An instruction produces an INV result
– If it sources an INV result
– If it misses in the L2 cache (A prefetch request is generated)
• INV results are marked using INV bits in the register file,
store buffer, and runahead cache.
– INV bits prevent introduction of bogus data into the pipeline.
– Bogus values are not used for prefetching/branch resolution.
11
Pseudo-retirement in Runahead Mode
• An instruction is examined for pseudo-retirement when it
reaches the head of the instruction window.
• An INV instruction is removed from window immediately.
• A VALID instruction is removed when it completes
execution and updates only the microarchitectural state.
• Pseudo-retired instructions free their allocated resources.
• Pseudo-retired runahead stores communicate their data and
INV status to dependent runahead loads.
12
Runahead Cache
• An auxiliary structure that holds the data values and INV bits
for memory locations modified by pseudo-retired runahead
stores.
• Its purpose is memory communication during runahead mode.
• Runahead loads access store buffer, runahead cache, and L1
data cache in parallel.
• Size of runahead cache is very small (512 bytes).
13
Runahead Branches
• Runahead branches use the same predictor as normal branches.
• VALID branches are resolved and trigger recovery if
mispredicted.
• INV branches cannot be resolved.
14
Exit from Runahead Mode
When the data for the instruction that caused entry into
runahead returns from main memory:
• All instructions in the machine are flushed.
• INV bits are reset. Runahead cache is flushed.
• Processor restores the state as it was before the runaheadinducing instruction was fetched.
• Processor starts fetch beginning with the runahead-inducing
L2-miss instruction.
15
Experimental Evaluation
16
Baseline Processor
•
•
•
•
•
•
3-wide fetch, 29-stage pipeline
128-entry instruction window
32 KB, 8-way, 3-cycle L1 data cache, write-back
512 KB, 8-way, 16-cycle L2 unified cache, write-back
Approximately 500-cycle penalty for L2 misses
Streaming prefetcher (16 streams)
17
Benchmarks
• Selected traces out of a pool of 280 traces
• Evaluated performance on those that gain at least 10% IPC
improvement with perfect L2 cache
• 147 traces simulated for 30 million x86 instructions
• Trace Suites
–
–
–
–
–
–
–
–
SPEC CPU95 (S95): 10 benchmarks, mostly FP
SPEC FP2k (FP00): 11 benchmarks
SPECint2k (INT00): 6 benchmarks
Internet (WEB): 18 benchmarks: SpecJbb, Webmark2001
Multimedia (MM): 9 benchmarks: mpeg, speech rec., quake
Productivity (PROD): 17 benchmarks: Sysmark2k, winstone
Server (SERV): 2 benchmarks: tpcc, timesten
Workstation (WS): 7 benchmarks: CAD, nastran, verilog
18
Performance of Runahead Execution
1.3
No prefetcher, no runahead
Prefetcher, no runahead
Runahead, no prefetcher
Runahead and prefetcher
12%
1.2
1.1
Instructions Per Cycle
1.0
22%
0.9
12%
15%
0.8
35%
0.7
22%
0.6
16%
52%
0.5
0.4
13%
0.3
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
SERV
WS
AVG
19
Effect of Frontend on Runahead
• Average number of instructions during runahead: 711
– Before mispredicted INV branch: 431
• Average number of L2 misses during runahead: 2.6
– Before mispredicted INV branch: 2.38
• Runahead becomes more effective with a better frontend:
– Real trace cache: 22% IPC improvement
– Perfect trace cache: 27% IPC improvement
– Perfect branch predictor and trace cache: 31% IPC improvement
20
Runahead vs. Large Windows
1.5
1.4
128-entry window with Runahead
256-entry window
384-entry window
512-entry window
-6%
1.3
1.2
Instructions Per Cycle
1.1
4%
-1%
1.0
3%
0.9
0%
0.8
3%
0.7
0.6
2%
12%
0.5
0.4
0.3
6%
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
SERV
WS
AVG
21
Importance of Store-Load Data Communication
1.3
1.2
Baseline
Runahead, no runahead cache
8%
1.1
Runahead with runahead cache
Instructions Per Cycle
1.0
13%
0.9
7%
9%
0.8
3%
0.7
10%
0.6
12%
23%
0.5
0.4
0.3
5%
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
SERV
WS
AVG
22
Conclusions
• Runahead execution results in 22% IPC increase over the
baseline processor with a 128-entry window and a streaming
prefetcher.
• This is within 1% of the IPC of a 384-entry window machine.
• Runahead and the streaming prefetcher interact positively.
• Store-load data communication through memory in runahead
mode is vital for high performance.
23
Backup Slides
Added Bits & Structures
24
25
Runahead-Prefetcher Interaction
1.4
1.3
1.2
Instructions Per Cycle
1.1
No prefetcher, no runahead
Prefetcher, no runahead
Runahead, no prefetcher
Runahead and prefetcher
24%
36%
3%
1.0
82%
0.9
48%
55%
0.8
0.7
0.6
0.5
17%
0.4
0.3
53%
0.2
0.1
0.0
tpcc
verilog
nastran
gcc
vortex
mgrid
apsi
ammp
26
Effect of a Better Frontend
1.4
1.3
20%
Baseline w/ perf TC
13%
Runahead w/ perf TC
13%
Baseline w/ perf TC and BP
29%
1.2
34%
15%
Runahead w/ perf TC and BP
1.1
21%
Instructions Per Cycle
1.0
26%
37%
31%
0.9
89%
27%
0.8
35%
0.7
27%
75%
36%
0.6
0.5
0.4
13% 8%
0.3
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
SERV
WS
AVG
When do we see the L2 misses in Runahead?
L2 Data Miss Distances
850
800
Cycles or instructions after the runahead-causing L2 miss
750
700
650
600
550
500
450
400
350
300
250
cycles
200
instrs
150
100
50
0
miss 1
miss 2
miss 3
miss 4
miss 5
miss 6
miss 7
miss 8
miss 9
miss 10 miss 11 miss 12 miss 13 miss 14 miss 15 miss 16
nth out-of-window miss
27
28
Distance of the first L2 miss in runahead mode
Where does the first L2 miss occur?
45.00%
40.00%
Percentage of first out-of-window L2 misses
35.00%
30.00%
25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
128-256
256-384
384-512
512-640
640-768
768-896
896-1024
Distance in instructions from the runahead-causing L2 miss
1024-1536
1536+
29
Instruction vs. Data Prefetching Benefit
1.3
1.2
Baseline
87%
Runahead with no inst benefit
1.1
Runahead with all benefits
Instructions Per Cycle
1.0
0.9
91%
0.8
87%
74%
97%
0.7
88%
0.6
0.5
55%
96%
SERV
WS
0.4
0.3
94%
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
AVG
30
Runahead with a Larger L2 Cache
1.4
1.3
1.2
6%
7%
12%
1.1
27%
13%
Instructions Per Cycle
1
11%
0.9
12%
0.8
16%
15%
30%
35%
0.7
19%
22%
8%
10%
12%
No Runahead - 0.5 MB L2
Runahead - 0.5 MB L2
No Runahead - 1 MB L2
Runahead - 1 MB L2
No Runahead - 4 MB L2
Runahead - 4 MB L2
17%
22%
8%
30%
13%
16%
0.6
32%
40%
52%
0.5
0.4
14%
13%
0.3
0.2
0.1
0
S95
FP00
INT00
WEB
MM
PROD
SERV
WS
AVG
31
Runahead on in-order?
1.3
in-order baseline
in-order baseline with runahead
out-of-order baseline
out-of-order baseline with runahead
12%
1.2
1.1
Instructions Per Cycle
1.0
0.9
0.8
15%
35%
0.7
22%
14%
0.6
0.5
12%
22%
17%
16%
21%
15%
52%
28%
0.4
40%
55%
73%
0.3
13%
74%
0.2
0.1
0.0
S95
FP00
INT00
WEB
MM
PROD
SERV
WS
AVG
32
Future Model Results
Future Model with Better Frontend
33
Download