Slides available - PHARM - University of Wisconsin–Madison

advertisement
Value Prediction:
Are(n’t) We Done Yet?
Mikko Lipasti
University of Wisconsin-Madison
Definition

What is value prediction? Broadly, three
salient attributes:
1.
2.
3.

Generate a speculative value (predict)
Consume speculative value (execute)
Verify speculative value (compare/recover)
This subsumes branch prediction
Focus here on operand values
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
2 of 38
Some History

“Classical” value prediction

1.
2.
3.
4.
Independently invented by 4 groups in 1995-1996
AMD (Nexgen): L. Widigen and E. Sowadsky,
patent filed March 1996, inv. March 1995
Technion: F. Gabbay and A. Mendelson, inv.
sometime 1995, TR 11/96, US patent Sep 1997
CMU: M. Lipasti, C. Wilkerson, J. Shen, inv. Oct.
1995, ASPLOS paper submitted March 1996
Wisconsin: Y. Sazeides, J. Smith, Summer 1996
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
3 of 38
Why?

Possible explanations:
1.
2.
3.
Natural evolution from branch prediction
Natural evolution from memoization
Natural evolution from rampant speculation



4.
Cache hit speculation
Memory independence speculation
Speculative address generation
Improvements in tracing/simulation technology


“There’s a lot of zeroes out there.” (C. Wilkerson)
Values, not just instructions & addresses

TRIP6000 [A. Martin-de-Nicolas, IBM]
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
4 of 38
Publications by Year
Cumulative Publications
70
60
ISCA
MICRO
HPCA
Others
Total
50
40
30
20
10
0
1996

1998
2000
2002
2004
Excludes journals, workshops, compiler conferences
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
5 of 38
What Happened?

Tremendous academic interest


No industry uptake


Dozens of research groups, papers, proposals
No present or planned CPU with value prediction
Why?


Meager performance benefit (< 10%)
Power consumption



Dynamic power for extra activity
Static power (area) for prediction tables
Complexity and correctness


Subtle memory ordering issues [MICRO ’01]
Misprediction recovery [HPCA ’04]
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
6 of 38
Performance?

Relationship between timely fetch and value
prediction benefit [Gabbay, ISCA]
Value prediction doesn’t help when the result can be
computed before the consumer instruction is fetched

High-bandwidth fetch helps




Wide trace caches studied in late 1990s
But, these have several negative attributes
Recent designs focus on frequency, not ILP
High-bandwidth fetch is a red herring

More important to fetch the right instructions
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
7 of 38
Future Adoption?

Classical value prediction will only make it in the
context of a very different microarchitecture


One that explicitly and aggressively exposes ILP
Promising trends

Deep pipelining craze appears to be over


High frequency mania appears to be over


Can’t manage the design complexity
Can’t afford the power
Architects are pursuing ILP once again

Value prediction has another opportunity
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
8 of 38
What Value Prediction Begat

Value prediction catalyzed a new focus on
values in computation





This had not been studied before
A whole new realm of research:
Value-Aware Microarchitecture
Spans numerous subdisciplines
Significant industrial impact already
Also, developments in supporting technologies
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
9 of 38
Value-Aware Microarchitecture
Value-Aware
Microarchitecture
Memory Hierarchy
Memory Hierarchy
File Compression [several]
•Register •Register
File Compression
[several]
•Cache Compression [Gupta, Alameldeen]
•Memory Compression
[e.g. IBM MXT]
•Cache Compression
[Gupta,
Alameldeen]
•Bandwidth compression
•Address and data
bus encoding
•Memory Compression
[e.g.
IBM[Rudolph]
MXT]
•Initialization Traffic [Lewis]
•Bandwidth compression
Load/Store
Processing
Load/Store
Processing
•Address
and
data bus encoding [Rudolph]
•Loadprediction
value prediction [numerous]
•Load
value
[numerous]
•Initialization
[Lewis]
•Fast address Traffic
calculation [Austin]
•Fast address
•Value-aware
calculation
alias prediction[Austin]
[Onder]
•Memory consistency [Cain]
Execution
Core
•Value-aware
alias prediction [Onder]
•Value
Prediction
•Memory
consistency
[Cain]
Execution Core
Prediction
•Operand •Value
Significance
•Operand Significance
•Low [Canal]
Power [Canal]
•Low Power
•Execution bandwidth [Loh]
[Pentium 4,[Loh]
Mestan]
•Execution•Bit-slicing
bandwidth
•Instruction
reuse
[Sodani]
Cache
Coherence
•Bit-slicing
[Pentium
4,Speculation]
Mestan]
•Carry prediction
[Circuit-level
•Producer-side
•Instruction reuse [Sodani]
•Silent
stores,
temporally silent
stores [Lepak]
Cache
Coherence
•Carry
prediction
[Circuit-level
Speculation]
•Producer-sidelock elision [Wisc, UIUC]
•Speculative
•Silent stores, temporally silent stores [Lepak]
•Speculative lock elision [Rajwar]
•Consumer side
•Consumer side
•Loadprediction
value prediction using
stale lines
[Lepak]
•Load value
using
stale
lines [Lepak]
•“Coherence decoupling” [Burger, Sohi]
•“Coherence decoupling” [ASPLOS 04]
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
10 of 38
Supporting Technologies

Value prediction presented some unique challenges:



Relatively low correct prediction rate (initially 40-50%)
Nontrivial misprediction rate with avoidable misprediction cost
These drove study of:

Confidence prediction/estimation



Selective recovery [Sazeides Ph.D., Kim HPCA ‘04]



First microarchitectural application of confidence estimation, though not
widely credited or cited as such
Since studied for numerous applications, e.g. gating control speculation
Numerous challenges in extending recovery to entire window
Both have proved to be fruitful research areas
Also stimulated development of software technology:



Value profiling
Value-based compiler optimizations
Run-time code specialization
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
11 of 38
Outline




Some History
Industry Trends
Value-Aware Microarchitecture
Case study: Memory Consistency [Trey Cain,
ISCA 2004]





Conventional load queue microarchitecture
Value-based memory ordering
Replay-reduction heuristics
Performance evaluation
Conclusions
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
12 of 38
Value-based Memory Consistency

High ILP => Large instruction windows





Larger physical register file
Larger scheduler
Larger load/store queues
Result in increased access latency
Value-based Replay



If load queue scalability a problem…who needs one!
Instead, re-execute load instructions a 2nd time in
program order
Filter replays: heuristics reduce extra cache
bandwidth to 3.5% on average
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
13 of 38
Enforcing RAW dependences
Program order (Exe order)
1. (1) store A
2. (3) store ?
3. (2) load A


Load queue contains load addresses
Memory independence speculation


Check correctness at store retirement


Hoist load above unknown store assuming it is to a different address
One search per store address calculation
If address matches, the load is squashed
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
14 of 38
Enforcing memory consistency
Processor p1

Processor p2
1. (3) load A
raw
2. (1) load A
war
1. (2) store A
Two approaches
Snooping: Search per incoming invalidate
 Insulated: Search per load address calculation

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
15 of 38
Load queue implementation
queue management
address
CAM
load
meta-data
RAM
load address
load age



squash determination
external request
external address
store address
store age
# of write ports = load address calc width
# of read ports = load+store address calc width ( + 1)
Current generation designs (32-48 entries, 2 write ports,
2 (3) read ports)
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
16 of 38
Load queue scaling

Larger instruction window => larger load
queue
Increases access latency
 Increases energy consumption


Wider issue width => more read/write
ports

Also increases latency and energy
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
17 of 38
Related work: MICRO 2003

Park et al., Purdue
Extra structure dedicated to enforcing
memory consistency
 Increase capacity through segmentation


Sethumadhavan et al., UT-Austin

Add set of filters summarizing contents of load
queue
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
18 of 38
Keep it simple…

Throw more hardware at the problem?
Need to design/implement/verify
 Execution core is already complicated


Load queue checks for rare errors

Why not move error checking away from exe?
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
19 of 38
Value-based Consistency
IF1 IF2 D



…
S
EX WB REP CMP
C
Almost always cache hit
Reuse address calculation and translation
Share cache port used by stores in commit stage
Compare: compares new value to original value


Q
Replay: access the cache a second time -cheaply!


R
Squash if the values differ
This is value prediction!



Predict: access cache prematurely
Execute: as usual
Verify: replay load, compare value, recover if necessary
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
20 of 38
Rules of replay
1.
All prior stores must have written data to
the cache

2.
3.
No store-to-load forwarding
Loads must replay in program order
If a load is squashed, it should not be
replayed a second time

Ensures forward progress
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
21 of 38
Replay reduction

Replay costs



Can we avoid these penalties?



Consumes cache bandwidth (and power)
Increases reorder buffer occupancy
Infer correctness of certain operations
Four replay filters
These are used to avoid checking our value
prediction when in fact no value prediction
occurred (loaded value is known to be correct)

Similar to “constant prediction” in initial work
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
22 of 38
No-Reorder filter


Avoid replay if load isn’t reordered wrt
other memory operations
Can we do better?
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
23 of 38
Enforcing single-thread RAW
dependencies

No-Unresolved Store Address Filter
Load instruction i is replayed if there are prior
stores with unresolved addresses when i
issues
 Works for intra-processor RAW dependences
 Doesn’t enforce memory consistency

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
24 of 38
Enforcing MP consistency

No-Recent-Miss Filter


Avoid replay if there have been no cache line
fills (to any address) while load in instruction
window
No-Recent-Snoop Filter

Avoid replay if there have been no external
invalidates (to any address) while load in
instruction window
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
25 of 38
Constraint graph


Defined for sequential consistency by Landin et
al., ISCA-18
Directed-graph represents a multithreaded
execution



Nodes represent dynamic instruction instances
Edges represent their transitive orders (program
order, RAW, WAW, WAR).
If the constraint graph is acyclic, then the
execution is correct
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
26 of 38
Constraint graph example - SC
Proc 1
Proc 2
WAR
2.
ST A
LD B
Program
order
ST B
3.
RAW
4.
Program
order
LD A
1.
Cycle indicates that
execution is incorrect
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
27 of 38
Anatomy of a cycle
Proc 1
Incoming invalidate
ST A
Proc 2
WAR
LD B
Program
order
Cache miss ST B
RAW
LD A
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Program
order
28 of 38
Enforcing MP consistency

No-Recent-Miss Filter


Avoid replay if there have been no cache line
fills (to any address) while load in instruction
window
No-Recent-Snoop Filter

Avoid replay if there have been no external
invalidates (to any address) while load in
instruction window
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
29 of 38
Filter Summary
Conservative
Replay all committed loads
No-Reorder Filter
No-Unresolved Store/
No-Recent-Miss Filter
No-Unresolved Store/
No-Recent-Snoop Filter
Aggressive
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
30 of 38
Outline




Some History
Industry Trends
Value-Aware Microarchitecture
Case study: Memory Consistency [Cain, ISCA]





Conventional load queue microarchitecture
Value-based memory ordering
Replay-reduction heuristics
Performance evaluation
Conclusions
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
31 of 38
Base machine model
PHARMsim
PowerPC execute-at-execute simulator with OOO cores and aggressive
split-transaction snooping coherence protocol
Out-of-order 5 GHZ, 15-stage, 8-wide pipeline
execution
256 entry reorder buffer, 128 entry load/store queue
core
32 entry issue queue
Functional
units
(latency)
8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),
4 L1 Dcache load ports in OoO window
1 L1 Dcache load/store port at commit
Front-end
Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with
16k entry selection table, 64 entry RAS, 8k entry 4-way BTB
Memory
system
(latency)
32k DM L1 icache (1), 32k DM L1 dcache (1)
256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines
Memory (400 cycle/100 ns best-case latency, 10 GB/S BW)
Stride-based prefetcher modeled after Power4`
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
32 of 38
%L1 DCache bandwidth increase
SPECint2000
SPECfp2000
commercial
multiprocessor
(a) replay all (b) no-reorder filter (c) no-recent-miss filter (d) no-recent-snoop filter
On average, 3.4% bandwidth overhead using no-recent-snoop filter
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
33 of 38
Value-based replay performance
(relative to constrained load queue)
SPECint2000
SPECfp2000
commercial
multiprocessor
Value-based replay 8% faster on avg than baseline using 16-entry ld queue
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
34 of 38
Does value locality help?


Not much…
Value locality does avoid memory ordering
violations
59% single-thread violations avoided
 95% consistency violations avoided


But these violations rarely occur
~1 single-thread violation per 100 million instr
 4 consistency violation per 10,000 instr

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
35 of 38
What About Power?

Simple power model:
DEnergy = # replays ( Eper cache access + Eper word comparison ) + replay overhead
– ( Eper ldq search × # ldq searches )


Empirically: 0.02 replay loads per committed
instruction
If load queue CAM energy/insn > 0.02 × energy
expenditure of a cache access and comparison:

value-based implementation saves power!
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
36 of 38
Value-based replay Pros/Cons
+ Eliminates associative lookup hardware
Load queue becomes simple FIFO
 Negligible IPC or L1D bandwidth impact

+ Can be used to fix value prediction

Enforces dependence order consistency
constraint [MICRO ‘01]
- Requires additional pipeline stages
- Requires additional cache datapath for
loads
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
37 of 38
Conclusions

Value prediction


Continues to generate lots of academic interest
Little industry uptake so far


Historical trends (narrow deep pipelines) minimized benefit
Sea-change underway on this front



Value-Aware Microarchitecture




Value prediction will be revisited in quest for ILP
Power consumption is key!
Multiple fertile areas of research
Some has found its way into products
Are we done yet? No!
Questions?
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
38 of 38
Backups
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
39 of 38
Caveat: Memory Dependence Prediction

Some predictors train using the conflicting store



(e.g. store-set predictor)
Replay mechanism is unable to pinpoint
conflicting store
Fair comparison:


Baseline machine: store-set predictor w/ 4k entry
SSIT and 128 entry LFST
Experimental machine: Simple 21264-style
dependence predictor w/ 4k entry history table
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
40 of 38
Load queue search energy
access energy (nJ)
3.5
3
2.5
rd6wr6
2
rd4wr4
1.5
rd2wr2
1
0.5
0
16
32
64
128
256
512
number of entries
Based on 0.09 micron process technology using Cacti v. 3.2
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
41 of 38
Load queue search latency
access latency (ns)
1.4
1.2
1
rd6wr6
0.8
rd4wr4
0.6
rd2wr2
0.4
0.2
0
16
32
64
128
256
512
number of entries
Based on 0.09 micron process technology using Cacti v. 3.2
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
42 of 38
Benchmarks

MP (16-way)




Commercial workloads (SPECweb, TPC-H)
SPLASH2 scientific application (ocean)
Error bars signify 95% statistical confidence
UP

3 from SPECfp2000



3 commercial


Selected due to high reorder buffer utilization
apsi, art, wupwise
SPECjbb2000, TPC-B, TPC-H
A few from SPECint2000
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
43 of 38
Life cycle of a load
ST
ST ? LD
ST ? LD
LD ? LD
ST ? LD
? ST A
?
ST A
OoO Execution Window
Blam!A
?
LD
Load queue
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
44 of 38
Performance relative to
unconstrained load queue
Good news: Replay w/ no-recent-snoop filter only 1% slower on average
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
45 of 38
Reorder-Buffer Utilization
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
46 of 38
Why focus on load queue?

Load queue has different constraints that store
queue




More loads than stores (30% vs 14% dynamic
instructions)
Load queue searched more frequently (consuming
more power)
Store-forwarding logic performance critical
Many non-scalable structures in OoO processor



Scheduler
Physical register file
Register map
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
47 of 38
Prior work: formal memory model
representations





Local, WRT, global “performance” of memory
ops (Dubois et al., ISCA-13)
Acyclic graph representation (Landin et al.,
ISCA-18)
Modeling memory operation as a series of suboperations (Collier, RAPA)
Acyclic graph + sub-operations (Adve, thesis)
Initiation event, for modeling early store-to-load
forwarding (Gharachorloo, thesis)
Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
48 of 38
Some History

From: Larry.Widigen@amd.com (Larry Widigen)
Received: by charlie (4.1) id AA00850; Wed, 14 Aug 96 10:33:12 PDT
Date: Wed, 14 Aug 96 10:33:12 PDT
Message-Id: <9608141733.AA00850@charlie>
To: Mikko_H._Lipasti@cmu.edu
Subject: www location of paper
Status: RO
X-Status:
X-Keywords:
X-UID: 1
“Classical” value prediction
Independently invented by 4 groups in 1995-1996
1.
AMD (Nexgen): L. Widigen and E. Sowadsky,
patent
filed
March
inv.paper,
March
1995
I would
like to
review
your 1996,
forthcoming
"Value
Locality and Load
Value Prediction." Could you provide a www address where it resides? I
2.am curious
Technion:
F. Gabbay
andits
A. title
Mendelson,
inv.it may
as to its
contents since
suggests that
discuss
an area where
I have
some US
work.patent Sep 1997
sometime
1995,
TRdone
11/96,
Cordially,
3.
CMU: M. Lipasti, C. Wilkerson, J. Shen, inv. Oct.
Larry Widigen
1995,
ASPLOS
paper submitted March, 1996
Manager
of Processor
Development
4.
Wisconsin: Y. Sazeides, J. Smith, Summer 1996

Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
49 of 38
Download