Modern Branch Prediction Schemes

advertisement
Branch Prediction Schemes
ECE404: Advance Microprocessor System
Raj Parihar
Motivation

Why do we need (dynamic) branch predictors?

Branches are very frequent


Pipeline stall (bubble) is required to know the exact
outcome / target of branch


Longer pipelines increase the latency; even worse!
Use a naïve predictor (always Taken or Not Taken)


More than 20% INS are Branches
Approx 50% of the time correct results
Branch penalty is too HIGH
If miss-predicted, Squash all the INS after branch
That’s why we need really smart branch predictors


3/7/2012
ECE404
Raj Parihar
References

“Alternative Implementations of Two-Level Adaptive
Branch Prediction” by Tse-Yu Yeh and Yale N. Patt

“Combining Branch Predictors” by Scott McFarling

“The Agree Predictor: A Mechanism for Reducing
Negative Branch History Interference” by Sprangle, et al

“Dynamic History-Length Fitting: A Third Level of
Adaptivity for Branch Prediction” by Toni Juan et al

“Neural Methods for Dynamic Branch Predictor” by
Daniel A. Jimenez et al.
3/7/2012
ECE404
Raj Parihar
Outline







Branch Prediction: Basics
Various Branch Predictors
BP in Commercial Design
Enhancement Techniques
Schemes to Reduce Interference
Some Simulations and Results
Conclusion
3/7/2012
ECE404
Raj Parihar
Conditional Branches


High performance systems use multi-level
branch predictors
Two aspects of conditional branch prediction



What about unconditional branches?



Branch outcome: Taken or Not Taken
Branch Address: if Taken then to Where?
Don’t even bother!
Compilers are smart enough to deal with them
State-of-the-art gives approx ~ 98.8% hit rate
3/7/2012
ECE404
Raj Parihar
Basic Branch Prediction

Bimodal Branch Prediction

Local Branch Prediction Schemes


Per branch address
Global Branch Prediction Schemes

3/7/2012
Combined all branch addresses
ECE404
Raj Parihar
Bi-model Branch Predictors
92
91
Bimodal
90
89
88
8k
16
k
32
k
64
k
87
32
64
13 bits
93
2k
4k
PC
94
12
8
25
6
51
2
1k
Conditional Branch Prediction
Accuracy (%)
Pattern History Table
0000
Predictor Size (bytes)
8191
Reference: Combining Branch Predictions, Scott McFarling. MRL TN-36
3/7/2012
ECE404
Raj Parihar
Two-Level Branch Predictor
Pattern History Table
Global History Register (Table)
PC
3/7/2012
ECE404
Raj Parihar
Two Level Branch Predictor (Cont…)
Pattern History Table
0000
Global History Register (Table)
12
0
PC
13 bits
213 – 1
= 8191
3/7/2012
ECE404
Raj Parihar
Combining Branch Predictors

Meta Predictor 
(2-level, Bi-model )
PHT
GHT/ GHR
PHT
0000
0000
12
0
PC
PC
13 bits
12 bits
213 – 1
3/7/2012
ECE404
4095
Raj Parihar
Branch Predictors in
Commercial Processors
POWER4, Alpha GS 21264 and Intel
POWER4: Core






Branch Prediction Unit
Instruction Fetch Unit
Decode, Crack, Group
Issue Queues
LD/ ST Queue
Execution units




3/7/2012
FP Execution units
Fixed Point EX units
BR Execution unit
CR Execution unit
ECE404
Raj Parihar
POWER4: Branch Prediction Unit

Three set of branch-history tables

Local predictor (Traditional BHT)


Global predictor




11-bit global history vector (Similar to GHR)
GHR is XORed with Branch address to index the HT
16K entry global history table, 1-bit prediction
Selector Table


3/7/2012
16K entry, indexed by branch address, 1-bit prediction
Keeps track of better predictor (global or local)
16K entry global history table, 1-bit prediction
ECE404
Raj Parihar
POWER 4 Branch Prediction (Cont…)






Fetching is re-directed based on prediction
Eventually branches are executed in BR unit
Upon execution predictor tables are updated
Dynamic branch prediction can be
overdriven by software, if needed
Link stack to predict the target of branches
A target address of branch-to-count is often
repetitive
3/7/2012
ECE404
Raj Parihar
Alpha 21264: BP

Composed of 3 units




Local predictor
Global predictor
Choice predictors
Local Predictor



3/7/2012
2-level, per-branch HT
1K table entry, 3-bit SC
VPC [ 11:2] of current
address
ECE404
Raj Parihar
Alpha 21264 (Cont…)

Global predictor




Uses 12-most recent br
4K-entry global HT
2-bit saturating counter
Choice Predictor



3/7/2012
Monitors the history of
local & global predictors
4K-entry table 2-bit each
Chooses the best of Two
ECE404
Raj Parihar
Intel Processors



386/ 486
 All branches are statically predicted Not Taken
Pentium III
 2-level, local histories
 2-bit saturating counters (Lee-Smith)
It’s really tough to
find any info about
Pentium M
Intel chips
 Combines 3 predictors


Loop predictor analyzes branches to see if they have
loop behavior

3/7/2012
Bimodal, Global and Loop predictor
Moving in one direction (taken or NT) a fixed number of times
ECE404
Raj Parihar
Branch Prediction: Insights
What makes them nearly perfect?
Potential Interferences in Two-Level

Interferences are caused by multiple branch
instructions being mapped to the same table entry

Types of interference:
 Neutral Interference
 Positive Interference
 Negative Interference

Negative interference is more dominant than
positive interference
3/7/2012
ECE404
The Agree Predictor:
A Mechanism for Reducing
Negative Branch History
Interference, by Sprangle,
Chappell, Alsup, and
Patt, ISCA 97
Raj Parihar
Destructive Interference (Aliasing)

Unrelated branches might accidentally use the same
counter



Almost all known techniques change the microarchitecture



If two branches behave differently, the predictor can’t learn the
behavior
Leads to decreased accuracy
Techniques shown to work well in simulation
But microprocessor manufacturers still use relatively simple
predictors
Can we reduce destructive interference without changing
the processor?
3/7/2012
ECE404
Raj Parihar
Ways to Reduce Interference

Larger prediction table


Efficient Hash Function


Use different mapping schemes to better distribute
branches among different entries
Profiling of branches


Map conflicting branches to different table entries
Separating different classes of branches to use different
prediction tables
Avoid negative interference

3/7/2012
By converting negative interferences into neutral or positive
interference
ECE404
Raj Parihar
Agree Predictor

Idea is that


Hope is that



Most branches are highly
biased, either T or NT
First time branch in BTB
exhibit the biased nature
A bias bit is assigned to
each branch in BTB
PHT gives the info as
“agree” or “disagree”
3/7/2012
ECE404
Raj Parihar
Pattern History Table: Utilization

We need a very efficient HASH function

We assume the utilization of PHT entry is
uniform



3/7/2012
Not true though, depends upon the efficiency of
HASH function
XOR is the simplest HASH function
And here are some surprising results 
ECE404
Raj Parihar
Simplescalar Implementation (Baseline)
L1 Table (GHR)
L2 Table (PHT)
L1 size - 1
0000
XOR
AND
Int
L1 index
31
XOR
Long
BR Address
>> 2
AND
Memory Address
Ones (size HR)
L2 index
8191
Baseline
Tweaking
3/7/2012
ECE404
Raj Parihar
Simplescalar Tweak

Change the history_reg_size in .cfg file

Branch Lookup



bpred.c: bpred_dir_lookup function
XOR upper half of l2index with lower half before indexing
L2 table which is kind of PHT
Branch Update

3/7/2012
Because L2index is a pointer it is automatically updates the
correct entry once bpred_update function is called
ECE404
Raj Parihar
PHT Utilization
GHR20
Baseline
GHR = 13bit
PHT = 8192 Entry
GHR20
GHR = 20bit
Simulation:10 million INS
BR+GHR
baddr is XORed
Fast forward: 50 million INS
Baseline
BR+GHR
Number of Branches
250000
200000
150000
100000
50000
0
1
529 1057 1585 2113 2641 3169 3697 4225 4753 5281 5809 6337 6865 7393 7921
PHT Entry [n]
3/7/2012
ECE404
Raj Parihar
PHT Utilization: Art Benchmark
BR_Baseline
Miss_Baseline
120000
Number of BR/MISS
100000
80000
60000
40000
20000
0
1
534 1067 1600 2133 2666 3199 3732 4265 4798 5331 5864 6397 6930 7463 7996
PHT Entry
3/7/2012
ECE404
Raj Parihar
Impact of GHR Length





In general, the length of GHR can impact the overall
prediction accuracy
Longer GHR will affect more entries in PHT, but may
reduce or enhance interference
Determining the appropriate GHR length is non
trivial
To allow GHR to be changed dynamically is a
possible way to improve performance
How to determine the best GHR length is an open
research issue.
3/7/2012
ECE404
Raj Parihar
Variable GHR Length: Simulation

GHR length VS # of Mispredictions
Baseline (GHR13)
GHR10
GHR16
GHR20
bzip2
swim
GHR26
# of Mispredictions
70000
60000
50000
40000
30000
20000
10000
0
ammp
applu
apsi
art
Applications
3/7/2012
ECE404
Raj Parihar
twolf
vortex
Variable GHR Length: Simulation

GHR length VS % Improvement
GHR10
GHR16
GHR20
GHR26
bzip2
swim
30
20
% Improvement
10
0
-10
ammp
applu
apsi
art
-20
-30
-40
-50
Application
3/7/2012
ECE404
Raj Parihar
twolf
vortex
Dynamic History Length

Optimal Branch history length



Vary history length



Some prefer short history (less training time)
Some require longer history (complex behavior)
Choose through profile/compile-time hints
Or learn dynamically
References


3/7/2012
Maria-Dana Tarlescu, Kevin B. Theobald, and Guang R. Gao.
Elastic History Buffer: A Low-Cost Method to Improve Branch
Prediction Accuracy. ICCD, October 1996.
Jared Stark, Marius Evers, and Yale N. Patt. Variable Path
Branch Prediction. ACM SIGPLAN Notices, 33(11):170-179,
1998
ECE404
Raj Parihar
Variable GHR and PHT Utilization
BR_Baseline
Miss_Baseline
BR_GHR20
Miss_GHR20
120000
# of BR/Mispredictions
100000
80000
60000
40000
20000
0
1
509 1017 1525 2033 2541 3049 3557 4065 4573 5081 5589 6097 6605 7113 7621 8129
PHT Entry # (Total 8192)
3/7/2012
ECE404
Raj Parihar
Changing the Branch Predictor

Before 2001, most work refined two-level
adaptive branch prediction [Yeh & Patt 92]




A 1st-level table records recent global or per-branch
pattern histories
A 2nd-level table learns correlations between histories and
outcomes
Refinements focus on reducing destructive interference
Some of the better refinements

3/7/2012
gshare [McFarling `93], agree [Sprangle et al. `97], hybrid
predictors [Evers et al. `96], skewed predictors [Michaud
et al. `93]
ECE404
Raj Parihar
A Machine Learning Approach

Conditional Branch Prediction is a Machine
Learning Problem

The machine learns to predict conditional
branches

So why not apply a machine learning
algorithm?

Artificial neural networks
3/7/2012

Simple model of neural networks in brain cells

Learn to recognize and classify patterns
ECE404
Raj Parihar
Neuron Based Prediction

Some of the well known techniques







Perceptron based predictors
Back propagation
Radial basis network
Elman network
Linear Vector Quantization (LVQ) network
All are well known complex neural network based
approaches
Lot of computation and implementation overhead

3/7/2012
Idea is to implement basic/lightweight footprint of above
ECE404
Raj Parihar
Basic of Neuron Based Predictor



The inputs to a neuron are branch outcome histories

The last n branch outcomes

Can be global or local (per-branch) or both (alloyed)

Conceptually, branch outcomes are represented as

+1, for taken

-1, for not taken
The output of the neuron is

Non-negative, if the branch is predicted taken

Negative, if the branch is predicted not taken
Ideally, each static branch is allocated its own neuron
3/7/2012
ECE404
Raj Parihar
Perceptron Based Predictors




Inputs (x’s) are from
branch history
n + 1 small integer
weights (w’s) learned
by on-line training
Output (y) is dot
product of x’s and w’s;
predict taken if y ≥ 0
Training finds
correlations between
history and outcome
3/7/2012
ECE404
Branch History Information (Global, Local)
Branch outcome (Taken, Not Taken)
Raj Parihar
Conclusion

Branch predictors make use of correlation between
history of a branch, correlation with other branches
and its outcome

Neural prediction could be incorporated into future
CPUs



Accuracy is very good; complexity is still a bottleneck
Power and energy need to be amortized
Predictor accuracy is more important for deeper
pipelines because the penalty increases with the
depth of pipeline
3/7/2012
ECE404
Raj Parihar
Question!
Download