CS 7960-4 Lecture 7 Combining Branch Predictors Scott McFarling

advertisement
CS 7960-4
Lecture 7
Combining Branch Predictors
Scott McFarling
WRL Tech. Report TN-36
1993
Bimodal Branch Prediction
• Identifies most popular prediction in recent past
• Updates happen during commit
PC
10-bit index
1
0
1024 entries
2-bit saturating
counters
Results
• SPEC’89 programs simulated for 10M instrs
(modern studies use hard-to-predict programs)
• A larger predictor reduces contention for counters
• Prediction rates saturate at 93.5% (at 2K bytes)
(Fig.3)
Local Predictors
• Two-Level predictor: The first level has history,
the second level has saturating counters
• History gets updated immediately
PC
10-bit index
0
1
1
1
1
0
16 entries
1024 entries
4-bit history table
2-bit saturating
counters
Results
• For small predictors, there could be contention
at both levels, resulting in inaccurate predictions
• Will also take longer to warm up – after every
context switch
• Does very well for large predictors – saturates at
97.1%
Global Predictors
• A single history register – neighboring branches
have correlated results
• However, the PC is not used
1
0
1024 entries
10-bit global history
2-bit saturating
counters
Do We Need PC?
• Note that the global history reveals which branch
is being examined
• Hence, it outdoes bimodal predictors when the
transistor budget is large (Fig.7)
• Local predictor does better – it is more important
to identify the PC and local history than behavior
of neighboring branches
Gselect
• Use a combination of PC and global history
• Bimodal and global prediction are special cases
(Fig.9)
PC
n
/
/
m
n+m
/
1
0
1024 entries
5-bit global history
2-bit saturating
counters
GShare
• Xor-ing 10 history bits and 10 PC bits has more
info than the concatenation of 5 bits of each and
more info than each individual component
Branch
Address
Global
History
Gselect
4/4
Gshare
8/8
00000000
00000001
00000001
00000001
00000000
00000000
00000000
00000000
11111111
00000000
11110000
11111111
11111111
10000000
11110000
01111111
01111110
00000001
11100001
01111111
Terminology
• GAG: Global history indexes into global array
of saturating counters
• PAG: Per-address history indexes into global array
of saturating counters
• GAP: Global history indexes into each PC’s private
array of counters (gselect)
• PAP: Per-address history indexes into each PC’s
private array of counters
Trade-Offs
• Some predictors warm-up faster than others
• Some programs benefit from global history, some
from local history
• Some programs have branches that interfere
with each other
• Note that a 64KB local predictor has fewer
saturating counters than a 64KB bimodal predictor
– the former won’t be better for every program
Combining Predictors
• Use an array of saturating counters to pick the
best available predictor for each PC
Predictor A
PC
1
0
1024 entries
Predictor B
2-bit saturating
counters
Results
• The combination of local and gshare increases
the prediction accuracy to 98.1% (Fig.16)
• For smaller transistor budgets, the combination
of bimodal and gshare is better (gshare is twice
the size to make sure the total is a power of two)
• A 1KB combined predictor does as well as a
16KB gselect predictor
Future Work
• Detect conflicts, correlations, and common
predictions through profiling/compiler analysis
• Functions that compress information in history
or PC
• Pipeline predictions – predict two branches ahead
• Hierarchical predictors – get a quick prediction in
a cycle and a more accurate one two cycles later
Next Week’s Paper
• “Design Trade-Offs for the Alpha EV8 Conditional
Branch Predictor”, Seznec et al., ISCA’02
Title
• Bullet
Download