Branch Prediction

advertisement
Branch Prediction
Branch Prediction
HM
(11/30/2011)
If only we could predict the future, computation would be swift and accurate
. In such a future world with clairvoyance the effect of stalls, created by
branches in a pipelined architecture could be minimized: As soon as a branch
is decoded, the pipeline could be primed again with the right instruction
stream from the new location, the destination of the correctly predicted
branch.
Unfortunately, we generally don’t know, whether a conditional branch is
taken until the condition is completely evaluated. We don’t know the
destination of a taken branch, conditional or not, until that operand is
evaluated. The same holds for all the other flow-control instructions, such as
calls, returns, exceptions, etc.
But we can guess the outcome of a conditional branch, we can guess the
destination address of any branch, and we can guess wrong. We cannot
predict with certainty! To help us guess the right branch destination, we
could remember the branch destination from the last time this branch
transferred control, and then predict that this time the destination will be the
same. If this would help us guess right most of the time, we would get some
advantage. Guessing always right would be nicer, but we are, after all, mere
mortals. Branch prediction strategies intend to learn from the past to guess
future behavior correctly most of the time. Practically this means, one can
reach >= 96% accurate prediction, which helps in a pipelined, superscalar
architecture. In fact, to take advantage of pipelined or superscalar
execution, very high prediction accuracy is mandatory. After all, each time
the prediction is wrong, the pipe has to be flushed, as the multiple arithmetic
units hold invalid operands that are not needed and thus all clever HW
speed-up methods were in vain. The “deeper” the pipes on a pipelined
architecture, the more stringent are the accuracy requirements for a branch
prediction scheme.
1
Branch Prediction
HM
Synopsis
 Definitions
 Introduction
 What’s so Bad About Branches?
 Static Branch Prediction
 Dynamic Branch Prediction
 A Two-Level Dynamic Prediction Scheme
 Yeh and Patt Nomenclature
 Table of Prediction Accuracies for SPECint92
 Bibliography
2
Branch Prediction
HM
Definitions
BHT, acronym for Branch History Table (BHT):
The Branch History Table (BHT) is the collection of branch History
Registers (HR), used in Single-Level or Two-Level dynamic branch
prediction. There could be a.) one HR per conditional branch, b.) one HR
each for the last n, or c.) just a single one for all conditional branches.
The cost for the choice a.) can be excessive, yet is more accurate. Choice
c.), while being the least accurate, also costs the least. Usually architects
select a compromise. This trade-off of resource cost vs. accuracy is
similar to the mapping policy employed in cache design. On real branch
prediction HW, just the last few branches executed have their associated
HR, otherwise too much HW –silicon space– for the BHT would be
consumed. Each HR records for the last k executions of its associated
conditional branch --or all branches-- whether (1) or not (0) that branch
was taken.
In a Two-Level dynamic branch prediction scheme, the HR has an
associated Pattern Table (PT), indexed by the HR. The entry in the PT
guesses, whether the next branch will be taken. The cost in bits can be
contained, because not all branches need to have an associated HR.
Branch Prediction:
Heuristic that guesses the outcome of a conditional branch, the
destination of the branch, or both, as soon as a branch instruction is
being decoded. Accurate prediction of a branch is, of course, not
possible. Heuristics aim at guessing right most of the time. For highly
pipelined and superscalar architectures “most of the time” has to mean
~97% or more.
Branch Profiling:
Compile a program with a special compiler directive. Then measure at
run-time, for each conditional branch, how many times each branch was
taken and how many times not. The next time this same program is
compiled, the measured results of the prior run are available to the
compiler. The information enables a compiler to bias conditional branches
according to past behavior. Underlying this scheme is the assumption
that past behavior is a reflection of the future. Branch profiling is one of
the static branch prediction schemes. It costs one additional execution
and costs HW instruction bits, for the compiler to set the a branch bias
one way or another. Generally, static prediction schemes, even with the
benefit of a profiling run, are not as effective as the dynamic methods.
3
Branch Prediction
HM
BTAC, acronym for Branch Target Address Cache:
For very fast performance, it is not sufficient to know ahead of time,
whether a conditional branch will be taken. For any branch the
destination address should also be known a priori. For this reason, each
branch in a BTAC implementation has an associated target address, used
in the instruction fetch unit to continue filling the pipeline. Note that after
complete decoding of an instruction this target is also computed. But
knowing the target earlier speeds up execution by filling an otherwise
stalled pipeline.
BTB, acronym for Branch Target Buffer:
For very fast performance, it is best to know ahead of time whether or
not a conditional branch will be taken, and where such a branch leads to.
The former can be implemented using a BHT with Pattern Table; the
latter can be implemented using a BTAC. The combination of these two is
called the BTB. This is the scheme implemented on Intel Pentium Pro®
and newer architectures.
BTFN, acronym for Backward Taken Forward Not:
A static prediction heuristic assuming that program execution time is
dominated by loops, especially While Loops. While loops are
characterized by an unconditional branch at the end of the loop body
back to the condition, and a conditional branch if false at the start,
leading to the instruction after the loop body. The backward branch is
always taken, and to the same destination; the forward branch if false is
taken just once. Since While Statements are often executed more than
twice, the BTFN heuristic guesses correctly the majority of the time.
Delay of Transfer (Delay Transfer Slot):
Pipelined architectures sometimes execute another instruction before the
current branch instruction. That step before is the instruction physically
at the target of the branch. The reason is to recover some of the lost
time caused by the pipeline stall. Thus, compilers or programmers can
physically place the target instruction of the branch physically after the
branch. Since it is supposed to be executed anyway, as soon as a branch
has reached its target, and since the HW already executes it before
completing the branch, time is saved. Note that at the target of such an
unconditional branch the relocated instruction must be omitted. Example:
Intel i860 architecture. When a suitable candidate cannot be found, a
NOP instruction is placed physically after the branch, i.e. into the delay
slot. There are restrictions; for example, branch instructions and other
control-transfer instructions cannot be placed there. If that does happen,
a phenomenon called code visiting occurs, with often unpredictable sideeffects, hence the restriction.
4
Branch Prediction
HM
Dynamic Branch Prediction:
Branch prediction policy that changes dynamically with the execution of
the program. Antonym: Static Branch Prediction.
History Register (HR):
k-bit shift register, associated with a conditional branch. The bits indicate
for each of the last k executions of the associated conditional branch,
whether it was taken, 1 saying yes. The newest bit shifts out the oldest,
given that the HR has only some very limited, fixed length.
Interference:
When multiple branches are associated with one HW data structure (such
as an HR or PT) the behavior of each branch will influence the data
structure’s state. However, the data will be used for the next branch,
even if it is not the one having modified the most recent state.
Mispredicted Branch (AKA Miss):
The condition or destination of a branch was predicted incorrectly. As a
consequence, the control of execution took a different flow than
predicted.
Mispredicted Branch Penalty:
Number of cycles lost, due to having incorrectly guessed the change in
flow of control, caused by a branch instruction.
Pattern Table (PT):
A table of entries, each specifying whether the associated conditional
branch will be taken. An entry in the PT is selected by using the history
bits of a branch History Register (HR). This can be done by indexing, in
which case the number of entries in the PT is 2k, with k being the
number of bits stored in the History Register. Otherwise, if the number of
entries is < 2k, a hashing scheme causing interference can be applied.
Each PT entry holds boolean information about the next conditional
branch, will it be taken or not.
Saturating Counter:
n-bit unsigned integer counter, n typically being 2..16 for branch
prediction HW. When all bits are on and counting up continues, a
saturating counter stays at the maximum value. When all bits are off and
counting down continues, the saturating counter stays at 0. This creates
a limited hysteresis effect on the behavior of the event depending on
such a counter.
5
Branch Prediction
HM
Shift Register:
Register with small number of bits, tracking a binary event. If the event
did occur, a 1 bit is shifted into the register at one end. This will be the
newest bit. The oldest bit is shifted out at the opposite end. Conversely,
if the event did NOT occur, a 0 bit is shifted in, and the oldest bit is
shifted out. All other bits shift their bit position by one.
Static Branch Prediction:
A branch prediction policy that is embedded in the binary code or
implemented in the hardware that executes the branches. The policy
does not change during execution of the program, even if known to be
wrong all the time. In the latter case, execution would be better off
without branch prediction.
The BTFN heuristic is a static branch prediction policy. It requires zero
instruction bits. The hardware compares the destination of a branch with
the conditional branch’s own address. Destinations smaller lead
backwards and are assumed taken. Destination addresses larger than the
branch address are assumed not taken, and the next instruction
predicted is the successor of the conditional branch. Typical industry
benchmarks (SPECint89) achieve almost 65% correct prediction with this
simple scheme.
Two-Level Branch Prediction:
Instead of associating a local branch history register with a conditional
branch, a two-level branch prediction scheme associates history bits
(pattern table) with branch execution history. Thus, each pattern of past
branch behaviors has its own future prediction, costing more storage, but
yielding better accuracy.
For example, each conditional branch may have a k-bit Branch History
Register, which records for each of the last k executions, whether or not
the condition was satisfied. And each possible history pattern has an
associated prediction of the future. Typically, the latter is implemented as
a 2-bit saturating counter.
Wide Issue:
Older architectures issue (i.e. fetch, decode, etc.) one instruction at a
time; for example, 1 instruction per clock cycle on a RISC architecture.
Computers since about 1980 issue more than 1 instruction at a time; this
is called a wide issue. Synonym: super-scalar architecture. Antonym:
Single-issue
6
Branch Prediction
HM
Introduction
Execution on a highly pipelined and wde issue architecture suffers severe
degradation, whenever an instruction disrupts the prefetched flow of
operations. Typically, branch instructions cause pipeline hazards. The higher
the degree of pipelining, and the higher the superscalar degree, the more of
the partially executed (fetched, decoded, operand-fetched, etc.) instructions
must be discarded. The pipeline must be primed again, must be filled again
with partially executed instructions.
However, more than one in five operations are control-flow instruction –e.g.
branch, call, return, conditional branch, exit etc. This almost invalidates the
architectural advantage of pipelining.
If it were possible to predict, whether a condition is true or not before
computing it, and if the machine could also predict the destination of a
branch before generating it from the instruction stream, then as soon as any
branch is decoded, the pipe could be filled and hazards would be avoided.
Obviously, complete prediction of the future is not possible. However, good
guesses about the future can be made based on past performance. In fact,
reasonably good guesses can be made, based on the static nature of the
branch itself. These are called branch prediction schemes. Prediction can be
static or dynamic. The former does not change during program execution;
the latter evolves as a function of program behavior during execution. Since
about 2000, dynamic branch prediction techniques with perceptrons has
further improved accuracy to almost 98% for several benchmarks [12].
The graph below shows prediction accuracies for certain benchmarks of 2
competing processor types, the Intel Core Duo and the AMD K8 processor.
We see an accuracy of 96% or even more. The 99% accuracy goal is the
Holy Grail of branch prediction. 
7
Branch Prediction
HM
Intel Core Duo two-level dynamic branch prediction vs. AMD K8, measured
across multiple games and apps, shows the benefit of Intel’s branch
prediction investment:
http://www.realworldtech.com/page.cfm?ArticleID=RWT102808015436&p=
5
© Real World technologies, April 2009
The _H and _L suffixes allude to high vs. low levels of optimization used
during the compile step.
8
Branch Prediction
HM
What’s so Bad About Branches?

Performance Penalties, Delay, Disturbance:
 Disruption of sequential control flow, hence the anticipated flow of the
pipeline is disturbed. The higher the number of pipeline stages, the
greater the relative penalty. Another case in point: High Pipelining is a
liability, not purely goodness!
 I-cache disturbance due to some new address range
 Must compute the condition of a branch, to determine the future
direction: fall-through or to the new target?
 Must determine the target of an unconditional branch

Determine Branch Direction:
 Cannot immediately fetch subsequent instruction, since it is not known
 Remedy: if possible, move instructions to compute branch-condition
away from branch, so that the resulting wait is minimized
 Or make use of penalty:
 Bias the case toward NOT taken, or vice versa; done in some static
schemes
 Fill delay slot with useful instruction (Intel 860 processor). This HW
trick is being used less and less in the 2000s
 Execute both paths speculatively. Once condition is known, kill the
superfluous path. Requires more HW, and can cause explosion of
HW when jumping to further branches; so done on Itanium
Processor Family (IPF)
 Predict branch direction, discussed here
Determine Branch Target: Must know target address, to fetch next; hence
use prediction
Sample Prediction Algorithm, with 2 bits, reaching > 80% accuracy. This
is an awesome policy, 2 data bits plus logic suffice for an amazing degree
of accuracy! Even global for all branches; works so well despite the heavy
interference!


n
Two-Bit Saturating Counter, Taken vs. Not Taken
9
Branch Prediction
HM
Static Branch Prediction





common to all static branch prediction schemes listed below: small cost in
extra hardware and cache
also common: the achievable ~70% accuracy in prediction is cheap, but
generally not sufficient for highly pipelined or for multi-way, superscalar
architectures
typical static prediction schemes are:
 condition not taken: assumes conditional branch is not taken; pipeline
continues to be filled with instruction physically after the conditional
branch; example early Intel ® 486; but this proved to be correct only
little over 40%; hence it would have been better to abstain from this
type of static prediction in the first place
 condition taken: assumes conditional branches are taken; pipeline
continues to be filled with instructions at destination of conditional
branch; correct about 60% of the time; can be advantageous for low
degrees of pipelining
 BTFN: assumes execution is dominated by while loops; true in some
code
 while-loops: un-optimized have conditional branch around the loop
body; direction being forward to first instruction after the loop body
 while-loops: un-optimized, then use a unconditional branch back to the
beginning of the loop body; hence BTFN prediction; common to be
accurate ~65%
single-bit bias, no profile: provide conditional instructions with a bit which
indicates, whether the condition is likely true or not
 compiler can analyze source code and make reasonable guesses about
condition’s outcome; this is encoded in the extra bit; reaches up to
about 70% accuracy
 for example, exceptions and assertions are almost never taken, source
program can give the compiler clues
single-bit bias, with profiling: run the program, initially compiled without
profile in bias bit. Then, for all conditional branches, count the number of
times whether the condition was true during execution; use the count to
set the bias bit, assuming the next execution with different data will have
similar behavior; achieves up to about 75% accuracy; note similarity in
using profile for Trace Scheduling
10
Branch Prediction
HM
Dynamic Branch Prediction



prediction bit per I-cache line: this scheme encodes no information in the
instruction stream, i.e. no information is assembled into the conditional
branch instruction. Instead, each cache line holding a sequence of x
instructions in the I-cache has an associated prediction bit. This bit, if set,
predicts that the next executed conditional branch in the I-cache line will
be taken.
 Problem: There may be no branch in the line at all, thus wasting the
bit in the cache.
 More serious for performance, there may be multiple conditional
branches, causing interference about the predictions of their
respective conditions.
 Advantage: low cost; order 1% of cache area; reaches up to 80%
accuracy; amazingly.
2 prediction bits per I-cache line: similar to above, but uses 2-bit
saturating counter to predict next branch; can achieve additional
accuracy; a single wrong guess does not disrupt the scheme; yet suffers
similarly from waste and interference
Branch History Table (BHT): have a history bit, or a saturating 2-bit
counter, or a longer shift register for each represented branch; to contain
the cost of history cache area: allot entries only for the last k different
branch instructions executed; advantage: increases accuracy to 85%;
implemented in Pentium ®. The total cache size is significantly smaller
than the possible # of branches of the program. Hence evictions will
occur, like in a regular data cache.
11
Branch Prediction
HM
Generic Two-Level Dynamic Branch Prediction
The general two-level branch prediction schemes discussed here in a nutshell. The specific scheme used by Yeh and Patt is discussed in detail.
1. Remember the direction of the last k conditional branches in a specialpurpose cache, implemented as a shift-register, named the History
Register (HR); can be global or local; global is one HR for all branches;
local is one HR per branch
2. Remember the target addresses of the last branches in a special
purpose cache, called the branch target address cache (BTAC)
3. Use the HR as an index into an array of patterns, called the pattern
table (PT), each pattern typically implemented as a 2-bit counter
predicting the future condition for this situation
4. Once the current branch has been completely computed, update the
HR by shifting in the current condition –plus shifting out the oldest–
and updating the PT[HR] as it was indexed by the last history register
state.
Local Branch Prediction
Local in this context means that each conditional branch has its own, private
branch prediction history cache. For example, each conditional branch may
have a two-level, adaptive branch predictor, with a unique history buffer for
each conditional, and either a local pattern history table, or a global one,
shared between all conditional branches.
For example, the Intel Pentium MMX, Pentium II, and Pentium III used local
branch predictors, with a local 4-bit branch history and a local pattern
history table of 16 entries (entries = 24 per conditional); see [12].
12
Branch Prediction
HM
Two-Level Algorithm by Yeh and Patt
This scheme uses a so-called History Register (HR). Whether this is one
register per conditional branch, or one single global register, we shall specify
later. A HR has an associated Pattern Table (PT). The HR is a k-bit shift
register that stores the history of the last k outcomes of its associated
conditional branch --or possibly of all > k branches. The PT is accessed
(indexed) by this history pattern and the identified entry predicts the next
condition’s outcome.
The prediction is performed by a finite state machine that uses the stored
bits of the PT to make a guess. The new state of the PT is derived from 2
inputs: previous state and real outcome of the branch, once the condition
has actually been computed and corrected, if required. Also the HR is
updated by left-shifting the new branch bit (1 if taken, else 0) in, and the
oldest bit out of the HR. Usually each PT entry is a 2-bit saturating counter.
The method reaches accuracies of up to 97%. Yeh and Patt argue that for
super-pipelined, high-issue architectures 97% is still poor.
The figure below shows the scheme for conditional branch instruction C0. HR
can exist once, in which case it applies globally to all branch instructions, and
then interferes with the prediction of any other branch. On the other hand,
an architecture may dedicate one local HR per branch, replicating n HRs, one
HR for each of the last n distinct branch instructions. Also, PT may exist once
globally for all HR, or a private PT may exist for each HR if HRs are
replicated per branch.
13
Branch Prediction
HM
Pattern Table (PT)
History Register (HR)
100101
C0
Use HR value as index
14
01
000000
01
000001
11
000010
00
000011
01
100101
01
111100
10
111101
11
111110
11
111111
Branch Prediction
HM
Yeh and Patt Nomenclature
The prediction scheme by Yeh and Patt (ref. [4] - [7]) can be effective, but
consumes ample cache space. Note that for each branch instruction a Branch
History Register of k bits, an address tag, and a PT of 2k+1 entries for a 2-bit
prediction pattern each is consumed. Could this same space be utilized
better?
Yeh and Patt measured varying accuracies for the same program, the same
number of cache bits, varying the scheme as follows. Instead of using always
one BHR per branch and one PT per branch, experiments associated a set of
multiple PT entries with one BHR.
Unintuitive as this may sound, they observed good prediction accuracy for
one global PT and measured this variation as well. Since the number of bits
consumed for the cache was decided to remain constant, a larger number of
history bits and/or a larger number of last executed branches could be used.
Varying the number of BH registers and number of PT, lead to the following
nomenclature:
Varying BHT
P one BH per branch
AND
Varying PT
p one PT per branch
A
G one global BH for all
branches
g one global PT for all
branches
There are only 3 meaningful choices: GAg, PAg, and PAp. The complete
measurements were conducted for a growing number of bits (HW budget),
from 8k to 128 k bits. Interestingly, for sufficiently large cache storage, Yeh
and Patt found that the best scheme constrained to 128 k bits is not PAp, but
a PAg scheme. Unintuitively, PAg is most cost-effective. This delivered the
highest accuracy for a fixed HW budget, despite interference.
For other HW budgets, Patt and Yeh found different optimal schemes.
15
Branch Prediction
HM
Prediction Accuracy Table for SPECint92
The vertical axis in the figure below shows the percentages of accurately
predicting branches in the SPECint92 benchmark. The horizontal axis show
the actual accuracies in improving order from left to right. Recall that BTFN
means: Backwards Taken, Forward Not.
Approximate Prediction Accuracies in %
120
100
80
60
Series1
40
20
0
always
taken
never
taken
BTFN
1 bit bias, 1 bit bias, 1 bit dyn 2 bits dyn 2-level
no profile
with
history
history
branch
profiling
prediction
16
Branch Prediction
HM
Bibliography
1. Gwennap L. [1995]. “New Algorithm Improves Branch Prediction,”
Microprocessor Report, March 1995, pp. 17-21
2. Gwennap L. [1995]. “New Algorithm Improves Branch Prediction,”
MicroDesign Resources, Vol. 9, No. 4, March 27, 1995, on web at:
https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15213f00/docs/mpr-branchpredict.pdf
3. Smith, J. [1981]. “A Study or Branch Prediction Strategies,” 8th
International Symposium on Computer Architecture, May 1981, pp. 135148.
4. Yeh, T. and Y. Patt [1991]. “Two-Level Adaptive Branch Prediction,” 24th
International Symposium on Computer Architecture, November 1991, pp.
51-61.
5. Yeh, T. and Y. Patt [1991]. “Alternative Implementations of Two-Level
Adaptive Branch Prediction,” 19th International Symposium on Computer
Architecture, May 1992, pp. 124-134.
6. Yeh, T. and Y. Patt [1993]. “A Comparison of Dynamic Branch Predictors
That Use Two Levels of Branch History,” 20th International Symposium on
Computer Architecture, May 1993, pp. 257-266.
7. Yeh, Tse-Yu, and Yale N. Patt [1992]. “Alternative Implementation of
Two-Level Adaptive Branch Prediction”, 19th Annual International
Symposium on Computer Architecture, pp 124-134. Can be located on
web pages of University of Michigan.
8. McFarling, Scott [1993]. “Combining Branch Predictors”, WRL Technical
Note TN-36, Digital Western Research Lab, June 1993.
9. Hilgendorf, R. B., et al. [1999]. “Evaluation of branch-prediction methods
on traces from commercial applications.”
www.research.ibm.com/journal/rd/434/hilgendorf.html IBM Journal of
Research & Development.
10.Hsien-Hsin Sean Lee: “Branch Prediction”,
http://users.ece.gatech.edu/~sudha/academic/class/ece41006100/Lectures/Module3-BranchPrediction/branch.prediction.pdf
11.Daniel A. Jiménez, Calvin Lin, [6.2000]. “Dynamic Branch Prediction with
Perceptrons.” Proceedings of the 7th International Symposium on High
Performance Computer Architecture.
12.Wikipedia, 2011, http://en.wikipedia.org/wiki/Branch_predictor
17
Download