Branch Prediction Branch Prediction HM (11/30/2011) If only we could predict the future, computation would be swift and accurate . In such a future world with clairvoyance the effect of stalls, created by branches in a pipelined architecture could be minimized: As soon as a branch is decoded, the pipeline could be primed again with the right instruction stream from the new location, the destination of the correctly predicted branch. Unfortunately, we generally don’t know, whether a conditional branch is taken until the condition is completely evaluated. We don’t know the destination of a taken branch, conditional or not, until that operand is evaluated. The same holds for all the other flow-control instructions, such as calls, returns, exceptions, etc. But we can guess the outcome of a conditional branch, we can guess the destination address of any branch, and we can guess wrong. We cannot predict with certainty! To help us guess the right branch destination, we could remember the branch destination from the last time this branch transferred control, and then predict that this time the destination will be the same. If this would help us guess right most of the time, we would get some advantage. Guessing always right would be nicer, but we are, after all, mere mortals. Branch prediction strategies intend to learn from the past to guess future behavior correctly most of the time. Practically this means, one can reach >= 96% accurate prediction, which helps in a pipelined, superscalar architecture. In fact, to take advantage of pipelined or superscalar execution, very high prediction accuracy is mandatory. After all, each time the prediction is wrong, the pipe has to be flushed, as the multiple arithmetic units hold invalid operands that are not needed and thus all clever HW speed-up methods were in vain. The “deeper” the pipes on a pipelined architecture, the more stringent are the accuracy requirements for a branch prediction scheme. 1 Branch Prediction HM Synopsis Definitions Introduction What’s so Bad About Branches? Static Branch Prediction Dynamic Branch Prediction A Two-Level Dynamic Prediction Scheme Yeh and Patt Nomenclature Table of Prediction Accuracies for SPECint92 Bibliography 2 Branch Prediction HM Definitions BHT, acronym for Branch History Table (BHT): The Branch History Table (BHT) is the collection of branch History Registers (HR), used in Single-Level or Two-Level dynamic branch prediction. There could be a.) one HR per conditional branch, b.) one HR each for the last n, or c.) just a single one for all conditional branches. The cost for the choice a.) can be excessive, yet is more accurate. Choice c.), while being the least accurate, also costs the least. Usually architects select a compromise. This trade-off of resource cost vs. accuracy is similar to the mapping policy employed in cache design. On real branch prediction HW, just the last few branches executed have their associated HR, otherwise too much HW –silicon space– for the BHT would be consumed. Each HR records for the last k executions of its associated conditional branch --or all branches-- whether (1) or not (0) that branch was taken. In a Two-Level dynamic branch prediction scheme, the HR has an associated Pattern Table (PT), indexed by the HR. The entry in the PT guesses, whether the next branch will be taken. The cost in bits can be contained, because not all branches need to have an associated HR. Branch Prediction: Heuristic that guesses the outcome of a conditional branch, the destination of the branch, or both, as soon as a branch instruction is being decoded. Accurate prediction of a branch is, of course, not possible. Heuristics aim at guessing right most of the time. For highly pipelined and superscalar architectures “most of the time” has to mean ~97% or more. Branch Profiling: Compile a program with a special compiler directive. Then measure at run-time, for each conditional branch, how many times each branch was taken and how many times not. The next time this same program is compiled, the measured results of the prior run are available to the compiler. The information enables a compiler to bias conditional branches according to past behavior. Underlying this scheme is the assumption that past behavior is a reflection of the future. Branch profiling is one of the static branch prediction schemes. It costs one additional execution and costs HW instruction bits, for the compiler to set the a branch bias one way or another. Generally, static prediction schemes, even with the benefit of a profiling run, are not as effective as the dynamic methods. 3 Branch Prediction HM BTAC, acronym for Branch Target Address Cache: For very fast performance, it is not sufficient to know ahead of time, whether a conditional branch will be taken. For any branch the destination address should also be known a priori. For this reason, each branch in a BTAC implementation has an associated target address, used in the instruction fetch unit to continue filling the pipeline. Note that after complete decoding of an instruction this target is also computed. But knowing the target earlier speeds up execution by filling an otherwise stalled pipeline. BTB, acronym for Branch Target Buffer: For very fast performance, it is best to know ahead of time whether or not a conditional branch will be taken, and where such a branch leads to. The former can be implemented using a BHT with Pattern Table; the latter can be implemented using a BTAC. The combination of these two is called the BTB. This is the scheme implemented on Intel Pentium Pro® and newer architectures. BTFN, acronym for Backward Taken Forward Not: A static prediction heuristic assuming that program execution time is dominated by loops, especially While Loops. While loops are characterized by an unconditional branch at the end of the loop body back to the condition, and a conditional branch if false at the start, leading to the instruction after the loop body. The backward branch is always taken, and to the same destination; the forward branch if false is taken just once. Since While Statements are often executed more than twice, the BTFN heuristic guesses correctly the majority of the time. Delay of Transfer (Delay Transfer Slot): Pipelined architectures sometimes execute another instruction before the current branch instruction. That step before is the instruction physically at the target of the branch. The reason is to recover some of the lost time caused by the pipeline stall. Thus, compilers or programmers can physically place the target instruction of the branch physically after the branch. Since it is supposed to be executed anyway, as soon as a branch has reached its target, and since the HW already executes it before completing the branch, time is saved. Note that at the target of such an unconditional branch the relocated instruction must be omitted. Example: Intel i860 architecture. When a suitable candidate cannot be found, a NOP instruction is placed physically after the branch, i.e. into the delay slot. There are restrictions; for example, branch instructions and other control-transfer instructions cannot be placed there. If that does happen, a phenomenon called code visiting occurs, with often unpredictable sideeffects, hence the restriction. 4 Branch Prediction HM Dynamic Branch Prediction: Branch prediction policy that changes dynamically with the execution of the program. Antonym: Static Branch Prediction. History Register (HR): k-bit shift register, associated with a conditional branch. The bits indicate for each of the last k executions of the associated conditional branch, whether it was taken, 1 saying yes. The newest bit shifts out the oldest, given that the HR has only some very limited, fixed length. Interference: When multiple branches are associated with one HW data structure (such as an HR or PT) the behavior of each branch will influence the data structure’s state. However, the data will be used for the next branch, even if it is not the one having modified the most recent state. Mispredicted Branch (AKA Miss): The condition or destination of a branch was predicted incorrectly. As a consequence, the control of execution took a different flow than predicted. Mispredicted Branch Penalty: Number of cycles lost, due to having incorrectly guessed the change in flow of control, caused by a branch instruction. Pattern Table (PT): A table of entries, each specifying whether the associated conditional branch will be taken. An entry in the PT is selected by using the history bits of a branch History Register (HR). This can be done by indexing, in which case the number of entries in the PT is 2k, with k being the number of bits stored in the History Register. Otherwise, if the number of entries is < 2k, a hashing scheme causing interference can be applied. Each PT entry holds boolean information about the next conditional branch, will it be taken or not. Saturating Counter: n-bit unsigned integer counter, n typically being 2..16 for branch prediction HW. When all bits are on and counting up continues, a saturating counter stays at the maximum value. When all bits are off and counting down continues, the saturating counter stays at 0. This creates a limited hysteresis effect on the behavior of the event depending on such a counter. 5 Branch Prediction HM Shift Register: Register with small number of bits, tracking a binary event. If the event did occur, a 1 bit is shifted into the register at one end. This will be the newest bit. The oldest bit is shifted out at the opposite end. Conversely, if the event did NOT occur, a 0 bit is shifted in, and the oldest bit is shifted out. All other bits shift their bit position by one. Static Branch Prediction: A branch prediction policy that is embedded in the binary code or implemented in the hardware that executes the branches. The policy does not change during execution of the program, even if known to be wrong all the time. In the latter case, execution would be better off without branch prediction. The BTFN heuristic is a static branch prediction policy. It requires zero instruction bits. The hardware compares the destination of a branch with the conditional branch’s own address. Destinations smaller lead backwards and are assumed taken. Destination addresses larger than the branch address are assumed not taken, and the next instruction predicted is the successor of the conditional branch. Typical industry benchmarks (SPECint89) achieve almost 65% correct prediction with this simple scheme. Two-Level Branch Prediction: Instead of associating a local branch history register with a conditional branch, a two-level branch prediction scheme associates history bits (pattern table) with branch execution history. Thus, each pattern of past branch behaviors has its own future prediction, costing more storage, but yielding better accuracy. For example, each conditional branch may have a k-bit Branch History Register, which records for each of the last k executions, whether or not the condition was satisfied. And each possible history pattern has an associated prediction of the future. Typically, the latter is implemented as a 2-bit saturating counter. Wide Issue: Older architectures issue (i.e. fetch, decode, etc.) one instruction at a time; for example, 1 instruction per clock cycle on a RISC architecture. Computers since about 1980 issue more than 1 instruction at a time; this is called a wide issue. Synonym: super-scalar architecture. Antonym: Single-issue 6 Branch Prediction HM Introduction Execution on a highly pipelined and wde issue architecture suffers severe degradation, whenever an instruction disrupts the prefetched flow of operations. Typically, branch instructions cause pipeline hazards. The higher the degree of pipelining, and the higher the superscalar degree, the more of the partially executed (fetched, decoded, operand-fetched, etc.) instructions must be discarded. The pipeline must be primed again, must be filled again with partially executed instructions. However, more than one in five operations are control-flow instruction –e.g. branch, call, return, conditional branch, exit etc. This almost invalidates the architectural advantage of pipelining. If it were possible to predict, whether a condition is true or not before computing it, and if the machine could also predict the destination of a branch before generating it from the instruction stream, then as soon as any branch is decoded, the pipe could be filled and hazards would be avoided. Obviously, complete prediction of the future is not possible. However, good guesses about the future can be made based on past performance. In fact, reasonably good guesses can be made, based on the static nature of the branch itself. These are called branch prediction schemes. Prediction can be static or dynamic. The former does not change during program execution; the latter evolves as a function of program behavior during execution. Since about 2000, dynamic branch prediction techniques with perceptrons has further improved accuracy to almost 98% for several benchmarks [12]. The graph below shows prediction accuracies for certain benchmarks of 2 competing processor types, the Intel Core Duo and the AMD K8 processor. We see an accuracy of 96% or even more. The 99% accuracy goal is the Holy Grail of branch prediction. 7 Branch Prediction HM Intel Core Duo two-level dynamic branch prediction vs. AMD K8, measured across multiple games and apps, shows the benefit of Intel’s branch prediction investment: http://www.realworldtech.com/page.cfm?ArticleID=RWT102808015436&p= 5 © Real World technologies, April 2009 The _H and _L suffixes allude to high vs. low levels of optimization used during the compile step. 8 Branch Prediction HM What’s so Bad About Branches? Performance Penalties, Delay, Disturbance: Disruption of sequential control flow, hence the anticipated flow of the pipeline is disturbed. The higher the number of pipeline stages, the greater the relative penalty. Another case in point: High Pipelining is a liability, not purely goodness! I-cache disturbance due to some new address range Must compute the condition of a branch, to determine the future direction: fall-through or to the new target? Must determine the target of an unconditional branch Determine Branch Direction: Cannot immediately fetch subsequent instruction, since it is not known Remedy: if possible, move instructions to compute branch-condition away from branch, so that the resulting wait is minimized Or make use of penalty: Bias the case toward NOT taken, or vice versa; done in some static schemes Fill delay slot with useful instruction (Intel 860 processor). This HW trick is being used less and less in the 2000s Execute both paths speculatively. Once condition is known, kill the superfluous path. Requires more HW, and can cause explosion of HW when jumping to further branches; so done on Itanium Processor Family (IPF) Predict branch direction, discussed here Determine Branch Target: Must know target address, to fetch next; hence use prediction Sample Prediction Algorithm, with 2 bits, reaching > 80% accuracy. This is an awesome policy, 2 data bits plus logic suffice for an amazing degree of accuracy! Even global for all branches; works so well despite the heavy interference! n Two-Bit Saturating Counter, Taken vs. Not Taken 9 Branch Prediction HM Static Branch Prediction common to all static branch prediction schemes listed below: small cost in extra hardware and cache also common: the achievable ~70% accuracy in prediction is cheap, but generally not sufficient for highly pipelined or for multi-way, superscalar architectures typical static prediction schemes are: condition not taken: assumes conditional branch is not taken; pipeline continues to be filled with instruction physically after the conditional branch; example early Intel ® 486; but this proved to be correct only little over 40%; hence it would have been better to abstain from this type of static prediction in the first place condition taken: assumes conditional branches are taken; pipeline continues to be filled with instructions at destination of conditional branch; correct about 60% of the time; can be advantageous for low degrees of pipelining BTFN: assumes execution is dominated by while loops; true in some code while-loops: un-optimized have conditional branch around the loop body; direction being forward to first instruction after the loop body while-loops: un-optimized, then use a unconditional branch back to the beginning of the loop body; hence BTFN prediction; common to be accurate ~65% single-bit bias, no profile: provide conditional instructions with a bit which indicates, whether the condition is likely true or not compiler can analyze source code and make reasonable guesses about condition’s outcome; this is encoded in the extra bit; reaches up to about 70% accuracy for example, exceptions and assertions are almost never taken, source program can give the compiler clues single-bit bias, with profiling: run the program, initially compiled without profile in bias bit. Then, for all conditional branches, count the number of times whether the condition was true during execution; use the count to set the bias bit, assuming the next execution with different data will have similar behavior; achieves up to about 75% accuracy; note similarity in using profile for Trace Scheduling 10 Branch Prediction HM Dynamic Branch Prediction prediction bit per I-cache line: this scheme encodes no information in the instruction stream, i.e. no information is assembled into the conditional branch instruction. Instead, each cache line holding a sequence of x instructions in the I-cache has an associated prediction bit. This bit, if set, predicts that the next executed conditional branch in the I-cache line will be taken. Problem: There may be no branch in the line at all, thus wasting the bit in the cache. More serious for performance, there may be multiple conditional branches, causing interference about the predictions of their respective conditions. Advantage: low cost; order 1% of cache area; reaches up to 80% accuracy; amazingly. 2 prediction bits per I-cache line: similar to above, but uses 2-bit saturating counter to predict next branch; can achieve additional accuracy; a single wrong guess does not disrupt the scheme; yet suffers similarly from waste and interference Branch History Table (BHT): have a history bit, or a saturating 2-bit counter, or a longer shift register for each represented branch; to contain the cost of history cache area: allot entries only for the last k different branch instructions executed; advantage: increases accuracy to 85%; implemented in Pentium ®. The total cache size is significantly smaller than the possible # of branches of the program. Hence evictions will occur, like in a regular data cache. 11 Branch Prediction HM Generic Two-Level Dynamic Branch Prediction The general two-level branch prediction schemes discussed here in a nutshell. The specific scheme used by Yeh and Patt is discussed in detail. 1. Remember the direction of the last k conditional branches in a specialpurpose cache, implemented as a shift-register, named the History Register (HR); can be global or local; global is one HR for all branches; local is one HR per branch 2. Remember the target addresses of the last branches in a special purpose cache, called the branch target address cache (BTAC) 3. Use the HR as an index into an array of patterns, called the pattern table (PT), each pattern typically implemented as a 2-bit counter predicting the future condition for this situation 4. Once the current branch has been completely computed, update the HR by shifting in the current condition –plus shifting out the oldest– and updating the PT[HR] as it was indexed by the last history register state. Local Branch Prediction Local in this context means that each conditional branch has its own, private branch prediction history cache. For example, each conditional branch may have a two-level, adaptive branch predictor, with a unique history buffer for each conditional, and either a local pattern history table, or a global one, shared between all conditional branches. For example, the Intel Pentium MMX, Pentium II, and Pentium III used local branch predictors, with a local 4-bit branch history and a local pattern history table of 16 entries (entries = 24 per conditional); see [12]. 12 Branch Prediction HM Two-Level Algorithm by Yeh and Patt This scheme uses a so-called History Register (HR). Whether this is one register per conditional branch, or one single global register, we shall specify later. A HR has an associated Pattern Table (PT). The HR is a k-bit shift register that stores the history of the last k outcomes of its associated conditional branch --or possibly of all > k branches. The PT is accessed (indexed) by this history pattern and the identified entry predicts the next condition’s outcome. The prediction is performed by a finite state machine that uses the stored bits of the PT to make a guess. The new state of the PT is derived from 2 inputs: previous state and real outcome of the branch, once the condition has actually been computed and corrected, if required. Also the HR is updated by left-shifting the new branch bit (1 if taken, else 0) in, and the oldest bit out of the HR. Usually each PT entry is a 2-bit saturating counter. The method reaches accuracies of up to 97%. Yeh and Patt argue that for super-pipelined, high-issue architectures 97% is still poor. The figure below shows the scheme for conditional branch instruction C0. HR can exist once, in which case it applies globally to all branch instructions, and then interferes with the prediction of any other branch. On the other hand, an architecture may dedicate one local HR per branch, replicating n HRs, one HR for each of the last n distinct branch instructions. Also, PT may exist once globally for all HR, or a private PT may exist for each HR if HRs are replicated per branch. 13 Branch Prediction HM Pattern Table (PT) History Register (HR) 100101 C0 Use HR value as index 14 01 000000 01 000001 11 000010 00 000011 01 100101 01 111100 10 111101 11 111110 11 111111 Branch Prediction HM Yeh and Patt Nomenclature The prediction scheme by Yeh and Patt (ref. [4] - [7]) can be effective, but consumes ample cache space. Note that for each branch instruction a Branch History Register of k bits, an address tag, and a PT of 2k+1 entries for a 2-bit prediction pattern each is consumed. Could this same space be utilized better? Yeh and Patt measured varying accuracies for the same program, the same number of cache bits, varying the scheme as follows. Instead of using always one BHR per branch and one PT per branch, experiments associated a set of multiple PT entries with one BHR. Unintuitive as this may sound, they observed good prediction accuracy for one global PT and measured this variation as well. Since the number of bits consumed for the cache was decided to remain constant, a larger number of history bits and/or a larger number of last executed branches could be used. Varying the number of BH registers and number of PT, lead to the following nomenclature: Varying BHT P one BH per branch AND Varying PT p one PT per branch A G one global BH for all branches g one global PT for all branches There are only 3 meaningful choices: GAg, PAg, and PAp. The complete measurements were conducted for a growing number of bits (HW budget), from 8k to 128 k bits. Interestingly, for sufficiently large cache storage, Yeh and Patt found that the best scheme constrained to 128 k bits is not PAp, but a PAg scheme. Unintuitively, PAg is most cost-effective. This delivered the highest accuracy for a fixed HW budget, despite interference. For other HW budgets, Patt and Yeh found different optimal schemes. 15 Branch Prediction HM Prediction Accuracy Table for SPECint92 The vertical axis in the figure below shows the percentages of accurately predicting branches in the SPECint92 benchmark. The horizontal axis show the actual accuracies in improving order from left to right. Recall that BTFN means: Backwards Taken, Forward Not. Approximate Prediction Accuracies in % 120 100 80 60 Series1 40 20 0 always taken never taken BTFN 1 bit bias, 1 bit bias, 1 bit dyn 2 bits dyn 2-level no profile with history history branch profiling prediction 16 Branch Prediction HM Bibliography 1. Gwennap L. [1995]. “New Algorithm Improves Branch Prediction,” Microprocessor Report, March 1995, pp. 17-21 2. Gwennap L. [1995]. “New Algorithm Improves Branch Prediction,” MicroDesign Resources, Vol. 9, No. 4, March 27, 1995, on web at: https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15213f00/docs/mpr-branchpredict.pdf 3. Smith, J. [1981]. “A Study or Branch Prediction Strategies,” 8th International Symposium on Computer Architecture, May 1981, pp. 135148. 4. Yeh, T. and Y. Patt [1991]. “Two-Level Adaptive Branch Prediction,” 24th International Symposium on Computer Architecture, November 1991, pp. 51-61. 5. Yeh, T. and Y. Patt [1991]. “Alternative Implementations of Two-Level Adaptive Branch Prediction,” 19th International Symposium on Computer Architecture, May 1992, pp. 124-134. 6. Yeh, T. and Y. Patt [1993]. “A Comparison of Dynamic Branch Predictors That Use Two Levels of Branch History,” 20th International Symposium on Computer Architecture, May 1993, pp. 257-266. 7. Yeh, Tse-Yu, and Yale N. Patt [1992]. “Alternative Implementation of Two-Level Adaptive Branch Prediction”, 19th Annual International Symposium on Computer Architecture, pp 124-134. Can be located on web pages of University of Michigan. 8. McFarling, Scott [1993]. “Combining Branch Predictors”, WRL Technical Note TN-36, Digital Western Research Lab, June 1993. 9. Hilgendorf, R. B., et al. [1999]. “Evaluation of branch-prediction methods on traces from commercial applications.” www.research.ibm.com/journal/rd/434/hilgendorf.html IBM Journal of Research & Development. 10.Hsien-Hsin Sean Lee: “Branch Prediction”, http://users.ece.gatech.edu/~sudha/academic/class/ece41006100/Lectures/Module3-BranchPrediction/branch.prediction.pdf 11.Daniel A. Jiménez, Calvin Lin, [6.2000]. “Dynamic Branch Prediction with Perceptrons.” Proceedings of the 7th International Symposium on High Performance Computer Architecture. 12.Wikipedia, 2011, http://en.wikipedia.org/wiki/Branch_predictor 17