Dynamic Hardware Prediction • Importance of control dependences – Branches and jumps are frequent – Limiting factor as ILP increases (Amdahl’s law) • Schemes to attack control dependences – Static • Basic (stall the pipeline) • Predict-not-taken and predict-taken • Delayed branch and canceling branch – Dynamic predictors • Effectiveness of dynamic prediction schemes – Accuracy – Cost of a correctly predicted branch – Cost of an incorrectly predicted branch Basic Branch Prediction Buffers a.k.a. Branch History Table (BHT) - Small direct-mapped cache of T/NT bits Branch Instruction IR: + Branch Target BHT T (predict taken) PC: NT (predict not- taken) PC + 4 N-bit Branch Prediction Buffers Use an n-bit saturating counter Only the loop exit causes a misprediction 2-bit predictor almost as good as any general n-bit predictor Predict taken Predict taken 11 10 taken not taken Predict not taken Predict not taken 01 00 2-bit Predictor 1 Correlating Predictors a.k.a. Two-level Predictors – Use recent behavior of other (previous) branches Branch Instruction IR: + Branch Target BHT T (predict taken) PC: NT (predict not- taken) PC + 4 1-bit global branch history: T/NT (stores behavior of previous branch) T NT Example BNEZ ADDI L1: SUBUI BNEZ ... L2: if (d = = 0) d = 1; if (d = = 1) whatever; R1, L1 ; branch b1 (d!=0) R1, R0, #1 R3, R1, #1 R3, L2 ; branch b2 (d!=1) Basic one-bit predictor d=? b1 pred 2 0 2 0 NT T NT T b1 action new b1 pred b2 pred b2 action T NT T NT T NT T NT NT T NT T T NT T NT new b2 pred T NT T NT One-bit predictor with one-bit correlation d=? b1 pred 2 0 2 0 NT/NT T/NT T/NT T/NT b1 action new b1 pred b2 pred b2 action T NT T NT T/NT T/NT T/NT T/NT NT/NT NT/T NT/T NT/T T NT T NT new b2 pred NT/T NT/T NT/T NT/T (m, n) Predictors • Use behavior of the last m branches • 2m n-bit predictors for each branch • Simple implementation – Use m-bit shift register to record the behavior of the last m branches m-bit GBH PC: (m,n) BPF + n-bit predictor 2 Size of the Buffers • Number of bits in a (m,n) predictor – 2m x n x Number of entries in the table • Example – assume 8K bits in the BHT – (0,1): 8K entries – (0,2): 4K entries – (2,2): 1K entries – (12,2): 1 entry! • Does not use the branch address • Relies only on the global branch history li g es cc pr es so eq nt ot t 20 18 16 14 12 10 8 6 4 2 0 na m sa7 at rix 30 to 0 m ca tv do du c sp ice fp pp p Frequency of mispredictions Performance of 2-bit Predictors (0,2) 4K entries (0,2) 1M entries (2,2) 1K entries SPEC89 Benchmarks Branch-Target Buffers • Further reduce control stalls (hopefully to 0) • Store the predicted address in the buffer • Access the buffer during IF PC Look up = Predicted address T/NT NO: instruction is not a branch YES: instruction is a branch 3 Prediction with BTF Send PC to memory and BTF IF NO YES Entry found in BTF? Send out predicted address NO ID Is instr a taken branch? YES Taken branch? NO Update BTF EX YES Kill fetched instr; restart fetch at other target delete entry from BTF; Target Instruction Buffers • Store target instructions instead of addresses • Advantages – BTB access can take longer than time between IFs and BTB can be larger – Branch folding • Zero-cycle unconditional branches – Replace branch with target instruction • Zero-cycle conditional branches – Condition codes preset Procedure Return Predictors • Use buffer (stack) of return addresses 60 Misprediction rate 50 40 gcc 30 li fpppp 20 10 0 1 2 4 8 16 Number of entries in the return stack 4 Performance Issues • Limitations of branch prediction schemes – Prediction accuracy (80% - 95%) • Type of program • Size of buffer – Penalty of misprediction • Fetch from both directions to reduce penalty – Memory system should: • Dual-ported • Have an interleaved cache • Fetch from one path and then from the other Approaches to Improve Performance • Goal so far: achieve CPI = 1 – Eliminate structural, data, and control stalls • Additional performance improvements – Make clock rate faster • Improve manufacturing process – Increase the number of stages • Superpipelining – Multiple issue of instructions • Superscalar • VLIW • IPC instead of CPI ! Superscalar Processors • Issue more than one instruction per cycle • Duplication of functional units • Constraints – Structural – Data dependencies – Control dependencies Sound familiar? • Scheduling of instructions – Static – Dynamic 5