Abstraction Question • General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the machine • Embedded processors tend to have entire control over the code run, compiler (if any), the ISA, and the hardware – so breaking the abstraction layer makes sense. • Bottom line becomes who controls what elements of the design. Intel side note We’ve been doing mips – how’d intel do this for a CISC x86 ISA? Microops from Pentium Pro on Read 5.1, 5.2 Branch Hazards or “Which way did he go?” Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Control Dependence • Just as an instruction will be dependent on other data instructions to provide its operands ( dependence), it will also be dependent on other instructions to determine branchwhether it gets executed or not ( dependence or control ___________ dependence). • Control dependences are particularly critical with _____________ branches. conditional add $5, $3, $2 sub $6, $5, $2 beq $6, $7, somewhere and $9, $6, $1 ... somewhere: or $10, $5, $2 add $12, $11, $9 ... Dealing With Branch Hazards • Hardware – stall until you know which direction – reduce hazard through earlier computation of branch direction – guess which direction assume not taken (easiest) more educated guess based on history (requires that you know it is a branch before it is even decoded!) • Hardware/Software – noops, or instructions that get executed either way (delayed branch). Stalling the pipeline Given our current pipeline – let’s assume we stall until we know the branch outcome. How many cycles will you lose per branch? Sel cycles ecti on A 0 B 1 C 2 D 3 E 4 Stalling for Branch Hazards CC1 beq $4, $0, there IM and $12, $2, $5 or ... add ... sw ... CC2 CC3 Reg Bubble CC4 DM Bubble Bubble CC5 CC6 CC7 CC8 Reg IM Reg IM DM Reg IM Reg DM Reg IM Reg DM Reg Stalling for Branch Hazards • • Why? Seems wasteful, particularly when the branch isn’t taken. Makes all branches cost 4 cycles. • What if we just assume ____________ branchesAren’t taken Assume Branch Not Taken • works pretty well when you’re right CC1 CC2 beq $4, $0, there IM Reg and $12, $2, $5 IM or ... add ... sw ... CC3 CC4 DM Reg IM CC5 CC7 Reg DM Reg IM CC6 Reg DM Reg IM Reg DM Reg CC8 Assume Branch Not Taken • same performance as stalling when you’re wrong CC1 CC2 beq $4, $0, there IM Reg and $12, $2, $5 IM CC3 CC4 DM CC5 CC6 CC7 Reg Flush Reg Wrong Path insts or ... add ... there: sub $12, $4, $2 IM Reg Flush IM Flush IM Reg CC8 Stalling the pipeline Let’s improve the pipeline so we move branch resolution to Decode. How many cycles would we lose then on a taken branch? Sel cycles ecti on A 0 B 1 C 2 D 3 E 4 Add drawing of Resolving in decode. The Pipeline with flushing for taken branches • Notice the IF/ID flush line added. Branch Hazards – Assume Not Taken • Great if most of your branches aren’t taken. • What about loops which are taken 95% of the time? – we would like the option of assuming not taken for some branches, and taken for others, depending on ??? Branch Hazards – Predicting Taken CC3 IM Reg DM IM Reg ALU here: lw CC2 ALU beq $2, $1, here CC1 CC4 Required information to predict Taken: CC5 CC6 CC7 CC8 Reading quiz Reg DM Reg Selection Required knowledge 1. An instruction is a branch before decode 2. The target of the branch A 2,3 3. The outcome of the branch B 1,2,3 C 1,2 D 2 E None of the above Branch Target Buffer • Keeps track of the PCs of recently seen branches and their • targets. Consult during Fetch (in parallel with Instruction Memory read) to determine: – Is this a branch? – If so, what is the target Branch Hazards – Predict Taken • Static policy: – Forward branches (if statements) predict not taken – Backward branches (loops) predict taken • Dynamic prediction (coming soon) • First – Branch Delay Slots Eliminating the Branch Stall • There’s no rule that says we have to see the effect of the • • branch immediately. Why not wait an extra instruction before branching? The original SPARC and MIPS processors each used a single branch delay slot to eliminate single-cycle stalls after branches. The instruction after a conditional branch is always executed in those machines, regardless of whether the branch is taken or not! Branch Delay Slot CC1 CC2 beq $4, $0, there IM Reg and $12, $2, $5 IM there: or ... add ... sw ... CC3 CC4 DM Reg IM CC5 CC7 CC8 Reg DM Reg IM CC6 Reg DM Reg IM Reg DM Reg Branch delay slot instruction (next instruction after a branch) is executed even if the branch is taken. Filling the branch delay slot add $5, $3, $7 No-R7 WAR add $9, $1, $3 Safe, $1 and $3 are fine sub $6, $1, $4 No-R6 No-R7 and $7, $8, $2 beq $6, $7, there nop /* branch delay slot */ 6 add $9, $1, $2 Not safe ($9 on nt path) 7 sub $2, $9, $5 Not safe (needs $9 not yet produced) ... there: 8 mult $2, $10, $11 Not safe ($2 is used before overwritten) … 1 2 3 4 5 E is the correct answer * It is not safe to assume anything about the … code Selection Safe instructions A 1,2 B 2,6 C 6,8 D 1,2,7,8 E None of the above Filling the branch delay slot • The branch delay slot is only useful if you can find • something to put there. If you can’t find anything, you must put a noop to insure correctness. Branch Delay Slots • This works great for this implementation of the • architecture, but becomes a permanent part of the ISA. What about the MIPS R10000, which has a 5-cycle branch penalty, and executes 4 instructions per cycle??? • Bottom line: Exposed a detail of the hardware implementation to the ISA. Dynamic Branch Prediction • Can we guess the outcome of branches? • What should we base that guess on? Branch Prediction program counter for (i=0;i<10;i++) { ... ... } 1 0 1 ... ... add $i, $i, #1 beq $i, #10, loop Accuracy? Too easily swayed Two-bit predictors give better loop prediction 00 Weakly Taken 10 Weakly Not Taken 01 Strongly Not Taken 00 for (i=0;i<10;i++) { ... ... } Increment when taken PHT Decrement when not taken branch address Strongly Taken 11 ... ... add $i, $i, #1 beq $i, #10, loop Better (less sway) Slower learning 2-bit A. T T T T N B. T T N T N B. T T N T N C. N T N T N C. N T N T N Strongly Taken 11 Weakly Taken 10 Weakly Not Taken 01 Strongly Not Taken 00 Increment when taken 1-bit A. T T T T N Decrement when not taken Suppose we have the following branch patterns for 3 branches (A, B, C). What is the accuracy of a 1-bit and 2bit Branch History Table. Assume initial values of 1 (1-bit) and (10) 2-bit. Modern Branch Prediction - Pentium 4 • Performance dependent on accurate Branch Prediction • 20 Stage Pipeline – 3-way issue – 60 instructions in flight (12 branches) – 17th stage is branch resolution – ~17*3=51 instructions lost on mispredict 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 X Branch Prediction • Latest branch predictors are significantly more sophisticated, using more advanced correlating techniques, larger structures, and even AI techniques • Use patterns of branches (local history) and recent other branch history (global history) to make predictions Putting it all together. For a given program on our 5-stage MIPS pipeline processor: 20% of insts are loads, 50% of instructions following a load are arithmetic instructions depending on the load 20% of instructions are branches. Using dynamic branch prediction, we achieve 80% prediction accuracy. What is the CPI of your program? Selection CPI A 0.76 B 0.9 C 1.0 D 1.1 E 1.14 Control Hazards -- Key Points • Control (or branch) hazards arise because we must fetch • • the next instruction before we know if we are branching or where we are branching. Control hazards are detected in hardware. We can reduce the impact of control hazards through: – early detection of branch address and condition – branch prediction – branch delay slots Given our 5-stage MIPS pipeline – what is the steady state CPI for the following code? Assume the branch is taken thousands of times. Recall – a processor is in steady state when all stages are active. Steady-State CPI = (#insts+#stalls+#flushed_insts) #insts Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop Selection CPI A 1 B 1.25 C 1.5 D 1.75 E None of the above Hardware engineers determine these to be the execution IF = 200ps times per stage of the MIPS 5-stage pipeline processor. Consider splitting IF and M into 2 stages each. (So IF1 IF2 ID = 100ps and M1 M2.) The most important code run by the EX = 100ps company is (assume branch is taken most of the time): M = 200ps Loop: lw r1, 0 (r2) Isomorphic add r2, r3, r4 WB = 100ps sub r5, r1, r2 beq r5, $zero, Loop What would be the impact of the new 7-stage pipeline compared to the original 5-stage MIPS pipeline.. Assume the pipeline has forwarding where available, predicts branch not taken, and resolves branches in ID. Selection CPI CT A Increase Increase B Increase Decrease C Decrease Increase D Decrease Decrease E Increase No Change Hardware engineers determine these to be the execution IF = 200ps times per stage of the MIPS 5-stage pipeline processor. Consider splitting IF and M into 2 stages each. (So IF1 IF2 ID = 100ps and M1 M2.) The most important code run by the EX = 200ps company is (assume branch is taken most of the time): M = 200ps Loop: lw r1, 0 (r2) add r2, r3, r4 WB = 100ps sub r5, r1, r2 beq r5, $zero, Loop What would be the impact of the new 7-stage pipeline compared to the original 5-stage MIPS pipeline.. Assume the pipeline has forwarding where available, predicts branch not taken, and resolves branches in ID. Selection CPI CT A Increase Increase B Increase Decrease C Decrease Increase D Decrease Decrease E Increase No Change 7-stage Pipeline Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop For diagramming the code in the last slide. Drive home – more stages = more hazards Pipelining -- Key Points • Pipelining focuses on improving instruction throughput, • • not individual instruction latency. Data hazards can be handled by hardware or software – but most modern processors have hardware support for stalling and forwarding. Control hazards can be handled by hardware or software – but most modern processors use Branch Target Buffers and advanced dynamic branch prediction to reduce the hazard. • ET = IC*CPI*CT