ENGS 116 Lecture 5 1 Pipelining and Hazards Vincent H. Berk October 3, 2008 Reading for today: Chapter A.1 – A.2, article: Patterson&Ditzel Reading for Monday: A.3 – A.4, article: Yeager Reading for Wednesday: A.5 – A.6 , article: Smith&Pleszkun ENGS 116 Lecture 5 2 Review: Pipelined DLX Datapath Figure A.17, Page A-29 ENGS 116 Lecture 5 3 Hazards Hazards are situations that hamper execution flow • Structural Hazards: – Resource Conflict, hardware cannot support all possible combinations of instructions simultaneously. • Data Hazards: – Source operands are not available: instruction depends on results of previous instructions still in the pipeline • Control Hazards: – Changes in program counter ENGS 116 Lecture 5 4 Structural Hazards ENGS 116 Lecture 5 5 One Memory Port/Structural Hazards Instr 2 stall Instr 3 Mem Reg Mem CC 4 Mem CC 5 CC 7 CC 8 Reg Reg Mem Mem Reg bubble CC 6 Reg Mem Reg bubble bubble bubble Mem Reg ALU O r d e r Instr 1 CC 3 ALU Load CC 2 ALU I n s t r. CC 1 ALU from: SECOND EDITION Time (clock cycles) bubble Mem ENGS 116 Lecture 5 6 Structural Hazard: Single Memory Instruction Load Instr. 1 Instr. 2 Instr. 3 Instr. 4 Instr. 5 Instr. 6 1 IF 2 ID IF Clock cycle number 3 4 5 6 7 EX MEM WB ID EX MEM WB IF ID EX MEM WB Stall IF ID EX IF ID IF 8 9 10 MEM WB EX MEM WB ID EX MEM IF ID EX ENGS 116 Lecture 5 7 Speed Up Equation for Pipelining Speedup from pipelining = Avg. Instr. Time Unpipelin ed Avg. Instr. Time Pipelined = CPI unpipelined Clock Cycle unpipelined = CPI unpipelined CPI pipelined Clock Cycle pipelined CPI pipelined Clock Cycle unpipelined Clock Cycle pipelined Ideal CPI = CPIunpipelined /Pipeline depth Speedup = Ideal CPI Pipeline depth Clock Cycle unpipelined CPI pipelined Clock Cycle pipelined ENGS 116 Lecture 5 8 Speed Up Equation for Pipelining CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instr Clock Cycle unpipelined Ideal CPI x Pipeline depth Speedup = Ideal CPI + Pipeline stall CPI Clock Cycle pipelined Clock Cycle unpipelined Pipeline depth Speedup = 1 + Pipeline stall CPI Clock Cycle pipelined ENGS 116 Lecture 5 9 Example: Dual-port vs. Single-port • Machine A: Dual ported memory • Machine B: Single ported memory, but its pipelined implementation has a clock rate that is 1.2 times faster • Ideal CPI=1 for both • Loads and stores are 40% of instructions executed SpeedUp A = Pipeline Depth/ ( 1 + 0 ) ( clock unpipe / clock pipe ) = Pipeline Depth SpeedUp B = Pipeline Depth/ ( 1 + 0.4 1 ) ( clock unpipe /( clock unpipe / 1.2 ) = ( Pipeline Depth/1.4 ) 1.2 = 0.86 Pipeline Depth SpeedUp A / SpeedUp B = Pipeline Depth/ ( 0.86 Pipeline Depth ) = 1.17 • Machine A is 1.17 times faster ENGS 116 Lecture 5 10 Data Hazards sub R2, R1, R3 ; R2 written by sub and R12, R2, R5 ; first operand (R2) depends on sub or R13, R6, R2 ; second operand (R2) depends on sub add R14, R2, R2 ; both operands depend on sub sw 100 (R2), R15 ; index (R2) depends on sub Notice that the value written into R2 by the subtract instruction is needed in all of the following instructions ENGS 116 Lecture 5 11 Classification of Data Hazards Consider instructions i and j, where i occurs before j. • RAW (read after write) — j tries to read a source before i writes it, so j gets the old value • WAW (write after write) — j tries to write an operand before it is written by i (only possible in pipelines that write in more than one pipe stage or allow an instruction to proceed even when a previous instruction is stalled) • WAR (write after read) — j tries to write a destination before it is read by i, so i incorrectly gets the new value (only possible when some instructions can write results early in the pipeline and other instructions can read sources late in the pipeline) ENGS 116 Lecture 5 12 Software Solution Compiler recognizes data hazard and adds nops to eliminate it sub R2, R1, R3 ; register R2 written by sub nop ; no operation nop nop and R12, R2, R5 or R13, R6, R2 add R14, R2, R2 sw 100 (R2), R15 ; now, result from sub available ENGS 116 Lecture 5 13 Data Hazard Control: Stalls • Hazard occurs when instruction reads (in ID stage) register that will be written by an earlier instruction (in WB stage) • Idea: Detect hazard and stall instructions in pipeline until hazard is resolved • Detect hazard by comparing read fields in IF/ID pipeline register with write fields in later pipeline registers (ID/EX, EX/MEM, MEM/WB) • To add bubble in pipeline – Preserve PC register and IF/ID pipeline register – Change EX, MEM, and WB control fields of ID/EX pipeline register to do nothing ENGS 116 Lecture 5 Data Hazard Reduction: Forwarding • Needed result is available before it is written into register file in WB stage • Idea: Use temporary results instead of waiting for registers to be written • Cannot solve problem of write (load) followed by read • All pipelined machines today use some form of forwarding 14 ENGS 116 Lecture 5 15 Data Hazard on R1 Figure A.6, Page A-17 Time (clock cycles) CC 1 I n s t r. add r1, r2, r3 O r d e r and r6, r1, r7 sub r4, r1, r3 or r8, r1, r9 xor r10, r1, r11 IM CC 2 CC 3 Reg IM CC 4 CC 5 DM Reg Reg IM DM Reg IM CC 6 Reg DM Reg IM Reg ENGS 116 Lecture 5 16 Forwarding to Avoid Data Hazard Figure A.7, Page A-18 Time (clock cycles) CC 1 I n s t r. add r1, r2, r3 O r d e r and r6, r1, r7 sub r4, r1, r3 or r8, r1, r9 xor r10, r1, r11 IM CC 2 CC 3 Reg IM CC 4 CC 5 DM Reg Reg IM DM Reg IM CC 6 Reg DM Reg IM Reg ENGS 116 Lecture 5 17 Data Hazard Even with Forwarding Figure A.9, Page A-20 Time (clock cycles) CC 1 I n s t r. lw r1, 0(r2) O r d e r and r6, r1, r7 sub r4, r1, r5 or r8, r1, r9 IM CC 2 CC 3 Reg IM CC 4 CC 5 DM Reg Reg IM DM Reg IM Reg ENGS 116 Lecture 5 18 Data Hazard Even with Forwarding Figure A.10, Page A-21 Time (clock cycles) CC 1 I n s t r. lw r1, 0(r2) O r d e r and r6, r1, r7 sub r4, r1, r5 or r8, r1, r9 IM CC 2 CC 3 Reg IM CC 4 CC 5 DM Reg Reg bubble IM bubble Reg bubble IM CC 6 DM Reg ENGS 116 Lecture 5 19 LW R1, 0 (R2) SUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 IF ID IF EX MEM WB ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB LW R1, 0 (R2) SUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 IF ID EX MEM WB ENGS 116 Lecture 5 20 Control Hazard on Branches Three Stage Stall Time (clock cycles) CC 1 Program Execution Order (in instructions) 40 beqz R1, 36 44 and R12, R2, R5 48 or R13, R6, R2 52 add R14, R2, R2 80 ld R4, R7, 100 IM CC 2 CC 3 Reg IM CC 4 DM Reg IM CC 5 CC 7 CC 8 CC 9 Reg DM Reg DM Reg IM CC 6 Reg IM Reg DM Reg Reg DM Reg ENGS 116 Lecture 5 Branch instruction Branch successor Branch successor + 1 Branch successor + 2 Branch successor + 3 Branch successor + 4 Branch successor + 5 21 IF ID EX IF MEM WB stall stall IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM IF ID EX IF ID IF ENGS 116 Lecture 5 22 Branch Stall Impact • If CPI = 1, 30% branches, 3-cycle stall new CPI = 1.9! Two simple solutions: • Predict not taken – Continue with decoding code that is already in Instruction Cache – Usually < 50% correct, however, no stalls when correct • Branch delay slot – The first instruction following the branch is ALWAYS executed – Compiler can figure out what to put there ENGS 116 Lecture 5 23 Delayed Branch ENGS 116 Lecture 5 24 Delayed Branch • Where to get instructions to fill branch delay slot? – Before branch instruction – From the target address: only valuable when branch taken – From fall through: only valuable when branch not taken – Canceling branches allow more slots to be filled • Compiler effectiveness for single branch delay slot: – Fills about 60% of branch delay slots – About 80% of instructions executed in branch delay slots useful in computation – About 50% (60% x 80%) of slots usefully filled ENGS 116 Lecture 5 25 Evaluating Branch Alternatives Pipeline depth 1 + Pipeline stalls Pipeline depth = 1 + Branch frequency Branch penalty Pipeline Speedup =