Pipeline Hazards Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University See P&H Chapter 4.7 Goals for Today Data Hazards • Revisit Pipelined Processors • Data dependencies • Problem, detection, and solutions – (delaying, stalling, forwarding, bypass, etc) • Hazard detection unit • Forwarding unit Next time • Control Hazards What is the next instruction to execute if a branch is taken? Not taken? MIPS Design Principles Simplicity favors regularity • 32 bit instructions Smaller is faster • Small register file Make the common case fast • Include support for constants Good design demands good compromises • Support for different type of interpretations/classes Recall: MIPS instruction formats All MIPS instructions are 32 bits long, has 3 formats R-type op 6 bits I-type op 6 bits J-type rs rt 5 bits 5 bits rs rt rd shamt func 5 bits 5 bits 6 bits immediate 5 bits 5 bits 16 bits op immediate (target address) 6 bits 26 bits Recall: MIPS Instruction Types Arithmetic/Logical • R-type: result and two source registers, shift amount • I-type: 16-bit immediate with sign/zero extension Memory Access • load/store between registers and memory • word, half-word and byte operations Control flow • conditional branches: pc-relative addresses • jumps: fixed offsets, register absolute Recall: MIPS Instruction Types Arithmetic/Logical • ADD, ADDU, SUB, SUBU, AND, OR, XOR, NOR, SLT, SLTU • ADDI, ADDIU, ANDI, ORI, XORI, LUI, SLL, SRL, SLLV, SRLV, SRAV, SLTI, SLTIU • MULT, DIV, MFLO, MTLO, MFHI, MTHI Memory Access • LW, LH, LB, LHU, LBU, LWL, LWR • SW, SH, SB, SWL, SWR Control flow • BEQ, BNE, BLEZ, BLTZ, BGEZ, BGTZ • J, JR, JAL, JALR, BEQL, BNEL, BLEZL, BGTZL Special • LL, SC, SYSCALL, BREAK, SYNC, COPROC Pipelined Processor memory register file alu +4 addr PC din control new pc Fetch memory compute jump/branch targets extend Decode dout Execute Memory WB register file B alu D memory D A Pipelined Processor +4 IF/ID M B ID/EX Execute EX/MEM Memory ctrl Instruction Decode Instruction Fetch dout compute jump/branch targets ctrl extend din memory imm new pc control ctrl inst PC addr WriteBack MEM/WB Example: : Sample Code (Simple) add nand lw add sw r3, r6, r4, r5, r7, r1, r2; r4, r5; 20(r2); r2, r5; 12(r3); Example: Sample Code (Simple) Assume eight-register machine Run the following code on a pipelined datapath add nand lw add sw Slides thanks to Sally McKee r3 r1 r2 ; reg 3 = reg 1 + reg 2 r6 r4 r5 ; reg 6 = ~(reg 4 & reg 5) r4 20 (r2) ; reg 4 = Mem[reg2+20] r5 r2 r5 ; reg 5 = reg 2 + reg 5 r7 12(r3) ; Mem[reg3+12] = reg 7 Clock cycle add nand Time Graphs 1 2 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID lw add sw Latency: Throughput: Concurrency: 3 4 5 6 7 8 9 EX MEM WB CPI = M U X 4 target + PC+4 PC+4 R0 R1 regB R2 R3 Register file instruction PC Inst mem regA 0 Bits 16-20 Bits 26-31 IF/ID valA R4 R5 R6 valB R7 extend Bits 11-15 ALU result M U X A L U ALU result mdata Data mem M U X data imm dest valB Rd Rt op ID/EX M U X dest dest op op EX/MEM MEM/WB At time 1, Fetch add r3 r1 r2 data dest extend 0 0 IF/ID ID/EX M U X EX/MEM MEM/WB add 3 1 2 M U X 4 0 + /0 4 4 R0 R1 R3 Register file add 3 1 2 PC Inst mem R2 R4 R5 R6 R7 0 36 9 12 18 7 41 22 extend Fetch: add 3 1 2 Bits 11-15 Bits 16-20 Bits 26-31 Time: 1/ 2 IF/ID 0 0 /0 36 /0 9 M U X A L U 0 0 Data mem M U X data 0 dest 0 /0 3 /0 2 nop / add ID/EX M U X 0 0 nop nop EX/MEM MEM/WB nand 6 4 5 add 3 1 2 M U X 4 /0 4 + /4 8 8 R0 R2 2 R3 Register file nand 6 4 5 PC Inst mem R1 1 R4 R5 R6 R7 0 36 9 12 18 7 41 22 extend Fetch: nand 6 4 5 Bits 11-15 Bits 16-20 Bits 26-31 Time: 2/ 3 IF/ID 0 36 / 18 /9 7 3 0 36 9 M U X A L U /0 45 0 Data mem data dest /0 9 /3 6 /2 5 M U X add / nand ID/EX 3 /0 3 nop / add EX/MEM M U X 0 nop MEM/WB lw 4 20(2) nand 6 4 5 add 3 1 2 M U X nand (18 ∙ 7) 4 + 12 8 R0 R2 5 R3 Register file lw 4 20(2) PC Inst mem R1 4 R4 R5 R6 R7 0 36 9 12 18 7 41 22 extend Fetch: lw 4 20(2) Bits 11-15 Bits 16-20 Bits 26-31 Time: 3/ 4 IF/ID 18 = 01 0010 7 = 00 0111 ------------------3 = 11 1101 0 /0 45 / 18 36 18 7 /4 8 /9 7 6 M U X A L U 45 / -3 0 M U X Data mem data dest /9 7 6 5 3 2 nand ID/EX M U X 3 /3 6 add / nand EX/MEM /0 3 nop / add MEM/WB add 5 2 5 lw 4 20(2) nand 6 4 5 add 3 1 2 M U X 4 8 + 16 12 R0 R2 4 R3 Register file add 5 2 5 PC Inst mem R1 2 R4 R5 R6 R7 0 36 9 12 18 7 41 22 extend Fetch: add 5 2 5 Bits 11-15 Bits 16-20 Bits 26-31 Time: 4 IF/ID 0 45 18 9 7 18 20 M U X A L U -3 45 0 Data mem M U X data dest 7 0 4 6 5 lw ID/EX M U X 6 6 3 nand EX/MEM 3 add MEM/WB sw 7 12(3) add 5 2 5 lw 4 20 (2) nand 6 4 5 add 3 1 2 M U X 4 12 + 20 16 R0 R2 5 R3 Register file sw 7 12(3) PC Inst mem R1 2 R4 R5 R6 R7 0 36 9 45 18 7 41 22 extend Fetch: sw 7 12(3) Bits 11-15 Bits 16-20 Bits 26-31 Time: 5 IF/ID 0 -3 9 9 7 M U 20 X 5 A L U 29 -3 45 0 Data mem M U X data dest 18 5 5 0 4 add ID/EX M U X 4 4 6 lw EX/MEM 6 3 nand MEM/WB sw 7 12(3) add 5 2 5 lw 4 20(2) nand 6 4 5 M U X 4 16 + 20 R0 R1 3 R2 7 R3 Register file PC Inst mem R4 R5 R6 R7 0 36 9 45 18 7 -3 22 extend No more instructions Bits 11-15 Bits 16-20 Bits 26-31 Time: 6 IF/ID 0 29 9 45 7 22 12 M U X A L U 16 29 -3 99 Data mem M U X data dest 7 0 7 5 5 sw ID/EX M U X 5 5 4 add EX/MEM 4 6 lw MEM/WB nop nop sw 7 12(3) add 5 2 5 lw 4 20(2) M U X 4 20 + R0 R1 R2 Inst mem R3 Register file PC R4 R5 R6 R7 0 36 9 45 99 7 -3 22 0 16 45 M U 12 X A L U 57 16 Data mem Bits 11-15 Bits 16-20 22 0 7 Bits 26-31 Time: 7 IF/ID data dest extend No more instructions 0 M U 99 X M U X 7 7 5 sw ID/EX EX/MEM 5 4 add MEM/WB nop nop nop sw 7 12(3) add 5 2 5 M U X 4 + R0 R1 R2 Inst mem R3 Register file PC R4 R5 R6 R7 0 36 9 45 99 16 -3 22 57 M U X 57 Bits 11-15 M U X Bits 16-20 IF/ID Slides thanks to Sally McKee 0 Data mem M U X data dest 7 Bits 26-31 Time: 8 22 22 extend No more instructions A L U 16 5 sw ID/EX EX/MEM MEM/WB nop nop nop nop sw 7 12(3) M U X 4 + R0 R1 R2 Inst mem R3 Register file PC R4 R5 R6 R7 0 36 9 45 99 16 -3 22 M U X A L U Data mem data dest extend No more instructions M U X Bits 11-15 M U X Bits 16-20 Bits 21-23 Time: 9 IF/ID ID/EX EX/MEM MEM/WB Takeaway Pipelining is a powerful technique to mask latencies and increase throughput • Logically, instructions execute one at a time • Physically, instructions execute in parallel – Instruction level parallelism Abstraction promotes decoupling • Interface (ISA) vs. implementation (Pipeline) Next Goal What about data dependencies (also known as a data hazard in a pipelined processor)? i.e. add r3, r1, r2 sub r5, r3, r4 Data Hazards Data Hazards • register file reads occur in stage 2 (ID) • register file writes occur in stage 5 (WB) • next instructions may read values about to be written Data Hazards time add r3, r1, r2 sub r5, r3, r4 lw r6, 4(r3) or r5, r3, r5 sw r6, 12(r3) Clock cycle 1 2 3 IF ID IF 4 5 MEM WB MEM ID IF 7 8 9 WB MEM WB ID IF 6 MEM WB ID IF ID MEM WB Data Hazards Data Hazards • register file reads occur in stage 2 (ID) • register file writes occur in stage 5 (WB) • next instructions may read values about to be written How to detect? IF/ID ID/EX B D EX/MEM Rd OP Rd mem OP imm Rt Rd PC+4 detect hazard OP PC PC+4 +4 addr din dout M A B Ra Rb D A Rd D inst add r3, r1, r2 sub inst r5, r3, r5 or r6, r3, r4 mem add r6, r3, r8 B Detecting Data Hazards MEM/WB Takeaway Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards. Next Goal What to do if data hazard detected? Stalling How to stall an instruction in ID stage • prevent IF/ID pipeline register update – stalls the ID stage instruction • convert ID stage instr into nop for later stages – innocuous “bubble” passes through pipeline • prevent PC update – stalls the next (IF stage) instruction WE=0 IF/ID ID/EX B D Rd Rd EX/MEM OP MemWr=0 RegWr=0 mem OP imm If detect hazard OP detect hazard PC+4 PC Rt Rd PC+4 +4 addr din dout M A B Ra Rb D A Rd D inst add r3, r1, r2 sub inst r5, r3, r5 or r6, r3, r4 mem add r6, r3, r8 B Detecting Data Hazards MEM/WB Stalling time add r3, r1, r2 sub r5, r3, r5 or r6, r3, r4 add r6, r3, r8 Clock cycle 1 2 3 4 5 6 7 8 Stalling time r3 = 10 add r3, r1, r2 r3 = 20 sub r5, r3, r5 or r6, r3, r4 add r6, r3, r8 Clock cycle 1 2 3 4 5 6 7 8 Stalling sub r5,r3,r5 or r6,r3,r4 (WE=0) /stall NOP = If(IF/ID.rA ≠ 0 && (IF/ID.rA==ID/Ex.Rd IF/ID.rA==Ex/M.Rd IF/ID.rA==M/W.Rd)) Rd WE Rd add r3,r1,r2 M Op nop data mem WE PC B D Op (MemWr=0 RegWr=0) B Rd +4 D WE inst inst mem D rD B rA rB A Op A Stalling sub r5,r3,r5 or r6,r3,r4 (WE=0) /stall NOP = If(IF/ID.rA ≠ 0 && (IF/ID.rA==ID/Ex.Rd IF/ID.rA==Ex/M.Rd IF/ID.rA==M/W.Rd)) nop Rd WE Rd (MemWr=0 RegWr=0) M Op nop data mem WE PC B D Op (MemWr=0 RegWr=0) B Rd +4 D WE inst inst mem D rD B rA rB A Op A add r3,r1,r2 Stalling sub r5,r3,r5 or r6,r3,r4 (WE=0) /stall NOP = If(IF/ID.rA ≠ 0 && (IF/ID.rA==ID/Ex.Rd IF/ID.rA==Ex/M.Rd IF/ID.rA==M/W.Rd)) D (MemWr=0 RegWr=0) nop nop WE (MemWr=0 RegWr=0) Rd M Rd data mem Op nop WE PC B Op (MemWr=0 RegWr=0) B Rd +4 D WE inst inst mem D rD B rA rB A Op A add r3,r1,r2 Stalling How to stall an instruction in ID stage • prevent IF/ID pipeline register update – stalls the ID stage instruction • convert ID stage instr into nop for later stages – innocuous “bubble” passes through pipeline • prevent PC update – stalls the next (IF stage) instruction Takeaway Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards. Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards. Stalling introduces NOPs (“bubbles”) into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file. Bubbles in pipeline significantly decrease performance. Next Goal: Resolving Data Hazards via Forwarding What to do if data hazard detected? A) Wait/Stall B) Reorder in Software (SW) C) Forward/Bypass Forwarding Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register). Three types of forwarding/bypass • Forwarding from Ex/Mem registers to Ex stage (MEx) • Forwarding from Mem/WB register to Ex stage (WEx) • RegisterFile Bypass Forwarding Datapath B B D D B IF/ID ID/Ex forward unit MC WE Rb Ra detect hazard data mem Rd imm inst mem D Ex/Mem Three types of forwarding/bypass • Forwarding from Ex/Mem registers to Ex stage (MEx) • Forwarding from Mem/WB register to Ex stage (W Ex) • RegisterFile Bypass M Rd A MC WE A Mem/WB Forwarding Datapath B B D D B IF/ID ID/Ex forward unit MC WE Rb Ra detect hazard data mem Rd imm inst mem D Ex/Mem Three types of forwarding/bypass • Forwarding from Ex/Mem registers to Ex stage (MEx) • Forwarding from Mem/WB register to Ex stage (W Ex) • RegisterFile Bypass M Rd A MC WE A Mem/WB Forwarding Datapath 1 Ex/MEM to EX Bypass • EX needs ALU result that is still in MEM stage • Resolve: Add a bypass from EX/MEM.D to start of EX How to detect? Logic in Ex Stage: forward = (Ex/M.WE && EX/M.Rd != 0 && ID/Ex.Ra == Ex/M.Rd) || (same for Rb) Forwarding Datapath 1 A D inst mem add r3, r1, r2 sub r5, r3, r1 B IF data mem ID Ex M W IF ID Ex M W Forwarding Datapath 2 Mem/WB to EX Bypass • EX needs value being written by WB • Resolve: Add bypass from WB final value to start of EX How to detect? Logic in Ex Stage: forward = (M/WB.WE && M/WB.Rd != 0 && ID/Ex.Ra == M/WB.Rd && not (ID/Ex.WE && Ex/M.Rd != 0 && ID/Ex.Ra == Ex/M.Rd) || (same for Rb) Check pg. 369 Forwarding Datapath 2 A D inst mem add r3, r1, r2 sub r5, r3, r1 or r6, r3, r4 B IF data mem ID Ex M W IF ID IF Ex ID M W Ex M W Register File Bypass Register File Bypass • Reading a value that is currently being written Detect: ((Ra == MEM/WB.Rd) or (Rb == MEM/WB.Rd)) and (WB is writing a register) Resolve: Add a bypass around register file (WB to ID) Better: (Hack) just negate register file clock – writes happen at end of first half of each clock cycle – reads happen during second half of each clock cycle Register File Bypass A D inst mem add r3, r1, r2 sub r5, r3, r1 or r6, r3, r4 add r6, r3, r8 B IF data mem ID Ex M W IF ID IF Ex ID IF M W Ex M ID Ex W M W Forwarding Example time r3 = 10 add r3, r1, r2 r3 = 20 sub r5, r3, r5 or r6, r3, r4 add r6, r3, r8 Clock cycle 1 2 3 4 5 6 7 8 Forwarding Example 2 time add r3, r1, r2 sub r5, r3, r4 lw r6, 4(r3) or r5, r3, r5 sw r6, 12(r3) Clock cycle 1 2 3 4 5 IF ID Ex M W IF ID Ex M 6 W 7 8 Tricky Example A D inst mem add r1, r1, r2 SUB r1, r1, r3 OR r1, r4, r1 B data mem Forwarding Datapath B B D D B IF/ID ID/Ex forward unit MC WE Rb Ra detect hazard data mem Rd imm inst mem D Ex/Mem Three types of forwarding/bypass • Forwarding from Ex/Mem registers to Ex stage (MEx) • Forwarding from Mem/WB register to Ex stage (W Ex) • Register File Bypass M Rd A MC WE A Mem/WB Takeaway Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards. Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards. Stalling introduces NOPs (“bubbles”) into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file. Bubbles (nops) in pipeline significantly decrease performance. Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register). Better performance than stalling. Administrivia Prelim1: next Tuesday, February 26th in evening • • Time: We will start at 7:30pm sharp, so come early Prelim Review: This Thur 6-8pm in Upson B14 and Fri, 5-7pm in Phillips 203 • Closed Book • • • Cannot use electronic device or outside material Practice prelims are online in CMS Material covered everything up to end of this week • • • • • Appendix C (logic, gates, FSMs, memory, ALUs) Chapter 4 (pipelined [and non-pipeline] MIPS processor with hazards) Chapters 2 (Numbers / Arithmetic, simple MIPS instructions) Chapter 1 (Performance) HW1, HW2, Lab0, Lab1, Lab2 Administrivia HW2 is due tomorrow • Fill out Survey online. Receive credit/points on homework for survey: • Should have received email from Kathryn Dimiduk • Survey is anonymous Project1 (PA1) due week after prelim • Continue working diligently. Use design doc momentum Save your work! • Save often. Verify file is non-zero. Periodically save to Dropbox, email. • Beware of MacOSX 10.5 (leopard) and 10.6 (snow-leopard) Use your resources • Lab Section, Piazza.com, Office Hours, Homework Help Session, • Class notes, book, Sections, CSUGLab Administrivia Check online syllabus/schedule • http://www.cs.cornell.edu/Courses/CS3410/2013sp/schedule.html Slides and Reading for lectures Office Hours Homework and Programming Assignments Prelims (in evenings): • Tuesday, February 26th • Thursday, March 28th • Thursday, April 25th Schedule is subject to change Collaboration, Late, Re-grading Policies “Black Board” Collaboration Policy • Can discuss approach together on a “black board” • Leave and write up solution independently • Do not copy solutions Late Policy • Each person has a total of four “slip days” • Max of two slip days for any individual assignment • Slip days deducted first for any late assignment, cannot selectively apply slip days • For projects, slip days are deducted from all partners • 25% deducted per day late after slip days are exhausted Regrade policy • Submit written request to lead TA, and lead TA will pick a different grader • Submit another written request, lead TA will regrade directly • Submit yet another written request for professor to regrade. Quiz Find all hazards, and say how they are resolved: add r3, r1, r2 sub r3, r2, r1 nand r4, r3, r1 or r0, r3, r4 xor r1, r4, r3 sb r4, 1(r0) Memory Load Data Hazard A D inst mem lw r4, 20(r8) sub r6, r4, r1 B data mem Resolving Memory Load Hazard Load Data Hazard • Value not available until WB stage • So: next instruction can’t proceed if hazard detected Resolution: • MIPS 2000/3000: one delay slot – ISA says results of loads are not available until one cycle later – Assembler inserts nop, or reorders to fill delay slot • MIPS 4000 onwards: stall – But really, programmer/compiler reorders to avoid stalling in the load delay slot Quiz 2 add r3, r1, r2 nand r5, r3, r4 add r2, r6, r3 lw r6, 24(r3) sw r6, 12(r2) Data Hazard Recap Delay Slot(s) • Modify ISA to match implementation Stall • Pause current and all subsequent instructions Forward/Bypass • Try to steal correct value from elsewhere in pipeline • Otherwise, fall back to stalling or require a delay slot Tradeoffs?