1 Pipeline Hazards Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic 6.823 L6- 2 Arvind Technology Assumptions • A small amount of very fast memory (caches) backed up by a large, slower memory • Fast ALU (at least for integers) • Multiported Register files (slower!) It makes the following timing assumption valid tIM ≈ tRF ≈ tALU ≈ tDM ≈ tRW A 5-stage pipelined Harvard architecture will be the focus of our detailed design September 28, 2005 6.823 L6- 3 Arvind 5-Stage Pipelined Execution 0x4 Add PC addr rdata we rs1 rs2 rd1 ws wd rd2 GPRs IR Inst. Memory I-Fetch (IF) ALU Data Memory Imm Ext wdata Decode, Reg. Fetch Execute (ID) (EX) time instruction1 instruction2 instruction3 instruction4 instruction5 September 28, 2005 we addr rdata t0 IF1 t1 t2 ID1 EX1 IF2 ID2 IF3 t3 MA1 EX2 ID3 IF4 t4 WB1 MA2 EX3 ID4 IF5 Write -Back (WB) Memory (MA) t5 t6 t7 .... WB2 MA3 WB3 EX4 MA4 WB4 ID5 EX5 MA5 WB5 6.823 L6- 4 Arvind 5-Stage Pipelined Execution Resource Usage Diagram 0x4 Add PC addr rdata we rs1 rs2 rd1 ws wd rd2 GPRs IR Inst. Memory Resources I-Fetch (IF) September 28, 2005 we addr rdata ALU Data Memory Imm Ext wdata Decode, Reg. Fetch Execute (ID) (EX) time IF ID EX MA WB t0 I1 t1 I2 I1 t2 I3 I2 I1 t3 I4 I3 I2 I1 t4 I5 I4 I3 I2 I1 Write -Back (WB) Memory (MA) t5 t6 t7 .... I5 I4 I3 I2 I5 I4 I3 I5 I4 I5 6.823 L6- 5 Arvind Pipelined Execution: ALU Instructions 0x4 IR Add PC addr IR IR 31 inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs A ALU Y B rdata Data Memory Imm Ext wdata wdata MD1 MD2 Not quite correct! We need an Instruction Reg (IR) for each stage September 28, 2005 we addr R 6.823 L6- 6 Arvind IR’s and Control points 0x4 IR Add PC addr IR IR 31 inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 A ALU GPRs Y B we addr rdata Data Memory Imm Ext wdata wdata MD1 MD2 Are control points connected properly? - ALU instructions - Load/Store instructions - Write back September 28, 2005 R 6.823 L6- 7 Arvind Pipelined MIPS Datapath without jumps F D E M IR W IR IR 31 0x4 Add RegDst RegWrite PC addr inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 OpSel MemWrite A ALU GPRs Y B ExtSel September 28, 2005 we addr rdata Data Memory Imm Ext wdata wdata MD1 BSrc WBSrc MD2 R How Instructions can Interact with each other in a pipeline • An instruction in the pipeline may need a resource being used by another instruction in the pipeline – structural hazard • An instruction may produce data that is needed by a later instruction – data hazard • In the extreme case, an instruction may determine the next instruction to be executed – control hazard (branches, interrupts,...) September 28, 2005 6.823 L6- 8 Arvind 6.823 L6- 9 Arvind Data Hazards r1 ← … r4 ← r1 … 0x4 IR Add PC addr IR IR 31 inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs A ALU Y B rdata Data Memory Imm Ext wdata wdata MD1 ... r1 ← r0 + 10 r4 ← r1 + 17 ... September 28, 2005 we addr MD2 r1 is stale. Oops! R 6.823 L6- 10 Arvind Resolving Data Hazards Freeze earlier pipeline stages until the data becomes available ⇒ interlocks If data is available somewhere in the datapath provide a bypass to get it to the right stage Speculate about the hazard resolution and kill the instruction later if the speculation is wrong. September 28, 2005 6.823 L6- 11 Arvind Feedback to Resolve Hazards FB1 FB2 stage 1 FB3 stage 2 FB4 stage 3 stage 4 • Detect a hazard and provide feedback to previous stages to stall or kill instructions • Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur) September 28, 2005 6.823 L6- 12 Arvind Interlocks to resolve Data Hazards Stall Condition 0x4 nop Add PC addr IR IR IR 31 inst IR Inst Memory ... r1 ← r0 + 10 r4 ← r1 + 17 ... September 28, 2005 we rs1 rs2 rd1 ws wd rd2 GPRs A ALU Y B we addr rdata Data Memory Imm Ext wdata wdata MD1 MD2 R 6.823 L6- 13 Arvind Stalled Stages and Pipeline Bubbles time t0 t1 t2 t3 t4 t5 (I1) r1 ← (r0) + 10 IF1 ID1 EX1 MA1 WB1 (I2) r4 ← (r1) + 17 IF2 ID2 ID2 ID2 ID2 (I3) IF3 IF3 IF3 IF3 (I4) stalled stages (I5) Resource Usage IF ID EX MA WB time t0 t1 I1 I2 I1 t2 I3 I2 I1 t3 I3 I2 nop I1 t4 I3 I2 nop nop I1 t5 I3 I2 nop nop nop t6 .... EX2 MA2 WB2 ID3 EX3 MA3 WB3 IF4 ID4 EX4 MA4 WB4 IF5 ID5 EX5 MA5 WB5 t6 I4 I3 I2 nop nop nop ⇒ September 28, 2005 t7 t7 I5 I4 I3 I2 nop .... I5 I4 I3 I2 I5 I4 I3 I5 I4 pipeline bubble I5 6.823 L6- 14 Arvind Interlock Control Logic stall ws Cstall rs rt ? 0x4 nop Add PC addr IR IR IR 31 inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs A ALU Y B we addr rdata Data Memory Imm Ext wdata wdata MD1 MD2 Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions. September 28, 2005 R 6.823 L6- 15 Arvind Interlocks Control Logic ignoring jumps & branches stall Cstall rs rt ws we ? re1 re2 Cre 0x4 PC addr Cdest Cdest nop Add ws ws we we IR IR IR 31 inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs Cdest A ALU Y B we addr rdata Data Memory Imm Ext wdata wdata MD1 MD2 Should we always stall if the rs field matches some rd? not every instruction writes a register ⇒ we not every instruction reads a register ⇒ re September 28, 2005 R 6.823 L6- 16 Arvind Source & Destination Registers ALU ALUi LW SW BZ J JAL JR JALR R-type: op rs rt I-type: op rs rt J-type: op func immediate16 immediate26 rd ← (rs) func (rt) rt ← (rs) op imm rt ← M [(rs) + imm] M [(rs) + imm] ← (rt) cond (rs) true: PC ← (PC) + imm false: PC ← (PC) + 4 PC ← (PC) + imm r31 ← (PC), PC ← (PC) + imm PC ← (rs) r31 ← (PC), PC ← (rs) September 28, 2005 rd source(s) destination rs, rt rd rs rt rs rt rs, rt rs rs rs rs 31 31 6.823 L6- 17 Arvind Deriving the Stall Signal Cdest ws = Case opcode ALU ⇒ rd ALUi, LW ⇒ rt JAL, JALR ⇒ R31 we = Case opcode ALU, ALUi, LW ⇒(ws ≠ 0) JAL, JALR ⇒ on ... ⇒ off Cstall September 28, 2005 Cre re1 = Case opcode ALU, ALUi, LW, SW, BZ, JR, JALR ⇒ on J, JAL ⇒ off re2 = Case opcode ⇒ on ALU, SW ⇒ off ... stall = ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW) . re1D ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW) . re2D + ! t y no tor is l s is ful h T e th 6.823 L6- 18 Arvind Hazards due to Loads & Stores Stall Condition What if (r1)+7 = (r3)+5 ? 0x4 nop Add PC addr IR IR IR 31 inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs A ALU Y B rdata Data Memory Imm Ext ... M[(r1)+7] ← (r2) r4 ← M[(r3)+5] ... September 28, 2005 we addr wdata wdata MD1 MD2 Is there any possible data hazard in this instruction sequence? R Load & Store Hazards ... M[(r1)+7] ← (r2) r4 ← M[(r3)+5] ... 6.823 L6- 19 Arvind (r1)+7 = (r3)+5 ⇒ data hazard However, the hazard is avoided because our memory system completes writes in one cycle ! Load/Store hazards, even when they do exist, are often resolved in the memory system itself. More on this later in the course. September 28, 2005 20 Five-minute break to stretch your legs 6.823 L6- 21 Arvind Complications due to Jumps PCSrc (pc+4 / jabs / rind/ br) stall Add nop 0x4 Add Jump? PC 104 I1 I2 I3 I4 addr inst Inst Memory 096 100 104 304 September 28, 2005 IR I2 ADD J 200 ADD kill ADD E M IR IR I1 Note fetching the next instruction before decode is speculation ⇒ kill A jump instruction kills (not stalls) the following instruction How? 6.823 L6- 22 Arvind Pipelining Jumps PCSrc (pc+4 / jabs / rind/ br) stall To kill a fetched instruction -- Insert a mux before IR Add nop 0x4 Add Jump? IRSrcD PC 304 104 I1 I2 I3 I4 addr inst Inst Memory 096 100 104 304 September 28, 2005 nop IR nop I2 ADD J 200 ADD kill ADD E M IR IR II21 I1 Any interaction between stall and jump? IRSrcD = Case opcodeD J, JAL ⇒ nop ... ⇒ IM 6.823 L6- 23 Arvind Jump Pipeline Diagrams (I1) (I2) (I3) (I4) 096: 100: 104: 304: ADD J 200 ADD ADD Resource Usage IF ID EX MA WB time t0 t1 t2 IF1 ID1 EX1 IF2 ID2 IF3 time t0 t1 I1 I2 I1 t2 I3 I2 I1 t3 MA1 EX2 nop IF4 t4 WB1 MA2 nop ID4 t3 I4 nop I2 I1 t4 I5 I4 nop I2 I1 t5 t6 .... WB2 nop nop EX4 MA4 WB4 t5 t6 t7 I5 I4 I5 nop I4 I5 I2 nop I4 nop ⇒ September 28, 2005 t7 .... I5 pipeline bubble 6.823 L6- 24 Arvind Pipelining Conditional Branches PCSrc (pc+4 / jabs / rind / br) stall Add nop 0x4 Add E M IR IR I1 BEQZ? zero? IRSrcD PC 104 I1 I2 I3 I4 addr inst nop Inst Memory 096 100 104 304 September 28, 2005 ADD BEQZ r1 200 ADD ADD IR A ALU Y I2 Branch condition is not known until the execute stage what action should be taken in the decode stage ? 6.823 L6- 25 Arvind Pipelining Conditional Branches PCSrc (pc+4 / jabs / rind / br) stall ? Add E nop 0x4 Add M BEQZ? IR IR I2 I1 zero? IRSrcD PC 108 I1 I2 I3 I4 addr inst Inst Memory 096 100 104 304 September 28, 2005 nop IR A ALU Y I3 If the branch is taken ADD - kill the two following instructions BEQZ r1 200 - the instruction at the decode stage ADD is not valid ADD ⇒ stall signal is not valid 6.823 L6- 26 Arvind Pipelining Conditional Branches stall Add PCSrc (pc+4/jabs/rind/br) E IRSrcE nop 0x4 Add Jump? M BEQZ? IR IR I2 I1 zero? PC PC 108 I1 I2 I3 I4 addr IRSrcD inst Inst Memory 096 100 104 304 September 28, 2005 nop IR A ALU Y I3 If the branch is taken ADD - kill the two following instructions BEQZ r1 200 - the instruction at the decode stage ADD is not valid ADD ⇒ stall signal is not valid 6.823 L6- 27 Arvind New Stall Signal stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D + ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D ) . !((opcodeE=BEQZ).z + (opcodeE=BNEZ).!z) Don’t stall if the branch is taken. Why? Instruction at the decode stage is invalid September 28, 2005 Control Equations for PC and IR Muxes PCSrc = Case opcodeE BEQZ.z, BNEZ.!z ⇒ br ... ⇒ Case opcodeD J, JAL JR, JALR ... ⇒ jabs ⇒ rind ⇒ pc+4 IRSrcD = Case opcodeE BEQZ.z, BNEZ.!z ⇒ nop ... ⇒ Case opcodeD J, JAL, JR, JALR ⇒ nop ... ⇒ IM IRSrcE = Case opcodeE BEQZ.z, BNEZ.!z ⇒ nop ... ⇒ stall.nop + !stall.IRD September 28, 2005 6.823 L6- 28 Arvind Give priority to the older instruction, i.e., execute stage instruction over decode stage instruction 6.823 L6- 29 Arvind Branch Pipeline Diagrams (resolved in execute stage) (I1) (I2) (I3) (I4) (I5) time t0 t1 t2 096: ADD IF1 ID1 EX1 100: BEQZ 200 IF2 ID2 104: ADD IF3 108: 304: ADD Resource Usage IF ID EX MA WB time t0 t1 I1 I2 I1 t2 I3 I2 I1 t3 MA1 EX2 ID3 IF4 t4 WB1 MA2 nop nop IF5 t3 I4 I3 I2 I1 t4 I5 nop nop I2 I1 t5 t6 .... WB2 nop nop nop nop nop ID5 EX5 MA5 WB5 t5 t6 t7 .... I5 nop I5 nop nop I5 I2 nop nop I5 nop ⇒ September 28, 2005 t7 pipeline bubble 6.823 L6- 30 Arvind Reducing Branch Penalty (resolve in decode stage) • One pipeline bubble can be removed if an extra comparator is used in the Decode stage PCSrc (pc+4 / jabs / rind/ br) E Add nop 0x4 IR Add PC addr nop inst Inst Memory IR D we rs1 rs2 rd1 ws wd rd2 Zero detect on register file output GPRs Pipeline diagram now same as for jumps September 28, 2005 6.823 L6- 31 Arvind Branch Delay Slots (expose control hazard to software) • Change the ISA semantics so that the instruction that follows a jump or branch is always executed – gives compiler the flexibility to put in a useful instruction where normally a pipeline bubble would have resulted. I1 I2 I3 I4 096 100 104 304 ADD BEQZ r1 200 ADD ADD Delay slot instruction executed regardless of branch outcome • Other techniques include branch prediction, which can dramatically reduce the branch penalty... to come later September 28, 2005 6.823 L6- 32 Arvind Bypassing time (I1) r1 ← r0 + 10 (I2) r4 ← r1 + 17 (I3) (I4) (I5) t0 IF1 t1 t2 t3 t4 t5 ID1 EX1 MA1 WB1 IF2 ID2 ID2 ID2 ID2 IF3 IF3 IF3 IF3 stalled stages t6 t7 .... EX2 MA2 WB2 ID3 EX3 MA3 IF4 ID4 EX4 IF5 ID5 Each stall or kill introduces a bubble in the pipeline ⇒ CPI > 1 A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input time (I1) r1 ← r0 + 10 (I2) r4 ← r1 + 17 (I3) (I4) (I5) September 28, 2005 t0 t1 IF1 t2 t3 ID1 EX1 IF2 ID2 IF3 t4 MA1 EX2 ID3 IF4 t5 WB1 MA2 EX3 ID4 IF5 t6 t7 .... WB2 MA3 WB3 EX4 MA4 WB4 ID5 EX5 MA5 WB5 6.823 L6- 33 Arvind Adding a Bypass stall r4 ← r1... 0x4 r1 ←... nop Add IR E IR M IR 31 ASrc PC addr inst IR D Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs A ALU Y B rdata Data Memory Imm Ext wdata wdata MD1 ... (I1) r1 ← r0 + 10 (I2) r4 ← r1 + 17 yes September 28, 2005 we addr MD2 When does this bypass help? r1 ← M[r0 + 10] r4 ← r1 + 17 no JAL 500 r4 ← r31 + 17 no R W 6.823 L6- 34 Arvind The Bypass Signal Deriving it from the Stall Signal stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D +((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D ) ws = Case opcode ALU ⇒ rd ALUi, LW ⇒ rt JAL, JALR ⇒ R31 ASrc = (rsD=wsE).weE.re1D we = Case opcode ALU, ALUi, LW ⇒(ws ≠ 0) JAL, JALR ⇒ on ... ⇒ off Is this correct? No because only ALU and ALUi instructions can benefit from this bypass Split weE into two components: we-bypass, we-stall September 28, 2005 6.823 L6- 35 Arvind Bypass and Stall Signals Split weE into two components: we-bypass, we-stall we-bypassE = Case opcodeE ALU, ALUi ⇒ (ws ≠ 0) ... ⇒ off we-stallE = Case opcodeE LW ⇒ (ws ≠ 0) JAL, JALR ⇒ on ... ⇒ off ASrc = (rsD =wsE).we-bypassE . re1D stall = ((rsD =wsE).we-stallE + (rsD=wsM).weM + (rsD=wsW).weW). re1D +((rtD = wsE).weE + (rtD = wsM).weM + (rtD = wsW).weW). re2D September 28, 2005 6.823 L6- 36 Arvind Fully Bypassed Datapath PC for JAL, ... stall 0x4 nop Add PC addr ASrc inst IR Inst Memory D Is there still a need for the stall signal ? September 28, 2005 IR M IR 31 we rs1 rs2 rd1 ws wd rd2 A ALU GPRs Imm Ext IR E Y B BSrc we addr rdata Data Memory R wdata wdata MD1 MD2 stall = (rsD=wsE). (opcodeE=LWE).(wsE≠0 ).re1D + (rtD=wsE). (opcodeE=LWE).(wsE≠0 ).re2D W Why an Instruction may not be dispatched every cycle (CPI>1) 6.823 L6- 37 Arvind • Full bypassing may be too expensive to implement – typically all frequently used paths are provided – some infrequently used bypass paths may increase cycle time and counteract the benefit of reducing CPI • Loads have two cycle latency – Instruction after load cannot use load result – MIPS-I ISA defined load delay slots, a software-visible pipeline hazard (compiler schedules independent instruction or inserts NOP to avoid hazard). Removed in MIPS-II. • Conditional branches may cause bubbles – kill following instruction(s) if no delay slots Machines with software-visible delay slots may execute significant number of NOP instructions inserted by the compiler. September 28, 2005 38 Thank you !