CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152 January 31, 2012 CS152, Spring 2012 Last time in Lecture 3 • Microcoding became less attractive as gap between RAM and ROM speeds reduced • Complex instruction sets difficult to pipeline, so difficult to increase performance as gate count grew • Load-Store RISC ISAs designed for efficient pipelined implementations – Very similar to vertical microcode – Inspired by earlier Cray machines (more on these later) • Iron Law explains architecture design space – Trade instructions/program, cycles/instruction, and time/cycle January 31, 2012 CS152, Spring 2012 2 An Ideal Pipeline stage 1 stage 2 stage 3 stage 4 • All objects go through the same stages • No sharing of resources between any two stages • Propagation delay through all pipeline stages is equal • The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines, but instructions depend on each other! January 31, 2012 CS152, Spring 2012 3 Pipelined RISC-V To pipeline RISC-V: • First build RISC-V without pipelining with CPI=1 • Next, add pipeline registers to reduce cycle time while maintaining CPI=1 January 31, 2012 CS152, Spring 2012 4 Lecture 3: Unpipelined Datapath for RISC-V PCSel br rind jabs pc+4 RegWriteEn MemWrite WBSel 0x4 Add Add clk PC clk 1 addr inst Inst. Memory Br Logic Bcomp? we rs1 rs2 rd1 wa wd rd2 ALU GPRs clk we addr rdata Data Memory Imm Select wdata ALU Control OpCode WASel January 31, 2012 ImmSel FuncSel Op2Sel CS152, Spring 2012 5 Lecture 3: Hardwired Control Table Op2Sel FuncSel MemWr Func Op + + no no no yes yes yes yes no * no JAL * * * * * * * * no no no JALR * * * no Opcode ImmSel ALU LW * IType12 IType12 SW BsType12 Reg Imm Imm Imm BEQtrue BrType12 * BEQfalse BrType12 ALUi J Op2Sel= Reg / Imm WASel = rd / X1 January 31, 2012 RFWen WBSel WASel PCSel ALU ALU Mem * rd rd rd * pc+4 pc+4 pc+4 pc+4 no * * br no no * * * * pc+4 jabs yes yes PC PC X1 rd jabs rind WBSel = ALU / Mem / PC PCSel = pc+4 / br / rind / jabs CS152, Spring 2012 6 Pipelined Datapath 0x4 Add PC addr rdata Inst. Memory IR we rs1 rs2 rd1 wa wd rd2 GPRs ALU Imm Select we addr rdata Data Memory wdata write fetch decode & Reg-fetch execute memory -back phase phase phase phase phase Clock period can be reduced by dividing the execution of an instruction into multiple cycles tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably) However, CPI will increase unless instructions are pipelined January 31, 2012 CS152, Spring 2012 7 “Iron Law” of Processor Performance Time = Instructions Cycles Time Program Program * Instruction * Cycle – Instructions per program depends on source code, compiler technology, and ISA – Cycles per instructions (CPI) depends upon the ISA and the microarchitecture – Time per cycle depends upon the microarchitecture and the base technology Microarchitecture Lecture 2 Microcoded Lecture 3 Single-cycle unpipelined Lecture 4 Pipelined January 31, 2012 CS152, Spring 2012 CPI >1 1 1 cycle time short long short 8 CPI Examples Microcoded machine 7 cycles 5 cycles Inst 1 Time 10 cycles Inst 2 Inst 3 3 instructions, 22 cycles, CPI=7.33 Unpipelined machine Inst 1 Inst 2 Inst 3 3 instructions, 3 cycles, CPI=1 Pipelined machine Inst 1 Inst 2 Inst 3 January 31, 2012 3 instructions, 3 cycles, CPI=1 5-stage pipeline CPI≠5!!! CS152, Spring 2012 9 Technology Assumptions • A small amount of very fast memory (caches) backed up by a large, slower memory • Fast ALU (at least for integers) • Multiported Register files (slower!) Thus, the following timing assumption is reasonable tIM tRF tALU tDM tRW A 5-stage pipeline will be the focus of our detailed design - some commercial designs have over 30 pipeline stages to do an integer add! January 31, 2012 CS152, Spring 2012 10 5-Stage Pipelined Execution 0x4 Add PC addr rdata we rs1 rs2 rd1 wa wd rd2 GPRs IR Inst. Memory I-Fetch (IF) Imm Select Data Memory wdata Write Decode, Reg. Fetch Execute Memory (ID) (EX) (MA) Back t0 t1 t2 t3 t4 t5 t6 t7 . . . . (WB) time instruction1 instruction2 instruction3 instruction4 instruction5 January 31, 2012 ALU we addr rdata IF1 ID1 EX1 MA1 WB1 IF2 ID2 EX2 MA2 WB2 IF3 ID3 EX3 MA3 WB3 IF4 ID4 EX4 MA4 WB4 IF5 ID5 EX5 MA5 WB5 CS152, Spring 2012 11 5-Stage Pipelined Execution Resource Usage Diagram 0x4 Add PC addr rdata we rs1 rs2 rd1 ws wd rd2 GPRs IR Inst. Memory Resources I-Fetch (IF) January 31, 2012 we addr rdata ALU Data Memory Imm Select wdata t5 Write Memory (MA) Back t6 t7 . . . . (WB) I5 I4 I3 I2 I5 I4 I3 Decode, Reg. Fetch Execute (ID) (EX) time IF ID EX MA WB t0 I1 t1 I2 I1 t2 I3 I2 I1 t3 I4 I3 I2 I1 t4 I5 I4 I3 I2 I1 CS152, Spring 2012 I5 I4 I5 12 Pipelined Execution: ALU Instructions 0x4 IR Add IR IR 1 PC addr inst IR Inst Memory we rs1 rs2 rd1 wa wd rd2 GPRs A ALU Y B we addr rdata Data Memory Imm Select R wdata wdata MD1 MD2 Not quite correct! We need an Instruction Reg (IR) for each stage January 31, 2012 CS152, Spring 2012 13 Pipelined RISC-V Datapath without jumps F D E M IR W IR IR 1 0x4 Add WASel RegWriteEn PC addr inst IR Inst Memory we rs1 rs2 rd1 wa wd rd2 GPRs FuncSel MemWrite A ALU Y we addr rdata B Data Memory wdata Imm Select R wdata MD1 ImmSel WBSel MD2 Op2Sel Control Points Need to Be Connected January 31, 2012 CS152, Spring 2012 14 Instructions interact with each other in pipeline • An instruction in the pipeline may need a resource being used by another instruction in the pipeline structural hazard • An instruction may depend on something produced by an earlier instruction – Dependence may be for a data value data hazard – Dependence may be for the next instruction’s address control hazard (branches, exceptions) January 31, 2012 CS152, Spring 2012 15 Resolving Structural Hazards • Structural hazard occurs when two instructions need same hardware resource at same time – Can resolve in hardware by stalling newer instruction till older instruction finished with resource • A structural hazard can always be avoided by adding more hardware to design – E.g., if two instructions both need a port to memory at same time, could avoid hazard by adding second port to memory • Our 5-stage pipe has no structural hazards by design – Thanks to RISC-V ISA, which was designed for pipelining January 31, 2012 CS152, Spring 2012 16 Data Hazards x1 … x4 x1 … 0x4 IR Add PC addr inst IR Inst Memory we rs1 rs2 rd1 wa wd rd2 GPRs A ALU Y B 1 we addr rdata Data Memory Imm Select R wdata wdata MD1 ... x1 x0 + 10 x4 x1 + 17 ... January 31, 2012 IR IR MD2 x1 is stale. Oops! CS152, Spring 2012 17 Resolving Data Hazards (1) Strategy 1: Wait for the result to be available by freezing earlier pipeline stages interlocks January 31, 2012 CS152, Spring 2012 18 Feedback to Resolve Hazards FB1 FB2 stage 1 FB3 stage 2 FB4 stage 3 stage 4 • Later stages provide dependence information to earlier stages which can stall (or kill) instructions • Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur) January 31, 2012 CS152, Spring 2012 19 Interlocks to resolve Data Hazards Stall Condition bubble 0x4 Add PC addr IR IR IR 1 inst IR Inst Memory ... x1 x0 + 10 x4 x1 + 17 ... January 31, 2012 we rs1 rs2 rd1 wa wd rd2 GPRs A ALU Y B we addr rdata Data Memory Imm Select R wdata wdata MD1 CS152, Spring 2012 MD2 20 Stalled Stages and Pipeline Bubbles time t0 t1 t2 t3 t4 t5 (I1) x1 (x0) + 10 IF1 ID1 EX1 MA1 WB1 (I2) x4 (x1) + 17 IF2 ID2 ID2 ID2 ID2 (I3) IF3 IF3 IF3 IF3 (I4) stalled stages (I5) Resource Usage IF ID EX MA WB time t0 t1 I1 I2 I1 t2 I3 I2 I1 t3 I3 I2 I1 t4 I3 I2 I1 t5 I3 I2 - t6 CS152, Spring 2012 .... EX2 MA2 WB2 ID3 EX3 MA3 WB3 IF4 ID4 EX4 MA4 WB4 IF5 ID5 EX5 MA5 WB5 t6 I4 I3 I2 - - January 31, 2012 t7 t7 I5 I4 I3 I2 - .... I5 I4 I3 I2 I5 I4 I3 I5 I4 pipeline bubble 21 I5 Interlock Control Logic stall wa Cstall rs2 ? rs1 bubble 0x4 Add PC addr IR IR IR 1 inst IR Inst Memory we rs1 rs2 rd1 wa wd rd2 GPRs A ALU Y B we addr rdata Data Memory Imm Select R wdata wdata MD1 MD2 Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions. January 31, 2012 CS152, Spring 2012 22 Interlock Control Logic ignoring jumps & branches stall Cstall wa we rs1 ? rs2 re1 re2 Cre 0x4 Add PC addr Cdest Cdest bubble wa wa we we IR IR IR 1 inst IR Inst Memory we rs1 rs2 rd1 wa wd rd2 GPRs Cdest A ALU Y B we addr rdata Data Memory Imm Select R wdata wdata MD1 MD2 Should we always stall if an rs field matches some rd? not every instruction writes a register we not every instruction reads a register re January 31, 2012 CS152, Spring 2012 23 Source & Destination Registers rd rs1 rs2 func10 rd rs1 Imm[11:0] Imm[11:7 ] rs1 rs2 Imm[6:0] opcode ALU func3 opcode ALUI/LW/JALR func3 opcode SW/Bcond Jump offset[24:0] ALU ALUI LW SW Bcond J JAL JALR rd rs1 func10 rs2 rd rs1 op imm rd M [rs1 + imm] M [rs1 + imm] rs2 rs1,rs2 true: PC PC + imm false: PC PC + 4 PC PC + imm x1 PC, PC PC + imm rd PC, PC rs1 + imm January 31, 2012 CS152, Spring 2012 opcode source(s) destination rs1, rs2 rd rs1 rd rs1 rd rs1, rs2 rs1, rs2rs1 x1 rd 24 Deriving the Stall Signal Cdest ws = Case opcode JAL X1 else rd Cre we = Case opcode ALU, ALUi, LW,JALR (ws 0) JAL on ... off Cstall January 31, 2012 re1 = Case opcode ALU, ALUi, LW, SW, Bcond, JALR on J, JAL off re2 = Case opcode ALU, SW,Bcond on off ... stall = ((rs1D =wsE).weE + (rs1D =wsM).weM + (rs1D =wsW).weW) . re1D ((rs2D =wsE).weE + (rs2D =wsM).weM + (rs2D =wsW).weW) . re2D CS152, Spring 2012 + 25 Hazards due to Loads & Stores Stall Condition 0x4 bubble Add PC addr What if x1+7 = x3+5 ? IR IR IR 1 inst IR Inst Memory we rs1 rs2 rd1 wa wd rd2 GPRs A ALU Y B rdata Data Memory Imm Select ... M[x1+7] x2 x4 M[x3+5] ... January 31, 2012 we addr R wdata wdata MD1 MD2 Is there any possible data hazard in this instruction sequence? CS152, Spring 2012 26 Load & Store Hazards ... M[x1+7] x2 x4 M[x3+5] ... x1+7 = x3+5 data hazard However, the hazard is avoided because our memory system completes writes in one cycle ! Load/Store hazards are sometimes resolved in the pipeline and sometimes in the memory system itself. More on this later in the course. January 31, 2012 CS152, Spring 2012 27 CS152 Administrivia • Quiz 1 on Feb 14 will cover PS1, Lab1, lectures 1-5, and associated readings. • Section on Friday will review pipelining. January 31, 2012 CS152, Spring 2012 28 Resolving Data Hazards (2) Strategy 2: Route data as soon as possible after it is calculated to the earlier pipeline stage bypass January 31, 2012 CS152, Spring 2012 29 Bypassing time (I1) x1 x0 + 10 (I2) x4 x1 + 17 (I3) (I4) (I5) t0 IF1 t1 t2 t3 t4 t5 ID1 EX1 MA1 WB1 IF2 ID2 ID2 ID2 ID2 IF3 IF3 IF3 IF3 stalled stages t6 t7 .... EX2 MA2 WB2 ID3 EX3 MA3 IF4 ID4 EX4 IF5 ID5 Each stall or kill introduces a bubble in the pipeline CPI > 1 A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input time t0 (I1) x1 x0 + 10 (I2) x4 x1 + 17 (I3) (I4) (I5) January 31, 2012 t1 IF1 t2 t3 ID1 EX1 IF2 ID2 IF3 t4 MA1 EX2 ID3 IF4 CS152, Spring 2012 t5 WB1 MA2 EX3 ID4 IF5 t6 t7 .... WB2 MA3 WB3 EX4 MA4 WB4 ID5 EX5 MA5 WB5 30 Adding a Bypass stall x4 x1... 0x4 x1 ... bubble IR Add E IR M IR 1 ASrc PC addr inst IR Inst Memory D we rs1 rs2 rd1 wa wd rd2 GPRs A ALU Y B rdata Data Memory Imm Select R wdata wdata MD1 (I1) (I2) we addr MD2 When does this bypass help? ... x1 x0 + 10 x1 M[x0 + 10] JAL 500 x4 x1 + 17 x4 x1 + 17 x4 x1 + 17 yes no no January 31, 2012 CS152, Spring 2012 31 W The Bypass Signal Deriving it from the Stall Signal stall = ( ((rs1D =wsE).weE + (rs1D =wsM).weM + (rs1D =wsW).weW).re1D +((rs2D =wsE).weE + (rs2D =wsM).weM + (rs2D =wsW).weW).re2D ) we = Case opcode ALU, ALUi, LW, JALR (ws 0) JAL on ... off ws = Case opcode JAL X1 else rd ASrc = (rs1D=wsE).weE.re1D Is this correct? No because only ALU and ALUi instructions can benefit from this bypass Split weE into two components: we-bypass, we-stall January 31, 2012 CS152, Spring 2012 32 Bypass and Stall Signals Split weE into two components: we-bypass, we-stall we-bypassE = Case opcodeE ALU, ALUi (ws 0) ... off ASrc we-stallE = Case opcodeE LW, JALR (ws 0) JAL on ... off = (rs1D =wsE).we-bypassE . re1D stall = ((rs1D =wsE).we-stallE + (rs1D=wsM).weM + (rs1D=wsW).weW). re1D +((rs2D = wsE).weE + (rs2D = wsM).weM + (rs2D = wsW).weW). re2D January 31, 2012 CS152, Spring 2012 33 Fully Bypassed Datapath PC for JAL, ... stall 0x4 bubble Add PC addr ASrc inst IR Inst Memory D Is there still a need for the stall signal ? January 31, 2012 IR M IR W 1 we rs1 rs2 rd1 wa wd rd2 A ALU GPRs Imm Select IR E Y B BSrc we addr rdata Data Memory R wdata wdata MD1 MD2 stall = (rs1D=wsE). (opcodeE=LWE).(wsE0 ).re1D + (rs2D=wsE). (opcodeE=LWE).(wsE0 ).re2D CS152, Spring 2012 34 Pipeline CPI Examples Measure from when first instruction finishes to when last instruction in sequence finishes. 3 instructions finish in 3 cycles Time Inst 1 Inst 2 Inst 3 CPI = 3/3 =1 Inst 1 Inst 2 Bubble Inst 3 Inst 1 Bubble 1 Inst 2 Inst Bubble 3 2 Inst 3 January 31, 2012 3 instructions finish in 4 cycles CPI = 4/3 = 1.33 3 instructions finish in 5cycles CPI = 5/3 = 1.67 CS152, Spring 2012 35 Resolving Data Hazards (3) Strategy 3: Speculate on the dependence. Two cases: Guessed correctly do nothing Guessed incorrectly kill and restart …. We’ll later see examples of this approach in more complex processors. January 31, 2012 CS152, Spring 2012 36 Speculation that load value=zero PC for JAL, ... stall 0x4 bubble Add PC addr IR ASrc inst IR Inst Memory D we rs1 rs2 rd1 wa wd rd2 IR M IR 1 Guess_zero 0 A ALU GPRs Imm Select E Y B BSrc we addr rdata Data Memory R wdata wdata MD1 MD2 Guess_zero= (rs1D=wsE). (opcodeE=LWE).(wsE0 ).re1D Also need to add circuitry to remember that this was a guess and flush pipeline if load not zero! Not worth doing in practice – why? January 31, 2012 CS152, Spring 2012 37 W Control Hazards • What do we need to calculate next PC? – For Jumps » Opcode, PC and offset – For Jump Register » Opcode, Register value, and PC – For Conditional Branches » Opcode, Register (for condition), PC and offset – For all other instructions » Opcode and PC • have to know it’s not one of above January 31, 2012 CS152, Spring 2012 38 PC Calculation Bubbles (I1) x1 x0 + 10 (I2) x3 x2 + 17 (I3) (I4) Resource Usage IF ID EX MA WB time t0 t1 t2 t3 IF1 ID1 EX1 MA1 IF2 IF2 ID2 IF3 time t0 t1 I1 I1 t2 I2 I1 t3 I2 I1 t4 t5 WB1 EX2 MA2 IF3 ID3 IF4 t6 t4 I3 t6 I4 I2 I1 t5 I3 I2 - CS152, Spring 2012 .... WB2 EX3 MA3 WB3 IF4 ID4 EX4 MA4 WB4 I3 I2 - January 31, 2012 t7 t7 .... I4 I3 - I4 I3 I4 - I4 pipeline bubble 39 Speculate next address is PC+4 PCSrc (pc+4 / jabs / rind/ br) stall Add bubble 0x4 Add Jump? PC 104 I1 I2 I3 I4 addr inst January 31, 2012 M IR IR I1 IR Inst Memory 096 100 104 304 E I2 ADD J 304 ADD ADD kill A jump instruction kills (not stalls) the following instruction CS152, Spring 2012 How? 40 Pipelining Jumps PCSrc (pc+4 / jabs / rind/ br) stall To kill a fetched instruction -- Insert a mux before IR Add bubble 0x4 Add Jump? 304 104 I1 I2 I3 I4 addr inst bubble Inst Memory 096 100 104 304 January 31, 2012 IR bubble I2 ADD J 304 ADD ADD kill M IR IR II21 I1 Any interaction between stall and jump? IRSrcD PC E IRSrcD = Case opcodeD J, JAL bubble ... IM CS152, Spring 2012 41 Jump Pipeline Diagrams (I1) (I2) (I3) (I4) 096: 100: 104: 304: ADD J 304 ADD ADD Resource Usage IF ID EX MA WB time t0 t1 t2 IF1 ID1 EX1 IF2 ID2 IF3 time t0 t1 I1 I2 I1 t2 I3 I2 I1 t3 MA1 EX2 IF4 t4 WB1 MA2 ID4 t5 t6 t3 I4 I2 I1 t4 I5 I4 I2 I1 t5 t6 t7 .... I5 I4 I2 I5 I4 - I5 I4 I5 CS152, Spring 2012 .... WB2 EX4 MA4 WB4 - January 31, 2012 t7 pipeline bubble 42 Pipelining Conditional Branches PCSrc (pc+4 / jabs / rind / br) stall Add 0x4 bubble Add E M IR IR I1 BEQ? Taken? IRSrcD PC 104 I1 I2 I3 I4 addr inst Inst Memory 096 100 104 304 January 31, 2012 bubble A IR ALU Y I2 Branch condition is not known until ADD BEQ x1,x2 +200 the execute stage what action should be taken in the ADD decode stage ? ADD CS152, Spring 2012 43 Pipelining Conditional Branches PCSrc (pc+4 / jabs / rind / br) stall ? Add E bubble 0x4 Add M Bcond? IR IR I2 I1 Taken? IRSrcD PC 108 I1 I2 I3 I4 addr inst Inst Memory 096 100 104 304 January 31, 2012 bubble IR A ALU Y I3 If the branch is taken ADD - kill the two following instructions BEQ x1,x2 +200 - the instruction at the decode stage ADD is not valid ADD stall signal is not valid CS152, Spring 2012 44 Pipelining Conditional Branches stall Add PCSrc (pc+4/jabs/rind/br) E IRSrcE bubble 0x4 Add Jump? M Bcond? IR IR I2 I1 Taken? PC PC 108 I1:096 I2:100 I3:104 I4:304 addr IRSrcD inst bubble Inst Memory ADD BEQZ x1,x2 +200 ADD ADD January 31, 2012 IR A ALU Y I3 If the branch is taken - kill the two following instructions - the instruction at the decode stage is not valid stall signal is not valid CS152, Spring 2012 45 Branch Pipeline Diagrams (resolved in execute stage) (I1) (I2) (I3) (I4) (I5) time t0 t1 t2 096: ADD IF1 ID1 EX1 100: BEQZ +200 IF2 ID2 104: ADD IF3 108: 304: ADD Resource Usage IF ID EX MA WB time t0 t1 I1 I2 I1 t2 I3 I2 I1 t3 MA1 EX2 ID3 IF4 t4 WB1 MA2 IF5 t5 t6 t3 I4 I3 I2 I1 t4 I5 I2 I1 t5 t6 t7 .... I5 I2 I5 - I5 - I5 CS152, Spring 2012 .... WB2 ID5 EX5 MA5 WB5 - January 31, 2012 t7 pipeline bubble 46 Reducing Branch Penalty (resolve in decode stage) • One pipeline bubble can be removed if an extra comparator is used in the Decode stage – issues? 0x4 Taken? Add PC addr inst IR Inst Memory D A ALU GPRs Imm Select January 31, 2012 Bcomp we rs1 rs2 rd1 wa wd rd2 B BSrc CS152, Spring 2012 MD1 47 Acknowledgements • These slides contain material developed and copyright by: – – – – – – Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) • MIT material derived from course 6.823 • UCB material derived from course CS252 January 31, 2012 CS152, Spring 2012 48