CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152 January 28, 2010 CS152, Spring 2010 Last time in Lecture 3 • Microcoding became less attractive as gap between RAM and ROM speeds reduced • Complex instruction sets difficult to pipeline, so difficult to increase performance as gate count grew • Iron-law explains architecture design space – Trade instructions/program, cycles/instruction, and time/cycle • Load-Store RISC ISAs designed for efficient pipelined implementations – Very similar to vertical microcode – Inspired by earlier Cray machines (more on these later) January 28, 2010 CS152, Spring 2010 2 An Ideal Pipeline stage 1 stage 2 stage 3 stage 4 • All objects go through the same stages • No sharing of resources between any two stages • Propagation delay through all pipeline stages is equal • The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines, but instructions depend on each other! January 28, 2010 CS152, Spring 2010 3 Pipelined MIPS To pipeline MIPS: • First build MIPS without pipelining with CPI=1 • Next, add pipeline registers to reduce cycle time while maintaining CPI=1 January 28, 2010 CS152, Spring 2010 4 Lecture 3: Unpipelined Datapath for MIPS PCSrc br rind jabs pc+4 RegWrite MemWrite WBSrc 0x4 Add Add clk PC clk 31 addr inst Inst. Memory we rs1 rs2 rd1 ws wd rd2 clk we addr ALU GPRs z rdata Data Memory Imm Ext wdata ALU Control OpCode RegDst January 28, 2010 ExtSel OpSel BSrc CS152, Spring 2010 zero? 5 Lecture 3: Hardwired Control Table Opcode ALU ExtSel BSrc OpSel MemW RegW WBSrc RegDst PCSrc SW * sExt16 uExt16 sExt16 sExt16 Reg Imm Imm Imm Imm Func Op Op + + no no no no yes yes yes yes yes no ALU ALU ALU Mem * rd rt rt rt * pc+4 pc+4 pc+4 pc+4 pc+4 BEQZz=0 sExt16 * 0? no no * * br BEQZz=1 sExt16 * * * * * no no no no no * * * * pc+4 jabs * * * * 0? * * * * yes no yes PC * PC R31 * R31 jabs rind rind ALUi ALUiu LW J JAL JR JALR BSrc = Reg / Imm RegDst = rt / rd / R31 January 28, 2010 no no WBSrc = ALU / Mem / PC PCSrc = pc+4 / br / rind / jabs CS152, Spring 2010 6 Pipelined Datapath 0x4 Add PC addr rdata Inst. Memory IR we rs1 rs2 rd1 ws wd rd2 GPRs ALU Imm Ext we addr rdata Data Memory wdata write fetch decode & Reg-fetch execute memory -back phase phase phase phase phase Clock period can be reduced by dividing the execution of an instruction into multiple cycles tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably) However, CPI will increase unless instructions are pipelined January 28, 2010 CS152, Spring 2010 7 “Iron Law” of Processor Performance Time = Instructions Cycles Time Program Program * Instruction * Cycle – Instructions per program depends on source code, compiler technology, and ISA – Cycles per instructions (CPI) depends upon the ISA and the microarchitecture – Time per cycle depends upon the microarchitecture and the base technology Microarchitecture Lecture 2 Microcoded Lecture 3 Single-cycle unpipelined Lecture 4 Pipelined January 28, 2010 CS152, Spring 2010 CPI >1 1 1 cycle time short long short 8 CPI Examples Microcoded machine 7 cycles 5 cycles Inst 1 Time 10 cycles Inst 2 Inst 3 3 instructions, 22 cycles, CPI=7.33 Unpipelined machine Inst 1 Inst 2 Inst 3 3 instructions, 3 cycles, CPI=1 Pipelined machine Inst 1 Inst 2 Inst 3 January 28, 2010 3 instructions, 3 cycles, CPI=1 CS152, Spring 2010 9 Technology Assumptions • A small amount of very fast memory (caches) backed up by a large, slower memory • Fast ALU (at least for integers) • Multiported Register files (slower!) Thus, the following timing assumption is reasonable tIM tRF tALU tDM tRW A 5-stage pipeline will be the focus of our detailed design - some commercial designs have over 30 pipeline stages to do an integer add! January 28, 2010 CS152, Spring 2010 10 5-Stage Pipelined Execution 0x4 Add PC addr rdata we rs1 rs2 rd1 ws wd rd2 GPRs IR Inst. Memory I-Fetch (IF) Imm Ext Data Memory wdata Write Decode, Reg. Fetch Execute Memory (ID) (EX) (MA) Back t0 t1 t2 t3 t4 t5 t6 t7 . . . . (WB) time instruction1 instruction2 instruction3 instruction4 instruction5 January 28, 2010 ALU we addr rdata IF1 ID1 EX1 MA1 WB1 IF2 ID2 EX2 MA2 WB2 IF3 ID3 EX3 MA3 WB3 IF4 ID4 EX4 MA4 WB4 IF5 ID5 EX5 MA5 WB5 CS152, Spring 2010 11 5-Stage Pipelined Execution Resource Usage Diagram 0x4 Add PC addr rdata we rs1 rs2 rd1 ws wd rd2 GPRs IR Inst. Memory Resources I-Fetch (IF) January 28, 2010 we addr rdata ALU Data Memory Imm Ext wdata t5 Write Memory (MA) Back t6 t7 . . . . (WB) I5 I4 I3 I2 I5 I4 I3 Decode, Reg. Fetch Execute (ID) (EX) time IF ID EX MA WB t0 I1 t1 I2 I1 t2 I3 I2 I1 t3 I4 I3 I2 I1 t4 I5 I4 I3 I2 I1 CS152, Spring 2010 I5 I4 I5 12 Pipelined Execution: ALU Instructions 0x4 IR Add PC addr IR IR 31 inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs A ALU Y B we addr rdata Data Memory Imm Ext R wdata wdata MD1 MD2 Not quite correct! We need an Instruction Reg (IR) for each stage January 28, 2010 CS152, Spring 2010 13 Pipelined MIPS Datapath without jumps F D E M IR W IR IR 31 0x4 Add RegDst RegWrite PC addr inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 OpSel MemWrite A ALU GPRs Y we addr rdata B Data Memory wdata Imm Ext R wdata MD1 ExtSel WBSrc MD2 BSrc Control Points Need to Be Connected January 28, 2010 CS152, Spring 2010 14 Instructions interact with each other in pipeline • An instruction in the pipeline may need a resource being used by another instruction in the pipeline structural hazard • An instruction may depend on something produced by an earlier instruction – Dependence may be for a data value data hazard – Dependence may be for the next instruction’s address control hazard (branches, exceptions) January 28, 2010 CS152, Spring 2010 15 Resolving Structural Hazards • Structural hazards occurs when two instruction need same hardware resource at same time – Can resolve in hardware by stalling newer instruction till older instruction finished with resource • A structural hazard can always be avoided by adding more hardware to design – E.g., if two instructions both need a port to memory at same time, could avoid hazard by adding second port to memory • Our 5-stage pipe has no structural hazards by design – Thanks to MIPS ISA, which was designed for pipelining January 28, 2010 CS152, Spring 2010 16 Data Hazards r1 … r4 r1 … 0x4 IR Add PC addr 31 inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs A ALU Y B we addr rdata Data Memory Imm Ext R wdata wdata MD1 ... r1 r0 + 10 r4 r1 + 17 ... January 28, 2010 IR IR MD2 r1 is stale. Oops! CS152, Spring 2010 17 CS152 Administrivia • PS 1 out Tuesday • Lab 1 out today • Scott Beamer will run section reviewing lab 1 at 2pm in 320 Soda January 28, 2010 CS152, Spring 2010 18 Resolving Data Hazards (1) Strategy 1: Wait for the result to be available by freezing earlier pipeline stages interlocks January 28, 2010 CS152, Spring 2010 19 Feedback to Resolve Hazards FB1 FB2 stage 1 FB3 stage 2 FB4 stage 3 stage 4 • Later stages provide dependence information to earlier stages which can stall (or kill) instructions • Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur) January 28, 2010 CS152, Spring 2010 20 Interlocks to resolve Data Hazards Stall Condition 0x4 nop Add PC addr IR IR IR 31 inst IR Inst Memory ... r1 r0 + 10 r4 r1 + 17 ... January 28, 2010 we rs1 rs2 rd1 ws wd rd2 GPRs A ALU Y B we addr rdata Data Memory Imm Ext R wdata wdata MD1 CS152, Spring 2010 MD2 21 Stalled Stages and Pipeline Bubbles time t0 t1 t2 t3 t4 t5 (I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 WB1 (I2) r4 (r1) + 17 IF2 ID2 ID2 ID2 ID2 (I3) IF3 IF3 IF3 IF3 (I4) stalled stages (I5) Resource Usage IF ID EX MA WB time t0 t1 I1 I2 I1 t2 I3 I2 I1 t3 I3 I2 nop I1 t4 I3 I2 nop nop I1 t5 I3 I2 nop nop nop t6 CS152, Spring 2010 .... EX2 MA2 WB2 ID3 EX3 MA3 WB3 IF4 ID4 EX4 MA4 WB4 IF5 ID5 EX5 MA5 WB5 t6 I4 I3 I2 nop nop nop January 28, 2010 t7 t7 I5 I4 I3 I2 nop .... I5 I4 I3 I2 I5 I4 I3 I5 I4 pipeline bubble 22 I5 Interlock Control Logic stall ws Cstall rs rt ? 0x4 nop Add PC addr IR IR IR 31 inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs A ALU Y B we addr rdata Data Memory Imm Ext R wdata wdata MD1 MD2 Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions. January 28, 2010 CS152, Spring 2010 23 Interlock Control Logic ignoring jumps & branches stall Cstall rs rt ws we ? re1 re2 Cre 0x4 PC addr Cdest Cdest nop Add ws ws we we IR IR IR 31 inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs Cdest A ALU Y B we addr rdata Data Memory Imm Ext R wdata wdata MD1 MD2 Should we always stall if the rs field matches some rd? not every instruction writes a register we not every instruction reads a register re January 28, 2010 CS152, Spring 2010 24 Source & Destination Registers ALU ALUi LW SW BZ J JAL JR JALR R-type: op rs rt I-type: op rs rt J-type: op func immediate16 immediate26 rd (rs) func (rt) rt (rs) op imm rt M [(rs) + imm] M [(rs) + imm] (rt) cond (rs) true: PC (PC) + imm false: PC (PC) + 4 PC (PC) + imm r31 (PC), PC (PC) + imm PC (rs) r31 (PC), PC (rs) January 28, 2010 rd CS152, Spring 2010 source(s) destination rs, rt rd rs rt rs rt rs, rt rs rs rs rs 31 31 25 Deriving the Stall Signal Cdest ws = Case opcode ALU rd ALUi, LW rt JAL, JALR R31 Cre we = Case opcode ALU, ALUi, LW (ws 0) JAL, JALR on ... off Cstall January 28, 2010 re1 = Case opcode ALU, ALUi, LW, SW, BZ, JR, JALR on J, JAL off re2 = Case opcode on ALU, SW off ... stall = ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW) . re1D ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW) . re2D CS152, Spring 2010 + 26 Hazards due to Loads & Stores Stall Condition What if (r1)+7 = (r3)+5 ? 0x4 nop Add PC addr IR IR IR 31 inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs A ALU Y B rdata Data Memory Imm Ext ... M[(r1)+7] (r2) r4 M[(r3)+5] ... January 28, 2010 we addr R wdata wdata MD1 MD2 Is there any possible data hazard in this instruction sequence? CS152, Spring 2010 27 Load & Store Hazards ... M[(r1)+7] (r2) r4 M[(r3)+5] ... (r1)+7 = (r3)+5 data hazard However, the hazard is avoided because our memory system completes writes in one cycle ! Load/Store hazards are sometimes resolved in the pipeline and sometimes in the memory system itself. More on this later in the course. January 28, 2010 CS152, Spring 2010 28 Resolving Data Hazards (2) Strategy 2: Route data as soon as possible after it is calculated to the earlier pipeline stage bypass January 28, 2010 CS152, Spring 2010 29 Bypassing time (I1) r1 r0 + 10 (I2) r4 r1 + 17 (I3) (I4) (I5) t0 IF1 t1 t2 t3 t4 t5 ID1 EX1 MA1 WB1 IF2 ID2 ID2 ID2 ID2 IF3 IF3 IF3 IF3 stalled stages t6 t7 .... EX2 MA2 WB2 ID3 EX3 MA3 IF4 ID4 EX4 IF5 ID5 Each stall or kill introduces a bubble in the pipeline CPI > 1 A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input time (I1) r1 r0 + 10 (I2) r4 r1 + 17 (I3) (I4) (I5) January 28, 2010 t0 t1 IF1 t2 t3 ID1 EX1 IF2 ID2 IF3 t4 MA1 EX2 ID3 IF4 CS152, Spring 2010 t5 WB1 MA2 EX3 ID4 IF5 t6 t7 .... WB2 MA3 WB3 EX4 MA4 WB4 ID5 EX5 MA5 WB5 30 Adding a Bypass stall r4 r1... 0x4 r1 ... nop Add IR E IR M IR 31 ASrc PC addr inst IR Inst Memory D we rs1 rs2 rd1 ws wd rd2 GPRs A ALU Y B rdata Data Memory Imm Ext R wdata wdata MD1 (I1) (I2) we addr MD2 When does this bypass help? ... r1 r0 + 10 r1 M[r0 + 10] JAL 500 r4 r1 + 17 r4 r1 + 17 r4 r31 + 17 yes no no January 28, 2010 CS152, Spring 2010 31 W The Bypass Signal Deriving it from the Stall Signal stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D +((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D ) ws = Case opcode ALU rd ALUi, LW rt JAL, JALR R31 we = Case opcode ALU, ALUi, LW (ws 0) JAL, JALR on ... off ASrc = (rsD=wsE).weE.re1D Is this correct? No because only ALU and ALUi instructions can benefit from this bypass Split weE into two components: we-bypass, we-stall January 28, 2010 CS152, Spring 2010 32 Bypass and Stall Signals Split weE into two components: we-bypass, we-stall we-bypassE = Case opcodeE ALU, ALUi (ws 0) ... off we-stallE = Case opcodeE LW (ws 0) JAL, JALR on ... off ASrc = (rsD =wsE).we-bypassE . re1D stall = ((rsD =wsE).we-stallE + (rsD=wsM).weM + (rsD=wsW).weW). re1D +((rtD = wsE).weE + (rtD = wsM).weM + (rtD = wsW).weW). re2D January 28, 2010 CS152, Spring 2010 33 Fully Bypassed Datapath PC for JAL, ... stall 0x4 nop Add PC addr ASrc inst IR Inst Memory D Is there still a need for the stall signal ? January 28, 2010 IR M IR 31 we rs1 rs2 rd1 ws wd rd2 A ALU GPRs Imm Ext IR E Y B BSrc we addr rdata Data Memory R wdata wdata MD1 MD2 stall = (rsD=wsE). (opcodeE=LWE).(wsE0 ).re1D + (rtD=wsE). (opcodeE=LWE).(wsE0 ).re2D CS152, Spring 2010 34 W Resolving Data Hazards (3) Strategy 3: Speculate on the dependence. Two cases: Guessed correctly do nothing Guessed incorrectly kill and restart …. We’ll later see examples of this approach in more complex processors. January 28, 2010 CS152, Spring 2010 35 Next Time: Control Hazards • Branches/Jumps • Exceptions/Interrupts January 28, 2010 CS152, Spring 2010 36 Acknowledgements • These slides contain material developed and copyright by: – – – – – – Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) • MIT material derived from course 6.823 • UCB material derived from course CS252 January 28, 2010 CS152, Spring 2010 37