CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152 Last time in Lecture 3 • Microcoding became less attractive as gap between RAM and ROM speeds reduced • Complex instruction sets difficult to pipeline, so difficult to increase performance as gate count grew • Iron-law explains architecture design space – Trade instruction/program, cycles/instruction, and time/cycle • Load-Store RISC ISAs designed for efficient pipelined implementations – Very similar to vertical microcode – Inspired by earlier Cray machines 2/5/2008 CS152-Spring’08 2 “Iron Law” of Processor Performance Time = Instructions Cycles Time Program Program * Instruction * Cycle – Instructions per program depends on source code, compiler technology, and ISA – Cycles per instructions (CPI) depends upon the ISA and the microarchitecture – Time per cycle depends upon the microarchitecture and the base technology Lecture 3 This lecture 2/5/2008 Microarchitecture Microcoded Single-cycle unpipelined Pipelined CS152-Spring’08 CPI >1 1 1 cycle time short long short 3 An Ideal Pipeline stage 1 stage 2 stage 3 stage 4 • All objects go through the same stages • No sharing of resources between any two stages • Propagation delay through all pipeline stages is equal • The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines. But can an instruction pipeline satisfy the last condition? 2/5/2008 CS152-Spring’08 4 Pipelined MIPS To pipeline MIPS: • First build MIPS without pipelining with CPI=1 • Next, add pipeline registers to reduce cycle time while maintaining CPI=1 2/5/2008 CS152-Spring’08 5 Pipelined Datapath 0x4 Add PC addr rdata Inst. Memory IR we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext ALU we addr rdata Data Memory wdata write fetch decode & Reg-fetch execute memory -back phase phase phase phase phase Clock period can be reduced by dividing the execution of an instruction into multiple cycles tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably) However, CPI will increase unless instructions are pipelined 2/5/2008 CS152-Spring’08 6 Technology Assumptions • A small amount of very fast memory (caches) backed up by a large, slower memory • Fast ALU (at least for integers) • Multiported Register files (slower!) Thus, the following timing assumption is reasonable tIM tRF tALU tDM tRW A 5-stage pipelined Harvard architecture will be the focus of our detailed design 2/5/2008 CS152-Spring’08 7 5-Stage Pipelined Execution 0x4 Add PC addr rdata we rs1 rs2 rd1 ws wd rd2 GPRs IR Inst. Memory I-Fetch (IF) Data Memory Imm Ext wdata Write Decode, Reg. Fetch Execute Memory (ID) (EX) (MA) Back t0 t1 t2 t3 t4 t5 t6 t7 . . . . (WB) time instruction1 instruction2 instruction3 instruction4 instruction5 2/5/2008 ALU we addr rdata IF1 ID1 EX1 MA1 WB1 IF2 ID2 EX2 MA2 WB2 IF3 ID3 EX3 MA3 WB3 IF4 ID4 EX4 MA4 WB4 IF5 ID5 EX5 MA5 WB5 CS152-Spring’08 8 5-Stage Pipelined Execution Resource Usage Diagram 0x4 Add PC addr rdata we rs1 rs2 rd1 ws wd rd2 GPRs IR Inst. Memory Resources I-Fetch (IF) 2/5/2008 we addr rdata ALU Data Memory Imm Ext wdata t5 Write Memory (MA) Back t6 t7 . . . . (WB) I5 I4 I3 I2 I5 I4 I3 Decode, Reg. Fetch Execute (ID) (EX) time IF ID EX MA WB t0 I1 t1 I2 I1 t2 I3 I2 I1 t3 I4 I3 I2 I1 t4 I5 I4 I3 I2 I1 CS152-Spring’08 I5 I4 I5 9 Pipelined Execution: ALU Instructions 0x4 IR Add PC addr IR IR 31 inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 A ALU GPRs Y B we addr rdata Data Memory Imm Ext R wdata wdata MD1 MD2 Not quite correct! We need an Instruction Reg (IR) for each stage 2/5/2008 CS152-Spring’08 10 Pipelined MIPS Datapath without jumps F D E M IR W IR IR 31 0x4 Add RegDst RegWrite PC addr inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 OpSel MemWrite A ALU GPRs Y we addr rdata B Data Memory wdata Imm Ext R wdata MD1 ExtSel WBSrc MD2 BSrc Control Points Need to Be Connected 2/5/2008 CS152-Spring’08 11 How Instructions can Interact with each other in a pipeline • An instruction in the pipeline may need a resource being used by another instruction in the pipeline structural hazard • An instruction may depend on something produced by an earlier instruction – Dependence may be for a data value data hazard – Dependence may be for the next instruction’s address control hazard (branches, exceptions) 2/5/2008 CS152-Spring’08 12 Data Hazards r1 … r4 r1 … 0x4 IR Add PC addr 31 inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 A ALU GPRs Y B we addr rdata Data Memory Imm Ext R wdata wdata MD1 ... r1 r0 + 10 r4 r1 + 17 ... 2/5/2008 IR IR MD2 r1 is stale. Oops! CS152-Spring’08 13 Resolving Data Hazards (1) Strategy 1: Wait for the result to be available by freezing earlier pipeline stages interlocks 2/5/2008 CS152-Spring’08 14 Feedback to Resolve Hazards FB1 FB2 stage 1 • FB3 stage 2 FB4 stage 3 stage 4 Later stages provide dependence information to earlier stages which can stall (or kill) instructions • Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur) 2/5/2008 CS152-Spring’08 15 Interlocks to resolve Data Hazards Stall Condition 0x4 nop Add PC addr IR IR 31 inst IR Inst Memory ... r1 r0 + 10 r4 r1 + 17 ... 2/5/2008 IR we rs1 rs2 rd1 ws wd rd2 GPRs A ALU Y B we addr rdata Data Memory Imm Ext R wdata wdata MD1 CS152-Spring’08 MD2 16 Stalled Stages and Pipeline Bubbles time t0 t1 t2 t3 t4 t5 (I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 WB1 (I2) r4 (r1) + 17 IF2 ID2 ID2 ID2 ID2 (I3) IF3 IF3 IF3 IF3 (I4) stalled stages (I5) Resource Usage IF ID EX MA WB time t0 t1 I1 I2 I1 t2 I3 I2 I1 t3 I3 I2 nop I1 t4 I3 I2 nop nop I1 t5 I3 I2 nop nop nop t6 CS152-Spring’08 .... EX2 MA2 WB2 ID3 EX3 MA3 WB3 IF4 ID4 EX4 MA4 WB4 IF5 ID5 EX5 MA5 WB5 t6 I4 I3 I2 nop nop nop 2/5/2008 t7 t7 I5 I4 I3 I2 nop .... I5 I4 I3 I2 I5 I4 I3 I5 I4 pipeline bubble 17 I5 Interlock Control Logic stall ws Cstall rs rt ? 0x4 nop Add PC addr IR IR IR 31 inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs A ALU Y B we addr rdata Data Memory Imm Ext R wdata wdata MD1 MD2 Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions. 2/5/2008 CS152-Spring’08 18 Interlock Control Logic ignoring jumps & branches stall Cstall rs rt ws we ? re1 re2 nop Add PC addr Cdest Cdest Cre 0x4 ws ws we we IR IR IR 31 inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs Cdest A ALU Y B we addr rdata Data Memory Imm Ext R wdata wdata MD1 MD2 Should we always stall if the rs field matches some rd? not every instruction writes a register we not every instruction reads a register re 2/5/2008 CS152-Spring’08 19 Source & Destination Registers R-type: op rs rt I-type: op rs rt J-type: op ALU ALUi LW SW BZ J JAL JR JALR 2/5/2008 rd func immediate16 immediate26 rd (rs) func (rt) rt (rs) op imm rt M [(rs) + imm] M [(rs) + imm] (rt) cond (rs) true: PC (PC) + imm false: PC (PC) + 4 PC (PC) + imm r31 (PC), PC (PC) + imm PC (rs) r31 (PC), PC (rs) CS152-Spring’08 source(s) destination rs, rt rd rs rt rs rt rs, rt rs rs rs rs 31 31 20 Deriving the Stall Signal Cdest ws = Case opcode ALU rd ALUi, LW rt JAL, JALR R31 Cre we = Case opcode ALU, ALUi, LW (ws 0) JAL, JALR on ... off Cstall 2/5/2008 re1 = Case opcode ALU, ALUi, LW, SW, BZ, JR, JALR on J, JAL off re2 = Case opcode on ALU, SW off ... stall = ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW) . re1D ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW) . re2D CS152-Spring’08 + 21 Hazards due to Loads & Stores Stall Condition What if (r1)+7 = (r3)+5 ? 0x4 nop Add PC addr IR 31 inst IR Inst Memory we rs1 rs2 rd1 ws wd rd2 GPRs A ALU Y B ... M[(r1)+7] (r2) r4 M[(r3)+5] ... we addr rdata Data Memory Imm Ext 2/5/2008 IR IR R wdata wdata MD1 MD2 Is there any possible data hazard in this instruction sequence? CS152-Spring’08 22 Load & Store Hazards ... M[(r1)+7] (r2) r4 M[(r3)+5] ... (r1)+7 = (r3)+5 data hazard However, the hazard is avoided because our memory system completes writes in one cycle ! Load/Store hazards are sometimes resolved in the pipeline and sometimes in the memory system itself. More on this later in the course. 2/5/2008 CS152-Spring’08 23 CS152 Administrivia • Krste, office hours, Monday 1-3pm, 645 Soda – email for alternate time • Henry office hours, 511 Soda – 9:30-10:30AM Mondays – 2:00-3:00PM Fridays • First lab and practice problems this evening • In-class quiz dates: – Q1: Tuesday February 19 (ISAs, microcode, simple pipelining) – Q2: Tuesday March 4 (memory hierarchies) 2/5/2008 CS152-Spring’08 24 Course Schedule • Check website for schedule • Generally, a lab, practice problem set, plus quiz every two weeks • Lab and practice problems out on Tuesday (this evening for first) • Lab overview in section next day (Wednesday) • Practice problems due following Tuesday in class, solutions available on line • Problem review in section next day (Wednesday) • Lab due in class Thursday • Quiz following Tuesday, with new PS and lab. 2/5/2008 CS152-Spring’08 25 Resolving Data Hazards (2) Strategy 2: Route data as soon as possible after it is calculated to the earlier pipeline stage bypass 2/5/2008 CS152-Spring’08 26 Bypassing time (I1) r1 r0 + 10 (I2) r4 r1 + 17 (I3) (I4) (I5) t0 IF1 t1 t2 t3 t4 t5 ID1 EX1 MA1 WB1 IF2 ID2 ID2 ID2 ID2 IF3 IF3 IF3 IF3 stalled stages t6 t7 .... EX2 MA2 WB2 ID3 EX3 MA3 IF4 ID4 EX4 IF5 ID5 Each stall or kill introduces a bubble in the pipeline CPI > 1 A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input time (I1) r1 r0 + 10 (I2) r4 r1 + 17 (I3) (I4) (I5) 2/5/2008 t0 t1 IF1 t2 t3 ID1 EX1 IF2 ID2 IF3 CS152-Spring’08 t4 MA1 EX2 ID3 IF4 t5 WB1 MA2 EX3 ID4 IF5 t6 t7 .... WB2 MA3 WB3 EX4 MA4 WB4 ID5 EX5 MA5 WB5 27 Adding a Bypass stall r4 r1... 0x4 r1 ... nop Add IR E IR M IR 31 ASrc PC addr inst IR Inst Memory D we rs1 rs2 rd1 ws wd rd2 GPRs A ALU Y B we addr rdata Data Memory Imm Ext wdata wdata MD1 MD2 When does this bypass help? ... (I1) r1 r0 + 10 r1 M[r0 + 10] JAL 500 (I2) r4 r1 + 17 r4 r1 + 17 r4 r31 + 17 yes no no 28 2/5/2008 CS152-Spring’08 R W The Bypass Signal Deriving it from the Stall Signal stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D +((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D ) ws = Case opcode ALU rd ALUi, LW rt JAL, JALR R31 we = Case opcode ALU, ALUi, LW (ws 0) JAL, JALR on ... off ASrc = (rsD=wsE).weE.re1D Is this correct? No because only ALU and ALUi instructions can benefit from this bypass Split weE into two components: we-bypass, we-stall 2/5/2008 CS152-Spring’08 29 Bypass and Stall Signals Split weE into two components: we-bypass, we-stall we-bypassE = Case opcodeE ALU, ALUi (ws 0) ... off we-stallE = Case opcodeE LW (ws 0) JAL, JALR on ... off ASrc = (rsD =wsE).we-bypassE . re1D stall = ((rsD =wsE).we-stallE + (rsD=wsM).weM + (rsD=wsW).weW). re1D +((rtD = wsE).weE + (rtD = wsM).weM + (rtD = wsW).weW). re2D 2/5/2008 CS152-Spring’08 30 Fully Bypassed Datapath PC for JAL, ... stall 0x4 nop Add PC addr ASrc inst IR Inst Memory D Is there still a need for the stall signal ? 2/5/2008 IR M IR 31 we rs1 rs2 rd1 ws wd rd2 A ALU GPRs Imm Ext IR E Y B BSrc we addr rdata Data Memory R wdata wdata MD1 MD2 stall = (rsD=wsE). (opcodeE=LWE).(wsE0 ).re1D + (rtD=wsE). (opcodeE=LWE).(wsE0 ).re2D CS152-Spring’08 31 W Resolving Data Hazards (3) Strategy 3: Speculate on the dependence. Two cases: Guessed correctly do nothing Guessed incorrectly kill and restart 2/5/2008 CS152-Spring’08 32 Instruction to Instruction Dependence • What do we need to calculate next PC: – For Jumps » Opcode, offset and PC – For Jump Register » Opcode and Register value – For Conditional Branches » Opcode, PC, Register (for condition), and offset – For all others » Opcode and PC 2/5/2008 CS152-Spring’08 33 PC Calculation Bubbles time t0 t1 t2 t3 (I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 (I2) r3 (r2) + 17 IF2 IF2 ID2 (I3) IF3 (I4) Resource Usage IF ID EX MA WB time t0 t1 t2 I1 nop I2 I1 nop I1 t3 t4 t5 WB1 EX2 MA2 IF3 ID3 IF4 t4 nop I3 I2 nop nop I2 I1 nop I1 t6 t5 t6 nop I4 I3 nop nop I3 I2 nop nop I2 CS152-Spring’08 .... WB2 EX3 MA3 WB3 IF4 ID4 EX4 MA4 WB4 nop 2/5/2008 t7 t7 .... I4 nop I4 I3 nop I4 nop I3 nop I4 pipeline bubble 34 Speculate next address is PC+4 PCSrc (pc+4 / jabs / rind/ br) stall Add nop 0x4 Add Jump? PC 104 I1 I2 I3 I4 addr M IR IR I1 IR Inst Memory 096 100 104 304 2/5/2008 inst E I2 ADD J 304 ADD ADD kill A jump instruction kills (not stalls) the following instruction How? CS152-Spring’08 35 Pipelining Jumps PCSrc (pc+4 / jabs / rind/ br) stall To kill a fetched instruction -- Insert a mux before IR Add nop 0x4 Add Jump? 304 104 I1 I2 I3 I4 addr nop Inst Memory 096 100 104 304 2/5/2008 inst IR nop I2 ADD J 304 ADD ADD kill M IR IR II21 I1 Any interaction between stall and jump? IRSrcD PC E IRSrcD = Case opcodeD J, JAL nop ... IM CS152-Spring’08 36 Jump Pipeline Diagrams (I1) (I2) (I3) (I4) 096: 100: 104: 304: ADD J 304 ADD ADD Resource Usage IF ID EX MA WB time t0 t1 t2 IF1 ID1 EX1 IF2 ID2 IF3 time t0 t1 I1 I2 I1 t2 I3 I2 I1 t3 MA1 EX2 nop IF4 t4 WB1 MA2 nop ID4 t5 t3 I4 nop I2 I1 t4 I5 I4 nop I2 I1 t5 t6 CS152-Spring’08 .... WB2 nop nop EX4 MA4 WB4 t6 t7 I5 I4 I5 nop I4 I5 I2 nop I4 nop 2/5/2008 t7 .... I5 pipeline bubble 37 Pipelining Conditional Branches PCSrc (pc+4 / jabs / rind / br) stall Add nop 0x4 Add E M IR IR I1 BEQZ? zero? IRSrcD PC 104 I1 I2 I3 I4 096 100 104 304 2/5/2008 addr inst nop Inst Memory ADD BEQZ r1 +200 ADD ADD A IR ALU Y I2 Branch condition is not known until the execute stage what action should be taken in the decode stage ? CS152-Spring’08 38 Pipelining Conditional Branches PCSrc (pc+4 / jabs / rind / br) stall ? Add E nop 0x4 Add M BEQZ? IR IR I2 I1 zero? IRSrcD PC 108 I1 I2 I3 I4 096 100 104 304 2/5/2008 addr inst Inst Memory nop A IR ALU Y I3 If the branch is taken ADD - kill the two following instructions BEQZ r1 +200 - the instruction at the decode stage ADD is not valid ADD stall signal is not valid CS152-Spring’08 39 Pipelining Conditional Branches stall Add PCSrc (pc+4/jabs/rind/br) E IRSrcE nop 0x4 Add Jump? M BEQZ? IR IR I2 I1 zero? PC PC 108 I1 I2 I3 I4 096 100 104 304 2/5/2008 addr IRSrcD inst Inst Memory nop A IR ALU Y I3 If the branch is taken ADD - kill the two following instructions BEQZ r1 200 - the instruction at the decode stage ADD is not valid ADD stall signal is not valid CS152-Spring’08 40 New Stall Signal stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D + ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D ) . !((opcodeE=BEQZ).z + (opcodeE=BNEZ).!z) Don’t stall if the branch is taken. Why? Instruction at the decode stage is invalid 2/5/2008 CS152-Spring’08 41 Control Equations for PC and IR Muxes PCSrc = Case opcodeE BEQZ.z, BNEZ.!z br ... Case opcodeD J, JAL JR, JALR ... jabs rind pc+4 IRSrcD = Case opcodeE BEQZ.z, BNEZ.!z nop ... Case opcodeD J, JAL, JR, JALR nop ... IM Give priority to the older instruction, i.e., execute stage instruction over decode stage instruction IRSrcE = Case opcodeE BEQZ.z, BNEZ.!z nop ... stall.nop + !stall.IRD 2/5/2008 CS152-Spring’08 42 Branch Pipeline Diagrams (resolved in execute stage) (I1) (I2) (I3) (I4) (I5) time t0 t1 t2 096: ADD IF1 ID1 EX1 100: BEQZ 200 IF2 ID2 104: ADD IF3 108: 304: ADD Resource Usage IF ID EX MA WB time t0 t1 I1 I2 I1 t2 I3 I2 I1 t3 MA1 EX2 ID3 IF4 t4 WB1 MA2 nop nop IF5 t5 t3 I4 I3 I2 I1 t4 I5 nop nop I2 I1 t5 t6 CS152-Spring’08 .... WB2 nop nop nop nop nop ID5 EX5 MA5 WB5 t6 t7 .... I5 nop I5 nop nop I5 I2 nop nop I5 nop 2/5/2008 t7 pipeline bubble 43 Reducing Branch Penalty (resolve in decode stage) • One pipeline bubble can be removed if an extra comparator is used in the Decode stage PCSrc (pc+4 / jabs / rind/ br) E Add nop 0x4 IR Add PC addr nop inst Inst Memory IR D we rs1 rs2 rd1 ws wd rd2 Zero detect on register file output GPRs Pipeline diagram now same as for jumps 2/5/2008 CS152-Spring’08 44 Branch Delay Slots (expose control hazard to software) • Change the ISA semantics so that the instruction that follows a jump or branch is always executed – I1 I2 I3 I4 gives compiler the flexibility to put in a useful instruction where normally a pipeline bubble would have resulted. 096 100 104 304 ADD BEQZ r1 200 ADD ADD Delay slot instruction executed regardless of branch outcome • Other techniques include branch prediction, which can dramatically reduce the branch penalty... to come later 2/5/2008 CS152-Spring’08 45 Why an Instruction may not be dispatched every cycle (CPI>1) • Full bypassing may be too expensive to implement – typically all frequently used paths are provided – some infrequently used bypass paths may increase cycle time and counteract the benefit of reducing CPI • Loads have two cycle latency – Instruction after load cannot use load result – MIPS-I ISA defined load delay slots, a software-visible pipeline hazard (compiler schedules independent instruction or inserts NOP to avoid hazard). Removed in MIPS-II. • Conditional branches may cause bubbles – kill following instruction(s) if no delay slots Machines with software-visible delay slots may execute significant number of NOP instructions inserted by the compiler. 2/5/2008 CS152-Spring’08 46 Acknowledgements • These slides contain material developed and copyright by: – – – – – – Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) • MIT material derived from course 6.823 • UCB material derived from course CS252 2/5/2008 CS152-Spring’08 47