CS152 – Computer Architecture and Engineering Fall 2004 Lecture 10: Basic MIPS Pipelining Review John Lazzaro (www.cs.berkeley.edu/~lazzaro) Dave Patterson (www.cs.berkeley.edu/~patterson) [Adapted from Mary Jane Irwin’s slides www.cse.psu.edu/~cg431 ] CS 152 L10 Pipeline Intro (1) Fall 2004 © UC Regents Recap last lecture Customers: measure to buy Architects: measure for design Tools: Performance Equation, CPI Seconds Instructions Cycles Seconds Program Instruction Cycle = Program Amdahl’s Law’s lesson: Balance Speedupwhole = 1 1 - (% affected/Speeduppart) Energy: E0->1= CS 152 L10 Pipeline Intro (2) 2 1 C Vdd 2 E1->0= 2 1 C Vdd 2 Fall 2004 © UC Regents The Five Stages of Load Instruction Cycle 1 Cycle 2 lw IFetch Dec Cycle 3 Cycle 4 Cycle 5 Exec Mem WB IFetch: Instruction Fetch and Update PC Dec: Registers Fetch and Instruction Decode Exec: Execute R-type; calculate memory address Mem: Read/write the data from/to the Data Memory WB: Write the result data into the register file CS 152 L10 Pipeline Intro (3) Fall 2004 © UC Regents Pipelined MIPS Processor Start the next instruction while still working on the current one improves throughput or bandwidth - total amount of work done in a given time (average instructions per second or per clock) instruction latency is not reduced (time from the start of an instruction to its completion) Cycle 1 Cycle 2 IFetch Dec lw Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Exec IFetch Dec sw R-type Mem WB Exec Mem WB Exec Mem IFetch Dec WB pipeline clock cycle (pipeline stage time) is limited by the slowest stage for some instructions, some stages are wasted cycles CS 152 L10 Pipeline Intro (4) Fall 2004 © UC Regents Single Cycle, Multiple Cycle, vs. Pipeline Single Cycle Implementation: Cycle 1 Cycle 2 Clk Load Store Waste Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 Clk lw IFetch sw Dec Exec Mem WB IFetch R-type Dec Pipeline Implementation: lw IFetch sw Exec Mem WB IFetch Dec Exec Mem WB Dec Exec Mem CS 152 L10 Pipeline Intro (5) Mem IFetch “wasted” cycles Dec R-type IFetch Exec WB Fall 2004 © UC Regents Multiple Cycle v. Pipeline, Bandwidth v. Latency Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 Clk lw IFetch sw Dec Exec Mem WB IFetch R-type Dec Exec Mem IFetch Pipeline Implementation: lw IFetch sw Dec Exec Mem WB IFetch Dec Exec Mem WB Dec Exec Mem R-type IFetch WB • Latency per lw = 5 clock cycles for both • Bandwidth of lw is 1 per clock clock (IPC) for pipeline vs. 1/5 IPC for multicycle • Pipelining improves instruction bandwidth, not instruction latency CS 152 L10 Pipeline Intro (6) Fall 2004 © UC Regents Pipelining the MIPS ISA What makes it easy all instructions are the same length (32 bits) - easier to fetch in 1st stage and decode in 2nd stage few instruction formats (three) with symmetry across formats - can begin reading register file in 2nd stage memory operations can occur only in loads and stores - can use the execute stage to calculate memory addresses each MIPS instruction writes at most one result and does so near the end of the pipeline What makes it hard structural hazards: what if we had only one memory? control hazards: what about branches? data hazards: what if an instruction’s input operands depend on the output of a previous instruction? CS 152 L10 Pipeline Intro (7) Fall 2004 © UC Regents MIPS Pipeline Datapath Modifications What do we need to add/modify in our MIPS datapath? registers between pipeline stages to isolate them IF:IFetch ID:Dec EX:Execute MEM: MemAccess 1 WB: WriteBack 0 Add Read Addr 2Data 1 File Write Addr Write Data 16 Sign Extend Read Data 2 ALU 0 Exec/Mem Register Read Dec/Exec Read Address Read Addr 1 IFetch/Dec PC Instruction Memory Add Data Memory Address Write Data Read Data Mem/WB Shift left 2 4 1 0 1 32 System Clock CS 152 L10 Pipeline Intro (8) Fall 2004 © UC Regents Graphically Representing MIPS Pipeline Reg ALU IM DM Reg Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? is there a hazard, why does it occur, and how can it be fixed? CS 152 L10 Pipeline Intro (9) Fall 2004 © UC Regents Why Pipeline? For Throughput! Time (clock cycles) Inst 2 Inst 3 IM Reg IM Reg IM Reg ALU Inst 1 Reg ALU IM ALU O r d e r Inst 0 ALU I n s t r. IM Reg DM DM Reg DM Reg DM ALU Inst 4 Once the pipeline is full, one instruction is completed every cycle Reg Reg DM Reg Time to fill the pipeline CS 152 L10 Pipeline Intro (10) Fall 2004 © UC Regents Administrivia Lab 2 demo Friday, due Monday Feedback on team effort How did it work? Change before pipeline? Reading Chapter 6, sections 6.1 to 6.4 for today, 6.5 to 6.9 for next 2 lectures Midterm Tue Oct 12 5:30 - 8:30 in 101 Morgan (you asked for it) Northwest corner of campus, near Arch and Hearst Midterm review Sunday Oct 10, 7 PM, 306 Soda Bring 1 page, handwritten notes, both sides Nothing electronic: no calculators, cell phones, pagers, … Meet at LaVal’s Northside afterwards for Pizza CS 152 L10 Pipeline Intro (11) Fall 2004 © UC Regents Important Observation Each functional unit can only be used once per instruction (since 4 other instructions executing) If each functional unit used at different stages then leads to hazards: Load uses Register File’s Write Port during its 5th stage ° R-type uses Register File’s Write Port during its 4th stage 2 ways to solve this pipeline hazard. 1 Load Ifetch 1 R-type CS 152 L10 Pipeline Intro (12) Ifetch 2 Reg/Dec 2 Reg/Dec 3 Exec 4 Mem 3 4 Exec Wr 5 Wr Fall 2004 © UC Regents Solution 1: Insert “Bubble” into the Pipeline Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Ifetch Reg/Dec Exec Wr Ifetch Reg/Dec Exec Mem Ifetch Reg/Dec Exec Ifetch Reg/Dec Pipeline Exec Ifetch Bubble Reg/Dec Exec Ifetch Reg/Dec Cycle 9 Clock Load R-type R-type R-type Wr Wr Wr Exec Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle Wr The control logic can be complex. Lose instruction fetch and issue opportunity. No instruction is started in Cycle 6! CS 152 L10 Pipeline Intro (13) Fall 2004 © UC Regents Solution 2: Delay R-type’s Write by One Cycle Delay R-type’s register write by one cycle: Now R-type instructions also use Reg File’s write port at Stage 5 Mem stage is a NOP stage: nothing is being done. 1 R-type 2 Ifetch Reg/Dec Cycle 4 3 4 Exec Mem Cycle 5 5 Cycle 6 Wr Cycle 1 Cycle 2 Cycle 3 Cycle 7 Cycle 8 Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Cycle 9 Clock R-type Load R-type R-type CS 152 L10 Pipeline Intro (14) Wr Fall 2004 © UC Regents Can Pipelining Get Us Into Trouble? Yes: Pipeline Hazards structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use data before it is ready - instruction source operands are produced by a prior instruction still in the pipeline - load instruction followed immediately by an ALU instruction that uses the load operand as a source value control hazards: attempt to make a decision before condition has been evaluated - branch instructions Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards CS 152 L10 Pipeline Intro (15) Fall 2004 © UC Regents A Single Memory Would Be a Structural Hazard Time (clock cycles) Mem Inst 3 Inst 4 Mem Reg Mem Reg Mem Reg ALU Inst 2 Reg ALU Mem Reg ALU Inst 1 Reg Mem Reg Mem Reading instruction from memory CS 152 L10 Pipeline Intro (16) Reading data from memory Mem ALU O r d e r lw ALU I n s t r. Mem Reg Mem Reg Reg Fall 2004 © UC Regents How About Register File Access? Time (clock cycles) Inst 2 add r2,r1, Reg IM Reg IM Reg IM Reg DM Reg DM Reg DM Reg DM ALU Inst 4 IM ALU Inst 1 Reg ALU IM ALU O r d e r add r1, ALU I n s t r. Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half. Reg DM Reg Potential read before write data hazard CS 152 L10 Pipeline Intro (18) Fall 2004 © UC Regents Register Usage Can Cause Data Hazards or r8, r1, r9 Reg IM Reg IM Reg IM Reg DM Reg DM Reg DM Reg DM ALU xor r4,r1,r5 IM ALU and r6,r1,r7 Reg ALU sub r4,r1,r5 IM ALU O r d e r add r1,r2,r3 ALU I n s t r. Dependencies backward in time cause hazards Reg DM Reg Which are read before write data hazards? CS 152 L10 Pipeline Intro (19) Fall 2004 © UC Regents Loads Can Cause Data Hazards and r6,r1,r7 or r8, r1, r9 Reg IM Reg IM Reg IM Reg DM Reg DM Reg DM Reg DM ALU xor r4,r1,r5 IM ALU sub r4,r1,r5 Reg ALU IM ALU O r d e r lw r1,100(r2) ALU I n s t r. Dependencies backward in time cause hazards Reg DM Reg Load-use data hazard CS 152 L10 Pipeline Intro (21) Fall 2004 © UC Regents One Way to “Fix” a Data Hazard IM Reg DM Reg IM Reg ALU O r d e r add r1,r2,r3 ALU I n s t r. Can fix data hazard by waiting – stall – but affects throughput IM Reg stall stall sub r4,r1,r5 CS 152 L10 Pipeline Intro (22) ALU and r6,r1,r7 DM Reg DM Reg Fall 2004 © UC Regents Another Way to “Fix” a Data Hazard or r8, r1, r9 CS 152 L10 Pipeline Intro (24) Reg IM Reg IM Reg IM Reg DM Reg DM Reg DM Reg DM ALU xor r4,r1,r5 IM ALU and r6,r1,r7 Reg ALU sub r4,r1,r5 IM ALU O r d e r add r1,r2,r3 ALU I n s t r. Can fix data hazard by forwarding results as soon as they are available to where they are needed. Reg DM Reg Fall 2004 © UC Regents Forwarding with Load-use Data Hazards or r8, r1, r9 Reg IM Reg IM Reg IM Reg DM Reg DM Reg DM Reg DM ALU xor r4,r1,r5 IM ALU and r6,r1,r7 Reg ALU sub r4,r1,r5 IM ALU O r d e r lw r1,100(r2) ALU I n s t r. Reg DM Reg Will still need one stall cycle even with forwarding CS 152 L10 Pipeline Intro (25) Fall 2004 © UC Regents Branch Instructions Cause Control Hazards Inst 3 CS 152 L10 Pipeline Intro (26) IM Reg IM Reg IM Reg DM Reg DM Reg DM ALU Inst 4 Reg ALU lw IM ALU O r d e r beq ALU I n s t r. Dependencies backward in time cause hazards Reg DM Reg Fall 2004 © UC Regents One Way to “Fix” a Control Hazard beq O r d e r stall IM Reg ALU I n s t r. DM Reg Can fix branch hazard by waiting – stall – but affects throughput stall stall CS 152 L10 Pipeline Intro (27) Reg IM Reg DM ALU Inst 3 IM ALU lw Reg DM Fall 2004 © UC Regents Corrected Datapath to Save RegWrite Addr Need to preserve the destination register address in the pipeline state registers (Bug in COD 1st edition!) 1 0 IF/ID ID/EX EX/MEM Add Shift left 2 4 PC Instruction Memory Read Address Read Addr 1 Read Addr 2Data 1 File Write Addr 16 Sign Extend Read Data 2 MEM/WB Data Memory Register Read Write Data CS 152 L10 Pipeline Intro (29) Add ALU Address 0 Write Data Read Data 1 0 1 32 Fall 2004 © UC Regents MIPS Pipeline Control Path Modifications All control signals can be determined during Decode and held in the state registers between pipeline stages 1 ID/EX 0 EX/MEM IF/ID Control Add MEM/WB Shift left 2 4 PC Instruction Memory Read Address Read Addr 1 Data Memory Register Read Read Addr 2Data 1 File Write Addr Write Data 16 CS 152 L10 Pipeline Intro (30) Add Sign Extend Read Data 2 ALU Address 0 Write Data Read Data 1 0 1 32 Fall 2004 © UC Regents Control Settings EX Stage MEM Stage WB Stage R Reg Dst 1 ALU ALU ALU Brch Mem Mem Reg Mem Op1 Op0 Src Read Write Write toReg 1 0 0 0 0 0 1 0 lw 0 0 0 1 0 1 0 1 1 sw X 0 0 1 0 0 1 0 X beq X 0 1 0 1 0 0 0 X Q: Why not show write enable for pipeline registers? A: Written every clock cycle (like PC) Q: Why not show control for IF and ID stages? A: Control same for all instructions in IF and ID stages: fetch instruction, increment PC CS 152 L10 Pipeline Intro (31) Fall 2004 © UC Regents Other Pipeline Structures Are Possible What about (slow) multiply operation? let it take two cycles MUL ALU IM Reg DM Reg What if the data memory access is twice as slow as the instruction memory? make the clock twice as slow or … let data memory access take two cycles (and keep the same clock rate) CS 152 L10 Pipeline Intro (32) Reg ALU IM DM1 DM2 Reg Fall 2004 © UC Regents Sample Pipeline Alternatives (for ARM ISA) IM Reg PC update IM access XScale (7-stage pipeline) decode reg access IM IM1 PC update BTB access start IM access Reg IM2 DM Reg Reg SHFT decode reg 1 access IM access CS 152 L10 Pipeline Intro (33) ALU op DM access shift/rotate commit result (write back) ALU StrongARM-1 (5-stage pipeline) EX ALU ARM7 (3-stage pipeline) DM1 Reg DM2 DM write reg write start DM access exception ALU op shift/rotate reg 2 access Fall 2004 © UC Regents Peer Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg/Dec Exec Mem1 Mem2 Wr 2nd lw Ifetch Reg/Dec Exec Mem1 Mem2 Wr 3rd lw Ifetch Reg/Dec Exec Mem1 Mem2 Clock 1st lw Wr Suppose a big data cache results in a data cache latency of 2 clock cycles and a 6-stage pipeline. (Pipelined, so can do 1 access / clock cycle.) What is the impact? 1. Instruction bandwidth is now 5/6-ths of the 5-stage pipeline 2. Instruction bandwidth is now 1/2 of the 5-stage pipeline 3. The branch delay slot is now 2 instructions 4. The load-use hazard can be with 2 instructions following load 5. Both 3 and 4: branch delay and load-use now 2 instructions 6. None of the above CS 152 L10 Pipeline Intro (34) Fall 2004 © UC Regents Peer Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Reg/Dec Exec Mem Cycle 6 Cycle 7 Clock 1st lw Ifetch1 Ifetch2 Wr Suppose a big I cache results in a I cache latency of 2 clock cycles and a 6-stage pipeline. (Pipelined, so can do 1 access / clock cycle.) What is the impact? 1. Instruction bandwidth is now 5/6-ths of the 5-stage pipeline 2. Instruction bandwidth is now 1/2 of the 5-stage pipeline 3. The branch delay slot is now 2 instructions 4. The load-use hazard can be with 2 instructions following load 5. Both 3 and 4: branch delay and load-use now 2 instructions 6. None of the above CS 152 L10 Pipeline Intro (36) Fall 2004 © UC Regents Peer Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Ifetch Reg/Dec Exec 2nd lw Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem/Wr Cycle 7 Clock 1st add 3rd add Mem/Wr Suppose we use with a 4 stage pipeline that combines memory access and write back stages for all instructions but load, stalling when there are structural hazards. Impact? 1. The branch delay slot is now 0 instructions 2. Most loads cause stall since often a structural hazard on reg. writes 3. Most stores cause stall since they have a structural hazard 4. Both 2 & 3: most loads&stores cause stall due to structural hazards 5. Most loads cause stall, but there is no load-use hazard anymore 6. Both 2 & 3, but there is no load-use hazard anymore 7. None of the above CS 152 L10 Pipeline Intro (38) Fall 2004 © UC Regents Designing a Pipelined Processor Go back and examine your data path and control diagram Associate resources with states Add pipeline registers between stages to balance clock cycle Be sure there are no structural hazards: one use / clock cycle Amdahl’s Law suggests splitting longest stage Resolve all data and control dependencies If backwards in time in pipeline drawing to registers => data hazard: forward or stall to resolve them If backwards in time in pipeline drawing to PC => control hazard: we’ll see next time Assert control in appropriate stage Develop test instruction sequences likely to uncover pipeline bugs If you don’t test it, it won’t work CS 152 L10 Pipeline Intro (40) Fall 2004 © UC Regents Brain storm on bugs (if time permits) Where are bugs likely to hide in a pipelined processor? 1. 2. … How can you write tests to uncover these likely bugs? 1. 2. … Once it passes a test, never need to run it again in the design process? CS 152 L10 Pipeline Intro (41) Fall 2004 © UC Regents Summary All modern day processors use pipelining Pipelining doesn’t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number of pipe stages Pipeline rate limited by slowest pipeline stage Must detect and resolve hazards Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Stalling negatively affects throughput Next time: pipeline control, including hazards CS 152 L10 Pipeline Intro (42) Fall 2004 © UC Regents