CS 152 Computer Architecture and Engineering Lecture 4 – Pipelining 2014-1-30 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Motorola 68000 Next week we will return to the microcode story ... Today is the anti-microcode story - pipelining ! RISC CPU Caches Data Path and Control Today: Pipelining Pipelining: an idea from assembly line production applied to CPU design Why pipelining is hard: data hazards, control hazards, structural hazards. Visualizing pipelines to evaluate hazard detection and resolution. Short Break. A tool kit for hazard resolution. CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Starting Point: Performance Equation Seconds Program Goal is to optimize execution time, not individual equation terms. CS 152: L4 Pipelining = Instructions Program Machines are optimized with respect to program workloads. Cycles Instruction The CPI of the program. Reflects the program’s instruction mix. Seconds Cycle Clock period. Optimize jointly with machine CPI. UC Regents Spring 2014 © UCB Pipelining CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Recall: Our single-cycle processor Challenge: Speed up clock while keeping CPI == 1 Seconds Program CS 152: L4 Pipelining = Instructions Program Cycles Instruction Seconds Cycle CPI == 1 This is good. Slow. This is bad. UC Regents Spring 2014 © UCB Recall: An R-format CPU design Decode fields to get : ADD $8 $9 $10 opcode rs rt rd shamt funct Logic op 5 5 5 32 ws 32 wd 32 CS 152: L4 Pipelining 32 RegFile rs1 rd1 rs2 rd2 32 A L U 32 WE UC Regents Spring 2014 © UCB Reminder: How data flows after posedge Instr Mem PC D + Q Addr Data 0x4 Logic op 5 5 5 32 ws 32 wd 32 CS 152: L4 Pipelining 32 RegFile rs1 rd1 rs2 rd2 32 A L U 32 WE UC Regents Spring 2014 © UCB Next posedge: Update state and repeat PC 5 5 5 D Q RegFile rs1 rd1 rs2 32 ws 32 wd 32 CS 152: L4 Pipelining rd2 WE UC Regents Spring 2014 © UCB Observation: Logic idle most of cycle For most of cycle, ALU is either “waiting” for its inputs, or “holding” its output Ideal: a CPU architecture where each part is always “working”. CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Inspiration: Automobile assembly line Assembly line moves on a steady clock. Each station does the same task on each car. The clock Car body shell Merge station Bolting station CS 152: L4 Pipelining Car chassis UC Regents Spring 2014 © UCB Inspiration: Automobile assembly line Simpler station tasks → more cars per hour. Simple tasks take less time, clock is faster. CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Inspiration: Automobile assembly line Line speed limited by slowest task. Most efficient if all tasks take same time to do CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Inspiration: Automobile assembly line Simpler tasks, complex car → long line! These lines go 24 x 7, and rarely shut down. CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Lessons from car assembly lines Faster line movement yields more cars per hour off the line. Faster line movement requires more stages, each doing simpler tasks. To maximize efficiency, all stages should take same amount of time (if not, workers in fast stages are idle) “Filling”, “flushing”, and “stalling” assembly line are all bad news. CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Key analogy: The instruction is the car Pipeline Stage #1 Stage #2 Stage #3 Stage #4 Stage #5 Instruction Fetch IR IR Controls hardware in stage 2 IR Controls hardware in stage 3 IR Controls hardware in stage 4 Controls hardware in stage 5 “Data-stationary control” CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Example: Decode & Register Fetch stage Pipeline Stage #1 Stage #3 Stage #2 Instr Fetch Decode & Reg Fetch ADD R4,R3,R2 OR R7,R6,R5 SUB R10, IR R9,R8 IR IR A sample program A ADD R4,R3,R2 OR R7,R6,R5 SUB R10,R9,R8 M B CS 152: L4 Pipelining R’s chosen so that instructions are independent - like cars on the line. UC Regents Spring 2014 © UCB Performance Equation and Pipelining Seconds Program = Instr Fetch Instructions Program IR Cycles Instruction Decode & Reg Fetch Stage #3 IR CPI == 1 Once pipe is fill, one instruction completes per A cycle M B CS 152: L4 Pipelining Seconds Cycle IR Clock period is shorter Less work to do in each cycle To get shortest clock period, balance the work to do in each pipeline stage. UC Regents Spring 2014 © UCB Hazards: An instruction is not a car ... Stage #3 Stage #1 Stage #2 Instr Fetch Decode & Reg Fetch IR ADD R4,R3,R2 OR R5,R4,R2 IR ... wrong value of R4 fetched from RegFile, contract with programmer A broken! Oops! M B CS 152: L4 Pipelining IR R4 not written yet ... New sample program ADD R4,R3,R2 OR R5,R4,R2 An example of a “hazard” -- we must (1) detect and (2) resolve all hazards to make a CPU that matches ISA UC Regents Spring 2014 © UCB Performance Equation and Hazards Seconds Program = Instr Fetch Instructions Program IR Cycles Instruction Decode & Reg Fetch Stage #3 IR Some ways to cope with hazards makes CPI > 1 “stalling pipeline” A M B CS 152: L4 Pipelining Seconds Cycle IR Added logic to detect and resolve hazards increases clock period “Software slows the machine down” Seymour Cray UC Regents Spring 2014 © UCB A (simplified) 5-stage pipelined CPU 1 “IF” Stage Instr Fetch 2 3 5 4 “ID/RF” Stage “EX” Stage “MEM” Stage WB Memory Write Decode & Reg Fetch Execution Back IR IR IR IR WE, MemToReg Mux,Logic A Y R M M B CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Sometimes, “contract” is a challenge 1 “IF” Stage Instr Fetch 2 LW R4,0(R0) OR R5,R4,R2 5 4 “ID/RF” Stage “EX” Stage “MEM” Stage WB Memory Write Decode & Reg Fetch Execution Back OR R5,R4,R2 Sample Program 3 IR ... but we haven’t even started the load yet! Mux,Logic LW R4, 0(R0) IR IR IR WE, MemToReg A Y R M B CS 152: L4 Pipelining M One approach: change the contract! UC Regents Spring 2014 © UCB From Lecture 1: Delayed Loads ... Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Next Instruction Fetch the load inst from memory opcode rs rt offset “I-Format” Decode fields to get : LW $1, 32($2) “Retrieve” register value: $2 Compute memory address: 32 + $2 Load memory address contents into: $1 Prepare to fetch instr that follows the LW in the program. Depending on load semantics, new $1 is visible to that instr, or not until the following instr (”delayed loads”). CS 152: L4 Pipelining UC Regents Spring 2014 © UCB After we change the contract ... 1 “IF” Stage Instr Fetch 2 LW R4,0(R0) OR R5,R4,R2 5 4 “ID/RF” Stage “EX” Stage “MEM” Stage WB Memory Write Decode & Reg Fetch Execution Back OR R5,R4,R2 Sample Program 3 IR ... “delayed load” contract does not guarantee new R4 is seen. Mux,Logic LW R4, 0(R0) IR IR IR WE, MemToReg A Y R M B CS 152: L4 Pipelining M Only partially solves problem ... soon, we finish the story. UC Regents Spring 2014 © UCB Visualizing Pipelines CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Pipeline Representation #1: Timeline IF (Fetch) ID (Decode) EX (ALU) IR IR MEM WB IR IR Good for visualizing pipeline fills. Sample Program Time: t1 Inst I1: ADD R4,R3,R2 IF I1: AND R6,R5,R4 I2: I2: I3: SUB R1,R9,R8 I3: XOR R3,R2,R1 I4: I4: OR R7,R6,R5 I5: I5: I6: CS 152: L4 Pipelining t2 t3 t4 t5 ID IF EX ID IF MEM EX ID IF WB MEM EX ID IF Pipeline is “full” t6 t7 t8 WB MEM EX ID IF WB MEM EX ID WB MEM EX UC Regents Spring 2014 © UCB Representation #2: Resource Usage IF (Fetch) ID (Decode) EX (ALU) IR IR MEM WB IR IR Good for visualizing pipeline stalls. Sample Program Time: t1 Stage I1: ADD R4,R3,R2 I1 IF: AND R6,R5,R4 I2: ID: I3: SUB R1,R9,R8 EX: XOR R3,R2,R1 I4: MEM: OR R7,R6,R5 I5: WB: CS 152: L4 Pipelining t2 t3 t4 t5 t6 t7 t8 I2 I1 I3 I2 I1 I4 I3 I2 I1 I5 I4 I3 I2 I1 I6 I5 I4 I3 I2 I7 I6 I5 I4 I3 I8 I7 I6 I5 I4 Pipeline is “full” UC Regents Spring 2014 © UCB Hazard Taxonomy CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Structural Hazards Several pipeline stages need to use the same hardware resource at the same time. Solution #1: Add extra copies of the resource (only works sometime). Solution #2: Change resource so that it can handle concurrent use. Solution #3: Stages “take turns” by stalling parts of the pipeline. CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Structural Hazard Example: One Memory “IF” Stage Used by IF stage and MEM stage PC “ID/RF” Stage IR “EX” Stage “MEM” Stage WB IR IR IR WE, MemToReg Mux,Logic A Y R To branch logic M M MemToReg B CS 152: L4 Pipelining UC Regents Spring 2014 © UCB A solution: “Extra copies” of memory 1 “IF” Stage Instr Fetch 2 3 5 4 “ID/RF” Stage “EX” Stage “MEM” Stage WB Memory Write Decode & Reg Fetch Execution Back IR IR IR IR WE, MemToReg Mux,Logic A Y R M B CS 152: L4 Pipelining M I and D caches are a hybrid solution UC Regents Spring 2014 © UCB Alternatively: Concurrent use ... 1 “IF” Stage Instr Fetch 2 3 5 4 “ID/RF” Stage “EX” Stage “MEM” Stage WB Memory Write Decode & Reg Fetch Execution Back IR IR IR IR WE, MemToReg Mux,Logic A Y R M B CS 152: L4 Pipelining M ID and WB stages use register file in same clock cycle UC Regents Spring 2014 © UCB Data Hazards: 3 Types (RAW, WAR, WAW) Several pipeline stages read or write the same data location in an incompatible way. Read After Write (RAW) hazards. Instruction I2 expects to read a data value written by an earlier instruction, but I2 executes “too early” and reads the wrong copy of the data. Note “data value”, not “register”. Data hazards are possible for any architected state (such as main memory). In practice, main memory hazard avoidance is the job of the memory system. CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Recall: RAW example Stage #3 Stage #1 Stage #2 Instr Fetch Decode & Reg Fetch Sample program ADD R4,R3,R2 OR R5,R4,R2 IR ADD R4,R3,R2 OR R5,R4,R2 IR ... wrong value of R4 fetched from RegFile, contract with programmer A broken! Oops! M B CS 152: L4 Pipelining IR R4 not written yet ... This is what we mean when we say Read After Write (RAW) Hazard UC Regents Spring 2014 © UCB Data Hazards: 3 Types (RAW, WAR, WAW) Write After Read (WAR) hazards. Instruction I2 expects to write over a data value after an earlier instruction I1 reads it. But instead, I2 writes too early, and I1 sees the new value. Write After Write (WAW) hazards. Instruction I2 writes over data an earlier instruction I1 also writes. But instead, I1 writes after I2, and the final data value is incorrect. WAR and WAW not possible in our 5-stage pipeline. But are possible in other pipeline designs. CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Control Hazards: A taken branch/jump IF (Fetch) ID (Decode) IR EX (ALU) IR MEM IR WB IR Note: with branch delay slot, I2 MUST complete, I3 MUST NOT complete. Sample Program Time: t1 (ISA w/o branch Inst IF I1: delay slot) I2: I1: BEQ R4,R3,25 I3: I2: AND R6,R5,R4 I4: SUB R1,R9,R8 I3: I5: I6: CS 152: L4 Pipelining t2 t3 t4 t5 ID IF EX ID IF MEM WB t6 t7 t8 EX stage computes if branch is taken If branch is taken, these instructions MUST NOT complete! UC Regents Spring 2014 © UCB Hazards Recap Structural Hazards Data Hazards (RAW, WAR, WAW) Control Hazards (taken branches and jumps) On each clock cycle, we must detect the presence of all of these hazards, and resolve them before they break the “contract with the programmer”. CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Break Play: CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Hazard Resolution Tools CS 152: L4 Pipelining UC Regents Spring 2014 © UCB The Hazard Resolution Toolkit Stall earlier instructions in pipeline. Forward results computed in later pipeline stages to earlier stages. Add new hardware or rearrange hardware design to eliminate hazard. Change ISA to eliminate hazard. Kill earlier instructions in pipeline. Make hardware handle concurrent requests to eliminate hazard. CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Resolving a RAW hazard by stalling Stage #3 Stage #1 Stage #2 Instr Fetch Decode & Reg Fetch Sample program ADD R4,R3,R2 OR R5,R4,R2 IR IR Keep executing OR instruction until R4 is ready. Until then, send NOPS to IR 2/3. Freeze PC and IR until stall is over. CS 152: L4 Pipelining ADD R4,R3,R2 OR R5,R4,R2 IR Let ADD proceed to WB stage, so that R4 is written to regfile. A New datapath hardware M (1) Mux into IR 2/3 to feed in NOP. B (2) Write enable on PC and IR 1/2 UC Regents Spring 2014 © UCB The Hazard Resolution Toolkit Stall earlier instructions in pipeline. Forward results computed in later pipeline stages to earlier stages. Add new hardware or rearrange hardware design to eliminate hazard. Change ISA to eliminate hazard. Kill earlier instructions in pipeline. Make hardware handle concurrent requests to eliminate hazard. CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Resolving a RAW hazard by forwarding 1 “IF” Stage 2 3 “ID/RF” Stage Decode & Reg Fetch “EX” Stage Execution OR R5,R4,R2 ADD R4,R3,R2 Instr Fetch Sample program ADD R4,R3,R2 IR OR R5,R4,R2 IR ALU computes R4 in the EX stage, so ... Just forward it back! A Y M M B CS 152: L4 Pipelining IR Unlike stalling, does not change CPI. May hurt cycle time. UC Regents Spring 2014 © UCB The Hazard Resolution Toolkit Stall earlier instructions in pipeline. Forward results computed in later pipeline stages to earlier stages. Add new hardware or rearrange hardware design to eliminate hazard. Change ISA to eliminate hazard. Kill earlier instructions in pipeline. Make hardware handle concurrent requests to eliminate hazard. CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Control Hazards: Fix with more hardware IF (Fetch) ID (Decode) IR EX (ALU) IR MEM IR WB IR If we add hardware, can we move it here? Sample Program Time: t1 (ISA w/o branch Inst IF I1: delay slot) I2: I1: BEQ R4,R3,25 I3: I2: AND R6,R5,R4 I4: SUB R1,R9,R8 I3: I5: I6: CS 152: L4 Pipelining t2 t3 t4 t5 ID IF EX ID IF MEM WB t6 t7 t8 EX stage computes if branch is taken If branch is taken, these instructions MUST NOT complete! UC Regents Spring 2014 © UCB Resolving control hazard with hardware Stage #3 Stage #1 Stage #2 Instr Fetch Decode & Reg Fetch To branch control logic IR IR IR == A M B CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Control Hazards: After more hardware IF (Fetch) ID (Decode) IR EX (ALU) IR MEM IR WB IR If we change ISA, can we always let I2 complete (”branch delay slot”) and eliminate the control hazard. Sample Program Time: t1 (ISA w/o branch Inst IF I1: delay slot) I2: I1: BEQ R4,R3,25 I3: I2: AND R6,R5,R4 I4: SUB R1,R9,R8 I3: I5: I6: CS 152: L4 Pipelining t2 t3 t4 t5 ID IF EX MEM WB t6 t7 t8 ID stage computes if branch is taken If branch is taken, this instruction MUST NOT complete! UC Regents Spring 2014 © UCB From Lecture 1: BEQ $1,$2,25 Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Next Instruction Fetch branch inst from memory opcode rs rt offset “I-Format” Decode fields to get: BEQ $1, $2, 25 “Retrieve” register values: $1, $2 Compute if we take branch: $1 == $2 ? ALWAYS prepare to fetch instr that follows the BEQ in the program (”delayed branch”). IF we take branch, the instr we fetch AFTER that instruction is PC + 4 + 100. CS 152: L4 Pipelining PC == “Program UC Regents Spring 2014 © UCB The Hazard Resolution Toolkit Stall earlier instructions in pipeline. Forward results computed in later pipeline stages to earlier stages. Add new hardware or rearrange hardware design to eliminate hazard. Change ISA to eliminate hazard. Kill earlier instructions in pipeline. Make hardware handle concurrent requests to eliminate hazard. CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Resolve control hazard by killing instr Stage #3 Stage #1 Stage #2 Instr Fetch Decode & Reg Fetch Sample program (no delay slot) J 200 OR R5,R4,R2 IR J 200 Detect J instruction, mux a NOP into IR 1/2 A M Compute new PC using hardware not shown CS 152: L4 Pipelining IR IR This hurts CPI. Can we do better? B UC Regents Spring 2014 © UCB The Hazard Resolution Toolkit Stall earlier instructions in pipeline. Forward results computed in later pipeline stages to earlier stages. Add new hardware or rearrange hardware design to eliminate hazard. Change ISA to eliminate hazard. Kill earlier instructions in pipeline. Make hardware handle concurrent requests to eliminate hazard. CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Structural hazard solution: concurrent use 1 “IF” Stage Instr Fetch Does not come for free ... 2 3 5 4 “ID/RF” Stage “EX” Stage “MEM” Stage WB Memory Write Decode & Reg Fetch Execution Back IR IR IR IR WE, MemToReg Mux,Logic A Y R M B CS 152: L4 Pipelining M ID and WB stages use register file in same clock cycle UC Regents Spring 2014 © UCB Hazard Diagnosis CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Data Hazards: Read After Write Read After Write (RAW) hazards. Instruction I2 expects to read a data value written by an earlier instruction, but I2 executes “too early” and reads the wrong copy of the data. Classic solution: use forwarding heavily, fall back on stalling when forwarding won’t work or slows down the critical path too much. CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Full bypass network ... ID (Decode) IR EX IR WB MEM IR IR WE, MemToReg Mux,Logic From WB A Y R M M B CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Common bug: Multiple forwards ... ADD R4,R3,R2 OR R2,R3,R1 AND R2,R2,R1 Which do we forward from? ID (Decode) IR EX IR WB MEM IR IR WE, MemToReg Mux,Logic From WB A Y R M M B CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Common bug: Multiple forwards II ... ADD R4,R0,R2 OR R0,R3,R1 AND R0,R2,R1 Which do we forward from? ID (Decode) IR EX IR WB MEM IR IR WE, MemToReg Mux,Logic From WB A Y R M M B CS 152: L4 Pipelining UC Regents Spring 2014 © UCB LW and Hazards No load “delay slot” CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Questions about LW and forwarding ADDIU R1 R1 24 OR R3,R3,R2 LW R1 128(R29) Do we need to stall ? ID (Decode) IR EX IR WB MEM IR IR WE, MemToReg Mux,Logic From WB A Y R M M B CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Questions about LW and forwarding ADDIU R1 R1 24 LW R1 128(R29) OR R1,R3,R1 Do we need to stall ? ID (Decode) IR EX IR WB MEM IR IR WE, MemToReg Mux,Logic From WB A Y R M M B CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Resolving a RAW hazard by stalling Stage #3 Stage #1 Stage #2 Instr Fetch Decode & Reg Fetch Sample program ADD R4,R3,R2 OR R5,R4,R2 IR IR Keep executing OR instruction until R4 is ready. Until then, send NOPS to IR 2/3. Freeze PC and IR until stall is over. CS 152: L4 Pipelining ADD R4,R3,R2 OR R5,R4,R2 IR Let ADD proceed to WB stage, so that R4 is written to regfile. A New datapath hardware M (1) Mux into IR 2/3 to feed in NOP. B (2) Write enable on PC and IR 1/2 UC Regents Spring 2014 © UCB Branches and Hazards Single “delay slot” CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Recall: Control hazard and hardware Stage #3 Stage #1 Stage #2 Instr Fetch Decode & Reg Fetch To branch control logic IR IR IR == A M B CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Recall: After more hardware, change ISA IF (Fetch) ID (Decode) IR EX (ALU) IR MEM IR WB IR If we change ISA, can we always let I2 complete (”branch delay slot”) and eliminate the control hazard. Sample Program Time: t1 (ISA w/o branch Inst IF I1: delay slot) I2: I1: BEQ R4,R3,25 I3: I2: AND R6,R5,R4 I4: SUB R1,R9,R8 I3: I5: I6: CS 152: L4 Pipelining t2 t3 t4 t5 ID IF EX MEM WB t6 t7 t8 ID stage computes if branch is taken If branch is taken, this instruction MUST NOT complete! UC Regents Spring 2014 © UCB Question about branch and forwards: BEQ R1 R3 label OR R3,R3,R1 Will this work as shown? ID (Decode) IR EX IR To branch control logic Mux,Logic WB MEM IR IR WE, MemToReg == A Y R M M B CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Lessons learned Pipelining is hard Study every instruction Write test code in advance Think about interactions ... CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Lessons learned Pipelining is hard Study every instruction Write test code in advance Think about interactions ... between forwarding, branch and jump delay slots, R0 issues LW issues ... a long list! CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Control Implementation CS 152: L4 Pipelining UC Regents Spring 2014 © UCB Recall: What is single cycle control? Instr Mem Combinational Logic (Only Gates, No Flip Flops) Equal 32 Addr Data Just specify logic functions! RegDest PCSrc RegWr ExtOp MemToReg ALUsrc 5 5 5 RegFile rs1 rs2 ws wd RegDest 32 ALUctr 32 rd1 rd2 32 Equal WE Ext RegWr CS 152: L4 Pipelining MemWr ExtOp MemToReg ALUsrc MemWr UC Regents Spring 2014 © UCB In pipelines, all IR registers are used ID (Decode) IR EX IR WB MEM IR IR Combinational Logic (Only Gates, No Flip Flops) Equal (add extra state outside!) RegDest PCSrc RegWr ExtOp MemToReg A “conceptual” design -- for shortest critical path, IR registers may hold decoded info, not the complete 32-bit instruction CS 152: L4 Pipelining UC Regents Spring 2014 © UCB On Tuesday Quantitative instruction set architecture ... Also, we will revisit the 68000 CPU design, and the topic of microcode. Have a good weekend !