Computer Architecture A Quantitative Approach, Fifth Edition Appendix C Pipelining: Basic and Intermediate Concepts 1 Basic Pipelining Pipelining is the organizational implementation technique that has been responsible for the most dramatic increase in computer performance. Overview of basic pipelining What is pipelining? Computing pipeline speedup Clocking pipelines Pipelining MIPS Pipeline hazards Handling interrupts. 2 Pipelining 3 Pipelining 3 Stages Assume a 2 ns flip-flop delay 4 Pipelining: Computing the speedup Time per instruction TPI = CPI cycle time We can think about pipelining as reducing either CPI or cycle time Ideal speedup Speedup TPIwithout pipeline number of pipeline stages TPIwith pipeline Requires that all stages be perfectly balanced No synchronization (latch, flip-flop) overhead No stall cycles The speedup from a pipeline is limited CPIreal = CPIideal + CPI stall CCTreal = Timelongest pipestage + Timelatch overhead 5 MIPS Instruction Formats 6 Basic MIPS Pipeline 7 Basic MIPS Pipeline (simplified) 8 Pipelining By Adding Registers IF: Instruction fetch ID: Instruction decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back 0 M u x 1 Add 4 Add Add result Shift left 2 PC Read register 1 Address Instruction Instruction memory Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 9 MIPS Pipelined Execution Instruction 1 2 3 4 5 i IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM i+1 i+2 i+3 i+4 6 7 8 9 WB 10 Rules for pipeline registers Each stage must be independent, so inter-stage registers must hold Think of the register file as two independent units Data values Control signals, including Decoded instruction fields MUX controls ALU controls Read file, accessed in ID Write file, accessed in WB There is no “final” set of registers after WB, (WB/IF) because the instruction is finished and all results are recorded in permanent machine state (register file, memory, and PC) 11 A More Accurate Pipeline Schematic 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 12 Pipeline Dataflow: the details Reg-Reg ALU Reg-immed ALU Load Store IF IR2 = IMEM[PC] PC2 = PC = PC+4 ID A3 = Regs[IR25..21]; B3=Regs[20..16]; IR3=IR2;PC3=PC2; IM3=IR2[15]16 ##IR2[14..0] EX ALU4= A3 op B3; IR4 = IR3 PC4 = PC3 ALU4 = A3 op IM3 IR4 = IR3 PC4 = PC3 ALU4 = A3 + IM3 IR4 = IR3 PC4 = PC3 MD4 = B3 MEM IR5=IR4 PC5=PC4 IR5=IR4 PC5=PC4 WB5 = DMEM[ALU4] WB Din = WB Din = WB Din = WB DMEM[ALU4] = MD4 Branch Jump ALU4 = PC3 + IM3 CO4 = A3 op 0 IR4 = IR3 PC4 = PC3 ALU4 = PC3 + IM3 IR4 = IR3 PC4 = PC3 IR5=IR4 PC5=PC4 If (C04) PC=ALU4 IR5=IR4 PC5=PC4 PC = ALU4 13 Problems with Pipelining (Dependencies and Hazards) Dependencies: a property of the program Data dependencies Instruction j uses the result produced by instruction I Control dependencies The execution of instruction j depends upon the result of instruction i 14 Dependencies and Hazards Hazard a result of dependencies in the pipeline Hazards lead to pipeline stalls or the execution of the wrong instruction Data hazards Instruction depends upon the result of an instruction still in the pipeline Structural Hazard Two instructions try to use the same hardware resource in a single cycle Control hazard Caused by the delay in fetching an instruction and decision about changes in instruction flow 15 Structural hazards When two instructions need to use the same hardware resource in the same cycle. Fix #1: Stall later instruction Resources are not duplicated Register file write ports Resources is not fully pipelined, I.e. takes more than one cycle Division, floating points Low cost, but increases average CPI Best used for rare events Examples: MIPS R2000 multi-cycle multiply SPARC V1 single memory port for instruction and data Fix #2: Duplicate the resource Increase cost, but preserves CPI Best used for cheap resources and/or frequent events 16 Structural hazards, continued Fix #3: Pipeline expensive resource Example resource duplication Separate instruction and data memory Separate ALU and PC adders Register files with multiple ports Moderate cost compared to duplication, expensive compared to stalling Best used for high performance or specialty machines Fully pipelined floating point units for scientific machines. How to avoid structural hazards altogether Design the ISA so that each resource needed by an instruction: Is used once Is always used in the same pipeline stage Takes one cycle MIPS is designed with pipelining in mind, x86 is not 17 Types of Data Hazards RAW (Read After Write) A M W F R A M F W Variable length pipeline Later instructions must write after earlier instruction I R 1 2 3 4 F R A M W R 1 2 3 4 F R A M W W WAR (Write after Read) R Only hazard for “fixed” pipelines Later instruction must be read after the earlier instruction writes WAW (Write After Write) F Pipeline with late read Later instruction must write after earlier instruction reads F R 5 W We can have Data hazard through memory locations 18 Example RAW pipeline hazard Time (in clock cycles) CC 1 Value of register $2: 10 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 10 10 10/– 20 – 20 – 20 – 20 – 20 DM Reg Program execution order (in instructions) sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) IM Reg IM DM Reg IM DM Reg IM Reg DM Reg IM Reg Reg Reg DM Reg 19 Stall for RAW hazards Relatively cheap: just needs some extra compare and control logic Detected in ID stage by comparing the registers to be read with the registers to be written for the instruction currently in the EX, MEM, or WB stages Stall if a match is found Increases the average CPI Would happen much too frequently F R X M W Write Data to R1 here F ADD ADD Bubble R1, R2, R3 R4, R1, R5 R X M W Read from R1 here 20 Stall type #1: Freeze the whole pipeline I I+1 I+2 I+3 I+4 I+5 1 2 3 4 5 6 7 IF ID EX MEM WB WB IF ID EX MEM MEM WB IF ID EX EX MEM WB IF ID ID EX MEM WB IF IF EX EX MEM WB IF ID EX MEM WB IF ID EX MEM I+6 8 9 10 11 Freeze all pipe stages for one or more cycles, and suppress writeback Needs only one global stall signal which suppresses all latching in all pipeline stages Sometimes called a “fixed pipe” or “frozen pipe” stall Works for cache misses Will not work to remove pipeline hazards 21 Stall type #2: Delay completion of an instruction I I+1 I+2 I+3 I+4 1 2 3 4 5 6 7 8 IF ID EX MEM WB IF ID EX MEM WB IF ID Stall IF EX MEM WB Stall ID EX MEM WB Stall IF EX EX MEM WB IF ID EX MEM WB IF ID EX MEM I+5 I+6 Bubble in: EX MEM 9 10 11 WB Instruction progress stops for one cycle Earlier instructions continue towards completion Prior instructions must suspend and make no more progress An “elastic pipe: stall Good when the need for stalling is only detected after decode, like for pipeline hazards 22 Bypass (Forwarding) If data is available elsewhere in the pipeline, there is no need to stall Detect condition Bypass (or forward) data directly to the consuming pipeline stage Bypass eliminates stalls for single-cycle operations Reduces longest stall to N-1 cycles for N-cycle operations 23 Physical Forwarding Paths Time (in clock cycles) CC 1 Value of register $2 : 10 Value of EX/MEM : X Value of MEM/WB : X CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 X X 10 X X 10 – 20 X 10/– 20 X – 20 – 20 X X – 20 X X – 20 X X – 20 X X DM Reg Program execution order (in instructions) sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) IM Reg IM Reg IM DM Reg IM Reg DM Reg IM Reg DM Reg Reg DM Reg The third forwarding operation might not be necessary •if we can make read-after-write register file • 24 Example forwarding decisions If EX has just finished an operation for which ID wants to read the value from either operand, we must forward If IR.Will_Write_Reg and IR4.Write_Reg_Num == IR3.RS1_Reg_Num then ALUmuxA =SelectALU4 If IR.Will_Write_Reg and IR4.Write_Reg_Num == IR3.RS2_Reg_Num then ALUmuxB =SelectALU4 Need one comparison and multiplex control for each forwarding path Be careful: if you forward from more than one instruction, choose the closest in the pipeline 25 Physical Forwarding Paths ID/EX WB Control PC Instruction memory Instruction IF/ID EX/MEM M WB EX M MEM/WB WB M u x Registers ALU Data memory M u x IF/ID.RegisterRs Rs IF/ID.RegisterRt Rt IF/ID.RegisterRt Rt IF/ID.RegisterRd Rd M u x M u x EX/MEM.RegisterRd Forwarding unit MEM/WB.RegisterRd 26 Forwarding Animation (1) or $4, $4, $2 and $4, $2, $5 sub $2, $1, $3 before<1> before<2> ID/EX 10 Control IF/ID PC Instruction memory Instruction 2 WB 10 EX/MEM M WB EX M $2 MEM/WB WB $1 M u x 5 Registers Data memory ALU $5 $3 M u x M u x 2 1 5 3 4 2 M u x Forwarding unit Clock 3 add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 sub $2, . . . before<1> ID/EX 10 Control WB M 10 EX/MEM WB 10 MEM/WB 27 5 3 4 2 M u x Forwarding Animation (2) Forwarding unit Clock 3 add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 sub $2, . . . before<1> ID/EX 10 Control IF/ID PC Instruction memory Instruction 4 $4 WB 10 EX/MEM M WB EX M 10 MEM/WB WB $2 M u x 6 Registers Data memory ALU $2 $5 M u x 2 2 6 5 4 4 M u x M u x 2 Forwarding unit Clock 4 28 Forwarding Animation (3) after<1> add $9, $4, $2 or $4, $4, $2 and $4, . . . sub $2, . . . ID/EX 10 Control IF/ID PC Instruction memory Instruction 4 WB 10 EX/MEM M WB EX M $4 10 MEM/WB WB 1 $4 M u x 2 2 Registers Data memory ALU $2 $2 M u x M u x 4 4 2 2 9 4 M u x 4 2 Forwarding unit Clock 5 after<2> after<1> add $9, $4, $2 or $4, . . . and $4, . . . ID/EX WB Control 10 EX/MEM M WB EX M 10 MEM/WB WB 1 29 4 4 2 2 Forwarding Animation (4) 9 4 M u x 4 2 Forwarding unit Clock 5 after<1> after<2> add $9, $4, $2 or $4, . . . and $4, . . . ID/EX WB Control IF/ID 10 EX/MEM M WB EX M 10 MEM/WB WB 1 PC Instruction memory Instruction $4 M u x 4 Registers Data memory ALU $2 M u x M u x 4 2 9 4 M u x 4 Forwarding unit Clock 6 30 Other Data Hazards WAR (Write After Read) Can happen if the instruction pipeline has early writes and/or late reads; something like: DIV (R1), Suppose that it does not read destination indirect until after the divide ADD ..,(R1)+ Incremented value of R1 is written before DIV has read value of R1 Can not happen in DLX because all reads are early (ID) and all writes are late (WB) WAW (Write After Write) Can happen when a fast operation follows a slow one; like LW R1,0(R2) ADD R1, R2, R3 IF ID EX MEM WB IF ID EX WB Can not happen in DLX (integer) because there is only one WB stage and instructions use it in order 31 One data hazard left Time (in clock cycles) Program CC 1 execution order (in instructions) lw $2, 20($1) and $4, $2, $5 or $8, $2, $6 add $9, $4, $2 slt $1, $6, $7 IM CC 2 CC 3 Reg IM CC 4 CC 5 DM Reg Reg IM DM Reg IM CC 6 CC 8 CC 9 Reg DM Reg IM CC 7 Reg DM Reg Reg DM Reg Loaded data is not available until the end of MEM, which is too late for the following instruction Forwarding can not help, so we must stall – or just “decree” that you can not write code like this. Such a decree is called a “delayed load” and was used in the original MIPS 2000 32 Stalling to interlock Program Time (in clock cycles) execution CC 1 CC 2 order (in instructions) lw $2, 20($1) and $4, $2, $5 or $8, $2, $6 IM CC 3 Reg IM Reg IM CC 4 CC 5 DM Reg Reg IM CC 6 CC 7 DM Reg Reg DM CC 8 CC 9 CC 10 Reg bubble add $9, $4, $2 slt $1, $6, $7 IM DM Reg IM Reg Reg DM Reg 33 The software fix: instruction scheduling to avoid stalls Since we can not avoid a stall following a load, avoid the stall by rearranging the code (“pipeline scheduling”), if possible Replace sub r4, lw r1, add r3, With lw r1, sub r4, add r3, r5, r7 50(r2) r1, r4 50(r2) r5, r7 r1, r4 This can improve a simple RISC machine performance 34 The software fix: instruction scheduling to avoid stalls But it is limited Usually limited to basic blocks between branches, 5-7 instructions Difficult to do interchanges to variables referenced indirectly (pointer, array, or parameter) due to the risk of aliases. 35 Branches and jumps Control point: know target and condition Mem control point Branch penalty: number of pipeline stages to control point Instruction 1 2 3 4 5 Branch I IF ID EX MEM WB IF Stall Stall IF I+1 I+2 I+3 6 7 8 9 ID EX MEM WB IF ID EX MEM IF ID EX This 3-cycle penalty works, but since branches occur every 5-7 instructions, it kills performance. What to do Determine the branch condition earlier than EX Compute the target address earlier than MEM 36 Characteristics of MIPS branches and jumps The branch condition The branch target Always PC-relative Needs only 16 bit adder (and carry propagation) The jump target Only has EQ/NE comparison to zero Fast and cheap, no need for a full ALU Use a 32-bit NOR gate instead Always PC-relative Target = {PC[31:28], offset, 00} All can be moved to the ID stage, at the cost of additional hardware (and maybe increased cycle time) Still requires one stall 37 Pipelining and Branch ISA Design Simple branches Makes ID control point possible Maybe increases cycle time 1 cycle penalty Complex branches Requires EX control point Maybe lower cycle time 2 cycle branch penalty 38 Reducing branch penalties (1) Predict that the branch will not be taken Continue fetching from sequential addresses. Cancel later if branch was taken Easy to do If it is not, continue If it is, change the following instructions into a NOP and thus take a 1-cycle penalty Helps a little, but bets the wrong way for loops 39 Reducing branch penalties (2) Predict that the branch will be taken Only useful if the target address is known before the branch condition – not true for MIPS Cancel later if the branch was not taken Always has some delay in fetching the branch target 40 Reducing branch penalties (3) Change the ISA: delay the effect of the branch Always execute the instruction(s) after the branch or jump Depends on the compiler to find something useful to do in the branch delay slot(s). An ugly dependence of ISA on implementation – may change Interaction with branch prediction, interrupts. 41 Filling the branch delay slot a. From before add $s1, $s2, $s3 if $s2 = 0 then Delay slot b. From target sub $t4, $t5, $t6 … c. From fall through add $s1, $s2, $s3 if $s1 = 0 then add $s1, $s2, $s3 Delay slot if $s1 = 0 then Delay slot Becomes Becomes sub $t4, $t5, $t6 Becomes add $s1, $s2, $s3 if $s1 = 0 then if $s2 = 0 then add $s1, $s2, $s3 add $s1, $s2, $s3 sub $t4, $t5, $t6 if $s1 = 0 then sub $t4, $t5, $t6 42 50% 45% 40% 35% 30% Canceled delay slot 25% Empty slot 20% 15% 10% 5% co r su 2 jd p m dl d dr o2 hy ea r c li gc c do du eq n to t t es pr es so 0% co m pr es s Percentage of conditional branches How useful are canceling branches Benchmark Integer : 35 % slots wasted Floating point : 25% slots wasted 43 Performance of Branch schemes? Effective CPI = 1 + %branches average branch penalty For integer MIPS: 20% of instructions are branches or jumps. 70% of them go to the target Strategy Branch Taken penalty Branch not taken penalty Effective CPI Stall 3 3 1.60 Branch in ID 1 1 1.20 Predict taken 1 1 1.20 Predict not taken 1 0 1.14 Delay slot 0.5 0.5 1.10 Cancel branch 0.3 0.3 1.06 44 Pipeline example Consider the following pipeline which implements the MIPS-like ISA. The only variation on the MIPS ISA is the support of full register compares in branch instructions Instruction 1 2 3 4 5 6 I IF ID EX1 EX2/ MEM1 MEM2 WB IF ID EX1 EX2/ MEM1 MEM2 WB IF ID EX1 EX2/ MEM1 MEM2 WB IF ID EX1 EX2/ MEM1 MEM2 IF ID EX1 IF ID I+1 I+2 I+3 I+4 I+5 7 8 9 11 WB EX2/ MEM2 MEM1 EX1 10 WB EX2/ MEM2 MEM1 WB 45 The Pipeline stages Stage Function IF Instruction fetch ID Instruction decode. Register fetch EX1 Address generation (data and PC-target) EX2/MEM1 ALU operation Branch condition resolution First cycle of memory access MEM2 Second cycle of memory access WB Register file writeback 46 Assumptions Writes to the register file occur in the first half of the clock cycle while reads from the register file occur in the second half All bypass paths have been implemented to minimize pipeline stalls due to data hazards The pipeline implements hardware interlocks 47 Questions How many register file ports does the processor need to minimize structural hazards? Indicate all forwarding required to minimize stalls in the given pipeline. Also, specify the minimum number of comparators needed to implement forwarding? What is the worst case delay due to RAW data hazards? What is the branch delay of this pipeline? 48 Instruction Dependencies The frequencies in the table are presented as percentages of all instructions executed Type 1 2 3 4 5 6 7 8 9 Instruction Sequence ALUop Rx,-,ALUop -,-,Rx or ALUop -,Rx,ALUop Rx,-,Store Rx,-(-) ALUop Rx,-,Load -,-(Rx) or Store -,-(Rx) ALUop Rx,-,JumpRegister Rx ALUop Rx,-,Branch Rx,-,# or Branch -,Rx,# Load Rx,-(-) ALUop -,-,Rx or ALUop -,Rx,Load Rx, -(-) Load -,-(Rx) or Store -,-(Rx) Load Rx, -(-) Branch Rx,-,# or Branch -,Rx,# Load Rx, -(-) JumpRegister Rx Frequency 10% 5% 5% 1% 2% 15% 3% 2% 1% 49 More Questions List the instruction sequences from the previous table that cause data stalls in the pipeline. Indicate the corresponding number of stall cycles. Compute the CPI for the pipeline due to data hazards only. Ignore instruction sequences that are not listed in the table If the frequency of conditional branches is 10% of which 65% are taken and the frequency of unconditional branches is 6%, compute the overall CPI assuming a TAKEN branch prediction scheme. 50 Summary Pipelining: overlaps execution of instructions Problem: structural, data, and control hazards Hazards occur if there are dependences and pipeline exposes them Common solution: stall, forwarding, scheduling Performance Improves instruction throughput → latency of long program CPIreal = CPIideal + Stallsstructural + Stallsdata + Stallscontrol Cycle timereal = Timelongest pipestage + Register Overhead What makes pipelining easier Simple instructions (load-stores, branches Fixed length, encoding with few formats 51