CS2100 Computer Organisation Pipelining (AY2015/6) Semester 1 Road Map: Part II Performance Assembly Language Processor: Datapath Processor: Control Pipelining Cache Pipelining Pipelining MIPS Pipeline Pipeline datapath & control Pipeline hazards Branch prediction 2 Pipelining: Overview MIPS Pipeline Stages Pipelined Datapath and Control Pipeline Hazards Structural Data Control Forwarding Branch prediction CS2100 Pipelining 3 Inspiration: Assembly Line CS2100 Pipelining 4 The Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold and stash Washer takes 30 minutes Dryer takes 30 minutes “Folder” takes 30 minutes “Stasher” takes 30 minutes to put clothes into drawers CS2100 Pipelining A B C D 5 Sequential Laundry 6 PM T a s k O r d e r 7 8 10 9 11 12 1 2 AM 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 A Time B C D Sequential laundry takes 8 hours for 4 loads Steady state: 1 load every 2 hours If they learned pipelining, how long would laundry take? CS2100 Pipelining 6 Pipelined Laundry 6 PM T a s k O r d e r 7 8 10 9 30 30 30 30 30 30 30 11 12 1 2 AM Time A B C D Pipelined laundry takes 3.5 hours for 4 loads! Steady state: 1 load every 30 min Potential speedup = 2 hr/30 min = 4 (# stages) Time to fill pipeline takes 2 hours speedup CS2100 Pipelining 7 What IF: Slow Dryer 6 PM T a s k 7 8 9 10 11 12 30 30 30 30 30 30 30 30 30 30 30 1 2 AM Time A B O r d e r C D Pipelined laundry now takes 5.5 hours! Steady state: One load every 1 hr (dryer speed) Pipeline rate is limited by the slowest stage CS2100 Pipelining 8 What IF: Dependency 6 PM T a s k 7 8 10 9 30 30 30 30 30 30 30 30 11 12 1 2 AM Time A Dependency B O r d e r Stall C D Brian is using the laundry for the first time; he wants to see the outcome of one wash + dry cycle first before putting in his clothes Pipelined laundry now takes 4 hours CS2100 Pipelining 9 Pipelining Lessons Pipelining doesn’t help latency of single task: It helps the throughput of entire workload Multiple tasks operating simultaneously using different resources Possible delays: Pipeline rate limited by slowest pipeline stage Stall for dependencies ( more on this later ) CS2100 Pipelining 10 Datapath and Control MIPS PIPELINE CS2100 Pipelining 11 MIPS Pipeline Stages (1/2) Five Execution Stages Idea: IF: Instruction Fetch ID: Instruction Decode and Register Read EX: Execute an operation or calculate an address MEM: Access an operand in data memory WB: Write back the result into a register Each execution stage takes 1 clock cycle General flow of data is from one stage to the next Exceptions: CS2100 Update of PC and write back of register file – more about this later… Pipelining 12 MIPS Pipeline Stages (2/2) IF: Instruction fetch ID: Instruction decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back 0 M u x 1 Add 4 Add Add result Shift left 2 PC Read register 1 Address Instruction Instruction memory Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 CS2100 Sign extend Read data 1 M u x 0 32 Pipelining 13 Pipelined Execution: Illustration Time IF ID EX MEM IF ID EX MEM IF ID EX MEM IF ID EX MEM IF ID EX MEM IF ID EX Program Flow (Instructions) WB WB WB WB WB MEM WB Several instructions are in the pipeline simultaneously! CS2100 Pipelining 14 MIPS Pipeline: Datapath(1/3) Single-cycle implementation: Update all state elements (PC, register file, data memory) at the end of a clock cycle Pipelined implementation: CS2100 One cycle per pipeline stage Data required for each stage needs to be stored separately (why?) Pipelining 15 MIPS Pipeline: Datapath(2/3) Data used by subsequent instructions: Data used by same instruction in later pipeline stages: Store in programmer-visible state elements: PC, register file and memory Additional registers in datapath called pipeline registers IF/ID: register between IF and ID ID/EX: register between ID and EX EX/MEM: register between EX and MEM MEM/WB: register between MEM and WB Why no register at the end of WB stage? CS2100 Pipelining 16 MIPS Pipeline: Datapath (3/3) 0 M u x 1 Coloured rectangles are pipeline registers IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 CS2100 Sign extend Read data 1 M u x 0 32 Pipelining 17 Pipeline Datapath: IF Stage At the end of a cycle, IF/ID receives (stores): Instruction read from DataMemory[ PC ] PC + 4 PC + 4 CS2100 Also connected to one of the MUX's inputs (another coming later) Pipelining 18 Pipeline Datapath: ID Stage At the beginning of a cycle IF/ID register supplies: At the end of a cycle ID/EX receives: Register numbers for reading two registers 16-bit offset to be signextended to 32-bit CS2100 Pipelining Data values read from register file 32-bit immediate value PC + 4 19 Pipeline Datapath: EX Stage At the beginning of a cycle ID/EX register supplies: At the end of a cycle EX/MEM receives: Data values read from register file 32-bit immediate value PC + 4 CS2100 Pipelining (PC + 4) + (Immediates x 4) ALU result isZero? signal Data Read 2 from register file 20 Pipeline Datapath: MEM Stage At the beginning of a cycle EX/MEM register supplies: CS2100 At the end of a cycle MEM/WB receives: ALU result Memory read data (PC + 4) + (Immediates x 4) ALU result isZero? signal Data Read 2 from register file Pipelining 21 Pipeline Datapath: WB Stage At the beginning of a cycle MEM/WB register supplies: At the end of a cycle ALU result Memory read data CS2100 Result is written back to register file (if applicable) There is a bug here……. Pipelining 22 Corrected Datapath (1/2) Observe the “Write register” number Supplied by the IF/ID pipeline register It is NOT the correct write register for the instruction now in WB stage! Solution: CS2100 Pass “Write register” number from ID/EX through EX/MEM to MEM/WB pipeline register for use in WB stage i.e. let the "Write register" number follows the instruction through the pipeline until it is needed in WB stage Pipelining 23 Corrected Datapath (2/2) 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 CS2100 Sign extend Read data 1 M u x 0 32 Pipelining 24 Pipeline Control: Main Idea Same control signals as single-cycle datapath Difference: Each control signal belongs to a particular pipeline stage PCSrc 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add result Add 4 Branch Shift left 2 PC Address Instruction memory Instruction RegWrite Read register 1 MemWrite Read data 1 Read register 2 Registers Read Write data 2 register Write data ALUSrc Zero Zero ALU ALU result 0 M u x 1 MemtoReg Address Data memory Write Read data 1 M u x 0 data Instruction 16 [15– 0] Sign extend 32 6 ALU control MemRead Instruction [20– 16] Instruction [15– 11] 0 M u x 1 ALUOp RegDst CS2100 Pipelining 25 Pipeline Control: Try it! PCSrc 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add result Add 4 Branch Shift left 2 PC Address Instruction memory Instruction RegWrite Read register 1 MemWrite Read data 1 Read register 2 Registers Read Write data 2 register Write data ALUSrc Zero Zero ALU ALU result 0 M u x 1 MemtoReg Address Data memory Write Try associating each control signal with the correct pipeline stage. CS2100 Read data 1 M u x 0 data Instruction 16 [15– 0] Sign extend 32 6 ALU control MemRead Instruction [20– 16] 0 M u x 1 Instruction [15– 11] ALUOp RegDst Pipelining 26 Pipeline Control: Grouping Group control signals according to pipeline stage RegDst ALUSrc MemTo Reg Reg Write Mem Read Mem Write Branch R-type 1 0 0 1 0 0 lw 0 1 1 1 1 sw X 1 X 0 beq X 0 X 0 op1 op0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 EX Stage MEM Stage ALUop ALUop WB Stage op1 op0 Mem Read 0 1 0 0 0 0 0 1 0 1 0 0 1 0 0 1 1 sw X 1 0 0 0 1 0 X 0 beq X 0 0 1 0 0 1 X 0 RegDst ALUSrc R-type 1 lw CS2100 Pipelining Mem Write Branch MemTo Reg Reg Write 27 Pipeline Control: Implementation Instruction Control WB Signal used in EX stage and onwards M WB EX M Signal propagated along until utilized Signal generated in ID stage IF/ID CS2100 WB ID/EX EX/MEM Pipelining MEM/WB 28 MIPS Pipeline: Datapath and Control PCSrc ID/EX 0 M u x 1 WB Control IF/ID EX/MEM M WB EX M MEM/WB WB Add Add Add result Instruction memory ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Zero ALU ALU result 0 M u x 1 MemtoReg Address Branch Shift left 2 MemWrite PC Instruction RegWrite 4 Address Data memory Read data Write data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 ALU control 0 M u x 1 1 M u x 0 MemRead ALUOp RegDst CS2100 Pipelining 29 Exercise #1: Pipeline Control Signals Indicate the control signals utilized in each cycle for the following sequence of instructions: add, sub, lw, sw Cycle RegDst ALUSrc ALUOp1 ALUOp0 1 2 3 1 0 1 0 1 0 1 0 4 5 6 7 MemRead MemWrite 0 0 MemToReg RegWrite 8 CS2100 Pipelining 30 PIPELINE PERFORMANCE CS2100 Pipelining 31 Single Cycle Processor: Performance Cycle time: 𝐶𝑇𝑠𝑒𝑞 = 𝑁 𝑘=1 𝑇𝑘 𝑇𝑘 = Time for operation in stage k N = Number of stages Total Execution Time for I instructions: 𝑇𝑖𝑚𝑒𝑠𝑒𝑞 = Cycles × 𝐶𝑦𝑐𝑙𝑒Time = I × 𝐶𝑇𝑠𝑒𝑞 = 𝑰 × CS2100 Pipelining 𝑵 𝒌=𝟏 𝑻𝒌 32 Single Cycle Processor: Example Instruction Inst Mem Reg read ALU ALU 2 1 2 lw 2 1 2 2 sw 2 1 2 2 beq 2 1 2 Reg write Total 1 6 1 8 7 5 Cycle Time Data Mem Choose the longest total time = 8ns To execute 100 instructions: CS2100 100 x 8ns = 800ns Pipelining 33 Multi-Cycle Processor: Performance Cycle time: 𝐶𝑇𝑚𝑢𝑙𝑡𝑖 =max 𝑇𝑘 max 𝑇𝑘 = longest stage duration among the N stages Total Execution Time for I instructions: 𝑇𝑖𝑚𝑒𝑚𝑢𝑙𝑡𝑖 = Cycles × 𝐶𝑦𝑐𝑙𝑒Time = I × 𝑨𝒗𝒆𝒓𝒂𝒈𝒆 𝑪𝑷𝑰 × 𝐶𝑇𝑚𝑢𝑙𝑡𝑖 CS2100 Average CPI is needed because each instruction take different number of cycles to finish Pipelining 34 Multi-Cycle Processor: Example Instruction Inst Mem Reg read ALU ALU 2 1 2 lw 2 1 2 2 sw 2 1 2 2 beq 2 1 2 Reg write Total 1 6 1 8 7 5 Cycle Time: Data Mem Choose the longest stage time = 2ns To execute 100 instructions, with a given average CPI of 4.6 CS2100 100 x 4.6 x 2ns = 920ns Pipelining 35 Pipeline Processor: Performance Cycle Time: 𝐶𝑇𝑝𝑖𝑝𝑒 = max 𝑇𝑘 + 𝑇𝑑 max 𝑇𝑘 = longest time among the N stages 𝑇𝑑 = Overhead for pipelining, e.g. pipeline register Cycles needed for I instructions: I+N−1 N − 1 is the cycles wasted in filling up the pipeline Total Time needed for I instructions : CS2100 𝑇𝑖𝑚𝑒𝑝𝑖𝑝𝑒 = 𝐶𝑦𝑐𝑙𝑒 × 𝐶𝑇𝑝𝑖𝑝𝑒 = I + N − 1 × ( max 𝑇𝑘 + 𝑇𝑑 ) Pipelining 36 Pipeline Processor: Example Instruction Inst Mem Reg read Data Mem ALU ALU 2 1 2 lw 2 1 2 2 sw 2 1 2 2 beq 2 1 2 Reg write Total 1 6 1 8 7 5 Cycle Time assume pipeline register delay of 0.5ns longest stage time + overhead = 2 + 0.5 = 2.5ns To execute 100 instructions: CS2100 (100 + 5 – 1) x 2.5ns = 260ns Pipelining 37 Pipeline Processor: Ideal Speedup Assumptions for ideal case: Every stage takes the same amount of time: 𝑁 𝑘=1 𝑇𝑘 = 𝑁 × 𝑇𝑘 No pipeline overhead 𝑇𝑑 = 0 Number of instructions I, is much larger than number of stages, N Note: The above also shows how pipeline processor loses performance CS2100 Pipelining 38 Pipeline Processor: Ideal Speedup 𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑝𝑖𝑝𝑒 = 𝑇𝑖𝑚𝑒𝑝𝑖𝑝𝑒 I× 𝑁 𝑘=1 𝑇𝑘 I+N−1 ×( max 𝑇𝑘 + 𝑇𝑑 ) = = 𝐼×𝑁×𝑇𝑘 𝐼+𝑁−1 ×𝑇𝑘 ≈ 𝐼×𝑁×𝑇𝑘 𝐼×𝑇𝑘 ≈𝑵 CS2100 𝑇𝑖𝑚𝑒𝑠𝑒𝑞 Conclusion: Pipeline processor can gain N times speedup, where N is the number of pipeline stages Pipelining 39 Summary for different implementations Single-Cycle Cycle 1 Cycle 2 Clk Load Store Waste Multi-Cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Load IF Store ID EX MEM WB EX MEM WB IF R-type ID EX MEM IF Pipeline Load IF Store ID IF ID R-type IF CS2100 EX ID MEM EX WB MEM Pipelining WB 40 Review Question Given this code: add sub sll srl $t0, $t1, $t2, $t3, $s0, $s0, $s0, $s1, $s1 $s1 2 2 a) How many cycles will it take to execute the code on a single-cycle datapath? b) How long will it take to execute the code on a singlecycle datapath, assuming a 100 MHz clock? c) How many cycles will it take to execute the code on a 5-stage MIPS pipeline? d) How long will it take to execute the code on a 5-stage MIPS pipeline, assuming a 500 MHz clock? CS2100 Pipelining 41 Grinding to a halt….. PIPELINE HAZARDS CS2100 Pipelining 42 Pipeline Hazards Speedup from pipeline implementation: Based on the assumption that a new instructions can be “pumped” into pipeline every cycle However, there are pipeline hazards Problems that prevent next instruction from immediately following previous instruction Structural hazards: Data hazards: Data dependencies between instructions Control hazards: CS2100 Simultaneous use of a hardware resource Instruction Dependencies Change in program flow Pipelining 43 Graphical Notation for Pipeline Time (in clock cycles) Program execution order (in instructions) lw $10, 20($1) sub $11, $2, $3 CC 1 CC 2 CC 3 CC 4 CC 5 IM Reg ALU DM Reg IM Reg ALU DM CC 6 Reg Horizontal = the stages of an instruction Vertical = the instructions in different pipeline stages CS2100 Pipelining 44 Structure Hazard: Example If there is only a single memory module: Time (clock cycles) Instruction Order CS2100 Reg Reg Mem Reg Mem Reg Mem Reg Mem Reg ALU Inst 3 Mem Mem ALU Inst 2 Reg ALU Inst 1 Mem ALU Load Load and Inst 3 access memory in the same cycle! Mem Pipelining Reg 45 Solution 1: Stall the Pipeline Time (clock cycles) Instruction Order Stall Mem Mem Reg Reg Mem Reg Mem Reg ALU Inst 2 Reg ALU Inst 1 Mem ALU Load Mem Reg Bubble Bubble Bubble Bubble CS2100 Pipelining Reg ALU Mem Inst 3 Delay (Stall) Inst 3 for 1 cycle Bubble Mem Reg 46 Solution 2: Separate Memory Split memory into: Data and Instruction memory Time (clock cycles) Instruction Order Reg Reg Mem Reg Mem Reg Mem Reg Mem Reg ALU Inst 3 Mem Mem ALU Inst 2 Reg ALU Inst 1 Mem ALU Load Mem Load uses Data Memory Reg Inst 3 uses Instr. Memory CS2100 Pipelining 47 Quiz: Is there another conflict? Time (clock cycles) Instruction Order CS2100 Mem Reg Mem Reg Mem Reg Reg Mem Mem Reg ALU Inst 3 Mem ALU Inst 2 Reg ALU Inst 1 Mem ALU Inst 0 Pipelining Reg Mem Reg 48 Instruction Dependencies Instructions can have relationship that prevent pipeline execution: When different instructions accesses (read/write) the same register Register contention is the cause of dependency Known as data dependency When the execution of an instruction depends on another instruction Although a partial overlap maybe possible in some cases Control flow is the cause of dependency Known as control dependency Failure to handle dependencies can affect program correctness! CS2100 Pipelining 49 Data Dependency: RAW "Read-After-Write" Definition: Occurs when a later instruction reads from the destination register written by an earlier instruction Also known as true data dependency i1: add $1, $2, $3 #writes to $1 i2: sub $4, $1, $5 #reads from $1 Effect of incorrect execution: CS2100 If i2 reads register $1 before i1 can write back the result, i2 will get a stale result (old result) Pipelining 50 Other Data Dependencies Similarly, we have: WAR: Write-after-Read dependency WAW: Write-after-Write dependency Fortunately, these dependencies do not cause any pipeline hazards They affects the processor only when instructions are executed out of program order: CS2100 i.e. in Modern SuperScalar Processor Pipelining 51 RAW Dependency: Hazards? Suppose we are executing the following code fragment: sub and or add sw $2, $1, $3 $12, $2, $5 $13, $6, $2 $14, $2, $2 $15, 100($2) #i1 #i2 #i3 #i4 #i5 Note the multiple uses of register $2 Question: CS2100 Which are the instructions require special handling? Pipelining 52 RAW Data Hazards: Illustration Value from prior instruction is needed before write back T im e (in clock cycle s) CC 1 V alue of re giste r $ 2 : 10 sub $ 2 , $ 1 , $ 3 IM CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 10 10 1 0 /– 2 0 – 20 – 20 – 20 – 20 DM Reg Reg Data dependency a nd $ 1 2 , $ 2 , $ 5 or $ 1 3 , $6 , $ 2 a dd $ 1 4, $ 2 , $ 2 IM DM R eg IM sw $ 1 5, 1 0 0 ($ 2 ) CS2100 DM R eg IM No problem R eg DM Reg IM Pipelining R eg R eg R eg DM Reg 53 Observations Questions: When is the result from sub instruction actually produced? When is the data actually needed by and? Beginning of and’s EX stage or clock cycle 4 When is the data actually needed by or? End of EX stage for sub or clock cycle 3 Beginning of or’s EX stage or clock cycle 5 Solution: Forward the result to any trailing (later) instructions before it is reflected in register file Bypass (replace) the data read from register file CS2100 Pipelining 54 Solution: Forwarding Time (in clock cycles) CC 1 Value of register $2 : 10 Value of EX/MEM : X Value of MEM/WB : X sub $2, $1, $3 IM CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 X X 10 X X 10 – 20 X 10/– 20 X – 20 – 20 X X – 20 X X – 20 X X – 20 X X DM Reg Reg and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) CS2100 IM Reg IM DM Reg IM Reg DM IM Pipelining Reg DM Reg Reg Forward results from one stage to another Bypass data read from register file Reg DM Reg 55 Data Hazard: LOAD Instruction T im e (in cloc k cycle s ) lw $ 2, 20 ($ 1) an d $4 , $ 2, $5 or $8 , $ 2, $6 ad d $9 , $ 4, $2 slt $ 1, $6 , $ 7 CS2100 CC 1 CC 2 IM R eg IM CC 3 CC 4 CC 5 DM R eg R eg IM DM R eg IM CC 6 Reg DM R eg IM Pipelining CC 7 CC 9 Cannot solve with forwarding! Data is needed before it is actually produced! Reg DM Reg CC 8 Reg DM Re g 56 Data Hazard: LOAD Instruction Solution Program Time(inclock cycles) execution CC1 CC2 order (ininstructions) lw$2, 20($1) and$4, $2, $5 or $8, $2, $6 add$9, $4, $2 IM CC3 Reg IM CC4 CC5 DM Reg Reg Bubble IM Bubble Reg IM CC8 CC9 CC10 Reg DM Reg DM Reg IM Pipelining CC7 Stall the pipeline! DM slt $1, $6, $7 CS2100 CC6 Reg Reg DM Reg 57 Exercise #2: Without Forwarding How many cycles will it take to execute the following code on a 5-stage pipeline without forwarding? sub and or add sw CS2100 $2, $12, $13, $14, $15, $1, $3 $2, $5 $6, $2 $2, $2 100($2) Pipelining 58 Exercise #2: Without Forwarding sub 1 2 3 4 5 6 IF ID EX MEM WB 7 8 9 10 11 and or add sw CS2100 Pipelining 59 Exercise #3: Without Forwarding How many cycles will it take to execute the following code on a 5-stage pipeline without forwarding? lw and or add sw CS2100 $2, $12, $13, $14, $15, 20($3) $2, $5 $6, $2 $2, $2 100($2) Pipelining 60 Exercise #3: Without Forwarding lw 1 2 3 4 5 IF ID EX MEM WB 6 7 8 9 10 11 and or add sw CS2100 Pipelining 61 Exercise #4: With Forwarding How many cycles will it take to execute the following code on a 5-stage pipeline with forwarding? sub and or add sw CS2100 $2, $12, $13, $14, $15, $1, $3 $2, $5 $6, $2 $2, $2 100($2) Pipelining 62 Exercise #4: With Forwarding sub 1 2 3 4 5 6 IF ID EX MEM WB 7 8 9 10 11 and or add sw CS2100 Pipelining 63 Exercise #5: With Forwarding How many cycles will it take to execute the following code on a 5-stage pipeline with forwarding? lw and or add sw CS2100 $2, $12, $13, $14, $15, 20($3) $2, $5 $6, $2 $2, $2 100($2) Pipelining 64 Exercise #5: With Forwarding lw 1 2 3 4 5 IF ID EX MEM WB 6 7 8 9 10 11 and or add sw CS2100 Pipelining 65 CONTROL DEPENDENCY CS2100 Pipelining 66 Control Dependency Definition: An instruction j is control dependent on i if i controls whether or not j executes Typically i would be a branch instruction Example: i1: beq $3, $5, label i2: add $1, $2, $4 ... ... ... # Branch # depends on i1 Effect of incorrect execution: CS2100 If i3 is allowed to execute before i2 is determined, register $1 maybe incorrectly changed! Pipelining 67 Control Dependency: Example Let us turn to a code fragment with a conditional branch: r1 r3 40 44 48 52 .. 72 beq $1, $3, 7 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 ......... lw $4, 5($7) r1 = r3 How does the code affect a pipeline processor? CS2100 Pipelining 68 Pipeline Execution: IF Stage Read instruction from memory using the address in PC and put it in IF/ID register PC address is incremented by 4 and then written back to the PC for next instruction CS2100 Pipelining 69 Control Dependency: Why? PCSrc ID/EX 0 M u x 1 WB Control IF/ID EX/MEM M WB EX M MEM/WB WB Add Add Add result Instruction memory ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Zero ALU ALU result 0 M u x 1 MemtoReg Address Branch Shift left 2 MemWrite PC Instruction RegWrite 4 Address Data memory Read data Write data Instruction 16 [15– 0] Decision is made in MEM stage: Too late! Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 ALU control 0 M u x 1 1 M u x 0 MemRead ALUOp RegDst CS2100 Pipelining 70 Control Dependency: Example Time (in clock cycles) Program execution CC 1 CC 2 order (in instructions) 40 beq $1, $3, 7 44 and $12, $2, $5 48 or $13, $6, $2 52 add $14, $2, $2 IM CC 3 Reg IM CC 4 DM Reg IM Reg IM CC 7 CC 8 CC 9 Wrong Execution! Reg DM Reg IM Pipelining CC 6 Reg DM 72 lw $4, 50($7) CS2100 CC 5 Reg DM Reg Reg DM Reg 71 Control Hazards: Stall Pipeline? Program execution order (in instructions) Time (in clock cycles) 40 beq $1, $3, 7 CC 1 CC 2 IM Reg Bubble CC 3 CC 4 CC 5 DM Reg CC 6 Bubble Bubble Bubble Bubble Bubble Bubble Bubble Bubble Bubble Bubble IM 72 lw $4, 50($7) Bubble Reg CC 7 CC 8 CC 9 Bubble Bubble Bubble DM Reg Wait until the branch outcome is known and then fetch the correct instructions Introduces 3 clock cycles delay CS2100 Pipelining 72 Control Hazards: Reducing the Penalty Branching is very common in code: A 3 cycles stall penalty is too heavy! Many techniques invented to reduce the control hazard penalty: Move branch decision calculation to earlier pipeline stage Guess the outcome before it is produced Branch Prediction Do something useful while waiting for the outcome CS2100 Early Branch Resolution Delayed Branching Pipelining 73 Reduce Stalls: Early Branch(1/3) Make decision in ID stage instead of MEM Move branch target address calculation Move register comparison cannot use ALU for register comparison any more Branch target address calculation Instruction 16 Add Shift left 2 Read Read register 1 data 1 Read register 2 Write Registers register Read Write data 2 data RegWrite CS2100 PC + 4 Sign extend 4 Target address ALUcontrol Zero ALU ALU result To branch Control logic Register Comparison 32 Pipelining 74 Reduce Stalls: Early Branch(2/3) Register comparison moved to ID stage CS2100 Pipelining 75 Reduce Stalls: Early Branch(3/3) Program execution order (in instructions) Time (in clock cycles) 40 beq $1, $3, 7 CC 1 CC 2 IM Reg Bubble 72 lw $4, 50($7) CS2100 Bubble IM CC 4 CC 5 DM Reg Bubble Bubble Reg CC 6 CC 7 CC 8 CC 9 Bubble DM Reg Wait until the branch decision is known: CC 3 Then fetch the correct instructions Reduced to 1 clock cycle delay Pipelining 76 Early Branch: Problems(1/3) However, if the register(s) involved in the comparison is produced by preceding instruction: Further stall is still needed! Time (in clock cycles) Program execution order (in instructions) add $s0, $s1, $s2 beq $s0, $s3, Exit CC 1 CC 2 IM Reg IM CC 3 CC 5 DM Reg Reg Bubble CS2100 CC 4 DM IM Pipelining Reg CC 6 CC 7 CC 8 CC 9 $s0 is needed before it is produced! Reg DM Reg 77 Early Branch: Problems(2/3) Solution: Add forwarding path from ALU to ID stage One clock cycle delay is still needed Time (in clock cycles) Program execution order (in instructions) add $s0, $s1, $s2 beq $s0, $s3, Exit CC 1 CC 2 IM Reg IM CC 3 Bubble CC 4 CC 5 DM Reg Reg Bubble CS2100 Pipelining IM CC 6 CC 7 DM Reg Reg CC 8 CC 9 DM Reg 78 Early Branch: Problems(3/3) Problem is worse with load followed by branch Solution: MEM to ID forwarding and 2 more stall cycles! In this case, we ended up with 3 total stall cycles no improvement! Time (in clock cycles) Program execution order (in instructions) lw $s0, 0($s1) beq $s0, $s3, Exit CC 1 IM CC 2 CC 3 Reg IM CC 4 DM Bubble Bubble CC 5 Pipelining CC 7 CC 8 CC 9 ALU ID forwarding cannot help! Reg Reg Bubble CS2100 CC 6 DM IM Reg Reg DM Reg 79 Reduce Stalls: Branch Prediction There are many branch prediction schemes Simple prediction: We only cover the simplest in this course All branches are assumed to be not taken Fetch the successor instruction and start pumping it through the pipeline stages When the actual branch outcome is known: CS2100 Not taken: Guessed correctly No pipeline stall Taken: Guessed wrongly Wrong instructions in the pipeline Flush successor instruction from the pipeline Pipelining 80 Branch Prediction: Correct Prediction Time (in clock cycles) Program execution order (in instructions) 40 beq $1, $3, 7 44 and $12, $2, $5 48 or $13, $6, $2 CC 1 CC 2 IM Reg IM CC 3 CC 4 CC 5 DM Reg Reg IM DM Reg 52 add $14, $2, $2 CS2100 CC 6 CC 7 CC 8 CC 9 Reg DM Reg Branch is known to be not taken in cycle 3 no stall needed! Pipelining 81 Branch Prediction: Wrong Prediction Time (in clock cycles) Program execution order (in instructions) 40 beq $1, $3, 7 44 and $12, $2, $5 72 lw $4, 50($7) CC 1 CC 2 IM Reg IM CC 3 Bubble IM CC 4 CC 5 DM Reg Bubble Bubble Reg CC 6 CC 7 CC 8 CC 9 Bubble DM Reg Branch is known to be taken in cycle 3 "and" instruction should not be executed Flushed from pipeline CS2100 Pipelining 82 Exercise #6: No Branch Prediction How many cycles will it take to execute the following code on a 5-stage pipeline with forwarding but no branch prediction? Decision making moved to ID stage addi Loop: addi bne sub $s0, $s0, $s0, $t0, $zero, 10 $s0, -1 $zero, Loop $t1, $t2 Total instructions = 1 + 102 + 1 = 22 Ideal pipeline = 4 + 221 = 26 cycles CS2100 Pipelining 83 Exercise #6: No Branch Prediction addi1 addi2 bne addi2 1 2 3 4 5 IF ID EX MEM WB IF ID EX MEM WB ID IF 6 7 8 EX MEM WB IF ID EX 9 MEM 10 11 WB Data dependency between (addi $s0, $s0, -1) and bne incurs 1 cycle of delay. There are 10 iterations, hence 10 cycles of delay. Every bne incurs a cycle of delay to execute the next instruction. There are 10 iterations, hence 10 cycles of delay. Total number of cycles of delay = 20. Total execution cycles = 26 + 20 = 46 cycles. CS2100 Pipelining 84 Exercise #7: Branch Prediction How many cycles will it take to execute the following code on a 5-stage pipeline with forwarding and branch prediction? Loop: addi addi bne sub $s0, $s0, $s0, $t0, $zero, 10 $s0, -1 $zero, Loop $t1, $t2 Total instructions = 1 + 102 + 1 = 22 Ideal pipeline = 4 + 221 = 26 cycles CS2100 Pipelining 85 Exercise #7: Branch Prediction addi1 addi2 bne sub addi2 1 2 3 4 5 IF ID EX MEM WB IF ID EX MEM WB ID IF 6 7 8 EX MEM WB IF ID EX 9 10 11 IF MEM WB Predict not taken. The data dependency remains, hence 10 cycles of delay for 10 iterations. In the first 9 iterations, the branch prediction is wrong, hence 1 cycle of delay. In the last iteration, the branch prediction is correct, hence saving 1 cycle of delay. Total number of cycles of delay = 19. Total execution cycles = 26 + 19 = 45 cycles. CS2100 Pipelining 86 Reduce Stalls: Delayed Branch Observation: Branch outcome takes X number of cycles to be known X cycles stall Idea: Move non-control dependent instructions into the X slots following a branch Known as the branch-delay slot These instructions are executed regardless of the branch outcome In our MIPS processor: CS2100 Branch-Delay slot = 1 (with the early branch) Pipelining 87 Delayed Branch: Example Nondelayed branch or add sub beq xor Exit: Delayed branch $8, $9, $10 $1, $2, $3 $4, $5, $6 $1, $4, Exit $10, $1, $11 add sub beq or xor $1, $2, $3 $4, $5, $6 $1, $4, Exit $8, $9, $10 $10, $1, $11 Exit: The "or" instruction is moved into the delayed slot: Get executed regardless of the branch outcome Same behavior as the original code! CS2100 Pipelining 88 Delayed Branch: Observation Best case scenario There is an instruction preceding the branch which can be moved into delayed slot Program correctness must be preserved! Worst case scenario Such instruction cannot be found Add a no-op (nop) instruction in the branch-delay slot Re-ordering instructions is a common method of program optimization CS2100 Compiler must be smart enough to do this Usually can find such an instruction at least 50% of the time Pipelining 89 For your reference EXPLORATION CS2100 Pipelining 90 Multiple Issue Processors(1/2) Multiple Issue processors Static multiple issue: Multiple instructions in every pipeline stage 4 washer, 4 dryer… EPIC (Explicitly Parallel Instruction Computer) or VLIW (Very Long Instruction Word), e.g. IA64 Compiler specifies the set of instructions that execute together in a given clock cycle Simple hardware, complex compiler Dynamic multiple issue: CS2100 Superscalar processor: Dominant design of modern processors Hardware decides which instructions to execute together Complex hardware, simpler compiler Pipelining 91 Multiple Issue Processors(2/2) A 2-wide superscalar pipeline: CS2100 By fetching and dispatching two instructions at a time, a maximum of two instructions per cycle can be completed. Pipelining 92 Summary Pipelining is a fundamental concept in computer systems! Multiple instructions in flight Limited by length of the longest stage Hazards create trouble by stalling pipeline Pentium 4 has 22 pipeline stages! CS2100 Pipelining 93 Reading Assignment 3rd edition Sections 6.1 – 6.3 Sections 6.4 – 6.6 (data hazards and control hazards in details; read for interest; not in syllabus) 4th edition CS2100 Sections 4.5 – 4.6 Sections 4.7 – 4.8 (data hazards and control hazards in details; read for interest; not in syllabus) Pipelining 94 End CS2100 Pipelining 95