COM506 Computer Design Lecture 2. Pipelining - COMP212 Review - Prof. Taeweon Suh Computer Science Education Korea University Processor Performance • Performance of single-cycle processor is limited by the long critical path delay The critical path limits the operating clock frequency • Can we do better? New semiconductor technology will reduce the critical path delay by manufacturing small-sized transistors • Core 2 Duo is manufactured with 65nm technology • Core i7 Nehalem: 45nm technology Sandy Bridge: 32nm technology Ivy Bridge: 22nm technology Can we increase the processor performance with a different microarchitecture? • Yes! Pipelining 2 Korea Univ Revisiting Performance • Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes 20 minutes 3 A B C D Korea Univ Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 T a s k 40 20 30 40 20 30 40 20 30 40 20 A B O r d e r C D • • Response time: 90 mins Throughput: 0.67 tasks / hr (= 90mins/task, 6 hours for 4 loads) 4 Korea Univ Pipelined Laundry 6 PM 7 8 9 10 11 Midnight Time 30 T a s k O r d e r • • 40 40 40 40 20 • A • B • C • D • Pipelining doesn’t help latency (response time) of a single task Pipelining helps throughput of entire workload Multiple tasks operating simultaneously Unbalanced lengths of pipeline stages reduce speedup Potential speedup = # of pipeline stages Response time: 90 mins Throughput: 1.14 tasks / hr (= 52.5 mins/task, 3.5 hours for 4 loads) 5 Korea Univ Pipelining • Improve performance by increasing instruction throughput Sequential Execution Instruction Fetch Register File Access (Read) ALU Operation Data Access Register Access (Write) 2ns 1ns 2ns 2ns 1ns Program execution Time order (in instructions) lw $1, 100($0) 2 Instruction Reg fetch lw $2, 200($0) 4 6 8 ALU Data access 10 12 14 ALU Data access 16 18 Reg Instruction Reg fetch 8 ns lw $3, 300($0) Reg Instruction fetch 8 ns 8 ns Pipelined Execution Program 2 execution Time order (in instructions) Instruction lw $1, 100($0) fetch lw $2, 200($0) lw $3, 300($0) 2 ns 4 Reg Instruction fetch 2 ns 6 ALU Reg Instruction fetch 2 ns 8 Data access ALU Reg 2 ns 6 10 ... 14 12 Reg Data access Reg ALU Data access 2 ns 2 ns Reg 2 ns Korea Univ Pipelining (Cont.) Multiple instructions are being executed simultaneously Program execution Time order (in instructions) 2 lw $1, 100($0) Instruction fetch lw $2, 200($0) 2 ns lw $3, 300($0) 4 Reg Instruction fetch 2 ns … 6 ALU Reg 8 Data access ALU Instruction fetch Reg Instruction fetch 10 Reg Data access Reg Data access ALU Reg ALU Instruction fetch Pipeline Speedup • • • Reg Instruction fetch If all stages are balanced (meaning that each stage takes the same amount of time) Time to execute an instructionpipeline= 14 12 Reg Data access ALU Reg Instruction fetch Reg Data access ALU Reg Instruction fetch Time to execute an instructionsequential Number of stages Reg Data access ALU Reg Instruction fetch Reg Data access ALU Reg Reg Data access ALU Reg Data access Reg If not balanced, speedup is less Speedup comes from increased throughput (the latency of instruction does not decrease) 7 Korea Univ Basic Idea IF: Instruction fetch ID: Instruction decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back 0 M u x 1 Add 4 Add Add result Shift left 2 PC Read register 1 Address Instruction Instruction memory Read data 1 Read register 2 Registers Read Write data 2 register Write data Zero ALU ALU result 0 M u x 1 Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 • What do we have to add to actually split the datapath into stages? 8 Korea Univ Basic Idea IF: Instruction fetch ID: Instruction decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back 0 M u x 1 Add 4 Add result Add Shift left 2 PC Read register 1 Address Read data 1 Read register 2 Registers Read Write data 2 register Instruction Instruction memory 0 M u x 1 Write data Zero ALU ALU result Address Read data 1 M u x 0 Data memory Write data 16 Sign extend 32 clock D Q F/F Q D Q F/F D Q F/F Q 9 Q D Q F/F Q Korea Univ Graphically Representing Pipelines 2 Time lw add IF 4 6 8 10 ID EX MEM WB IF ID EX MEM WB • Shading indicates the unit is being used by the instruction • Shading on the right half of the register file (ID or WB) or memory means the element is being read in that stage • Shading on the left half means the element is being written in that stage 10 Korea Univ Pipelined Datapath 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 11 Korea Univ lw: Instruction Fetch (IF) Instruction fetch 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 12 Korea Univ lw: Instruction Decode (ID) Instruction decode 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 13 Korea Univ lw: Execution (EX) Execution 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 14 Korea Univ lw: Memory (MEM) Memory 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 15 Korea Univ lw: Writeback (WB) Writeback 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 16 Korea Univ sw: Memory (MEM) Memory 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 17 Korea Univ sw: Writeback (WB): do nothing Writeback 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 18 Korea Univ Corrected Datapath (for lw) 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 19 Korea Univ Pipelining Example add $14, $5, $6 lw $13, 24($1) add $12, $3, $4 sub $11, $2, $3 lw $10, 20($1) 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 Sign extend Read data 1 M u x 0 32 20 Korea Univ Pipeline Control PCSrc Note that in this implementation, the branch is resolved in the MEM stage 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add result Add 4 Branch Shift left 2 PC Address Instruction memory Instruction RegWrite Read register 1 MemWrite Read data 1 Read register 2 Registers Read Write data 2 register Write data ALUSrc Zero Zero ALU ALU result 0 M u x 1 MemtoReg Address Data memory Write Read data 1 M u x 0 data Instruction 16 [15– 0] Sign extend 32 6 ALU control MemRead Instruction [20– 16] Instruction [15– 11] 0 M u x 1 ALUOp RegDst 21 Korea Univ Pipeline Control • What needs to be controlled in each stage (IF, ID, EX, MEM, WB)? IF: Instruction fetch and PC increment ID: Instruction decode and operand fetch from register file and/or immediate EX: Execution stage • RegDst • ALUop[1:0] • ALUSrc MA: Memory stage • Branch • MemRead • MemWrite WB: Writeback • MemtoReg • RegWrite (note that this signal is in ID stage) 22 Korea Univ Pipeline Control • Extend pipeline registers to include control information created in ID stage • Pass control signals along just like the data Instruction R-format lw sw beq Execution/Address Calculation stage control lines Reg ALU ALU ALU Dst Op1 Op0 Src 1 1 0 0 0 0 0 1 X 0 0 1 X 0 1 0 Memory access stage control lines Mem Mem Branch Read Write 0 0 0 0 1 0 0 0 1 1 0 0 Write-back stage control lines Reg Mem write to Reg 1 0 1 1 0 X 0 X WB Instruction IF/ID Control M WB EX M WB ID/EX EX/MEM MEM/WB 23 Korea Univ Datapath with Control PCSrc ID/EX 0 M u x 1 WB Control IF/ID EX/MEM M WB EX M MEM/WB WB Add Add Add result Instruction memory ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Zero ALU ALU result 0 M u x 1 MemtoReg Address Branch Shift left 2 MemWrite PC Instruction RegWrite 4 Address Data memory Read data Write data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 ALU control 0 M u x 1 1 M u x 0 MemRead ALUOp RegDst 24 Korea Univ Datapath with Control IF: lw $10, 9($1) PCSrc ID/EX 0 M u x 1 WB Control IF/ID EX/MEM M WB EX M MEM/WB WB Add ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Zero ALU ALU result 0 M u x 1 MemtoReg Instruction memory Branch Shift left 2 MemWrite Address Instruction PC Add Add result RegWrite 4 Address Data memory Read data Write data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 ALU control 0 M u x 1 1 M u x 0 MemRead ALUOp RegDst 25 Korea Univ Datapath with Control IF: sub $11, $2, $3 ID: lw $10, 9($1) PCSrc ID/EX 0 M u x 1 11 “lw” 010 Control WB EX/MEM M WB 0001 E X IF/ID MEM/WB M WB Add ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Zero ALU ALU result 0 M u x 1 MemtoReg Instruction memory Branch Shift left 2 MemWrite Address Instruction PC Add Add result RegWrite 4 Address Data memory Read data Write data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 ALU control 0 M u x 1 1 M u x 0 MemRead ALUOp RegDst 26 Korea Univ Datapath with Control IF: and $12, $4, $5 ID: sub $11, $2, $3 EX: lw $10, 9($1) PCSrc ID/EX 0 M u x 1 10 “sub” 000 Control 1100 IF/ID WB M EX 11 EX/MEM 010 0 00 1 WB MEM/WB M WB Add ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Zero ALU ALU result 0 M u x 1 MemtoReg Instruction memory Branch Shift left 2 MemWrite Address Instruction PC Add Add result RegWrite 4 Address Data memory Read data Write data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 ALU control 0 M u x 1 1 M u x 0 MemRead ALUOp RegDst 27 Korea Univ Datapath with Control IF: or $13, $6, $7 ID: and $12, $4, $5 EX: sub $11, $2, $3 MEM: lw $10, 9($1) PCSrc ID/EX 0 M u x 1 10 “and” 000 Control 1100 IF/ID WB M EX 10 EX/MEM 000 1 10 0 WB M 11 0 1 0 MEM/WB WB Add ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Zero ALU ALU result 0 M u x 1 MemtoReg Instruction memory Branch Shift left 2 MemWrite Address Instruction PC Add Add result RegWrite 4 Address Data memory Read data Write data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 ALU control 0 M u x 1 1 M u x 0 MemRead ALUOp RegDst 28 Korea Univ Datapath with Control IF: add $14, $8, $9 ID: or $13, $6, $7 EX: and $12, $4, $5 MEM: sub $11, .. PCSrc WB: lw $10, 9($1) ID/EX 0 M u x 1 10 “or” 000 Control 1100 IF/ID WB M EX 10 EX/MEM 000 1 10 0 WB M 10 0 0 0 MEM/WB 1 WB 1 Add ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Zero ALU ALU result 0 M u x 1 MemtoReg Instruction memory Branch Shift left 2 MemWrite Address Instruction PC Add Add result RegWrite 4 Address Data memory Read data Write data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 ALU control 0 M u x 1 1 M u x 0 MemRead ALUOp RegDst 29 Korea Univ Datapath with Control IF: xxxx ID: add $14, $8, $9 MEM: and $12… WB: sub $11, .. EX: or $13, $6, $7 PCSrc ID/EX 0 M u x 1 10 “add” 000 Control 1100 IF/ID WB M EX 10 EX/MEM 000 1 10 0 WB M 10 0 0 0 MEM/WB 1 WB 0 Add ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Zero ALU ALU result 0 M u x 1 MemtoReg Instruction memory Branch Shift left 2 MemWrite Address Instruction PC Add Add result RegWrite 4 Address Data memory Read data Write data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 ALU control 0 M u x 1 1 M u x 0 MemRead ALUOp RegDst 30 Korea Univ Datapath with Control IF: xxxx ID: xxxx EX: add $14, $8, $9 MEM: or $13, .. WB: and $12… PCSrc ID/EX 0 M u x 1 WB M Control EX IF/ID 10 EX/MEM 000 1 10 0 WB M 10 0 0 0 MEM/WB 1 WB 0 Add ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Zero ALU ALU result 0 M u x 1 MemtoReg Instruction memory Branch Shift left 2 MemWrite Address Instruction PC Add Add result RegWrite 4 Address Data memory Read data Write data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 ALU control 0 M u x 1 1 M u x 0 MemRead ALUOp RegDst 31 Korea Univ Datapath with Control IF: xxxx ID: xxxx EX: xxxx MEM: add $14, .. WB: or $13… PCSrc ID/EX 0 M u x 1 WB Control IF/ID EX/MEM M WB EX M 10 0 0 0 MEM/WB 1 WB 0 Add ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Zero ALU ALU result 0 M u x 1 MemtoReg Instruction memory Branch Shift left 2 MemWrite Address Instruction PC Add Add result RegWrite 4 Address Data memory Read data Write data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 ALU control 0 M u x 1 1 M u x 0 MemRead ALUOp RegDst 32 Korea Univ Datapath with Control IF: xxxx ID: xxxx EX: xxxx MEM: xxxx WB: add $14.. PCSrc ID/EX 0 M u x 1 WB Control IF/ID EX/MEM M WB EX M MEM/WB 1 WB 0 Add ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Zero ALU ALU result 0 M u x 1 MemtoReg Instruction memory Branch Shift left 2 MemWrite Address Instruction PC Add Add result RegWrite 4 Address Data memory Read data Write data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 ALU control 0 M u x 1 1 M u x 0 MemRead ALUOp RegDst 33 Korea Univ Hazards • It would be happy if we split the datapath into stages and the CPU works just fine But, things are not that simple as you may expect There are hazards! • Hazard is a situation that prevents starting the next instruction in the next cycle Structure hazards • Conflict over the use of a resource at the same time Data hazard • Data is not ready for the subsequent dependent instruction Control hazard • Fetching the next instruction depends on the previous branch outcome 34 Korea Univ Structure Hazards • Structural hazard is a conflict over the use of a resource at the same time • Suppose the MIPS CPU with a single memory Load/store requires data access in MEM stage Instruction fetch requires instruction access from the same memory • Instruction fetch would have to stall for that cycle • Would cause a pipeline “bubble” • Hence, pipelined datapaths require either separate ports to memory or separate memories for instruction and data Address Bus Address Bus MIPS CPU MIPS CPU Memory Data Bus Memory Address Bus Data Bus Data Bus 35 Korea Univ Structure Hazards (Cont.) 2 Time lw add sub add IF 4 6 8 10 ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Either provide separate ports to access memory or to provide instruction memory and data memory separately 36 Korea Univ Data Hazards • Data is not ready for the subsequent dependent instruction add $s0,$t0,$t1 sub $t2,$s0,$t3 IF ID EX IF ID MEM Bubble EX WB Bubble MEM WB • To solve the data hazard problem, the pipeline needs to be stalled (typically referred to as “bubble”) • Then, the performance is penalized • A better solution? • Forwarding (or Bypassing) 37 Korea Univ Forwarding add $s0,$t0,$t1 IF ID EX sub $t2,$s0,$t3 IF ID MEM Bubble 38 WB Bubble EX MEM WB Korea Univ Data Hazard - Load-Use Case • Can’t always avoid stalls by forwarding Can’t forward backward in time! • Hardware interlock is needed for the pipeline stall lw $s0, 8($t1) sub $t2,$s0,$t3 IF ID EX IF ID MEM Bubble EX WB MEM WB • This bubble can be hidden by proper instruction scheduling 39 Korea Univ Code Scheduling to Avoid Stalls • Reorder code to avoid use of load result in the next instruction A = B + E; // B is loaded to $t1, E is loaded to $t2 C = B + F; // F is loaded to $t4 stall stall lw lw add sw lw add sw $t1, $t2, $t3, $t3, $t4, $t5, $t5, 0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $t4 16($t0) lw lw lw add sw add sw $t1, $t2, $t4, $t3, $t3, $t5, $t5, 0($t0) 4($t0) 8($t0) $t1, $t2 12($t0) $t1, $t4 16($t0) 11 cycles 13 cycles 40 Korea Univ Data Hazard - Forwarding • Don’t wait for them to be written to the register file Use temporary results Time (in clock cycles) CC 1 Value of register $2 : 10 Value of EX/MEM : X Value of MEM/WB : X CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 X X 10 X X 10 – 20 X 10/– 20 X – 20 – 20 X X – 20 X X – 20 X X – 20 X X Ok.. Then, do we have to do this forwarding? Program execution order (in instructions) sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 IM Reg IM DM Reg Reg IM DM Reg Reg DM IM Reg sw $15, 100($2) IM 41 1. If the write to the register file occurs in the first half of the clock, and read occurs in the 2nd half of the clock, then? • Our textbook follows this 2. If you are asked to design CPU using only rising-edge of the clock, then? • Let’s stick to this for our project Reg DM Reg Reg DM Reg Korea Univ Forwarding ID/EX MEM/WB ALU Data Memory 42 MUX Register File EX/MEM Korea Univ Forwarding (from EX/MEM) EX/MEM MEM/WB MUX ID/EX Register File Data Memory 43 MUX MUX ALU Korea Univ Forwarding (from MEM/WB) EX/MEM MEM/WB MUX ID/EX Register File Data Memory 44 MUX MUX ALU Korea Univ Forwarding (operand selection) EX/MEM MEM/WB MUX ID/EX Register File Data Memory MUX MUX ALU Forwarding Unit 45 Korea Univ Forwarding (operand propagation) EX/MEM MEM/WB MUX ID/EX Register File MUX ALU Rt Rt Rs MUX Rd MUX Data Memory Forwarding Unit 46 EX/MEM Rd MEM/WB Rd Korea Univ Forwarding ID/EX WB Control PC Instruction memory Instruction IF/ID EX/MEM M WB EX M MEM/WB WB M u x Registers ALU Data memory M u x M u x IF/ID.RegisterRs Rs IF/ID.RegisterRt Rt IF/ID.RegisterRt Rt IF/ID.RegisterRd Rd M u x EX/MEM.RegisterRd Forwarding unit 47 MEM/WB.RegisterRd Korea Univ Can't always forward • lw (load word) can still cause a hazard An instruction tries to read a register following a load instruction that writes to the same register Time (in clock cycles) Program CC 1 execution order (in instructions) lw $2, 20($1) and $4, $2, $5 or $8, $2, $6 add $9, $4, $2 slt $1, $6, $7 IM CC 2 CC 3 Reg IM CC 4 CC 5 DM Reg Reg IM DM Reg IM CC 6 CC 8 CC 9 Reg DM Reg IM CC 7 Reg DM Reg Reg DM Reg • Thus, we need a hazard detection unit to “stall” the pipeline after the load instruction 48 Korea Univ Stalling • We can stall the pipeline by keeping an instruction in the same stage Program Time (in clock cycles) execution CC 1 CC 2 order (in instructions) lw $2, 20($1) and $4, $2, $5 or $8, $2, $6 IM CC 3 Reg IM Reg ID IM IF CC 4 CC 5 DM Reg Reg IM CC 6 CC 7 DM Reg Reg DM CC 8 CC 9 CC 10 Reg bubble add $9, $4, $2 IM slt $1, $6, $7 IM 49 DM Reg Reg Reg DM Reg Korea Univ Hazard Detection Unit • Stall the pipeline if both ID/EX is a load and (rt=IF/ID.rs or rt=IF/ID.rt) Stall by letting an instruction (that won’t write anything) go forward ID/EX.MemRead Hazard detection unit ID/EX IF/IDWrite WB Control 0 M u x PC Instruction memory Instruction PCWrite IF/ID EX/MEM M WB EX M MEM/WB WB M u x Registers ALU Data memory M u x M u x IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt Rt IF/ID.RegisterRd Rd ID/EX.RegisterRt Rs Rt 50 M u x EX/MEM.RegisterRd Forwarding unit MEM/WB.RegisterRd Korea Univ Control Hazard • • Branch determines the flow of instructions Fetching the next instruction depends on the branch outcome Pipeline can’t always fetch correct instruction Branch instruction is still working on ID stage when fetching the next instruction Taken target address is known here beq $1,$2,L1 IF add $1,$2,$3 sw $1, 4($2) Branch is resolved here ID EX MEM WB IF Bubble ID EX MEM WB IF Bubble ID EX MEM WB IF ID EX MEM … L1: sub $1,$2, $3 WB Fetch the next instruction based on the comparison result 51 Korea Univ Reducing Control Hazard • To reduce 2 bubbles to 1 bubble, add hardware in ID stage to compare registers (and generate branch condition) But, it requires additional forwarding and hazard detection logic – Why? Taken target address is known here beq $1,$2,L1 add $1,$2,$3 IF Branch is resolved here ID EX MEM WB Bubble IF ID EX MEM WB IF ID EX MEM … L1: sub $1,$2, $3 WB Fetch instruction based on the comparison result 52 Korea Univ Delayed Branch • Many CPUs adopt a technique called the delayed branch to further reduce the stall Delayed branch always executes the next sequential instruction • The branch takes place after that one instruction delay Delay slot is the slot right after a delayed branch instruction Taken target address is known here beq $1,$2,L1 IF add $1,$2,$3 (delay slot) Branch is resolved here ID EX MEM WB IF ID EX MEM WB IF ID EX MEM … L1: sub $1,$2, $3 WB Fetch instruction based on the comparison result 53 Korea Univ Delay Slot (Cont.) • Compiler needs to schedule a useful instruction in the delay slot, or fills it up with nop (no operation) // $s1 = a, $s2 = b, $3 = c // $t0 = d, $t1 = f a = b + c; if (d == 0) {f = f + 1;} f = f + 2; add $s1,$s2, $s3 bne $t0,$zero, L1 nop //delay slot addi $t1, $t1, 1 L1: addi $t1, $t1, 2 Can we do better? bne $t0, $zero, L1 add $s1,$s2,$s3 // delay slot addi $t1, $t1, 1 L1: addi $t1, $t1, 2 54 Fill the delay slot with a useful and valid instruction Korea Univ Branch Prediction • Longer pipelines (implemented in Core 2 Duo, for example) can’t readily determine branch outcome early Stall penalty becomes unacceptable since branch instructions are used so frequently in the program • Solution: Branch Prediction Predict the branch outcome in hardware Flush the instructions (that shouldn’t have been executed) in the pipeline if the prediction turns out to be wrong Modern processors use sophisticated branch predictors 55 Korea Univ MIPS with Predict-Not-Taken Prediction correct Flush the instruction that shouldn’t be executed Prediction incorrect 56 Korea Univ Control Hazards - Branch • When the branch condition is resolved, other instructions are in the pipeline Time (in clock cycles) Program execution CC 1 CC 2 order (in instructions) 40 beq $1, $3, 7 44 and $12, $2, $5 48 or $13, $6, $2 52 add $14, $2, $2 IM CC 3 Reg IM CC 4 CC 5 DM Reg Reg IM DM Reg IM 72 lw $4, 50($7) • CC 6 Reg DM Reg IM CC 7 CC 8 CC 9 Note that in this implementation, the branch is resolved in the MEM stage Reg DM Reg Reg DM Reg We are predicting “branch not taken” • If we are wrong (if branch is taken), flush instructions 57 Korea Univ Alleviate Branch Hazards • Reduce penalty to 1 cycle Move the branch compare to the ID stage of pipeline Add an adder to calculate the branch target in ID stage Add the IF.flush signal that zeros the instruction (or squash) in IF/ID pipeline register Taken target address is known here beq $1,$2,L1 add $1,$2,$3 … L1: sub $1,$2, $3 IF ID Bubble IF e Branch is resolved here EX MEM WB ID EX MEM WB IF ID EX MEM 58 WB Korea Univ Flushing Instructions IF.Flush Hazard detection unit ID/EX M u x WB Control 0 M u x IF/ID 4 M WB EX M MEM/WB WB Shift left 2 Registers PC EX/MEM M u x = Instruction memory ALU M u x Data memory M u x Sign extend M u x Forwarding unit 59 Korea Univ Flushing Instructions (cycle N) beq $1, $3, L2 and $12, $2, $5 or $13, $12, $1 … L2: lw $4, 40($7) beq $1, $3, L2 and $12, $2, $5 IF.Flush Hazard detection unit ID/EX M u x WB Control 0 M u x IF/ID 4 M WB EX M MEM/WB WB Shift left 2 Registers PC EX/MEM M u x = Instruction memory ALU M u x Data memory M u x Sign extend M u x Forwarding unit 60 Korea Univ Flushing Instructions (cycle N) beq $1, $3, L2 and $12, $2, $5 or $13, $12, $1 … L2: lw $4, 40($7) beq $1, $3, L2 and $12, $2, $5 IF.Flush Hazard detection unit ID/EX M u x WB Control 0 M u x IF/ID 4 L2 M WB EX M MEM/WB WB Shift left 2 Registers PC EX/MEM M u x = Instruction memory ALU M u x Data memory M u x Sign extend M u x Forwarding unit 61 Korea Univ Flushing Instructions (cycle N+1) lw $4, 40($7) beq $1, $3, L2 and $12, $2, $5 or $13, $12, $1 … L2: lw $4, 40($7) beq $1, $3, L2 nop IF.Flush Hazard detection unit ID/EX M u x WB Control 0 M u x IF/ID 4 M WB EX M MEM/WB WB Shift left 2 Registers PC EX/MEM M u x = Instruction memory ALU M u x Data memory M u x Sign extend M u x Forwarding unit 62 Korea Univ Supporting Multiple FP Operations E X Integer Unit FP multiplier: 7 cycles M 1 IF M 2 M 3 ID M 6 A 2 A 3 M 7 MEM WB A 4 FP divider (non-pipelined) 24 cycles Complicate bypass Potential structural hazard Multiple (FP) instructions can complete at the same time • M 5 FP add: 4 cycles A 1 • • • M 4 RF might need to be multi-ported Ordering issue, who gets to update the register? Out-of-order completion/retirement: Precise exception issue Modified from Prof Sean Lee’s Slide 63 Korea Univ Bypassing & Forwarding Clock Cycles L.D F4,0(R2) MUL.D F0,F4,F6 ADD.D F2,F0,F8 S.D F2,0(R2) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 IF ID EX M WB IF ID S M1 M2 M3 M4 M5 M6 M7 M WB IF S ID S S S S S S A1 A2 A3 A4 M WB IF S S S S S S ID EX S 64 S S M WB Korea Univ Structural Hazards Clock Cycles MUL.D F0,F4,F6 . . . . . . . . ADD.D F2,F4,F6 1 2 3 4 6 7 8 9 10 11 IF ID M1 M2 M3 M4 M5 M6 M7 M WB IF ID EX M WB IF ID EX M WB IF ID A1 A2 A3 A4 M WB . . . . . . . . IF ID EX M WB IF ID EX M WB IF ID EX M WB L.D F2,0(R2) • • • 5 Write to register file at the same cycle (cc11) Write to the same register (WAW) MEM in cc10 65 Korea Univ Precise Exception Issue DIV.D ADD.D SUB.D • • • • • F0,F2,F4 F3,F10,F8 F12,F12,F14 (exception!) (completed) (completed) Precise exception: If the pipeline can (or must) be stopped All the instructions before the faulty (or intended) instruction must be completed All the instructions after it must not be completed Restart the execution from the faulty (or intended) instruction State must be consistent with the original program order Not straightforward with out-of-order completion Simple solution: Stalling until no exception of prior long-latency instruction is guaranteed Other modern solution: ROB (will dedicate a lecture to it) 66 Korea Univ Instruction Sequence Scalar Pipeline (Baseline) IF DE EX MEM WB 1 2 3 4 5 6 Execution Cycle Modified from Prof Sean Lee’s Slide 67 Korea Univ Superpipeline • Deeper pipelining is called superpipelining • Cache access is particularly time critical, so the extra pipeline stages come from decomposing the memory access • Deeper pipeline allows for achieving higher clock rates Instruction Sequence IF 1 I I DE D I EX D D E E 2 3 4 5 6 7 8 9 1 2 Modified from Prof Sean Lee’s Slide MEM WB E M M M W W W E E M E E E D E E D D E D D D I D D I I D I I I 3 4 5 68 6 Execution Cycle Korea Univ MIPS R4000 Pipeline • Deeper Pipeline (superpipelining) • 2 cycle delays for load • Predicted-Not-Taken strategy Not-taken (fall-through) branch : 1 delay slot Taken branch: 1 delay slot + 2 idle cycles IS Instruction Memory RF Reg DF EX ALU IF DS Data Memory TC WB Reg Branch target and condition eval. Prof Sean Lee’s Slide 69 Korea Univ IS LD R1 Inst 1 Bubble Bubble Inst 2 Instruction Memory CC4 CC5 CC6 CC7 CC8 RF EX DF DS TC WB Reg Instruction Memory Reg Instruction Memory ADD R2, R1 Modified from Prof Sean Lee’s Slide Data Memory Reg Instruction Memory 70 CC9 Data Memory Reg CC10 CC11 Reg Reg Data Memory ALU IF CC3 ALU CC2 ALU CC1 ALU Load delay (2 cycles) Reg Data Memory Reg Korea Univ Branches (Predicted-not-taken) A C T U A L D I R E C T I O N CC1 N Branch IF O T Delay slot T Branch inst+2 A Branch inst+3 K E N IF Branch T Delay slot A Stall K E Stall N Branch Target Prof Sean Lee’s Slide CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 IS RF EX DF DS TC WB IF IS RF EX DF DS TC WB IF IS RF EX DF DS TC WB IF IS RF EX DF DS TC WB IS RF EX DF DS TC WB IF IS RF EX DF DS TC WB S S S S S S S S S S S S S S S S IF IS RF EX DF DS TC 71 Korea Univ