Pipelining review Pipelining review instr1 instr2 Instructions instr3 instr4 Clock Cycle 1 Clock Cycle 2 IM Reg IM Clock Cycle 3 Clock Cycle 4 Clock Cycle 5 Clock Cycle 6 Clock Cycle 7 DM Reg IM DM Reg IM DM Reg DM Time Pipeline Control – big picture Control Unit ID/EX EX/Mem Mem/WB IF/ID IF ID EX Mem WB Pipeline in action ID IF EX MEM Ctrl Instruction memory CLK clk PC 0x1234 Let’s execute a program at address 0x1234: lw $1, 4($3) add $2, $3, $4 or $3, $1, $2 bne $1, $3, L Register file ALU SE Data Memory Suppose: $1 = 1, $2 = 2, $3 = 1000, $4 = 6 WB Pipeline in action 1 ID IF EX MEM lw $1, 4($3) Instruction memory CLK Register file ALU SE PC Data Memory 0x1234 Clock: 1 $1 = 1, $2 = 2, $3 = 1000, $4 = 4 WB Pipeline in action 2 ID IF EX MEM add $2, $3, $4 Instruction memory CLK Register file ALU SE PC lw $1, 4($3) 0x1238 Data Memory Clock: 1 2 $1 = 1, $2 =2, $3 =1000, $4 = 6 WB Pipeline in action 3 ID IF EX MEM or $3, $1, $2 Instruction memory CLK Register file ALU SE PC add $2, $3, $4 0x123C lw $1, 4($3) Data Memory Clock: 1 2 3 $1 = 1, $2 =2, $3 =1000, $4 = 6 WB Pipeline in action 4 ID IF (anything wrong?) EX MEM WB bne $1, $3, L Instruction memory CLK Register file ALU SE PC Data Memory 0x1240 or $3, $1, $2 add $2, $3, $4 lw $1, 4($3) Clock: 1 2 3 4 Let’s assume memory returns 42 for lw. $1 = 1, $2 =2, $3 =1000, $4 = 6 Data Hazards! ID IF EX MEM WB bne $1, $3, L Instruction memory CLK Register file ALU SE PC Data Memory 0x1240 or $3, $1, $2 add $2, $3, $4 lw $1, 4($3) Clock: 1 2 3 4 $1 = 1, $2 = 2, $3 = 1000, $4 = 6 Pipeline in action 5 ID IF EX MEM WB sll $4, $3, 5 Instruction memory CLK Register file ALU SE PC Data Memory 0x1244 bne $1, $3, L Clock: 1 2 3 4 5 or $3, $1, $2 add $2, $3, $4 $1 and $2 both wrong for or because we got old values! lw $1, 4($3) $1 = 42, $2 = 2, $3 = 1000, $4 = 6 Data Hazards Data dependence result of an operation needed before it is stored back in reg. file: lw $1, 4($3) add $2, $3, $4 or $3, $1, $2 The data hazard above is read after write (RAW) Data dependence (RAW) occurs when: An instruction wants to read a register in stage 2, and One instruction in stage 3 or stage 4 is going to write that register Note: if the instruction writing the register is in stage 5, this is fine since we can write a register and read it in the same cycle Hazard detection unit compares register fields across stages. Resolving data dependencies Have the compiler generate no-ops Stall the pipeline, i.e., create bubbles Don’t have to deal with data hazards in hardware. the resulting delays are the same as for no-ops Send the result generated in stage 3 or stage 4 to the appropriate input of the ALU. This is forwarding or bypassing. More performance at the cost of more hardware For one simple pipeline, cost is slightly more control and extra buses For several pipelines, say n, communication grows as O(n2) Forwarding from WB to EX ID IF EX MEM WB sll $4, $3, 5 Instruction memory CLK Register file ALU SE PC Data Memory 0x1244 bne $1, $3, L Clock: 1 2 3 4 5 or $3, $1, $2 $1 fixed, $2 still wrong add $2, $3, $4 lw $1, 4($3) $1 = 42, $2 = 2, $3 = 1000, $4 = 6 Forwarding from MEM to EX ID IF EX MEM WB sll $4, $3, 5 Instruction memory CLK Register file ALU SE PC Data Memory 0x1244 bne $1, $3, L Clock: or $3, $1, $2 add $2, $3, $4 lw $1, 4($3) 1 2 3 4 5 $1 = 42, $2 = 2, $3 = 1000, $4 = 6 Generally, need to forward to both inputs! ID IF EX MEM WB sll $4, $3, 5 Instruction memory CLK Register file ALU SE PC Data Memory 0x1244 beq $1, $3, L Clock: or $3, $1, $2 What if we had: or $3, $1, $1 1 2 3 4 5 add $2, $3, $4 lw $1, 4($3) Need to forward $1 to both ALU inputs $1 = 42, $2 = 2, $3 = 1000, $4 = 6 Forwarding a register twice? What if had: add $1, $2, $2 or $1, $3, $4 and $5, $1, $1 Where do you forward $1 from? Another look at forwarding… Clock Cycle 1 Clock Cycle 2 IM Reg lw $1, 4($3) Clock Cycle 3 Clock Cycle 4 Clock Cycle 5 Clock Cycle 6 Clock Cycle 7 DM Forward $1 add $2, $3, $4 IM Reg DM Forward $2 or $3, $1, $2 IM Reg DM Instructions fwd $3 bne $1, $3, L IM Reg DM Time Data hazards and loads Clock Cycle 1 lw $1, 4($3) IM or $3, $1, $2 Clock Cycle 2 Clock Cycle 3 Reg IM Clock Cycle 4 Clock Cycle 5 DM Reg Clock Cycle 6 Clock Cycle 7 Can’t do forwarding in this case: can’t go back in time. ? Have to insert a bubble between lw and or DM What if we switch these? Instructions add $2, $3, $4 bne $1, $3, L IM Reg IM DM Reg Time DM Inserting a bubble • Have to do it when you load into a register which is used in the next instruction Clock Cycle 1 Clock Cycle 2 IM Reg lw $1, 4($3) Clock Cycle 3 Clock Cycle 4 Clock Cycle 5 Clock Cycle 6 Clock Cycle 7 Time DM Now we can forward $1 or $3, $1, $2 IM Reg DM do nothing Instructions add $2, $3, $4 bne $1, $3, L Insert bubble here IM Reg IM DM Reg DM … Other hazards We’ve seen data hazards when an instruction in the pipeline is dependent on another instruction still in the pipeline. Resolved by: Bypassing/forwarding Stalling Other types of hazards: Structural hazards where two instructions at different stages want to use the same resource. Solved by using more resources (e.g., instruction and data memory; several ALU’s). Won’t happen in our pipeline. Control hazards happens on a taken branch. Evaluation of branch condition and calculation of branch target is not completed before next instruction is fetched. Branch path ID IF EX MEM WB instr at L Instruction memory CLK Register file ALU SE sll $4, $3, 5 PC L: Data Memory This is wrong! (Should be instr at L) result of comparison Determine new PC Clock: =? bne $1, $3, L 1 2 3 4 5 or $3, $1, $2 add $2, $3, $4 6 $1 = 42, $2 = 1006, $3 = 1000, $4 = 6 Resolving control hazards Stall until result of the condition & target are known too slow Reduce penalty by redesigning the pipeline: move branch calculation to ID stage move branch comparison to ID stage use both edges of the clock Delay slots how to do quick compare? specify in ISA that instruction after branch is always executed. Branch prediction in IF stage, guess branch outcome correct it if turns out to be wrong. Lw & sw ID IF EX MEM Ctrl Instruction memory CLK clk PC Consider a memory copy program… loop: lw $1, 0($3) sw $1, 0($4) addi $3, $3, 4 addi $4, $4, 4 bne $4, $5, Loop Register file ALU SE Data Memory Too many stalls! How do we fix this? WB