IC220 Set #20: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life Chapter 6 1 ADMIN • Reading for Chapter 6: 6.1, 6.9-6.12 2 Midnight Laundry Time 6 PM 7 8 9 10 11 12 1 2 AM 6 PM 7 8 9 10 11 12 1 2 AM Task order A B C D Time Task order A B C D 3 Smarty Laundry Time 6 PM 7 8 9 10 11 12 1 2 AM 6 PM 7 8 9 10 11 12 1 2 AM Task order A B C D Time Task order A PAT06F01.eps B C D 4 Pipelining • Improve performance by increasing instruction throughput Program execution Time order (in instructions) 200 on lw $1, 100($0) Instructi fetch Reg lw $2, 200($0) 400 600 800 Data access ALU 1000 1200 1400 ALU Data access 1600 1800 Reg Instruction Reg fetch 800 ps lw $3, 300($0) Reg Instruction fetch 800 ps 800 ps Program execution Time order (in instructions) lw $1, 100($0) 200 Instruction fetch 400 Reg 600 Data access ALU Instruction lw $2, 200($0) 200 ps fetch 800 Reg 1200 1400 Reg Data access ALU Instruction 200 ps fetch lw $3, 300($0) 1000 Reg Reg Data access ALU Reg 200 ps 200 ps 200 ps 200 ps 200 ps Ideal speedup is number of stages in the pipeline. Do we achieve this? 5 Basic Idea IF: Instruction fetch ID: Instruction decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back Add 4 ADD Shift left 2 0 M u x 1 PC Address Instruction Instruction memory Read register 1 Read data 1 Read register 2 Registers Write Read register data 2 Write data 16 Add result Zero 0 M u x 1 ALU ALU result Address Write data Sign extend Read data Data Memory 1 M u x 0 32 6 Pipelined Datapath IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 Shift left 2 PC Read register 1 Address Read data 1 Read register 2 Registers Read Write data 2 register Instruction 0 M u x 1 Instruction memory Zero ALU ALU result 0 M u x 1 Write data Read data Address 0 M u x 1 Data memory Write data 16 32 Sign extend 7 Pipeline Diagrams Clock cycle: Time $s0, $t0, add $s0,add $s1, $s2$t1 200 1 IF 2 400 EX ID 200 Time sub $a1, $s2, $a3 add $s0, $t0, $t1 600 3 IF 200 add $s0, $t0, $t1 IF 800 7 1000 WB MEM 400 ID 6 PAT06F11.eps 600 EX ID 1000 5 WB MEM 400 Time add $t0, $t1, $t2 800 4 600 EX 800 MEM 1000 WB Assumptions: • Reads to memory or register file in 2nd half of clock cycle • Writes to memory or register file in 1st half of clock cycle What could go wrong? 8 Problem: Dependencies • Problem with starting next instruction before first is finished Clock cycle: Time subadd $s0, $s2 $s0,$s1, $t0, $t1 1 200 IF 2 400 3 EX ID 200 Time and $a1, $s0, $a3 add $s0, $t0, $t1 IF 800 EX IF Time 6 7 800 8 200 IF 1000 WB MEM 400 ID add $s0, $t0, $t1 1000 600 200 add $s0, $t0, $t1 5 WB MEM ID add $t0, $t1, $s0 $t2, $s0, $s0 4 400 Time or 600 600 EX 800 400 MEM 600 EX ID 1000 WB 800 1000 WB MEM Dependencies that “go backward in time” are ____________________ Will the “or” instruction work properly? 9 Solution: Forwarding Use temporary results, don’t wait for them to be written Clock cycle: Time subadd $s0, $s2 $s0,$s1, $t0, $t1 1 200 IF 2 400 200 Time IF add $t0, $t1, $s0 add $s0, $t0, $t1 $t2, $s0, $s0 4 800 EX add $s0, $t0, $t1 1000 600 200 IF Time 5 7 800 8 200 1000 WB MEM 400 ID 6 WB MEM 400 ID Time or 600 EX ID and $a1, $s0, $a3 add $s0, $t0, $t1 3 600 EX 800 400 MEM 600 1000 WB 800 1000 PAT06F04.eps IF ID EX WB MEM PAT06F04.eps Where do we need this? Will this deal with all hazards? 10 PAT06F04.eps PAT06F Problem? Clock cycle: Time lw $t0,add 0($s1) $s0, $t0, $t1 1 200 IF 2 400 200 Time add $s0, $t0, $t1 600 EX ID sub $a1, $t0, $a3 3 IF add $s0, $t0, $t1 600 EX ID 200 IF 1000 5 6 7 WB MEM 400 Time add $a2, $t0, $t2 800 4 800 WB MEM 400 1000 600 EX ID 800 MEM 1000 WB Forwarding not enough… When an instruction tries to ___________ a register following a ____________ to the same register. 11 Solution: “Stall” later instruction until result is ready Clock cycle: 1 2 3 4 5 6 7 lw $t0, 0($s1) sub $a1, $t0, $a3 add $a2, $t0, $t2 PAT06F04.eps PAT06F04.eps Why does the stall start after ID stage? PAT06F04.e 12 Assumptions • For exercises/exams/everything assume… – The MIPS 5-stage pipeline – That we have forwarding …unless told otherwise 13 Exercise #1 – Pipeline diagrams • Draw a pipeline stage diagram for the following sequence of instructions. Start at cycle #1. You don’t need fancy pictures – just text for each stage: ID, MEM, etc. add $s1, $s3, $s4 lw $v0, 0($a0) sub $t0, $t1, $t2 • What is the total number of cycles needed to complete this sequence? • What is the ALU doing during cycle #4? • When does the sub instruction writeback its result? • When does the lw instruction access memory? 14 Exercise #2 – Data hazards • Consider this code: 1. add $s1, $s3, $s4 2. add $v0, $s1, $s3 3. sub $t0, $v0, $t2 4. and $a0, $v0, $s1 1. Draw lines showing all the data dependencies in this code 2. Which of these dependencies do not need forwarding to avoid stalling? 15 Exercise #3 – Data hazards • Draw a pipeline diagram for this code. Show stalls where needed. 1. add $s1, $s3, $s4 2. lw $v0, 0($s1) 3. sub $v0, $v0, $s1 16 Exercise #4 – More Data hazards • Draw a pipeline diagram for this code. Show stalls where needed. 1. 2. 3. 4. lw lw sw sw $s1, $v0, $v0, $t0, 0($t0) 0($s1) 4($s1) 0($t1) 17 Exercise #5 – Stretch • What might be a problem with pipelining the following code? Else: beq lw sw add $a0, $v0, $v0, $a1, $a1, Else 0($s1) 4($s1) $a2, $a3 18 Exercise #6 – Stretch • This diagram (from before) has a serious bug. What is it? IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 Shift left 2 PC Address Instruction memory Instruction 0 M u x 1 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register 0 M u x 1 Write data Zero ALU ALU result Read data Address Data memory 0 M u x 1 Write data 16 32 Sign extend 19 Big Picture • • Remember the single-cycle implementation – Inefficient because low utilization of hardware resources – Each instruction takes one long cycle Two possible ways to improve on this: Multicycle Pipelined PAT06F11.eps Clock cycle time (vs. single cycle) Amount of hardware used (vs. single cycle) Split instruction into multiple stages (1 per cycle)? Each stage has its own set of hardware? How many instructions executing at once? 20 The Pipeline Paradox • Pipelining does not ________________ the execution time of any ______________ instruction • But by _____________________ instruction execution, it can greatly improve performance by ________________ the ________________ 21 Implementing Pipelining • What makes it easy? – all instructions are the same length – just a few instruction formats – memory operands appear only in loads and stores • What makes it hard? – data hazards – structural hazards – control hazards • What make it really hard? – exception handling – Improving performance with out-of-order execution, etc. 22 Structural Hazards • Occur when the hardware can’t support the combination of instructions that we want to execute in the same clock cycle • MIPS instruction set designed to reduce this problem • But could occur if: 23 Control Hazards • What might be a problem with pipelining the following code? Else: • beq lw sw add $a0, $v0, $v0, $a1, $a1, Else 0($s1) 4($s1) $a2, $a3 What other kinds of instructions would cause this problem? 24 Control Hazard Strategy #1: Predict not taken • What if we are wrong? • Assume branch target and decision known at end of ID cycle. Show a pipeline diagram for when branch is taken. beq $a0, $a1, Else lw $v0, 0($s1) sw $v0, 4($s1) Else: add $a1, $a2, $a3 25 Control Hazard Strategies 1. Predict not taken One cycle penalty when we are wrong – not so bad Penalty gets bigger with longer pipelines – bigger problem 2. 3. 26 Branch Prediction Taken Not taken Predict taken Predict taken Taken Not taken Taken Not taken Predict not taken Predict not taken Taken Not taken With more sophistication can get 90-95% accuracy Good prediction key to enabling more advanced pipelining techniques! 27 Pipeline Control • Generate control signal during the ________ stage • _________ control signals along just like the __________ Instruction R-format lw sw beq Execution/Address Calculation Memory access stage stage control lines control lines Reg ALU ALU ALU Mem Mem Dst Op1 Op0 Src Branch Read Write 1 1 0 0 0 0 0 0 0 0 1 0 1 0 X 0 0 1 0 0 1 X 0 1 0 1 0 0 Write-back stage control lines Reg Mem to write Reg 1 0 1 1 0 X 0 X WB Instruction IF/ID Control M WB EX M WB ID/EX EX/MEM MEM/WB 28 Code Scheduling to Improve Performance • Can we avoid stalls by rescheduling? lw add lw add • $t0, $t2, $t3, $t4, 0($t1) $t0, $t2 4($t1) $t3, $t4 Dynamic Pipeline Scheduling – Hardware chooses which instructions to execute next – Will execute instructions out of order (e.g., doesn’t wait for a dependency to be resolved, but rather keeps going!) – Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect) 29 Dynamic Pipeline Scheduling • • Let hardware choose which instruction to execute next (might execute instructions out of program order) Why might hardware do better job than programmer/compiler? Example #1 lw $t0, 0($t1) add $t2, $t0, $t2 lw $t3, 4($t1) add $t4, $t3, $t4 Example #2 sw $s0, 0($s3) lw $t0, 0($t1) add $t2, $t0, $t2 30 Exercise #1 • Can you rewrite this code to eliminate stalls? 1. 2. 3. 4. lw lw sw add $s1, $v0, $v0, $t0, 0($t0) 0($s1) 4($s1) $t1, $t2 31 Exercise #2 • Show a pipeline diagram for the following code, assuming: – The branch is predicted not taken – The branch actually is taken lw beq sub Label2: add $t1, $s1, $v0, $t0, 0($t0) $s2, Label2 $v1, $v2 $t1, $t2 32 Exercise #3 – True or False? 1. A pipelined implementation will have a faster clock rate than a comparable single cycle implementation 2. Pipelining increases performance by splitting up each instruction into stages, thereby decreasing the time needed to execute each instruction. 3. A structural hazard could occur if an instruction produce two results that needed to be written to the register file. 4. “Backwards” branches are likely to not be taken 33 Exercise #4 – Stretch • • What is problematic about the following code? Show a pipeline diagram – assume branch is predicted not taken, but is taken. lw beq sub Label2: add $s1, $s1, $v0, $t0, 0($t0) $s2, Label2 $v1, $v2 $t1, $t2 34