ELEC 5200-001/6200-001 Computer Architecture and Design Spring 2016 Pipelining (Chapter 6) Vishwani D. Agrawal James J. Danaher Professor Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 http://www.eng.auburn.edu/~vagrawal vagrawal@eng.auburn.edu Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 1 ILP: Instruction Level Parallelism Single-cycle and multi-cycle datapaths execute one instruction at a time. How can we get better performance? Answer: Execute multiple instructions at a time: Pipelining – Enhance a multi-cycle datapath to fetch one instruction every cycle. Parallelism – Fetch multiple instructions every cycle. Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 2 Automobile Team Assembly 1 hour 1 hour 1 hour 1 hour 1 car assembled every four hours 6 cars per day 180 cars per month 2,040 cars per year Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 3 Automobile Assembly Line Task 1 1 hour Mecahnical Task 2 1 hour Task 3 1 hour Task 4 1 hour Electrical Painting Testing First car assembled in 4 hours (pipeline latency) thereafter, 1 car completed per hour 21 cars on first day, thereafter 24 cars per day 717 cars per month 8,637 cars per year What gives 4X increase? Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 4 Throughput: Team Assembly Red car started Red car completed Mechanical Electrical Painting Testing Mechanical Electrical Painting Testing Blue car started Blue car completed Time of assembling one car Time = n hours where n is the number of nearly equal subtasks, each requiring 1 unit of time Throughput Spring 2016, Feb 29 . . . = 1/n cars per unit time ELEC 5200-001/6200-001 Lecture 6 5 Throughput: Assembly Line Car 1 Mechanical Electrical Car 2 Painting Testing Mechanical Electrical Car 3 Painting Mechanical Electrical Painting Mechanical Electrical Car 4 Car 1 complete . . Testing Time to complete first car =n Cars completed in time T =T–n+1 Throughput = 1 – (n – 1)/ T Throughput (assembly line) ─────────────────── Throughput (team assembly) Spring 2016, Feb 29 . . . Testing Painting Testing Car 2 complete time time units (latency) = cars per unit time 1 – (n – 1)/ T ──────── = 1/n ELEC 5200-001/6200-001 Lecture 6 n(n – 1) n – ───── → T n as T→∞ 6 Some Features of Assembly Line Electrical parts delivered (JIT) Task 1 1 hour Mechanical Stall assembly line to fix the cause of defect Spring 2016, Feb 29 . . . Task 2 1 hour Task 3 1 hour Task 4 1 hour Electrical Painting Testing 3 cars in the assembly line are suspects, to be removed (flush pipeline) ELEC 5200-001/6200-001 Lecture 6 Defect found 7 Pros and Cons Advantages: Efficient use of labor. Specialists can do better job. Just in time (JIT) methodology eliminates warehouse cost. Disadvantages: Penalty of defect latency. Lack of flexibility in production. Assembly line work is monotonous and boring. https://www.youtube.com/watch?v=IjarLbD9r30 https://www.youtube.com/watch?v=ANXGJe6i3G8 https://www.youtube.com/watch?v=5lp4EbfPAtI Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 8 A Single Cycle Example Delay of 1-bit full adder = 1ns Clock period ≥ 32ns a31 . . . a2 a1 a0 b31 . . . b2 b1 b0 1-b full adder 1-b full adder 1-b full adder c32 s31 . . . s2 s1 s0 1-b full adder 0 Time of adding words ~ 32ns Time of adding bytes ~ 32ns Clock Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 9 a31 . . . a2 a1 a0 b31 . . . b2 b1 b0 Delay of 1-bit full adder = 1ns Clock period ≥ 1ns Time of adding words ~ 32ns Time of adding bytes ~ 8ns 1-b full adder c32 FF Initialize to 0 s31 . . . s2 s1 s0 Shift Shift Shift A Multicycle Implementation Clock Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 10 A Pipeline Implementation Delay of 1-bit full adder = 1ns Clock period ≥ 1ns a31 . . . a2 a1 a0 b31 . . . b2 b1 b0 P1 P2 P3 .. P31 1-b full adder 1-b full adder 1-b full adder c32 s31 . . . s2 s1 s0 1-b full adder 0 Time of adding words ~ 1ns Time of adding bytes ~ 1ns Clock Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 11 Pipelining in a Computer Divide datapath into nearly equal tasks, to be performed serially and requiring non-overlapping resources. Insert registers at task boundaries in the datapath; registers pass the output data from one task as input data to the next task. Synchronize tasks with a clock having a cycle time just enough to complete the longest task. Break each instruction down into a set of tasks so that instructions can be executed in a staggered fashion. Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 12 Pipelining of RISC Instructions (From Lecture 3, Slide 6) Fetch Instruction Examine Opcode ALU Operation Memory Read/Write Store Result IF ID EX MEM WB Instruction Fetch Decode instruction and Fetch operands Execute Memory Operation Write Back to Reg file Although an instruction takes five clock cycles, one instruction is completed every cycle. Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 13 Pipelining a Single-Cycle Datapath Instruction class Instr. Instr. Execution Data (ALU access fetch Decode (IF) (also reg. Operation) (MEM) file read) (EX) (ID) Write Back (Reg. file write) (WB) Total time lw 2ns 1ns 2ns 2ns 1ns 8ns sw 2ns 1ns 2ns 2ns 8ns R-format 2ns 1ns 2ns 1ns 8ns 2ns 1ns 2ns 8ns add, sub, and, or, slt B-format, beq No operation on data; idle times equalize instruction lengths. Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 14 Execution Time: Single-Cycle 0 lw $1, 100($0) lw $2, 200($0) lw $3, 300($0) 2 IF ID 4 EX 6 8 10 12 14 16 . . Time (ns) MEM WB IF ID EX MEM WB IF ID EX MEM WB Clock cycle time = 8 ns Total time for executing three lw instructions = 24 ns Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 15 Pipelined Datapath Instruction class Instr. fetch (IF) Instr. Decode (also reg. file read) (ID) Execution (ALU Operation) (EX) Data access (MEM) Write Back (Reg. file write) (WB) 2ns 1ns 2ns 2ns 1ns 2ns R-format: add, sub, and, or, slt 2ns 1ns 2ns 2ns 2ns 1ns 2ns B-format: beq 2ns 1ns 2ns 2ns 2ns 1ns 2ns lw sw 2ns 2ns Total time 2ns 1ns 2ns 10ns 2ns 1ns 2ns 10ns 10ns 10ns No operation on data; idle time inserted to equalize instruction lengths. Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 16 Execution Time: Pipeline 0 lw $1, 100($0) 2 IF lw $2, 200($0) lw $3, 300($0) 4 6 8 ID EX MEM IF ID EX IF ID 10 12 14 16 . . Time (ns) RW MEM RW EX MEM RW Clock cycle time = 2 ns, four times faster than single-cycle clock Total time for executing three lw instructions = 14 ns Performance ratio Spring 2016, Feb 29 . . . = Single-cycle time ──────────── Pipeline time ELEC 5200-001/6200-001 Lecture 6 = 24 ── = 1.7 14 17 Pipeline Performance Clock cycle time = 2 ns 1,003 lw instructions: Total time for executing 1,003 lw instructions Performance ratio = = 2,014 ns Single-cycle time ──────────── Pipeline time = 8,024 ──── = 3.98 2,014 80,024 / 20,014 = 3.998 → Clock cycle ratio (4) 10,003 lw instructions: Performance ratio = Pipeline performance approaches clock-cycle ratio for long programs. Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 18 Single-Cycle Datapath Instr. mem. 16-20 11-15 RegWrite ALU 1 mux 0 MEM: mem. access MemtoReg ALUSrc zero MemWrite MemRead Data mem. 0 mux 1 PC 1 mux 0 21-25 Branch ALU 26-31 EX: Execute, address calc. 1 mux 0 opcode Reg. File Add 4 CONTROL ID: Instr. decode, reg. file read IF: Instr. fetch WB: write back RegDst 0-15 Sign ext. Shift left 2 ALUOp ALU Cont. 0-5 Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 19 This requires a CONTROL not too different from single-cycle 16-20 11-15 ALU zero MemWrite MemRead Data mem. 0 mux 1 Instr. mem. ALUSrc ALU PC 1 mux 0 21-25 MemtoReg MEM/WB RegWrite 1 mux 0 26-31 EX/MEM Branch Reg. File opcode CONTROL 4 ID/EX Add IF/ID 1 mux 0 Pipeline Registers RegDst 0-15 Sign ext. Shift left 2 ALUOp ALU Cont. 0-5 Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 20 Pipeline Register Functions Four pipeline registers are added: Register name Data held IF/ID PC+4, Instruction word (IW) ID/EX PC+4, R1, R2, IW(0-15) sign ext., IW(11-15) EX/MEM PC+4, zero, ALUResult, R2, IW(11-15) or IW(16-20) MEM/WB M[ALUResult], ALUResult, IW(11-15) or IW(16-20) Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 21 EX/MEM Shift left 2 opcode ALU 4 ID/EX Add IF/ID 1 mux 0 Pipelined Datapath MEM/WB 26-31 16-20 mem Data mem. 0 mux 1 PC ALU Instr 1 mux 0 21-25 Reg. File zero Sign ext. 11-15 for R-type 16-20 for I-type lw 0-15 Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 22 Five-Cycle Pipeline Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 CC5 MEM/WB REG. FILE WRITE DM CC4 EX/MEM CC3 ALU ID, REG. FILE READ ID/EX CC2 IF/ID IM PC CC1 23 Add Instruction add $t0, $s1, $s2 Machine instruction word 000000 10001 10010 01000 00000 100000 opcode $s1 $s2 $t0 function IF Spring 2016, Feb 29 . . . ID read $s1 read $s2 EX add $s1+$s2 MEM ELEC 5200-001/6200-001 Lecture 6 REG. FILE WRITE CC5 MEM/WB DM CC4 EX/MEM CC3 ALU ID, REG. FILE READ ID/EX CC2 IF/ID IM PC CC1 WB write $t0 24 EX/MEM Shift left 2 opcode PC mem 16-20 s2 Spring 2016, Feb 29 . . . $s2 Sign ext. 11-15 for R-type 16-20 for I-type lw t0 $s1 addr Data mem 0 mux 1 Instr zero ALU 21-25 MEM/WB s1 Reg. File 26-31 1 mux 0 4 ALU ID/EX Add IF/ID 1 mux 0 Pipelined Datapath Executing add data 0-15 ELEC 5200-001/6200-001 Lecture 6 25 Load Instruction lw $t0, 1200 ($t1) 100011 01001 01000 0000 0100 1000 0000 opcode $t1 $t0 1200 IF Spring 2016, Feb 29 . . . ID read $t1 sign ext 1200 EX MEM add read $t1+1200 M[addr] ELEC 5200-001/6200-001 Lecture 6 REG. FILE WRITE CC5 MEM/WB DM CC4 EX/MEM CC3 ALU ID, REG. FILE READ ID/EX CC2 IF/ID IM PC CC1 WB write $t0 26 EX/MEM Shift left 2 opcode PC 16-20 mem Sign ext. 11-15 for R-type 16-20 for I-type lw t0 Spring 2016, Feb 29 . . . $t1 addr Data mem 0 mux 1 Instr zero ALU 21-25 MEM/WB t1 Reg. File 26-31 1 mux 0 4 ALU ID/EX Add IF/ID 1 mux 0 Pipelined Datapath Executing lw data 0-15 1200 ELEC 5200-001/6200-001 Lecture 6 27 Store Instruction sw $t0, 1200 ($t1) 101011 01001 01000 0000 0100 1000 0000 opcode $t1 $t0 1200 IF Spring 2016, Feb 29 . . . ID read $t1 sign ext 1200 EX MEM add write $t1+1200 M[addr] (addr) ← $t0 ELEC 5200-001/6200-001 Lecture 6 REG. FILE WRITE CC5 MEM/WB DM CC4 EX/MEM CC3 ALU ID, REG. FILE READ ID/EX CC2 IF/ID IM PC CC1 WB 28 Shift left 2 opcode PC mem 16-20 t0 $t1 $t0 Sign ext. 11-15 for R-type 16-20 for I-type lw addr Data mem 0 mux 1 Instr zero ALU 21-25 MEM/WB t1 Reg. File 26-31 ALU EX/MEM 1 mux 0 4 ID/EX Add IF/ID 1 mux 0 Pipelined Datapath Executing sw data 0-15 1200 Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 29 Executing a Program Consider a five-instruction segment: lw sub add lw add Spring 2016, Feb 29 . . . $10, 20($1) $11, $2, $3 $12, $3, $4 $13, 24($1) $14, $5, $6 ELEC 5200-001/6200-001 Lecture 6 30 add $14, $5, $6 Spring 2016, Feb 29 . . . EX/MEM DM MEM/WB REG. FILE WRITE ALU EX/MEM DM MEM/WB REG. FILE WRITE IF/ID ID, REG. FILE READ ID/EX ALU EX/MEM ELEC 5200-001/6200-001 Lecture 6 MEM/WB REG. FILE WRITE DM MEM/WB REG. FILE WRITE EX/MEM MEM/WB REG. FILE WRITE DM DM ALU CC5 lw $10, 20($1) sub $11, $2, $3 Program instructions ALU ALU EX/MEM IF/ID ID, REG. FILE READ ID/EX CC4 IM IF/ID ID, REG. FILE READ ID/EX IM IF/ID ID, REG. FILE READ ID/EX CC3 PC lw $13, 24($1) IM IF/ID ID, REG. FILE READ ID/EX CC2 PC add $12, $3, $4 PC IM PC CC1 IM Program Execution time 31 CC5 4 ID/EX EX/MEM Add IF/ID Shift left 2 opcode 1 mux 0 ID: lw $13, 24($1) ALU IF: add $14, $5, $6 MEM: WB: EX: add $12, $3, $4 sub $11, $2, $3 lw $10, 20($1) MEM/WB 26-31 16-20 mem Data mem. 0 mux 1 PC ALU Instr 1 mux 0 21-25 Reg. File zero Sign ext. 11-15 for R-type 16-20 for I-type lw 0-15 Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 32 Advantages of Pipeline After the fifth cycle (CC5), one instruction is completed each cycle; CPI ≈ 1, neglecting the initial pipeline latency of 5 cycles. – Pipeline latency is defined as the number of stages in the pipeline, or – The number of clock cycles after which the first instruction is completed. The clock cycle time is about four times shorter than that of single-cycle datapath and about the same as that of multicycle datapath. For multicycle datapath, CPI = 3. …. So, pipelined execution is faster, but . . . Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 33 Science is always wrong. It never solves a problem without creating ten more. George Bernard Shaw Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 34 Pipeline Hazards Definition: Hazard in a pipeline is a situation in which the next instruction cannot complete execution one clock cycle after completion of the present instruction. Three types of hazards: – Structural hazard (resource conflict) – Data hazard – Control hazard Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 35 Structural Hazard Two instructions cannot execute due to a resource conflict. Example: Consider a computer with a common data and instruction memory. The fourth cycle of a lw instruction requires memory access (memory read) and at the same time the first cycle of the fourth instruction requires instruction fetch (memory read). This will cause a memory resource conflict. Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 36 lw $13, 24($1) Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 REG. FILE WRITE IM/DM IM/DM REG. FILE WRITE MEM/WB REG. FILE WRITE CC5 lw $10, 20($1) sub $11, $2, $3 Program instructions REG. FILE WRITE MEM/WB MEM/WB IM/DM ALU EX/MEM EX/MEM ALU MEM/WB EX/MEM IM/DM ALU CC4 ID, REG. FILE READ ID/EX IF/ID EX/MEM ALU CC3 IM/DM IF/ID ID, REG. FILE READ ID/EX IM/DM ID/EX ID, REG. FILE READ ID/EX IF/ID CC2 PC Common data and instr. Mem. add $12, $3, $4 PC IF/ID ID, REG. FILE READ CC1 IM/DM PC IM/DM Example of Structural Hazard time Nedded by two instructions 37 Possible Remedies for Structural Hazards Provide duplicate hardware resources in datapath. Control unit or compiler can insert delays (no-op cycles) between instructions. This is known as pipeline stall or bubble. Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 38 lw $13, 24($1) Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 REG. FILE WRITE IM/DM Stall (bubble) Program instructions MEM/WB REG. FILE WRITE MEM/WB EX/MEM CC5 lw $10, 20($1) sub $11, $2, $3 MEM/WB REG. FILE WRITE IM/DM EX/MEM ALU IF/ID ID, REG. FILE READ ID/EX IM/DM ALU MEM/WB REG. FILE WRITE IM/DM CC4 IM/DM EX/MEM ALU EX/MEM ALU IF/ID ID, REG. FILE READ ID/EX CC3 ID, REG. FILE READ ID/EX IF/ID IM/DM IF/ID ID, REG. FILE READ ID/EX CC2 PC add $12, $3, $4 PC IM/DM PC CC1 IM/DM Stall (Bubble) for Structural Hazard time 39 Data Hazard Data hazard means that an instruction cannot be completed because the needed data, being generated by another instruction in the pipeline, is not available. Example: consider two instructions: add $s0, $t0, $t1 sub $t2, $s0, $t3 Spring 2016, Feb 29 . . . # needs $s0 ELEC 5200-001/6200-001 Lecture 6 40 Example of Data Hazard time CC5 Write s0 in CC5 sub $s0, $t0, $t1 $t2, $s0, $t3 Program instructions REG. FILE WRITE add MEM/WB DM MEM/WB REG. FILE WRITE EX/MEM ALU ALU ID/EX IF/ID ID, REG. FILE READ ID/EX IM DM CC4 EX/MEM CC3 IF/ID ID, REG. FILE READ CC2 PC IM PC CC1 Read s0 and t3 in CC3 We need to read s0 from reg file in cycle 3 But s0 will not be written in reg file until cycle 5 However, s0 will only be used in cycle 4 And it is available at the end of cycle 3 Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 41 Forwarding or Bypassing Output of a resource used by an instruction is forwarded to the input of some resource being used by another instruction. Forwarding can eliminate some, but not all, data hazards. Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 42 Forwarding for Data Hazard time CC5 sub $s0, $t0, $t1 $t2, $s0, $t3 Program instructions DM REG. FILE WRITE REG. FILE WRITE add MEM/WB MEM/WB Write s0 in CC5 EX/MEM DM ALU EX/MEM CC4 ID, REG. FILE READ ID/EX IF/ID IM ALU CC3 IF/ID ID, REG. FILE READ ID/EX CC2 PC IM PC CC1 Read s0 and t3 in CC3 Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 43 Forwarding Unit Hardware Source reg. IDs from opcode Spring 2016, Feb 29 . . . MEM/WB Data Mem. MUX ALU Data to reg. file FORW. MUX EX/MEM FORW. MUX ID/EX Destination registers Forwarding Unit ELEC 5200-001/6200-001 Lecture 6 44 Forwarding Alone May Not Work time CC5 lw $s0, 20($s1) sub $t2, $s0, $t3 Program instructions DM REG. FILE WRITE REG. FILE WRITE MEM/WB MEM/WB Write s0 in CC5 EX/MEM DM ALU ALU EX/MEM IM ID/EX CC4 IF/ID ID, REG. FILE READ ID/EX CC3 IF/ID ID, REG. FILE READ CC2 PC IM PC CC1 Read s0 and t3 in CC3 data needed by sub (data hazard) data available from memory only at the end of cycle 4 Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 45 Use Bubble and Forwarding time CC5 MEM/WB REG. FILE WRITE ALU Write s0 in CC5 ID/EX ALU DM CC4 EX/MEM CC3 ID/EX CC2 IF/ID ID, REG. FILE READ IM PC CC1 lw $s0, 20($s1) Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 Program instructions REG. FILE WRITE MEM/WB DM EX/MEM IF/ID ID, REG. FILE READ $t2, $s0, $t3 IM sub PC stall (bubble) 46 Hazard Detection Unit Hardware Instruction IF/ID to reg. file register IDs from prev. instr. Spring 2016, Feb 29 . . . ALU 0 PC EX/MEM FORW. MUX Control ID/EX FORW. MUX Hazard Detection Unit NOP MUX Disable write Forwarding Unit ELEC 5200-001/6200-001 Lecture 6 MEM/WB Data Mem. Control signals 47 Resolving Hazards Hazards are resolved by Hazard detection and forwarding units. Compiler’s understanding of how these units work can improve performance. Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 48 Avoiding Stall by Code Reorder C code: A = B + E; C = B + F; MIPS code: lw $t1, lw $t2, add $t3, sw $t3, lw $t4, add $t5, sw $t5, Spring 2016, Feb 29 . . . 0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $t4 16($t0) . . . . . . . . . . . . . . . ELEC 5200-001/6200-001 Lecture 6 $t1 written $t2 written $t1, $t2 needed . . . . . . . . . . . . . . . $t4 written $t4 needed . . . . . 49 Reordered Code C code: A = B + E; C = B + F; MIPS code: lw $t1, lw $t2, lw $t4, add $t3, sw $t3, add $t5, sw $t5, Spring 2016, Feb 29 . . . 0($t0) 4($t0) 8($t0) $t1, $t2 12($t0) $t1, $t4 16($t0) no hazard no hazard ELEC 5200-001/6200-001 Lecture 6 50 Control Hazard Instruction to be fetched is not known! Example: Instruction being executed is branch-type, which will determine the next instruction: 40 Spring 2016, Feb 29 . . . add $4, $5, $6 beq $1, $2, 40 next instruction ... and $7, $8, $9 ELEC 5200-001/6200-001 Lecture 6 51 next instruction or and $7, $8, $9 Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 Program instructions CC5 MEM/WB REG. FILE WRITE DM add EX/MEM ALU MEM/WB REG. FILE WRITE MEM/WB REG. FILE WRITE CC4 DM EX/MEM Stall (bubble) IF/ID ID, REG. FILE READ ID/EX DM CC3 ALU EX/MEM ALU CC2 IM IF/ID ID, REG. FILE READ ID/EX IM IF/ID ID, REG. FILE READ ID/EX CC1 PC beq $1, $2, 40 PC IM PC Stall on Branch time $4, $5, $6 52 Why Only One Stall? Extra hardware in ID phase: Additional ALU to compute branch address Comparator to generate zero signal Hazard detection unit writes the branch address in PC Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 53 Ways to Handle Branch Stall or bubble Branch prediction: – Heuristics Next instruction Prediction based on statistics (dynamic) Hardware decision (dynamic) – Prediction error: pipeline flush Delayed branch Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 54 Delayed Branch Example Stall on branch add $4, $5, $6 beq $1, $2, skip next instruction ... skip or $7, $8, $9 Delayed branch beq $1, $2, skip add $4, $5, $6 next instruction ... skip or $7, $8, $9 Instruction executed irrespective of branch decision Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 55 next instruction or skip or $7, $8, $9 Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 DM ALU Program instructions REG. FILE WRITE MEM/WB DM CC5 MEM/WB REG. FILE WRITE DM MEM/WB REG. FILE WRITE EX/MEM ID/EX EX/MEM ALU EX/MEM CC4 ID, REG. FILE READ IF/ID ALU CC3 IM ID/EX IF/ID ID, REG. FILE READ ID/EX CC2 IM IF/ID ID, REG. FILE READ PC add $4, $5, $6 IM CC1 PC beq $1, $2, skip PC Delayed Branch time 56 Summary: Hazards Structural hazards – Cause: resource conflict – Remedies: (i) hardware resources, (ii) stall (bubble) Data hazards – Cause: data unavailablity – Remedies: (i) forwarding, (ii) stall (bubble), (iii) code reordering Control hazards – Cause: out-of-sequence execution (branch or jump) – Remedies: (i) stall (bubble), (ii) branch prediction/pipeline flush, (iii) delayed branch/pipeline flush Spring 2016, Feb 29 . . . ELEC 5200-001/6200-001 Lecture 6 57