55:035 Computer Architecture and Organization Lecture 10 Outline Pipelining Basics Throughput Execution Time Pipeline Registers Program Execution Hazards Structural Data Control 55:035 Computer Architecture and Organization 2 ILP: Instruction Level Parallelism Single-cycle and multi-cycle datapaths execute one instruction at a time. How can we get better performance? Answer: Execute multiple instruction at a time: Pipelining – Enhance a multi-cycle datapath to fetch one instruction every cycle. Parallelism – Fetch multiple instructions every cycle. 55:035 Computer Architecture and Organization 3 Automobile Team Assembly 1 hour 1 hour 1 hour 1 hour 1 car assembled every four hours 6 cars per day 180 cars per month 2,040 cars per year 55:035 Computer Architecture and Organization 4 Automobile Assembly Line Task 1 1 hour Mecahnical Task 2 1 hour Task 3 1 hour Electrical Painting Task 4 1 hour Testing First car assembled in 4 hours (pipeline latency) thereafter 1 car per hour 21 cars on first day, thereafter 24 cars per day 717 cars per month 8,637 cars per year 55:035 Computer Architecture and Organization 5 Throughput: Team Assembly Red car started Red car completed Mechanical Electrical Painting Testing Mechanical Electrical Painting Testing Blue car started Blue car completed Time of assembling one car Time = n hours where n is the number of nearly equal subtasks, each requiring 1 unit of time Throughput = 1/n cars per unit time 55:035 Computer Architecture and Organization 6 Throughput: Assembly Line Car 1 Mechanical Electrical Car 2 Painting Mechanical Electrical Car 3 Testing Painting Mechanical Electrical Testing Painting Mechanical Electrical Car 4 Car 1 complete . Testing Painting Testing Car 2 complete time . Time to complete first car Cars completed in time T Throughput =n time units (latency) =T–n+1 = 1- (n - 1)/ T car per unit time Throughput (assembly line) ─────────────────── Throughput (team assembly) = 1 – (n - 1)/ T ──────── = 1/n 55:035 Computer Architecture and Organization n (n – 1) n – ───── → T n as T→∞ 7 Some Features of Assembly Line Electrical parts delivered (JIT) Task 1 1 hour Mechanical Stall assembly line to fix the cause of defect Task 2 1 hour Task 3 1 hour Task 4 1 hour Electrical Painting Testing 3 cars in the assembly line are suspects, to be removed (flush pipeline) 55:035 Computer Architecture and Organization Defect found 8 Pipelining in a Computer Divide datapath into nearly equal tasks, to be performed serially and requiring non-overlapping resources. Insert registers at task boundaries in the datapath; registers pass the output data from one task as input data to the next task. Synchronize tasks with a clock having a cycle time that just exceeds the time required by the longest task. Break each instruction down into a fixed number of tasks so that instructions can be executed in a staggered fashion. 55:035 Computer Architecture and Organization 9 Single-Cycle Datapath Instruction class Instr. fetch (IF) Instr. Decode (also reg. file read) (ID) Execution (ALU Operation) (EX) Data access (MEM) Write Back (Reg. file write) (WB) Total time lw 2ns 1ns 2ns 2ns 1ns 8ns sw 2ns 1ns 2ns 2ns R-format add, sub, and, or, slt 2ns 1ns 2ns B-format, beq 2ns 1ns 2ns 8ns 1ns 8ns 8ns No operation on data; idle time equalizes instruction length to a fixed clock period. 55:035 Computer Architecture and Organization 10 Execution Time: Single-Cycle 0 lw $1, 100($0) lw $2, 200($0) lw $3, 300($0) 2 IF ID 4 EX 6 8 10 12 14 16 . . Time (ns) MEM WB IF ID EX MEM WB IF ID EX MEM WB Clock cycle time = 8 ns Total time for executing three lw instructions = 24 ns 55:035 Computer Architecture and Organization 11 Pipelined Datapath Instruction class Instr. fetch (IF) Instr. Decode (also reg. file read) (ID) Execution (ALU Operation) (EX) Data access (MEM) lw Write Back (Reg. file write) (WB) Total time 2ns 1ns 2ns 2ns 2ns 1ns 2ns 10ns sw 2ns 1ns 2ns 2ns 2ns 1ns 2ns 10ns R-format: add, sub, and, or, slt 2ns 1ns 2ns 2ns 2ns 1ns 2ns 10ns B-format: beq 2ns 1ns 2ns 2ns 2ns 1ns 2ns 10ns No operation on data; idle time inserted to equalize instruction lengths. 55:035 Computer Architecture and Organization 12 Execution Time: Pipeline 0 lw $1, 100($0) lw $2, 200($0) lw $3, 300($0) 2 IF 4 6 8 ID EX MEM IF ID EX IF ID 10 12 14 16 . . Time (ns) RW MEM RW EX MEM RW Clock cycle time = 2 ns, four times faster than single-cycle clock Total time for executing three lw instructions = 14 ns Performance ratio = Single-cycle time ──────────── Pipeline time 55:035 Computer Architecture and Organization = 24 ── = 1.7 14 13 Pipeline Performance Clock cycle time = 2 ns 1,003 lw instructions: Total time for executing 1,003 lw instructions Performance ratio = = 2,014 ns Single-cycle time ──────────── Pipeline time = 8,024 ──── = 3.98 2,014 80,024 / 20,014 = 3.998 → Clock cycle ratio (4) 10,003 lw instructions: Performance ratio = Pipeline performance approaches clock-cycle ratio for long programs. 55:035 Computer Architecture and Organization 14 Single-Cycle Datapath 16-20 11-15 1 mux 0 PC WB: writeback ALU 1 mux 0 MEM: mem. access MemtoReg zero Data mem. MemWrite MemRead 0 mux 1 RegWrite ALUSrc 21-25 Instr. mem. Branch ALU 26-31 EX: Execute, address calc. 1 mux 0 opcode Reg. File Add 4 CONTROL ID: Instr. decode, reg. file read IF: Instr. fetch RegDst 0-15 Sign ext. Shift left 2 ALUOp ALU Cont. 0-5 55:035 Computer Architecture and Organization 15 Pipelining of RISC Instructions Fetch Instruction Examine Opcode Fetch Operands Perform Operation Store Result IF ID EX MEM WB Instruction Fetch Decode instruction and Fetch operands Execute Memory Operation Write Back to Reg file Although an instruction takes five clock cycles, one instruction is completed every cycle. 55:035 Computer Architecture and Organization 16 11-15 ALU ALU zero Data mem. MemWrite MemRead 0 mux 1 This requires a CONTROL not too different from single-cycle 0-15 1 mux 0 16-20 MEM/WB RegWrite ALUSrc 21-25 Instr. mem. MemtoReg 1 mux 0 26-31 PC EX/MEM Branch Reg. File opcode CONTROL 4 ID/EX Add IF/ID 1 mux 0 Pipeline Registers RegDst Sign ext. Shift left 2 ALUOp ALU Cont. 0-5 55:035 Computer Architecture and Organization 17 Pipeline Register Functions Four pipeline registers are added: Register name Data held IF/ID PC+4, Instruction word (IW) ID/EX PC+4, R1, R2, IW(0-15) sign ext., IW(11-15) EX/MEM PC+4, zero, ALUResult, R2, IW(11-15) or IW(16-20) MEM/WB M[ALUResult], ALUResult, IW(11-15) or IW(16-20) 55:035 Computer Architecture and Organization 18 ID/EX EX/MEM Shift left 2 opcode 1 mux 0 IF/ID ALU 4 Add Pipelined Datapath MEM/WB 26-31 11-15 for R-type 16-20 for I-type lw Data mem 0 mux 1 ALU 16-20 1 mux 0 PC Instr mem zero Reg. File 21-25 Sgn ext 0-15 55:035 Computer Architecture and Organization 19 Five-Cycle Pipeline 55:035 Computer Architecture and Organization REG. FILE WRITE CC5 MEM/WB DM CC4 EX/MEM CC3 ALU ID, REG. FILE READ ID/EX CC2 IF/ID IM CC1 20 Add Instruction add IF ID read $s1 read $s2 EX add $s1+$s2 CC5 MEM 55:035 Computer Architecture and Organization REG. FILE WRITE CC4 MEM/WB CC3 DM CC2 EX/MEM CC1 ALU ID, REG. FILE READ ID/EX Machine instruction word 000000 10001 10010 01000 00000 100000 opcode $s1 $s2 $t0 function IF/ID $t0, $s1, $s2 IM WB write $t0 21 11-15 for R-type 16-20 for I-type lw t0 16-20 s2 zero $s1 $s2 ALU PC Instr mem Reg. File 26-31 s1 MEM/WB Sign ext. addr Data mem data 0 mux 1 Shift left 2 opcode 21-25 EX/MEM 1 mux 0 ID/EX ALU IF/ID 1 mux 0 4 Add Pipelined Datapath Executing add 0-15 55:035 Computer Architecture and Organization 22 Load Instruction lw $t0, 1200 ($t1) 100011 01001 01000 0001 0010 0000 0000 opcode $t1 $t0 1200 IF ID read $t1 sign ext 1200 EX MEM add read $t1+1200 M[addr] 55:035 Computer Architecture and Organization REG. FILE WRITE CC5 MEM/WB DM CC4 EX/MEM CC3 ALU ID, REG. FILE READ ID/EX CC2 IF/ID IM CC1 WB write $t0 23 PC Instr mem 16-20 11-15 for R-type 16-20 for I-type lw t0 0-15 1200 $t1 MEM/WB zero Sign ext. 55:035 Computer Architecture and Organization addr Data mem data 0 mux 1 21-25 Reg. File 26-31 t1 Shift left 2 1 mux 0 opcode EX/MEM ALU ID/EX ALU IF/ID 1 mux 0 4 Add Pipelined Datapath Executing lw 24 Store Instruction sw $t0, 1200 ($t1) 101011 01001 01000 0001 0010 0000 0000 opcode $t1 $t0 1200 IF ID read $t1 sign ext 1200 EX MEM add write $t1+1200 M[addr] (addr) ← $t0 55:035 Computer Architecture and Organization REG. FILE WRITE CC5 MEM/WB DM CC4 EX/MEM CC3 ALU ID, REG. FILE READ ID/EX CC2 IF/ID IM CC1 WB 25 16-20 t0 zero $t1 $t0 ALU PC Instr mem Reg. File 26-31 t1 Sign ext. 11-15 for R-type 16-20 for I-type lw MEM/WB addr Data mem 0 mux 1 Shift left 2 opcode 21-25 EX/MEM 1 mux 0 ID/EX ALU IF/ID 1 mux 0 4 Add Pipelined Datapath Executing sw data 0-15 1200 55:035 Computer Architecture and Organization 26 Executing a Program Consider a five-instruction segment: lw sub add lw add $10, 20($1) $11, $2, $3 $12, $3, $4 $13, 24($1) $14, $5, $6 55:035 Computer Architecture and Organization 27 55:035 Computer Architecture and Organization ALU EX/MEM ID, REG. FILE READ ID/EX REG. FILE WRITE MEM/WB REG. FILE WRITE lw $10, 20($1) sub $11, $2, $3 Program instructions REG. FILE WRITE MEM/WB DM REG. FILE WRITE MEM/WB DM CC5 REG. FILE WRITE MEM/WB DM EX/MEM DM ID/EX IF/ID ALU ALU ID, REG. FILE READ IM CC4 EX/MEM ID/EX IF/ID CC3 MEM/WB DM ALU ID, REG. FILE READ IM CC2 EX/MEM EX/MEM ID/EX IF/ID $14, $5, $6 ALU add ID, REG. FILE READ $13, 24($1) IM lw ID/EX $12, $3, $4 IF/ID add IM CC1 ID, REG. FILE READ IF/ID IM Program Execution time 28 CC5 ID/EX EX/MEM Add IF/ID 4 Shift left 2 opcode ALU IF: add $14, $5, $6 1 mux 0 MEM: ID: lw $13, 24($1) EX: add $12, $3, $4 sub $11, $2, $3 WB: lw $10, 20($1) MEM/WB 26-31 zero Data mem. 0 mux 1 ALU 16-20 1 mux 0 PC Instr mem Reg. File 21-25 Sign ext. 11-15 for R-type 16-20 for I-type lw 0-15 55:035 Computer Architecture and Organization 29 Advantages of Pipeline After the fifth cycle (CC5), one instruction is completed each cycle; CPI ≈ 1, neglecting the initial pipeline latency of 5 cycles. Pipeline latency is defined as the number of stages in the pipeline, or The number of clock cycles after which the first instruction is completed. The clock cycle time is about four times shorter than that of single-cycle datapath and about the same as that of multicycle datapath. For multicycle datapath, CPI = 3. …. So, pipelined execution is faster, but . . . 55:035 Computer Architecture and Organization 30 Pipeline Hazards Definition: Hazard in a pipeline is a situation in which the next instruction cannot complete execution one clock cycle after completion of the present instruction. Three types of hazards: Structural hazard (resource conflict) Data hazard Control hazard 55:035 Computer Architecture and Organization 31 Structural Hazard Two instructions cannot execute due to a resource conflict. Example: Consider a computer with a common data and instruction memory. The fourth cycle of a lw instruction requires memory access (memory read) and at the same time the first cycle of the fourth instruction requires instruction fetch (memory read). This will cause a memory resource conflict. 55:035 Computer Architecture and Organization 32 add $12, $3, $4 lw $13, 24($1) 55:035 Computer Architecture and Organization EX/MEM IM/DM ID/EX ALU REG. FILE WRITE REG. FILE WRITE MEM/WB IM/DM REG. FILE WRITE MEM/WB CC5 lw $10, 20($1) sub $11, $2, $3 Program instructions REG. FILE WRITE MEM/WB IM/DM MEM/WB ALU ID, REG. FILE READ EX/MEM IM/DM ALU CC4 EX/MEM ID/EX ID, REG. FILE READ EX/MEM ID/EX CC3 IF/ID IM/DM Common data and instr. Mem. IF/ID ALU ID, REG. FILE READ CC2 IM/DM ID/EX CC1 IF/ID IM/DM ID, REG. FILE READ IF/ID IM/DM Example of Structural Hazard time Needed by two instructions 33 Possible Remedies for Structural Hazards Provide duplicate hardware resources in datapath. Control unit or compiler can insert delays (no-op cycles) between instructions. This is known as pipeline stall or bubble. 55:035 Computer Architecture and Organization 34 lw $13, 24($1) 55:035 Computer Architecture and Organization Stall (bubble) Program instructions REG. FILE WRITE MEM/WB IM/DM EX/MEM REG. FILE WRITE lw $10, 20($1) sub $11, $2, $3 REG. FILE WRITE IM/DM ALU REG. FILE WRITE MEM/WB CC5 MEM/WB IM/DM EX/MEM ALU ID/EX ID, REG. FILE READ EX/MEM ID/EX CC4 MEM/WB ALU ID, REG. FILE READ CC3 IF/ID EX/MEM ID/EX IF/ID IM/DM ALU ID, REG. FILE READ CC2 IM/DM ID/EX $12, $3, $4 IF/ID IM/DM ID, REG. FILE READ CC1 IM/DM add IF/ID IM/DM Stall (Bubble) for Structural Hazard time 35 Data Hazard Data hazard means that an instruction cannot be completed because the needed data, being generated by another instruction in the pipeline, is not available. Example: consider two instructions: add $s0, $t0, $t1 sub $t2, $s0, $t3 # needs $s0 55:035 Computer Architecture and Organization 36 Example of Data Hazard time CC5 add $s0, $t0, $t1 sub $t2, $s0, $t3 Program instructions MEM/WB DM REG. FILE WRITE Write s0 in CC5 MEM/WB REG. FILE WRITE EX/MEM DM ALU EX/MEM ALU CC4 IF/ID ID, REG. FILE READ ID/EX IM CC3 ID/EX CC2 IF/ID ID, REG. FILE READ IM CC1 Read s0 and t3 in CC3 We need to read s0 from reg file in cycle 3 But s0 will not be written in reg file until cycle 5 However, s0 will only be used in cycle 4 And it is available at the end of cycle 3 55:035 Computer Architecture and Organization 37 Forwarding or Bypassing Output of a resource used by an instruction is forwarded to the input of some resource being used by another instruction. Forwarding can eliminate some, but not all, data hazards. 55:035 Computer Architecture and Organization 38 Forwarding for Data Hazard CC5 time add $s0, $t0, $t1 sub $t2, $s0, $t3 Program instructions DM REG. FILE WRITE REG. FILE WRITE MEM/WB MEM/WB Write s0 in CC5 EX/MEM DM EX/MEM CC4 ALU IF/ID IM ALU CC3 ID, REG. FILE READ ID/EX CC2 IF/ID ID, REG. FILE READ ID/EX IM CC1 Read s0 and t3 in CC3 55:035 Computer Architecture and Organization 39 Forwarding Unit Hardware Source reg. IDs from opcode MEM/WB Data Mem. MUX ALU FORW. MUX Data to reg. file EX/MEM FORW. MUX ID/EX Destination registers Forwarding Unit 55:035 Computer Architecture and Organization 40 Forwarding Alone May Not Work time CC5 lw $s0, 20($s1) sub $t2, $s0, $t3 Program instructions DM REG. FILE WRITE REG. FILE WRITE MEM/WB MEM/WB Write s0 in CC5 EX/MEM DM ALU EX/MEM ALU CC4 IF/ID ID, REG. FILE READ ID/EX IM CC3 ID/EX CC2 IF/ID ID, REG. FILE READ IM CC1 Read s0 and t3 in CC3 data needed by sub (data hazard) data available from memory only at the end of cycle 4 55:035 Computer Architecture and Organization 41 Use Bubble and Forwarding time CC5 MEM/WB REG. FILE WRITE ALU Write s0 in CC5 ID/EX DM CC4 EX/MEM ALU CC3 ID/EX CC2 IF/ID ID, REG. FILE READ IM CC1 lw $s0, 20($s1) 55:035 Computer Architecture and Organization Program instructions REG. FILE WRITE MEM/WB DM EX/MEM $t2, $s0, $t3 IF/ID ID, REG. FILE READ sub IM stall (bubble) 42 Hazard Detection Unit Hardware Source reg. IDs from opcode to reg. file ALU 0 EX/MEM FORW. MUX IF/ID PC Instruction Control ID/EX FORW. MUX Hazard Detection Unit NOP MUX Disable write Forwarding Unit 55:035 Computer Architecture and Organization MEM/WB Data Mem. Control signals 43 Resolving Hazards Hazards are resolved by Hazard detection and forwarding units. Compiler’s understanding of how these units work can improve performance. 55:035 Computer Architecture and Organization 44 Avoiding Stall by Code Reorder C code: A = B + E; C = B + F; MIPS code: lw $t1, lw $t2, add $t3, sw $t3, lw $t4, add $t5, sw $t5, 0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $t4 16,($t0) . . . . . . . . . . . . . . . $t1 written $t2 written $t1, $t2 needed . . . . . 55:035 Computer Architecture and Organization . . . . . . . . . . $t4 written $t4 needed . . . . . 45 Reordered Code C code: A = B + E; C = B + F; MIPS code: lw $t1, lw $t2, lw $t4, add $t3, sw $t3, add $t5, sw $t5, 0($t0) 4($t0) 8($t0) $t1, $t2 12($t0) $t1, $t4 16,($t0) no hazard no hazard 55:035 Computer Architecture and Organization 46 Control Hazard Instruction to be fetched is not known! Example: Instruction being executed is branchtype, which will determine the next instruction: add $4, $5, $6 beq $1, $2, 40 next instruction ... 40 and $7, $8, $9 55:035 Computer Architecture and Organization 47 beq $1, $2, 40 Program instructions CC5 MEM/WB REG. FILE WRITE DM add EX/MEM ALU MEM/WB REG. FILE WRITE MEM/WB REG. FILE WRITE CC4 DM EX/MEM Stall (bubble) IF/ID ID, REG. FILE READ ID/EX DM CC3 ALU EX/MEM ALU CC2 IM next instruction or and $7, $8, $9 IF/ID ID, REG. FILE READ ID/EX CC1 IM IF/ID ID, REG. FILE READ ID/EX IM Stall on Branch time $4, $5, $6 Why Only One Stall? Extra hardware in ID phase: Additional ALU to compute branch address Comparator to generate zero signal Hazard detection unit writes the branch address in PC 55:035 Computer Architecture and Organization 49 Ways to Handle Branch Stall or bubble Branch prediction: Heuristics Next instruction Prediction based on statistics (dynamic) Hardware decision (dynamic) Prediction error: pipeline flush Delayed branch 55:035 Computer Architecture and Organization 50 Delayed Branch Example Stall on branch add $4, $5, $6 beq $1, $2, skip next instruction ... skip or $7, $8, $9 Delayed branch beq $1, $2, skip add $4, $5, $6 next instruction ... skip or $7, $8, $9 Instruction executed irrespective of branch decision 55:035 Computer Architecture and Organization 51 next instruction or skip or $7, $8, $9 55:035 Computer Architecture and Organization REG. FILE WRITE REG. FILE WRITE Program instructions REG. FILE WRITE MEM/WB DM MEM/WB DM CC5 MEM/WB DM EX/MEM ALU EX/MEM ALU ID, REG. FILE READ EX/MEM ALU ID/EX CC4 ID/EX ID/EX ID, REG. FILE READ IF/ID ID, REG. FILE READ CC3 IF/ID IM add $4, $5, $6 CC2 IM beq $1, $2, skip IF/ID CC1 IM Delayed Branch time 52 Summary: Hazards Structural hazards Data hazards Cause: resource conflict Remedies: (i) hardware resources, (ii) stall (bubble) Cause: data unavailablity Remedies: (i) forwarding, (ii) stall (bubble), (iii) code reordering Control hazards Cause: out-of-sequence execution (branch or jump) Remedies: (i) stall (bubble), (ii) branch prediction/pipeline flush, (iii) delayed branch/pipeline flush 55:035 Computer Architecture and Organization 53