ELEC 5200-001/6200-001 Computer Architecture and Design Spring 2016 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher Professor Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 http://www.eng.auburn.edu/~vagrawal vagrawal@eng.auburn.edu Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 1 EX/MEM Shift left 2 opcode ALU 4 ID/EX Add IF/ID 1 mux 0 Pipelined Datapath (without Jump) MEM/WB 26-31 16-20 1 mux 0 Sign ext. Data mem. 0 mux 1 mem PC ALU 21-25 1 mux 0 Instr Reg. File zero 16-20 for I-type lw 11-15 for R-type 0-15 Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 2 mem 16-20 MemWrite MemRead Data mem. 1 mux 0 Sign ext. MEM/WB 0 mux 1 PC 21-25 zero ALU Instr Shift left 2 ALU EX/MEM 1 mux 0 26-31 Reg. File opcode RegWrite 4 ID/EX Add IF/ID 1 mux 0 Mem. and Reg. File Need Controls 16-20 for I-type lw 11-15 for R-type 0-15 Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 3 mem 16-20 1 mux 0 Sign ext. 16-20 for I-type lw 11-15 for R-type MemtoReg 1 mux 0 MemWrite MemRead Branch PCSrc Data mem. MEM/WB 0 mux 1 PC 21-25 zero ALU Instr Shift left 2 ALUSrc 26-31 Reg. File opcode EX/MEM ALU ID/EX 1 mux 0 IF/ID RegWrite 4 Add Multiplexers Need Controls RegDst 0-15 Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 4 16-20 Sign ext. 16-20 for I-type lw 11-15 for R-type MemtoReg 1 mux 0 MemWrite MemRead PCSrc Branch ALUSrc 0-5 ALU cont. Data mem. MEM/WB 0 mux 1 mem 21-25 ALU PC Instr zero 1 mux 0 26-31 Shift left 2 Reg. File opcode EX/MEM ALU ID/EX 1 mux 0 IF/ID RegWrite 4 Add ALU Needs a Control ALUOp RegDst 0-15 Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 5 Compare with Single-Cycle Control Control signals are the same as those needed for a single-cycle datapath. Control signals are generated using the Opcode in the ID (instruction decode) cycle and then distributed to other cycles. Let us reexamine the implementation of the single-cycle control (slides 19-21 of Lecture 5). Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 6 Hardwired CU: Single-Cycle Implemented by combinational logic. Datapath 6 opcode Control logic Control signals funct. code 6 ALUOp Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 2 To ALU 3 ALU control 7 Instr. mem. 16-20 Single-cycle datapath 0-15 11-15 RegWrite 0 mux 1 1 mux 0 ALU MemtoReg ALUSrc zero MemWrite MemRead Data mem. 0 mux 1 PC 1 mux 0 21-25 ALU 26-31 Branch Reg. File opcode CONTROL Add 4 Jump Shift left 2 1 mux 0 0-25 RegDst Sign ext. Shift left 2 ALUOp ALU Cont. 0-5 Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 8 Single-Cycle Control Logic Jump ALUOp0 ALUOp1 Branch MemWrite MemRead RegWrite Instruction bits 31 31 29 28 27 26 MemtoReg Opcode ALUSrc Instr. type Outputs RegDst Inputs R 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 lw 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 sw 1 0 1 0 1 1 X 1 X 0 0 1 0 0 0 0 beq 0 0 0 1 0 0 X 0 X 0 0 0 1 0 1 0 J 0 0 0 0 1 0 X X X 0 X 0 X X X 1 Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 9 Single-Cycle Control Circuit Op5 Op4 Op3 Op2 Op1 Op0 R lw sw beq J RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp0 Jump Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 10 ALU Control Logic Instr. type Inputs From CU ALUOp1 lw, sw B R Spr 2016, Mar 9 . . . 0 0 1 1 1 1 1 Outputs to ALU Funct. Code from IR (bits 0-5) 3-bit code Operation ALUOp0 F5 F4 F3 F2 F1 F0 0 1 X X X X X X X X X X X X X X X X X X X X X 0 0 0 0 1 X X 0 0 1 1 0 X X 0 1 0 0 1 ELEC 5200-001/6200-001 Lecture 7 X X 0 0 0 1 0 Add 010 110 Subtract Add 010 110 Subtract 000 AND OR 001 slt 111 11 ALU Control Operation select from control From Control Circuit ALUOp1 ALUOp0 3 zero ALU F3 result overflow F2 Operation select ALU function 000 001 010 110 111 AND OR Add Subtract Set on less than F1 F0 Spr 2016, Mar 9 . . . ALU control ELEC 5200-001/6200-001 Lecture 7 12 Returning to Pipelined Control Opcode input to control is supplied by the pipeline register IF/ID in the ID (instruction decode) cycle. Nine control signals are generated in the ID cycle, but none is used. They are saved in the pipeline register ID/EX. ALUSrc, RegDst and ALUOp (2 bits) are used in the EX (execute) cycle. Remaining 5 control signals are saved in the pipeline register EX/MEM. Branch, MemWrite and MemRead are used in the MEM (memory access) cycle. Remaining 2 control signals are saved in the pipeline register MEM/WB. MemtoReg and RegWrite are used in the WB (write back) cycle. Pipelined control is shown without Jump. Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 13 16-20 for I-type lw 11-15 for R-type 0-5 1 mux 0 Data mem. MemtoReg RegWrite MemWrite MemRead PCSrc ALUSrc Sign ext. ALU cont. MEM/WB 0 mux 1 16-20 zero ALU mem 21-25 1 mux 0 PC Instr 1 mux 0 26-31 Shift left 2 Reg. File opcode EX/MEM Branch ID/EX ALU IF/ID CONTROL 4 Add Placing Control in Pipelined Datapath ALUOp RegDst 0-15 Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 14 16-20 for I-type lw 11-15 for R-type 0-5 1 mux 0 Data mem MemtoReg RegWrite MemWrite MemRead PCSrc ALUSrc Sign ext. ALU cont. MEM/WB 0 mux 1 16-20 zero ALU mem 21-25 1 mux 0 PC Instr 1 mux 0 26-31 Shift left 2 Reg. File opcode EX/MEM Branch ID/EX ALU IF/ID CONTROL 4 Add Highlighted Pipelined Control ALUOp RegDst 0-15 Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 15 Single-Cycle Performance Assume 200 ps for memory access 100 ps for ALU operation 50 ps for register file read or write Cycle time set according to longest instruction: lw ≡ IF + ID/RegRead + ALU + MEM + RegWrite = 200 + 50 +100 + 200 + 50 = 600 ps Av. instruction execution time = clock cycle time = 600 ps Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 16 Multicycle Performance Consider SPECINT2000* instruction mix: 25% lw 10% sw 11% branch 2% jump 52% ALU instr. Av. CPI 5 cycles 4 cycles 3 cycles 3 cycles 4 cycles = 0.25×5 + 0.10×4 + 0.11×3 + 0.02×3 + 0.52×4 = 4.12 Clock cycle time determined from longest operation (memory access) = 200 ps Av. instruction execution time = 4.12×200 = 824 ps *Set of benchmark programs used for performance evaluation, to be discussed in a later lecture. Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 17 Pipeline Performance Neglect initial latency (reasonable for long programs). One instruction completed every clock cycle unless delayed by hazard. Average CPI: lw sw ALU branch jump 2 cycles in 50% cases due to hazard 2 cycles in 25% cases due to hazard 1.5 cycles 1 cycle 1 cycle 1.25 cycles 2 cycles For SPECINT2000 Av. CPI = 0.25×1.5 + 0.10×1 + 0.11×1.25 + 0.02×2.0 + 0.52×1 = 1.17 Clock cycle time (longest operation: memory access) = 200 ps Av. instruction execution time = 1.17×200 = 234 ps Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 18 Comparing Alternatives Type of Clock cycle Average datapath time CPI and control Single600 ps 1.00 cycle Multicycle 200 ps 4.12 Pipelined Spr 2016, Mar 9 . . . 200 ps 1.17 ELEC 5200-001/6200-001 Lecture 7 Av. instruction execution time 600 ps 824 ps 234 ps 19 Other Controls for Pipeline Forwarding Stall Branch hazard and branch prediction Instruction flush Exceptions Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 20 Forwarding Consider a data hazard: Spr 2016, Mar 9 . . . CC6 CC7 CC8 CC3: ALU saves new WB: REG. WRITE data in EX/MEM, to be written to $2 in CC5 MEM/WB WB: REG. WRITE MEM: DM MEM: DM CC5 MEM/WB EX: ALU EX/MEM CC4 EX/MEM IF/ID ID: REG. FILE READ ID/EX CC3 EX: ALU CC3: and reads $2 to ID/EX, but the correct data is in EX/MEM IF: IM and $12, $2, $5 CC2 IF/ID sub $2, $1, $3 IF: IM CC1 # computes result in CC3, writes in $2 in CC5 # reads $2 in CC3, adds in CC4 ID: REG. FILE READ ID/EX sub $2, $1, $3 and $12, $2, $5 CC4: forwarding allows execution of “and” with correct data ELEC 5200-001/6200-001 Lecture 7 21 Understanding Forwarding Let’s ask following questions: Q: A: Q: A: Q: A: Q: A: Spr 2016, Mar 9 . . . Why is there a hazard? Source register for the present instruction is the same as the destination register of the previous instruction. When is the source register data needed? In the execute cycle (CC4). Is source register data available in CC4? Yes – use forwarding. No – use stall. Where is the required data in CC4? In the pipeline register EX/MEM as ALU output. ELEC 5200-001/6200-001 Lecture 7 22 Forwarding Hardware A forwarding unit is added to execute (ALU) cycle hardware. Functions of forwarding unit: – Hazard detection – Forward correct data to ALU Inputs to forwarding unit: – Source registers of the instruction in EX – Destination registers of instructions in DM and WB Outputs of forwarding unit: multiplexer controls to route correct data to the ALU. Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 23 Recall Register Definitions R-type instruction (add, sub, and, or, . . . ) opcode Rs Rt Rd shamt funct I-type instruction (beq, lw, sw, addi, . . . ) opcode Rs Rt constant_or_address J-type instruction (j, jal, jr) opcode a___d___d___r___e___s___s where Rs is the first source register Rt is the second source register Rd is the destination register Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 24 Forwarding Implemented EX/MEM ALU ID/EX PC+4 opcode Shift left 2 26-31 21-25 ALU MUX MUX 1 mux 0 16-20 16-20 11-15 21-25 Rs 16-20 Rt 1 mux 0 Sign ext. Data mem. Rd Forwarding unit 0-15 Spr 2016, Mar 9 . . . Branch addr. zero Reg. File Addr mem MEM/WB 0 mux 1 IF/ID ELEC 5200-001/6200-001 Lecture 7 Rd 25 Stall Delay next instruction by sending nop through pipeline. Necessary when hazard not resolved by forwarding. CC6 REG. FILE WRITE CC4: new data in MEM/WB, to be written to $2 MEM/WB DM REG. FILE WRITE MEM/WB CC5 EX/MEM DM ALU EX/MEM CC4 ID, REG. FILE READ ID/EX IF/ID ALU CC3 ID, REG. FILE READ ID/EX IM and $4, $2, $5 CC2 IF/ID lw $2, 20($1) IM CC1 CC4: execution of and is impossible; correct data unavailable until end of CC4 Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 26 Detecting Hazard Requiring Stall Consider instruction in IF/ID being decoded: If Previous instruction (lw) activated MemRead, and Instruction being decoded has a source register (Rs or Rt) same as the destination register (Rt for lw) of the previous instruction Then, stall the pipeline: Force all control outputs to 0 Prevent PC from changing Prevent IF/ID from changing Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 27 Stall Implementation Shift left 2 zero Data mem. 16-20 11-15 21-25 Rs 16-20 Rt 1 mux 0 Sign ext. Forwarding unit 0 mux 1 1 mux 0 Reg. File 21-25 Addr mem PC 0 16-20 MEM/WB ALU 26-31 EX/MEM MUX opcode ID/EX MUX Rs MemRead MUX IF/ID Hazard detection unit Control IF/IDWrite PCWrite Rt Rd Rd 0-15 Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 28 next Spr 2016, Mar 9 . . . next is fetched twice since PC was frozen WB: REG. WRITE EX/MEM MEM: DM MEM/WB WB: REG. WRITE EX: ALU EX/MEM MEM: DM MEM/WB WB: REG. WRITE IF/ID ID: REG. FILE READ ID/EX EX: ALU EX/MEM MEM: DM CC5 ELEC 5200-001/6200-001 Lecture 7 CC6 WB: REG. WRITE CC4 MEM/WB MEM/WB EX: ALU IF: IM CC3 ID/EX MEM: DM ID: REG. FILE READ EX/MEM ID/EX IF/ID State of IF/ID is frozen in CC3 IF/ID EX: ALU CC2 ID: REG. FILE READ IF/ID ID: REG. FILE READ ID/EX IF/ID CC1 IF: IM and $4, $2, $5 IF: IM lw $2, 20($1) IF: IM Stall Execution with stall and forwarding: CC7 CC4: new data in MEM/WB, to be written to $2 bubble (nop) 29 Branch Hazard Consider heuristic – branch not taken. Continue fetching instructions in sequence following the branch instructions. If branch is taken (indicated by zero output of ALU): – Control generates branch signal in ID cycle. – branch activates PCSource signal in the MEM cycle to load PC with new branch address. – Three instructions in the pipeline must be flushed if branch is taken – can this penalty be reduced? Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 30 16-20 for I-type lw 11-15 for R-type 0-5 1 mux 0 Data mem. MemtoReg RegWrite MemWrite MemRead PCSrc ALUSrc Sign ext. ALU cont. MEM/WB 0 mux 1 16-20 zero ALU mem 21-25 1 mux 0 PC Instr 1 mux 0 beq 26-31 Shift left 2 Reg. File opcode EX/MEM Branch ID/EX ALU IF/ID CONTROL 4 Add Branch Hazard ALUOp RegDst 0-15 Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 31 Branch Not Taken Branch on condition to Z A B C D Z cycle b Branch fetched cycle b+1 cycle b+2 cycle b+3 Branch decoded Branch decision PC keeps D (br. not taken) A fetched A decoded A executed B fetched cycle b+4 A continues B decoded B executed C fetched C decoded D fetched Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 32 Branch Taken Branch on condition to Z A B C D Z cycle b Branch fetched cycle b+1 cycle b+2 cycle b+3 Branch decoded Branch decision PC gets Z (br. taken) A fetched A decoded A executed B fetched cycle b+4 Nop B decoded Nop C fetched Nop Three-cycle penalty Three instructions are flushed if branch is taken Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 Z fetched 33 16-20 for I-type lw 11-15 for R-type 0-5 1 mux 0 Data mem. MemtoReg RegWrite MemWrite MemRead PCSrc ALUSrc Sign ext. ALU cont. MEM/WB 0 mux 1 16-20 zero ALU mem 21-25 1 mux 0 PC Instr 1 mux 0 beq 26-31 Shift left 2 Reg. File opcode EX/MEM Branch ID/EX Add IF/ID CONTROL 4 Add Branch Penalty Reduction ALUOp RegDst 0-15 Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 34 Branch Taken Branch to Z A B C D Z cycle b Branch fetched cycle b+1 cycle b+2 Branch decision PC gets Z A fetched A flushed Z fetched cycle b+3 cycle b+4 Nop Nop Z decoded Z executed One-cycle penalty One instructions is flushed if branch is taken Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 35 Pipeline Flush If branch is taken (as indicated by zero), then control does the following: – Change all control signals to 0, similar to the case of stall for data hazard, i.e., insert bubble in the pipeline. – Generate a signal IF.Flush that changes the instruction in the pipeline register IF/ID to 0 (nop). Penalty of branch hazard is reduced by – Adding branch detection and address generation hardware in the decode cycle – one bubble needed – a next address generation logic in the decode stage writes PC+4, branch address, or jump address into PC. – Using branch prediction. – Unrolling loops. Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 36 Branch Prediction Useful for program loops. A one-bit prediction scheme: a one-bit buffer carries a “history bit” that tells what happened on the last branch instruction History bit = 1, branch was taken History bit = 0, branch was not taken Not taken taken Predict branch taken 1 Predict branch not taken 0 Not taken taken Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 37 Branch Prediction Address of recent branch instructions Target addresses History bit(s) Low-order bits used as index PC+4 Next PC 0 1 = Prediction Logic PC Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 38 Branch Prediction for a Loop Execution of Instruction d a I=0 I=I+1 b X = X + R(I) c N d I – 10 = 0? Y e Store X in memory Execu -tion seq. Old hist. bit Pred. I 1 0 e 2 1 3 Act. New hist. bit Predi ction 1 b 1 Bad b 2 b 1 Good 1 b 3 b 1 Good 4 1 b 4 b 1 Good 5 1 b 5 b 1 Good 6 1 b 6 b 1 Good 7 1 b 7 b 1 Good 8 1 b 8 b 1 Good 9 1 b 9 b 1 Good 10 1 b 10 e 0 Bad Next instr. h.bit = 0 branch not taken, h.bit = 1 branch taken. Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 39 Prediction Accuracy One-bit predictor: 2 errors out of 10 predictions Prediction accuracy = 80% To improve prediction accuracy, use twobit predictor: A prediction must be wrong twice before it is changed Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 40 Two-Bit Prediction Buffer Implemented as a two-bit counter. Can improve correct prediction statistics. Not taken taken Predict branch taken 11 Predict branch taken 10 taken taken Not taken Not taken Not taken Predict branch not taken 00 Predict branch not taken 01 taken Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 41 Branch Prediction for a Loop 1 I=0 I=I+1 2 X = X + R(I) 3 N 4 Execution of Instruction 4 I – 10 = 0? Y 5 Store X in memory Spr 2016, Mar 9 . . . Execu -tion seq. Old Pred. Buf Pred. I 1 10 2 1 2 11 Good 2 11 2 2 2 11 Good 3 11 2 3 2 11 Good 4 11 2 4 2 11 Good 5 11 2 5 2 11 Good 6 11 2 6 2 11 Good 7 11 2 7 2 11 Good 8 11 2 8 2 11 Good 9 11 2 9 2 11 Good 10 11 2 10 5 10 Bad New pred. Act. Buf Next instr. ELEC 5200-001/6200-001 Lecture 7 Predi ction 42 Exceptions A typical exception occurs when ALU produces an overflow signal. Control asserts following actions on exception: – Change the PC address to 4000 0040hex. This is the location of the exception routine. This is done by adding an additional input to the PC input multiplexer. – Overflow is detected in the EX cycle. Similar to data hazard and pipeline flush, Set IF/ID to 0 (nop). Generate ID.Flush and EX.Flush signals to set all control signals to 0 in ID/EX and EX/MEM registers. This also prevents the ALU result (presumed contaminated) from being written in the WB cycle. Spr 2016, Mar 9 . . . ELEC 5200-001/6200-001 Lecture 7 43