Enhancing Performance with Pipelining

MIPS Pipelining Chapter 4 Sections 4.5 – 4.8 Dr. Iyad F. Jafar Outline            2 Introduction Why Pipelining? MIPS Pipelined Datapath MIPS Pipelined Control Pipelining Hazards Structural Hazards Data Hazards Control Hazards Exceptions and Interrupts Fallacies and Pitfalls Reading Assignment Introduction  Single-cycle datapath  Simple!  Hardware replication?  Cycle time?  Multi-cycle datapath  More involved  Less HW replication of major units  Better performance if the delay of major functional units is balanced!  Can we do any better?  Pipelining! 3  Pipelining Introduction  In Multi-cycle, only one major unit is used in each cycle while other units are idle!  Why not to use them to do something else?  Basically, start the next instruction before the current one is finished! Cycle 1 LW SW R-Type 4 IFetch Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Dec Exec Mem WB IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB Introduction  Pipelining  The time required to execute one instruction 5 (Instruction latency) is not affected!  However, the number of instructions finished per unit time (Throughput) is increased  Thus, Pipelining improves the throughput not latency!  Most modern processors are pipelined!  Notes  As in multi-cycle, the cycle time is determined by the slowest unit!  However, similar to single-cycle, we can get one instruction done every cycle!  It is assumed that all instructions take the same number of cycles! Introduction Single Cycle Implementation: Cycle 1 Cycle 2 Clk lw sw Waste R-type Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10 Clk lw IFetch R-type sw Dec Exec Mem WB IFetch Dec Pipeline Implementation: lw IFetch sw Dec Exec Mem WB IFetch Dec Exec Mem WB Dec Exec Mem R-type IFetch 6 WB Exec Mem IFetch Why Pipelining?  For Performance! IM Reg DM IM Reg DM IM Reg DM IM Reg ALU Inst 4 DM ALU Inst 3 Reg ALU Inst 2 IM ALU O r d e r Inst 1 Inst 5 7 Once the pipeline is full, one instruction is completed every cycle, so CPI = 1 (similar to Singlecycle) ALU I n s t r. Time (clock cycles) Time to fill the pipeline Reg Reg Reg Reg DM Reg Why Pipelining?  Example 1. Comparing pipelining to single-cycle Consider a program that consists of a large number of LOAD instructions only that is executed on a single-cycle CPU and 5-stage pipelined CPU with the operation time for the major units (memory, ALU, and register file) to be 200 ps in both cases. 1) Determine the time required to finish executing 1,000,000 LOAD instructions and compute the speed up of pipelining. 2) Determine the time required to finish executing the first 3 LOAD instructions 3) Repeat (1) and (2) if the delay of the register file is 100 ps instead of 200 ps. 8 Cycle times for the two implementations CCSC = 200 + 200 + 200 + 200 + 200 = 1000 ps CCPP = 200 ps Why Pipelining?  Example 1. Comparing pipelining to single-cycle 1) Determine the time required to finish executing 1,000,000 LOAD instructions and compute the speed up of pipelining. Single-cycle Pipelining TimeSC = 1000 ps x 1000000 = 1,000,000,000 ps TimePP = 1000 ps + 200 ps x 999999 = 200,000,800 ps Speeup = 1,000,000,000 / 200,000,800 = 4.99998 (very close to the number of stages) 9 Why Pipelining?  Example 1. Comparing pipelining to single-cycle 2) Determine the time required to finish executing the first 3 LOAD instructions and compute the speed up of pipelining Single-cycle TimeSC = 1000 x 3 = 3000 ps Pipelining TimePP = 200 x 5 +200 + 200 = 1400 ps 10 Speeup = 3000 / 1400 = 2.14 (less than the number of stages) Why Pipelining?  Example 1. Comparing pipelining to single-cycle 3) Repeat (1) and (2) if the delay of the register file is 100 ps . CCSC = 200 + 100 + 200 + 200 + 100 = 800 ps CCPP = 200 ps For 1,000,000 instructions TimeSC = 800 x 1,000,000 = 800,000,000 ps TimePP = 1000+ 200x999,999 = 200,000,800ps Speeup = 800,000,000/ 200,000,600 = 3.99998 (<5) For 3 instructions TimeSC = 800 x 3 = 2400 ps TimePP = 1000 + 200x 2 = 1400 ps 11 Speeup = 2400/ 1400 = 1.71 (<5) Why Pipelining?      Example 1. Summary Ideally, the pipeline speedup is n times faster than the singlecycle, where n is the number of pipeline stages. In the 5-stage MIPS, the pipelined version would be 5 times faster. When the pipeline is full, the throughput will be one instruction per cycle Many factors affect pipelining performance       12 Time to fill empty the pipeline Number of instructions to execute Unbalancecd delay of pipeline stages Instruction mix Pipeline hazards Ideally, the number of cycles required to finish M instructions in N-stages pipeline is N + M – 1 Pipelined MIPS Datapath   What do we need to implement pipelining? We need to consider the following: 1. The execution of instructions is divided into 5 stages (cycles): Instruction fetch (IF) , Instruction decode (ID), Execute (EX), Memory Access (MEM), Write Back (WB) 2. Instruction flow is from left to right except in two cases  In the write-back stage where the result is written into the register file in the middle of the datapath  Choosing between the incremented PC and the branch address in the MEM stage In pipelining, all units are operating in every cycle; thus we have to duplicate hardware where needed Since the execution is over multiple cycles, we need to add State (Pipeline) registers between stages to preserve intermediate data and control for each instruction. 3. 4.  13 These registers hold the values to be used in later stages as long as they are needed. Pipelined MIPS Datapath ID EX + Shift left 2 File Write Addr Read Data 2 Write Data 16 Sign Extend 32 System Clock 14 Any problem? ALU Exec/Mem Read Address Register Read Read Addr 2Data 1 WB + Read Addr 1 Dec/Exec PC Instruction Memory IFetch/Dec 4 MEM Data Memory Address Write Data Read Data Mem/WB IF Pipelined MIPS Datapath ID EX + Shift left 2 File Write Addr Read Data 2 Write Data 16 Sign Extend ALU Exec/Mem Read Address Data Memory Register Read Read Addr 2Data 1 Address Write Data 32 System Clock 15 WB + Read Addr 1 Dec/Exec PC Instruction Memory IFetch/Dec 4 MEM Need to preserve the destination register ! Read Data Mem/WB IF Pipelined MIPS Datapath  Example 2. Execution of LW instruction (1) Instruction Fetch: Put PC and the loaded instruction in the IF/ID register 16 Pipelined MIPS Datapath  Example 2. Execution of LW instruction (2) Instruction Decode and Read Registers: Store Reg[rs], Reg[rt], sign extended offset , rd, rt, and the updated PC (why?) in the ID/EX register 17 MIPS Pipelining  Example 2. Execution of LW instruction (3) Execute Or Address Calculation: Store branch address, Reg[rt], result, and zero flag in the EX/MEM register 18 Pipelined MIPS Datapath  Example 2. Execution of LW instruction (4) Memory Access: Store the data from memory into MEM/WB register 19 Pipelined MIPS Datapath  Example 2. Execution of LW instruction (5) Write Back: Copy the data loaded in the MEM/WB register to register file 20 Pipelined MIPS Datapath  Required data fields in the pipelining registers  Data fields are moved from one pipeline register to another every clock cycle until they are no longer needed 21 Pipeline Register Data Fields Register Size IF/ID Instruction and PC 64 bits ID/EX PC, Reg[rs], Reg[rt], sign-extended offset, rt, rd 138 bits EX/MEM Branch address, Zero, ALU result, Reg[rt], Destination register address (rt or rd) 103 bits MEM/WB ALU Result, Data from memory, Destination register address 69 Pipelined MIPS Control  All control signals can be determined during Decode stage while they are needed in later stages!  Solution! Expand the pipeline registers to store and move the control signals between stages until they are needed 22 Pipelined MIPS Control  Define the control signals and generate them in the decode stage  For the time being, no explicit write signals are required for the pipeline registers since the are updated every cycle 23 Pipelined MIPS Control  Control signals needed in each stage Pipeline Stage Control signals IF None ID None EX RegDst, ALUOp1, ALUOp0, ALUSrc MEM Branch, MemRead, MemWrite WB MemtoReg, RegWrite  Control signal values based on instruction type 24 MIPS Pipeline  Example 3. Given the code segment and the register contents below, show the contents of the data and control fields in the pipeline registers if the sixth instruction has been fetched (i.e. the beginning of cycle 7) Address 25 Instruction 0x00000000 lw $10, 20($1) 0x00000004 sub $11,$1,$2 0x00000008 add $12,$3,$4 0x0000000c lw $13, 24($1) 0x00000010 add $3,$2,$1 0x00000014 Sub $1,$5,$6 Register Contents $1 1 $2 5 $3 3 $4 -6 $5 2 $6 7 $11 12 $12 -15 $13 10 MIPS Pipeline  Example 3. Multi-cycle diagram sub $1,$5,$6 DM IM Reg DM IM Reg DM IM Reg DM IM Reg ALU 26 add $3,$2,$1 Reg ALU lw $13, 24($1) IM ALU O r d e r add $12,$3,$4 DM ALU sub $11,$1,$2 Reg ALU I n s t r. IM ALU lw $10, 20($1) Time Reg Reg Reg Reg Reg DM Reg MIPS Pipeline  Example 3. Single-cycle diagram sub $1,$5,$6 27 add $3,$2,$1 lw $13, 24($1) add $12,$3,$4 sub $11,$1,$2 MIPS Pipeline  Example 3. At the beginning of cycle 7, the sixth instruction is stored in the IF/ID register while the data and control for earlier instructions are pushed to next pipeline registers and the register files. Thus,  IF/ID register  No control signals are stored  Store the instruction sub $1,$5,$6 and PC+4   28 IF/ID.Instruction = 0x00A60822 IF/ID.PC = 0x00000018 MIPS Pipeline  Example 3.  ID/EX register  Store the information of add $3,$2,$1 and PC+4       ID/EX.PC = 0x00000014 ID/EX.RegRsContents = 0x00000005 ID/EX.RegRtContents = 0x00000001 ID/EX.RegRt = (00001)2 ID/EX.RegRd = (00011)2 ID/EX.SignExtend = 0x00001820  Control Information         29 ID/EX.MemToReg = 0 ID/EX.RegWrite = 1 ID/EX.MemRead = 0 ID/EX.MemWrite = 0 ID/EX.Branch = 0 ID/EX.ALUSrc = 0 ID/EX.RegDst = 1 ID/EX.ALUOp = (10)2 MIPS Pipeline  Example 3.  EX/MEM register  Store the information of lw $13,24($1), branch address, and memory address      EX/MEM.BranchAddress = 0x00000070 EX/MEM.ALUOut = 0x00000019 EX/MEM.Zero = 0 EX/MEM.RegDestination= (01101)2 EX/MEM.RegRtContents = 0x0000000A  Control Information      30 EX/MEM.MemToReg = 0 EX/MEM.RegWrite = 1 EX/MEM.MemRead = 1 EX/MEM.MemWrite = 0 EX/MEM.Branch = 0 MIPS Pipeline  Example 3.  MEM/WB register  Store the information of add $12, $3,$4, addition result, and data memory  MEM/WB.RegDestination= (01100)2  MEM/WB.ALUOut = 0xFFFFFFFD  MEM/WB.MemoryData = XXXX  Control Information  MEM/WB.MemToReg = 0  MEM/WB.RegWrite = 1  For the sub $11, $1,$2 31  It will be writing (1 - 5) to $11 Pipelining Hazards  In general, pipelining is effective!  MIPS ISA makes even easy  All instructions are of the same length (32 bits)  Can fetch the next instruction once the current is being decoded  Few instruction formats with symmetry across them  Can read the register file in the 2nd stage  Memory access is through the Load and Store instructions  Can use the execute stage to compute the address  Each MIPS instruction writes at most one result in the MEM or WB stage  Is it that easy? Any complications?  YES!  PIPELINING HAZARDS ! 32 Pipelining Hazards  Hazards - problems the might occur during pipeline operation  Three basic sources  Structural Hazards  In pipelining, all functional units are used in any cycle  What if two instructions use the same functional unit in the same cycle?  Data Hazards  In pipelining, execution of instructions is overlapped  What if the operand(s) of some instruction comes from an earlier instruction that is still in the pipeline?  Control Hazards  In pipelining, an instruction is fetched every cycle  What if an instruction is a jump or a branch instruction that evaluates to true? The following instruction(s) in the pipeline might not be correct?  Simple Solution?  Wait until the issue is resolved! 33 Structural Hazards Reading from memory twice in the same cycle!  Single Memory! Time (clock cycles) Inst 4 34 Reg Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg ALU Inst 3 Mem Mem ALU Inst 2 Reg ALU Inst 1 Mem ALU O r d e r lw ALU I n s t r. Mem Solution: Use two memories; Data and Instruction! Reg Structural Hazards  Single Register File! DM IM Reg DM IM Reg DM IM Reg ALU Inst 1 Reg ALU 35 IM ALU O r d e r add $1, One instruction is writing and the other is reading the register file? ALU I n s t r. Time (clock cycles) Inst 2 add $2,$1, clock edge that controls register writing Reg Reg Reg DM Solution: Design the register file to write in the first half of the cycle and read in the second half! Reg clock edge that controls loading of pipeline state registers Data Hazards xor $4,$1,$5 36 Reg DM IM Reg DM IM Reg DM IM Reg ALU $8,$1,$9 IM ALU or DM ALU and $6,$1,$7 Reg ALU sub $4,$1,$5 IM ALU add $1, Reg Reg • Dependencies backward in time cause hazards • This is called Read-after-Write (RAW) data hazard • Register-use data hazard Reg Reg DM Reg Solution? Data Hazards  Simply, wait for the earlier instruction to finish! This is called stalling the pipeline! However, this affects the CPI? Reg Reg IM Reg DM IM Reg stall stall sub $4,$1,$5 and $6,$1,$7 37 DM ALU IM ALU O r d e r add $1, ALU I n s t r. Do we need two stalls all the time? Reg DM Reg Data Hazards xor $4,$1,$5 38 Reg DM IM Reg DM IM Reg DM IM Reg ALU $8,$1,$9 IM ALU or DM ALU and $6,$1,$7 Reg ALU sub $4,$1,$5 IM ALU lw $1,5($s1) Reg Reg • Dependencies backward in time cause hazards • It is a Read-after-Write (RAW) data hazard • Load-use data hazard Reg Reg DM Reg Solution? Data Hazards  Again, wait for the LW instruction to finish by stalling the pipeline! However, this affects the CPI? DM Reg IM Reg DM IM Reg stall stall sub $4,$1,$5 and $6,$1,$7 39 Reg ALU IM ALU O r d e r lw $1, ALU I n s t r. Reg DM Reg Data Hazards  Example 4. how many cycles are actually required to execute the following code? Assume the pipeline is already full. Ideally, and since the pipeline add $1, $2, $5 add $5, $3, $1 sub $10, $7, $8 sub $5, $6, $7 lw $3, 45($9) add $3, $3, $8 40 is full, each instruction requires 1 cycle. Thus, we need 6 cycles (CPI =6/6= 1). However, … Register-use data hazard Adds 2 cycles by stalls Load-use data hazard Adds 2 cycles by stalls Thus, 10 cycles are needed. CPI = 10/6 = 1.667 ?? Performance ?? Can we do any better? Data Hazards  Fixing Register-use Hazard by Forwarding  Note that data produced by an instruction and needed by a later instruction is pushed through the pipeline registers until it is saved into the register file !  Why not to read the data from the pipeline registers before it is stored ?  This is called forwarding!  What is required?  Need to detect the hazard Is any of the source registers for the instruction the same as the destination register for an earlier instruction that is still in the pipeline?  Need to create a path to pass the data between pipeline stages  Instead of reading the source registers of the instruction from the register file, read them from the pipeline registers  41 Data Hazards  Fixing Register-use Hazard by Forwarding or $8,$1,$9 IM Reg DM IM Reg DM IM Reg DM IM Reg ALU and $6,$1,$7 DM ALU sub $4,$1,$5 Reg ALU IM ALU O r d e r add $1, ALU I n s t r. xor $4,$1,$5 42 No Stalls! Reg Reg Reg Reg DM Reg Data Hazards  Forwarding Hardware implementation 43 Note that forwarding could be from EX/MEM or from MEM/WB! Why? Data Hazards  Forwarding Hardware implementation  Inside the forwarding unit (1) Forwarding from EX/MEM (MEM Stage) if (EX/MEM.RegWrite and (EX/MEM.RegRd != 0) and (EX/MEM.RegRd = ID/EX.RegRs)) then ForwardA = From EX/MEM if (EX/MEM.RegWrite and (EX/MEM.RegRd != 0) and (EX/MEM.RegRd = ID/EX.RegRt)) then ForwardB = From EX/MEM  Why to check the RegWrite signal? 44  Why to check the Zero register? Data Hazards  Forwarding Hardware implementation  Inside the forwarding unit (2) Forwarding from MEM/WB (WB Stage) if (MEM/WB.RegWrite and (MEM/WB.RegRd != 0) and (MEM/WB.RegRd = ID/EX.RegRs)) then ForwardA = From MEM/WB if (MEM/WB.RegWrite and (MEM/WB.RegRd != 0) and (MEM/WB.RegRd = ID/EX.RegRt)) then ForwardB = From MEM/WB 45 Data Hazards  Can the forwarding hardware be used with Load-use data hazard? or $8,$1,$9 xor $4,$1,$5 46 IM Reg DM IM Reg DM IM Reg DM IM Reg ALU and $6,$1,$7 DM ALU sub $4,$1,$5 Reg ALU $1,4($2) IM ALU O r d e r lw ALU I n s t r. Reg Reg Reg We still need 1 Stall for the instruction following the load? Reg DM Reg Data Hazards  How to stall the pipeline?  Stall is required when the instruction in the EX stage is Load and the one in the ID stage depends on the loaded value  The Load instruction moves normally to EX/MEM on the next cycle  The conflicting instruction (the instruction following the load) should stay in the decode stage? How?  Don’t write the IF/ID register  need IF/IDWrite Signal  Don’t update the PC  need PCWrite Signal  The control signals of the instruction in the decode stage are stored as 0’s (WHY?) in the ID/EX  need a multiplexor for the control signals  Controlling the process requires a special unit; Hazard Detection Unit 47 Data Hazards  Stall Implementation 48 Data Hazards  Stall Implementation  Inside hazard detection unit if (ID/EX.MemRead and [(ID/EX.RegRt == IF/ID.RegRs) or (ID/EX.RegRt == IF/ID.RegRt)]) then PCWrite = 0 IF/IDWrite = 0 Select 0’s as control signals Any Problem? Do we need to stall in all cases? How about j and jal that come immediately after load with rs and/or rt fields being the same as the rt field of the load? 49 Data Hazards  Example 5. Consider the following code segment in C A=B+E C=B+F (1) Generate the MIPS code assuming that variables A, B, C, E, and F are in memory and addressable with offsets 0, 4, 8, 12, and 16 from $t0 (2) Find all the data hazards and determine the number of cycles required to run the code. Assume forwarding is implemented. (3) Can you reorder the code to reduce the stalls ? 50  Data Hazards  Example 5. lw lw add sw lw add sw 51 $t1, 4($t0) $t2, 12($t0) $t3, $t1, $t2 $t3, 0($t0) $t4, 16($t0) $t5, $t1, $t4 $t5, 8($t0) # loads B # loads E #A=B+E # stores A # loads F #C=B+F # stores C Thus, 13 cycles are needed. CPI = 13/7 = 1.86 ?? Performance ?? Ideally, each instruction requires 1 cycle after the pipeline is full. Thus, we need (5+7-1) cycles. CPI = 11/7 = 1.57 Load-use data hazard Adds 1 cycle as a stall Load-use data hazard Adds 1 cycle as a stall Data Hazards  Example 5. Reducing stalls by instruction reordering lw lw lw add sw lw add sw 52 $t1, 4($t0) $t2, 12($t0) $t4, 16($t0) $t3, $t1, $t2 $t3, 0($t0) $t4, 16($t0) $t5, $t1, $t4 $t5, 8($t0) # loads B # loads E # loads F #A=B+E # stores A # loads F #C=B+F # stores C Moving this instructions fills the first stall and eliminate the second one! Thus, 11 cycles are needed. CPI = 11/7 = 1.57 Data Hazards  Example 6. Assume that the pipelined MIPS processor without forwarding is used to run a program with the following instruction mix: 20% loads, 20% store, and 60% ALU. Then compute the average CPI given that  10% of the ALU instructions result in load-use hazards.  15% of the ALU instructions result in read-before-write hazards.  Solution  Ideally, the average CPI is 1 for each instruction  With no forwarding  Load-use hazards add two cycles  Register-use hazards add two cycles  Average CPI = 0.2 x 1 + 0.2 x 1 + 0.75 x 0.60 x 1 + 53 0.1 x 0.60 x 3 + 0.15 x 0.60 x 3 = 1.30 Control Hazards  For the pipelined datapath designed so far, the branch address and decision are known by the end of the MEM stage  Instructions following the branch instruction in the pipeline are not correct if the branch evaluates to true!  If the branch is true, then these instructions should be removed from the pipeline and execution should continue from the branch address  Otherwise, no action is required!  This is a dependency backward in time  Control Hazard 54 55 Branch Inst1 Inst2 Inst3 Control Hazards Solution! Once it is known that the instruction is branch, then stall the pipeline for 3 cycles? Is it actually a stall? Control Hazards IM Reg Reg stall stall stall Fetching from instruction memory is either from PC+4 or Branch address depending on the branch result IM Reg IM DM Reg ALU Inst Inst 56 DM ALU O r d e r beq ALU I n s t r. Reg DM Are these actual stalls? Why not to start the execution of the following instructions normally and if the branch is true, then flush these instructions?! Control Hazards  Reducing the Cost of Branch Hazard  Note that three cycles are lost if the branch evaluates to true in order to remove the three instructions following the branch instruction!  This could affect the performance significantly!  Can we reduce this cost?  Move the branch address computation to the decode stage  Add additional hardware to compare the two registers in the ID stage!  Whenever there is a branch instruction in the ID/EX register (ID/EX.branch =1), flush the instruction in the IF/ID register.  The branch penalty in this case will be 1 cycle instead of 3 cycles! 57 Control Hazards  Reducing the Cost of Branch Hazard 58 Control Hazards  Reducing the Cost of Branch Hazard IM Reg ALU beq DM IM Reg Reg stall ALU lw DM Reg  Modifying the Hazard Detection Unit IF (ID/EX.Branch) then Flush IF/ID register  Note that we lose one cycle whenever a branch 59 instruction is encountered!  Can we do any better? Control Hazards  Reducing the Cost of Branch Hazard  Approach I – Static Branch Prediction  Always predict the branch as Not Taken and start fetching the instruction following the branch  If the branch evaluates to Not Taken, then the prediction is correct and no further actions are required!  If the branch evaluates to Taken, then the prediction is not correct! Remove the fetched instruction and start fetching from the branch address  In this approach, we only lose one cycle if the prediction is not correct  Inside the hazard detection unit IF (ID/EX.Branch) and (ID/EX.ZERO) Then Flush IF/ID register 60 Control Hazards  Reducing the Cost of Branch Hazard  Approach II – Dynamic Branch Prediction  Prediction could be Taken or Not Taken  If the branch is predicted as Not Taken  Fetch the next instruction  If prediction is false, flush the instruction. One cycle is lost!  If branch is predicted as Taken  Fetch the instruction from the branch address  If prediction is false, flush and fetch from PC+4  How to store branch prediction?  Use Branch History Table or Branch Prediction Buffer  The table is addressable by the lower bits of the branch instruction address  If branch is predicted as taken, we need to wait for the branch address to be computed?  Use Branch Target Buffer 61 Control Hazards  Approach II – Dynamic Branch Prediction  1-bit Branch Predictor  Basically we have two states (Taken and Not Taken)  One bit is used to store the prediction  Prediction state is changed when prediction is wrong  Performance Issues 62  Consider branching in loops? EXAMPLE? Control Hazards  Approach II – Dynamic Branch Prediction  2-bit Branch Predictor  Basically we have four states  two bits are used to store the prediction  Prediction state is changed when prediction is wrong twice 63 Control Hazards  Example 7. Consider a certain program that have a conditional branch instruction whose actual outcome is given below when the program is executed. T-T-N-T-T-N-T List predictions for the following branch prediction schemes and find the prediction accuracy. 64 1. Predict always taken 2. Predict always not taken 3. 1-bit predictor, initialized to predict taken 4. 2-bit predictor, initialized to weakly predict taken Control Hazards  Example 7.  Actual branch actions : T-T-N-T-T-N-T  Predict as always taken Predictions : T-T-T-T-T-T-T  Accuracy = 5/7 = 71%  Predict as always not taken  Predictions : N-N-N-N-N-N-N  Accuracy = 2/7 = 29%  1-bit predictor initialized to predict taken  Predictions: T-T-T-N-T-T-N  Accuracy = 3/7 = 43%  2-bit predictor initialized to weakly predict taken  Predictions: T-T-T-T-T-T-T  Accuracy = 5/7 = 71%  65 Pipelining Performance  Example 8. Let’s compare the performance of single-cycle, multi-cycle, and pipeline implementation of MIPS processor given the operation times and instruction mix below. For the pipelined implementation, assume that: 1) Branch decision is done in the MEM cycle. Branch handling in the pipeline implementation is done by stalling the pipeline. 2) Half of the load instructions incur load-use hazard. 3) Forwarding is implemented. 4) The jump instruction is completed in the ID stage 66 Instruction type Percentage % Unit Time (ps) ALU 52 Memory 200 Load 25 ALU and adders 100 Store 10 Register File 50 Branch 11 Jump 2 Pipelining Performance  Example 8.  Clock cycle time  Single-cycle = 200 + 50 + 100 + 50 + 200 = 600 ps  Multi-cycle = 200 ps  Pipeline = 200 ps  CPI  Single-cycle = 1  Multi-cycle = 5x 0.25 + 4x0.52 + 4x0.10 + 3x0.11 + 3x0.02 = 4.12  Pipeline = 0.125x2 + 0.125x1 + 0.52x1 + 0.1x1 + 0.11x4 + 0.02x2 = 1.475  Execution Time per instruction  Single-cycle = 600 ps  Multi-cycle = 4.12 x 200 ps = 824 ps  Pipeline = 1.475 x 200 = 295 ps 67 Pipelining Performance  Example 9. Redo example 8 by assuming that branch prediction is employed and 1/4th of the branch instructions are miss predicted. 68 Exceptions & Interrupts  Exceptions and interrupts are unexpected events that require the change in the flow  The two terms are used interchangeably and depending is ISA  Intel x86 uses the term interrupt only  In MIPS  Exceptions: any internal unexpected change in the flow (undefined opecode, overflow, system calls)  Interrupts: the event is external (I/O controller request)  Dealing with them  Is a challenging part of processor design  Affects performance 69 Exceptions & Interrupts  In MIPs, when an exception is generated, the following sequence of steps are taken  The address of the offending instruction is saved into a special called the Exception Program Counter (EPC).  The cause of the exception is saved in a special register called the Cause Register.  The control is transferred to the operating system by loading a special address (0x8000 00180) into the PC. The code loaded starting at this address  Determines what actions will be done by the operating system in response to the exception based on the value found in the Cause Register. The operating system may terminate the program or resume the execution using the value found in the EPC 70 Overflow Exception  Modifications to the Datapath 71 Fallacies  Fallacy 1. Pipelining is easy !  Not true ! Hazards complicate the operation  Fallacy 2. Pipelining is independent of technology!  Why didn’t we have pipelined processors before ?  Advanced technology allowed more transistors and thus more operations ! 72 Reading Assignment  Read the following from the textbook  Section 4.9 – Exceptions  Section 4.10 – Parallelism and Advanced Instruction Level Parallelism 73

Enhancing Performance with Pipelining

Related documents

Products

Support

Enhancing Performance with Pipelining

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib