CH 6 - PIPELINING INTRODUCTION - basics of pipelining and the problems that are generated by this technique. Basic Principle is simple: Divide up the instruction execution into discrete parts, much like the multicycle processor: IF - Instruction fetch ID - Register read/ Instruction decode EXE - Execute operation MEM - Memory Access WB - Write back Note that all instructions will write back on the 5th cycle (normally); data will remain on hold during mem access stage if there is no mem access. Key: can execute several instructions in parallel- up to five- because we have five 'functional units' in our pipeline. If instruction n is being executed, n+1 is doing register read, n+2 is being fetched, n-1 is being memory accessed, and n-2 being written back. How does this improve performance? Before, each instruction would have taken 5 cycles. Now, a new instruction might be fetched every cycle- giving an upper-limit performance increase of: MaxSpeedup = CyclesBefore/EffectiveCyclesAfter = pipelineStages It looks as if each instruct takes one cycle! However, execution for a given instruction is the same, or more (due to delayed writeback), than for multicycle. Let's look at an example: ClockCycle: add $2,$3,$4 lw $4,0($2) beq $4,$0,foo sw $2,0($6) 1 2 if id if 3 4 5 6 7 8 exe mem wb id exe mem wb if id exe mem wb if id exe mem wb 9 10 Much better, right? Well, yes and no. There are a few problems. HAZARDS are situations where the result of an instruction must be determined in order to be able to execute later instructions. 1 There are three basic types of hazards: 1) Structural hazards occur when the hardware is incapable of supporting the operations needed in the same clock cycle. MIPS is so simple that it does not have structural hazards. If, however, program and data memory were unified, then mem read could not be done at the same time as IF. Or, if there were a two-cycle divide unit that was not pipelined, couldn't have adjacent divide instructions. Solution: STALL. Continue the earlier instruction's execution, but hold back all later instructions one cycle so the conflict can be resolved. This inserts a BUBBLE into the pipeline where useful work is not being done. 2) Control hazards occur because until a conditional branch is evaluated, or a jump address determined, the address of the next instruction is unclear. Question: in the example above, when is the address of the instruction after the branch available? (After bra exe, could do sw if). Solutions: a) Stall until decision known (2 clocks IF branch calculated early) b) Assume "branch-not-taken" and inhibit WB if branch c) Assume "branch-taken", which causes at least one stall d) Branch prediction: keep track of each branch address and whether the last three times branch was taken or not more often. *** Prediction is used in most systems. NO STALLS unless wrong. e) Delayed branch. The instruction after the bra is assumed to always be executed. "Early decision" branch detection could then be used to decide branch at end of id. Allows only bzero, bnz. The instruction after the branch is in the "branch delay slot". Often there’s no instruction that works, so nop used. 3) Data hazards are most common. This happens when a register value calculated in a previous instruction has not yet been written back by the time a later instruction needs it in the id stage. Question: Can you find examples in our code? ($2:add->lw, $4: lw->beq; $2:add->sw is NOT a hazard- we can assume that a wb and read in same 2 cycle can be supported by the hardware. Write completes in beginning of clock, read at end). Solution: Non-load data hazards can be fixed using data forwarding, Load data hazards can't be fixed and must invoke stalls. Note that for non-load instructions, the results are available after the exe stage, and the next instruction doesn't need the result until IT'S exe stage. Thus, the result from the first instruction can be forwarded from the exe output to the exe input, bypassing the register altogether (this technique is also called data bypassing). This is done with multiplexers. Let's reexamine our program snippet. If we used just stalls, we'd have to: ClockCycle: add $2,$3,$4 lw $4,0($2) beq $4,$0,foo sw $2,0($6) 1 2 3 4 if id exe mem if STA STA if STA 5 6 7 8 wb id exe mem wb STA STA STA id 9 10 11 12 13 14 exe mem if id exe mem wb And the sw WB would occur on clock cycle 14! Note that this is not unusually "unfortunate" code; data hazards are the rule more than the exception. But it is folly to just wait for the branch decision to be made. If we use the schemes we have discussed, and we make the correct branch decision, then we would have: ClockCycle: add $2,$3,$4 lw $4,0($2) beq $4,$0,foo sw $2,0($6) 1 2 if id if 3 4 5 6 7 8 9 exe mem wb id exe mem wb if STA id exe mem wb if STA id exe mem wb 10 (data forward) (data forward) (branch predict) Only one stall, due to the unavoidable load data hazard! A performance improvement of 14/9 = 56%! That's the background on pipelining, next we'll consider how to extend our simple MIPS to do pipelining. 3 MIPS pipeline implementation (Sec. 6.2) - fig. 6.10/6.12 To consider how to implement the pipelined MIPS, we'll have to go back to assuming separate program and data memories, to avoid memory contention problems for L/S instructions. Let's start back at the single cycle block diagram, reorganized as 6.10. This shows the datapath elements associated with the five pipeline stages: IF: fetch instruction, update PC=PC+4 ID: register access EXE: execute, calculate branch addr MEM: Load/Store only WB: dest. MUX only Just like in the multicycle implementation, the key factor is that we need to use registers to save the results of each stage so that they can be available as inputs to the next stage in the next clock cycle. The four pipeline registers are shown in the next Figure. The IF/ID register contains the IR and updated PC value. A few of the details are glossed over in this figure. Most egregious is that the write register addressing scheme does not extend through the other three pipeline registers. Control signals and data have to travel together through the registers. Note that registers are updated all at once at the clock edge. It is assumed that the clock is slow enough that all signals settle in time before the edge. Let's run our little program snippet on this architecture. Note that there is no forwarding available, and ... when is the branch decision made? (during exe, PC not updated with branch address until mem). ClockCycle: add $2,$3,$4 lw $4,0($2) beq $4,$0,foo sw $2,0($6) 1 2 if id if 3 4 5 6 7 8 exe mem wb id exe mem wb if id 4 9 10 exe mem if 11 id 12 exe PIPELINE CONTROL (6.3, handout Fig. 6.30) In this lecture, we’ll consider the control signals for a basic pipeline, and then expand to detection and (if possible) circumvention of hazards. The key concept here is that control signals must go through the pipeline registers in step with the data buses, so they can be applied in the correct stage at the correct clock cycle. Look at the handout. We see that once control signals are decoded, they are carried along through the pipeline registers to later stages. Dividing the control signals into the appropriate pipeline stage, we have: IF – Nothing to assert (registers always clocked) ID – Nothing to write here. RegWrite controlled in last stage. EXE – RegDst, ALUop, ALUSrc applied here MEM – Branch, MemRead, MemWrite needed here. Branch PC here! WB – MemToReg and RegWrite. NOTE that register write happens here! WORK through the example in the text pp. 471-476. You should be able to identify which instruction is doing what in which portion of the pipeline. DATA HAZARDS and FORWARDING (6.4) In order to deal with data hazards, we need to do the following: 1. Detection of data hazards 2. Forward data where possible 3. Stall where forwarding not possible (Loads) Let’s consider these steps one at a time. Detection of data hazards. First, consider how far apart instructions can be and still have data hazard problems. Clock Add $1,$2,$3 Add $4,$5,$1 Add $6,$7,$1 Add $8,$9,$1 1 if 2 id if 3 exe id if 4 mem exe id if 5 5 wb mem exe id 6 7 8 wb mem wb exe mem wb 9 Since we can do reg fetch and writeback in same cycle, The last add is independent of the first. Thus, the second and third are dependent. So we need to examine whether any register values of an instruction are generated by either of the two previous instructions. This is done in the execution stage by pipelining the rs and rt addresses, and comparing them to the rd address in the following two stages: id/exe exe/mem mem/wb rd rs rt Forward Detect If any address matches (rs=rd_mem, rs=rd_wb, rt=rd_mem,rt=rd_wb), data must be forwarded UNLESS the operation is a load. Forwarding is achieved by adding TWO MORE inputs to the MUX at the ALU inputs, which are connected to the datastream outputs from the mem and WB stages. See Handout. What happens with a load? Clock: 1 2 Lw $1, 0($2) if id Add $3,$4,$1 if Add $5,$6,$1 3 exe id if 4 mem stall stall Lw $1, 0($2) Add $3,$4,$10 Add $5,$6,$1 exe id if mem wb exe mem wb id exe mem wb if id if 6 5 wb exe id 6 7 8 mem wb (data haz) exe mem wb (no data haz) (data haz) Data forwarding can be used after one cycle. Stall required if instruction after load has data hazard with load. Thus, we still need to be able to do a stall…. Here’s how: Need a HAZARD DETECTION UNIT (in the book’s pseudocode): If(ID/EX.MemRead and ((ID/EX.rt=IF/ID.Rs) or (ID/EX.rt = IF/ID.rt)) then stall the pipeline’s if and id stages for one cycle. Note that exe continues (to do memory load), while the following instructions (in id, if) are stalled one cycle. Note that ID/EX.rt is actually the destination address for the Load, since the RegDst MUX is in the EX stage! The stall is accomplished by inhibiting clocking of the PC and IF/ID pipeline. Also, the control signals for the bubble are set to 0, which inhibits writeback, mem write, and branching. That’s all!! BRANCH HAZARDS (6.6) Reduction of branch hazards can be achieved in a number of ways: 1) Assume branch not taken. This will give no stalls IF branch shouldn’t be taken, but will give 3 stalls if it IS taken, if new address not calculated until mem unit Clock: 1 2 3 4 5 6 7 8 Beq $1, $2, foo if id exe mem wb +1 if id exe mem (wb) +2 if id exe mem (wb) +3 if id exe mem (wb) foo: instruct if id exe mem *** write/writeback inhibited for +1, +2, +3 *** 2) Reducing branch delay. Can improve above situation by one cycle by pre-calculating branch address. PC+4 can be calculated in the IF stage, and the offset can be calculated in the ID stage. Normally would still have to wait for exe to determine result, so earliest that new IF could occur would be while branch is in mem (2 stalls). We can go one better by using new hardware that determines if rs, rt equal – and put it in ID/reg fetch stage (see diagram). Then, decision 7 can be made in ID stage, and branched-to IF can happen during EXE, giving only ONE delay 3) Dynamic branch prediction. Based on actual history of execution! Uses branch prediction buffer or branch history table. Small memory that is indexed by lowest address bits, and has high address bits and recent history: A2-A6 A7-A31 A7-A31 H0-H1 =? Branch decision Match: instruction is branch Where decision is based on state machine that predicts based on two of three last branch outcomes at that address. When branch resolved, the decision outcome bits are modified. 4) Delayed Branch – instruction after branch ALWAYS executed. Used to reduce bubbles even if branch prediction is wrong. Or for jumps. EXCEPTIONS (6.7) As usual, exceptions mess up the elegance of the architectural model. Exceptions can happen in several pipeline stages (id=bad instruction, exe=overflow, mem=memory fault) and in some of these cases we need to RESTART the program from the instruction that faulted. External interrupts need to be dealt with as well. Need logic that COMPLETES (WB) instructions BEFORE the faulting instruction, but disables writeback of the faulting instruction or later ones. Again, control bits of the faulting instruction and later ones can be zeroed out. Also need to save the address (+4) of the faulting instruction in the EPC, and the reason in the CAUSE register. This means passing the address+4 through the pipeline registers (Fig. 6.55) allows saving exe exceptions but not mem. 8 For interrupts, there is some flexibility as to where the pipeline should complete. Somewhat arbitrary. SUPERSCALAR PROCESSORS (6.8) In the hunt for greater performance, the step after pipelining was superscalar. A superscalar CPU is capable of issuing (fetching) more than one instruction at a time. Usually the ORDER of a superscalar is 2 or 4 (simultaneous instructions). How is this done? 1) The cache memory can send 2 or 4 instructions at a time to the CPU. (The path from DRAM to cache may be narrower). 2) The IF and ID stages run in tandem for these instructions, usually, so that the register set is multiple ported. 3) Once the instructions are decoded, they are scheduled for issue to one of a collection of functional units: Int unit IF ID Int unit Float unit Writeback Load/Store Branch unit 4) Instructions may be issued OUT OF ORDER if there is no functional unit available at the right time (structural hazard). This is called dynamic scheduling. 5) Some instructions (div, float ops) take longer, so instructions can complete OUT OF ORDER 6) WRITEBACK is done by collecting results and writing them back in order, with appropriate data forwarding. Control and exceptions are VERY COMPLICATED! A SCOREBOARD is often used to keep track of instructions as they are executing, and data/control dependencies between instructions. Nifty techniques can be implemented, such as executing both the nonbranch code and the branch-to code at the same time, then tossing out the instructions that were not to be executed. 9 The DEC Alpha 21264 was the first processor to implement a superscalar processor with deep pipelines (nine stages for integer ops only!), 4 th order, max of 6 instructions issued at once, first 1GHz clock CPU. Problems: 1) (6.2) Show the forwarding paths needed to execute the following three instructions: add add add $2, $3, $4 $4, $5, $6 $5, $3, $4 Solution: we need a forwarding path from the second instruction to the third since it depends on the first ( $4) 2) (6.3) How could we modify the following code to make use of a delayed branch slot? Solution: Loop: lw $2, 100($3) Addi $3, $3, 4 Beq $3, $4, Loop Loop: addi $3, $3, 4 Beq $3, $4, Loop Lw $2, 96($3) 3) (6.11) Consider executing the following code on the pipelined datapath of Figure 6.46. Add Add Add Add Add $1, $2, $3 $4, $5, $6 $7, $8, $9 $10, $11, $12 $13, $14, $15 10 At the end of the fifth cycle of execution, which registers are being read and which registers being written? Solution: In the fifth cycle, register $1 will be written and $11 and $12 will be read. 4) (6.12) With regard to the last problem, explain what the forwarding unit is doing during the fifth cycle of execution. If any comparisons are being made, mention them. Solution: The forwarding unit is looking at the instructions in the fourth and fifth stages and checking to see whether they intend to write to the register file and whether the register written is being used as an ALU input. Thus, it is comparing 8 = 4? 8 = 1? 9 = 4? 9 = 1? 5) (6.23) Normally we want to maximize performance on our pipelined datapath with forwarding and stalls on use after a load. Rewrite this code to minimize performance while still obtaining the same result. Lw Lw Add Add Add Sw Beq $3, 0($5) $4, 4($5) $7, $7, $3 $8, $8, $4 $10, $7, $8 $6, 0($5) $10, $11, Loop Lw Add Sw Lw Add Add Beq $3, 0($5) $7, $7, $3 $6, 0($5) $4, 4($5) $8, $8, $4 $10, $7, $8 $10, $11, Loop Solution: 11 Homework: 1) (6.4) Identify all of the data dependencies in the following code. Which dependencies are data hazards that will be removed via forwarding? Add Add Sw Add $2, $5, $4 $4, $2, $5 $5, 100($2) $3, $2, $4 2) (6.9) Given Figure 6.71, determine as much as you can about the five instructions in the five pipeline stages. If you cannot fill in a field of an instruction, state why. 3) (6.14) Consider a program consisting of 100 lw instructions and in which each instruction is dependent upon the instruction before it. What would the actual CPI be if the program were run on the pipelined datapath of Figure 6.45? 4) (6.15) Consider executing the following code on the pipelined datapath of Figure 6.46. add $5, $6, $7 lw $6, 100($7) sub $7, $6, $8 How many cycles will it take to execute this code? State what hazards there are and how they are fixed. 12