ECE200 – Computer Organization Chapter 6 – Enhancing Performance with Pipelining Homework 6 6.2, 6.3, 6.5, 6.9, 6.11, 6.19, 6.27, 6.30 Outline for Chapter 6 lectures Pipeline motivation: increasing instruction throughput MIPS 5-stage pipeline Hazards Handling exceptions Superscalar execution Dynamic scheduling (out-of-order execution) Real pipeline designs Pipeline motivation Need both low CPI and high frequency for best performance Want a multicycle for high frequency, but need better CPI Idea behind pipelining is to have a multicycle implementation that operates like a factory assembly line Each “worker” in the pipeline performs a particular task, hands off to the next “worker”, while getting new work Pipeline motivation Tasks should take about the same time – if one “worker” is much slower than the rest, then other “workers” will stand idle Once the assembly line is full, a new “product” (instruction) comes out of the back-end of the line each time period In a computer assembly line (pipeline), each task is called a stage and the time period is one clock cycle MIPS 5-stage pipeline Like single cycle datapath but with registers separating each stage MIPS 5-stage pipeline 5 stages for each instruction IF: instruction fetch ID: instruction decode and register file read EX: instruction execution or effective address calculation MEM: memory access for load and store WB: write back results to register file Delays of all 5 stages are relatively the same Staging registers are used to hold data and All instructions pass through all 5 stages As an instruction leaves a stage in a particular clock period, the next instruction enters it control as instructions pass between stages Pipeline operation for lw Stage 1: Instruction fetch Pipeline operation for lw Stage 2: Instruction decode and register file read What happens to the instruction info in IF/ID? Pipeline operation for lw Stage 3: Effective address calculation Pipeline operation for lw Stage 4: Memory access Pipeline operation for lw Stage 5: Write back Instruction info in IF/ID is gone – won’t work Modified pipeline with write back fix Write register bits from the instruction must be carried through the pipeline with the instruction Pipeline operation for lw Pipeline usage in each stage for lw Pipeline operation for sw Stage 3: Effective address calculation Pipeline operation for sw Stage 4: Memory access Pipeline operation for sw Stage 5: Write back (nothing) Pipeline operation for lw, sub sequence Pipeline operation for lw, sub sequence Pipeline operation for lw, sub sequence Pipeline operation for lw, sub sequence Pipeline operation for lw, sub sequence Pipeline operation for lw, sub sequence Graphical pipeline representation Represent overlap of pipelined instructions as multiple pipelines skewed by a cycle Another useful shorthand form Pipeline control Basic pipeline control is similar to the single cycle implementation Pipeline control Control for an instruction is generated in ID and travels with the instruction and data through the pipeline When an instruction enters a stage, it’s control signals set the operation of that stage Pipeline control Multiple instruction example For the following code fragment lw sub and or add $10, 20($1) $11, $2, $3 $12, $4, $5 $13, $6, $7 $14, $8, $9 show the datapath and control usage as the instruction sequence travels down the pipeline Multiple instruction example Multiple instruction example Multiple instruction example Multiple instruction example Multiple instruction example Multiple instruction example Multiple instruction example Multiple instruction example Multiple instruction example How the MIPS ISA simplifies pipelining Fixed length instruction simplifies Fetch – just get the next 32 bits Decode – single step; don’t have to decode opcode before figuring out where to get the rest of the fields Source register fields always in same location Can read source registers during decode Load/store architecture ALU can be used for both arithmetic and EA calculation Memory instruction require about same amount of work as arithmetic ones, easing pipelining of the two together Memory data must be aligned Read or write accesses can be done in one cycle Pipeline hazards A hazard is a conflict, regarding data, control, or hardware resources Data hazards are conflicts for register values Control hazards occur due to the delay to Structural hazards are conflicts for hardware execute branch and jump instruction resources, such as A single memory for instructions and data A multi-cycle, non-pipelined functional unit (such as a divider) Data dependences A read after write (RAW) dependence occurs when the register written by an instruction is a source register of a subsequent instruction lw $10, 20($1) sub $11, $10, $3 and $12, $4, $11 or $13, $11, $4 add $14, $13, $9 Also have write after read (WAR) and write after write (WAW) data dependences (later) Pipelining and RAW dependences RAW dependences that are close by may cause data hazards in the pipeline Consider the following code sequence: sub and or add sw $2, $1, $3 $12, $2, $6 $13, $6, $2 $14, $2, $2 $15, 100($2) What are the RAW dependences? Pipelining and RAW dependences Data hazards with first three instructions hazard hazard ok ok Forwarding Most RAW hazards can be eliminated by forwarding results between pipe stages at this point, result of sub is available Forwarding datapaths Bypass paths feed data from MEM and WB back Do we still have to write the register file in WB? to MUXes at the EX ALU inputs Detecting forwarding Rd of the instruction in MEM or WB must match Rs and/or Rt of the instruction in EX The instruction in MEM or WB must have RegWrite=1 (why?) Rd must not be $0 (why?) Detecting forwarding from MEM to EX To the upper ALU input (ALUupper) EX/MEM.RegWrite =1 EX/MEM.RegisterRd not equal 0 EX/MEM.RegisterRd = ID/EX.RegisterRs To the lower ALU input (ALUlower) EX/MEM.RegWrite =1 EX/MEM.RegisterRd not equal 0 EX/MEM.RegisterRd = ID/EX.RegisterRt Detecting forwarding from WB to EX To the upper ALU input MEM/WB.RegWrite =1 MEM/WB.RegisterRd not equal 0 MEM/WB.RegisterRd = ID/EX.RegisterRs The value is not being forwarded from MEM (why?) To the lower ALU input MEM/WB.RegWrite =1 MEM/WB.RegisterRd not equal 0 MEM/WB.RegisterRd = ID/EX.RegisterRt The value is not being forwarded from MEM Forwarding control Control is handled by the forwarding unit Forwarding example Show forwarding for the code sequence: sub $2, $1, $3 and $4, $2, $5 or $4, $4, $2 add $9, $4, $2 Forwarding example sub produces result in EX Forwarding example sub forwards result from MEM to ALUupper Forwarding example sub forwards result from WB to ALUlower and forwards result from MEM to ALUupper Forwarding example or forwards result from MEM to ALUupper RAW hazards involving loads Loads produce results in MEM – can’t forward to an immediately following R-type instruction Called a load-use hazard RAW hazards involving loads Solution: stall the stages behind the load for one cycle, after which the result can be forwarded Detecting load-use hazards Instruction in EX is a load ID/EX.MemRead =1 Instruction in ID has a source register that matches the load destination register ID/EX.RegisterRt = IF/ID.RegisterRs OR ID/EX.RegisterRt = IF/ID.RegisterRt Stalling the stages behind the load Force nop (“no operation”) instruction into EX stage on next clock cycle Force ID/EX.MemWrite input to zero Force ID/EX.RegWrite input to zero Hold instructions in ID and IF stages for one clock cycle Hold the contents of PC Hold the contents of IF/ID Control for load-use hazards Control is handled by the hazard detection unit Load-use stall example Code sequence: lw $2, 20($1) and $4, $2, $5 or $4, $4, $2 add $9, $4, $2 Load-use stall example lw enters ID Load-use stall example Load-use hazard detected Load-use stall example Force nop into EX and hold ID and IF stages Load-use stall example lw result in WB forwarded to and in EX or reads operand $2 from register file Load-use stall example Pipeline advances normally Control hazards Taken branches and jumps change the PC to the target address from which the next instruction is to be fetched In our pipeline, the PC is changed when the taken beq instruction is in the MEM stage This creates a control hazard in which sequential instructions in earlier stages must be discarded beq instruction that is taken instri+3 instri+2 instri+1 beq $2,$3,7 instri+1, instri+2, instri+3 must be discarded beq instruction that is taken In this example, the branch delay is three Why is the branch immediate field a 7? Reducing the branch delay Reducing the branch delay reduces the number of instructions that have to be discarded on a taken branch We can reduce the branch delay to one for beq by moving both the equality test and the branch target address calculation into ID We need to insert a nop between the beq and the correctly fetched instruction Reducing the branch delay beq with one branch delay Register equality test done in ID by a exclusive ORing the register values and NORing the result Instruction in ID forced to nop by zeroing the IF/ID register Next fetched instruction will be from PC+4 or branch target depending on the beq outcome beq with one branch delay beq in ID; next sequential instruction (and) in IF beq with one branch delay bubble in ID; lw (from taken address) in IF Forwarding and stalling changes Results in MEM and WB must be forwarded to ID for use as possible beq source operand values beq may have to stall in ID to wait for source operand values to be produced Examples addi $2, $2, -1 lw $8, 20($1) beq $2, $0, 20 beq $4, $8, 6 Stall beq one cycle; forward $2 from MEM to upper equality input in ID Stall beq two cycles; forward $8 from WB to lower equality input in ID Forwarding from MEM to ID beq $2,$0,20 How bubble could we eliminate the bubble? addi $2,$2,-1 Forwarding from WB to ID beq $4,$8,6 bubble bubble lw $8,20($1) Further reducing the branch delay Insert a bubble only if the branch is taken Allow the next sequential instruction to proceed if the branch is not taken AND the IF.Flush signal with the result of the equality test Still have bubble for taken branches (~2/3 of all branches) Delayed branching Delayed branching The ISA states that the instruction following the branch is always executed irregardless of the branch outcome Hardware must adhere to this rule! The compiler finds an appropriate instruction to place after the branch (in the branch delay slot) beq $4, $8, 6 sub $1, $2, $3 branch delay slot (always executed after the branch) Delayed branching Three places compiler may find a delay slot instruction Prior example without delayed branch beq in ID; next sequential instruction (and) in IF What do you notice about the sub instruction? Prior example without delayed branch bubble in ID; lw (from taken address) in IF Prior example with delayed branch beq in ID; delay slot instruction (sub) in IF sub $10 $4,$8 Prior example with delayed branch sub in ID; lw (from taken address) in IF sub $10 $4,$8 What would happen if the branch was not taken? Limitations of delayed branching 50% of the time the compiler can’t fill delay slot with useful instructions while maintaining correctness (has to insert nops instead) High performance pipelines may have >10 delay slots Many cycles for instruction fetch and decode Multiple instructions in each pipeline stage Example Pipeline: IF1-IF2-ID1-ID2 Branch calculation performed in ID2 Four instructions in each stage 12 delay slots Solution: branch prediction (later) Precise exceptions Exceptions require a change of control to a special exception handler routine The PC of the user program is saved in EPC and restored after the handler completes so that the user program can resume at that instruction For the user program to work correctly after resuming, All instructions before the excepting one must have written their result All subsequent instructions must not have written their result Exceptions handled this way are called precise Pipelining and precise exceptions There may be instructions from before the excepting one and from after it in the pipeline when the exception occurs Exceptions may be detected out of program order exception exception Which should be handled first? Supporting precise exceptions Each instruction in the pipeline has an exception field that travels with it When an exception is detected, the type of exception is encoded in the exception field The RegWrite and MemWrite control signals for the instruction are set to 0 At the end of MEM, the exception field is checked to see if an exception occurred If so, the instructions in IF, ID, and EX are made into nops, and the address of the exception handler is loaded into the PC Supporting precise exceptions exception exception Superscalar pipelines In a superscalar pipeline, each pipeline stage holds multiple instructions 4-6 instructions in modern high performance microprocessors Performance is increased because every clock period more than one instruction completes (increased parallelism) Superscalar pipelines have a CPI less than 1 Simple 2-way superscalar MIPS Simple 2-way superscalar MIPS Two instructions fetched and decoded each cycle Conditions for executing a pair of instructions First instruction an integer or branch, second a load or store No RAW dependence from first to second Otherwise, second instruction is executed the cycle after the first Compiler code scheduling The compiler can improve performance by changing the order of the instructions in the program (code scheduling) Examples Fill branch delay slots Move instructions between two dependent instructions to eliminate the stall cycles Reorder instructions to increase the number executed in parallel Scheduling example – before Loop: lw addu sw addi bne Load-use $t0, 0($s1) $t0, $t0, $s2 $t0, 0($s1) $s1, $s1, -4 $s1, $zero, Loop stall Stall after addi First three instructions must execute serially due to dependences Last two must also execute serially for same reason Have branch delay slot to fill Scheduling example – after Loop: lw addi addu bne sw All $t0, 0($s1) $s1, $s1, -4 # moved into load delay slot $t0, $t0, $s2 $s1, $zero, Loop $t0, 4($s1) # moved into branch delay slot stall cycles are eliminated Last two instructions can now execute in parallel on the 2-way superscalar MIPS First two can also, but we would introduce a stall cycle before the addu (loop is too short – not enough instructions to schedule) Loop unrolling Idea is to take multiple iterations of a loop (“unroll” it) and combine them into one bigger loop Gives the compiler many instructions to move between dependent instructions and to increase parallel execution Reduces the overhead of branching Loop unrolling Example of prior loop unrolled 4 times: Loop: lw addu sw lw addu sw lw addu sw lw addu sw addi bne $t0, 0($s1) $t0, $t0, $s2 $t0, 0($s1) $t0, -4($s1) $t0, $t0, $s2 $t0, -4($s1) $t0, -8($s1) $t0, $t0, $s2 $t0, -8($s1) $t0, -12($s1) $t0, $t0, $s2 $t0, -12($s1) $s1, $s1, -16 $s1, $zero, Loop Original code: Loop: lw addu sw addi bne $t0, 0($s1) $t0, $t0, $s2 $t0, 0($s1) $s1, $s1, -4 $s1, $zero, Loop Loop unrolling Problem: reuse of $t0 constrains instruction order Loop: lw addu sw lw addu sw lw addu sw lw addu sw addi bne $t0, 0($s1) $t0, $t0, $s2 $t0, 0($s1) $t0, -4($s1) $t0, $t0, $s2 $t0, -4($s1) $t0, -8($s1) $t0, $t0, $s2 $t0, -8($s1) $t0, -12($s1) $t0, $t0, $s2 $t0, -12($s1) $s1, $s1, -16 $s1, $zero, Loop Write after read (WAR) and write after write (WAW) hazards Loop unrolling Solution: different registers for each computation Loop: lw addu sw lw addu sw lw addu sw lw addu sw addi bne $t0, 0($s1) $t0, $t0, $s2 $t0, 0($s1) $t1, -4($s1) $t1, $t1, $s2 $t1, -4($s1) $t2, -8($s1) $t2, $t2, $s2 $t2, -8($s1) $t3, -12($s1) $t3, $t3, $s2 $t3, -12($s1) $s1, $s1, -16 $s1, $zero, Loop Loop unrolling Unrolled loop after scheduling: Loop: lw lw lw lw addu addu addu addu addi sw sw sw bne sw $t0, 0($s1) $t1, -4($s1) $t2, -8($s1) $t3, -12($s1) $t0, $t0, $s2 $t1, $t1, $s2 $t2, $t2, $s2 $t3, $t3, $s2 $s1, $s1, -16 $t0, 16($s1) $t1, 12($s1) $t2, 8($s1) $s1, $zero, Loop $t3, 4($s1) New sw offsets due to moving the addi Modern superscalar processors Today’s superscalar processors attempt to issue (initiate the execution of) 4-6 instructions each clock cycle Such processors have multiple integer ALUs, integer multipliers, and floating point units that operate in parallel on different instructions Because most of these units are pipelined, there is the potential to have 10’s of instructions simultaneously executing We must remove several barriers to achieve this Modern processor challenges Handling branches in a way that prevents instruction fetch from becoming a bottleneck Preventing long latency operations, especially loads in which the data is in main memory, from holding up instruction execution Removing register hazards due to the reuse of registers so that instruction can execute in parallel Instruction fetch challenges Branches comprise about 20% of the executed instructions in SPEC integer programs The branch delay may be >10 instructions in a highly pipelined, superscalar processor Delayed branches are useless with so may delay slots Solution: dynamic branch prediction with speculative execution Dynamic branch prediction When fetching the branch, predict what the branch outcome and target will be Fetch instructions from the predicted direction After executing the branch, verify whether the prediction was correct If so, continue without any performance penalty If not, undo and fetch from the other direction Bimodal branch predictor Predicts the branch outcome Works under the assumption that most branches are either taken most of the time or not taken most of the time Prediction accuracy is ~85-95% with 2048 entries Bimodal branch predictor Consists of a small memory and a state machine Each memory location has 2 bits The address of the memory is the low-order log2n PC bits of a fetched branch instruction PC of fetched branch instruction branch predictor memory address n entries . . . 2 bits/entry Bimodal branch predictor When a branch is fetched, the 2-bit memory entry is retrieved The prediction is based on the high-order bit 1=predict taken 0=predict not taken Bimodal branch predictor Once the branch is executed, the state bits are updated and written back into the memory actual branch outcome 11 01 10 00 In the 00 or 11 state, have to be wrong twice in a row to change the prediction Branch target buffer Predicts the branch target address Is this as critical as predicting the branch outcome? Small memory (typically 256-512 entries) addressed by the low-order branch PC bits Each entry holds the last target address of the branch When a branch is fetched, the BTB is accessed and the target address is used if the bimodal predictor predicts “taken” Speculative execution The execution of the branch, and verification of the prediction, may take many cycles due to RAW dependences with long-latency instructions lw beq $2,100($1) $2,$0,Label # can take >100 cycles We cannot write the register file or data memory until we know the prediction is correct Execution will eventually stall Speculative execution In speculative execution, results are first written to temporary buffers (NOT the register file or data memory) The results are copied from the buffers to the register file or data memory if the branch prediction has been verified and is correct If the prediction is incorrect, we discard the results Speculative execution Writeback now consists of two stages: instruction completion and instruction commit Completion: execution is complete, write results to buffer Commit: branch prediction is verified and correct, copy results from buffers to register file or data memory branch prediction verified as correct execute completion buffers commit register file Modern processors can speculate through 4-8 branches Modern processor challenges Handling branches in a way that prevents instruction fetch from becoming a bottleneck Preventing long latency operations, especially loads in which the data is in main memory, from holding up instruction execution Removing register hazards due to the reuse of registers so that instruction can execute in parallel Long latency operations Long latency operations, especially loads that have to access main memory, may stall subsequent instructions completed waiting for lw can’t execute even though its operands are available! or and lw add sub $5,$6,$7 $8,$6,$7 $2,100($1) $9,$2,$2 $10,$5,$8 data not found in onchip memory, have to get from main memory Solution: allow instructions to issue (start executing) out of their original program order but update registers/memory in program order Out-of-order issue Fetched and decoded instructions are placed in a special hardware queue called the issue queue IF ID issue queue … reg file completion commit EX buffers An instruction waits in the IQ until Its source operands are available A suitable functional unit is available The instruction can then issue Out-of-order issue Every cycle, the destination register numbers (rd or rt) of issuing instructions are broadcast to all instructions in the IQ IF ID issue queue … completion EX commit dest reg numbers A match with a source register number (rs or rt) of an instruction in the IQ indicates the operand will be available issued instructions or and lw add sub $5,$6,$7 $8,$6,$7 $2,100($1) $9,$2,$2 $10,$5,$8 both operands become available # can take >100 cycles! Out-of-order issue Instructions with available source operands can issue ahead of earlier instructions (out of original program order) from ID issue queue . . . sub $10,$5,$8 waiting for lw add $9,$2,$2 or and and instructions were just issued => issue sub Out-of-order issue, in-order commit Once instructions complete, they write results into the buffers used for speculative execution However, instructions are written to the register file and data memory in original program order execute completion commit buffers may be outof-order completes first or and lw add sub register file must be inorder $5,$6,$7 $8,$6,$7 $2,100($1) $9,$2,$2 $10,$5,$8 Why do we need to do this? commits first Modern processor challenges Handling branches in a way that prevents instruction fetch from becoming a bottleneck Preventing long latency operations, especially loads in which the data is in main memory, from holding up instruction execution Removing register hazards due to the reuse of registers so that instruction can execute in parallel Register hazards The reuse of registers creates WAW and WAR hazards that limit out-of-order issue and parallel execution Example Loop: lw addu sw addi bne $t0, 0($s1) $t0, $t0, $s2 $t0, 0($s1) $s1, $s1 , -4 $s1, $zero, Loop Potential for multiple iterations to be executed in parallel The branch could be predicted as taken with high accuracy Problem: WAW and WAR hazards involving $t0 and $s1 Solution: register renaming Register renaming Idea is for the hardware to reassign registers like the compiler does in loop unrolling Loop: lw addu sw lw addu sw $t0, 0($s1) $t0, $t0, $s2 $t0, 0($s1) $t1, -4($s1) $t1, $t1, $s2 $t1, -4($s1) Requires implementing more registers than specified in the ISA (e.g., 128 integer registers rather than 32) Allows every instruction in the pipeline to be given a unique destination register number to eliminate all WAR and WAW register conflicts Register renaming A register renaming stage is added between decode and the register file access The original architectural destination register number is replaced by a unique physical register number that is not used by any other instruction A lookup is done for each source register to find the corresponding physical register number decode architectural register numbers used up to here rename physical register numbers used after this point reg file Register renaming Example: two iterations of the loop with branch predicted taken BEFORE: lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0($s1) addi $s1, $s1 , -4 <bne predicted taken> lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0($s1) addi $s1, $s1 , -4 <bne predicted taken> WAR AFTER: lw $p1, 0($p3) addu $p2, $p1, $p10 sw $p2, 0($p3) addi $p4, $p3 , -4 <bne predicted taken> lw $p7, 0($p4) addu $p23, $p7, $p10 sw $p23, 0($p4) addi $p11, $p4 , -4 <bne predicted taken> hazard involving $s1 is removed, allowing the addi to complete before the first iteration is completed The WAW and WAR hazards involving $t0 are removed Removing both of these restrictions allows the second iteration to proceed in parallel with the first The MIPS R12000 microprocessor 4-way superscalar Five execution units 2 integer 2 floating point 1 load/store for effective address calculation and data memory access Dynamic branch prediction and speculative execution ooo issue, in-order commit Register renaming R12000 pipeline (ALU operations) Fetch stages 1 and 2 Fetch 4 instructions each cycle Predict branches Split into two stages to enable higher clock rates (R10K had 1) Decode stage Decode and rename 4 instructions each cycle Put into issue queues Issue stage Check source operand availability Read source operands from register file (or bypass paths) for issued instructions Execute stage Execute and complete Write stage Write results to physical registers R12000 branch prediction 2048-entry bimodal predictor 32 entry branch target address cache Speculation through four branches R12000 ooo completion, in-order commit Separate 16-entry issue queues for integer, floating point, and memory (load and store) instructions Hardware tracks the program order and status (completed, caused exception, etc) of up to 48 instructions R12000 register renaming 64 integer and 64 floating point physical registers Hardware lookup table to correlate architectural registers with physical registers Hardware maintains list of currently unused registers that can be assigned as destination registers R10000 die photo R12000 summary R10000 was one of the 1st microprocessors to implement the “issue queue” approach to ooo superscalar execution PowerPC processors use the “reservation station” approach discussed in the book Clock rate was slow R12000 provided a slight improvement with some redesign Pentium and Alpha processors are ooo but with much faster clock rates Very hard to get significant improvement beyond 4-6 way issue Branch prediction needs to be extremely high Finding parallel operations in many program is difficult Long latency of loads creates an operand supply problem Keeping the clock rate high is tough Questions?