Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture

Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Copyright © 2012, Elsevier Inc. All rights reserved. 1  Pipelining became a universal technique in 1985   Overlaps execution of instructions Exploits “Instruction Level Parallelism”   Introduction Introduction The potential overlap among instructions Beyond this, there are two main approaches:  Hardware-based dynamic approaches    Used in server and desktop processors Not used as extensively in PMP processors Compiler-based static approaches  Not as successful outside of scientific applications Copyright © 2012, Elsevier Inc. All rights reserved. 2  When exploiting instruction-level parallelism, goal is to maximize IPC (or minimize CPI)  Pipeline CPI =      Introduction Instruction-Level Parallelism Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls Parallelism within basic block is limited  A code sequence:    no branch in except to the entry & no branch out except at the exit Typical size of basic block = 3-6 instructions Must optimize across branches Copyright © 2012, Elsevier Inc. All rights reserved. 3  Loop-Level Parallelism    Unroll loop statically or dynamically Use SIMD (vector processors and GPUs) Challenges:  Data dependency  Instruction j is data dependent on instruction i if    Introduction Data Dependence Instruction i produces a result that may be used by instruction j Instruction j is data dependent on instruction k and instruction k is data dependent on instruction i Dependent instructions cannot be executed simultaneously Copyright © 2012, Elsevier Inc. All rights reserved. 4  Dependencies are a property of programs Pipeline organization determines if dependence is detected and if it causes a stall  Data dependence conveys:      Introduction Data Dependence Possibility of a hazard Order in which results must be calculated Upper bound on exploitable instruction level parallelism Dependencies that flow through memory locations are difficult to detect Copyright © 2012, Elsevier Inc. All rights reserved. 5  Two instructions use the same name but no flow of information   Not a true data dependence, but is a problem when reordering instructions Antidependence: instruction j writes a register or memory location that instruction i reads   Initial ordering (i before j) must be preserved Output dependence: instruction i and instruction j write the same register or memory location   Introduction Name Dependence Ordering must be preserved To resolve, use renaming techniques Copyright © 2012, Elsevier Inc. All rights reserved. 6  Data Hazards     Read after write (RAW) Write after write (WAW) Write after read (WAR) example Introduction Other Factors if (p1) {s1}; p3=0; if (p2) {s2}; Control Dependence  Ordering of instruction i with respect to a branch instruction   Instruction control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch (e.g., cannot move s1 before if(p1) ) An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch (e.g., cannot move p3=0 into {s2}) Copyright © 2012, Elsevier Inc. All rights reserved. 7 • Example 1:  DADDU R1,R2,R3 BEQZ R4,L DSUBU R1,R1,R6 L: … OR R7,R1,R8 • Example 2:  DADDU R1,R2,R3 BEQZ R12,skip DSUBU R4,R5,R6 DADDU R5,R4,R9 skip: OR R7,R8,R9 OR instruction dependent on DADDU and DSUBU Introduction Examples Assume R4 isn’t used after skip  Possible to move DSUBU before the branch Copyright © 2012, Elsevier Inc. All rights reserved. 8  Pipeline scheduling   Separate dependent instruction from the source instruction by the pipeline latency of the source instruction Compiler Techniques Compiler Techniques for Exposing ILP Example: for (i=999; i>=0; i=i-1) x[i] = x[i] + s; Copyright © 2012, Elsevier Inc. All rights reserved. 9 FP Loop: Where are the Hazards? • First translate into MIPS code: -To simplify, assume 8 is lowest address, R1 = base address of X and F2 = s Loop: L.D ADD.D S.D DADDUI BNEZ 7/27/2016 F0,0(R1) ;F0=vector element F4,F0,F2 ;add scalar from F2 0(R1),F4 ;store result R1,R1,-8 ;decrement pointer 8B (DW) R1,Loop ;branch R1!=zero CSCE 430/830, Instruction Level Parallelism 10 FP Loop Showing Stalls 1 Loop: L.D 2 stall 3 ADD.D 4 stall 5 stall 6 S.D 7 DADDUI 8 stall 9 BNEZ Instruction producing result FP ALU op FP ALU op Load double • F0,0(R1) ;F0=vector element F4,F0,F2 ;add scalar in F2 0(R1),F4 ;store result R1,R1,-8 ;decrement pointer 8B (DW) ;assumes can’t forward to branch R1,Loop ;branch R1!=zero Instruction using result Another FP ALU op Store double FP ALU op Latency in clock cycles 3 2 1 9 clock cycles: Rewrite code to minimize stalls? 7/27/2016 CSCE 430/830, Instruction Level Parallelism 11 Revised FP Loop Minimizing Stalls 1 Loop: L.D F0,0(R1) 2 DADDUI R1,R1,-8 3 ADD.D F4,F0,F2 4 stall 5 stall 6 7 S.D 8(R1),F4 BNEZ R1,Loop ;altered offset when move DADDUI Swap DADDUI and S.D by changing address of S.D Instruction producing result FP ALU op FP ALU op Load double Instruction using result Another FP ALU op Store double FP ALU op Latency in clock cycles 3 2 1 7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loop overhead; How to make faster? 7/27/2016 CSCE 430/830, Instruction Level Parallelism 12 Unroll Loop Four Times (straightforward way) 1 Loop:L.D 3 ADD.D 6 S.D 7 L.D 9 ADD.D 12 S.D 13 L.D 15 ADD.D 18 S.D 19 L.D 21 ADD.D 24 S.D 25 DADDUI 26 BNEZ F0,0(R1) F4,F0,F2 0(R1),F4 F6,-8(R1) F8,F6,F2 -8(R1),F8 F10,-16(R1) F12,F10,F2 -16(R1),F12 F14,-24(R1) F16,F14,F2 -24(R1),F16 R1,R1,#-32 R1,LOOP 1 cycle stall Rewrite loop to stalls? 2 cycles stall minimize ;drop DSUBUI & BNEZ ;drop DSUBUI & BNEZ ;drop DSUBUI & BNEZ ;alter to 4*8 27 clock cycles, or 6.75 per iteration (Assumes R1 is multiple of 4) 7/27/2016 CSCE 430/830, Instruction Level Parallelism 13 Unrolled Loop That Minimizes Stalls 1 Loop:L.D F0,0(R1) 2 L.D F6,-8(R1) 3 L.D F10,-16(R1) 4 L.D F14,-24(R1) 5 ADD.D F4,F0,F2 6 ADD.D F8,F6,F2 7 ADD.D F12,F10,F2 8 ADD.D F16,F14,F2 9 S.D 0(R1),F4 10 S.D -8(R1),F8 11 S.D -16(R1),F12 12 DSUBUI R1,R1,#32 13 S.D 8(R1),F16 ; 8-32 = -24 14 BNEZ R1,LOOP 14 clock cycles, or 3.5 per iteration 7/27/2016 CSCE 430/830, Instruction Level Parallelism 14 Unrolled Loop Detail • Do not usually know upper bound of loop • Suppose it is n, and we would like to unroll the loop to make k copies of the body • Instead of a single unrolled loop, we generate a pair of consecutive loops: – 1st executes (n mod k) times and has a body that is the original loop – 2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times • For large values of n, most of the execution time will be spent in the unrolled loop 7/27/2016 CSCE 430/830, Instruction Level Parallelism 15 5 Loop Unrolling Decisions • Requires understanding how one instruction depends on another and how the instructions can be changed or reordered given the dependences: 1. Determine loop unrolling useful by finding that loop iterations were independent (except for maintenance code) 2. Use different registers to avoid unnecessary constraints forced by using same registers for different computations 3. Eliminate the extra test and branch instructions and adjust the loop termination and iteration code 4. Determine that loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent • Transformation requires analyzing memory addresses and finding that they do not refer to the same address 5. Schedule the code, preserving any dependences needed to yield the same result as the original code 7/27/2016 CSCE 430/830, Instruction Level Parallelism 16 In-class Exercise • Identify data hazards in the code below: – MULTD – ADDD – SD F3, F4, F2 F2, F6, F1 F2, 0(F3) • For each of the following code fragments, identify each type of dependence that a compiler will find (a fragment may have no dependences) and whether a compiler could schedule the two instructions (i.e., change their orders). 1. DADDI R1, R1, #4 LD R2, 7(R1) 3. SD R2, 7(R1) SD F2, 200(R7) 7/27/2016 2. DADD R3, R1, R2 SD R2, 7(R1) 4. BEZ R1, place SD R1, 7(R1) CSCE 430/830, Instruction Level Parallelism 17 3 Limits to Loop Unrolling 1. Decrease in amount of overhead amortized with each extra unrolling • Amdahl’s Law 2. Growth in code size • For larger loops, it increases the instruction cache miss rate 3. Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling • • If not possible to allocate all live values to registers, may lose some or all of its advantage Loop unrolling reduces impact of branches on pipeline; another way is branch prediction 7/27/2016 CSCE 430/830, Instruction Level Parallelism 18 Static Branch Prediction • Previous lectures showed scheduling code around delayed branch • To reorder code around branches, need to predict branch statically when compile • Simplest scheme is to predict a branch as taken 7/27/2016 18% 20% 15% 15% 12% 11% 12% 9% 10% 4% 5% 10% 6% CSCE 430/830, Instruction Level Parallelism Integer r su 2c o p dl jd m 2d dr o hy ea r c do du li c gc eq nt ot es t pr es so m pr e ss 0% co • More accurate scheme predicts branches using profile information collected from earlier runs, and modify prediction based on last run: Misprediction Rate – Average misprediction = untaken branch frequency = 34% SPEC 25% 22% Floating Point 19 Dynamic Branch Prediction • Why does prediction work? – Underlying algorithm has regularities – Data that is being operated on has regularities – Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems • Is dynamic branch prediction better than static branch prediction? – Seems to be – There are a small number of important branches in programs which have dynamic behavior 7/27/2016 CSCE 430/830, Instruction Level Parallelism 20 Dynamic Branch Prediction • Performance = ƒ(accuracy, cost of misprediction) • Branch History Table: Lower bits of PC address index table of 1-bit values – Says whether or not branch taken last time – No address check • Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit): – End of loop case, when it exits instead of looping as before – First time through loop on next time through code, when it predicts exit instead of looping 7/27/2016 CSCE 430/830, Instruction Level Parallelism 21 Dynamic Branch Prediction • Solution: 2-bit scheme where change prediction only if get misprediction twice T NT Predict Taken T Predict Not Taken T NT T Predict Taken NT Predict Not Taken NT • Red: stop, not taken • Green: go, taken • Adds hysteresis to decision making process 7/27/2016 CSCE 430/830, Instruction Level Parallelism 22 BHT Accuracy • Mispredict because either: – Wrong guess for that branch – Got branch history of wrong branch when index the table 18% 12% 10% 9% 9% 5% 1% 7 na sa 30 0 at rix pp m fp p ice sp c do du ice sp li c 0% Integer 7/27/2016 9% 5% gc 20% 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% eq nt ot es t pr es so Misprediction Rate • 4096 entry table: Floating Point CSCE 430/830, Instruction Level Parallelism 23 Correlated Branch Prediction • Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper n-bit branch history table • In general, a (m,n) predictor means recording last m branches to select between 2m history tables, each with n-bit counters – Thus, old 2-bit BHT is a (0,2) predictor • Global Branch History: m-bit shift register keeping T/NT status of last m branches. • Each entry in table has 2m n-bit predictors. 7/27/2016 CSCE 430/830, Instruction Level Parallelism 24 Correlating Branches Branch address (2,2) predictor – Behavior of recent branches selects between four predictions of next branch, updating just that prediction 4 2-bits per branch predictor Prediction 2-bit global branch history 7/27/2016 CSCE 430/830, Instruction Level Parallelism 25 Accuracy of Different Schemes 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 18% 16% 14% 12% 11% 10% 8% 6% 6% 5% 6% 6% 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry li eqntott expresso gcc fpppp matrix300 0% spice 1% 0% doducd 1% tomcatv 2% 7/27/2016 5% 4% 4% nasa7 Frequency of Mispredictions 20% 1,024 entries (2,2) CSCE 430/830, Instruction Level Parallelism 26 Tournament Predictors Tournament predictor using, say, 4K 2-bit counters indexed by local branch address. Chooses between: • Global predictor – 4K entries index by history of last 12 branches (212 = 4K) – Each entry is a standard 2-bit predictor • Local predictor – Local history table: 1024 10-bit entries recording last 10 branches, index by branch address – The pattern of the last 10 occurrences of that particular branch used to index table of 1K entries with 3-bit saturating counters 7/27/2016 CSCE 430/830, Instruction Level Parallelism 27 Tournament Predictors • Multilevel branch predictor • Use n-bit saturating counter to choose between predictors • Usual choice between global and local predictors P1/P2 ={0,1}/{0,1} 0: prediction incorrect 1: prediction correct 7/27/2016 CSCE 430/830, Instruction Level Parallelism 28 Comparing Predictors (Fig. 2.8) • Advantage of tournament predictor is ability to select the right predictor for a particular branch – Particularly crucial for integer benchmarks. – A typical tournament predictor will select the global predictor almost 40% of the time for the SPEC integer benchmarks and less than 15% of the time for the SPEC FP benchmarks 7/27/2016 CSCE 430/830, Instruction Level Parallelism 29 Pentium 4 Misprediction Rate (per 1000 instructions, not per branch) 14 13 Branch mispredictions per 1000 Instructions 13 6% misprediction rate per branch SPECint (19% of INT instructions are branch) 12 12 11 2% misprediction rate per branch SPECfp (5% of FP instructions are branch) 11 10 9 9 8 7 7 6 5 5 4 3 2 1 1 0 0 0 7/27/2016 a 17 7. m es u 17 3. ap pl 17 2. m gr id im 17 1. sw e is af ty up w 16 8. w SPECint2000 18 6. cr 18 1. m cf 17 6. gc c r 17 5. vp 16 4. gz i p 0 SPECfp2000 CSCE 430/830, Instruction Level Parallelism 30 Branch Target Buffers (BTB) • Branch target calculation is costly and stalls the instruction fetch. • BTB stores PCs the same way as caches • The PC of a branch is sent to the BTB • When a match is found the corresponding Predicted PC is returned • If the branch was predicted taken, instruction fetch continues at the returned predicted PC Branch Target Buffers Branch target folding: for unconditional branches store the target instructions themselves in the buffer! Dynamic Branch Prediction Summary • Prediction becoming important part of execution • Branch History Table: 2 bits for loop accuracy • Correlation: Recently executed branches correlated with next branch – Either different branches (GA) – Or different executions of same branches (PA) • Tournament predictors take insight to next level, by using multiple predictors – usually one based on global information and one based on local information, and combining them with a selector – In 2006, tournament predictors using  30K bits are in processors like the Power5 and Pentium 4 • Branch Target Buffer: include branch address & prediction; • Branch target folding. 7/27/2016 CSCE 430/830, Instruction Level Parallelism 33 In-Class Exercise • Given a deeply pipelined processor and a branch-target buffer for conditional branches only, assuming a misprediction penalty of 4 cycles and a buffer miss penalty of 3 cycles, 90% hit rate and 90% accuracy, and 15% branch frequency. How much faster is the processor with the BTB vs. a processor that has a fixed 2-cycle branch penalty? • Speedup = CPInoBTB /CPIBTB • = (CPIbase + StallsnoBTB )/(CPIbase + StallsBTB ) • Stalls = ∑ Frequency x Penalty • StallsnoBTB = 15% x 2 = 0.30 • StallsBTB = Stallsmiss-BTB + Stallshit-BTB-correct + Stallshit-BTB-wrong • = 15%x10%x3 + 15%x90%x90%x0 +15%x90%x10%x4 • = 0.097 • Speedup = (1 + 0.3) / (1 + 0.097) = 1.2 7/27/2016 CSCE 430/830, Instruction Level Parallelism 34  Rearrange order of instructions to reduce stalls while maintaining data flow  Advantages:    Branch Prediction Dynamic Scheduling Compiler doesn’t need to have knowledge of microarchitecture Handles cases where dependencies are unknown at compile time Disadvantage:   Substantial increase in hardware complexity Complicates exceptions Copyright © 2012, Elsevier Inc. All rights reserved. 42  Dynamic scheduling implies:     Out-of-order execution Out-of-order completion Branch Prediction Dynamic Scheduling Creates the possibility for WAR and WAW hazards Tomasulo’s Approach   Tracks when operands are available Introduces register renaming in hardware  Minimizes WAW and WAR hazards Copyright © 2012, Elsevier Inc. All rights reserved. 43  Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8 Branch Prediction Register Renaming Antidependence F8 Antidependence F6 + output dependence with F6 Copyright © 2012, Elsevier Inc. All rights reserved. 44  Example: Branch Prediction Register Renaming DIV.D F0,F2,F4 ADD.D S,F0,F8 S.D S,0(R1) SUB.D T,F10,F14 MUL.D F6,F10,T  Now only RAW hazards remain, which can be strictly ordered Copyright © 2012, Elsevier Inc. All rights reserved. 45 Branch Prediction Register Renaming  Register renaming is provided by reservation stations (RS)    Contains:  The instruction  Buffered operand values (when available)  Reservation station number of instruction providing the operand values RS fetches and buffers an operand as soon as it becomes available (not necessarily involving register file) Pending instructions designate the RS to which they will send their output     Result values broadcast on a result bus, called the common data bus (CDB) Only the last output updates the register file As instructions are issued, the register specifiers are renamed with the reservation station May be more reservation stations than registers Copyright © 2012, Elsevier Inc. All rights reserved. 46  Load and store buffers   Contain data and addresses, act like reservation stations Branch Prediction Tomasulo’s Algorithm Top-level design: Copyright © 2012, Elsevier Inc. All rights reserved. 47 Branch Prediction Tomasulo’s Algorithm  Three Steps:  Issue     Execute      Get next instruction from FIFO queue If available RS, issue the instruction to the RS with operand values if available If operand values not available, stall the instruction When operand becomes available, store it in any reservation stations waiting for it When all operands are ready, issue the instruction Loads and store maintained in program order through effective address No instruction allowed to initiate execution until all branches that proceed it in program order have completed Write result  Write result on CDB into reservation stations and store buffers  (Stores must wait until address and value are received) Copyright © 2012, Elsevier Inc. All rights reserved. 48 Tomasulo Organization FP Registers From Mem FP Op Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 Store Buffers Add1 Add2 Add3 Mult1 Mult2 FP adders 7/27/2016 Reservation Stations To Mem FP multipliers Common Data Bus (CDB) CSCE 430/830, Instruction Level Parallelism 49 Reservation Station Components Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands – Store buffers has V field, result to be stored Qj, Qk: Reservation stations producing source registers (value to be written) – Note: Qj,Qk=0 => ready – Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. 7/27/2016 CSCE 430/830, Instruction Level Parallelism 50 Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execute—operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available • Normal data bus: data + destination (“go to” bus) • Common data bus: data + source (“come from” bus) – 64 bits of data + 4 bits of Functional Unit source address – Write if matches expected Functional Unit (produces result) – Does the broadcast • Example speed: 3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for / 7/27/2016 CSCE 430/830, Instruction Level Parallelism 51 Tomasulo Example Instruction stream Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result Load1 Load2 Load3 Register result status: Clock 0 No No No 3 Load/Buffers Reservation Stations: Time Name Busy Add1 No Add2 No FU count Add3 No down Mult1 No Mult2 No Busy Address Op S1 Vj S2 Vk RS Qj RS Qk 3 FP Adder R.S. 2 FP Mult R.S. F0 F2 F4 F6 F8 F10 F12 ... FU Clock cycle counter 7/27/2016 CSCE 430/830, Instruction Level Parallelism 52 F30 Tomasulo Example Cycle 1 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 Reservation Stations: Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status: Clock 1 7/27/2016 FU Busy Address Load1 Load2 Load3 Op S1 Vj S2 Vk RS Qj RS Qk F0 F2 F4 F6 F8 Yes No No 34+R2 F10 F12 ... Load1 CSCE 430/830, Instruction Level Parallelism 53 F30 Tomasulo Example Cycle 2 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 Reservation Stations: Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No Register result status: Clock 2 FU Busy Address Load1 Load2 Load3 Op S1 Vj S2 Vk RS Qj RS Qk F0 F2 F4 F6 F8 Load2 Yes Yes No 34+R2 45+R3 F10 F12 ... Load1 Note: Can have multiple loads outstanding 7/27/2016 CSCE 430/830, Instruction Level Parallelism 54 F30 Tomasulo Example Cycle 3 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 3 Reservation Stations: Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes MULTD Mult2 No Register result status: Clock 3 F0 FU Busy Address 3 S1 Vj Load1 Load2 Load3 S2 Vk RS Qj Yes Yes No 34+R2 45+R3 F10 F12 RS Qk R(F4) Load2 F2 Mult1 Load2 F4 F6 F8 ... Load1 • Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued • Load1 completing; what is waiting for Load1? 7/27/2016 CSCE 430/830, Instruction Level Parallelism 55 F30 Tomasulo Example Cycle 4 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 3 4 Reservation Stations: Busy Address 3 4 4 Load1 Load2 Load3 S1 Vj S2 Vk RS Qj RS Qk F2 F4 F6 F8 No Yes No 45+R3 F10 F12 Time Name Busy Op Add1 Yes SUBD M(A1) Load2 Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No Register result status: Clock 4 F0 FU Mult1 Load2 ... M(A1) Add1 • Load2 completing; what is waiting for Load2? 7/27/2016 CSCE 430/830, Instruction Level Parallelism 56 F30 Tomasulo Example Cycle 5 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 3 4 5 Reservation Stations: Busy Address 3 4 4 5 Load1 Load2 Load3 S1 Vj S2 Vk RS Qj RS Qk F2 F4 F6 F8 Time Name Busy Op 2 Add1 Yes SUBD M(A1) M(A2) Add2 No Add3 No 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock 5 F0 FU Mult1 M(A2) No No No F10 F12 ... M(A1) Add1 Mult2 • Timer starts down for Add1, Mult1 7/27/2016 CSCE 430/830, Instruction Level Parallelism 57 F30 Tomasulo Example Cycle 6 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 3 4 5 6 Reservation Stations: Busy Address 3 4 4 5 Load1 Load2 Load3 S1 Vj S2 Vk RS Qj RS Qk F2 F4 F6 F8 Time Name Busy Op 1 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 9 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock 6 F0 FU Mult1 M(A2) Add2 No No No F10 F12 ... Add1 Mult2 • Issue ADDD here despite name dependency on F6? 7/27/2016 CSCE 430/830, Instruction Level Parallelism 58 F30 Tomasulo Example Cycle 7 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 3 4 5 6 Reservation Stations: 3 4 Busy Address 4 5 Load1 Load2 Load3 7 S1 Vj S2 Vk RS Qj RS Qk F2 F4 F6 F8 Time Name Busy Op 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 8 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock 7 F0 FU No No No Mult1 M(A2) Add2 F10 F12 ... Add1 Mult2 • Add1 (SUBD) completing; what is waiting for it? 7/27/2016 CSCE 430/830, Instruction Level Parallelism 59 F30 Tomasulo Example Cycle 8 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 3 4 5 6 Reservation Stations: Busy Address 3 4 4 5 Load1 Load2 Load3 7 8 S1 Vj S2 Vk RS Qj RS Qk F2 F4 F6 F8 Time Name Busy Op Add1 No 2 Add2 Yes ADDD (M-M) M(A2) Add3 No 7 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock 8 7/27/2016 F0 FU Mult1 M(A2) No No No F10 F12 ... Add2 (M-M) Mult2 CSCE 430/830, Instruction Level Parallelism 60 F30 Tomasulo Example Cycle 9 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 3 4 5 6 Reservation Stations: Busy Address 3 4 4 5 Load1 Load2 Load3 7 8 S1 Vj S2 Vk RS Qj RS Qk F2 F4 F6 F8 Time Name Busy Op Add1 No 1 Add2 Yes ADDD (M-M) M(A2) Add3 No 6 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock 9 7/27/2016 F0 FU Mult1 M(A2) No No No F10 F12 ... Add2 (M-M) Mult2 CSCE 430/830, Instruction Level Parallelism 61 F30 Tomasulo Example Cycle 10 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 3 4 5 6 Reservation Stations: 3 4 4 5 7 8 Busy Address Load1 Load2 Load3 10 S1 Vj S2 Vk RS Qj RS Qk F2 F4 F6 F8 Time Name Busy Op Add1 No 0 Add2 Yes ADDD (M-M) M(A2) Add3 No 5 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock 10 F0 FU No No No Mult1 M(A2) F10 F12 ... Add2 (M-M) Mult2 • Add2 (ADDD) completing; what is waiting for it? 7/27/2016 CSCE 430/830, Instruction Level Parallelism 62 F30 Tomasulo Example Cycle 11 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 3 4 5 6 Reservation Stations: Busy Address 3 4 4 5 Load1 Load2 Load3 7 8 10 11 S1 Vj S2 Vk RS Qj RS Qk F2 F4 F6 F8 Time Name Busy Op Add1 No Add2 No Add3 No 4 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock 11 F0 FU Mult1 M(A2) No No No F10 F12 ... (M-M+M)(M-M) Mult2 • Write result of ADDD here? • All quick instructions complete in this cycle! 7/27/2016 CSCE 430/830, Instruction Level Parallelism 63 F30 Tomasulo Example Cycle 12 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 3 4 5 6 Reservation Stations: Busy Address 3 4 4 5 Load1 Load2 Load3 7 8 10 11 S1 Vj S2 Vk RS Qj RS Qk F2 F4 F6 F8 Time Name Busy Op Add1 No Add2 No Add3 No 3 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock 12 7/27/2016 F0 FU Mult1 M(A2) No No No F10 F12 ... (M-M+M)(M-M) Mult2 CSCE 430/830, Instruction Level Parallelism 64 F30 Tomasulo Example Cycle 13 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 3 4 5 6 Reservation Stations: Busy Address 3 4 4 5 Load1 Load2 Load3 7 8 10 11 S1 Vj S2 Vk RS Qj RS Qk F2 F4 F6 F8 Time Name Busy Op Add1 No Add2 No Add3 No 2 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock 13 7/27/2016 F0 FU Mult1 M(A2) No No No F10 F12 ... (M-M+M)(M-M) Mult2 CSCE 430/830, Instruction Level Parallelism 65 F30 Tomasulo Example Cycle 14 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 3 4 5 6 Reservation Stations: Busy Address 3 4 4 5 Load1 Load2 Load3 7 8 10 11 S1 Vj S2 Vk RS Qj RS Qk F2 F4 F6 F8 Time Name Busy Op Add1 No Add2 No Add3 No 1 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock 14 7/27/2016 F0 FU Mult1 M(A2) No No No F10 F12 ... (M-M+M)(M-M) Mult2 CSCE 430/830, Instruction Level Parallelism 66 F30 Tomasulo Example Cycle 15 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 3 4 5 6 Reservation Stations: Busy Address 3 4 15 7 4 5 Load1 Load2 Load3 10 11 S1 Vj S2 Vk RS Qj RS Qk F2 F4 F6 F8 8 Time Name Busy Op Add1 No Add2 No Add3 No 0 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock 15 F0 FU Mult1 M(A2) No No No F10 F12 ... (M-M+M)(M-M) Mult2 • Mult1 (MULTD) completing; what is waiting for it? 7/27/2016 CSCE 430/830, Instruction Level Parallelism 67 F30 Tomasulo Example Cycle 16 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 3 4 5 6 Reservation Stations: 3 4 15 7 4 5 16 8 Load1 Load2 Load3 10 11 S1 Vj S2 Vk RS Qj RS Qk F2 F4 F6 F8 Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock 16 F0 FU Busy Address M*F4 M(A2) No No No F10 F12 ... (M-M+M)(M-M) Mult2 • Just waiting for Mult2 (DIVD) to complete 7/27/2016 CSCE 430/830, Instruction Level Parallelism 68 F30 Faster than light computation (skip a couple of cycles) 7/27/2016 CSCE 430/830, Instruction Level Parallelism 69 Tomasulo Example Cycle 55 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 3 4 5 6 Reservation Stations: 3 4 15 7 4 5 16 8 Load1 Load2 Load3 10 11 S1 Vj S2 Vk RS Qj RS Qk F2 F4 F6 F8 Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 1 Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock 55 7/27/2016 F0 FU Busy Address M*F4 M(A2) No No No F10 F12 ... (M-M+M)(M-M) Mult2 CSCE 430/830, Instruction Level Parallelism 70 F30 Tomasulo Example Cycle 56 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 3 4 5 6 Reservation Stations: 3 4 15 7 56 10 4 5 16 8 Load1 Load2 Load3 S1 Vj S2 Vk RS Qj RS Qk 56 F0 F2 F4 F6 F8 FU M*F4 M(A2) No No No 11 Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 0 Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock Busy Address F10 F12 ... (M-M+M)(M-M) Mult2 • Mult2 (DIVD) is completing; what is waiting for it? 7/27/2016 CSCE 430/830, Instruction Level Parallelism 71 F30 Tomasulo Example Cycle 57 Instruction status: Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2 Exec Write Issue Comp Result 1 2 3 4 5 6 Reservation Stations: 3 4 15 7 56 10 4 5 16 8 57 11 Load1 Load2 Load3 S1 Vj S2 Vk RS Qj RS Qk F2 F4 F6 F8 Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No Mult2 Yes DIVD M*F4 M(A1) Register result status: Clock 56 F0 FU Busy Address M*F4 M(A2) No No No F10 F12 ... F30 (M-M+M)(M-M) Result • Once again: In-order issue, out-of-order execution and out-of-order completion. 7/27/2016 CSCE 430/830, Instruction Level Parallelism 72 Speculation to greater ILP • Greater ILP: Overcome control dependence by hardware speculating on outcome of branches and executing program as if guesses were correct – Speculation  fetch, issue, and execute instructions as if branch predictions were always correct – Dynamic scheduling  only fetches and issues instructions • Essentially a data flow execution model: Operations execute as soon as their operands are available 7/27/2016 CSCE 430/830 Thread Level Parallelism 77 Speculation to greater ILP  3 components of HW-based speculation: 1. Dynamic branch prediction to choose which instructions to execute 2. Speculation to allow execution of instructions before control dependences are resolved + ability to undo effects of incorrectly speculated sequence 3. Dynamic scheduling to deal with scheduling of different combinations of basic blocks 7/27/2016 CSCE 430/830 Thread Level Parallelism 78 Adding Speculation to Tomasulo • Must separate execution from allowing instruction to finish or “commit” • This additional step called instruction commit • When an instruction is no longer speculative, allow it to update the register file or memory • Requires additional set of buffers to hold results of instructions that have finished execution but have not committed • This reorder buffer (ROB) is also used to pass results among instructions that may be speculated 7/27/2016 CSCE 430/830 Thread Level Parallelism 79 Reorder Buffer (ROB) • In non-speculative Tomasulo’s algorithm, once an instruction writes its result, any subsequently issued instructions will find result in the register file • With speculation, the register file is not updated until the instruction commits – (we know definitively that the instruction should execute) • Thus, the ROB supplies operands in interval between completion of instruction execution and instruction commit – ROB is a source of operands for instructions, just as reservation stations (RS) provide operands in Tomasulo’s algorithm – ROB extends architectured registers like RS 7/27/2016 CSCE 430/830 Thread Level Parallelism 80 Reorder Buffer Entry Fields  Each entry in the ROB contains four fields: 1. Instruction type • a branch (has no destination result), a store (has a memory address destination), or a register operation (ALU operation or load, which has register destinations) 2. Destination • Register number (for loads and ALU operations) or memory address (for stores) where the instruction result should be written 3. Value • Value of instruction result until the instruction commits 4. Ready • 7/27/2016 Indicates that instruction has completed execution, and the value is ready CSCE 430/830 Thread Level Parallelism 81 Reorder Buffer operation • Holds instructions in FIFO order, exactly as issued • When instructions complete, results placed into ROB – Supplies operands to other instruction between execution complete & commit  more registers like RS – Tag results with ROB buffer number instead of reservation station • Instructions commit values at head of ROB placed in registers (or memory locations) Reorder • As a result, easy to undo Buffer speculated instructions FP Op on mispredicted branches Queue FP Regs or on exceptions Commit path Res Stations FP Adder 7/27/2016 CSCE 430/830 Thread Level Parallelism Res Stations FP Adder 82 Recall: 4 Steps of Speculative Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”) 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”) 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4. Commit—update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”) 7/27/2016 CSCE 430/830 Thread Level Parallelism 83 Tomasulo With Reorder buffer: Done? FP Op Queue ROB7 ROB6 Newest ROB5 Reorder Buffer ROB4 ROB3 ROB2 F0 LD F0,10(R2) Registers Dest 7/27/2016 ROB1 Oldest To Memory from Memory Dest FP adders N Reservation Stations Dest 1 10+R2 FP multipliers CSCE 430/830 Thread Level Parallelism 84 Tomasulo With Reorder buffer: Done? FP Op Queue ROB7 ROB6 Newest ROB5 Reorder Buffer ROB4 ROB3 F10 F0 ADDD F10,F4,F0 LD F0,10(R2) Registers Dest 2 ADDD R(F4),ROB1 FP adders 7/27/2016 N N ROB2 ROB1 Oldest To Memory from Memory Dest Reservation Stations Dest 1 10+R2 FP multipliers CSCE 430/830 Thread Level Parallelism 85 Tomasulo With Reorder buffer: Done? FP Op Queue ROB7 ROB6 Newest ROB5 Reorder Buffer ROB4 F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) Registers Dest 2 ADDD R(F4),ROB1 FP adders 7/27/2016 N N N ROB3 ROB2 ROB1 Oldest To Memory Dest 3 DIVD ROB2,R(F6) Reservation Stations from Memory Dest 1 10+R2 FP multipliers CSCE 430/830 Thread Level Parallelism 86 Tomasulo With Reorder buffer: Done? FP Op Queue ROB7 Reorder Buffer F0 F4 -F2 F10 F0 ADDD F0,F4,F6 LD F4,0(R3) BNE F2,<…> DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) Registers Dest 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) FP adders 7/27/2016 N N N N N N ROB6 Newest ROB5 ROB4 ROB3 ROB2 ROB1 Oldest To Memory Dest 3 DIVD ROB2,R(F6) Reservation Stations FP multipliers CSCE 430/830 Thread Level Parallelism from Memory Dest 1 10+R2 5 0+R3 87 Tomasulo With Reorder buffer: FP Op Queue Reorder Buffer -- ROB5 F0 F4 -F2 F10 F0 Done? ST 0(R3),F4 N ROB7 ADDD F0,F4,F6 N ROB6 LD F4,0(R3) N ROB5 BNE F2,<…> N ROB4 DIVD F2,F10,F6 N ROB3 ADDD F10,F4,F0 N ROB2 LD F0,10(R2) N ROB1 Registers Dest 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) FP adders 7/27/2016 Newest Oldest To Memory Dest 3 DIVD ROB2,R(F6) Reservation Stations FP multipliers CSCE 430/830 Thread Level Parallelism from Memory Dest 1 10+R2 5 0+R3 88 Tomasulo With Reorder buffer: FP Op Queue Reorder Buffer -- M[10] F0 F4 M[10] -F2 F10 F0 Done? ST 0(R3),F4 Y ROB7 ADDD F0,F4,F6 N ROB6 LD F4,0(R3) Y ROB5 BNE F2,<…> N ROB4 DIVD F2,F10,F6 N ROB3 ADDD F10,F4,F0 N ROB2 LD F0,10(R2) N ROB1 Registers Dest 2 ADDD R(F4),ROB1 6 ADDD M[10],R(F6) FP adders 7/27/2016 Newest Oldest To Memory Dest 3 DIVD ROB2,R(F6) Reservation Stations from Memory Dest 1 10+R2 FP multipliers CSCE 430/830 Thread Level Parallelism 89 Tomasulo With Reorder buffer: FP Op Queue Reorder Buffer Done? -- M[10] ST 0(R3),F4 Y ROB7 F0 <val2> ADDD F0,F4,F6 Ex ROB6 F4 M[10] LD F4,0(R3) Y ROB5 -BNE F2,<…> N ROB4 F2 DIVD F2,F10,F6 N ROB3 F10 ADDD F10,F4,F0 N ROB2 F0 LD F0,10(R2) N ROB1 Registers Dest 2 ADDD R(F4),ROB1 FP adders 7/27/2016 Newest Oldest To Memory Dest 3 DIVD ROB2,R(F6) Reservation Stations from Memory Dest 1 10+R2 FP multipliers CSCE 430/830 Thread Level Parallelism 90 Tomasulo With Reorder buffer: FP Op Queue Reorder Buffer What about memory hazards??? Done? -- M[10] ST 0(R3),F4 Y ROB7 F0 <val2> ADDD F0,F4,F6 Ex ROB6 F4 M[10] LD F4,0(R3) Y ROB5 -BNE F2,<…> N ROB4 F2 DIVD F2,F10,F6 N ROB3 F10 ADDD F10,F4,F0 N ROB2 F0 LD F0,10(R2) N ROB1 Registers Dest 2 ADDD R(F4),ROB1 FP adders 7/27/2016 Newest Oldest To Memory Dest 3 DIVD ROB2,R(F6) Reservation Stations from Memory Dest 1 10+R2 FP multipliers CSCE 430/830 Thread Level Parallelism 91 Avoiding Memory Hazards • WAW and WAR hazards through memory are eliminated with speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pending • RAW dependence through memory are maintained by two restrictions: 1. not allowing a load to initiate the second step of its execution if any active ROB entry occupied by a store has a Destination field that matches the value of the A field of the load, and 2. maintaining the program order for the computation of an effective address of a load with respect to all earlier stores. • these restrictions ensure that any load that accesses a memory location written to by an earlier store cannot perform the memory access until the store has written the data 7/27/2016 CSCE 430/830 Thread Level Parallelism 92 Exceptions and Interrupts • IBM 360/91 invented “imprecise interrupts” – Computer stopped at this PC; its likely close to this address – Not so popular with programmers – Also, what about Virtual Memory? (Not in IBM 360) • Technique for both precise interrupts/exceptions and speculation: in-order completion and inorder commit – If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly – This is exactly same as need to do with precise exceptions • Exceptions are handled by not recognizing the exception until instruction that caused it is ready to commit in ROB – If a speculated instruction raises an exception, the exception is recorded in the ROB – This is why reorder buffers in all new processors 7/27/2016 CSCE 430/830 Thread Level Parallelism 93  To achieve CPI < 1, need to complete multiple instructions per clock  Solutions:    Statically scheduled superscalar processors VLIW (very long instruction word) processors dynamically scheduled superscalar processors Copyright © 2012, Elsevier Inc. All rights reserved. Multiple Issue and Static Scheduling Multiple Issue and Static Scheduling 95 Copyright © 2012, Elsevier Inc. All rights reserved. Multiple Issue and Static Scheduling Multiple Issue 96 VLIW: Very Large Instruction Word • Each “instruction” has explicit coding for multiple operations – In IA-64, grouping called a “packet” – In Transmeta, grouping called a “molecule” (with “atoms” as ops) • Tradeoff instruction space for simple decoding – The long instruction word has room for many operations – By definition, all the operations the compiler puts in the long instruction word are independent => execute in parallel – E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch » 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide – Need compiling technique that schedules across several branches 7/27/2016 CSCE 430/830 Thread Level Parallelism 97 Recall: Unrolled Loop that Minimizes Stalls for Scalar 1 Loop: 2 3 4 5 6 7 8 9 10 11 12 13 14 L.D L.D L.D L.D ADD.D ADD.D ADD.D ADD.D S.D S.D S.D DSUBUI BNEZ S.D F0,0(R1) F6,-8(R1) F10,-16(R1) F14,-24(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 F16,F14,F2 0(R1),F4 -8(R1),F8 -16(R1),F12 R1,R1,#32 R1,LOOP 8(R1),F16 L.D to ADD.D: 1 Cycle ADD.D to S.D: 2 Cycles ; 8-32 = -24 14 clock cycles, or 3.5 per iteration 7/27/2016 CSCE 430/830 Thread Level Parallelism 98 Loop Unrolling in VLIW Memory reference 1 Memory reference 2 L.D F0,0(R1) L.D F6,-8(R1) FP operation 1 L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) ADD.D L.D F26,-48(R1) ADD.D ADD.D S.D 0(R1),F4 S.D -8(R1),F8 ADD.D S.D -16(R1),F12 S.D -24(R1),F16 S.D -32(R1),F20 S.D -40(R1),F24 S.D -0(R1),F28 FP op. 2 Int. op/ branch Clock 1 2 F4,F0,F2 ADD.D F8,F6,F2 3 F12,F10,F2 ADD.D F16,F14,F2 4 F20,F18,F2 ADD.D F24,F22,F2 5 F28,F26,F2 6 7 DSUBUI R1,R1,#48 8 BNEZ R1,LOOP 9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS) 7/27/2016 CSCE 430/830 Thread Level Parallelism 99 Problems with 1st Generation VLIW • Increase in code size – generating enough operations in a straight-line code fragment requires ambitiously unrolling loops – whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding • Operated in lock-step; no hazard detection HW – a stall in any functional unit pipeline caused entire processor to stall, since all functional units must be kept synchronized – Compiler might predict function units, but caches hard to predict • Binary code compatibility – Pure VLIW => different numbers of functional units and unit latencies require different versions of the code 7/27/2016 CSCE 430/830 Thread Level Parallelism 100  Limit the number of instructions of a given class that can be issued in a “bundle”  I.e. one FP, one integer, one load, one store  Examine all the dependencies among the instructions in the bundle  If dependencies exist in bundle, encode them in reservation stations  Also need multiple completion/commit Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling, Multiple Issue, and Speculation Multiple Issue 105 Loop: LD R2,0(R1) DADDIU R2,R2,#1 SD R2,0(R1) DADDIU R1,R1,#8 BNE R2,R3,LOOP ;R2=array element ;increment R2 ;store result ;increment pointer ;branch if not last element Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling, Multiple Issue, and Speculation Example 106 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling, Multiple Issue, and Speculation Example (No Speculation) 107 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling, Multiple Issue, and Speculation Example (Speculatin) 108  Need high instruction bandwidth!  Branch-Target buffers  Next PC prediction buffer, indexed by current PC Copyright © 2012, Elsevier Inc. All rights reserved. Adv. Techniques for Instruction Delivery and Speculation Branch-Target Buffer 109  Optimization:    Larger branch-target buffer Add target instruction into buffer to deal with longer decoding time required by larger buffer “Branch folding” Copyright © 2012, Elsevier Inc. All rights reserved. Adv. Techniques for Instruction Delivery and Speculation Branch Folding 110   Most unconditional branches come from function returns The same procedure can be called from multiple sites   Causes the buffer to potentially forget about the return address from previous calls Create return address buffer organized as a stack Copyright © 2012, Elsevier Inc. All rights reserved. Adv. Techniques for Instruction Delivery and Speculation Return Address Predictor 111  Design monolithic unit that performs:   Branch prediction Instruction prefetch   Fetch ahead Instruction memory access and buffering  Deal with crossing cache lines Copyright © 2012, Elsevier Inc. All rights reserved. Adv. Techniques for Instruction Delivery and Speculation Integrated Instruction Fetch Unit 112  Register renaming vs. reorder buffers  Instead of virtual registers from reservation stations and reorder buffer, create a single register pool       Use hardware-based map to rename registers during issue WAW and WAR hazards are avoided Speculation recovery occurs by copying during commit Still need a ROB-like queue to update table in order Simplifies commit:     Contains visible registers and virtual registers Record that mapping between architectural register and physical register is no longer speculative Free up physical register used to hold older value In other words: SWAP physical registers on commit Physical register de-allocation is more difficult Copyright © 2012, Elsevier Inc. All rights reserved. Adv. Techniques for Instruction Delivery and Speculation Register Renaming 113  Combining instruction issue with register renaming:    Issue logic pre-reserves enough physical registers for the bundle (fixed number?) Issue logic finds dependencies within bundle, maps registers as necessary Issue logic finds dependencies between current bundle and already in-flight bundles, maps registers as necessary Copyright © 2012, Elsevier Inc. All rights reserved. Adv. Techniques for Instruction Delivery and Speculation Integrated Issue and Renaming 114  How much to speculate  Mis-speculation degrades performance and power relative to no speculation    May cause additional misses (cache, TLB) Prevent speculative code from causing higher costing misses (e.g. L2) Speculating through multiple branches   Complicates speculation recovery No processor can resolve multiple branches per cycle Copyright © 2012, Elsevier Inc. All rights reserved. Adv. Techniques for Instruction Delivery and Speculation How Much? 115  Speculation and energy efficiency   Note: speculation is only energy efficient when it significantly improves performance Value prediction  Uses:     Loads that load from a constant pool Instruction that produces a value from a small set of values Not been incorporated into modern processors Similar idea--address aliasing prediction--is used on some processors Copyright © 2012, Elsevier Inc. All rights reserved. Adv. Techniques for Instruction Delivery and Speculation Energy Efficiency 116 Pipelining     Multiple instructions are overlapped in execution Takes advantage of parallelism Increases instruction throughput, but does not reduce the time to complete an instruction Ideal case:     The time per instruction: Time per instruction on unpipelined machine / number of pipe stages Speedup: the number of pipe stages Intel Pentium 4: 20 stages; Cedar Mill: 31 stages Intel Core i7: 14 stages. Why? Copyright © 2012, Elsevier Inc. All rights reserved. 117 The Basics of a RISC Instruction Set • Key Proprieties: – All operations on data apply to data in registers – The only operations that affect memory are load and store operations – The instruction formats are few in number, with all instructions typically being one size. 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 118 Basic Pipelining: A "Typical" RISC ISA • 32-bit fixed format instruction (3 formats) – ALU instructions, Load/Store, Branches and Jumps • 32 32-bit GPR (R0 contains zero, DP take pair) • 3-address, reg-reg arithmetic instruction • Single address mode for load/store: base + displacement – no indirection • Simple branch conditions /Jump see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 119 Basic Pipelining: e.g., MIPS ( MIPS) Register-Register 31 26 25 Op 21 20 Rs1 16 15 Rs2 11 10 6 5 Rd 0 Opx Register-Immediate 31 26 25 Op 21 20 Rs1 16 15 Rd immediate 0 Branch 31 26 25 Op Rs1 21 20 16 15 Rs2/Opx immediate 0 Jump / Call 31 26 25 Op 7/27/2016 target CSCE 430/830, Basic Pipelining & Performance 0 120 A simple implementation of a RISC Instruction Set • Every instruction can be implemented in at most 5 cycles. – Instruction fetch cycle (IF): send PC to memory and fetch the current instruction; Update the PC to the next PC by adding 4. – Instruction decode/register fetch cycle (ID): decode the instruction and read the registers. For a branch, do the equality test on the registers, and compute the target address – Execution/effective address cycle (EX): The ALU operates on the operands prepared in the prior cycle, performing: » Memory reference: ALU adds the base register and the offset. LW 100(R1) => 100 + R1 » Register-Register ALU instruction: ALU performs the operation specified by the opcode on the values read from the register file. ADD R1,R2,R3 => R1 + R2 » Register-Immediate ALU instruction: ALU performs the operation specified by the opcode on the first values read from the register file and the immediate. ADD R1,3,R2 => R1 + 3 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 121 A simple implementation of a RISC Instruction Set • Every instruction can be implemented in at most 5 cycles. – Memory access (MEM). Read the memory using the effective address computed in the previous cycle if the instruction is a load; If it is a store, write the data to the memory – Write-back cycle (WB). For Register-Register ALU instruction or load instruction, write the result into the register file, whether it comes from the memory (for a load) or from the ALU (for an ALU instruction) • Branch instructions require 2 cycles, store instructions require 4 cycles, others require 5 cycles. 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 122 An example of an instruction execution • ADD R3, R1, R2 => R3 = R1 + R2; IR ALU PC Reg File Memory 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 123 Instruction Fetch (IF) • ADD R3, R1, R2 => R3 = R1 + R2; IR ALU PC 000001000010001000011… Memory 7/27/2016 Reg File 000011000111111100000… CSCE 430/830, Basic Pipelining & Performance 124 Instruction Fetch (IF) • ADD R3, R1, R2 => R3 = R1 + R2; IR ALU PC 000001000010001000011… Memory 7/27/2016 Reg File 000011000111111100000… CSCE 430/830, Basic Pipelining & Performance 125 Instruction Fetch (IF) • ADD R3, R1, R2 => R3 = R1 + R2; IR PC ALU 000001000010001000011… 000001000010001000011… Memory 7/27/2016 Reg File 000011000111111100000… CSCE 430/830, Basic Pipelining & Performance Fetch the instruction 126 Instruction Fetch (IF) • ADD R3, R1, R2 => R3 = R1 + R2; IR PC ALU 000001000010001000011… 000001000010001000011… Memory 7/27/2016 Reg File 000011000111111100000… CSCE 430/830, Basic Pipelining & Performance Fetch the instruction Move PC to the next instruction 127 Instruction Decode (ID) • ADD R3, R1, R2 => R3 = R1 + R2; IR ALU 000001000010001000011… Reg File 0 R1 x R2 y Memory R3 R31 7/27/2016 Reg File R0 z … CSCE 430/830, Basic Pipelining & Performance 128 Instruction Decode (ID) • ADD R3, R1, R2 => R3 = R1 + R2; IR ALU 000001000010001000011… Reg File ADD 0 R1 x R2 y Memory R3 R31 7/27/2016 Reg File R0 decode the instruction z … CSCE 430/830, Basic Pipelining & Performance 129 Instruction Decode (ID) • ADD R3, R1, R2 => R3 = R1 + R2; IR ALU 000001000010001000011… Reg File ADD R1 0 R1 x R2 y Memory R3 R31 7/27/2016 Reg File R0 decode the instruction z … CSCE 430/830, Basic Pipelining & Performance 130 Instruction Decode (ID) • ADD R3, R1, R2 => R3 = R1 + R2; IR ALU 000001000010001000011… Reg File ADD R1 R2 0 R1 x R2 y Memory R3 R31 7/27/2016 Reg File R0 decode the instruction z … CSCE 430/830, Basic Pipelining & Performance 131 Instruction Decode (ID) • ADD R3, R1, R2 => R3 = R1 + R2; IR ALU 000001000010001000011… Reg File ADD R1 R2 0 R1 x R2 y Memory R3 R31 7/27/2016 Reg File R3 R0 decode the instruction z … CSCE 430/830, Basic Pipelining & Performance 132 Instruction Decode (ID) • ADD R3, R1, R2 => R3 = R1 + R2; x IR ALU 000001000010001000011… Reg File ADD R1 R2 0 R1 x R2 y Memory R3 R31 7/27/2016 Reg File R3 R0 z y decode the instruction Load the reg file … CSCE 430/830, Basic Pipelining & Performance 133 Execution (EX) • ADD R3, R1, R2 => R3 = R1 + R2; x IR ALU 000001000010001000011… y Reg File Memory 7/27/2016 CSCE 430/830, Basic Pipelining & Performance Execute the operation 134 Execution (EX) • ADD R3, R1, R2 => R3 = R1 + R2; x IR ALU 000001000010001000011… x+y y Reg File Memory 7/27/2016 CSCE 430/830, Basic Pipelining & Performance Execute the operation 135 Writeback (WB) • ADD R3, R1, R2 => R3 = R1 + R2; x IR ALU 000001000010001000011… Reg File 0 R1 x R2 y Memory R3 R31 7/27/2016 Reg File R0 z x+y y Write the result to the register … CSCE 430/830, Basic Pipelining & Performance 136 Writeback (WB) • ADD R3, R1, R2 => R3 = R1 + R2; x IR ALU 000001000010001000011… Reg File 0 R1 x R2 y Memory R3 R31 7/27/2016 Reg File R0 x+y x+y y Write the result to the register … CSCE 430/830, Basic Pipelining & Performance 137 Another example: Load • LW R3, R31, 200 => R3 = mem[R31 + 200]; IR ALU 000011000111111100000… Reg File 0 R1 x R2 y Memory R3 R31 7/27/2016 Reg File R0 z n CSCE 430/830, Basic Pipelining & Performance 138 Before the Execution (EX) stage • LW R3, R31, 200 => R3 = mem[R31 + 200]; n IR ALU 000011000111111100000… Reg File 0 R1 x R2 y Memory R3 R31 7/27/2016 Reg File R0 200 z n CSCE 430/830, Basic Pipelining & Performance 139 Execution (EX) • LW R3, R31, 200 => R3 = mem[R31 + 200]; n IR ALU 000011000111111100000… n + 200 200 Reg File Memory 7/27/2016 CSCE 430/830, Basic Pipelining & Performance Execute the operation 140 Memory Access (MEM) • LW R3, R31, 200 => R3 = mem[R31 + 200]; n IR 000001000010001000011… 000001000010000100000… n + 200 Reg File R0 0 R1 x R2 y R3 n + 200 200 Reg File Memory PC 000011000111111100000… ALU 000011000111111100000… Read the memory z 123456 R31 7/27/2016 n CSCE 430/830, Basic Pipelining & Performance 141 Writeback (WB) • LW R3, R31, 200 => R3 = mem[R31 + 200]; n IR 000001000010001000011… 000001000010000100000… n + 200 Reg File R0 0 R1 x R2 y R3 n + 200 200 Reg File Memory PC 000011000111111100000… ALU 000011000111111100000… z Write to the register 123456 R31 7/27/2016 n CSCE 430/830, Basic Pipelining & Performance 142 5-stage pipeline • • • • • IF: ID: EX: MEM: WB: mem, alu, load/store, branch reg file, alu, load/store, branch ALU, alu, load/store mem, load/store reg file, alu, load • Can we change the order of MEM and WB? Why? 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 143 5 Steps of MIPS Datapath Figure A.2, Page A-8 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Next SEQ PC Adder 4 Zero? RS1 L M D MUX Data Memory ALU Imm MUX MUX RD Reg File Inst Memory Address IR <= mem[PC]; RS2 Write Back MUX Next PC Memory Access Sign Extend PC <= PC + 4 Reg[IRrd] <= Reg[IRrs1] opIRop Reg[IRrs2] 7/27/2016 WB Data CSCE 430/830, Basic Pipelining & Performance 146 Classic RISC Pipeline Source: http://en.wikipedia.org/wiki/Classic_RISC_pipeline 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 147 Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows the overlap among the parts of the data path, with clock cycle 5 (CC 5) showing the steady-state situation. Because the register file is used as a source in the ID stage and as a destination in the WB stage, it appears twice. We show that it is read in one part of the stage and written in another by using a solid line, on the right or left, respectively, and a dashed line on the other side. The abbreviation IM is used for instruction memory, DM for data memory, and CC for clock cycle. 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 148 Inst. Set Processor Controller IR <= mem[PC]; PC <= PC + 4 JSR A <= Reg[IRrs1]; JR br if bop(A,b) Ifetch opFetch-DCD ST B <= Reg[IRrs2] jmp RR PC <= IRjaddr r <= A opIRop B RI LD r <= A opIRop IRim r <= A + IRim WB <= r WB <= Mem[r] PC <= PC+IRim WB <= r Reg[IRrd] <= WB Reg[IRrd] <= WB CSCE 430/830, Basic Pipelining & Performance Reg[IRrd] <= WB 149 5 Steps of MIPS Datapath Figure A.3, Page A-9 Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Next SEQ PC Adder 4 Zero? RS1 MUX MEM/WB Data Memory EX/MEM A <= Reg[IRrs1]; Imm ALU PC <= PC + 4 MUX MUX IR <= mem[PC]; ID/EX RD Reg File IF/ID Memory Address RS2 Write Back MUX Next PC Memory Access WB Data Instruction Fetch Sign Extend RD RD RD B <= Reg[IRrs2] rslt <= A opIRop B WB <= result 7/27/2016 Reg[IRrd] <= WB CSCE 430/830, Basic Pipelining & Performance 150 Visualizing Pipelining Figure A.2, Page A-8 Time (clock cycles) 7/27/2016 Reg DMem Ifetch Reg DMem Reg ALU DMem Reg ALU O r d e r Ifetch ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Ifetch Reg Reg CSCE 430/830, Basic Pipelining & Performance Reg DMem Reg 151 Time (clock cycles) Reg DMem Ifetch Reg DMem Reg ALU DMem Reg ALU DMem Reg ALU DMem Reg ALU O r d e r Ifetch ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Ifetch Reg Ifetch Reg Ifetch Reg Reg Reg 152 DMem Reg Time (clock cycles) Reg DMem Ifetch Reg DMem Reg ALU DMem Reg ALU DMem Reg ALU DMem Reg ALU O r d e r Ifetch ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Ifetch Reg Ifetch Reg Ifetch Reg Reg Reg 153 DMem Reg Pipelining is not quite that easy! • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle – Structural hazards: HW cannot support this combination of instructions – Data hazards: Instruction depends on result of prior instruction still in the pipeline – Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 154 One Memory Port/Structural Hazards Figure A.4, Page A-14 Time (clock cycles) 7/27/2016 Reg DMem Reg DMem Reg DMem Reg ALU Instr 4 Ifetch ALU Instr 3 DMem ALU O r d e r Instr 2 Reg ALU I Load Ifetch n s Instr 1 t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Ifetch Reg Ifetch Reg CSCE 430/830, Basic Pipelining & Performance Reg Reg Reg DMem 155 One Memory Port/Structural Hazards (Similar to Figure A.5, Page A-15) Time (clock cycles) Stall Instr 3 7/27/2016 DMem Ifetch Reg DMem Reg ALU Ifetch Bubble Reg Reg DMem Bubble Bubble Ifetch Reg CSCE 430/830, Basic Pipelining & Performance Reg Bubble ALU O r d e r Instr 2 Reg ALU I Load Ifetch n s Instr 1 t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Bubble Reg DMem 156 Data Hazard on R1 Figure A.6, Page A-17 Time (clock cycles) and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 7/27/2016 Ifetch DMem Reg DMem Ifetch Reg DMem Ifetch Reg DMem Ifetch Reg ALU sub r4,r1,r3 Reg ALU Ifetch ALU O r d e r add r1,r2,r3 WB ALU I n s t r. MEM ALU IF ID/RF EX Reg CSCE 430/830, Basic Pipelining & Performance Reg Reg Reg DMem 157 Reg Three Generic Data Hazards • Read After Write (RAW) InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3 • Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 158 Three Generic Data Hazards • Write After Read (WAR) InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. • Can it happen in MIPS 5 stage pipeline? – All instructions take 5 stages, and – Reads are always in stage 2, and – Writes are always in stage 5 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 159 Three Generic Data Hazards • Write After Write (WAW) InstrJ writes operand before InstrI writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. • Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 • Will see WAR and WAW in more complicated pipes 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 160 Forwarding to Avoid Data Hazard Figure A.7, Page A-19 or r8,r1,r9 xor r10,r1,r11 7/27/2016 Reg DMem Ifetch Reg DMem Ifetch Reg DMem Ifetch Reg ALU and r6,r1,r7 Ifetch DMem ALU sub r4,r1,r3 Reg ALU O r d e r add r1,r2,r3 Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Reg CSCE 430/830, Basic Pipelining & Performance Reg Reg DMem 161 Reg HW Change for Forwarding Figure A.23, Page A-37 NextPC mux MEM/WR EX/MEM ALU mux ID/EX Registers Data Memory mux Immediate 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 162 Forwarding to Avoid LW-SW Data Hazard Figure A.8, Page A-20 or r8,r6,r9 xor r10,r9,r11 Reg DMem Ifetch Reg DMem Ifetch Reg DMem Ifetch Reg ALU sw r4,12(r1) Ifetch DMem ALU lw r4, 0(r1) Reg ALU O r d e r add r1,r2,r3 Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Reg Reg Reg DMem Any hazard that cannot be avoided with forwarding? 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 163 Reg Data Hazard Even with Forwarding Figure A.9, Page A-21 and r6,r1,r7 or 7/27/2016 r8,r1,r9 DMem Ifetch Reg DMem Reg Ifetch Ifetch Reg Reg CSCE 430/830, Basic Pipelining & Performance Reg DMem ALU O r d e r sub r4,r1,r6 Reg ALU lw r1, 0(r2)Ifetch ALU I n s t r. ALU Time (clock cycles) Reg DMem 164 Reg Data Hazard Even with Forwarding (Similar to Figure A.10, Page A-21) and r6,r1,r7 or r8,r1,r9 7/27/2016 Reg DMem Ifetch Reg Bubble Ifetch Bubble Reg Bubble Ifetch Reg CSCE 430/830, Basic Pipelining & Performance DMem Reg Reg Reg DMem ALU sub r4,r1,r6 Ifetch ALU O r d e r lw r1, 0(r2) ALU I n s t r. ALU Time (clock cycles) DMem 165 Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,Ra Re,e Rf,f Rd,Re,Rf d,Rd Fast code: LW LW LW ADD LW SW SUB SW Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,Ra Rd,Re,Rf d,Rd Compiler optimizes for performance. Hardware checks for safety. 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 166 Control Hazard: Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Next SEQ PC Adder 4 Zero? RS1 Sign Extend RD RD RD • Branch condition determined @ end of EX stage • Target instruction address ready @ end of Mem stage MUX MEM/WB Data Memory EX/MEM ALU MUX MUX ID/EX Imm Reg File IF/ID Memory Address RS2 Write Back MUX Next PC Memory Access WB Data Instruction Fetch r6,r1,r7 22: add r8,r1,r9 Ifetch Reg DMem Ifetch Reg DMem Ifetch Reg DMem Ifetch Reg ALU 18: or DMem ALU 14: and r2,r3,r5 Reg ALU Ifetch ALU 10: beq r1,r3,36 ALU Control Hazard on Branches Three Stage Stall 36: xor r10,r1,r11 Reg Reg Reg Reg DMem What do you do with the 3 instructions in between? How do you do it? Where is the “commit”? 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 168 Reg Branch Stall Impact • If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! • Two part solution: – Determine branch outcome sooner, AND – Compute taken branch (target) address earlier • MIPS branch tests if register = 0 or  0 • MIPS Solution: – Move Zero test to ID/RF stage – Add an adder to calculate new PC in ID/RF stage – 1 clock cycle penalty for branch versus 3 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 169 Two-part solution to the long stalls: Instruction Fetch Memory Access Write Back Adder Adder MUX Next SEQ PC Next PC Zero? RS1 RD RD RD 1. Determine the condition earlier @ ID 2. Calculate the target instruction address earlier @ ID MUX Sign Extend MEM/WB Data Memory EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Memory Address RS2 WB Data 4 Execute Addr. Calc Instr. Decode Reg. Fetch Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken – – – – – Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken – 53% MIPS branches taken on average – But haven’t calculated branch target address in MIPS » MIPS still incurs 1 cycle branch penalty » Other machines: branch target known before branch condition determination 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 171 Four Branch Hazard Alternatives #4: Delayed Branch – Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken Branch delay of length n – 1 slot delay allows proper decision and branch target address in 5 stage pipeline – MIPS uses this 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 172 Scheduling Branch Delay Slots (Fig A.14) A. From before branch add $1,$2,$3 if $2=0 then delay slot becomes if $2=0 then add $1,$2,$3 B. From branch target sub $4,$5,$6 add $1,$2,$3 if $1=0 then delay slot becomes sub $4,$5,$6 add $1,$2,$3 if $1=0 then sub $4,$5,$6 C. From fall through add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 OR $7,$8,$9 becomes add $1,$2,$3 if $1=0 then sub $4,$5,$6 OR $7,$8,$9 • A is the best choice, fills delay slot • In B, the sub instruction may need to be copied, increasing IC • In B and C, must be okay to execute sub when branch fails 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 173 Delayed Branch • Compiler effectiveness for single branch delay slot: – Fills about 60% of branch delay slots – About 80% of instructions executed in branch delay slots useful in computation – About 50% (60% x 80%) of slots usefully filled • Delayed Branch downside: As processors go to deeper pipelines and multiple issues, the branch delay grows and needs more than one delay slot – Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches – Growth in available transistors has made dynamic approaches relatively cheaper 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 174 Speed Up Equation for Pipelining CPIpipelined  Ideal CPI  Average Stall cycles per Inst Cycle Timeunpipelined Ideal CPI  Pipeline depth Speedup   Ideal CPI  Pipeline stall CPI Cycle Timepipelined For simple RISC pipeline, Ideal CPI = 1: Cycle Timeunpipelined Pipeline depth Speedup   1  Pipeline stall CPI Cycle Timepipelined 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 176 Example: Dual-port vs. Single-port • Machine A: Dual ported memory (“Harvard Architecture”) • Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate • Ideal CPI = 1 for both • Loads are 40% of instructions executed SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 • Machine A is 1.33 times faster 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 177 Evaluating Branch Alternatives Assume 4% unconditional branch, 6% conditional branchuntaken, 10% conditional branch-taken; Deeper pipeline: target & condition known @ end of 3rd & 4th stage respectively What’s the addition to CPI? Scheduling Uncond Untaken Taken. scheme penalty penalty penalty Stall pipeline 2 3 3 Predict taken 2 3 2 Predict not taken 2 0 3 Scheduling Uncond Untaken Taken All branches scheme branches branches branches 4% 6% 10% 20% Stall pipeline 0.08 0.18 0.3 0.56 Predict taken 0.08 0.18 0.2 0.46 Predict not taken 0.08 0.00 0.3 0.38 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 178 More branch evaluations • Suppose the branch frequencies (as percentage of all instructions) of 15% cond. Branches, 1% jumps and calls, and 60% cond. Branches are taken. Consider a 4-stage pipeline where branch is resolved at the end of the 2nd cycle for uncond. Branches and at the end of the 3rd cycle for cond. Branches. How much faster would the machine be without any branch hazards, ignoring other pipeline stalls? Pipeline speedupideal = Pipeline depth/(1+Pipeline stalls) = 4/(1+0) = 4 Pipeline stallsreal = (1x1%) + (2x9%) + (0x6%) = 0.19 Pipeline speedupreal = 4/(1+0.19) = 3.36 Pipeline speedupwithout control hazards = 4/3.36 = 1.19 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 179 More branch question • A reduced hardware implementation of the classic 5-stage RISC pipeline might use the EX stage hardware to perform a branch instruction comparison and then not actually deliver the branch target PC to the IF stage until the clock cycle in which the branch reaches the MEM stage. Control hazard stalls can be reduced by resolving branch instructions in ID, but improving performance in one aspect may reduce performance in other circumstances. • How does determining branch outcome in the ID stage have the potential to increase data hazard stall cycles? 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 180 Problems with Pipelining • Exception: An unusual event happens to an instruction during its execution – Examples: divide by zero, undefined opcode • Interrupt: Hardware signal to switch the processor to a new instruction stream – Example: a sound card interrupts when it needs more audio output samples (an audio “click” happens if it is left waiting) • Problem: It must appear that the exception or interrupt must appear between 2 instructions (Ii and Ii+1) – The effect of all instructions up to and including Ii is totally complete – No effect of any instruction after Ii can take place • The interrupt (exception) handler either aborts program or restarts at instruction Ii+1 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 181 Homework • C.1.(e) – 10-stage pipeline, full forwarding & bypassing, predicted taken – Branch outcomes and targets are obtained at the end of the ID stage. LD R1,0(R2) DADDI R1, R1, #1 SD R1, 0(R2) 7/27/2016 1 2 IF1 IF2 IF1 3 4 5 6 7 8 ID1 ID2 EX1 EX2 MEM1 MEM2 IF2 …………………. IF1 …………………… 9 10 11 WB1 WB2 CSCE 430/830, Basic Pipelining & Performance 12 13 14 182 Homework • C.1.(e) – “For example, data are forwarded from the output of the second EX stage to the input of the first EX stage, still causing 1-cycle delay” – Example: (not as the same as the problem) Data Dependence: Read After Write 1 2 3 4 ADD R1, R2, R3 IF1 IF2 ID1 ID2 SUB R4, R5, R1 IF1 IF2 ID1 5 EX1 ID2 6 7 EX2 MEM1 stall EX1 8 MEM2 EX2 No Data Dependence 1 2 3 4 ADD R1, R2, R3 IF1 IF2 ID1 ID2 SUB R4, R5, R6 IF1 IF2 ID1 5 EX1 ID2 6 7 EX2 MEM1 EX1 EX2 8 9 10 11 MEM2 WB1 WB2 MEM1 MEM2 WB1 WB2 7/27/2016 9 10 11 12 13 WB1 WB2 MEM1 MEM2 WB1 WB2 CSCE 430/830, Basic Pipelining & Performance 12 13 183 Homework • C.1.(g) – CPI = The total number of clock cycles / the total number of instructions of code segment 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 184 Homework • C.2.(a) & C.2.(b) – “Assuming that only the first pipe stage can always be done independent of whether the branch goes” – An indicator of “predicted not taken” – The first stage of the instruction after the branch instruction could be done – Example: 4-stage pipeline – Unconditional branch: “at the end of the second cycle” 1 2 3 4 5 6 J loop IF ID EX WB Next instruction IF Target instruction IF ID EX WB 7/27/2016 CSCE 430/830, Basic Pipelining & Performance 185 Homework • C.2.(a) & C.2.(b) – Conditional Branch: “at the end of the third cycle” – If the branch is not taken, penalty = ? 1 2 3 4 5 6 BNEZ R4, loop IF ID EX WB Next instruction IF stall ID EX WB Next + 1 instruction IF ID EX WB – If the branch is taken, penalty = ? 1 2 3 4 BNEZ R4, loop IF ID EX WB Next instruction IF stall ID Target instruction IF 7/27/2016 5 EX ID CSCE 430/830, Basic Pipelining & Performance 6 WB EX WB 186

Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture

Related documents

Products

Support

Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib