Chapter 2: ILP and Its Exploitation • • • • • • • • Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Multiple issue (superscalar) Hardware-based speculation ILP limitation Intel P6 microarchitecture 1 Advanced Processor Pipelining • Focus on exploit Instruction-Level Parallelism (ILP): – Definition: Executing multiple instructions (within a single program thread) simultaneously – Note that even ordinary pipelining does some ILP (overlapping execution of multiple instructions). – Focus of this chapter: Increasing ILP even more by allowing out-of-order execution with/without speculation – Using multiple-issue datapaths to initiate multiple instructions simultaneously for further improvement Microarchitectures that do this are called superscalar – Examples: PowerPC, Pentium, etc. 2 Pipeline Performance • Ideal pipeline CPI = 1 is minimum number of cycles per instruction issued, if no stalls occur. – May be <1 in superscalar machines. • E.g., Ideal CPI=1/3 in 3-way superscalar (often use IPC=3 in superscalar) • Real pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Hazard Stalls • Maximize performance using various techniques to eliminate stalls and reduce ideal CPI. • Note: Real pipeline CPI still need to account for cache misses (discuss later). 3 Advanced Pipelining Techniques Technique Reduces Loop unrolling Control stalls Basic pipeline scheduling / forwarding RAW stalls Dynamic scheduling w. scoreboarding RAW stalls Dyna. sched. with register renaming WAR & WAW stalls Dynamic branch prediction Control stalls Issuing multiple instructions per cycle Ideal pipeline CPI Compiler dependence analysis/software Ideal CPI & data stalls Software pipelining & trace scheduling Ideal CPI & data stalls Hardware Speculation All data & control stalls Dynamic memory disambiguation RAW stalls involving mem. 4 Instruction-Level Parallelism (ILP) • Basic Block (BB) ILP is quite small – BB: a straight-line code sequence with no branches in and out except for the entry and at the exit – average dynamic branch frequency 15% to 25% => 4 to 7 instructions execute between two branches – Plus instructions in BB likely to depend on each other • To obtain performance enhancements, we must exploit ILP across multiple basic blocks • Simplest: loop-level parallelism to exploit parallelism among iterations of a loop. E.g., for (i=1; i<=1000; i=i+1) x[i] = x[i] + y[i]; 5 Dependences • A dependence is a way in which one instruction can depend on (be impacted by) another for scheduling purposes. • Three major dependence types: – Data (True) dependence: RAW – Name dependence: WAR, WAW – Control dependence: branch, jump, etc. • A dependency (or dependence) is a particular instance of one instruction depending on another. – The instructions can’t be effectively (as opposed to just syntactically) fully parallelized, or reordered. 6 Data Dependence • Recursive definition: – Instruction B is data dependent on instruction A iff: • B uses a data result produced by instruction A, or • There is another instruction C such that B is data dependent on C, and C is data dependent on A. • Potential for a RAW hazard • Loop: LD ADDD SD SUBI BNEZ F0,0(R1) F4,F0,F2 0(R1),F4 R1,R1,#8 R1,Loop A A B C B Data dependencies in loop example 7 Name Dependence • Occurs when two instructions both access the same data storage location due to reuse the storage (Also called storage dependence, at least one of the accesses must be a write.) • Two sub-types (for inst. B after inst. A): – Antidependence: A reads, then B writes. • Potential for a WAR hazard. – Output dependence: A writes, then B writes. • Potential for a WAW hazard. • Note: Name dependencies can be avoided by changing instructions to use different locations (rather than reusing a location). 8 WAR, WAW Examples • WAR Hazard: InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • WAW Hazard: InstrJ writes operand before InstrI writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 9 Control Dependence • Occurs when the execution of an instruction depends on a conditional branch instruction. • Program control flow must follow for correct executions. • However, only two things must really be preserved: – Data flow (how a given result is produced) – Exception behavior (must handle in order) • Example: (for Exception Behavior) L1: DADDU R2, R3, R4 BEQZ R2, L1 LW R1, 0(R2) ; may not before BEQZ ; due to exception & R1 10 Control Dependence – Another Example • Example: (for data flow) DADDU R2, R3, R4 BEQZ R5, L1 DSUBU R2, R6, R7 L1: OR R8, R2, R9 • OR depends DADDU and DSUBU. Maintaining data dependences is not enough • Control flow decides where the correct R2 comes from (DADDU or DSUBU) 11 Relaxing Control Dependence • Only two things must really be preserved: – Data flow (how a given result is produced) – Exception behavior • Some techniques permit removing control dependence from instruction execution, by dependently ignoring instruction results instead – Speculation (betting on branches, to fill delay slots) • Make instructions unconditional if no harm done – Speculative multiple-execution • Take both paths, invalidate results of one later – Conditional / predicated instructions (used in IA-64). • Note, Instruction reordering around branch must has no side-effect 12 Loop Unrolling • This code, add a scalar to a vector: for (i=1000; i>0; i=i–1) x[i] = x[i] + s; • Assume following latencies for all examples – Ignore delayed branch in these examples Instruction Instruction Latency stalls between producing result using result in cycles in cycles --------------------------------------------------------------------------------------------------------FP ALU op Another FP ALU op 4 3 FP ALU op Store double 3 2 Load double FP ALU op 1 1 Load double Store double 1 0 Integer op Integer op 1 0 13 MIPS Code • First translate into MIPS code: – To simplify, assume 8 is lowest address Loop: L.D ADD.D S.D DADDUI BNEZ F0,0(R1) F4,F0,F2 0(R1),F4 R1,R1,-8 R1,Loop ;F0=vector element ;add scalar from F2 ;store result ;decrement pointer 8B (DW) ;branch R1!=zero 14 Execution Cycles without Inst Scheduling 1 Loop: L.D 2 stall 3 ADD.D 4 stall 5 stall 6 S.D 7 DADDUI 8 stall 9 BNEZ Instruction producing result FP ALU op FP ALU op Load double F0,0(R1) ;F0=vector element F4,F0,F2 ;add scalar in F2 0(R1),F4 ;store result R1,R1,-8 ;decrement pointer 8B (DW) ;assumes can’t forward branch R1,Loop ;branch R1!=zero Instruction using result Another FP ALU op Store double FP ALU op Latency in clock cycles 3 2 1 • 9 clock cycles: Rewrite code to minimize stalls? 15 Apply Instruction Scheduling 1 Loop: 2 3 4 5 6 7 L.D DADDUI ADD.D stall stall S.D BNEZ F0,0(R1) R1,R1,-8 F4,F0,F2 8(R1),F4 R1,Loop ;altered offset • 7 clock cycles, but just 3 for execution (L.D, ADD.D, S.D), 4 for loop overhead; Can we make faster? 16 Unroll Loop Four Times 1 Loop: 3 6 7 9 12 13 15 18 19 21 24 25 27 L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D DADDUI BNEZ F0,0(R1) F4,F0,F2 0(R1),F4 F6,-8(R1) F8,F6,F2 -8(R1),F8 F10,-16(R1) F12,F10,F2 -16(R1),F12 F14,-24(R1) F16,F14,F2 -24(R1),F16 R1,R1,#-32 R1,LOOP ;no DSUBUI/BNEZ ;drop loop ;drop loop ;alter to 4*8 27 clock cycles, or 6.75 per iteration (Assumes R1 is multiple of 4) 17 Unrolled and Rescheduled Loop 1 Loop: 2 3 4 5 6 7 8 9 10 11 12 13 14 L.D L.D L.D L.D ADD.D ADD.D ADD.D ADD.D S.D S.D S.D DSUBUI S.D BNEZ F0,0(R1) F6,-8(R1) F10,-16(R1) F14,-24(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 F16,F14,F2 0(R1),F4 -8(R1),F8 -16(R1),F12 R1,R1,#32 8(R1),F16 ; 8-32 = -24 R1,LOOP • 14 clock cycles, or 3.5 per iteration 18 Loop Unrolling Decisions • Requires understanding how one instruction depends on another and how the instructions can be changed or reordered given the dependences: 1. Determine loop unrolling useful by finding that loop iterations were independent (except for maintenance code) 2. Use different registers to avoid unnecessary constraints forced by using same registers for different computations 3. Eliminate the extra test and branch instructions and adjust the loop termination and iteration code 4. Determine that loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent 1. Transformation requires analyzing memory addresses and finding that they do not refer to the same address 5. Schedule the code, preserving any dependences needed to yield the same result as the original code 19 Three Unrolling Considerations • Decrease in amount of overhead amortized with each extra unrolling – Amdahl’s Law • Growth in code size – For larger loops, concern it increases the instruction cache miss rate • Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling – If not be possible to allocate all live values to registers, may lose some or all of its advantage Loop unrolling reduces impact of branches on pipeline; another way is branch prediction 20 Static Branch Prediction • Delayed branch • To reorder code around branches, need to predict branch statically when compile – Predict a branch as taken: Average misprediction = untaken frequency = 34% SPEC • More accurate prediction: – Profile based 21 Dynamic Branch Prediction • Why does prediction work? – Underlying algorithm has regularities – Data that is being operated on has regularities – Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems • Is dynamic branch prediction better than static branch prediction? – Seems to be – There are a small number of important branches in programs which have dynamic behavior 22 Dynamic Branch Prediction • As the amount of ILP exploited increases (CPI decreases), impact of control stalls increases. – Branches come more often – An n-cycle delay postpones more instructions • Dynamic Hardware Branch Prediction – “Learns” which branches are taken, or not – Make the right guess (most of the time) about whether a branch is taken, or not. – Delay depends on whether prediction is correct, and whether branch is taken. 23 Branch-Prediction Buffers (BPB) • Also called “branch history table” • Low-order n bits of branch address used to index a table of branch history data for prediction. – May have “collisions” between distant branches. – Associative tables also possible • In each entry, k bits of information about history of that branch are stored. – Common values of k: 1, 2, and larger • Entry is used to predict what branch will do. • Actual behavior of branch will update the entry. 24 1-bit Branch Prediction • The entry for a branch has only two states: – Bit = 1 • “The last time this branch was encountered, it was taken. I predict it will be taken next time.” – Bit = 0 • “The last time this branch was encountered, it was not taken. I predict it will not be taken next time.” • Will make 2 mistakes each time a loop is encountered. – At the end of the first & last iterations. • May always mispredict in pathological cases! 25 2-bit Branch Prediction • 4 states – Based on the most recent two branch history actions • Only 1 mis-prediction per loop execution, after the first time the loop is reached. Most recent State run of 2 0 – Last iteration • What about n-bit predictor? 1 Not Taken Not Taken Most recent action Not Taken Taken Predict Not Taken Not Taken 2 Taken Not Taken Taken 3 Taken Taken Taken 26 State Transition Diagram 27 Implementing Branch Histories • Separate “cache” (prediction table) accessed during IF stage • Extra bits in instruction cache • Problem with this approach in MIPS: – After fetch, don’t know whether the instruction is really a branch or not (until decoding) – Also don’t know the target address. – In MIPS, by the time you know these things (in ID), you already know whether it’s really taken! – Haven’t saved any time! – Branch-Target Buffers can fix this problem (later)... 28 Misprediction Rate for 2-bit BPB 29 Branch-Prediction Performance • Contribution to cycle count depends on: – Branch frequency & misprediction frequency – Freqs. of taken/not taken, predicted/mispredicted. – Delay of taken/not taken, predicted/mispredicted. • How to reduce misprediction frequency? – Increase buffer size to avoid collisions. • Has little effect beyond ~4,096 entries. – Increase prediction accuracy • Increase # of bits/entry (little effect beyond 2) • Use a different prediction scheme (correlated predictors, tournament predictors) 30 Exhaustive Search for Optimal 2-bit Predictor • There are 2^20 possible state machines of 2-bit predictors • Some machines are uninteresting, pruning them out reduces the number of state machines to 5248 • For each benchmark, determine the prediction accuracy for all the predictor state machines • Optimal 2-bit predictor for each application (by IBM) – spice2g6 97.2% – doduc 94.3% – gcc 89.1% – espresso 89.1% – li 87.1% – eqntott 87.9% 31 Correlated Prediction - Example • Code fragment from eqntott: if (aa==2) aa=0; if1 if (bb==2) bb=0; if2 if (aa!=bb) { … }; false if (if1 & if2) • MIPS code (aa=R1, bb=R2): SUBUI BNEZ ADD L1: SUBUI BNEZ ADD L2: SUBU BEQZ R3,R1,#2 R3,L1 R1,R0,R0 R3,R2,#2 R3,L2 R2,R0,R0 R3,R1,R2 R3,L3 … ; ; ; ; ; ; ; ; (aa-2) branch b1 (aa!=2) aa=0 (bb-2) branch b2 (bb!=2) bb=0 (aa-bb) branch b3 (aa==bb) 32 Even Simpler Example • C code: if (d==0) d=1; if (d==1) ... • MIPS code (d=R1): BNEZ ADDI L1: SUBUI BNEZ R1,L1 R1,R0,#1 R3,R1,#1 R3,L2 ; ; ; ; b1: d!=0 d=1 (d-1) b2: d!=1 (and others) Any Correlation (b1, b2)? 33 Using 1-bit Predictor • Suppose value of ‘d’ alternates between 2 and 0, and the code repeat multiple times • All branches are mispredicted!! (with initial NT, NT) 34 Correlating Predictors • Have different predictions for the current branch depending on the previously executed branch instruction was taken or not. • Notation: _ / _ (separate prediction table) What to predict if the last branch was NOT taken What to predict if the last branch was TAKEN Prediction used is shown in bold Based on the last outcome 35 (m,n) correlated predictors • Uses the behavior of the most recent m branches encountered to select one of 2m different branch predictors for the next branch. • Each of these predictors records n bits of history information for any given branch. • On previous slide we saw a (1,1) predictor. • Easy to implement: – Behavior of last m branches: an m-bit shift register – Branch-prediction buffer: access with low-order bits of branch address, concatenated with shift register. 36 (2,2) Correlated Predictor 0 1 (Correction based on the last 2 branch outcomes) 37 Correlated Predictors Better 38 Tournament Predictors • Three predictors: Global correlated predictor, Local 2-bit predictor, and a tournament predictor • The Tournament predictor determines which (global or local) predictor to be used for prediction 39 Performance Tournament Predictor 40 Branch-Target Buffers (BTB) • How to know the address of the next instruction as soon as the current instruction is fetched? • Normally, an extra (ID) cycle is needed to: – Determine that the first instruction is a branch – Determine whether the branch is taken – Compute the target address PC+offset • Branch prediction alone doesn’t help – Need next PC • What if, instead, the next instruction address could be fetched at the same time that the current instruction is fetched? BTB 41 BTB Schematic 42 Handling Instruction with BTB Fetch from target This flowchart based on 1-bit predictor 43 Branch Penalties, Branch Folding • If instruction not in BTB and branch not taken (case not shown), penalty is 0. • Store target instructions instead of their addresses in BTB – Saves on fetch time. – Permits branch folding - zero-cycle branches! Substitute destination instruction for branch in pipeline! 44 Return Address Predictor • Predicting register/indirect branches E.g., abstract function calls, switch statements, procedure returns. CPU-internal return-address stack 45 Dynamic Branch Prediction Summary • Prediction becoming important part of execution • Branch History Table: 2 bits for loop accuracy • Correlation: Recently executed branches correlated with next branch – Either different branches (GA) – Or different executions of same branches (PA) • Tournament predictors take insight to next level, by using multiple predictors – usually one based on global information and one based on local information, and combining them with a selector – In 2006, tournament predictors using 30K bits are in processors like the Power5 and Pentium 4 • Branch Target Buffer: include branch address & prediction 46