Data Hazards 1 Hazards: Key Points • • Hazards cause imperfect pipelining • • They prevent us from achieving CPI = 1 They are generally causes by “counter flow” data dependences in the pipeline Three kinds • Structural -- contention for hardware resources • Data -- a data value is not available when/where it is needed. • Control -- the next instruction to execute is not known. ways to deal with hazards • TwoRemoval hardware and/or complexity to work around the • hazard so--itadd does not exist • • • Bypassing/forwarding Speculation Stall -- Sacrifice performance to prevent the hazard from occurring Stalling causes “bubbles” • 2 Data Dependences data dependence occurs whenever one • Ainstruction needs a value produced by another. • • Register values (for now) Also memory accesses (more on this later) add $s0, $t0, $t1 sub $t2, $s0, $t3 sw $t1, 0($t2) ld $t3, 0($t2) ld $t4, 16($s4) add $t3, $s0, $t4 and $t3, $t2, $t4 3 Dependences in the pipeline our simple pipeline, these instructions cause a • Inhazard Cycles add $s0, $t0, $t1 sub $t2, $s0, $t3 Fetch Deco de Fetch EX Mem Deco de EX Write back Mem Write back • 4 How can we fix it? • Ideas? 5 Solution 1: Make the compiler deal with it. hazards to the big A architecture • Expose A result is available N instructions after the instruction • • • that generates it. In the meantime, the register file has the old value. “delay slots” is N? • What it change? • Can • What can the compiler do? Fetch Deco de EX Mem Write back 6 Compiling for delay slots compiler must fill the delay slots with other • The instructions • What if it can’t? No-ops • add $s0, $t0, $t1 Rearrange instructions add $s0, $t0, $t1 sub $t2, $s0, $t3 and $t7, $t5, $t4 add $t3, $s0, $t4 sub $t2, $s0, $t3 and $t7, $t5, $t4 add $t3, $s0, $t4 7 Solution 2: Stall you need a value that is not ready, “stall” • When Suspend the execution of the executing instruction • • • and those that follow. This introduces a pipeline “bubble.” A bubble is a lack of work to do. It moves through the pipeline like an instruction. Cycles add $s0, $t0, $t1 sub $t2, $s0, $t3 Fetch Deco de Fetch EX Mem Stall Write back Deco de EX Mem Write back 8 Stalling the pipeline all pipeline stages before the stage where • Freeze the hazard occurred. • • Disable the PC update Disable the pipeline registers • • Insert nop control bits at stalled stage (decode in our example) How is this solution still potentially “better” than relying on the compiler? essentially equivalent to always inserting a • This nop when a hazard exists The compiler can still act like there are delay slots to avoid stalls. Implementation details are not exposed in the ISA 9 The Impact of Stalling On Performance = I * CPI * CT • ET and CT are constant • IWhat is the impact of stalling on CPI? • • What do we need to know to figure it out? 10 The Impact of Stalling On Performance = I * CPI * CT • ET and CT are constant • IWhat is the impact of stalling on CPI? • of instructions that stall: 30% • Fraction CPI = 1 • Baseline • Stall CPI = 1 + 2 = 3 • New CPI = 0.3*3 + 0.7*1 = 1.6 11 Solution 3: Bypassing/Forwarding values are computed in Ex and Mem but • Data “publicized in write back” • The data exists! We should use it. Results "published" to registers results known inputs are needed Fetch Deco de EX Mem Write back 12 Bypassing or Forwarding • Take the values, where ever they are Cycles add $s0, $t0, $t1 sub $t2, $s0, $t3 Fetch Deco de Fetch EX Mem Deco de EX Write back Mem Write back • 13 Forwarding Paths Cycles add $s0, $t0, $t1 sub $t2, $s0, $t3 sub $t2, $s0, $t3 sub $t2, $s0, $t3 Fetch Deco de Fetch EX Mem Deco de EX Mem Deco de EX Mem Deco de EX Fetch Fetch Write back Write back Write back Mem Write back 14 Forwarding in Hardware Add Add 4 Shi< le< 2 File Write Addr Write Data 16 Sign Extend Read Data 2 32 ALU Address Write Data Read Data Mem/WB Read Addr 2 Data Memory Read Data 1 Exec/Mem Register Dec/Exec Read Address Read Addr 1 IFetch/Dec PC Instruc(on Memory Add Forwarding for Loads • Load values come from the Mem stage Cycles ld $s0, (0)$t0 sub $t2, $s0, $t3 Fetch Deco de Fetch EX Mem Deco de EX Write back Mem Time travel presents significant implementation challenges 16 What can we do? to the compiler • Punt Easy enough. • • • Will work. Same dangers apply as before. • • If the compiler can’t fix it, the hardware will stall stall. • Always when possible, stall otherwise • Forward Here the compiler still has leverage 17 Hardware Cost of Forwarding our pipeline, adding forwarding required • Inrelatively little hardware. deeper pipelines it gets much more • For expensive ALU * pipeline stages you need to forward over • Roughly: modern processor have multiple ALUs (4-5) • Some • And deeper pipelines (4-5 stages of to forward across) paths need to be supported. • NotIf a allpathforwarding does not exist, the processor will need to stall. • 18 Key Points: Control Hazards occur when we don’t know what the • Control next instruction is caused by branches • Mostly for dealing with them • Strategies Stall • • Guess! • • • Leads to speculation Flushing the pipeline Strategies for making better guesses • Understand the difference between stall and flush 19 Control Hazards • add $s1, $s3, $s2 Computing the new PC sub $s6, $s5, $s2 beq $s6, $s7, somewhere and $s2, $s3, $s1 Fetch Deco de EX Mem Write back 20 Computing the PC instruction • Non-branch PC = PC + 4 • • When is PC ready? Fetch Deco de EX Mem Write back 21 Computing the PC instructions • Branch bne $s1, $s2, offset • • if ($s1 != $s2) { PC = PC + offset} else {PC = PC + 4;} • When is the value ready? Fetch Deco de EX Mem Write back 22 Option 2: Simple Prediction a processor tell the future? • Can non-taken branches, the new PC is ready • For immediately. just assume the branch is not taken • Let’s called “branch prediction” or “control • Also speculation” • What if we are wrong? 23 Predict Not-taken Cycles Not-taken bne $t2, $s0, somewhere Taken bne $t2, $s4, else Fetch Deco de Fetch add $s0, $t0, $t1 ... else: sub $t2, $s0, $t3 EX Mem Deco de EX Fetch Deco de Write back Mem EX Write back Mem Write back Squash Fetch Deco de start the add, and then, when we discover • We the branch outcome, we squash it. • We “flush” the pipeline. 24 Simple “static” Prediction means before run time • “static” prediction schemes are possible • Many taken • Predict Loops are commons • Pros? not-taken • Predict Pros? • Not all branches are for loops. Backward Taken/Forward not taken Best of both worlds. 25 Implementing Backward taken/forward not taken in control • Changes inputs to the control unit • •New The sign of the offset • The result of the branch outputs from control • New flush signal. • The • Inserts “noop” bits in datapath and control 26 The Importance of Pipeline depth are two important parameters of the • There pipeline that determine the impact of branches on performance • • Branch decode time -- how many cycles does it take to identify a branch (in our case, this is less than 1) Branch resolution time -- cycles until the real branch outcome is known (in our case, this is 2 cycles) 27 Pentium 4 pipeline 1.Branches take 19 cycles to resolve 2.Identifying a branch takes 4 cycles. 3.Stalling is not an option. 4.Not quite as bad now, but BP is still very important. Dynamic Branch Prediction pipes demand higher accuracy than static • Long schemes can deliver. of making the the guess once, make it • Instead every time we see the branch. • Predict future behavior based on past behavior 29