Data Hazards

advertisement
Data Hazards
1
Hazards: Key Points
•
•
Hazards cause imperfect pipelining
•
•
They prevent us from achieving CPI = 1
They are generally causes by “counter flow” data dependences in
the pipeline
Three kinds
• Structural -- contention for hardware resources
• Data -- a data value is not available when/where it is needed.
• Control -- the next instruction to execute is not known.
ways to deal with hazards
• TwoRemoval
hardware and/or complexity to work around the
• hazard so--itadd
does not exist
•
•
•
Bypassing/forwarding
Speculation
Stall -- Sacrifice performance to prevent the hazard from
occurring
Stalling causes “bubbles”
•
2
Data Dependences
data dependence occurs whenever one
• Ainstruction
needs a value produced by another.
•
•
Register values (for now)
Also memory accesses (more on this later)
add $s0, $t0, $t1
sub $t2, $s0, $t3
sw
$t1, 0($t2)
ld
$t3, 0($t2)
ld
$t4, 16($s4)
add $t3, $s0, $t4
and $t3, $t2, $t4
3
Dependences in the pipeline
our simple pipeline, these instructions cause a
• Inhazard
Cycles
add $s0, $t0, $t1
sub $t2, $s0, $t3
Fetch
Deco
de
Fetch
EX
Mem
Deco
de
EX
Write
back
Mem
Write
back
•
4
How can we fix it?
• Ideas?
5
Solution 1: Make the compiler deal with it.
hazards to the big A architecture
• Expose
A result is available N instructions after the instruction
•
•
•
that generates it.
In the meantime, the register file has the old value.
“delay slots”
is N?
• What
it change?
• Can
• What can the compiler do?
Fetch
Deco
de
EX
Mem
Write
back
6
Compiling for delay slots
compiler must fill the delay slots with other
• The
instructions
• What if it can’t? No-ops
•
add $s0, $t0, $t1
Rearrange
instructions add $s0, $t0, $t1
sub $t2, $s0, $t3
and $t7, $t5, $t4
add $t3, $s0, $t4
sub $t2, $s0, $t3
and $t7, $t5, $t4
add $t3, $s0, $t4
7
Solution 2: Stall
you need a value that is not ready, “stall”
• When
Suspend the execution of the executing instruction
•
•
•
and those that follow.
This introduces a pipeline “bubble.” A bubble is a lack of
work to do. It moves through the pipeline like an
instruction.
Cycles
add $s0, $t0, $t1
sub $t2, $s0, $t3
Fetch
Deco
de
Fetch
EX
Mem
Stall
Write
back
Deco
de
EX
Mem
Write
back
8
Stalling the pipeline
all pipeline stages before the stage where
• Freeze
the hazard occurred.
•
•
Disable the PC update
Disable the pipeline registers
•
•
Insert nop control bits at stalled stage (decode in our
example)
How is this solution still potentially “better” than relying
on the compiler?
essentially equivalent to always inserting a
• This
nop when a hazard exists
The compiler can still act like there are delay slots to avoid stalls.
Implementation details are not exposed in the ISA
9
The Impact of Stalling On Performance
= I * CPI * CT
• ET
and CT are constant
• IWhat
is the impact of stalling on CPI?
•
• What do we need to know to figure it out?
10
The Impact of Stalling On Performance
= I * CPI * CT
• ET
and CT are constant
• IWhat
is the impact of stalling on CPI?
•
of instructions that stall: 30%
• Fraction
CPI = 1
• Baseline
• Stall CPI = 1 + 2 = 3
• New CPI = 0.3*3 + 0.7*1 = 1.6
11
Solution 3: Bypassing/Forwarding
values are computed in Ex and Mem but
• Data
“publicized in write back”
• The data exists! We should use it.
Results "published"
to registers
results known
inputs are needed
Fetch
Deco
de
EX
Mem
Write
back
12
Bypassing or Forwarding
• Take the values, where ever they are
Cycles
add $s0, $t0, $t1
sub $t2, $s0, $t3
Fetch
Deco
de
Fetch
EX
Mem
Deco
de
EX
Write
back
Mem
Write
back
•
13
Forwarding Paths
Cycles
add $s0, $t0, $t1
sub $t2, $s0, $t3
sub $t2, $s0, $t3
sub $t2, $s0, $t3
Fetch
Deco
de
Fetch
EX
Mem
Deco
de
EX
Mem
Deco
de
EX
Mem
Deco
de
EX
Fetch
Fetch
Write
back
Write
back
Write
back
Mem
Write
back
14
Forwarding in Hardware
Add
Add
4
Shi<
le< 2
File
Write Addr
Write Data
16
Sign
Extend
Read
Data 2
32
ALU
Address
Write Data
Read
Data
Mem/WB
Read Addr 2
Data
Memory
Read
Data 1
Exec/Mem
Register
Dec/Exec
Read
Address
Read Addr 1
IFetch/Dec
PC
Instruc(on
Memory
Add
Forwarding for Loads
• Load values come from the Mem stage
Cycles
ld
$s0, (0)$t0
sub $t2, $s0, $t3
Fetch
Deco
de
Fetch
EX
Mem
Deco
de
EX
Write
back
Mem
Time travel presents significant
implementation challenges
16
What can we do?
to the compiler
• Punt
Easy enough.
•
•
•
Will work.
Same dangers apply as before.
•
•
If the compiler can’t fix it, the hardware will stall
stall.
• Always
when possible, stall otherwise
• Forward
Here the compiler still has leverage
17
Hardware Cost of Forwarding
our pipeline, adding forwarding required
• Inrelatively
little hardware.
deeper pipelines it gets much more
• For
expensive
ALU * pipeline stages you need to forward over
• Roughly:
modern processor have multiple ALUs (4-5)
• Some
• And deeper pipelines (4-5 stages of to forward across)
paths need to be supported.
• NotIf a allpathforwarding
does not exist, the processor will need to stall.
•
18
Key Points: Control Hazards
occur when we don’t know what the
• Control
next instruction is
caused by branches
• Mostly
for dealing with them
• Strategies
Stall
•
•
Guess!
•
•
•
Leads to speculation
Flushing the pipeline
Strategies for making better guesses
• Understand the difference between stall and flush
19
Control Hazards
•
add $s1, $s3, $s2
Computing the new PC
sub $s6, $s5, $s2
beq $s6, $s7, somewhere
and $s2, $s3, $s1
Fetch
Deco
de
EX
Mem
Write
back
20
Computing the PC
instruction
• Non-branch
PC = PC + 4
•
• When is PC ready?
Fetch
Deco
de
EX
Mem
Write
back
21
Computing the PC
instructions
• Branch
bne $s1, $s2, offset
•
•
if ($s1 != $s2) { PC = PC + offset} else {PC = PC + 4;}
• When is the value ready?
Fetch
Deco
de
EX
Mem
Write
back
22
Option 2: Simple Prediction
a processor tell the future?
• Can
non-taken branches, the new PC is ready
• For
immediately.
just assume the branch is not taken
• Let’s
called “branch prediction” or “control
• Also
speculation”
• What if we are wrong?
23
Predict Not-taken
Cycles
Not-taken
bne $t2, $s0, somewhere
Taken
bne $t2, $s4, else
Fetch
Deco
de
Fetch
add $s0, $t0, $t1
...
else:
sub $t2, $s0, $t3
EX
Mem
Deco
de
EX
Fetch
Deco
de
Write
back
Mem
EX
Write
back
Mem
Write
back
Squash
Fetch
Deco
de
start the add, and then, when we discover
• We
the branch outcome, we squash it.
•
We “flush” the pipeline.
24
Simple “static” Prediction
means before run time
• “static”
prediction schemes are possible
• Many
taken
• Predict
Loops are commons
• Pros?
not-taken
• Predict
Pros?
•
Not all branches are for loops.
Backward Taken/Forward not taken
Best of both worlds.
25
Implementing Backward taken/forward not
taken
in control
• Changes
inputs to the control unit
• •New
The sign of the offset
• The result of the branch
outputs from control
• New
flush signal.
• The
• Inserts “noop” bits in datapath and control
26
The Importance of Pipeline depth
are two important parameters of the
• There
pipeline that determine the impact of branches
on performance
•
•
Branch decode time -- how many cycles does it take to
identify a branch (in our case, this is less than 1)
Branch resolution time -- cycles until the real branch
outcome is known (in our case, this is 2 cycles)
27
Pentium 4 pipeline
1.Branches take 19 cycles to resolve
2.Identifying a branch takes 4 cycles.
3.Stalling is not an option.
4.Not quite as bad now, but BP is still very important.
Dynamic Branch Prediction
pipes demand higher accuracy than static
• Long
schemes can deliver.
of making the the guess once, make it
• Instead
every time we see the branch.
• Predict future behavior based on past behavior
29
Download