L9-Pipelining-Control-Hazards - Peer Instruction for Computer

advertisement
Abstraction Question
• General purpose processors have an abstraction layer fixed
at the ISA and have little control over the compilers or
code run on the machine
• Embedded processors tend to have entire control over the
code run, compiler (if any), the ISA, and the hardware – so
breaking the abstraction layer makes sense.
• Bottom line becomes who controls what elements of the
design.
Intel side note
We’ve been doing mips
– how’d intel do this for
a CISC x86 ISA?
Microops from Pentium
Pro on
Read 5.1, 5.2
Branch Hazards
or
“Which way did he go?”
Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is
licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0
Unported License.
Control Dependence
• Just as an instruction will be dependent on other data
instructions to provide its operands (
dependence), it
will also be dependent on other instructions to determine
branchwhether it gets executed or not (
dependence or
control
___________ dependence).
• Control dependences are particularly critical with
_____________ branches.
conditional
add $5, $3, $2
sub $6, $5, $2
beq $6, $7, somewhere
and $9, $6, $1
...
somewhere: or $10, $5, $2
add $12, $11, $9
...
Dealing With Branch Hazards
• Hardware
– stall until you know which direction
– reduce hazard through earlier computation of branch direction
– guess which direction
 assume not taken (easiest)
 more educated guess based on history (requires that you know it is a
branch before it is even decoded!)
• Hardware/Software
– noops, or instructions that get executed either way (delayed
branch).
Stalling the pipeline
Given our current pipeline – let’s assume we stall until we know the
branch outcome. How many cycles will you lose per branch?
Sel cycles
ecti
on
A
0
B
1
C
2
D
3
E
4
Stalling for Branch Hazards
CC1
beq $4, $0, there IM
and $12, $2, $5
or ...
add ...
sw ...
CC2
CC3
Reg
Bubble
CC4
DM
Bubble
Bubble
CC5
CC6
CC7
CC8
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Stalling for Branch Hazards
•
•
Why?
Seems wasteful, particularly when the branch isn’t taken.
Makes all branches cost 4 cycles.
• What if we just assume ____________
branchesAren’t taken
Assume Branch Not Taken
• works pretty well when you’re right
CC1
CC2
beq $4, $0, there IM
Reg
and $12, $2, $5
IM
or ...
add ...
sw ...
CC3
CC4
DM
Reg
IM
CC5
CC7
Reg
DM
Reg
IM
CC6
Reg
DM
Reg
IM
Reg
DM
Reg
CC8
Assume Branch Not Taken
• same performance as stalling when you’re wrong
CC1
CC2
beq $4, $0, there IM
Reg
and $12, $2, $5
IM
CC3
CC4
DM
CC5
CC6
CC7
Reg
Flush
Reg
Wrong Path insts
or ...
add ...
there: sub $12, $4, $2
IM
Reg
Flush
IM
Flush
IM
Reg
CC8
Stalling the pipeline
Let’s improve the pipeline so we move branch resolution to Decode.
How many cycles would we lose then on a taken branch?
Sel cycles
ecti
on
A
0
B
1
C
2
D
3
E
4
Add drawing of
Resolving in decode.
The Pipeline with flushing for taken
branches
• Notice the IF/ID flush line added.
Branch Hazards – Assume Not Taken
• Great if most of your branches aren’t taken.
• What about loops which are taken 95% of the time?
– we would like the option of assuming not taken for some branches,
and taken for others, depending on ???
Branch Hazards – Predicting Taken
CC3
IM
Reg
DM
IM
Reg
ALU
here: lw
CC2
ALU
beq $2, $1, here
CC1
CC4
Required information to predict Taken:
CC5
CC6
CC7
CC8
Reading quiz
Reg
DM
Reg
Selection
Required
knowledge
1.
An instruction is a branch before decode
2.
The target of the branch
A
2,3
3.
The outcome of the branch
B
1,2,3
C
1,2
D
2
E
None of the
above
Branch Target Buffer
• Keeps track of the PCs of recently seen branches and their
•
targets.
Consult during Fetch (in parallel with Instruction Memory
read) to determine:
– Is this a branch?
– If so, what is the target
Branch Hazards – Predict Taken
• Static policy:
– Forward branches (if statements) predict not taken
– Backward branches (loops) predict taken
• Dynamic prediction (coming soon)
• First – Branch Delay Slots
Eliminating the Branch Stall
• There’s no rule that says we have to see the effect of the
•
•
branch immediately. Why not wait an extra instruction
before branching?
The original SPARC and MIPS processors each used a
single branch delay slot to eliminate single-cycle stalls
after branches.
The instruction after a conditional branch is always
executed in those machines, regardless of whether the
branch is taken or not!
Branch Delay Slot
CC1
CC2
beq $4, $0, there IM
Reg
and $12, $2, $5
IM
there: or ...
add ...
sw ...
CC3
CC4
DM
Reg
IM
CC5
CC7
CC8
Reg
DM
Reg
IM
CC6
Reg
DM
Reg
IM
Reg
DM
Reg
Branch delay slot instruction (next instruction after a branch) is executed even
if the branch is taken.
Filling the branch delay slot
add $5, $3, $7 No-R7 WAR
add $9, $1, $3 Safe, $1 and $3 are fine
sub $6, $1, $4 No-R6
No-R7
and $7, $8, $2
beq $6, $7, there
nop /* branch delay slot */
6 add $9, $1, $2 Not safe ($9 on nt path)
7 sub $2, $9, $5 Not safe (needs $9 not yet
produced)
...
there:
8 mult $2, $10, $11 Not safe ($2 is used before
overwritten)
…
1
2
3
4
5
E is the correct answer
* It is not safe to
assume anything
about the … code
Selection
Safe
instructions
A
1,2
B
2,6
C
6,8
D
1,2,7,8
E
None of the
above
Filling the branch delay slot
• The branch delay slot is only useful if you can find
•
something to put there.
If you can’t find anything, you must put a noop to insure
correctness.
Branch Delay Slots
• This works great for this implementation of the
•
architecture, but becomes a permanent part of the ISA.
What about the MIPS R10000, which has a 5-cycle branch
penalty, and executes 4 instructions per cycle???
• Bottom line: Exposed a detail of the hardware
implementation to the ISA.
Dynamic Branch Prediction
• Can we guess the outcome of branches?
• What should we base that guess on?
Branch Prediction
program counter
for (i=0;i<10;i++) {
...
...
}
1
0
1
...
...
add $i, $i, #1
beq $i, #10, loop
Accuracy?
Too easily swayed
Two-bit predictors give better loop
prediction
00
Weakly Taken
10
Weakly Not Taken
01
Strongly Not Taken
00
for (i=0;i<10;i++) {
...
...
}
Increment when taken
PHT
Decrement when not taken
branch address
Strongly Taken
11
...
...
add $i, $i, #1
beq $i, #10, loop
Better (less sway)
Slower learning
2-bit
A. T T T T N
B. T T N T N
B. T T N T N
C. N T N T N
C. N T N T N
Strongly Taken
11
Weakly Taken
10
Weakly Not Taken
01
Strongly Not Taken
00
Increment when taken
1-bit
A. T T T T N
Decrement when not taken
Suppose we have the following branch
patterns for 3 branches (A, B, C).
What is the accuracy of a 1-bit and 2bit Branch History Table. Assume
initial values of 1 (1-bit) and (10) 2-bit.
Modern Branch Prediction - Pentium 4
• Performance dependent on accurate Branch Prediction
• 20 Stage Pipeline – 3-way issue
– 60 instructions in flight (12 branches)
– 17th stage is branch resolution
– ~17*3=51 instructions lost on mispredict
1 2
3 4
5 6
7 8
9 10 11 12 13 14 15 16 17 18 19 20
X
Branch Prediction
• Latest branch predictors are significantly more
sophisticated, using more advanced correlating techniques,
larger structures, and even AI techniques
• Use patterns of branches (local history) and recent other
branch history (global history) to make predictions
Putting it all together.
For a given program on our 5-stage MIPS pipeline processor:
20% of insts are loads, 50% of instructions following a load
are arithmetic instructions depending on the load
20% of instructions are branches. Using dynamic branch
prediction, we achieve 80% prediction accuracy.
What is the CPI of your program?
Selection
CPI
A
0.76
B
0.9
C
1.0
D
1.1
E
1.14
Control Hazards -- Key Points
• Control (or branch) hazards arise because we must fetch
•
•
the next instruction before we know if we are branching or
where we are branching.
Control hazards are detected in hardware.
We can reduce the impact of control hazards through:
– early detection of branch address and condition
– branch prediction
– branch delay slots
Given our 5-stage MIPS pipeline – what is the steady state CPI
for the following code? Assume the branch is taken thousands
of times. Recall – a processor is in steady state when all stages
are active.
Steady-State CPI = (#insts+#stalls+#flushed_insts)
#insts
Loop: lw r1, 0 (r2)
add r2, r3, r4
sub r5, r1, r2
beq r5, $zero, Loop
Selection CPI
A
1
B
1.25
C
1.5
D
1.75
E
None of the above
Hardware engineers determine these to be the execution
IF = 200ps times per stage of the MIPS 5-stage pipeline processor.
Consider splitting IF and M into 2 stages each. (So IF1 IF2
ID = 100ps and M1 M2.) The most important code run by the
EX = 100ps company is (assume branch is taken most of the time):
M = 200ps Loop: lw r1, 0 (r2)
Isomorphic
add r2, r3, r4
WB = 100ps
sub r5, r1, r2
beq r5, $zero, Loop
What would be the impact of the new 7-stage pipeline compared to the
original 5-stage MIPS pipeline.. Assume the pipeline has forwarding where
available, predicts branch not taken, and resolves branches in ID.
Selection
CPI
CT
A
Increase
Increase
B
Increase
Decrease
C
Decrease
Increase
D
Decrease
Decrease
E
Increase
No Change
Hardware engineers determine these to be the execution
IF = 200ps times per stage of the MIPS 5-stage pipeline processor.
Consider splitting IF and M into 2 stages each. (So IF1 IF2
ID = 100ps and M1 M2.) The most important code run by the
EX = 200ps company is (assume branch is taken most of the time):
M = 200ps Loop: lw r1, 0 (r2)
add r2, r3, r4
WB = 100ps
sub r5, r1, r2
beq r5, $zero, Loop
What would be the impact of the new 7-stage pipeline compared to the
original 5-stage MIPS pipeline.. Assume the pipeline has forwarding where
available, predicts branch not taken, and resolves branches in ID.
Selection
CPI
CT
A
Increase
Increase
B
Increase
Decrease
C
Decrease
Increase
D
Decrease
Decrease
E
Increase
No Change
7-stage Pipeline
Loop: lw r1, 0 (r2)
add r2, r3, r4
sub r5, r1, r2
beq r5, $zero, Loop
For diagramming the
code in the last slide.
Drive home – more
stages = more hazards
Pipelining -- Key Points
• Pipelining focuses on improving instruction throughput,
•
•
not individual instruction latency.
Data hazards can be handled by hardware or software – but
most modern processors have hardware support for stalling
and forwarding.
Control hazards can be handled by hardware or software –
but most modern processors use Branch Target Buffers and
advanced dynamic branch prediction to reduce the hazard.
• ET = IC*CPI*CT
Download