Pipeline Hazards Arvind M.I.T. Based on the material prepared by

advertisement
1
Pipeline Hazards
Arvind
Computer Science and Artificial Intelligence Laboratory
M.I.T.
Based on the material prepared by
Arvind and Krste Asanovic
6.823 L6- 2
Arvind
Technology Assumptions
• A small amount of very fast memory (caches)
backed up by a large, slower memory
• Fast ALU (at least for integers)
• Multiported Register files (slower!)
It makes the following timing assumption valid
tIM ≈ tRF ≈ tALU ≈ tDM ≈ tRW
A 5-stage pipelined Harvard architecture will be
the focus of our detailed design
September 28, 2005
6.823 L6- 3
Arvind
5-Stage Pipelined Execution
0x4
Add
PC
addr
rdata
we
rs1
rs2
rd1
ws
wd rd2
GPRs
IR
Inst.
Memory
I-Fetch
(IF)
ALU
Data
Memory
Imm
Ext
wdata
Decode, Reg. Fetch Execute
(ID)
(EX)
time
instruction1
instruction2
instruction3
instruction4
instruction5
September 28, 2005
we
addr
rdata
t0
IF1
t1 t2
ID1 EX1
IF2 ID2
IF3
t3
MA1
EX2
ID3
IF4
t4
WB1
MA2
EX3
ID4
IF5
Write
-Back
(WB)
Memory
(MA)
t5
t6
t7
....
WB2
MA3 WB3
EX4 MA4 WB4
ID5 EX5 MA5 WB5
6.823 L6- 4
Arvind
5-Stage Pipelined Execution
Resource Usage Diagram
0x4
Add
PC
addr
rdata
we
rs1
rs2
rd1
ws
wd rd2
GPRs
IR
Inst.
Memory
Resources
I-Fetch
(IF)
September 28, 2005
we
addr
rdata
ALU
Data
Memory
Imm
Ext
wdata
Decode, Reg. Fetch Execute
(ID)
(EX)
time
IF
ID
EX
MA
WB
t0
I1
t1
I2
I1
t2
I3
I2
I1
t3
I4
I3
I2
I1
t4
I5
I4
I3
I2
I1
Write
-Back
(WB)
Memory
(MA)
t5
t6
t7
....
I5
I4
I3
I2
I5
I4
I3
I5
I4
I5
6.823 L6- 5
Arvind
Pipelined Execution:
ALU Instructions
0x4
IR
Add
PC
addr
IR
IR
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
rdata
Data
Memory
Imm
Ext
wdata
wdata
MD1
MD2
Not quite correct!
We need an Instruction Reg (IR) for each stage
September 28, 2005
we
addr
R
6.823 L6- 6
Arvind
IR’s and Control points
0x4
IR
Add
PC
addr
IR
IR
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
A
ALU
GPRs
Y
B
we
addr
rdata
Data
Memory
Imm
Ext
wdata
wdata
MD1
MD2
Are control points connected properly?
- ALU instructions
- Load/Store instructions
- Write back
September 28, 2005
R
6.823 L6- 7
Arvind
Pipelined MIPS Datapath
without jumps
F
D
E
M
IR
W
IR
IR
31
0x4
Add
RegDst
RegWrite
PC
addr
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
OpSel
MemWrite
A
ALU
GPRs
Y
B
ExtSel
September 28, 2005
we
addr
rdata
Data
Memory
Imm
Ext
wdata
wdata
MD1
BSrc
WBSrc
MD2
R
How Instructions can Interact
with each other in a pipeline
• An instruction in the pipeline may need a
resource being used by another instruction
in the pipeline
– structural hazard
• An instruction may produce data that is
needed by a later instruction
– data hazard
• In the extreme case, an instruction may
determine the next instruction to be
executed
– control hazard (branches, interrupts,...)
September 28, 2005
6.823 L6- 8
Arvind
6.823 L6- 9
Arvind
Data Hazards
r1 ← …
r4 ← r1 …
0x4
IR
Add
PC
addr
IR
IR
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
rdata
Data
Memory
Imm
Ext
wdata
wdata
MD1
...
r1 ← r0 + 10
r4 ← r1 + 17
...
September 28, 2005
we
addr
MD2
r1 is stale. Oops!
R
6.823 L6- 10
Arvind
Resolving Data Hazards
Freeze earlier pipeline stages until the data
becomes available ⇒ interlocks
If data is available somewhere in the datapath
provide a bypass to get it to the right stage
Speculate about the hazard resolution and kill
the instruction later if the speculation is wrong.
September 28, 2005
6.823 L6- 11
Arvind
Feedback to Resolve Hazards
FB1
FB2
stage
1
FB3
stage
2
FB4
stage
3
stage
4
• Detect a hazard and provide feedback to previous
stages to stall or kill instructions
• Controlling a pipeline in this manner works provided
the instruction at stage i+1 can complete without
any interference from instructions in stages 1 to i
(otherwise deadlocks may occur)
September 28, 2005
6.823 L6- 12
Arvind
Interlocks to resolve Data
Hazards
Stall Condition
0x4
nop
Add
PC
addr
IR
IR
IR
31
inst IR
Inst
Memory
...
r1 ← r0 + 10
r4 ← r1 + 17
...
September 28, 2005
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
we
addr
rdata
Data
Memory
Imm
Ext
wdata
wdata
MD1
MD2
R
6.823 L6- 13
Arvind
Stalled Stages and Pipeline Bubbles
time
t0 t1 t2 t3 t4 t5
(I1) r1 ← (r0) + 10 IF1 ID1 EX1 MA1 WB1
(I2) r4 ← (r1) + 17
IF2 ID2 ID2 ID2 ID2
(I3)
IF3 IF3 IF3 IF3
(I4)
stalled stages
(I5)
Resource
Usage
IF
ID
EX
MA
WB
time
t0 t1
I1
I2
I1
t2
I3
I2
I1
t3
I3
I2
nop
I1
t4
I3
I2
nop
nop
I1
t5
I3
I2
nop
nop
nop
t6
....
EX2 MA2 WB2
ID3 EX3 MA3 WB3
IF4 ID4 EX4 MA4 WB4
IF5 ID5 EX5 MA5 WB5
t6
I4
I3
I2
nop
nop
nop ⇒
September 28, 2005
t7
t7
I5
I4
I3
I2
nop
....
I5
I4
I3
I2
I5
I4
I3
I5
I4
pipeline bubble
I5
6.823 L6- 14
Arvind
Interlock Control Logic
stall
ws
Cstall
rs
rt
?
0x4
nop
Add
PC
addr
IR
IR
IR
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
we
addr
rdata
Data
Memory
Imm
Ext
wdata
wdata
MD1
MD2
Compare the source registers of the instruction in the decode
stage with the destination register of the uncommitted
instructions.
September 28, 2005
R
6.823 L6- 15
Arvind
Interlocks Control Logic
ignoring jumps & branches
stall
Cstall
rs
rt
ws
we
?
re1
re2
Cre
0x4
PC
addr
Cdest
Cdest
nop
Add
ws
ws we
we
IR
IR
IR
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Cdest
A
ALU
Y
B
we
addr
rdata
Data
Memory
Imm
Ext
wdata
wdata
MD1
MD2
Should we always stall if the rs field matches some rd?
not every instruction writes a register ⇒ we
not every instruction reads a register ⇒ re
September 28, 2005
R
6.823 L6- 16
Arvind
Source & Destination Registers
ALU
ALUi
LW
SW
BZ
J
JAL
JR
JALR
R-type:
op
rs
rt
I-type:
op
rs
rt
J-type:
op
func
immediate16
immediate26
rd ← (rs) func (rt)
rt ← (rs) op imm
rt ← M [(rs) + imm]
M [(rs) + imm] ← (rt)
cond (rs)
true: PC ← (PC) + imm
false: PC ← (PC) + 4
PC ← (PC) + imm
r31 ← (PC), PC ← (PC) + imm
PC ← (rs)
r31 ← (PC), PC ← (rs)
September 28, 2005
rd
source(s) destination
rs, rt
rd
rs
rt
rs
rt
rs, rt
rs
rs
rs
rs
31
31
6.823 L6- 17
Arvind
Deriving the Stall Signal
Cdest
ws = Case opcode
ALU
⇒ rd
ALUi, LW
⇒ rt
JAL, JALR
⇒ R31
we = Case opcode
ALU, ALUi, LW ⇒(ws ≠ 0)
JAL, JALR
⇒ on
...
⇒ off
Cstall
September 28, 2005
Cre
re1 = Case opcode
ALU, ALUi,
LW, SW, BZ,
JR, JALR
⇒ on
J, JAL
⇒ off
re2 = Case opcode
⇒ on
ALU, SW
⇒ off
...
stall = ((rsD =wsE).weE +
(rsD =wsM).weM +
(rsD =wsW).weW) . re1D
((rtD =wsE).weE +
(rtD =wsM).weM +
(rtD =wsW).weW) . re2D
+
!
t
y
no tor
is l s
is ful
h
T e
th
6.823 L6- 18
Arvind
Hazards due to Loads & Stores
Stall Condition
What if
(r1)+7 = (r3)+5 ?
0x4
nop
Add
PC
addr
IR
IR
IR
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
rdata
Data
Memory
Imm
Ext
...
M[(r1)+7] ← (r2)
r4 ← M[(r3)+5]
...
September 28, 2005
we
addr
wdata
wdata
MD1
MD2
Is there any possible data hazard
in this instruction sequence?
R
Load & Store Hazards
...
M[(r1)+7] ← (r2)
r4 ← M[(r3)+5]
...
6.823 L6- 19
Arvind
(r1)+7 = (r3)+5 ⇒ data hazard
However, the hazard is avoided because our
memory system completes writes in one cycle !
Load/Store hazards, even when they do exist, are
often resolved in the memory system itself.
More on this later in the course.
September 28, 2005
20
Five-minute break to stretch your legs
6.823 L6- 21
Arvind
Complications due to Jumps
PCSrc (pc+4 / jabs / rind/ br)
stall
Add
nop
0x4
Add
Jump?
PC
104
I1
I2
I3
I4
addr
inst
Inst
Memory
096
100
104
304
September 28, 2005
IR
I2
ADD
J
200
ADD
kill
ADD
E
M
IR
IR
I1
Note fetching the
next instruction
before decode is
speculation ⇒ kill
A jump instruction kills (not stalls)
the following instruction
How?
6.823 L6- 22
Arvind
Pipelining Jumps
PCSrc (pc+4 / jabs / rind/ br)
stall
To kill a fetched
instruction -- Insert
a mux before IR
Add
nop
0x4
Add
Jump?
IRSrcD
PC
304
104
I1
I2
I3
I4
addr
inst
Inst
Memory
096
100
104
304
September 28, 2005
nop
IR
nop
I2
ADD
J
200
ADD
kill
ADD
E
M
IR
IR
II21
I1
Any
interaction
between
stall and
jump?
IRSrcD = Case opcodeD
J, JAL
⇒ nop
...
⇒ IM
6.823 L6- 23
Arvind
Jump Pipeline Diagrams
(I1)
(I2)
(I3)
(I4)
096:
100:
104:
304:
ADD
J 200
ADD
ADD
Resource
Usage
IF
ID
EX
MA
WB
time
t0 t1 t2
IF1 ID1 EX1
IF2 ID2
IF3
time
t0 t1
I1
I2
I1
t2
I3
I2
I1
t3
MA1
EX2
nop
IF4
t4
WB1
MA2
nop
ID4
t3
I4
nop
I2
I1
t4
I5
I4
nop
I2
I1
t5
t6
....
WB2
nop nop
EX4 MA4 WB4
t5
t6
t7
I5
I4
I5
nop I4
I5
I2
nop I4
nop ⇒
September 28, 2005
t7
....
I5
pipeline bubble
6.823 L6- 24
Arvind
Pipelining Conditional Branches
PCSrc (pc+4 / jabs / rind / br)
stall
Add
nop
0x4
Add
E
M
IR
IR
I1
BEQZ?
zero?
IRSrcD
PC
104
I1
I2
I3
I4
addr
inst
nop
Inst
Memory
096
100
104
304
September 28, 2005
ADD
BEQZ r1 200
ADD
ADD
IR
A
ALU
Y
I2
Branch condition is not known until
the execute stage
what action should be taken in the
decode stage ?
6.823 L6- 25
Arvind
Pipelining Conditional Branches
PCSrc (pc+4 / jabs / rind / br)
stall
?
Add
E
nop
0x4
Add
M
BEQZ?
IR
IR
I2
I1
zero?
IRSrcD
PC
108
I1
I2
I3
I4
addr
inst
Inst
Memory
096
100
104
304
September 28, 2005
nop
IR
A
ALU
Y
I3
If the branch is taken
ADD
- kill the two following instructions
BEQZ r1 200
- the instruction at the decode stage
ADD
is not valid
ADD
⇒ stall signal is not valid
6.823 L6- 26
Arvind
Pipelining Conditional Branches
stall
Add
PCSrc (pc+4/jabs/rind/br)
E
IRSrcE
nop
0x4
Add
Jump?
M
BEQZ?
IR
IR
I2
I1
zero?
PC
PC
108
I1
I2
I3
I4
addr
IRSrcD
inst
Inst
Memory
096
100
104
304
September 28, 2005
nop
IR
A
ALU
Y
I3
If the branch is taken
ADD
- kill the two following instructions
BEQZ r1 200
- the instruction at the decode stage
ADD
is not valid
ADD
⇒ stall signal is not valid
6.823 L6- 27
Arvind
New Stall Signal
stall = (
((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D
+ ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D
) . !((opcodeE=BEQZ).z + (opcodeE=BNEZ).!z)
Don’t stall if the branch is taken. Why?
Instruction at the decode stage is invalid
September 28, 2005
Control Equations for PC and IR
Muxes
PCSrc = Case opcodeE
BEQZ.z, BNEZ.!z
⇒ br
...
⇒
Case opcodeD
J, JAL
JR, JALR
...
⇒ jabs
⇒ rind
⇒ pc+4
IRSrcD = Case opcodeE
BEQZ.z, BNEZ.!z
⇒ nop
...
⇒
Case opcodeD
J, JAL, JR, JALR ⇒ nop
...
⇒ IM
IRSrcE = Case opcodeE
BEQZ.z, BNEZ.!z
⇒ nop
...
⇒ stall.nop + !stall.IRD
September 28, 2005
6.823 L6- 28
Arvind
Give priority
to the older
instruction,
i.e., execute
stage instruction
over decode
stage instruction
6.823 L6- 29
Arvind
Branch Pipeline Diagrams
(resolved in execute stage)
(I1)
(I2)
(I3)
(I4)
(I5)
time
t0 t1 t2
096: ADD
IF1 ID1 EX1
100: BEQZ 200
IF2 ID2
104: ADD
IF3
108:
304: ADD
Resource
Usage
IF
ID
EX
MA
WB
time
t0 t1
I1
I2
I1
t2
I3
I2
I1
t3
MA1
EX2
ID3
IF4
t4
WB1
MA2
nop
nop
IF5
t3
I4
I3
I2
I1
t4
I5
nop
nop
I2
I1
t5
t6
....
WB2
nop nop
nop nop nop
ID5 EX5 MA5 WB5
t5
t6
t7
....
I5
nop I5
nop nop I5
I2
nop nop I5
nop ⇒
September 28, 2005
t7
pipeline bubble
6.823 L6- 30
Arvind
Reducing Branch Penalty
(resolve in decode stage)
• One pipeline bubble can be removed if an extra
comparator is used in the Decode stage
PCSrc (pc+4 / jabs / rind/ br)
E
Add
nop
0x4
IR
Add
PC
addr
nop
inst
Inst
Memory
IR
D
we
rs1
rs2
rd1
ws
wd rd2
Zero detect on
register file output
GPRs
Pipeline diagram now same as for jumps
September 28, 2005
6.823 L6- 31
Arvind
Branch Delay Slots
(expose control hazard to software)
• Change the ISA semantics so that the instruction that
follows a jump or branch is always executed
– gives compiler the flexibility to put in a useful instruction where
normally a pipeline bubble would have resulted.
I1
I2
I3
I4
096
100
104
304
ADD
BEQZ r1 200
ADD
ADD
Delay slot instruction
executed regardless of
branch outcome
• Other techniques include branch prediction,
which can dramatically reduce the branch
penalty... to come later
September 28, 2005
6.823 L6- 32
Arvind
Bypassing
time
(I1) r1 ← r0 + 10
(I2) r4 ← r1 + 17
(I3)
(I4)
(I5)
t0
IF1
t1 t2 t3 t4 t5
ID1 EX1 MA1 WB1
IF2 ID2 ID2 ID2 ID2
IF3 IF3 IF3 IF3
stalled stages
t6
t7
....
EX2 MA2 WB2
ID3 EX3 MA3
IF4 ID4 EX4
IF5 ID5
Each stall or kill introduces a bubble in the pipeline
⇒ CPI > 1
A new datapath, i.e., a bypass, can get the data from
the output of the ALU to its input
time
(I1) r1 ← r0 + 10
(I2) r4 ← r1 + 17
(I3)
(I4)
(I5)
September 28, 2005
t0
t1
IF1
t2 t3
ID1 EX1
IF2 ID2
IF3
t4
MA1
EX2
ID3
IF4
t5
WB1
MA2
EX3
ID4
IF5
t6
t7
....
WB2
MA3 WB3
EX4 MA4 WB4
ID5 EX5 MA5 WB5
6.823 L6- 33
Arvind
Adding a Bypass
stall
r4 ← r1...
0x4
r1 ←...
nop
Add
IR
E
IR
M
IR
31
ASrc
PC
addr
inst IR
D
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
rdata
Data
Memory
Imm
Ext
wdata
wdata
MD1
...
(I1)
r1 ← r0 + 10
(I2)
r4 ← r1 + 17
yes
September 28, 2005
we
addr
MD2
When does this bypass help?
r1 ← M[r0 + 10]
r4 ← r1 + 17
no
JAL 500
r4 ← r31 + 17
no
R
W
6.823 L6- 34
Arvind
The Bypass Signal
Deriving it from the Stall Signal
stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D
+((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D )
ws = Case opcode
ALU
⇒ rd
ALUi, LW ⇒ rt
JAL, JALR ⇒ R31
ASrc = (rsD=wsE).weE.re1D
we = Case opcode
ALU, ALUi, LW ⇒(ws ≠ 0)
JAL, JALR
⇒ on
...
⇒ off
Is this correct?
No because only ALU and ALUi instructions can benefit
from this bypass
Split weE into two components: we-bypass, we-stall
September 28, 2005
6.823 L6- 35
Arvind
Bypass and Stall Signals
Split weE into two components: we-bypass, we-stall
we-bypassE = Case opcodeE
ALU, ALUi ⇒ (ws ≠ 0)
...
⇒ off
we-stallE = Case opcodeE
LW
⇒ (ws ≠ 0)
JAL, JALR ⇒ on
...
⇒ off
ASrc
= (rsD =wsE).we-bypassE . re1D
stall
= ((rsD =wsE).we-stallE +
(rsD=wsM).weM + (rsD=wsW).weW). re1D
+((rtD = wsE).weE + (rtD = wsM).weM + (rtD = wsW).weW). re2D
September 28, 2005
6.823 L6- 36
Arvind
Fully Bypassed Datapath
PC for JAL, ...
stall
0x4
nop
Add
PC
addr
ASrc
inst IR
Inst
Memory
D
Is there still
a need for the
stall signal ?
September 28, 2005
IR
M
IR
31
we
rs1
rs2
rd1
ws
wd rd2
A
ALU
GPRs
Imm
Ext
IR
E
Y
B
BSrc
we
addr
rdata
Data
Memory
R
wdata
wdata
MD1
MD2
stall = (rsD=wsE). (opcodeE=LWE).(wsE≠0 ).re1D
+ (rtD=wsE). (opcodeE=LWE).(wsE≠0 ).re2D
W
Why an Instruction may not be
dispatched every cycle (CPI>1)
6.823 L6- 37
Arvind
• Full bypassing may be too expensive to
implement
– typically all frequently used paths are provided
– some infrequently used bypass paths may increase
cycle time and counteract the benefit of reducing CPI
•
Loads have two cycle latency
– Instruction after load cannot use load result
– MIPS-I ISA defined load delay slots, a software-visible
pipeline hazard (compiler schedules independent
instruction or inserts NOP to avoid hazard). Removed
in MIPS-II.
•
Conditional branches may cause bubbles
– kill following instruction(s) if no delay slots
Machines with software-visible delay slots may execute significant
number of NOP instructions inserted by the compiler.
September 28, 2005
38
Thank you !
Download