stalling

advertisement
Stalling
 The easiest solution is to stall the pipeline
 We could delay the AND instruction by introducing a one-cycle delay into
the pipeline, sometimes called a bubble
lw
$2, 20($3)
and $12, $2, $5
1
2
IM
Reg
IM
Clock cycle
3
4
DM
Reg
5
6
7
DM
Reg
Reg
 Notice that we’re still using forwarding in cycle 5, to get data from the
MEM/WB pipeline register to the ALU
1
Stalling and forwarding
 Without forwarding, we’d have to stall for two cycles to wait for the LW
instruction’s writeback stage
lw
$2, 20($3)
and $12, $2, $5
1
2
IM
Reg
IM
3
Clock cycle
4
5
DM
6
7
8
DM
Reg
Reg
Reg
 In general, you can always stall to avoid hazards—but dependencies are
very common in real code, and stalling often can reduce performance by
a significant amount
2
Load-Use Hazard Detection
• Check when using instruction is decoded in ID stage
• ALU operand register numbers in ID stage are given by
• IF/ID.RegisterRs, IF/ID.RegisterRt
• Load-use hazard when
• ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = IF/ID.RegisterRt))
• If detected, stall and insert bubble
How to Stall the Pipeline
• Force control values in ID/EX register
to 0
• EX, MEM and WB do nop (no-operation)
• Prevent update of PC and IF/ID register
• Using instruction is decoded again
• Following instruction is fetched again
• 1-cycle stall allows MEM to read data for lw
• Can subsequently forward to EX stage
Stalling delays the entire pipeline
 If we delay the second instruction, we’ll have to delay the third one too
— This is necessary to make forwarding work between AND and OR
— It also prevents problems such as two instructions trying to write to
the same register in the same cycle
1
lw
$2, 20($3)
and $12, $2, $5
or
$13, $12, $2
IM
2
3
Reg
IM
Clock cycle
4
5
DM
7
8
Reg
Reg
IM
6
DM
Reg
Reg
DM
Reg
5
What about EX, MEM, WB
 But what about the ALU during cycle 4, the data memory in cycle 5, and
the register file write in cycle 6?
lw
$2, 20($3)
and $12, $2, $5
or
$13, $12, $2
1
2
IM
Reg
IM
3
Clock cycle
4
5
DM
Reg
Reg
IM
IM
6
7
DM
Reg
8
Reg
Reg
DM
Reg
 Those units aren’t used in those cycles because of the stall, so we can set
the EX, MEM and WB control signals to all 0s.
6
Detecting Stalls, cont.

DM
mem\wb
Reg
Reg
ex/mem
Reg
mem\wb
IM
DM
id/ex
and $12, $2, $5
Reg
if/id
IM
id/ex
$2, 20($3)
if/id
lw
ex/mem
When should stalls be detected?
EX stage (of the instruction causing the stall)
if/id

Reg
What is the stall condition?
if (ID/EX.MemRead = 1 and (ID/EX.rt = IF/ID.rs or ID/EX.rt = IF/ID.rt))
then stall
7
Adding hazard detection to the CPU
PC Write
IF/ID Write
ID/EX.MemRead
Hazard
Unit
ID/EX
Rs
Rt
0
0
1
Control
PC
WB
EX/MEM
M
WB
MEM/WB
EX
M
WB
IF/ID
Read
register 1
Addr
ID/EX.RegisterRt
Instr
Read
data 1
Read
register 2
Write
register
Instruction
memory
Write
data
Read
data 2
Registers
0
1
2
ALU
Zero
ALUSrc
0
1
2
Result
0
Address
Data
memory
1
Instr [15 - 0]
RegDst
Extend
Rt
Write Read
data
data
1
0
0
Rd
1
Rs
EX/MEM.RegisterRd
Forwarding
Unit
MEM/WB.RegisterRd
8
Stalls and Performance
 Stalls reduce performance
— But are required to get correct results
 Compiler can arrange code to avoid hazards and stalls
— Requires knowledge of the pipeline structure
Code Scheduling to Avoid Stalls
Reorder code to avoid use of load result in the
next instruction
Ex: c code for A = B + E; C = B + F;
stall
stall
lw
lw
add
sw
lw
add
sw
$t1,
$t2,
$t3,
$t3,
$t4,
$t5,
$t5,
0($t0)
4($t0)
$t1, $t2
12($t0)
8($t0)
$t1, $t4
16($t0)
13 cycles
lw
lw
lw
add
sw
add
sw
$t1,
$t2,
$t4,
$t3,
$t3,
$t5,
$t5,
0($t0)
4($t0)
8($t0)
$t1, $t2
12($t0)
$t1, $t4
16($t0)
11 cycles
Branches in the original pipelined datapath
1
0
PCSrc
Control
IF/ID
4
When are they resolved?
ID/EX
WB
EX/MEM
M
WB
MEM/WB
EX
M
WB
Add
P
C
Add
RegWrite
Read Instruction
address
[31-0]
Instruction
memory
Read
register 1
Read
data 1
Read
register 2
Read
data 2
Write
register
Write
data
Instr [15 - 0]
Instr [20 - 16]
Shift
left 2
ALU
0
MemWrite
Zero
Result
1
Registers
ALUOp
ALUSrc
Sign
extend
Address
Data
memory
Write
data
RegDst
MemToReg
Read
data
1
MemRead
0
0
Instr [15 - 11]
1
11
Branch Hazards
If branch outcome determined in MEM:
Flush these
instructions
(Set control
values to 0)
PC
Reducing Branch Delay
Move hardware to determine outcome to ID stage
— Target address adder
— Register comparator
Example: branch taken
36: sub $10, $4, $8
40: beq $1, $3, 7
44: and $12, $2, $5
48: or
$13, $2, $6
52: add $14, $4, $2
56: slt $15, $6, $7
...
72: lw
$4, 50($7)
Example: Branch Taken
Example: Branch Taken
Data Hazards for Branches
If a comparison register is a destination of 2nd or 3rd preceding
ALU instruction
add $1, $2, $3
IF
add $4, $5, $6
…
beq $1, $4, target
Can resolve using forwarding
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
Data Hazards for Branches
If a comparison register is a destination of preceding ALU
instruction or 2nd preceding load instruction
Need 1 stall cycle
lw
$1, addr
IF
add $4, $5, $6
beq stalled
beq $1, $4, target
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
ID
EX
MEM
WB
Data Hazards for Branches
If a comparison register is a destination of immediately preceding
load instruction
— Need 2 stall cycles
lw
$1, addr
IF
beq stalled
beq stalled
beq $1, $0, target
ID
EX
IF
ID
MEM
WB
ID
ID
EX
MEM
WB
Branch Prediction
• Longer pipelines can’t readily determine branch
outcome early
• Stall penalty becomes unacceptable
• Predict (i.e., guess) outcome of branch
• Only stall if prediction is wrong
• Simplest prediction strategy
• predict branches not taken
• Works well for loops if the loop tests are done at the
start.
• Fetch instruction after branch, with no delay
Dynamic Branch Prediction
 In deeper and superscalar pipelines, branch penalty is more
significant
 Use dynamic prediction
 Branch prediction buffer (aka branch history table)
 Indexed by recent branch instruction addresses
 Stores outcome (taken/not taken)
 To execute a branch
 Check table, expect the same outcome
 Start fetching from fall-through or target
 If wrong, flush pipeline and flip prediction
1-Bit Predictor: Shortcoming
Inner loop branches mispredicted twice!
outer: …
…
inner: …
…
beq …, …, inner
…
beq …, …, outer
 Mispredict as taken on last iteration of inner loop
 Then mispredict as not taken on first iteration of inner loop
next time around
2-Bit Predictor
Only change prediction on two successive mispredictions
Calculating the Branch Target
 Even with predictor, still need to calculate the target
address
 1-cycle penalty for a taken branch
 Branch target buffer
 Cache of target addresses
 Indexed by PC when instruction fetched
 If hit and instruction is branch predicted taken,
can fetch target immediately
Concluding Remarks




ISA influences design of datapath and control
Datapath and control influence design of ISA
Pipelining improves instruction throughput
using parallelism
 More instructions completed per second
 Latency for each instruction not reduced
Hazards: structural, data, control

Main additions in hardware:
 forwarding unit
 hazard detection and stalling
 branch predictor
 branch target table
Download