notes

advertisement
ENGS 116 Lecture 5
1
Pipelining and Hazards
Vincent H. Berk
October 3, 2008
Reading for today: Chapter A.1 – A.2, article: Patterson&Ditzel
Reading for Monday: A.3 – A.4, article: Yeager
Reading for Wednesday: A.5 – A.6 , article: Smith&Pleszkun
ENGS 116 Lecture 5
2
Review: Pipelined DLX Datapath
Figure A.17, Page A-29
ENGS 116 Lecture 5
3
Hazards
Hazards are situations that hamper execution flow
• Structural Hazards:
– Resource Conflict, hardware cannot support all possible
combinations of instructions simultaneously.
• Data Hazards:
– Source operands are not available: instruction depends on
results of previous instructions still in the pipeline
• Control Hazards:
– Changes in program counter
ENGS 116 Lecture 5
4
Structural Hazards
ENGS 116 Lecture 5
5
One Memory Port/Structural Hazards
Instr 2
stall
Instr 3
Mem
Reg
Mem
CC 4
Mem
CC 5
CC 7
CC 8
Reg
Reg
Mem
Mem
Reg
bubble
CC 6
Reg
Mem
Reg
bubble
bubble
bubble
Mem
Reg
ALU
O
r
d
e
r
Instr 1
CC 3
ALU
Load
CC 2
ALU
I
n
s
t
r.
CC 1
ALU
from: SECOND EDITION
Time (clock cycles)
bubble
Mem
ENGS 116 Lecture 5
6
Structural Hazard: Single Memory
Instruction
Load
Instr. 1
Instr. 2
Instr. 3
Instr. 4
Instr. 5
Instr. 6
1
IF
2
ID
IF
Clock cycle number
3
4
5
6
7
EX MEM WB
ID
EX MEM WB
IF
ID
EX MEM WB
Stall
IF
ID
EX
IF
ID
IF
8
9
10
MEM WB
EX
MEM WB
ID
EX
MEM
IF
ID
EX
ENGS 116 Lecture 5
7
Speed Up Equation for Pipelining
Speedup from pipelining
=
Avg. Instr. Time Unpipelin ed
Avg. Instr. Time Pipelined
=
CPI unpipelined  Clock Cycle unpipelined
=
CPI unpipelined
CPI pipelined  Clock Cycle pipelined
CPI pipelined

Clock Cycle unpipelined
Clock Cycle pipelined
Ideal CPI = CPIunpipelined /Pipeline depth
Speedup =
Ideal CPI  Pipeline depth Clock Cycle unpipelined

CPI pipelined
Clock Cycle pipelined
ENGS 116 Lecture 5
8
Speed Up Equation for Pipelining
CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instr
Clock Cycle unpipelined
Ideal CPI x Pipeline depth
Speedup =

Ideal CPI + Pipeline stall CPI Clock Cycle pipelined
Clock Cycle unpipelined
Pipeline depth
Speedup =

1 + Pipeline stall CPI Clock Cycle pipelined
ENGS 116 Lecture 5
9
Example: Dual-port vs. Single-port
• Machine A: Dual ported memory
• Machine B: Single ported memory, but its pipelined implementation
has a clock rate that is 1.2 times faster
• Ideal CPI=1 for both
• Loads and stores are 40% of instructions executed
SpeedUp A = Pipeline Depth/ ( 1 + 0 )  ( clock unpipe / clock pipe )
= Pipeline Depth
SpeedUp B = Pipeline Depth/ ( 1 + 0.4  1 )
 ( clock unpipe /( clock unpipe / 1.2 )
= ( Pipeline Depth/1.4 )  1.2
= 0.86  Pipeline Depth
SpeedUp A / SpeedUp B = Pipeline Depth/ ( 0.86  Pipeline Depth ) = 1.17
•
Machine A is 1.17 times faster
ENGS 116 Lecture 5
10
Data Hazards
sub
R2, R1, R3
; R2 written by sub
and
R12, R2, R5
; first operand (R2) depends on sub
or
R13, R6, R2
; second operand (R2) depends on sub
add
R14, R2, R2
; both operands depend on sub
sw
100 (R2), R15
; index (R2) depends on sub
Notice that the value written into R2 by the subtract instruction is needed
in all of the following instructions
ENGS 116 Lecture 5
11
Classification of Data Hazards
Consider instructions i and j, where i occurs before j.
• RAW (read after write) — j tries to read a source before i writes it,
so j gets the old value
• WAW (write after write) — j tries to write an operand before it is
written by i (only possible in pipelines that write in more than one
pipe stage or allow an instruction to proceed even when a previous
instruction is stalled)
• WAR (write after read) — j tries to write a destination before it is read
by i, so i incorrectly gets the new value (only possible when some
instructions can write results early in the pipeline and other instructions
can read sources late in the pipeline)
ENGS 116 Lecture 5
12
Software Solution
Compiler recognizes data hazard and adds nops to eliminate it
sub R2, R1, R3
; register R2 written by sub
nop
; no operation
nop
nop
and R12, R2, R5
or
R13, R6, R2
add R14, R2, R2
sw
100 (R2), R15
; now, result from sub available
ENGS 116 Lecture 5
13
Data Hazard Control: Stalls
• Hazard occurs when instruction reads (in ID stage) register that will be
written by an earlier instruction (in WB stage)
• Idea: Detect hazard and stall instructions in pipeline until hazard is
resolved
• Detect hazard by comparing read fields in IF/ID pipeline register with
write fields in later pipeline registers (ID/EX, EX/MEM, MEM/WB)
• To add bubble in pipeline
–
Preserve PC register and IF/ID pipeline register
–
Change EX, MEM, and WB control fields of ID/EX pipeline
register to do nothing
ENGS 116 Lecture 5
Data Hazard Reduction: Forwarding
• Needed result is available before it is written into register file in WB
stage
• Idea: Use temporary results instead of waiting for registers to be
written
• Cannot solve problem of write (load) followed by read
• All pipelined machines today use some form of forwarding
14
ENGS 116 Lecture 5
15
Data Hazard on R1
Figure A.6, Page A-17
Time (clock cycles)
CC 1
I
n
s
t
r.
add r1, r2, r3
O
r
d
e
r
and r6, r1, r7
sub r4, r1, r3
or r8, r1, r9
xor r10, r1, r11
IM
CC 2
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
CC 6
Reg
DM
Reg
IM
Reg
ENGS 116 Lecture 5
16
Forwarding to Avoid Data Hazard
Figure A.7, Page A-18
Time (clock cycles)
CC 1
I
n
s
t
r.
add r1, r2, r3
O
r
d
e
r
and r6, r1, r7
sub r4, r1, r3
or r8, r1, r9
xor r10, r1, r11
IM
CC 2
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
CC 6
Reg
DM
Reg
IM
Reg
ENGS 116 Lecture 5
17
Data Hazard Even with Forwarding
Figure A.9, Page A-20
Time (clock cycles)
CC 1
I
n
s
t
r.
lw r1, 0(r2)
O
r
d
e
r
and r6, r1, r7
sub r4, r1, r5
or r8, r1, r9
IM
CC 2
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
Reg
ENGS 116 Lecture 5
18
Data Hazard Even with Forwarding
Figure A.10, Page A-21
Time (clock cycles)
CC 1
I
n
s
t
r.
lw r1, 0(r2)
O
r
d
e
r
and r6, r1, r7
sub r4, r1, r5
or r8, r1, r9
IM
CC 2
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
bubble
IM
bubble
Reg
bubble
IM
CC 6
DM
Reg
ENGS 116 Lecture 5
19
LW R1, 0 (R2)
SUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
IF
ID
IF
EX MEM WB
ID EX
MEM WB
IF ID
EX
MEM WB
IF
ID
EX
MEM WB
LW R1, 0 (R2)
SUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
IF
ID
EX MEM WB
ENGS 116 Lecture 5
20
Control Hazard on Branches
Three Stage Stall
Time (clock cycles)
CC 1
Program Execution Order (in instructions)
40 beqz R1, 36
44 and R12, R2, R5
48 or R13, R6, R2
52 add R14, R2, R2
80 ld R4, R7, 100
IM
CC 2
CC 3
Reg
IM
CC 4
DM
Reg
IM
CC 5
CC 7
CC 8
CC 9
Reg
DM
Reg
DM
Reg
IM
CC 6
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
ENGS 116 Lecture 5
Branch instruction
Branch successor
Branch successor + 1
Branch successor + 2
Branch successor + 3
Branch successor + 4
Branch successor + 5
21
IF
ID EX
IF
MEM WB
stall stall
IF
ID EX MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM
IF
ID
EX
IF
ID
IF
ENGS 116 Lecture 5
22
Branch Stall Impact
• If CPI = 1, 30% branches, 3-cycle stall  new CPI = 1.9!
Two simple solutions:
• Predict not taken
– Continue with decoding code that is already in Instruction Cache
– Usually < 50% correct, however, no stalls when correct
• Branch delay slot
– The first instruction following the branch is ALWAYS executed
– Compiler can figure out what to put there
ENGS 116 Lecture 5
23
Delayed Branch
ENGS 116 Lecture 5
24
Delayed Branch
• Where to get instructions to fill branch delay slot?
– Before branch instruction
– From the target address: only valuable when branch taken
– From fall through: only valuable when branch not taken
– Canceling branches allow more slots to be filled
• Compiler effectiveness for single branch delay slot:
– Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots useful
in computation
– About 50% (60% x 80%) of slots usefully filled
ENGS 116 Lecture 5
25
Evaluating Branch Alternatives
Pipeline depth
1 + Pipeline stalls
Pipeline depth
=
1 + Branch frequency  Branch penalty
Pipeline Speedup =
Download