Pipelining - II - TAMU Computer Science Faculty Pages

advertisement
Pipelining - II
Adapted from CS 152C (UC Berkeley) lectures notes of
Spring 2002
Revisiting Pipelining Lessons
6 PM
7
8
9
Time
30 40
T
a
s
k
A
B
O
r
d
e
r
C
D
40
40
40 20
• Pipelining doesn’t help
latency of single task, it
helps throughput of
entire workload
• Pipeline rate limited by
slowest pipeline stage
• Multiple tasks operating
simultaneously using
different resources
• Potential speedup =
Number pipe stages
• Unbalanced lengths of
pipe stages reduces
speedup
• Time to “fill” pipeline and
time to “drain” it reduces
speedup
• Stall for Dependences
Revisiting Pipelining Hazards
• Structural Hazards
– Hardware design
• Control Hazard
– Decision based on results
• Data Hazard
– Data Dependency
Control Signals for existing Datapath
IF: Instruction Fetch
ID: Instruction Decode/
register file read
EX: Execute/address
calculation
MEM: Memory Access
WB: Write back
ADD
ADD
4
Shift left 2
Read Reg1
M
U
X
P
C
Address
Read Reg2
Zero
ADD
Instruction
Instruction
Memory
Read
Data1
Registers
Read
Data2
Write Reg
M
U
X
Address
Read Data
Data
Memory
Write Data
Write Data
16
Sign
Extend
32
The Right to Left Control can lead to hazards
M
U
X
Place registers between each step
IF/ID
ID/EX
EX/MEM
MEM/WB
ADD
ADD
4
Shift left 2
Read Reg1
M
U
X
P
C
Address
Read Reg2
Zero
ADD
Instruction
Instruction
Memory
Read
Data1
Registers
Read
Data2
Write Reg
M
U
X
Address
Read Data
Data
Memory
Write Data
Write Data
16
Sign
Extend
32
M
U
X
Example
10
lw
r1, r2(35)
14
addI r2, r2, 3
20
sub
r3, r4, r5
24
beq
r6, r7, 100
30
ori
r8, r9, 17
34
add
r10, r11, r12
100
and
r13, r14, 15
Start: Fetch 10
n
WB
Ctrl
A
Exec
im
Reg
File
Mem
Ctrl
rs rt
S
M
=
PC
10
D
Mem
Acces
s
Data
Mem
B
Next PC
IR
n
Reg.
File
n
Decode
Inst. Mem
n
IF 10
lw
r1, r2(35)
14
addI r2, r2, 3
20
sub
r3, r4, r5
24
beq
r6, r7, 100
30
ori
r8, r9, 17
34
add
r10, r11, r12
100 and
r13, r14, 15
n
n
WB
Ctrl
Mem
Ctrl
A
S
M
Reg.
File
im
Exec
rt
Reg
File
2
=
PC
14
D
Mem
Acces
s
Data
Mem
B
Next PC
IR
n
Decode
lw r1, r2(35)
Inst. Mem
Fetch 14, Decode 10
ID 10
lw
r1, r2(35)
IF 14
addI r2, r2, 3
20
sub
r3, r4, r5
24
beq
r6, r7, 100
30
ori
r8, r9, 17
34
add
r10, r11, r12
100 and
r13, r14, 15
n
WB
Ctrl
S
M
Reg.
File
r2
n
Mem
Ctrl
Exec
35
Reg
File
lw r1
rt
2
=
PC
20
D
Mem
Acces
s
Data
Mem
B
Next PC
IR
Decode
addI r2, r2, 3
Inst. Mem
Fetch 20, Decode 14, Exec 10
EX 10
lw
r1, r2(35)
14
addI r2, r2, 3
20
sub
r3, r4, r5
24
beq
r6, r7, 100
30
ori
r8, r9, 17
34
add
r10, r11, r12
100 and
r13, r14, 15
n
WB
Ctrl
D
Reg.
File
M
Mem
Acces
s
Data
Mem
Mem
Ctrl
r2+35
Exec
3
Reg
File
r2
lw r1
addI r2, r2, 3
5
4
B
PC
24
=
Next PC
IR
Decode
sub r3, r4, r5
Inst. Mem
Fetch 24, Decode 20, Exec 14, Mem 10
M 10
lw
EX 14
addI r2, r2, 3
ID 20
sub
r3, r4, r5
24
beq
r6, r7, 100
30
ori
r8, r9, 17
34
add
r10, r11, r12
IF
100 and
r1, r2(35)
r13, r14, 15
lw r1
addI r2
sub r3
Mem
Ctrl
WB
Ctrl
M[r2+35]
D
Mem
Acces
s
Data
Mem
Reg.
File
r2+3
r4
Exec
7
Reg
File
6
r5
PC
30
=
Next PC
IR
Decode
beq r6, r7 100
Inst. Mem
Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10
WB 10
M 14
lw
r1, r2(35)
addI r2, r2, 3
EX 20
ID 24
sub
r3, r4, r5
beq
r6, r7, 100
IF 30
ori
r8, r9, 17
add
r10, r11, r12
34
100 and
r13, r14, 15
r1=M[r2+35]
WB
Ctrl
Reg.
File
addI r2
Mem
Ctrl
r2+3
sub r3
r4-r5
r6
Exec
100
Reg
File
beq
xx
9
=
PC
100
D
Mem
Acces
s
Data
Mem
r7
Next PC
IR
Decode
ori r8, r9 17
Inst. Mem
Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14
10
WB 14
M 20
lw
r1, r2(35)
addI r2, r2, 3
sub
r3, r4, r5
EX 24
ID 30
beq
r6, r7, 100
ori
r8, r9, 17
34
add
r10, r11, r12
IF 100 and r13, r14, 15
Pipelining Load Instruction
Cycle 1 Cycle 2
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Clock
1st lw Ifetch
Reg/Dec
2nd lw Ifetch
3rd lw
Exec
Mem
Wr
Reg/Dec
Exec
Mem
Wr
Ifetch
Reg/Dec
Exec
Mem
Wr
• The five independent functional units in the pipeline datapath
are:
– Instruction Memory for the Ifetch stage
– Register File’s Read ports (bus A and busB) for the
Reg/Dec stage
– ALU for the Exec stage
– Data Memory for the Mem stage
– Register File’s Write port (bus W) for the Wr stage
Pipelining the R Instruction
Cycle 1 Cycle 2
R-type Ifetch
Reg/Dec
Cycle 3 Cycle 4
Exec
Wr
• Ifetch: Instruction Fetch
– Fetch the instruction from the Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode
• Exec:
– ALU operates on the two register operands
– Update PC
• Wr: Write the ALU output back to the register file
Pipelining Both L and R type
Cycle 1 Cycle 2
R-type Ifetch
R-type
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Reg/Dec
Exec
Ifetch
Reg/Dec
Exec
Ifetch
Reg/Dec
Load
Ops! We have a problem!
Wr
R-type Ifetch
Wr
Exec
Mem
Wr
Reg/Dec
Exec
Wr
R-type Ifetch
Reg/Dec
Exec
Wr
• We have pipeline conflict or structural hazard:
– Two instructions try to write to the register file at
the same time!
– Only one write port
Important Observations
• Each functional unit can only be used once per
instruction
• Each functional unit must be used at the same
stage for all instructions:
– Load uses Register File’s Write Port during its 5th
stage
Load
1
2
3
4
5
Ifetch
1
Reg/Dec
2
Exec
3
Mem
4
Wr
R-type Ifetch
Reg/Dec
Exec
Wr
– R-type uses Register File’s Write Port during its
4th stage
Solution
• Delay R-type’s register write by one cycle:
– Now R-type instructions also use Reg File’s write port at Stage 5
– Mem stage is a NOOP stage: nothing is being done.
1
2
R-type Ifetch
Cycle 1 Cycle 2
Reg/Dec
3
Exec
4
5
Mem
Wr
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch
Reg/Dec
Exec
Mem
Wr
R-type
Ifetch
Reg/Dec
Exec
Mem
Wr
Ifetch
Reg/Dec
Exec
Mem
Wr
Reg/Dec
Exec
Mem
Wr
Reg/Dec
Exec
Mem
Load
R-type Ifetch
R-type Ifetch
Wr
Datapath (Without Pipeline)
IR <- Mem[PC]; PC <– PC+4;
A <- R[rs]; B<– R[rt]
S <– A + SX;
M <– Mem[S]
Mem[S] <- B
If Cond
PC < PC+SX;
Reg.
File
S
B
M
D
Mem
Acces
s
Data
Mem
A
Exec
R[rd] <– M;
IR
Inst. Mem
R[rt] <– S;
PC
Next PC
R[rd] <– S;
S <– A + SX;
Equal
S <– A or ZX;
Reg
File
S <– A + B;
Datapath (With Pipeline)
IR <- Mem[PC]; PC <– PC+4;
A <- R[rs]; B<– R[rt]
Mem[S] <- B
A
S
M
B
D
Reg.
File
R[rd] <– M;
IR
Inst. Mem
R[rt] <– M;
PC
Next PC
R[rd] <– M;
M <– Mem[S]
if Cond PC
< PC+SX;
Mem
Acces
s
Data
Mem
M <– S
S <– A + SX;
Exec
M <– S
S <– A + SX;
Equal
S <– A or ZX;
Reg
File
S <– A + B;
Structural Hazard and Solution
Time (clock cycles)
Instr 4
Reg
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
ALU
Instr 3
Mem
Mem
ALU
Instr 2
Reg
ALU
Instr 1
Mem
ALU
O
r
d
e
r
Load
ALU
I
n
s
t
r.
Mem
Reg
Control Hazard - #1 Stall
Add
Beq
Reg
Mem
Mem
Reg
Reg
Mem
Lost
potential
Mem
Reg
Reg
ALU
Load
Mem
ALU
O
r
d
e
r
Time (clock cycles)
ALU
I
n
s
t
r.
Mem
Reg
• Stall: wait until decision is clear
• Impact: 2 lost cycles (i.e. 3 clock cycles per
branch instruction) => slow
Control Hazard – #2 Predict
Beq
Load
Reg
Mem
Mem
Reg
Reg
Mem
Reg
Mem
Reg
ALU
Add
Mem
ALU
O
r
d
e
r
Time (clock cycles)
ALU
I
n
s
t
r.
Mem
Reg
• Predict: guess one direction then back up if wrong
• Impact: 0 lost cycles per branch instruction if right,
1 if wrong (right 50% of time)
• More dynamic scheme: history of 1 branch
Control Hazard - #3 Delayed Branch
Misc
Load
Mem
Mem
Reg
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
ALU
Beq
Reg
ALU
Add
Mem
ALU
O
r
d
e
r
Time (clock cycles)
ALU
I
n
s
t
r.
Mem
Reg
• Delayed Branch: Redefine branch behavior (takes
place after next instruction)
• Impact: 0 clock cycles per branch instruction if can
find instruction to put in “slot” ( 50% of time)
Data Hazards (RAW)
• Dependencies backwards in time are hazards
or r8,r1,r9
xor r10,r1,r11
W
B
Reg
Reg
Dm
Im
Reg
Dm
Im
Reg
ALU
and r6,r1,r7
Im
ME
M
Dm
ALU
sub r4,r1,r3
E
X
ALU
O
r
d
e
r
add r1,r2,r3
ID/R
FReg
ALU
I
n
s
t
r.
Time (clock cycles)
I
F
Im
Reg
Reg
Dm
Reg
Data Hazards [contd…]
• “Forward” result from one stage to another
xor r10,r1,r11
Reg
Dm
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
ALU
or r8,r1,r9
W
B
Reg
ALU
and r6,r1,r7
Im
ME
M
Dm
ALU
sub r4,r1,r3
E
X
ALU
O
r
d
e
r
add r1,r2,r3
ID/R
FReg
ALU
I
n
s
t
r.
Time (clock cycles)
I
F
Im
Reg
Reg
Reg
Dm
Reg
Data Hazards [contd…]
• Dependencies backwards in time are hazards
sub r4,r1,r3
Stall
ME
M
Dm
W
B
Reg
Im
Reg
ALU
lw r1,0(r2)
ID/R
FReg
ALU
Time (clock cycles)
I
F
Im
E
X
Dm
Reg
• Can’t solve with forwarding:
• Must delay/stall instruction dependent on loads
Hazard Detection
I-Fetch
DCD MemOpFetch OpFetch
IFetch
DCD
Exec
Store
°°°
Structural
Hazard
I-Fetch
DCD
OpFetch
Jump
IFetch
IF
DCD EX
IF
Mem WB
DCD EX
IF
DCD
°°°
RAW (read after write) Data Hazard
Mem
WB
DCD EX
Mem WB
IF
DCD
IF
Control Hazard
DCD OF
WAW Data Hazard
(write after write)
OF
Ex
RS
Ex
Mem
WAR Data Hazard
(write after read)
Three Generic Data Hazards
• Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
• Caused by a “Data Dependence” (in compiler
nomenclature). This hazard results from an actual
need for communication.
CPSC614
Lec 2.28
Three Generic Data Hazards
• Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5
CPSC614
Lec 2.29
Three Generic Data Hazards
• Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
• Will see WAR and WAW in later more complicated
pipes
CPSC614
Lec 2.30
Hazard Detection
• Suppose instruction i is about to be issued and a
predecessor instruction j is in the instruction pipeline.
New Inst
Instruction
Movement:
Inst I
Inst J
Window on execution:
Only pending instructions can
cause hazards
• A RAW hazard exists on register if Rregs( i ) Wregs( j )
• A WAW hazard exists on register if Wregs( i ) Wregs( j )
• A WAR hazard exists on register if Wregs( i ) Rregs( j )
Computing CPI
• Start with Base CPI
• Add stalls
CPI  CPIbase  CPI stall
CPI stall  STALLtype1  freq type1  STALLtype 2  freq type 2
•Suppose:
–CPIbase=1
–Freqbranch=20%, freqload=30%
–Suppose branches always cause 1 cycle stall
–Loads cause a 2 cycle stall
•Then: CPI = 1 + (10.20)+(2  0.30)= 1.8
Summary
• Control Signals need to be propagated
• Insert Registers between every stage to
“remember” and “propagate” values
• Solutions to Control Hazard are Stall,
Predict and Delayed Branch
• Solutions to Data Hazard is “Forwarding”
• Effective CPI = CPIideal + CPIstall
Download