CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining Krste Asanovic

advertisement
CS 152 Computer Architecture and
Engineering
Lecture 4 - Pipelining
Krste Asanovic
Electrical Engineering and Computer Sciences
University of California at Berkeley
http://www.eecs.berkeley.edu/~krste
http://inst.eecs.berkeley.edu/~cs152
January 28, 2010
CS152, Spring 2010
Last time in Lecture 3
• Microcoding became less attractive as gap between
RAM and ROM speeds reduced
• Complex instruction sets difficult to pipeline, so
difficult to increase performance as gate count grew
• Iron-law explains architecture design space
– Trade instructions/program, cycles/instruction, and time/cycle
• Load-Store RISC ISAs designed for efficient
pipelined implementations
– Very similar to vertical microcode
– Inspired by earlier Cray machines (more on these later)
January 28, 2010
CS152, Spring 2010
2
An Ideal Pipeline
stage
1
stage
2
stage
3
stage
4
• All objects go through the same stages
• No sharing of resources between any two stages
• Propagation delay through all pipeline stages is equal
• The scheduling of an object entering the pipeline
is not affected by the objects in other stages
These conditions generally hold for industrial
assembly lines, but instructions depend on each
other!
January 28, 2010
CS152, Spring 2010
3
Pipelined MIPS
To pipeline MIPS:
• First build MIPS without pipelining with
CPI=1
• Next, add pipeline registers to reduce
cycle time while maintaining CPI=1
January 28, 2010
CS152, Spring 2010
4
Lecture 3: Unpipelined Datapath for MIPS
PCSrc
br
rind
jabs
pc+4
RegWrite
MemWrite
WBSrc
0x4
Add
Add
clk
PC
clk
31
addr
inst
Inst.
Memory
we
rs1
rs2
rd1
ws
wd rd2
clk
we
addr
ALU
GPRs
z
rdata
Data
Memory
Imm
Ext
wdata
ALU
Control
OpCode RegDst
January 28, 2010
ExtSel
OpSel
BSrc
CS152, Spring 2010
zero?
5
Lecture 3: Hardwired Control Table
Opcode
ALU
ExtSel
BSrc
OpSel
MemW
RegW
WBSrc
RegDst
PCSrc
SW
*
sExt16
uExt16
sExt16
sExt16
Reg
Imm
Imm
Imm
Imm
Func
Op
Op
+
+
no
no
no
no
yes
yes
yes
yes
yes
no
ALU
ALU
ALU
Mem
*
rd
rt
rt
rt
*
pc+4
pc+4
pc+4
pc+4
pc+4
BEQZz=0
sExt16
*
0?
no
no
*
*
br
BEQZz=1
sExt16
*
*
*
*
*
no
no
no
no
no
*
*
*
*
pc+4
jabs
*
*
*
*
0?
*
*
*
*
yes
no
yes
PC
*
PC
R31
*
R31
jabs
rind
rind
ALUi
ALUiu
LW
J
JAL
JR
JALR
BSrc = Reg / Imm
RegDst = rt / rd / R31
January 28, 2010
no
no
WBSrc = ALU / Mem / PC
PCSrc = pc+4 / br / rind / jabs
CS152, Spring 2010
6
Pipelined Datapath
0x4
Add
PC
addr
rdata
Inst.
Memory
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
ALU
Imm
Ext
we
addr
rdata
Data
Memory
wdata
write
fetch
decode & Reg-fetch
execute
memory
-back
phase
phase
phase
phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
However, CPI will increase unless instructions are pipelined
January 28, 2010
CS152, Spring 2010
7
“Iron Law” of Processor Performance
Time = Instructions
Cycles
Time
Program
Program * Instruction * Cycle
– Instructions per program depends on source code,
compiler technology, and ISA
– Cycles per instructions (CPI) depends upon the
ISA and the microarchitecture
– Time per cycle depends upon the
microarchitecture and the base technology
Microarchitecture
Lecture 2 Microcoded
Lecture 3 Single-cycle unpipelined
Lecture 4 Pipelined
January 28, 2010
CS152, Spring 2010
CPI
>1
1
1
cycle time
short
long
short
8
CPI Examples
Microcoded machine
7 cycles
5 cycles
Inst 1
Time
10 cycles
Inst 2
Inst 3
3 instructions, 22 cycles, CPI=7.33
Unpipelined machine
Inst 1
Inst 2
Inst 3
3 instructions, 3 cycles, CPI=1
Pipelined machine
Inst 1
Inst 2
Inst 3
January 28, 2010
3 instructions, 3 cycles, CPI=1
CS152, Spring 2010
9
Technology Assumptions
• A small amount of very fast memory (caches)
backed up by a large, slower memory
• Fast ALU (at least for integers)
• Multiported Register files (slower!)
Thus, the following timing assumption is reasonable
tIM tRF tALU tDM  tRW
A 5-stage pipeline will be the focus of our
detailed design
- some commercial designs have over
30 pipeline stages to do an integer add!
January 28, 2010
CS152, Spring 2010
10
5-Stage Pipelined Execution
0x4
Add
PC
addr
rdata
we
rs1
rs2
rd1
ws
wd rd2
GPRs
IR
Inst.
Memory
I-Fetch
(IF)
Imm
Ext
Data
Memory
wdata
Write
Decode, Reg. Fetch Execute
Memory
(ID)
(EX)
(MA)
Back
t0 t1 t2 t3 t4 t5 t6 t7 . . . . (WB)
time
instruction1
instruction2
instruction3
instruction4
instruction5
January 28, 2010
ALU
we
addr
rdata
IF1
ID1 EX1 MA1 WB1
IF2 ID2 EX2 MA2 WB2
IF3 ID3 EX3 MA3 WB3
IF4 ID4 EX4 MA4 WB4
IF5 ID5 EX5 MA5 WB5
CS152, Spring 2010
11
5-Stage Pipelined Execution
Resource Usage Diagram
0x4
Add
PC
addr
rdata
we
rs1
rs2
rd1
ws
wd rd2
GPRs
IR
Inst.
Memory
Resources
I-Fetch
(IF)
January 28, 2010
we
addr
rdata
ALU
Data
Memory
Imm
Ext
wdata
t5
Write
Memory
(MA)
Back
t6 t7 . . . . (WB)
I5
I4
I3
I2
I5
I4
I3
Decode, Reg. Fetch Execute
(ID)
(EX)
time
IF
ID
EX
MA
WB
t0
I1
t1
I2
I1
t2
I3
I2
I1
t3
I4
I3
I2
I1
t4
I5
I4
I3
I2
I1
CS152, Spring 2010
I5
I4
I5
12
Pipelined Execution:
ALU Instructions
0x4
IR
Add
PC
addr
IR
IR
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
we
addr
rdata
Data
Memory
Imm
Ext
R
wdata
wdata
MD1
MD2
Not quite correct!
We need an Instruction Reg (IR) for each stage
January 28, 2010
CS152, Spring 2010
13
Pipelined MIPS Datapath
without jumps
F
D
E
M
IR
W
IR
IR
31
0x4
Add
RegDst
RegWrite
PC
addr
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
OpSel
MemWrite
A
ALU
GPRs
Y
we
addr
rdata
B
Data
Memory
wdata
Imm
Ext
R
wdata
MD1
ExtSel
WBSrc
MD2
BSrc
Control Points Need to
Be Connected
January 28, 2010
CS152, Spring 2010
14
Instructions interact with each other in
pipeline
• An instruction in the pipeline may need a
resource being used by another instruction
in the pipeline  structural hazard
• An instruction may depend on something
produced by an earlier instruction
– Dependence may be for a data value
 data hazard
– Dependence may be for the next instruction’s
address
 control hazard (branches, exceptions)
January 28, 2010
CS152, Spring 2010
15
Resolving Structural Hazards
• Structural hazards occurs when two
instruction need same hardware resource at
same time
– Can resolve in hardware by stalling newer instruction till
older instruction finished with resource
• A structural hazard can always be avoided by
adding more hardware to design
– E.g., if two instructions both need a port to memory at same
time, could avoid hazard by adding second port to memory
• Our 5-stage pipe has no structural hazards by
design
– Thanks to MIPS ISA, which was designed for pipelining
January 28, 2010
CS152, Spring 2010
16
Data Hazards
r1 …
r4  r1 …
0x4
IR
Add
PC
addr
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
we
addr
rdata
Data
Memory
Imm
Ext
R
wdata
wdata
MD1
...
r1 r0 + 10
r4 r1 + 17
...
January 28, 2010
IR
IR
MD2
r1 is stale. Oops!
CS152, Spring 2010
17
CS152 Administrivia
• PS 1 out Tuesday
• Lab 1 out today
• Scott Beamer will run section reviewing lab 1 at 2pm
in 320 Soda
January 28, 2010
CS152, Spring 2010
18
Resolving Data Hazards (1)
Strategy 1:
Wait for the result to be available by freezing
earlier pipeline stages  interlocks
January 28, 2010
CS152, Spring 2010
19
Feedback to Resolve Hazards
FB1
FB2
stage
1
FB3
stage
2
FB4
stage
3
stage
4
• Later stages provide dependence information to
earlier stages which can stall (or kill) instructions
• Controlling a pipeline in this manner works provided
the instruction at stage i+1 can complete without
any interference from instructions in stages 1 to i
(otherwise deadlocks may occur)
January 28, 2010
CS152, Spring 2010
20
Interlocks to resolve Data Hazards
Stall Condition
0x4
nop
Add
PC
addr
IR
IR
IR
31
inst IR
Inst
Memory
...
r1 r0 + 10
r4 r1 + 17
...
January 28, 2010
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
we
addr
rdata
Data
Memory
Imm
Ext
R
wdata
wdata
MD1
CS152, Spring 2010
MD2
21
Stalled Stages and Pipeline Bubbles
time
t0 t1 t2 t3 t4 t5
(I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 WB1
(I2) r4 (r1) + 17
IF2 ID2 ID2 ID2 ID2
(I3)
IF3 IF3 IF3 IF3
(I4)
stalled stages
(I5)
Resource
Usage
IF
ID
EX
MA
WB
time
t0 t1
I1
I2
I1
t2
I3
I2
I1
t3
I3
I2
nop
I1
t4
I3
I2
nop
nop
I1
t5
I3
I2
nop
nop
nop
t6
CS152, Spring 2010
....
EX2 MA2 WB2
ID3 EX3 MA3 WB3
IF4 ID4 EX4 MA4 WB4
IF5 ID5 EX5 MA5 WB5
t6
I4
I3
I2
nop
nop
nop 
January 28, 2010
t7
t7
I5
I4
I3
I2
nop
....
I5
I4
I3
I2
I5
I4
I3
I5
I4
pipeline bubble
22
I5
Interlock Control Logic
stall
ws
Cstall
rs
rt
?
0x4
nop
Add
PC
addr
IR
IR
IR
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
we
addr
rdata
Data
Memory
Imm
Ext
R
wdata
wdata
MD1
MD2
Compare the source registers of the instruction in the decode
stage with the destination register of the uncommitted
instructions.
January 28, 2010
CS152, Spring 2010
23
Interlock Control Logic
ignoring jumps & branches
stall
Cstall
rs
rt
ws
we
?
re1
re2
Cre
0x4
PC
addr
Cdest
Cdest
nop
Add
ws
ws we
we
IR
IR
IR
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Cdest
A
ALU
Y
B
we
addr
rdata
Data
Memory
Imm
Ext
R
wdata
wdata
MD1
MD2
Should we always stall if the rs field matches some rd?
not every instruction writes a register we
not every instruction reads a register re
January 28, 2010
CS152, Spring 2010
24
Source & Destination Registers
ALU
ALUi
LW
SW
BZ
J
JAL
JR
JALR
R-type:
op
rs
rt
I-type:
op
rs
rt
J-type:
op
func
immediate16
immediate26
rd  (rs) func (rt)
rt  (rs) op imm
rt M [(rs) + imm]
M [(rs) + imm]  (rt)
cond (rs)
true: PC  (PC) + imm
false: PC  (PC) + 4
PC  (PC) + imm
r31  (PC), PC  (PC) + imm
PC  (rs)
r31  (PC), PC  (rs)
January 28, 2010
rd
CS152, Spring 2010
source(s) destination
rs, rt
rd
rs
rt
rs
rt
rs, rt
rs
rs
rs
rs
31
31
25
Deriving the Stall Signal
Cdest
ws = Case opcode
ALU
rd
ALUi, LW
rt
JAL, JALR
R31
Cre
we = Case opcode
ALU, ALUi, LW (ws  0)
JAL, JALR
on
...
off
Cstall
January 28, 2010
re1 = Case opcode
ALU, ALUi,
LW, SW, BZ,
JR, JALR
on
J, JAL
off
re2 = Case opcode
on
ALU, SW
off
...
stall = ((rsD =wsE).weE +
(rsD =wsM).weM +
(rsD =wsW).weW) . re1D
((rtD =wsE).weE +
(rtD =wsM).weM +
(rtD =wsW).weW) . re2D
CS152, Spring 2010
+
26
Hazards due to Loads & Stores
Stall Condition
What if
(r1)+7 = (r3)+5 ?
0x4
nop
Add
PC
addr
IR
IR
IR
31
inst IR
Inst
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
rdata
Data
Memory
Imm
Ext
...
M[(r1)+7]  (r2)
r4  M[(r3)+5]
...
January 28, 2010
we
addr
R
wdata
wdata
MD1
MD2
Is there any possible data hazard
in this instruction sequence?
CS152, Spring 2010
27
Load & Store Hazards
...
M[(r1)+7]  (r2)
r4  M[(r3)+5]
...
(r1)+7 = (r3)+5  data hazard
However, the hazard is avoided because our
memory system completes writes in one cycle !
Load/Store hazards are sometimes resolved in
the pipeline and sometimes in the memory
system itself.
More on this later in the course.
January 28, 2010
CS152, Spring 2010
28
Resolving Data Hazards (2)
Strategy 2:
Route data as soon as possible after it is
calculated to the earlier pipeline stage  bypass
January 28, 2010
CS152, Spring 2010
29
Bypassing
time
(I1) r1 r0 + 10
(I2) r4 r1 + 17
(I3)
(I4)
(I5)
t0
IF1
t1 t2 t3 t4 t5
ID1 EX1 MA1 WB1
IF2 ID2 ID2 ID2 ID2
IF3 IF3 IF3 IF3
stalled stages
t6
t7
....
EX2 MA2 WB2
ID3 EX3 MA3
IF4 ID4 EX4
IF5 ID5
Each stall or kill introduces a bubble in the pipeline
CPI > 1
A new datapath, i.e., a bypass, can get the data from
the output of the ALU to its input
time
(I1) r1 r0 + 10
(I2) r4  r1 + 17
(I3)
(I4)
(I5)
January 28, 2010
t0
t1
IF1
t2 t3
ID1 EX1
IF2 ID2
IF3
t4
MA1
EX2
ID3
IF4
CS152, Spring 2010
t5
WB1
MA2
EX3
ID4
IF5
t6
t7
....
WB2
MA3 WB3
EX4 MA4 WB4
ID5 EX5 MA5 WB5
30
Adding a Bypass
stall
r4  r1...
0x4
r1 ...
nop
Add
IR
E
IR
M
IR
31
ASrc
PC
addr
inst IR
Inst
Memory
D
we
rs1
rs2
rd1
ws
wd rd2
GPRs
A
ALU
Y
B
rdata
Data
Memory
Imm
Ext
R
wdata
wdata
MD1
(I1)
(I2)
we
addr
MD2
When does this bypass help?
...
r1 r0 + 10
r1 M[r0 + 10]
JAL 500
r4 r1 + 17
r4 r1 + 17
r4 r31 + 17
yes
no
no
January 28, 2010
CS152, Spring 2010
31
W
The Bypass Signal
Deriving it from the Stall Signal
stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D
+((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D )
ws = Case opcode
ALU
rd
ALUi, LW
rt
JAL, JALR R31
we = Case opcode
ALU, ALUi, LW (ws  0)
JAL, JALR
on
...
off
ASrc = (rsD=wsE).weE.re1D
Is this correct?
No because only ALU and ALUi instructions can benefit
from this bypass
Split weE into two components: we-bypass, we-stall
January 28, 2010
CS152, Spring 2010
32
Bypass and Stall Signals
Split weE into two components: we-bypass, we-stall
we-bypassE = Case opcodeE
ALU, ALUi  (ws  0)
...
off
we-stallE = Case opcodeE
LW
 (ws  0)
JAL, JALR on
...
off
ASrc
= (rsD =wsE).we-bypassE . re1D
stall
= ((rsD =wsE).we-stallE +
(rsD=wsM).weM + (rsD=wsW).weW). re1D
+((rtD = wsE).weE + (rtD = wsM).weM + (rtD = wsW).weW). re2D
January 28, 2010
CS152, Spring 2010
33
Fully Bypassed Datapath
PC for JAL, ...
stall
0x4
nop
Add
PC
addr
ASrc
inst IR
Inst
Memory
D
Is there still
a need for the
stall signal ?
January 28, 2010
IR
M
IR
31
we
rs1
rs2
rd1
ws
wd rd2
A
ALU
GPRs
Imm
Ext
IR
E
Y
B
BSrc
we
addr
rdata
Data
Memory
R
wdata
wdata
MD1
MD2
stall = (rsD=wsE). (opcodeE=LWE).(wsE0 ).re1D
+ (rtD=wsE). (opcodeE=LWE).(wsE0 ).re2D
CS152, Spring 2010
34
W
Resolving Data Hazards (3)
Strategy 3:
Speculate on the dependence. Two cases:
Guessed correctly  do nothing
Guessed incorrectly  kill and restart
…. We’ll later see examples of this approach in
more complex processors.
January 28, 2010
CS152, Spring 2010
35
Next Time: Control Hazards
• Branches/Jumps
• Exceptions/Interrupts
January 28, 2010
CS152, Spring 2010
36
Acknowledgements
• These slides contain material developed and
copyright by:
–
–
–
–
–
–
Arvind (MIT)
Krste Asanovic (MIT/UCB)
Joel Emer (Intel/MIT)
James Hoe (CMU)
John Kubiatowicz (UCB)
David Patterson (UCB)
• MIT material derived from course 6.823
• UCB material derived from course CS252
January 28, 2010
CS152, Spring 2010
37
Download