Pipelining review

advertisement
COM506 Computer Design
Lecture 2. Pipelining
- COMP212 Review -
Prof. Taeweon Suh
Computer Science Education
Korea University
Processor Performance
• Performance of single-cycle processor is
limited by the long critical path delay
 The critical path limits the operating clock
frequency
• Can we do better?
 New semiconductor technology will reduce the
critical path delay by manufacturing small-sized
transistors
• Core 2 Duo is manufactured with 65nm technology
• Core i7
 Nehalem: 45nm technology
 Sandy Bridge: 32nm technology
 Ivy Bridge: 22nm technology
 Can we increase the processor performance with
a different microarchitecture?
• Yes! Pipelining
2
Korea Univ
Revisiting Performance
• Laundry Example
 Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
 Washer takes 30 minutes
 Dryer takes 40 minutes
 Folder takes 20 minutes
3
A
B
C
D
Korea Univ
Sequential Laundry
6 PM
7
8
9
10
11
Midnight
Time
30
T
a
s
k
40 20 30
40 20 30
40 20 30
40 20
A
B
O
r
d
e
r
C
D
•
•
Response time: 90 mins
Throughput: 0.67 tasks / hr (= 90mins/task, 6 hours for 4 loads)
4
Korea Univ
Pipelined Laundry
6 PM
7
8
9
10
11
Midnight
Time
30
T
a
s
k
O
r
d
e
r
•
•
40
40
40
40 20
•
A
•
B
•
C
•
D
•
Pipelining doesn’t help latency
(response time) of a single
task
Pipelining helps throughput of
entire workload
Multiple tasks operating
simultaneously
Unbalanced lengths of pipeline
stages reduce speedup
Potential speedup = # of
pipeline stages
Response time: 90 mins
Throughput: 1.14 tasks / hr (= 52.5 mins/task, 3.5 hours for 4 loads)
5
Korea Univ
Pipelining
• Improve performance by increasing instruction throughput
Sequential
Execution
Instruction
Fetch
Register File
Access (Read)
ALU
Operation
Data Access
Register
Access (Write)
2ns
1ns
2ns
2ns
1ns
Program
execution
Time
order
(in instructions)
lw $1, 100($0)
2
Instruction Reg
fetch
lw $2, 200($0)
4
6
8
ALU
Data
access
10
12
14
ALU
Data
access
16
18
Reg
Instruction
Reg
fetch
8 ns
lw $3, 300($0)
Reg
Instruction
fetch
8 ns
8 ns
Pipelined
Execution
Program
2
execution
Time
order
(in instructions)
Instruction
lw $1, 100($0)
fetch
lw $2, 200($0)
lw $3, 300($0)
2 ns
4
Reg
Instruction
fetch
2 ns
6
ALU
Reg
Instruction
fetch
2 ns
8
Data
access
ALU
Reg
2 ns
6
10
...
14
12
Reg
Data
access
Reg
ALU
Data
access
2 ns
2 ns
Reg
2 ns
Korea Univ
Pipelining (Cont.)
Multiple instructions are being
executed simultaneously
Program
execution
Time
order
(in instructions)
2
lw $1, 100($0)
Instruction
fetch
lw $2, 200($0)
2 ns
lw $3, 300($0)
4
Reg
Instruction
fetch
2 ns
…
6
ALU
Reg
8
Data
access
ALU
Instruction
fetch
Reg
Instruction
fetch
10
Reg
Data
access
Reg
Data
access
ALU
Reg
ALU
Instruction
fetch
Pipeline Speedup
•
•
•
Reg
Instruction
fetch
If all stages are balanced (meaning that each stage
takes the same amount of time)
Time to execute an instructionpipeline=
14
12
Reg
Data
access
ALU
Reg
Instruction
fetch
Reg
Data
access
ALU
Reg
Instruction
fetch
Time to execute an instructionsequential
Number of stages
Reg
Data
access
ALU
Reg
Instruction
fetch
Reg
Data
access
ALU
Reg
Reg
Data
access
ALU
Reg
Data
access
Reg
If not balanced, speedup is less
Speedup comes from increased throughput (the
latency of instruction does not decrease)
7
Korea Univ
Basic Idea
IF: Instruction fetch
ID: Instruction decode/
register file read
EX: Execute/
address calculation
MEM: Memory access
WB: Write back
0
M
u
x
1
Add
4
Add
Add
result
Shift
left 2
PC
Read
register 1
Address
Instruction
Instruction
memory
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
• What do we have to add to actually split the datapath into stages?
8
Korea Univ
Basic Idea
IF: Instruction fetch
ID: Instruction decode/
register file read
EX: Execute/
address calculation
MEM: Memory access
WB: Write back
0
M
u
x
1
Add
4
Add
result
Add
Shift
left 2
PC
Read
register 1
Address
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Instruction
Instruction
memory
0
M
u
x
1
Write
data
Zero
ALU ALU
result
Address
Read
data
1
M
u
x
0
Data
memory
Write
data
16
Sign
extend
32
clock
D
Q
F/F
Q
D
Q
F/F
D
Q
F/F
Q
9
Q
D
Q
F/F
Q
Korea Univ
Graphically Representing Pipelines
2
Time
lw
add
IF
4
6
8
10
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
• Shading indicates the unit is being used by the instruction
• Shading on the right half of the register file (ID or WB) or memory
means the element is being read in that stage
• Shading on the left half means the element is being written in that
stage
10
Korea Univ
Pipelined Datapath
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
11
Korea Univ
lw: Instruction Fetch (IF)
Instruction fetch
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
12
Korea Univ
lw: Instruction Decode (ID)
Instruction decode
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
13
Korea Univ
lw: Execution (EX)
Execution
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
14
Korea Univ
lw: Memory (MEM)
Memory
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
15
Korea Univ
lw: Writeback (WB)
Writeback
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
16
Korea Univ
sw: Memory (MEM)
Memory
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
17
Korea Univ
sw: Writeback (WB): do nothing
Writeback
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
18
Korea Univ
Corrected Datapath (for lw)
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add Add
result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
19
Korea Univ
Pipelining Example
add $14, $5, $6
lw $13, 24($1)
add $12, $3, $4
sub $11, $2, $3
lw $10, 20($1)
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
20
Korea Univ
Pipeline Control
PCSrc
Note that in this
implementation, the branch is
resolved in the MEM stage
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
result
Add
4
Branch
Shift
left 2
PC
Address
Instruction
memory
Instruction
RegWrite
Read
register 1
MemWrite
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
ALUSrc
Zero
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Data
memory
Write
Read
data
1
M
u
x
0
data
Instruction
16
[15– 0]
Sign
extend
32
6
ALU
control
MemRead
Instruction
[20– 16]
Instruction
[15– 11]
0
M
u
x
1
ALUOp
RegDst
21
Korea Univ
Pipeline Control
• What needs to be controlled in each stage (IF, ID, EX, MEM, WB)?
 IF: Instruction fetch and PC increment
 ID: Instruction decode and operand fetch from register file and/or
immediate
 EX: Execution stage
• RegDst
• ALUop[1:0]
• ALUSrc
 MA: Memory stage
• Branch
• MemRead
• MemWrite
 WB: Writeback
• MemtoReg
• RegWrite (note that this signal is in ID stage)
22
Korea Univ
Pipeline Control
• Extend pipeline registers to include control information created in ID stage
• Pass control signals along just like the data
Instruction
R-format
lw
sw
beq
Execution/Address
Calculation stage control
lines
Reg
ALU
ALU
ALU
Dst
Op1
Op0
Src
1
1
0
0
0
0
0
1
X
0
0
1
X
0
1
0
Memory access stage
control lines
Mem Mem
Branch Read Write
0
0
0
0
1
0
0
0
1
1
0
0
Write-back
stage control
lines
Reg
Mem
write to Reg
1
0
1
1
0
X
0
X
WB
Instruction
IF/ID
Control
M
WB
EX
M
WB
ID/EX
EX/MEM
MEM/WB
23
Korea Univ
Datapath with Control
PCSrc
ID/EX
0
M
u
x
1
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
Add
Add
Add result
Instruction
memory
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Branch
Shift
left 2
MemWrite
PC
Instruction
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
1
M
u
x
0
MemRead
ALUOp
RegDst
24
Korea Univ
Datapath with Control
IF: lw $10, 9($1)
PCSrc
ID/EX
0
M
u
x
1
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
Add
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Instruction
memory
Branch
Shift
left 2
MemWrite
Address
Instruction
PC
Add
Add result
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
1
M
u
x
0
MemRead
ALUOp
RegDst
25
Korea Univ
Datapath with Control
IF: sub $11, $2, $3
ID: lw $10, 9($1)
PCSrc
ID/EX
0
M
u
x
1
11
“lw”
010
Control
WB
EX/MEM
M
WB
0001 E X
IF/ID
MEM/WB
M
WB
Add
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Instruction
memory
Branch
Shift
left 2
MemWrite
Address
Instruction
PC
Add
Add result
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
1
M
u
x
0
MemRead
ALUOp
RegDst
26
Korea Univ
Datapath with Control
IF: and $12, $4, $5
ID: sub $11, $2, $3
EX: lw $10, 9($1)
PCSrc
ID/EX
0
M
u
x
1
10
“sub”
000
Control
1100
IF/ID
WB
M
EX
11
EX/MEM
010
0
00
1
WB
MEM/WB
M
WB
Add
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Instruction
memory
Branch
Shift
left 2
MemWrite
Address
Instruction
PC
Add
Add result
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
1
M
u
x
0
MemRead
ALUOp
RegDst
27
Korea Univ
Datapath with Control
IF: or $13, $6, $7
ID: and $12, $4, $5 EX: sub $11, $2, $3 MEM: lw $10, 9($1)
PCSrc
ID/EX
0
M
u
x
1
10
“and”
000
Control
1100
IF/ID
WB
M
EX
10
EX/MEM
000
1
10
0
WB
M
11
0
1
0
MEM/WB
WB
Add
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Instruction
memory
Branch
Shift
left 2
MemWrite
Address
Instruction
PC
Add
Add result
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
1
M
u
x
0
MemRead
ALUOp
RegDst
28
Korea Univ
Datapath with Control
IF: add $14, $8, $9
ID: or $13, $6, $7
EX: and $12, $4, $5
MEM: sub $11, ..
PCSrc
WB: lw $10,
9($1)
ID/EX
0
M
u
x
1
10
“or”
000
Control
1100
IF/ID
WB
M
EX
10
EX/MEM
000
1
10
0
WB
M
10
0
0
0
MEM/WB
1
WB
1
Add
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Instruction
memory
Branch
Shift
left 2
MemWrite
Address
Instruction
PC
Add
Add result
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
1
M
u
x
0
MemRead
ALUOp
RegDst
29
Korea Univ
Datapath with Control
IF: xxxx
ID: add $14, $8, $9
MEM: and $12… WB: sub $11, ..
EX: or $13, $6, $7
PCSrc
ID/EX
0
M
u
x
1
10
“add”
000
Control
1100
IF/ID
WB
M
EX
10
EX/MEM
000
1
10
0
WB
M
10
0
0
0
MEM/WB
1
WB
0
Add
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Instruction
memory
Branch
Shift
left 2
MemWrite
Address
Instruction
PC
Add
Add result
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
1
M
u
x
0
MemRead
ALUOp
RegDst
30
Korea Univ
Datapath with Control
IF: xxxx
ID: xxxx
EX: add $14, $8, $9
MEM: or $13, ..
WB: and $12…
PCSrc
ID/EX
0
M
u
x
1
WB
M
Control
EX
IF/ID
10
EX/MEM
000
1
10
0
WB
M
10
0
0
0
MEM/WB
1
WB
0
Add
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Instruction
memory
Branch
Shift
left 2
MemWrite
Address
Instruction
PC
Add
Add result
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
1
M
u
x
0
MemRead
ALUOp
RegDst
31
Korea Univ
Datapath with Control
IF: xxxx
ID: xxxx
EX: xxxx
MEM: add $14, ..
WB: or $13…
PCSrc
ID/EX
0
M
u
x
1
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
10
0
0
0
MEM/WB
1
WB
0
Add
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Instruction
memory
Branch
Shift
left 2
MemWrite
Address
Instruction
PC
Add
Add result
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
1
M
u
x
0
MemRead
ALUOp
RegDst
32
Korea Univ
Datapath with Control
IF: xxxx
ID: xxxx
EX: xxxx
MEM: xxxx
WB: add $14..
PCSrc
ID/EX
0
M
u
x
1
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
1
WB
0
Add
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Instruction
memory
Branch
Shift
left 2
MemWrite
Address
Instruction
PC
Add
Add result
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
1
M
u
x
0
MemRead
ALUOp
RegDst
33
Korea Univ
Hazards
• It would be happy if we split the datapath into stages and
the CPU works just fine
 But, things are not that simple as you may expect
 There are hazards!
• Hazard is a situation that prevents starting the next
instruction in the next cycle
 Structure hazards
• Conflict over the use of a resource at the same time
 Data hazard
• Data is not ready for the subsequent dependent instruction
 Control hazard
• Fetching the next instruction depends on the previous branch outcome
34
Korea Univ
Structure Hazards
• Structural hazard is a conflict over the use of a resource at the
same time
• Suppose the MIPS CPU with a single memory
 Load/store requires data access in MEM stage
 Instruction fetch requires instruction access from the same memory
• Instruction fetch would have to stall for that cycle
• Would cause a pipeline “bubble”
• Hence, pipelined datapaths require either separate ports to
memory or separate memories for instruction and data
Address Bus
Address Bus
MIPS CPU
MIPS CPU
Memory
Data Bus
Memory
Address Bus
Data Bus
Data Bus
35
Korea Univ
Structure Hazards (Cont.)
2
Time
lw
add
sub
add
IF
4
6
8
10
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
Either provide separate ports to access memory or to provide
instruction memory and data memory separately
36
Korea Univ
Data Hazards
• Data is not ready for the subsequent dependent instruction
add $s0,$t0,$t1
sub $t2,$s0,$t3
IF
ID
EX
IF
ID
MEM
Bubble
EX
WB
Bubble
MEM
WB
• To solve the data hazard problem, the pipeline needs to be
stalled (typically referred to as “bubble”)
• Then, the performance is penalized
• A better solution?
• Forwarding (or Bypassing)
37
Korea Univ
Forwarding
add $s0,$t0,$t1 IF
ID
EX
sub $t2,$s0,$t3
IF
ID
MEM
Bubble
38
WB
Bubble
EX
MEM
WB
Korea Univ
Data Hazard - Load-Use Case
• Can’t always avoid stalls by forwarding
 Can’t forward backward in time!
• Hardware interlock is needed for the pipeline stall
lw $s0, 8($t1)
sub $t2,$s0,$t3
IF
ID
EX
IF
ID
MEM
Bubble
EX
WB
MEM
WB
• This bubble can be hidden by proper instruction
scheduling
39
Korea Univ
Code Scheduling to Avoid Stalls
• Reorder code to avoid use of load result in the next
instruction
A = B + E; // B is loaded to $t1, E is loaded to $t2
C = B + F; // F is loaded to $t4
stall
stall
lw
lw
add
sw
lw
add
sw
$t1,
$t2,
$t3,
$t3,
$t4,
$t5,
$t5,
0($t0)
4($t0)
$t1, $t2
12($t0)
8($t0)
$t1, $t4
16($t0)
lw
lw
lw
add
sw
add
sw
$t1,
$t2,
$t4,
$t3,
$t3,
$t5,
$t5,
0($t0)
4($t0)
8($t0)
$t1, $t2
12($t0)
$t1, $t4
16($t0)
11 cycles
13 cycles
40
Korea Univ
Data Hazard - Forwarding
• Don’t wait for them to be written to the register file
 Use temporary results
Time (in clock cycles)
CC 1
Value of register $2 : 10
Value of EX/MEM : X
Value of MEM/WB : X
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
X
X
10
– 20
X
10/– 20
X
– 20
– 20
X
X
– 20
X
X
– 20
X
X
– 20
X
X
Ok.. Then, do we have to do this
forwarding?
Program
execution order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
IM
Reg
IM
DM
Reg
Reg
IM
DM
Reg
Reg
DM
IM
Reg
sw $15, 100($2)
IM
41
1. If the write to the register file occurs in the
first half of the clock, and read occurs in the
2nd half of the clock, then?
• Our textbook follows this
2. If you are asked to design CPU using only
rising-edge of the clock, then?
• Let’s stick to this for our project
Reg
DM
Reg
Reg
DM
Reg
Korea Univ
Forwarding
ID/EX
MEM/WB
ALU
Data
Memory
42
MUX
Register
File
EX/MEM
Korea Univ
Forwarding (from EX/MEM)
EX/MEM
MEM/WB
MUX
ID/EX
Register
File
Data
Memory
43
MUX
MUX
ALU
Korea Univ
Forwarding (from MEM/WB)
EX/MEM
MEM/WB
MUX
ID/EX
Register
File
Data
Memory
44
MUX
MUX
ALU
Korea Univ
Forwarding (operand selection)
EX/MEM
MEM/WB
MUX
ID/EX
Register
File
Data
Memory
MUX
MUX
ALU
Forwarding
Unit
45
Korea Univ
Forwarding (operand propagation)
EX/MEM
MEM/WB
MUX
ID/EX
Register
File
MUX
ALU
Rt
Rt
Rs
MUX
Rd
MUX
Data
Memory
Forwarding
Unit
46
EX/MEM Rd
MEM/WB Rd
Korea Univ
Forwarding
ID/EX
WB
Control
PC
Instruction
memory
Instruction
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
M
u
x
Registers
ALU
Data
memory
M
u
x
M
u
x
IF/ID.RegisterRs
Rs
IF/ID.RegisterRt
Rt
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
47
MEM/WB.RegisterRd
Korea Univ
Can't always forward
• lw (load word) can still cause a hazard
 An instruction tries to read a register following a load instruction that writes
to the same register
Time (in clock cycles)
Program
CC 1
execution
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
IM
CC 2
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
CC 6
CC 8
CC 9
Reg
DM
Reg
IM
CC 7
Reg
DM
Reg
Reg
DM
Reg
• Thus, we need a hazard detection unit to “stall” the pipeline after
the load instruction
48
Korea Univ
Stalling
• We can stall the pipeline by keeping an instruction in the
same stage
Program
Time (in clock cycles)
execution
CC 1
CC 2
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
IM
CC 3
Reg
IM
Reg
ID
IM
IF
CC 4
CC 5
DM
Reg
Reg
IM
CC 6
CC 7
DM
Reg
Reg
DM
CC 8
CC 9
CC 10
Reg
bubble
add $9, $4, $2
IM
slt $1, $6, $7
IM
49
DM
Reg
Reg
Reg
DM
Reg
Korea Univ
Hazard Detection Unit
• Stall the pipeline if both ID/EX is a load and (rt=IF/ID.rs or rt=IF/ID.rt)
 Stall by letting an instruction (that won’t write anything) go forward
ID/EX.MemRead
Hazard
detection
unit
ID/EX
IF/IDWrite
WB
Control
0
M
u
x
PC
Instruction
memory
Instruction
PCWrite
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
M
u
x
Registers
ALU
Data
memory
M
u
x
M
u
x
IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
ID/EX.RegisterRt
Rs
Rt
50
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
MEM/WB.RegisterRd
Korea Univ
Control Hazard
•
•
Branch determines the flow of instructions
Fetching the next instruction depends on the branch outcome


Pipeline can’t always fetch correct instruction
Branch instruction is still working on ID stage when fetching the next instruction
Taken target address
is known here
beq $1,$2,L1
IF
add $1,$2,$3
sw $1, 4($2)
Branch is resolved here
ID
EX
MEM
WB
IF
Bubble
ID
EX
MEM
WB
IF
Bubble
ID
EX
MEM
WB
IF
ID
EX
MEM
…
L1: sub $1,$2, $3
WB
Fetch the next instruction based
on the comparison result
51
Korea Univ
Reducing Control Hazard
• To reduce 2 bubbles to 1 bubble, add hardware in ID stage
to compare registers (and generate branch condition)
 But, it requires additional forwarding and hazard detection logic –
Why?
Taken target address
is known here
beq $1,$2,L1
add $1,$2,$3
IF
Branch is resolved here
ID
EX
MEM
WB
Bubble
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
…
L1: sub $1,$2, $3
WB
Fetch instruction based on the
comparison result
52
Korea Univ
Delayed Branch
• Many CPUs adopt a technique called the delayed branch to further
reduce the stall
 Delayed branch always executes the next sequential instruction
•
The branch takes place after that one instruction delay
 Delay slot is the slot right after a delayed branch instruction
Taken target address
is known here
beq $1,$2,L1
IF
add $1,$2,$3 (delay slot)
Branch is resolved here
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
…
L1: sub $1,$2, $3
WB
Fetch instruction based on the
comparison result
53
Korea Univ
Delay Slot (Cont.)
• Compiler needs to schedule a useful instruction in
the delay slot, or fills it up with nop (no operation)
// $s1 = a, $s2 = b, $3 = c
// $t0 = d, $t1 = f
a = b + c;
if (d == 0) {f = f + 1;}
f = f + 2;
add $s1,$s2, $s3
bne $t0,$zero, L1
nop //delay slot
addi $t1, $t1, 1
L1: addi $t1, $t1, 2
Can we do better?
bne $t0, $zero, L1
add $s1,$s2,$s3 // delay slot
addi $t1, $t1, 1
L1: addi $t1, $t1, 2
54
Fill the delay slot with a
useful and valid instruction
Korea Univ
Branch Prediction
• Longer pipelines (implemented in Core 2 Duo, for
example) can’t readily determine branch outcome
early
 Stall penalty becomes unacceptable since branch
instructions are used so frequently in the program
• Solution: Branch Prediction
 Predict the branch outcome in hardware
 Flush the instructions (that shouldn’t have been executed)
in the pipeline if the prediction turns out to be wrong
 Modern processors use sophisticated branch predictors
55
Korea Univ
MIPS with Predict-Not-Taken
Prediction
correct
Flush the instruction that
shouldn’t be executed
Prediction
incorrect
56
Korea Univ
Control Hazards - Branch
•
When the branch condition is resolved, other instructions are in the pipeline
Time (in clock cycles)
Program
execution
CC 1
CC 2
order
(in instructions)
40 beq $1, $3, 7
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
IM
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
72 lw $4, 50($7)
•
CC 6
Reg
DM
Reg
IM
CC 7
CC 8
CC 9
Note that in this implementation,
the branch is resolved in the MEM
stage
Reg
DM
Reg
Reg
DM
Reg
We are predicting “branch not taken”
• If we are wrong (if branch is taken), flush instructions
57
Korea Univ
Alleviate Branch Hazards
• Reduce penalty to 1 cycle
 Move the branch compare to the ID stage of pipeline
 Add an adder to calculate the branch target in ID stage
 Add the IF.flush signal that zeros the instruction (or squash) in
IF/ID pipeline register
Taken target address
is known here
beq $1,$2,L1
add $1,$2,$3
…
L1: sub $1,$2, $3
IF
ID
Bubble
IF
e
Branch is
resolved here
EX
MEM
WB
ID
EX
MEM
WB
IF
ID
EX
MEM
58
WB
Korea Univ
Flushing Instructions
IF.Flush
Hazard
detection
unit
ID/EX
M
u
x
WB
Control
0
M
u
x
IF/ID
4
M
WB
EX
M
MEM/WB
WB
Shift
left 2
Registers
PC
EX/MEM
M
u
x
=
Instruction
memory
ALU
M
u
x
Data
memory
M
u
x
Sign
extend
M
u
x
Forwarding
unit
59
Korea Univ
Flushing Instructions (cycle N)
beq $1, $3, L2
and $12, $2, $5
or $13, $12, $1
…
L2:
lw $4, 40($7)
beq $1, $3, L2
and $12, $2, $5
IF.Flush
Hazard
detection
unit
ID/EX
M
u
x
WB
Control
0
M
u
x
IF/ID
4
M
WB
EX
M
MEM/WB
WB
Shift
left 2
Registers
PC
EX/MEM
M
u
x
=
Instruction
memory
ALU
M
u
x
Data
memory
M
u
x
Sign
extend
M
u
x
Forwarding
unit
60
Korea Univ
Flushing Instructions (cycle N)
beq $1, $3, L2
and $12, $2, $5
or $13, $12, $1
…
L2:
lw $4, 40($7)
beq $1, $3, L2
and $12, $2, $5
IF.Flush
Hazard
detection
unit
ID/EX
M
u
x
WB
Control
0
M
u
x
IF/ID
4
L2
M
WB
EX
M
MEM/WB
WB
Shift
left 2
Registers
PC
EX/MEM
M
u
x
=
Instruction
memory
ALU
M
u
x
Data
memory
M
u
x
Sign
extend
M
u
x
Forwarding
unit
61
Korea Univ
Flushing Instructions (cycle N+1)
lw $4, 40($7)
beq $1, $3, L2
and $12, $2, $5
or $13, $12, $1
…
L2:
lw $4, 40($7)
beq $1, $3, L2
nop
IF.Flush
Hazard
detection
unit
ID/EX
M
u
x
WB
Control
0
M
u
x
IF/ID
4
M
WB
EX
M
MEM/WB
WB
Shift
left 2
Registers
PC
EX/MEM
M
u
x
=
Instruction
memory
ALU
M
u
x
Data
memory
M
u
x
Sign
extend
M
u
x
Forwarding
unit
62
Korea Univ
Supporting Multiple FP Operations
E
X
Integer Unit
FP multiplier: 7 cycles
M
1
IF
M
2
M
3
ID
M
6
A
2
A
3
M
7
MEM
WB
A
4
FP divider (non-pipelined) 24 cycles
Complicate bypass
Potential structural hazard
Multiple (FP) instructions can complete at the same time


•
M
5
FP add: 4 cycles
A
1
•
•
•
M
4
RF might need to be multi-ported
Ordering issue, who gets to update the register?
Out-of-order completion/retirement: Precise exception issue
Modified from Prof Sean Lee’s Slide
63
Korea Univ
Bypassing & Forwarding
Clock Cycles
L.D F4,0(R2)
MUL.D F0,F4,F6
ADD.D F2,F0,F8
S.D F2,0(R2)
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18
IF ID EX M WB
IF ID S M1 M2 M3 M4 M5 M6 M7 M WB
IF S ID S
S
S
S
S
S A1 A2 A3 A4 M WB
IF S
S
S
S
S
S ID EX S
64
S
S
M WB
Korea Univ
Structural Hazards
Clock Cycles
MUL.D F0,F4,F6
. . . .
. . . .
ADD.D F2,F4,F6
1
2
3
4
6
7
8
9 10 11
IF ID M1 M2 M3 M4 M5 M6 M7 M WB
IF ID EX M WB
IF ID EX M WB
IF ID A1 A2 A3 A4 M WB
. . . .
. . . .
IF ID EX M WB
IF ID EX M WB
IF ID EX M WB
L.D F2,0(R2)
•
•
•
5
Write to register file at the same cycle (cc11)
Write to the same register (WAW)
MEM in cc10
65
Korea Univ
Precise Exception Issue
DIV.D
ADD.D
SUB.D
•
•
•
•
•
F0,F2,F4
F3,F10,F8
F12,F12,F14
(exception!)
(completed)
(completed)
Precise exception: If the pipeline can (or must) be stopped
 All the instructions before the faulty (or intended) instruction must be
completed
 All the instructions after it must not be completed
 Restart the execution from the faulty (or intended) instruction
State must be consistent with the original program order
Not straightforward with out-of-order completion
Simple solution: Stalling until no exception of prior long-latency instruction is
guaranteed
Other modern solution: ROB (will dedicate a lecture to it)
66
Korea Univ
Instruction Sequence
Scalar Pipeline (Baseline)
IF
DE
EX
MEM
WB
1
2
3
4
5
6
Execution Cycle
Modified from Prof Sean Lee’s Slide
67
Korea Univ
Superpipeline
• Deeper pipelining is called superpipelining
• Cache access is particularly time critical, so the extra pipeline stages
come from decomposing the memory access
• Deeper pipeline allows for achieving higher clock rates
Instruction Sequence
IF
1
I
I
DE
D
I
EX
D D
E E
2
3
4
5
6
7
8
9
1
2
Modified from Prof Sean Lee’s Slide
MEM
WB
E M M M W W W
E
E M
E
E E
D
E E
D
D E
D
D D
I
D D
I
I
D
I
I
I
3
4
5
68
6
Execution Cycle
Korea Univ
MIPS R4000 Pipeline
• Deeper Pipeline (superpipelining)
• 2 cycle delays for load
• Predicted-Not-Taken strategy
 Not-taken (fall-through) branch : 1 delay slot
 Taken branch: 1 delay slot + 2 idle cycles
IS
Instruction
Memory
RF
Reg
DF
EX
ALU
IF
DS
Data Memory
TC
WB
Reg
Branch target and condition eval.
Prof Sean Lee’s Slide
69
Korea Univ
IS
LD R1
Inst 1
Bubble
Bubble
Inst 2
Instruction
Memory
CC4
CC5
CC6
CC7
CC8
RF
EX
DF
DS
TC
WB
Reg
Instruction
Memory
Reg
Instruction
Memory
ADD R2, R1
Modified from Prof Sean Lee’s Slide
Data Memory
Reg
Instruction
Memory
70
CC9
Data Memory
Reg
CC10
CC11
Reg
Reg
Data Memory
ALU
IF
CC3
ALU
CC2
ALU
CC1
ALU
Load delay (2 cycles)
Reg
Data Memory
Reg
Korea Univ
Branches (Predicted-not-taken)
A
C
T
U
A
L
D
I
R
E
C
T
I
O
N
CC1
N
Branch
IF
O
T Delay slot
T Branch inst+2
A
Branch inst+3
K
E
N
IF
Branch
T Delay slot
A
Stall
K
E Stall
N
Branch Target
Prof Sean Lee’s Slide
CC2
CC3
CC4
CC5
CC6
CC7
CC8
CC9
CC10 CC11
IS
RF
EX
DF
DS
TC
WB
IF
IS
RF
EX
DF
DS
TC
WB
IF
IS
RF
EX
DF
DS
TC
WB
IF
IS
RF
EX
DF
DS
TC
WB
IS
RF
EX
DF
DS
TC
WB
IF
IS
RF
EX
DF
DS
TC
WB
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
IF
IS
RF
EX
DF
DS
TC
71
Korea Univ
Download