Pipelining

advertisement
CS2100 Computer Organisation
Pipelining
(AY2015/6) Semester 1
Road Map: Part II
Performance
Assembly
Language
Processor:
Datapath
Processor:
Control
Pipelining
Cache

Pipelining




Pipelining
MIPS Pipeline
Pipeline datapath &
control
Pipeline hazards
Branch prediction
2
Pipelining: Overview

MIPS Pipeline Stages

Pipelined Datapath and Control

Pipeline Hazards





Structural
Data
Control
Forwarding
Branch prediction
CS2100
Pipelining
3
Inspiration: Assembly Line
CS2100
Pipelining
4
The Laundry Example

Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, fold and stash

Washer takes 30 minutes

Dryer takes 30 minutes

“Folder” takes 30 minutes

“Stasher” takes 30 minutes
to put clothes into drawers
CS2100
Pipelining
A
B
C
D
5
Sequential Laundry
6 PM
T
a
s
k
O
r
d
e
r
7
8
10
9
11
12
1
2 AM
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
A
Time
B
C
D



Sequential laundry takes 8 hours for 4 loads
Steady state: 1 load every 2 hours
If they learned pipelining, how long would laundry
take?
CS2100
Pipelining
6
Pipelined Laundry
6 PM
T
a
s
k
O
r
d
e
r
7
8
10
9
30 30 30 30 30 30 30
11
12
1
2 AM
Time
A
B
C
D




Pipelined laundry takes 3.5 hours for 4 loads!
Steady state: 1 load every 30 min
Potential speedup = 2 hr/30 min = 4 (# stages)
Time to fill pipeline takes 2 hours  speedup 
CS2100
Pipelining
7
What IF: Slow Dryer
6 PM
T
a
s
k
7
8
9
10
11
12
30 30 30 30 30 30 30 30 30 30 30
1
2 AM
Time
A
B
O
r
d
e
r
C
D



Pipelined laundry now takes 5.5 hours!
Steady state: One load every 1 hr (dryer speed)
Pipeline rate is limited by the slowest stage
CS2100
Pipelining
8
What IF: Dependency
6 PM
T
a
s
k
7
8
10
9
30 30 30 30 30 30 30 30
11
12
1
2 AM
Time
A
Dependency
B
O
r
d
e
r
Stall
C
D


Brian is using the laundry for the first time; he wants to
see the outcome of one wash + dry cycle first before
putting in his clothes
Pipelined laundry now takes 4 hours
CS2100
Pipelining
9
Pipelining Lessons

Pipelining doesn’t help latency of single
task:

It helps the throughput of entire workload

Multiple tasks operating simultaneously
using different resources

Possible delays:

Pipeline rate limited by slowest pipeline stage

Stall for dependencies ( more on this later )
CS2100
Pipelining
10
Datapath and Control
MIPS PIPELINE
CS2100
Pipelining
11
MIPS Pipeline Stages (1/2)

Five Execution Stages






Idea:



IF: Instruction Fetch
ID: Instruction Decode and Register Read
EX: Execute an operation or calculate an address
MEM: Access an operand in data memory
WB: Write back the result into a register
Each execution stage takes 1 clock cycle
General flow of data is from one stage to the next
Exceptions:

CS2100
Update of PC and write back of register file – more
about this later…
Pipelining
12
MIPS Pipeline Stages (2/2)
IF: Instruction fetch
ID: Instruction decode/
register file read
EX: Execute/
address calculation
MEM: Memory access
WB: Write back
0
M
u
x
1
Add
4
Add
Add
result
Shift
left 2
PC
Read
register 1
Address
Instruction
Instruction
memory
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
CS2100
Sign
extend
Read
data
1
M
u
x
0
32
Pipelining
13
Pipelined Execution: Illustration
Time
IF
ID
EX
MEM
IF
ID
EX
MEM
IF
ID
EX
MEM
IF
ID
EX
MEM
IF
ID
EX
MEM
IF
ID
EX
Program Flow
(Instructions)
WB
WB
WB
WB
WB
MEM
WB
Several instructions
are in the pipeline
simultaneously!
CS2100
Pipelining
14
MIPS Pipeline: Datapath(1/3)

Single-cycle implementation:


Update all state elements (PC, register file, data
memory) at the end of a clock cycle
Pipelined implementation:


CS2100
One cycle per pipeline stage
Data required for each stage needs to be stored
separately (why?)
Pipelining
15
MIPS Pipeline: Datapath(2/3)

Data used by subsequent instructions:


Data used by same instruction in later pipeline
stages:






Store in programmer-visible state elements: PC, register
file and memory
Additional registers in datapath called pipeline
registers
IF/ID: register between IF and ID
ID/EX: register between ID and EX
EX/MEM: register between EX and MEM
MEM/WB: register between MEM and WB
Why no register at the end of WB stage?
CS2100
Pipelining
16
MIPS Pipeline: Datapath (3/3)
0
M
u
x
1
Coloured rectangles are pipeline registers
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
CS2100
Sign
extend
Read
data
1
M
u
x
0
32
Pipelining
17
Pipeline Datapath: IF Stage

At the end of a cycle, IF/ID receives (stores):



Instruction read from DataMemory[ PC ]
PC + 4
PC + 4

CS2100
Also connected to one of the MUX's inputs (another coming later)
Pipelining
18
Pipeline Datapath: ID Stage
At the beginning of a cycle
IF/ID register supplies:
At the end of a cycle
ID/EX receives:
 Register numbers for reading
two registers
 16-bit offset to be signextended to 32-bit
CS2100
Pipelining
 Data values read from
register file
 32-bit immediate value
 PC + 4
19
Pipeline Datapath: EX Stage
At the beginning of a cycle
ID/EX register supplies:
At the end of a cycle
EX/MEM receives:




 Data values read from
register file
 32-bit immediate value
 PC + 4
CS2100
Pipelining
(PC + 4) + (Immediates x 4)
ALU result
isZero? signal
Data Read 2 from register file
20
Pipeline Datapath: MEM Stage
At the beginning of a cycle
EX/MEM register supplies:




CS2100
At the end of a cycle
MEM/WB receives:
 ALU result
 Memory read data
(PC + 4) + (Immediates x 4)
ALU result
isZero? signal
Data Read 2 from register file
Pipelining
21
Pipeline Datapath: WB Stage
At the beginning of a cycle
MEM/WB register supplies:
At the end of a cycle
 ALU result
 Memory read data
CS2100
 Result is written back to register
file (if applicable)
 There is a bug here…….
Pipelining
22
Corrected Datapath (1/2)


Observe the “Write register” number

Supplied by the IF/ID pipeline register

It is NOT the correct write register for the instruction
now in WB stage!
Solution:


CS2100
Pass “Write register” number from ID/EX through
EX/MEM to MEM/WB pipeline register for use in WB
stage
i.e. let the "Write register" number follows the
instruction through the pipeline until it is needed in WB
stage
Pipelining
23
Corrected Datapath (2/2)
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add Add
result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
CS2100
Sign
extend
Read
data
1
M
u
x
0
32
Pipelining
24
Pipeline Control: Main Idea


Same control signals as single-cycle datapath
Difference: Each control signal belongs to a particular pipeline stage
PCSrc
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
result
Add
4
Branch
Shift
left 2
PC
Address
Instruction
memory
Instruction
RegWrite
Read
register 1
MemWrite
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
ALUSrc
Zero
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Data
memory
Write
Read
data
1
M
u
x
0
data
Instruction
16
[15– 0]
Sign
extend
32
6
ALU
control
MemRead
Instruction
[20– 16]
Instruction
[15– 11]
0
M
u
x
1
ALUOp
RegDst
CS2100
Pipelining
25
Pipeline Control: Try it!
PCSrc
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
result
Add
4
Branch
Shift
left 2
PC
Address
Instruction
memory
Instruction
RegWrite
Read
register 1
MemWrite
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
ALUSrc
Zero
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Data
memory
Write
Try associating
each control
signal with the
correct pipeline
stage.
CS2100
Read
data
1
M
u
x
0
data
Instruction
16
[15– 0]
Sign
extend
32
6
ALU
control
MemRead
Instruction
[20– 16]
0
M
u
x
1
Instruction
[15– 11]
ALUOp
RegDst
Pipelining
26
Pipeline Control: Grouping

Group control signals according to pipeline stage
RegDst
ALUSrc
MemTo
Reg
Reg
Write
Mem
Read
Mem
Write
Branch
R-type
1
0
0
1
0
0
lw
0
1
1
1
1
sw
X
1
X
0
beq
X
0
X
0
op1
op0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
1
EX Stage
MEM Stage
ALUop
ALUop
WB Stage
op1
op0
Mem
Read
0
1
0
0
0
0
0
1
0
1
0
0
1
0
0
1
1
sw
X
1
0
0
0
1
0
X
0
beq
X
0
0
1
0
0
1
X
0
RegDst
ALUSrc
R-type
1
lw
CS2100
Pipelining
Mem
Write
Branch
MemTo
Reg
Reg
Write
27
Pipeline Control: Implementation
Instruction
Control
WB
Signal used in
EX stage and
onwards
M
WB
EX
M
Signal
propagated
along until
utilized
Signal
generated
in ID stage
IF/ID
CS2100
WB
ID/EX
EX/MEM
Pipelining
MEM/WB
28
MIPS Pipeline: Datapath and Control
PCSrc
ID/EX
0
M
u
x
1
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
Add
Add
Add result
Instruction
memory
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Branch
Shift
left 2
MemWrite
PC
Instruction
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
1
M
u
x
0
MemRead
ALUOp
RegDst
CS2100
Pipelining
29
Exercise #1: Pipeline Control Signals

Indicate the control signals utilized in each cycle for the
following sequence of instructions: add, sub, lw, sw
Cycle
RegDst
ALUSrc
ALUOp1
ALUOp0
1
2
3
1
0
1
0
1
0
1
0
4
5
6
7
MemRead
MemWrite
0
0
MemToReg
RegWrite
8

CS2100
Pipelining
30
PIPELINE PERFORMANCE
CS2100
Pipelining
31
Single Cycle Processor: Performance

Cycle time:




𝐶𝑇𝑠𝑒𝑞 =
𝑁
𝑘=1 𝑇𝑘
𝑇𝑘 = Time for operation in stage k
N = Number of stages
Total Execution Time for I instructions:

𝑇𝑖𝑚𝑒𝑠𝑒𝑞 = Cycles × 𝐶𝑦𝑐𝑙𝑒Time
= I × 𝐶𝑇𝑠𝑒𝑞 = 𝑰 ×
CS2100
Pipelining
𝑵
𝒌=𝟏 𝑻𝒌
32
Single Cycle Processor: Example

Instruction
Inst
Mem
Reg
read
ALU
ALU
2
1
2
lw
2
1
2
2
sw
2
1
2
2
beq
2
1
2
Reg
write
Total
1
6
1
8
7
5
Cycle Time


Data
Mem
Choose the longest total time = 8ns
To execute 100 instructions:

CS2100
100 x 8ns = 800ns
Pipelining
33
Multi-Cycle Processor: Performance

Cycle time:


𝐶𝑇𝑚𝑢𝑙𝑡𝑖 =max 𝑇𝑘
max 𝑇𝑘 = longest stage duration among
the N stages

Total Execution Time for I instructions:

𝑇𝑖𝑚𝑒𝑚𝑢𝑙𝑡𝑖 = Cycles × 𝐶𝑦𝑐𝑙𝑒Time
= I × 𝑨𝒗𝒆𝒓𝒂𝒈𝒆 𝑪𝑷𝑰 × 𝐶𝑇𝑚𝑢𝑙𝑡𝑖

CS2100
Average CPI is needed because each instruction
take different number of cycles to finish
Pipelining
34
Multi-Cycle Processor: Example

Instruction
Inst
Mem
Reg
read
ALU
ALU
2
1
2
lw
2
1
2
2
sw
2
1
2
2
beq
2
1
2
Reg
write
Total
1
6
1
8
7
5
Cycle Time:


Data
Mem
Choose the longest stage time = 2ns
To execute 100 instructions, with a given
average CPI of 4.6

CS2100
100 x 4.6 x 2ns = 920ns
Pipelining
35
Pipeline Processor: Performance


Cycle Time:

𝐶𝑇𝑝𝑖𝑝𝑒 = max 𝑇𝑘 + 𝑇𝑑

max 𝑇𝑘 = longest time among the N stages

𝑇𝑑 = Overhead for pipelining, e.g. pipeline register
Cycles needed for I instructions:



I+N−1
N − 1 is the cycles wasted in filling up the pipeline
Total Time needed for I instructions :

CS2100
𝑇𝑖𝑚𝑒𝑝𝑖𝑝𝑒 = 𝐶𝑦𝑐𝑙𝑒 × 𝐶𝑇𝑝𝑖𝑝𝑒
= I + N − 1 × ( max 𝑇𝑘 + 𝑇𝑑 )
Pipelining
36
Pipeline Processor: Example


Instruction
Inst
Mem
Reg
read
Data
Mem
ALU
ALU
2
1
2
lw
2
1
2
2
sw
2
1
2
2
beq
2
1
2
Reg
write
Total
1
6
1
8
7
5
Cycle Time

assume pipeline register delay of 0.5ns

longest stage time + overhead = 2 + 0.5 = 2.5ns
To execute 100 instructions:

CS2100
(100 + 5 – 1) x 2.5ns = 260ns
Pipelining
37
Pipeline Processor: Ideal Speedup

Assumptions for ideal case:




Every stage takes the same amount of time:
 𝑁
𝑘=1 𝑇𝑘 = 𝑁 × 𝑇𝑘
No pipeline overhead  𝑇𝑑 = 0
Number of instructions I, is much larger than
number of stages, N
Note: The above also shows how pipeline
processor loses performance
CS2100
Pipelining
38
Pipeline Processor: Ideal Speedup

𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑝𝑖𝑝𝑒 =
𝑇𝑖𝑚𝑒𝑝𝑖𝑝𝑒
I× 𝑁
𝑘=1 𝑇𝑘
I+N−1 ×( max 𝑇𝑘 + 𝑇𝑑 )
=
=
𝐼×𝑁×𝑇𝑘
𝐼+𝑁−1 ×𝑇𝑘
≈
𝐼×𝑁×𝑇𝑘
𝐼×𝑇𝑘
≈𝑵
CS2100
𝑇𝑖𝑚𝑒𝑠𝑒𝑞
Conclusion:
Pipeline processor can gain N
times speedup, where N is the
number of pipeline stages
Pipelining
39
Summary for different implementations
Single-Cycle
Cycle 1
Cycle 2
Clk
Load
Store
Waste
Multi-Cycle
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Clk
Load
IF
Store
ID
EX
MEM
WB
EX
MEM
WB
IF
R-type
ID
EX
MEM
IF
Pipeline
Load IF
Store
ID
IF
ID
R-type IF
CS2100
EX
ID
MEM
EX
WB
MEM
Pipelining
WB
40
Review Question

Given this code:
add
sub
sll
srl
$t0,
$t1,
$t2,
$t3,
$s0,
$s0,
$s0,
$s1,
$s1
$s1
2
2
a) How many cycles will it take to execute the code on a
single-cycle datapath?
b) How long will it take to execute the code on a singlecycle datapath, assuming a 100 MHz clock?
c) How many cycles will it take to execute the code on a
5-stage MIPS pipeline?
d) How long will it take to execute the code on a 5-stage
MIPS pipeline, assuming a 500 MHz clock?

CS2100
Pipelining
41
Grinding to a halt…..
PIPELINE HAZARDS
CS2100
Pipelining
42
Pipeline Hazards

Speedup from pipeline implementation:


Based on the assumption that a new instructions can
be “pumped” into pipeline every cycle
However, there are pipeline hazards


Problems that prevent next instruction from
immediately following previous instruction
Structural hazards:


Data hazards:


Data dependencies between instructions
Control hazards:

CS2100
Simultaneous use of a hardware resource
Instruction
Dependencies
Change in program flow
Pipelining
43
Graphical Notation for Pipeline
Time (in clock cycles)
Program
execution
order
(in instructions)
lw $10, 20($1)
sub $11, $2, $3


CC 1
CC 2
CC 3
CC 4
CC 5
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
CC 6
Reg
Horizontal = the stages of an instruction
Vertical = the instructions in different pipeline stages
CS2100
Pipelining
44
Structure Hazard: Example

If there is only a single memory module:
Time (clock cycles)
Instruction Order
CS2100
Reg
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
ALU
Inst 3
Mem
Mem
ALU
Inst 2
Reg
ALU
Inst 1
Mem
ALU
Load
Load and Inst 3
access memory in
the same cycle!
Mem
Pipelining
Reg
45
Solution 1: Stall the Pipeline
Time (clock cycles)
Instruction Order
Stall
Mem
Mem
Reg
Reg
Mem
Reg
Mem
Reg
ALU
Inst 2
Reg
ALU
Inst 1
Mem
ALU
Load
Mem
Reg
Bubble
Bubble
Bubble
Bubble
CS2100
Pipelining
Reg
ALU
Mem
Inst 3
Delay (Stall)
Inst 3 for 1
cycle
Bubble
Mem
Reg
46
Solution 2: Separate Memory

Split memory into:

Data and Instruction memory
Time (clock cycles)
Instruction Order
Reg
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
ALU
Inst 3
Mem
Mem
ALU
Inst 2
Reg
ALU
Inst 1
Mem
ALU
Load
Mem
Load uses Data Memory
Reg
Inst 3 uses Instr. Memory
CS2100
Pipelining
47
Quiz: Is there another conflict?
Time (clock cycles)
Instruction Order
CS2100
Mem
Reg
Mem
Reg
Mem
Reg
Reg
Mem
Mem
Reg
ALU
Inst 3
Mem
ALU
Inst 2
Reg
ALU
Inst 1
Mem
ALU
Inst 0
Pipelining
Reg
Mem
Reg
48
Instruction Dependencies

Instructions can have relationship that prevent pipeline
execution:


When different instructions accesses (read/write) the same
register



Register contention is the cause of dependency
Known as data dependency
When the execution of an instruction depends on another
instruction



Although a partial overlap maybe possible in some cases
Control flow is the cause of dependency
Known as control dependency
Failure to handle dependencies can affect program
correctness!
CS2100
Pipelining
49
Data Dependency: RAW

"Read-After-Write" Definition:


Occurs when a later instruction reads from the
destination register written by an earlier
instruction
Also known as true data dependency
i1: add $1, $2, $3 #writes to $1
i2: sub $4, $1, $5 #reads from $1

Effect of incorrect execution:

CS2100
If i2 reads register $1 before i1 can write back the result,
i2 will get a stale result (old result)
Pipelining
50
Other Data Dependencies

Similarly, we have:


WAR: Write-after-Read dependency
WAW: Write-after-Write dependency

Fortunately, these dependencies do not
cause any pipeline hazards

They affects the processor only when
instructions are executed out of program
order:

CS2100
i.e. in Modern SuperScalar Processor
Pipelining
51
RAW Dependency: Hazards?

Suppose we are executing the following code
fragment:
sub
and
or
add
sw
$2, $1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$15, 100($2)
#i1
#i2
#i3
#i4
#i5

Note the multiple uses of register $2

Question:

CS2100
Which are the instructions require special handling?
Pipelining
52
RAW Data Hazards: Illustration

Value from prior instruction is needed before write back
T im e (in clock cycle s)
CC 1
V alue of
re giste r $ 2 :
10
sub $ 2 , $ 1 , $ 3
IM
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
10
10
1 0 /– 2 0
– 20
– 20
– 20
– 20
DM
Reg
Reg
Data
dependency
a nd $ 1 2 , $ 2 , $ 5
or $ 1 3 , $6 , $ 2
a dd $ 1 4, $ 2 , $ 2
IM
DM
R eg
IM
sw $ 1 5, 1 0 0 ($ 2 )
CS2100
DM
R eg
IM
No problem
R eg
DM
Reg
IM
Pipelining
R eg
R eg
R eg
DM
Reg
53
Observations

Questions:

When is the result from sub instruction actually
produced?


When is the data actually needed by and?


Beginning of and’s EX stage or clock cycle 4
When is the data actually needed by or?


End of EX stage for sub or clock cycle 3
Beginning of or’s EX stage or clock cycle 5
Solution:
Forward the result to any trailing (later)
instructions before it is reflected in register file
 Bypass (replace) the data read from register file

CS2100
Pipelining
54
Solution: Forwarding
Time (in clock cycles)
CC 1
Value of register $2 : 10
Value of EX/MEM : X
Value of MEM/WB : X
sub $2, $1, $3
IM
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
X
X
10
– 20
X
10/– 20
X
– 20
– 20
X
X
– 20
X
X
– 20
X
X
– 20
X
X
DM
Reg
Reg


and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
CS2100
IM
Reg
IM
DM
Reg
IM
Reg
DM
IM
Pipelining
Reg
DM
Reg
Reg
Forward results from
one stage to another
Bypass data read
from register file
Reg
DM
Reg
55
Data Hazard: LOAD Instruction
T im e (in cloc k cycle s )
lw $ 2, 20 ($ 1)
an d $4 , $ 2, $5
or $8 , $ 2, $6
ad d $9 , $ 4, $2
slt $ 1, $6 , $ 7
CS2100
CC 1
CC 2
IM
R eg
IM
CC 3
CC 4
CC 5
DM
R eg
R eg
IM
DM
R eg
IM
CC 6
Reg
DM
R eg
IM
Pipelining
CC 7
CC 9
Cannot solve with
forwarding!
Data is needed before it
is actually produced!
Reg
DM
Reg
CC 8
Reg
DM
Re g
56
Data Hazard: LOAD Instruction Solution
Program
Time(inclock cycles)
execution
CC1
CC2
order
(ininstructions)
lw$2, 20($1)
and$4, $2, $5
or $8, $2, $6
add$9, $4, $2
IM
CC3
Reg
IM
CC4
CC5
DM
Reg
Reg
Bubble
IM
Bubble
Reg
IM
CC8
CC9
CC10
Reg
DM
Reg
DM
Reg
IM
Pipelining
CC7
Stall the pipeline!
DM
slt $1, $6, $7
CS2100
CC6
Reg
Reg
DM
Reg
57
Exercise #2: Without Forwarding

How many cycles will it take to execute the following
code on a 5-stage pipeline without forwarding?
sub
and
or
add
sw
CS2100
$2,
$12,
$13,
$14,
$15,
$1, $3
$2, $5
$6, $2
$2, $2
100($2)
Pipelining
58
Exercise #2: Without Forwarding
sub
1
2
3
4
5
6
IF
ID
EX
MEM
WB
7
8
9
10 11
and
or
add
sw

CS2100
Pipelining
59
Exercise #3: Without Forwarding

How many cycles will it take to execute the following
code on a 5-stage pipeline without forwarding?
lw
and
or
add
sw
CS2100
$2,
$12,
$13,
$14,
$15,
20($3)
$2, $5
$6, $2
$2, $2
100($2)
Pipelining
60
Exercise #3: Without Forwarding
lw
1
2
3
4
5
IF
ID
EX
MEM
WB
6
7
8
9
10 11
and
or
add
sw

CS2100
Pipelining
61
Exercise #4: With Forwarding

How many cycles will it take to execute the following
code on a 5-stage pipeline with forwarding?
sub
and
or
add
sw
CS2100
$2,
$12,
$13,
$14,
$15,
$1, $3
$2, $5
$6, $2
$2, $2
100($2)
Pipelining
62
Exercise #4: With Forwarding
sub
1
2
3
4
5
6
IF
ID
EX
MEM
WB
7
8
9
10 11
and
or
add
sw

CS2100
Pipelining
63
Exercise #5: With Forwarding

How many cycles will it take to execute the following
code on a 5-stage pipeline with forwarding?
lw
and
or
add
sw
CS2100
$2,
$12,
$13,
$14,
$15,
20($3)
$2, $5
$6, $2
$2, $2
100($2)
Pipelining
64
Exercise #5: With Forwarding
lw
1
2
3
4
5
IF
ID
EX
MEM
WB
6
7
8
9
10 11
and
or
add
sw

CS2100
Pipelining
65
CONTROL DEPENDENCY
CS2100
Pipelining
66
Control Dependency

Definition:



An instruction j is control dependent on i if i controls
whether or not j executes
Typically i would be a branch instruction
Example:
i1: beq $3, $5, label
i2: add $1, $2, $4
...
...
...

# Branch
# depends on i1
Effect of incorrect execution:

CS2100
If i3 is allowed to execute before i2 is determined,
register $1 maybe incorrectly changed!
Pipelining
67
Control Dependency: Example

Let us turn to a code fragment with a
conditional branch:
r1  r3

40
44
48
52
..
72
beq $1, $3, 7
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
.........
lw $4, 5($7)
r1 = r3
How does the code affect a pipeline processor?
CS2100
Pipelining
68
Pipeline Execution: IF Stage

Read instruction from memory using the address in PC
and put it in IF/ID register

PC address is incremented by 4 and then written back to
the PC for next instruction
CS2100
Pipelining
69
Control Dependency: Why?
PCSrc
ID/EX
0
M
u
x
1
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
Add
Add
Add result
Instruction
memory
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Branch
Shift
left 2
MemWrite
PC
Instruction
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Decision is made
in MEM stage:
Too late!
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
1
M
u
x
0
MemRead
ALUOp
RegDst
CS2100
Pipelining
70
Control Dependency: Example
Time (in clock cycles)
Program
execution
CC 1
CC 2
order
(in instructions)
40 beq $1, $3, 7
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
IM
CC 3
Reg
IM
CC 4
DM
Reg
IM
Reg
IM
CC 7
CC 8
CC 9
Wrong
Execution!
Reg
DM
Reg
IM
Pipelining
CC 6
Reg
DM
72 lw $4, 50($7)
CS2100
CC 5
Reg
DM
Reg
Reg
DM
Reg
71
Control Hazards: Stall Pipeline?
Program
execution
order
(in instructions)
Time (in clock cycles)
40 beq $1, $3, 7
CC 1
CC 2
IM
Reg
Bubble
CC 3
CC 4
CC 5
DM
Reg
CC 6
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
Bubble
IM
72 lw $4, 50($7)
Bubble
Reg
CC 7
CC 8
CC 9
Bubble
Bubble
Bubble
DM
Reg
Wait until the branch outcome is known and then fetch
the correct instructions
 Introduces 3 clock cycles delay

CS2100
Pipelining
72
Control Hazards: Reducing the Penalty

Branching is very common in code:


A 3 cycles stall penalty is too heavy!
Many techniques invented to reduce the control
hazard penalty:

Move branch decision calculation to earlier pipeline
stage


Guess the outcome before it is produced


Branch Prediction
Do something useful while waiting for the outcome

CS2100
Early Branch Resolution
Delayed Branching
Pipelining
73
Reduce Stalls: Early Branch(1/3)

Make decision in ID stage instead of MEM


Move branch target address calculation
Move register comparison  cannot use ALU for
register comparison any more
Branch target
address calculation
Instruction
16
Add
Shift
left 2
Read
Read
register 1
data 1
Read
register 2
Write
Registers
register
Read
Write
data 2
data
RegWrite
CS2100
PC + 4
Sign
extend
4
Target address
ALUcontrol
Zero
ALU ALU
result
To branch
Control logic
Register Comparison
32
Pipelining
74
Reduce Stalls: Early Branch(2/3)
Register comparison
moved to ID stage
CS2100
Pipelining
75
Reduce Stalls: Early Branch(3/3)
Program
execution
order
(in instructions)
Time (in clock cycles)
40 beq $1, $3, 7
CC 1
CC 2
IM
Reg
Bubble
72 lw $4, 50($7)

CS2100
Bubble
IM
CC 4
CC 5
DM
Reg
Bubble
Bubble
Reg
CC 6
CC 7
CC 8
CC 9
Bubble
DM
Reg
Wait until the branch decision is known:


CC 3
Then fetch the correct instructions
Reduced to 1 clock cycle delay
Pipelining
76
Early Branch: Problems(1/3)

However, if the register(s) involved in the comparison is
produced by preceding instruction:

Further stall is still needed!
Time (in clock cycles)
Program
execution
order
(in instructions)
add $s0, $s1, $s2
beq $s0, $s3, Exit
CC 1
CC 2
IM
Reg
IM
CC 3
CC 5
DM
Reg
Reg
Bubble
CS2100
CC 4
DM
IM
Pipelining
Reg
CC 6
CC 7
CC 8
CC 9
$s0 is needed
before it is
produced!
Reg
DM
Reg
77
Early Branch: Problems(2/3)

Solution:


Add forwarding path from ALU to ID stage
One clock cycle delay is still needed
Time (in clock cycles)
Program
execution
order
(in instructions)
add $s0, $s1, $s2
beq $s0, $s3, Exit
CC 1
CC 2
IM
Reg
IM
CC 3
Bubble
CC 4
CC 5
DM
Reg
Reg
Bubble
CS2100
Pipelining
IM
CC 6
CC 7
DM
Reg
Reg
CC 8
CC 9
DM
Reg
78
Early Branch: Problems(3/3)


Problem is worse with load followed by branch
Solution:


MEM to ID forwarding and 2 more stall cycles!
In this case, we ended up with 3 total stall cycles  no
improvement!
Time (in clock cycles)
Program
execution
order
(in instructions)
lw $s0, 0($s1)
beq $s0, $s3, Exit
CC 1
IM
CC 2
CC 3
Reg
IM
CC 4
DM
Bubble
Bubble
CC 5
Pipelining
CC 7
CC 8
CC 9
ALU  ID
forwarding cannot help!
Reg
Reg
Bubble
CS2100
CC 6
DM
IM
Reg
Reg
DM
Reg
79
Reduce Stalls: Branch Prediction

There are many branch prediction schemes


Simple prediction:



We only cover the simplest in this course 
All branches are assumed to be not taken
Fetch the successor instruction and start pumping it through the
pipeline stages
When the actual branch outcome is known:


CS2100
Not taken: Guessed correctly  No pipeline stall
Taken: Guessed wrongly  Wrong instructions in the pipeline 
Flush successor instruction from the pipeline
Pipelining
80
Branch Prediction: Correct Prediction
Time (in clock cycles)
Program
execution
order
(in instructions)
40 beq $1, $3, 7
44 and $12, $2, $5
48 or $13, $6, $2
CC 1
CC 2
IM
Reg
IM
CC 3
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
52 add $14, $2, $2
CS2100
CC 6
CC 7
CC 8
CC 9
Reg
DM
Reg
Branch is known to be
not taken in cycle 3
 no stall needed!
Pipelining
81
Branch Prediction: Wrong Prediction
Time (in clock cycles)
Program
execution
order
(in instructions)
40 beq $1, $3, 7
44 and $12, $2, $5
72 lw $4, 50($7)
CC 1
CC 2
IM
Reg
IM
CC 3
Bubble
IM
CC 4
CC 5
DM
Reg
Bubble
Bubble
Reg
CC 6
CC 7
CC 8
CC 9
Bubble
DM
Reg
Branch is known to be taken in cycle 3
"and" instruction should not be executed
 Flushed from pipeline
CS2100
Pipelining
82
Exercise #6: No Branch Prediction

How many cycles will it take to execute the
following code on a 5-stage pipeline with
forwarding but no branch prediction?

Decision making moved to ID stage
addi
Loop: addi
bne
sub


$s0,
$s0,
$s0,
$t0,
$zero, 10
$s0, -1
$zero, Loop
$t1, $t2
Total instructions = 1 + 102 + 1 = 22
Ideal pipeline = 4 + 221 = 26 cycles
CS2100
Pipelining
83
Exercise #6: No Branch Prediction
addi1
addi2
bne
addi2
1
2
3
4
5
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
ID
IF
6
7
8
EX
MEM
WB
IF
ID
EX
9
MEM
10 11
WB
Data dependency between (addi $s0, $s0, -1) and bne incurs 1 cycle of delay.
There are 10 iterations, hence 10 cycles of delay.
Every bne incurs a cycle of delay to execute the next instruction. There are 10
iterations, hence 10 cycles of delay.
Total number of cycles of delay = 20.
Total execution cycles = 26 + 20 = 46 cycles.

CS2100
Pipelining
84
Exercise #7: Branch Prediction

How many cycles will it take to execute the
following code on a 5-stage pipeline with
forwarding and branch prediction?
Loop:


addi
addi
bne
sub
$s0,
$s0,
$s0,
$t0,
$zero, 10
$s0, -1
$zero, Loop
$t1, $t2
Total instructions = 1 + 102 + 1 = 22
Ideal pipeline = 4 + 221 = 26 cycles
CS2100
Pipelining
85
Exercise #7: Branch Prediction
addi1
addi2
bne
sub
addi2
1
2
3
4
5
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
ID
IF
6
7
8
EX
MEM
WB
IF
ID
EX
9
10 11
IF
MEM
WB
Predict not taken.
The data dependency remains, hence 10 cycles of delay for 10 iterations.
In the first 9 iterations, the branch prediction is wrong, hence 1 cycle of delay.
In the last iteration, the branch prediction is correct, hence saving 1 cycle of delay.
Total number of cycles of delay = 19.
Total execution cycles = 26 + 19 = 45 cycles.

CS2100
Pipelining
86
Reduce Stalls: Delayed Branch

Observation:



Branch outcome takes X number of cycles to be known
X cycles stall
Idea:

Move non-control dependent instructions into the X
slots following a branch



Known as the branch-delay slot
These instructions are executed regardless of the branch
outcome
In our MIPS processor:

CS2100
Branch-Delay slot = 1 (with the early branch)
Pipelining
87
Delayed Branch: Example
Nondelayed branch
or
add
sub
beq
xor
Exit:

Delayed branch
$8, $9, $10
$1, $2, $3
$4, $5, $6
$1, $4, Exit
$10, $1, $11
add
sub
beq
or
xor
$1, $2, $3
$4, $5, $6
$1, $4, Exit
$8, $9, $10
$10, $1, $11
Exit:
The "or" instruction is moved into the delayed slot:
Get executed regardless of the branch outcome
 Same behavior as the original code!

CS2100
Pipelining
88
Delayed Branch: Observation

Best case scenario

There is an instruction preceding the branch which
can be moved into delayed slot


Program correctness must be preserved!
Worst case scenario
Such instruction cannot be found
 Add a no-op (nop) instruction in the branch-delay slot


Re-ordering instructions is a common method of
program optimization


CS2100
Compiler must be smart enough to do this
Usually can find such an instruction at least 50% of
the time
Pipelining
89
For your reference
EXPLORATION
CS2100
Pipelining
90
Multiple Issue Processors(1/2)

Multiple Issue processors



Static multiple issue:




Multiple instructions in every pipeline stage
4 washer, 4 dryer…
EPIC (Explicitly Parallel Instruction Computer) or VLIW (Very
Long Instruction Word), e.g. IA64
Compiler specifies the set of instructions that execute together in
a given clock cycle
Simple hardware, complex compiler
Dynamic multiple issue:



CS2100
Superscalar processor: Dominant design of modern processors
Hardware decides which instructions to execute together
Complex hardware, simpler compiler
Pipelining
91
Multiple Issue Processors(2/2)

A 2-wide superscalar pipeline:

CS2100
By fetching and dispatching two instructions at a time, a
maximum of two instructions per cycle can be completed.
Pipelining
92
Summary

Pipelining is a fundamental concept in computer
systems!




Multiple instructions in flight
Limited by length of the longest stage
Hazards create trouble by stalling pipeline
Pentium 4 has 22 pipeline stages!
CS2100
Pipelining
93
Reading Assignment

3rd edition



Sections 6.1 – 6.3
Sections 6.4 – 6.6 (data hazards and control hazards
in details; read for interest; not in syllabus)
4th edition


CS2100
Sections 4.5 – 4.6
Sections 4.7 – 4.8 (data hazards and control hazards
in details; read for interest; not in syllabus)
Pipelining
94
End
CS2100
Pipelining
95
Download