Pipeline design - Personal Web Pages

advertisement
Pipelined Processor Design
A basic techniques to improve performance - always applied in
high performance systems. Adapted in all processors.
ITCS 3181 Logic and Computer Systems 2014 B. Wilkinson Slides9.ppt
Modification date: Nov 3, 2014
1
Pipelined Processor Design
The operation of the processor are divided into a number of
sequential actions, e.g.:
1.
2.
3.
4.
Fetch instruction.
Fetch operands.
Execute operation.
Store results
or more steps. Each step is performed by a separate unit (stage).
Each action is performed by a separate logic unit which are linked
together in a “pipeline.”
2
Processor Pipeline Space-Time Diagram
Uni t 1
U nit 2
Un it 3
Uni t 4
U nit 5
Un it 6
Uni t 7
Instru ction s
from
me mo ry
(a ) Stag es
P roce ssin g first in structio n
Pr oce ssin g se con d in structio n
Un it 7
Un it 6
Un it 5
Un it 4
Un it 3
Un it 2
Un it 1
Notation:
Subscript - instruction
Superscript - stage
I
1
1
I
7
1
I
7
2
I
7
3
I
7
4
I
7
5
I
7
6
I
7
7
I
6
1
I
6
2
I
6
3
I
6
4
I
6
5
I
6
6
I
5
7
I
6
8
I
5
1
I
5
2
I
5
3
I
5
4
I
5
5
I
5
6
I
5
7
I
5
8
I
5
9
I
4
1
I
4
2
I
4
3
I
4
4
I
4
5
I
4
6
I
4
7
I
4
8
I
4
9
I
4
10
I
3
1
I
3
2
I
3
3
I
3
4
I
3
5
I
3
6
I
3
7
I
3
8
I
3
9
I
3
10
I
3
11
I
2
1
I
2
2
I
2
3
I
2
4
I
2
5
I
2
6
I
2
7
I
2
8
I
2
9
I
2
10
I
2
11
I
2
12
I
1
2
I
1
3
I
1
4
I
1
5
I
1
6
I
1
7
I
1
8
I
1
9
I
1
10
I
1
11
I
1
12
I
1
13
(b ) Sp ace -Time dia g ram
Ti me
3
Pipeline Staging Latches
Usually, pipelines designed using latches (registers) between units
(stages) to hold the information being transferred from one stage
to next.
Transfer occurs in synchronism with a clock signal:
Latch
Unit
Latch
Unit
Latch
Unit
Latch
Data
Clock
4
Processing time
Time to process s instructions using a pipeline with p stages
= p + s - 1 cycles
p stag es
p + s - 1 (cycle s)
p-1
s i nstru cti on s
Stag e p
Stag e p -1
L ast
in structio n
Stag e 3
Stag e 2
Stag e 1
Ti me
5
Speedup
How much faster using pipeline rather than a single homogeneous
unit?
Speed-up available in a pipeline can be given by:
𝑺𝒑𝒆𝒆𝒅𝒖𝒑, 𝒔 =
𝑻𝟏
𝑻𝟐
=
𝒔𝒑
𝒑+𝒔 −𝟏
Note: This does not take
into account the extra
time due to the latches
in the pipeline version
Potential maximum speed-up is p, though only be achieved for an
infinite stream of tasks (s  ) and no hold-ups in the pipeline.
An alternative to pipelining - using multiple units each doing the
complete task. Units could be designed to operate faster than the
pipelined version, but the system would cost much more.
6
Dividing Processor Actions
The operation of the processor can be divided into:
• Fetch Cycle
• Execute Cycle
7
Two Stage Fetch/Execute Pipeline
F etch
unit
Execute
unit
Instructions
(a) Fetch/execute stages
EX
IF
F etch 1st
instruction
Execute 1st
instruction
Execute 2nd
instruction
Execute 3rd
instruction
F etch 2nd
instruction
Fetch 3rd
instruction
Fetch 4th
instruction
IF = Fetch unit
EX = Execute unit
Time
(b) Space-time diagram with ideal overlap
8
A Two-Stage Pipeline Design
F etch unit
Instruction
Latch
Execute unit
MDR
Registers
Control
MAR
IR
Memory
Address
PC
+4
Branch/jump can affect PC
ALU
Accesses memory for data
(LD and ST instructions)
9
Fetch/decode/execute pipeline
Relevant for complex instruction formats Recognizes
instruction - separates operation and operand addresses
Fe tch
u nit
D eco de
u n it
E xecu te
un it
In stru cti on s
( a) Fe tch/d eco de /exe cute stag e s
E xe cute 1 st
instru ctio n
E xe cute 2 nd
in stru cti on
E xecu te 3r d
instr uctio n
D eco de 1st
in str uctio n
D eco de 2n d
instru ctio n
De cod e 3 rd
in structio n
De cod e 4 th
in str uctio n
F etch 2 nd
instr uctio n
Fe tch 3 rd
i nstru ction
Fe tch 4 th
instr uctio ns
Fe tch 5th
instr uctio n
Exe cu te
De cod e
Fe tch
Fetch 1st
i nstru ction
(b) Id ea l o verl ap
Time
10
Try to have each stage require the same time otherwise pipeline
will have to operate at the time of the slowest stage. Usually have
more stages to equalize times. Let’s start at four stages:
Four-Stage Pipeline
In str uctio n
fe tch un it
In structio n
IF
Ope ra nd
fetch un it
OF
Exe cute
u ni t
EX
Ope ra nd
store un it
OS
Space-Time Diagram
OS
Instruction 1 Instruction 2
EX
Instruction 1 Instruction 2 Instruction 3
OF
Instruction 1 Instruction 2 Instruction 3 Instruction 4
IF
Instruction 1 Instruction 2 Instruction 3 Instruction 4 Instruction 5
Time
11
Four-stage Pipeline “Instruction-Time Diagram”
An alternative diagram:
Instruction
1st
IF
2nd
OF
EX
OS
IF
OF
EX
OS
IF
OF
EX
IF
OF
3rd
4th
IF
OF
EX
OS
=
=
=
=
Instru ctio n fetch u ni t
O pe ra nd fe tch u nit
E xecu te un it
O pe ra nd sto re un it
Time
This form of diagram used later to show pipeline dependencies.
12
Information Transfer in Four-Stage Pipeline
Register file
Memory
Register #’s
Contents
Latch
Latch
Latch
Instruction
Address
PC
IF
ALU
OF
EX
OS
Clock
13
Register-Register Instructions
ADD R3, R2, R1
Register file
After instruction fetched:
Memory
Latch
Latch
Latch
Add
R3
Instruction
Address
PC = PC+4
R2
R1
PC
IF
ALU
OF
EX
OS
Clock
Note: where R3, R2, and R1 mentioned in the latch, it actually holds just register numbers.
14
Register-Register Instructions
ADD R3, R2, R1
Register file
After instruction fetched:
Memory
Latch
Latch
Latch
Add
Instruction
R3
Address
R2
R1
PC
IF
ALU
OF
EX
OS
Clock
Note: where R3, R2, and R1 mentioned in the latch, it actually holds just register numbers.
15
Register-Register Instructions
ADD R3, R2, R1
Register file
After operands fetched:
Memory
Latch
Latch
-----
Add
Instruction
---
R3
Address
---
V2
---
V1
PC
IF
Latch
ALU
OF
EX
OS
Clock
V1 is contents of R1, V2 is contents of R2
16
Register-Register Instructions
ADD R3, R2, R1
Register file
After execution (addition):
Memory
Latch
Latch
Latch
R3
---Instruction
---
---
Address
---
---
---
---
Result
V2
V1 ALU
PC
IF
Add
OF
EX
OS
Clock
Note: where R3, R2, and R1 mentioned in the latch, it actually holds just register numbers.
17
Register-Register Instructions
ADD R3, R2, R1
Register file
After result stored:
R3, result
Memory
Latch
Latch
Latch
---Instruction
---
---
--
Address
---
---
---
---
---
PC
IF
ALU
OF
EX
OS
Clock
Note: where R3, R2, and R1 mentioned in the latch, it actually holds just register numbers.
18
Register-Register Instructions
ADD R3, R2, R1
Register file
Overall:
R3, result
Memory
Instruction
Address
PC = PC+4
Latch
Latch
Add
Add
R3
R3
R2
V2
R1
V1
R3
Add
Result
V2
V1 ALU
PC
IF
Latch
OF
EX
OS
Clock
Note: where R3, R2, and R1 mentioned in the latch, it actually holds just register numbers.
19
Register-Constant Instructions
ADD R3, R2, 123
Register file
After instruction fetched:
Memory
Latch
Latch
Latch
Add
R3
Instruction
Address
PC = PC+4
R2
123
PC
IF
ALU
OF
EX
OS
Clock
Note: where R3 and R2 mentioned in the latch, it actually holds just register numbers.
20
Register-Constant Instructions
ADD R3, R2, 123
Register file
After instruction fetched:
Memory
Latch
Latch
Latch
Add
Instruction
R3
Address
R2
123
PC
IF
ALU
OF
EX
OS
Clock
Note: where R3 and R2 mentioned in the latch, it actually holds just register numbers.
21
Register-Constant Instructions
ADD R3, R2, 123
Register file
After operands fetched:
Memory
Latch
Latch
-----
Add
Instruction
---
R3
Address
---
V2
---
123
PC
IF
Latch
ALU
OF
EX
OS
Clock
V2 is contents of R2
22
Register-Constant Instructions
ADD R3, R2, 123
Register file
After execution (addition):
Memory
Latch
Latch
Latch
R3
---Instruction
---
---
Address
---
---
---
---
Result
V2
123ALU
PC
IF
Add
OF
EX
OS
Clock
Note: where R3 and R2 mentioned in the latch, it actually holds just register numbers.
23
Register-Constant Instructions
ADD R3, R2, 123
Register file
After result stored:
R3, result
Memory
Latch
Latch
Latch
---Instruction
---
---
--
Address
---
---
---
---
---
PC
IF
ALU
OF
EX
OS
Clock
Note: where R3 and R2 mentioned in the latch, it actually holds just register numbers.
24
Register-Constant Instructions
(Immediate addressing)
ADD R3, R2, 123
Register file
Overall:
R3, result
Memory
R2
Latch
Latch
Latch
Add
Add
Instruction
R3
R3
R3
Address
R2
V2
Result
123
123
PC
IF
ALU
OF
EX
OS
Clock
V2 is contents of R2
25
Branch Instructions
A couple of issues to deal with here:
1. Number of steps needed.
1. Dealing with program counter incrementing after each
instruction fetch.
26
(Complex) Branch Instructions
Offset to L1 held
in instruction
Bcond R1, R2, L1
After instruction fetched:
Register
file
Memory
Latch
R2
R1
Test
Bcond
Instruction
+
Latch
Latch
R1
R2
Address
Offset
PC
IF
ALU
OF
EX/BR
OS
Clock
27
(Complex) Branch Instructions
Offset to L1 held
in instruction
Bcond R1, R2, L1
After operands fetched:
Register
file
Memory
V1
V2 Latch
Latch
Bcond
Instruction
Latch
Test
V1
+
V2
Address
Offset
Offset
PC
IF
ALU
OF
EX/BR
OS
Clock
V1 is contents of R1, V2 is contents of R2
28
(Complex) Branch Instructions
Offset to L1 held
in instruction
Bcond R1, R2, L1
After execution (addition):
Register
file
Memory
Latch
Latch
Latch
Bcond Test
Instruction
+
V1
Address
V2
PC
IF
Result
Offset
ALU
OF
EX/BR
OS
Clock
V1 is contents of R1
29
(Complex) Branch Instructions
Offset to L1 held
in instruction
Bcond R1, R2, L1
After result stored:
Offset
Memory
If TRUE add
offset to PC
else do
nothing
Result (TRUE/FALSE)
Register
file
Latch
Latch
Latch
Test
Instruction
+
Result
Address
Offset
PC
IF
ALU
OF
EX/BR
OS
Clock
V1 is contents of R1
30
(Complex) Branch Instructions
Offset to L1 held
in instruction
Bcond R1, R2, L1
Overall:
Offset
Memory
If TRUE add
offset to PC
else do
nothing
Result (TRUE/FALSE)
Register
file
Latch
Instruction
+
Address
R2
R1
V1
V2 Latch
Bcond
Bcond
R1
V1
R2
V2
Result
Offset
Offset
Offset
PC
IF
Latch
Test
ALU
OF
EX/BR
OS
Clock
V1 is contents of R1
31
Simpler Branch Instructions
Bcond R1, L1
Overall:
Tests R1 against zero
Offset
Memory
If TRUE add
offset to PC
else do
nothing
Result (TRUE/FALSE)
Register
file
Latch
Instruction
+
V1
R1
Latch
Latch
Bcond
Bcond
R1
V1
Result
Offset
Offset
Offset
Test
Address
PC
IF
OF
EX/BR
OS
Clock
V1 is contents of R1
32
Dealing with program counter incrementing after
each instruction fetch
Previous design will need to taking into account that by the time
the branch instruction is in the execute unit, the program counter
will have been incremented three times.
Solutions:
1. Modify the offset value in the instruction (subtract 12).
2. Modify the arithmetic operation to be PC + offset – 12
3. Feed the program counter value through the pipeline.
(This is the best way as it takes into account any pipeline length.
Done in the P-H book)
33
Feeding PC value through pipeline
Bcond R1, L1
Overall:
Tests R1 against zero
New PC value
Memory
If TRUE
update PC
else do
nothing
Result (TRUE/FALSE)
Register
file
Latch
Instruction
Address
PC
IF
V1
R1
Latch
Latch
Bcond
Bcond
R1
V1
Offset
Offset
PC
PC
OF
Test
Result
Add
EX/BR
New
PC
value
OS
Clock
V1 is contents of R1
34
Load and Store Instructions
Need at least one extra stage to handle memory accesses. Early RISC
processor arrangement was to place memory stage (MEM) between
EX and OS as below. Now a five-stage pipeline.
LD R1, 100[R2]
R eg ister fil e
R1 , va lu e
Instru ction
me mo ry
R2
Instru ctio n
Ad d ress
PC
IF
Val ue
LD
R1
R2
100
LD
R1
V2
100
OF
M EM
AL U
LD
R1
R1
Addr
Value
+
EX
OS
Clo ck
Address Data
Co mp ute effective a d dre ss
Da ta mem or y
35
ST 100[R2], R1
Re giste r file
Th is stag e n ot use d
R1
R2
In str uctio n
m em ory
In stru cti on
A d dre ss
PC
IF
Va lu es
ST
R1
R2
100
ST
V1
V2
100
OF
ME M
A LU
ST
V1
Addr
+
OS
EX
C lock
Address Data
Co mpu te e ffe cti ve a dd re ss
D ata me mo ry
Note: Convenient to have separate instruction and data
memories connecting to processor pipeline - usually separate
cache memories, see later.
36
Usage of Stages
Fetch
Fetch
o pera nds
Execute
instru ction from registe rs (Compute)
Access
Mem ory
Store
resu lts
in reg iste r
Instructions
Time
(a) Units
Load
Sto re
Arithmetic
Instruction passes
through ME M stage
but no a ctions taking place
Branch
(b) Instruction usage
37
Number of Pipeline Stages
As the number of stages is increased, one would expect the time for
each stage to decrease, i.e. the clock period to decrease and the
speed to increase.
However one must take into account the pipeline latch delay.
5-stage pipeline represents an early RISC design - “underpipelined”
Most recent processors have more stages.
38
Optimum Number of Pipeline Stages*
Suppose one homogeneous unit doing everything takes Ts time units.
With p pipeline stages with equally distributed work, each stage takes T/p.
Let tL = time for latch to operate. Then:
Execution time Tex = (p + s - 1)  (Ts/p + tL)
800
600
Typical results
(Ts = 128, TL=2)
Optimum about 16 stages
Tex
In practice, there are a
lot more factors
involved, see later for
some.
400
200
0
21
22
23
24
25
26
Number of pipeline stages, p
27
* Adapted from “Computer Architecture and Implementation” by H. G. Cragon, Cambridge University Press, 2000.
39
Questions
40
Download