What is a Pipeline? Lecture 3: Instruction Pipelining Basic concepts Pipeline hazards

advertisement
2015-11-06
Lecture 3: Instruction Pipelining




Basic concepts
Pipeline hazards
Branch handling
Branch prediction
Zebo Peng, IDA, LiTH
1
TDTS 08 – Lecture 3
What is a Pipeline?
Divide a task into a sequence of simpler sub-tasks.
Employ one worker for each simpler H.
sub-task.
Ford
Zebo Peng, IDA, LiTH
2
TDTS 08 – Lecture 3
1
2015-11-06
Basic Concepts

Sequential execution of an N-stage task:
1
2
3
…
N
1
2
…
3
Task1
…
N
Task2
 Production time: N time units.
 Resource needed: one general-purpose machine.
 Productivity: one product per N time units.

Pipelined execution of an N-stage task:
1
2
1
…
N
2
3
…
N
1
2
3
…
3
 Production time: N time units.
 Resource needed: N specialpurpose machines.
N
 Productivity ≈ one product/time unit.
…
Zebo Peng, IDA, LiTH
3
TDTS 08 – Lecture 3
Instruction Execution Stages
A typical instruction execution sequence:
1. Fetch Instruction (FI): Fetch the instruction.
2. Decode Instruction (DI): Determine the op-code and
the operand specifiers.
3. Calculate Operands (CO): Calculate the effective
addresses.
4. Fetch Operands (FO): Fetch the operands.
5. Execute Instruction (EI): perform the operation.
6. Write Operand (WO): store the result in memory.
FI
Zebo Peng, IDA, LiTH
DI
CO
FO
4
EI
WO
TDTS 08 – Lecture 3
2
2015-11-06
Instruction Pipelining
I1
time
FI
I2
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
I3
I4
I5
I6
I7
I8
This is the ideal case 
Speed up by 6 times.
I9
Zebo Peng, IDA, LiTH
5
WO
TDTS 08 – Lecture 3
Typical Instruction Pipelining
I1
I2
I3
time
FI
DI
CO
FO
FI
DI
CO
FI
DI
I4
FI
I5
EI
EI
DI
CO
FI
DI
I6
I7
I8
WO
FI
Different execution
patterns for different
instructions!
WO
EI
WO
FO
EI
EI
DI
CO
FI
DI
FI
I9
Zebo Peng, IDA, LiTH
In practice, there are
many holes, which
reduces the speed-up
factor.
FO
EI
WO
DI
FI
6
DI
EI
WO
CO
FO
EI
TDTS 08 – Lecture 3
3
2015-11-06
Number of Pipeline Stages


In general, a larger number of stages gives better performance.
However:
 A larger number of stages increases the overhead in moving
information between stages and synchronization between stages.
 The complexity of the CPU grows with the number of stages.
 It is difficult to keep a large pipeline at maximum rate because of
pipeline hazards.

Intel 80486 and Pentium:
 Five-stage pipeline for integer instructions.
 Eight-stage pipeline for FP (floating points) instructions.

Pentium 4 machine has 20 stages (!).

IBM PowerPC has different numbers of stages (3-9) for different
machines. PowerPC 440 has 7 stages.
Zebo Peng, IDA, LiTH
7
TDTS 08 – Lecture 3
Lecture 3: Instruction pipelining




Basic concepts
Pipeline hazards
Branch handling
Branch prediction
Zebo Peng, IDA, LiTH
8
TDTS 08 – Lecture 3
4
2015-11-06
Pipeline Hazards (Conflicts)

Situations that prevent the next instruction in the instruction
stream from executing during its designated clock cycle.
 The instruction is said to be stalled.

When an instruction is stalled:
 All instructions later in the pipeline than it are also stalled;
 No new instructions are fetched during the stall;
 Instructions earlier than the stalled one continue as usual.

Types of hazards:
 Structural hazards
 Data hazards
 Control hazards
Zebo Peng, IDA, LiTH
9
TDTS 08 – Lecture 3
Structural (Resource) Hazards

Hardware conflicts  caused by the use of the same
hardware resource at the same time (e.g., memory conflicts).
FI
DI
CO
FO
EI
FI
DI
CO
?
FI
DI
CO
FI
stall
FI
Harvard Architecture
solves this problem!

EI
WO
EI
WO
DI
CO
FO
stall
FI
DI
FO
EI
CO
FO
EI
WO
Penalty: 1 cycle (NOTE: the performance lost is multiplied
by the number of stages).
Zebo Peng, IDA, LiTH
10
TDTS 08 – Lecture 3
5
2015-11-06
Structural Hazard Solutions



In general, the hardware resources in conflict are duplicated
in order to avoid structural hazards.
Functional units (ALU, FP unit) can also be pipelined
themselves to support several instructions at the same time.
Memory conflicts can be solved by:
 having two separate caches, one for instructions and
the other for operands (Harvard Architecture);
FI
DI
CO
FO
FI
DI
CO
FI
DI
CO
FI
stall
FI
Zebo Peng, IDA, LiTH
EI
EI
WO
EI
WO
DI
CO
stall
FI
DI
11
FO
EI
CO
FO
EI
WO
TDTS 08 – Lecture 3
Structural Hazard Solutions



In general, the hardware resources in conflict are duplicated
in order to avoid structural hazards.
Functional units (ALU, FP unit) can also be pipelined
themselves to support several instructions at the same time.
Memory conflicts can be solved by
 having two separate caches, one for instructions and
the other for operands (Harvard architecture);
 Using multiple banks of the main memory; or
 keeping as many intermediate results as possible in the
registers (!).
Zebo Peng, IDA, LiTH
12
TDTS 08 – Lecture 3
6
2015-11-06
Data Hazards

Caused by reversing the order of data-dependent operations
due to the pipeline (e.g., WRITE/READ conflicts).
Mem(A)  Mem(A) + R1;
Mem(A)  Mem(A) - R2;
ADD A, R1;
SUB A, R2;
Sequential
execution
ADD
FI
SUB
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
Value of Mem(A) needed
Ex. A=200, Mem(200)=100, R1=30, R2=50
Value of Mem(A) available
 80 50
Zebo Peng, IDA, LiTH
13
TDTS 08 – Lecture 3
Data Hazard Penalty
Mem(A)  Mem(A) + R1;
Mem(A)  Mem(A) - R2;
ADD A, R1;
SUB A, R2;
ADD
SUB
FI
DI
CO
FO
EI
WO
FI
DI
CO
Stall
Stall
FO
EI
WO
Value of Mem(A) available

Data hazard is an important issue:

Penalty: 2 cycles.

It happens very often, since we have many data dependencies.
Zebo Peng, IDA, LiTH
14
TDTS 08 – Lecture 3
7
2015-11-06
Data Hazard Solutions

The penalty due to data hazards can be reduced by a technique
called forwarding (bypassing).
MUX
ALU

Memory
System
Bypass Path
MUX
Registers, cache and
main memory
The ALU result is fed back to the ALU input.
 If it detects that the value needed for an operation is the one
produced by the previous one, and has not yet been written back.
 ALU selects the forwarded result, instead of the value from the
memory system.
Zebo Peng, IDA, LiTH
15
TDTS 08 – Lecture 3
Control Hazards

Caused by branch instructions, which change the instruction
execution order.
1
2
3
FI
The branch should be taken!
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
4
5
6
25
26
Zebo Peng, IDA, LiTH
16
time
“BRA 25 IF Zero”
WO
TDTS 08 – Lecture 3
8
2015-11-06
Lecture 3: Instruction pipelining




Basic concepts
Pipeline hazards
Branch handling
Branch prediction
Zebo Peng, IDA, LiTH
17
TDTS 08 – Lecture 3
Branch Handling (1)

1
2
25
Stop the pipeline until the branch instruction reaches the
last stage.
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
The branch should be taken!
FI
26

“BRA 25 IF Zero”
WO
Stall Stall Stall Stall
time
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
This leads to very large lost of performance, in particular,
since 20%-35% of the instructions executed are branches.
Zebo Peng, IDA, LiTH
18
TDTS 08 – Lecture 3
9
2015-11-06
Branch Handling (2)

Multiple streams — implement hardware resources to
deal with different branch alternatives.
Branch condition known
1
FI
2
3
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
FI
DI
CO
FO
FI
DI
CO
FI
DI
4
5
6
25
FI
26
27
time
“BRA 25 IF Zero”

This is an expensive
solution, and you
need special memory
technique to fully
utilize it!
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
28
Zebo Peng, IDA, LiTH
19
WO
TDTS 08 – Lecture 3
Branch Handling (3)

Pre-fetch branch target — when a conditional branch is
recognized, the following instruction is fetched, and the
branch target is also pre-fetched.

Loop buffer — use a small, high-speed memory to keep
the n most recently fetched instructions. If a branch is to
be taken, the buffer is first checked to see if the branch
target is in it.
 Special cache for branch target instructions.

Delayed branch — re-arrange the instructions so that
branching occur later than originally specified.
-
Software solution.
Zebo Peng, IDA, LiTH
20
TDTS 08 – Lecture 3
10
2015-11-06
Delayed Branch Example
Original inst. sequence:
ADD X;
No data-dependence
BRA L;
…
ADD
FI
EI
BRA
FI
BRA
EI
FI

Delayed branch:
BRA L;
ADD X;
…
Branch
FI
EI
ADD
FI
EI
condition
known
EI
FI
EI
The compiler or the programmer has to find an instruction which can be
moved from its original place to the branch delay slot (it will be executed
regardless of the branch outcome).


60% to 85% success rate.
This leads, however, to un-readable code.
Zebo Peng, IDA, LiTH
21
TDTS 08 – Lecture 3
Lecture 3: Instruction pipelining




Basic concepts
Pipeline hazards
Branch handling
Branch prediction
Zebo Peng, IDA, LiTH
22
TDTS 08 – Lecture 3
11
2015-11-06
Branch Prediction

When a branch is encountered, a prediction is made and the
predicted path is followed.

The instructions on the predicted path are fetched.
The fetched instruction can also be executed  called
Speculative Execution.

 Results produced of these executions should be marked as tentative.

When the branch outcome is decided, if the prediction is
correct, the special tags on tentative results are removed.

If not, the tentative results are removed, and the execution
goes to the other path.

Branch prediction can base on static or dynamic information.
Zebo Peng, IDA, LiTH
23
TDTS 08 – Lecture 3
Static Branch Prediction

Predict always taken
 Assume that jump will happen.
 Always fetch target instruction.
1
2
25
time
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
WO
FI
DI
CO
FO
EI
26
Zebo Peng, IDA, LiTH
24
“BRA 25 IF Zero”
WO
TDTS 08 – Lecture 3
12
2015-11-06
Static Branch Prediction (Cont’d)
sum = 0;
//Java code
for (i = 0; i < 1000; i++)
sum += a[i];
0 1 2 …
999
R0
a
(sum)
R1 (start at memory location, e.g., 100)
MOVE
MOVE
L1: ADD
ADD
COMP
BNZ
MOVE
R0,
R1,
R0,
R1,
R1,
L1
R0,
#0
-- sum
#100 -- index
(R1)
The prediction will
#1
be correct 99.9% of
#1100
the time with this
example !
sum
Zebo Peng, IDA, LiTH
25
TDTS 08 – Lecture 3
Static Branch Prediction (Cont’d)

Predict never taken
 Assume that jump will not happen.
 Always fetch next instruction.

Predict by Operation Codes
 Some instructions are more likely to result in a jump than others.
• BNZ (Branch if the result is Not Zero)
• BEZ (Branch if the result equals Zero)
 Can get up to 75% success.

Predict by relative positions
 Backward-pointing branches will be taken (usually loop back).
 Forward-pointing branches will not be taken (often loop exist).
Zebo Peng, IDA, LiTH
26
TDTS 08 – Lecture 3
13
2015-11-06
Dynamic Branch Prediction

Based on branch history:
 Store information regarding branches in a branch-history table so
as to more accurately predict the branch outcome.
 E.g., assuming that the branch will do what it did last time.
One-Bit Predictor:
T
2
3
Not Taken
(0)
Taken
(1)
N
N
T
Two errors per loop iteration.
…NNTNNNNNNTNN…
Zebo Peng, IDA, LiTH
27
TDTS 08 – Lecture 3
Bimodal Prediction

Use 2-bit saturating counters to predict the most common
direction, where the first bit indicates the predication.

Branches evaluated as not taken (N) decrement the counter
towards strongly not taken, and branches evaluated as
taken (T) increment the state towards strongly taken.

It tolerates a branch going an unusual direction one time.
 A loop-closing branch is miss-predicted once rather than twice.
…NNTNNNNNNTNN…
Strongly
Not Taken
(00)
T (+1)
N (-1)
N (-1)
N
Zebo Peng, IDA, LiTH
T (+1)
T (+1)
Not Taken
(01)
Taken
(10)
N (-1)
Strongly
Taken
(11)
T
28
TDTS 08 – Lecture 3
14
2015-11-06
Summary

Instruction execution can be substantially accelerated by
instruction pipelining.

A pipeline is organized as a succession of N stages. Ideally
N instructions can be active inside a pipeline.

Keeping a pipeline at its maximal rate is prevented by
pipeline hazards.


Structural hazards are due to resource conflicts.

Data hazards are caused by data dependencies between instructions.

Control hazards are produced as consequence of branch instructions.
Branches can dramatically affect pipeline performance.
 It is very important to reduce penalties produced by them.
 (Dynamic) prediction is an efficient way to address this problem.
Zebo Peng, IDA, LiTH
29
TDTS 08 – Lecture 3
15
Download