Chapter 1: Fundamentals of Quantitative Design and Analysis

advertisement
Computer Architecture
A Quantitative Approach, Fifth Edition
Appendix C
Pipelining: Basic and
Intermediate Concepts
1
Basic Pipelining


Pipelining is the organizational implementation
technique that has been responsible for the most
dramatic increase in computer performance.
Overview of basic pipelining
 What is pipelining?
 Computing pipeline speedup
 Clocking pipelines
 Pipelining MIPS
 Pipeline hazards
 Handling interrupts.
2
Pipelining
3
Pipelining 3 Stages

Assume a 2 ns flip-flop delay
4
Pipelining: Computing the speedup


Time per instruction
 TPI = CPI cycle time
 We can think about pipelining as reducing either CPI
or cycle time
Ideal speedup Speedup  TPIwithout pipeline  number of pipeline stages
TPIwith pipeline
Requires that all stages be perfectly balanced
 No synchronization (latch, flip-flop) overhead
 No stall cycles
The speedup from a pipeline is limited
 CPIreal = CPIideal + CPI stall
 CCTreal = Timelongest pipestage + Timelatch overhead


5
MIPS Instruction Formats
6
Basic MIPS Pipeline
7
Basic MIPS Pipeline (simplified)
8
Pipelining By Adding Registers
IF: Instruction fetch
ID: Instruction decode/
register file read
EX: Execute/
address calculation
MEM: Memory access
WB: Write back
0
M
u
x
1
Add
4
Add
Add
result
Shift
left 2
PC
Read
register 1
Address
Instruction
Instruction
memory
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
9
MIPS Pipelined Execution
Instruction
1
2
3
4
5
i
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
i+1
i+2
i+3
i+4
6
7
8
9
WB
10
Rules for pipeline registers

Each stage must be independent, so inter-stage registers
must hold



Think of the register file as two independent units



Data values
Control signals, including
 Decoded instruction fields
 MUX controls
 ALU controls
Read file, accessed in ID
Write file, accessed in WB
There is no “final” set of registers after WB, (WB/IF)
because the instruction is finished and all results are
recorded in permanent machine state (register file,
memory, and PC)
11
A More Accurate Pipeline Schematic
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
Sign
extend
Read
data
1
M
u
x
0
32
12
Pipeline Dataflow: the details
Reg-Reg
ALU
Reg-immed
ALU
Load
Store
IF
IR2 = IMEM[PC]
PC2 = PC = PC+4
ID
A3 = Regs[IR25..21]; B3=Regs[20..16];
IR3=IR2;PC3=PC2;
IM3=IR2[15]16 ##IR2[14..0]
EX
ALU4= A3 op B3;
IR4 = IR3
PC4 = PC3
ALU4 = A3 op IM3
IR4 = IR3
PC4 = PC3
ALU4 = A3 + IM3
IR4 = IR3
PC4 = PC3
MD4 = B3
MEM
IR5=IR4
PC5=PC4
IR5=IR4
PC5=PC4
WB5 =
DMEM[ALU4]
WB
Din = WB
Din = WB
Din = WB
DMEM[ALU4]
= MD4
Branch
Jump
ALU4 = PC3 + IM3
CO4 = A3 op 0
IR4 = IR3
PC4 = PC3
ALU4 = PC3 + IM3
IR4 = IR3
PC4 = PC3
IR5=IR4
PC5=PC4
If (C04) PC=ALU4
IR5=IR4
PC5=PC4
PC = ALU4
13
Problems with Pipelining (Dependencies and
Hazards)

Dependencies: a property of the program

Data dependencies


Instruction j uses the result produced by
instruction I
Control dependencies

The execution of instruction j depends upon the
result of instruction i
14
Dependencies and Hazards

Hazard a result of dependencies in the pipeline


Hazards lead to pipeline stalls or the execution of the
wrong instruction
Data hazards
Instruction depends upon the result of an
instruction still in the pipeline
Structural Hazard
 Two instructions try to use the same hardware
resource in a single cycle
Control hazard
 Caused by the delay in fetching an instruction and
decision about changes in instruction flow



15
Structural hazards

When two instructions need to use the same
hardware resource in the same cycle.



Fix #1: Stall later instruction




Resources are not duplicated
 Register file write ports
Resources is not fully pipelined, I.e. takes more than one cycle
 Division, floating points
Low cost, but increases average CPI
Best used for rare events
Examples:
 MIPS R2000 multi-cycle multiply
 SPARC V1 single memory port for instruction and data
Fix #2: Duplicate the resource


Increase cost, but preserves CPI
Best used for cheap resources and/or frequent events
16
Structural hazards, continued


Fix #3: Pipeline expensive resource



Example resource duplication
 Separate instruction and data memory
 Separate ALU and PC adders
 Register files with multiple ports
Moderate cost compared to duplication, expensive compared to stalling
Best used for high performance or specialty machines
 Fully pipelined floating point units for scientific machines.
How to avoid structural hazards altogether


Design the ISA so that each resource needed by an instruction:
 Is used once
 Is always used in the same pipeline stage
 Takes one cycle
MIPS is designed with pipelining in mind, x86 is not
17
Types of Data Hazards

RAW (Read After Write)





A
M
W
F
R
A
M
F
W
Variable length pipeline
Later instructions must write
after earlier instruction I
R
1
2
3
4
F
R
A
M
W
R
1
2
3
4
F
R
A
M
W
W
WAR (Write after Read)



R
Only hazard for “fixed” pipelines
Later instruction must be read after
the earlier instruction writes
WAW (Write After Write)

F
Pipeline with late read
Later instruction must write after
earlier instruction reads
F
R
5
W
We can have Data hazard through memory locations
18
Example RAW pipeline hazard
Time (in clock cycles)
CC 1
Value of
register $2: 10
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
10
10
10/– 20
– 20
– 20
– 20
– 20
DM
Reg
Program
execution
order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
IM
Reg
IM
DM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
Reg
Reg
DM
Reg
19
Stall for RAW hazards

Relatively cheap: just needs some extra compare and
control logic




Detected in ID stage by comparing the registers to be read with
the registers to be written for the instruction currently in the EX,
MEM, or WB stages
Stall if a match is found
Increases the average CPI
Would happen much too frequently
F
R
X
M
W
Write Data to R1 here
F
ADD
ADD
Bubble
R1, R2, R3
R4, R1, R5
R
X
M
W
Read from R1 here
20
Stall type #1: Freeze the whole pipeline
I
I+1
I+2
I+3
I+4
I+5
1
2
3
4
5
6
7
IF
ID
EX
MEM
WB
WB
IF
ID
EX
MEM
MEM
WB
IF
ID
EX
EX
MEM
WB
IF
ID
ID
EX
MEM
WB
IF
IF
EX
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
I+6





8
9
10
11
Freeze all pipe stages for one or more cycles, and suppress writeback
Needs only one global stall signal which suppresses all latching in all
pipeline stages
Sometimes called a “fixed pipe” or “frozen pipe” stall
Works for cache misses
Will not work to remove pipeline hazards
21
Stall type #2: Delay completion of an instruction
I
I+1
I+2
I+3
I+4
1
2
3
4
5
6
7
8
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
Stall
IF
EX
MEM
WB
Stall
ID
EX
MEM
WB
Stall
IF
EX
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
I+5
I+6
Bubble in: EX





MEM
9
10
11
WB
Instruction progress stops for one cycle
Earlier instructions continue towards completion
Prior instructions must suspend and make no more progress
An “elastic pipe: stall
Good when the need for stalling is only detected after decode, like for
pipeline hazards
22
Bypass (Forwarding)




If data is available elsewhere in the pipeline, there is no
need to stall
Detect condition
Bypass (or forward) data directly to the consuming
pipeline stage
Bypass eliminates stalls for single-cycle operations
 Reduces longest stall to N-1 cycles for N-cycle
operations
23
Physical Forwarding Paths
Time (in clock cycles)
CC 1
Value of register $2 : 10
Value of EX/MEM : X
Value of MEM/WB : X
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
X
X
10
– 20
X
10/– 20
X
– 20
– 20
X
X
– 20
X
X
– 20
X
X
– 20
X
X
DM
Reg
Program
execution order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
The third forwarding operation might not be necessary
•if we can make read-after-write register file
•
24
Example forwarding decisions

If EX has just finished an operation for which ID
wants to read the value from either operand, we
must forward




If IR.Will_Write_Reg and IR4.Write_Reg_Num ==
IR3.RS1_Reg_Num
then ALUmuxA =SelectALU4
If IR.Will_Write_Reg and IR4.Write_Reg_Num ==
IR3.RS2_Reg_Num
then ALUmuxB =SelectALU4
Need one comparison and multiplex control for each
forwarding path
Be careful: if you forward from more than one
instruction, choose the closest in the pipeline
25
Physical Forwarding Paths
ID/EX
WB
Control
PC
Instruction
memory
Instruction
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
M
u
x
Registers
ALU
Data
memory
M
u
x
IF/ID.RegisterRs
Rs
IF/ID.RegisterRt
Rt
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
M
u
x
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
MEM/WB.RegisterRd
26
Forwarding Animation (1)
or $4, $4, $2
and $4, $2, $5
sub $2, $1, $3
before<1>
before<2>
ID/EX
10
Control
IF/ID
PC
Instruction
memory
Instruction
2
WB
10
EX/MEM
M
WB
EX
M
$2
MEM/WB
WB
$1
M
u
x
5
Registers
Data
memory
ALU
$5
$3
M
u
x
M
u
x
2
1
5
3
4
2
M
u
x
Forwarding
unit
Clock 3
add $9, $4, $2
or $4, $4, $2
and $4, $2, $5
sub $2, . . .
before<1>
ID/EX
10
Control
WB
M
10
EX/MEM
WB
10
MEM/WB
27
5
3
4
2
M
u
x
Forwarding Animation (2)
Forwarding
unit
Clock 3
add $9, $4, $2
or $4, $4, $2
and $4, $2, $5
sub $2, . . .
before<1>
ID/EX
10
Control
IF/ID
PC
Instruction
memory
Instruction
4
$4
WB
10
EX/MEM
M
WB
EX
M
10
MEM/WB
WB
$2
M
u
x
6
Registers
Data
memory
ALU
$2
$5
M
u
x
2
2
6
5
4
4
M
u
x
M
u
x
2
Forwarding
unit
Clock 4
28
Forwarding Animation (3)
after<1>
add $9, $4, $2
or $4, $4, $2
and $4, . . .
sub $2, . . .
ID/EX
10
Control
IF/ID
PC
Instruction
memory
Instruction
4
WB
10
EX/MEM
M
WB
EX
M
$4
10
MEM/WB
WB
1
$4
M
u
x
2
2
Registers
Data
memory
ALU
$2
$2
M
u
x
M
u
x
4
4
2
2
9
4
M
u
x
4
2
Forwarding
unit
Clock 5
after<2>
after<1>
add $9, $4, $2
or $4, . . .
and $4, . . .
ID/EX
WB
Control
10
EX/MEM
M
WB
EX
M
10
MEM/WB
WB
1
29
4
4
2
2
Forwarding Animation (4)
9
4
M
u
x
4
2
Forwarding
unit
Clock 5
after<1>
after<2>
add $9, $4, $2
or $4, . . .
and $4, . . .
ID/EX
WB
Control
IF/ID
10
EX/MEM
M
WB
EX
M
10
MEM/WB
WB
1
PC
Instruction
memory
Instruction
$4
M
u
x
4
Registers
Data
memory
ALU
$2
M
u
x
M
u
x
4
2
9
4
M
u
x
4
Forwarding
unit
Clock 6
30
Other Data Hazards

WAR (Write After Read)

Can happen if the instruction pipeline has early writes
and/or late reads; something like:
DIV (R1), Suppose that it does not read destination indirect until after the divide
ADD ..,(R1)+ Incremented value of R1 is written before DIV has read value of R1


Can not happen in DLX because all reads are early
(ID) and all writes are late (WB)
WAW (Write After Write)

Can happen when a fast operation follows a slow one;
like
LW R1,0(R2)
ADD R1, R2, R3

IF
ID
EX
MEM
WB
IF
ID
EX
WB
Can not happen in DLX (integer) because there is only
one WB stage and instructions use it in order
31
One data hazard left
Time (in clock cycles)
Program
CC 1
execution
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7


IM
CC 2
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
CC 6
CC 8
CC 9
Reg
DM
Reg
IM
CC 7
Reg
DM
Reg
Reg
DM
Reg
Loaded data is not available until the end of MEM, which is too
late for the following instruction
Forwarding can not help, so we must stall – or just “decree” that
you can not write code like this. Such a decree is called a
“delayed load” and was used in the original MIPS 2000
32
Stalling to interlock
Program
Time (in clock cycles)
execution
CC 1
CC 2
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
IM
CC 3
Reg
IM
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
CC 6
CC 7
DM
Reg
Reg
DM
CC 8
CC 9
CC 10
Reg
bubble
add $9, $4, $2
slt $1, $6, $7
IM
DM
Reg
IM
Reg
Reg
DM
Reg
33
The software fix: instruction scheduling to avoid stalls

Since we can not avoid a stall following a load,
avoid the stall by rearranging the code (“pipeline
scheduling”), if possible



Replace
sub r4,
lw r1,
add r3,
With
lw r1,
sub r4,
add r3,
r5, r7
50(r2)
r1, r4
50(r2)
r5, r7
r1, r4
This can improve a simple RISC machine
performance
34
The software fix: instruction scheduling to avoid stalls

But it is limited


Usually limited to basic blocks between
branches, 5-7 instructions
Difficult to do interchanges to variables
referenced indirectly (pointer, array, or
parameter) due to the risk of aliases.
35
Branches and jumps



Control point: know target and condition
Mem control point
Branch penalty: number of pipeline stages to control
point
Instruction
1
2
3
4
5
Branch I
IF
ID
EX
MEM
WB
IF
Stall
Stall
IF
I+1
I+2
I+3

6
7
8
9
ID
EX
MEM
WB
IF
ID
EX
MEM
IF
ID
EX
This 3-cycle penalty works, but since branches occur
every 5-7 instructions, it kills performance. What to do
 Determine the branch condition earlier than EX
 Compute the target address earlier than MEM
36
Characteristics of MIPS branches and jumps

The branch condition




The branch target





Always PC-relative
Needs only 16 bit adder (and carry propagation)
The jump target


Only has EQ/NE comparison to zero
Fast and cheap, no need for a full ALU
Use a 32-bit NOR gate instead
Always PC-relative
Target = {PC[31:28], offset, 00}
All can be moved to the ID stage, at the cost of additional
hardware (and maybe increased cycle time)
Still requires one stall
37
Pipelining and Branch ISA Design


Simple branches
 Makes ID control point possible
 Maybe increases cycle time
 1 cycle penalty
Complex branches
 Requires EX control point
 Maybe lower cycle time
 2 cycle branch penalty
38
Reducing branch penalties (1)

Predict that the branch will not be taken




Continue fetching from sequential addresses.
Cancel later if branch was taken
Easy to do
 If it is not, continue
 If it is, change the following instructions into a
NOP and thus take a 1-cycle penalty
Helps a little, but bets the wrong way for
loops
39
Reducing branch penalties (2)

Predict that the branch will be taken



Only useful if the target address is known
before the branch condition – not true for
MIPS
Cancel later if the branch was not taken
Always has some delay in fetching the branch
target
40
Reducing branch penalties (3)

Change the ISA: delay the effect of the
branch




Always execute the instruction(s) after the
branch or jump
Depends on the compiler to find something
useful to do in the branch delay slot(s).
An ugly dependence of ISA on implementation
– may change
Interaction with branch prediction, interrupts.
41
Filling the branch delay slot
a. From before
add $s1, $s2, $s3
if $s2 = 0 then
Delay slot
b. From target
sub $t4, $t5, $t6
…
c. From fall through
add $s1, $s2, $s3
if $s1 = 0 then
add $s1, $s2, $s3
Delay slot
if $s1 = 0 then
Delay slot
Becomes
Becomes
sub $t4, $t5, $t6
Becomes
add $s1, $s2, $s3
if $s1 = 0 then
if $s2 = 0 then
add $s1, $s2, $s3
add $s1, $s2, $s3
sub $t4, $t5, $t6
if $s1 = 0 then
sub $t4, $t5, $t6
42
50%
45%
40%
35%
30%
Canceled delay slot
25%
Empty slot
20%
15%
10%
5%
co
r
su
2
jd
p
m
dl
d
dr
o2
hy
ea
r
c
li
gc
c
do
du
eq
n
to
t
t
es
pr
es
so
0%
co
m
pr
es
s
Percentage of conditional branches
How useful are canceling branches
Benchmark
Integer : 35 % slots wasted
Floating point : 25% slots wasted
43
Performance of Branch schemes?


Effective CPI = 1 + %branches  average
branch penalty
For integer MIPS: 20% of instructions are
branches or jumps. 70% of them go to the target
Strategy
Branch Taken
penalty
Branch not taken
penalty
Effective CPI
Stall
3
3
1.60
Branch in ID
1
1
1.20
Predict taken
1
1
1.20
Predict not taken
1
0
1.14
Delay slot
0.5
0.5
1.10
Cancel branch
0.3
0.3
1.06
44
Pipeline example

Consider the following pipeline which
implements the MIPS-like ISA. The only variation
on the MIPS ISA is the support of full register
compares in branch instructions
Instruction
1
2
3
4
5
6
I
IF
ID
EX1
EX2/
MEM1
MEM2
WB
IF
ID
EX1
EX2/
MEM1
MEM2
WB
IF
ID
EX1
EX2/
MEM1
MEM2
WB
IF
ID
EX1
EX2/
MEM1
MEM2
IF
ID
EX1
IF
ID
I+1
I+2
I+3
I+4
I+5
7
8
9
11
WB
EX2/ MEM2
MEM1
EX1
10
WB
EX2/ MEM2
MEM1
WB
45
The Pipeline stages
Stage
Function
IF
Instruction fetch
ID
Instruction decode.
Register fetch
EX1
Address generation (data and PC-target)
EX2/MEM1
ALU operation
Branch condition resolution
First cycle of memory access
MEM2
Second cycle of memory access
WB
Register file writeback
46
Assumptions



Writes to the register file occur in the first
half of the clock cycle while reads from the
register file occur in the second half
All bypass paths have been implemented
to minimize pipeline stalls due to data
hazards
The pipeline implements hardware
interlocks
47
Questions

How many register file ports does the processor
need to minimize structural hazards?

Indicate all forwarding required to minimize stalls in
the given pipeline. Also, specify the minimum
number of comparators needed to implement
forwarding?

What is the worst case delay due to RAW data
hazards?

What is the branch delay of this pipeline?
48
Instruction Dependencies

The frequencies in the table are presented as
percentages of all instructions executed
Type
1
2
3
4
5
6
7
8
9
Instruction Sequence
ALUop Rx,-,ALUop -,-,Rx or ALUop -,Rx,ALUop Rx,-,Store Rx,-(-)
ALUop Rx,-,Load -,-(Rx) or Store -,-(Rx)
ALUop Rx,-,JumpRegister Rx
ALUop Rx,-,Branch Rx,-,# or Branch -,Rx,#
Load Rx,-(-)
ALUop -,-,Rx or ALUop -,Rx,Load Rx, -(-)
Load -,-(Rx) or Store -,-(Rx)
Load Rx, -(-)
Branch Rx,-,# or Branch -,Rx,#
Load Rx, -(-)
JumpRegister Rx
Frequency
10%
5%
5%
1%
2%
15%
3%
2%
1%
49
More Questions

List the instruction sequences from the previous table
that cause data stalls in the pipeline. Indicate the
corresponding number of stall cycles.

Compute the CPI for the pipeline due to data hazards
only. Ignore instruction sequences that are not listed in
the table

If the frequency of conditional branches is 10% of which
65% are taken and the frequency of unconditional
branches is 6%, compute the overall CPI assuming a
TAKEN branch prediction scheme.
50
Summary

Pipelining: overlaps execution of instructions


Problem: structural, data, and control hazards



Hazards occur if there are dependences and pipeline
exposes them
Common solution: stall, forwarding, scheduling
Performance



Improves instruction throughput → latency of long program
CPIreal = CPIideal + Stallsstructural + Stallsdata + Stallscontrol
Cycle timereal = Timelongest pipestage + Register Overhead
What makes pipelining easier


Simple instructions (load-stores, branches
Fixed length, encoding with few formats
51
Download