Pipeline Control and Performance (Chapter 6) ELEC 5200-001/6200-001 Computer Architecture and Design

advertisement
ELEC 5200-001/6200-001
Computer Architecture and Design
Spring 2016
Pipeline Control and Performance
(Chapter 6)
Vishwani D. Agrawal
James J. Danaher Professor
Department of Electrical and Computer Engineering
Auburn University, Auburn, AL 36849
http://www.eng.auburn.edu/~vagrawal
vagrawal@eng.auburn.edu
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
1
EX/MEM
Shift
left 2
opcode
ALU
4
ID/EX
Add
IF/ID
1 mux 0
Pipelined Datapath (without Jump)
MEM/WB
26-31
16-20
1 mux 0
Sign
ext.
Data
mem.
0 mux 1
mem
PC
ALU
21-25
1 mux 0
Instr
Reg. File
zero
16-20 for I-type lw
11-15 for R-type
0-15
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
2
mem
16-20
MemWrite
MemRead
Data
mem.
1 mux 0
Sign
ext.
MEM/WB
0 mux 1
PC
21-25
zero
ALU
Instr
Shift
left 2
ALU
EX/MEM
1 mux 0
26-31
Reg. File
opcode
RegWrite
4
ID/EX
Add
IF/ID
1 mux 0
Mem. and Reg. File Need Controls
16-20 for I-type lw
11-15 for R-type
0-15
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
3
mem
16-20
1 mux 0
Sign
ext.
16-20 for I-type lw
11-15 for R-type
MemtoReg
1 mux 0
MemWrite
MemRead
Branch
PCSrc
Data
mem.
MEM/WB
0 mux 1
PC
21-25
zero
ALU
Instr
Shift
left 2
ALUSrc
26-31
Reg. File
opcode
EX/MEM
ALU
ID/EX
1 mux 0
IF/ID
RegWrite
4
Add
Multiplexers Need Controls
RegDst
0-15
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
4
16-20
Sign
ext.
16-20 for I-type lw
11-15 for R-type
MemtoReg
1 mux 0
MemWrite
MemRead
PCSrc
Branch
ALUSrc
0-5
ALU
cont.
Data
mem.
MEM/WB
0 mux 1
mem
21-25
ALU
PC
Instr
zero
1 mux 0
26-31
Shift
left 2
Reg. File
opcode
EX/MEM
ALU
ID/EX
1 mux 0
IF/ID
RegWrite
4
Add
ALU Needs a Control
ALUOp
RegDst
0-15
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
5
Compare with Single-Cycle Control
Control signals are the same as those
needed for a single-cycle datapath.
Control signals are generated using the
Opcode in the ID (instruction decode)
cycle and then distributed to other cycles.
Let us reexamine the implementation of
the single-cycle control (slides 19-21 of
Lecture 5).
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
6
Hardwired CU: Single-Cycle
Implemented by combinational logic.
Datapath
6
opcode
Control
logic
Control
signals
funct.
code
6
ALUOp
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
2
To ALU
3
ALU
control
7
Instr.
mem.
16-20
Single-cycle
datapath
0-15
11-15
RegWrite
0 mux 1
1 mux 0
ALU
MemtoReg
ALUSrc
zero
MemWrite
MemRead
Data
mem.
0 mux 1
PC
1 mux 0
21-25
ALU
26-31
Branch
Reg. File
opcode
CONTROL
Add
4
Jump
Shift
left 2
1 mux 0
0-25
RegDst
Sign
ext.
Shift
left 2
ALUOp
ALU
Cont.
0-5
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
8
Single-Cycle Control Logic
Jump
ALUOp0
ALUOp1
Branch
MemWrite
MemRead
RegWrite
Instruction bits
31 31 29 28 27 26
MemtoReg
Opcode
ALUSrc
Instr.
type
Outputs
RegDst
Inputs
R
0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0
lw
1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0
sw
1 0 1 0 1 1 X 1 X 0 0 1 0 0 0 0
beq
0 0 0 1 0 0 X 0 X 0 0 0 1 0 1 0
J
0 0 0 0 1 0 X X X 0 X 0 X X X 1
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
9
Single-Cycle Control Circuit
Op5
Op4
Op3
Op2
Op1
Op0
R
lw
sw
beq
J
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
ALUOp1
ALUOp0
Jump
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
10
ALU Control Logic
Instr.
type
Inputs
From CU
ALUOp1
lw, sw
B
R
Spr 2016, Mar 9 . . .
0
0
1
1
1
1
1
Outputs to ALU
Funct. Code from IR
(bits 0-5)
3-bit
code
Operation
ALUOp0 F5 F4 F3 F2 F1 F0
0
1
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
0
0
0
0
1
X
X
0
0
1
1
0
X
X
0
1
0
0
1
ELEC 5200-001/6200-001 Lecture 7
X
X
0
0
0
1
0
Add
010
110 Subtract
Add
010
110 Subtract
000 AND
OR
001
slt
111
11
ALU Control
Operation
select
from control
From Control Circuit
ALUOp1
ALUOp0
3
zero
ALU
F3
result
overflow
F2
Operation select
ALU function
000
001
010
110
111
AND
OR
Add
Subtract
Set on less than
F1
F0
Spr 2016, Mar 9 . . .
ALU control
ELEC 5200-001/6200-001 Lecture 7
12
Returning to Pipelined Control
Opcode input to control is supplied by the pipeline
register IF/ID in the ID (instruction decode) cycle.
Nine control signals are generated in the ID cycle,
but none is used. They are saved in the pipeline
register ID/EX.
ALUSrc, RegDst and ALUOp (2 bits) are used in the
EX (execute) cycle. Remaining 5 control signals are
saved in the pipeline register EX/MEM.
Branch, MemWrite and MemRead are used in the
MEM (memory access) cycle. Remaining 2 control
signals are saved in the pipeline register MEM/WB.
MemtoReg and RegWrite are used in the WB (write
back) cycle.
Pipelined control is shown without Jump.
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
13
16-20 for I-type lw
11-15 for R-type
0-5
1 mux 0
Data
mem.
MemtoReg
RegWrite
MemWrite
MemRead
PCSrc
ALUSrc
Sign
ext.
ALU
cont.
MEM/WB
0 mux 1
16-20
zero
ALU
mem
21-25
1 mux 0
PC
Instr
1 mux 0
26-31
Shift
left 2
Reg. File
opcode
EX/MEM
Branch
ID/EX
ALU
IF/ID
CONTROL
4
Add
Placing Control in Pipelined Datapath
ALUOp
RegDst
0-15
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
14
16-20 for I-type lw
11-15 for R-type
0-5
1 mux 0
Data
mem
MemtoReg
RegWrite
MemWrite
MemRead
PCSrc
ALUSrc
Sign
ext.
ALU
cont.
MEM/WB
0 mux 1
16-20
zero
ALU
mem
21-25
1 mux 0
PC
Instr
1 mux 0
26-31
Shift
left 2
Reg. File
opcode
EX/MEM
Branch
ID/EX
ALU
IF/ID
CONTROL
4
Add
Highlighted Pipelined Control
ALUOp
RegDst
0-15
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
15
Single-Cycle Performance
Assume
200 ps for memory access
100 ps for ALU operation
50 ps for register file read or write
Cycle time set according to longest instruction:

lw ≡ IF + ID/RegRead + ALU + MEM + RegWrite
= 200 + 50 +100 + 200 + 50
= 600 ps
Av. instruction execution time = clock cycle time
= 600 ps
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
16
Multicycle Performance
Consider SPECINT2000* instruction mix:
25% lw
10% sw
11% branch
2% jump
52% ALU instr.
Av. CPI
5 cycles
4 cycles
3 cycles
3 cycles
4 cycles
= 0.25×5 + 0.10×4 + 0.11×3 + 0.02×3 + 0.52×4
= 4.12
Clock cycle time determined from longest operation
(memory access) = 200 ps
Av. instruction execution time = 4.12×200 = 824 ps
*Set of benchmark programs used for performance evaluation, to be
discussed in a later lecture.
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
17
Pipeline Performance
Neglect initial latency (reasonable for long programs).
One instruction completed every clock cycle unless delayed
by hazard. Average CPI:
lw
sw
ALU
branch
jump
2 cycles in 50% cases due to hazard
2 cycles in 25% cases due to hazard
1.5 cycles
1 cycle
1 cycle
1.25 cycles
2 cycles
For SPECINT2000
Av. CPI
= 0.25×1.5 + 0.10×1 + 0.11×1.25 + 0.02×2.0 + 0.52×1
= 1.17
Clock cycle time (longest operation: memory access) = 200 ps
Av. instruction execution time = 1.17×200 = 234 ps
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
18
Comparing Alternatives
Type of
Clock cycle Average
datapath
time
CPI
and control
Single600 ps
1.00
cycle
Multicycle
200 ps
4.12
Pipelined
Spr 2016, Mar 9 . . .
200 ps
1.17
ELEC 5200-001/6200-001 Lecture 7
Av. instruction
execution time
600 ps
824 ps
234 ps
19
Other Controls for Pipeline
Forwarding
Stall
Branch hazard and branch prediction
Instruction flush
Exceptions
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
20
Forwarding
Consider a data hazard:
Spr 2016, Mar 9 . . .
CC6
CC7
CC8
CC3: ALU saves new
WB:
REG.
WRITE
data in EX/MEM,
to be written to $2
in CC5
MEM/WB
WB:
REG.
WRITE
MEM:
DM
MEM:
DM
CC5
MEM/WB
EX: ALU
EX/MEM
CC4
EX/MEM
IF/ID
ID: REG.
FILE
READ
ID/EX
CC3
EX: ALU
CC3: and reads $2
to ID/EX, but the
correct data is
in EX/MEM
IF: IM
and $12, $2, $5
CC2
IF/ID
sub $2, $1, $3
IF: IM
CC1
# computes result in CC3, writes in $2 in CC5
# reads $2 in CC3, adds in CC4
ID: REG.
FILE
READ
ID/EX
sub $2, $1, $3
and $12, $2, $5
CC4: forwarding allows
execution of “and” with
correct data
ELEC 5200-001/6200-001 Lecture 7
21
Understanding Forwarding
Let’s ask following questions:
Q:
A:
Q:
A:
Q:
A:
Q:
A:
Spr 2016, Mar 9 . . .
Why is there a hazard?
Source register for the present instruction is
the same as the destination register of the
previous instruction.
When is the source register data needed?
In the execute cycle (CC4).
Is source register data available in CC4?
Yes – use forwarding. No – use stall.
Where is the required data in CC4?
In the pipeline register EX/MEM as ALU output.
ELEC 5200-001/6200-001 Lecture 7
22
Forwarding Hardware
A forwarding unit is added to execute (ALU)
cycle hardware.
Functions of forwarding unit:
– Hazard detection
– Forward correct data to ALU
Inputs to forwarding unit:
– Source registers of the instruction in EX
– Destination registers of instructions in DM and WB
Outputs of forwarding unit: multiplexer controls
to route correct data to the ALU.
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
23
Recall Register Definitions
R-type instruction (add, sub, and, or, . . . )
opcode
Rs Rt
Rd shamt
funct
I-type instruction (beq, lw, sw, addi, . . . )
opcode
Rs Rt
constant_or_address
J-type instruction (j, jal, jr)
opcode
a___d___d___r___e___s___s
where
Rs is the first source register
Rt is the second source register
Rd is the destination register
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
24
Forwarding Implemented
EX/MEM
ALU
ID/EX
PC+4
opcode
Shift
left 2
26-31
21-25
ALU
MUX
MUX
1 mux 0
16-20
16-20
11-15
21-25
Rs
16-20
Rt
1 mux 0
Sign
ext.
Data
mem.
Rd
Forwarding
unit
0-15
Spr 2016, Mar 9 . . .
Branch
addr.
zero
Reg. File
Addr
mem
MEM/WB
0 mux 1
IF/ID
ELEC 5200-001/6200-001 Lecture 7
Rd
25
Stall
Delay next instruction by sending nop through pipeline.
Necessary when hazard not resolved by forwarding.
CC6
REG.
FILE
WRITE
CC4: new data
in MEM/WB, to
be written to $2
MEM/WB
DM
REG.
FILE
WRITE
MEM/WB
CC5
EX/MEM
DM
ALU
EX/MEM
CC4
ID, REG.
FILE
READ
ID/EX
IF/ID
ALU
CC3
ID, REG.
FILE
READ
ID/EX
IM
and $4, $2, $5
CC2
IF/ID
lw $2, 20($1)
IM
CC1
CC4: execution of and
is impossible; correct data
unavailable until end of CC4
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
26
Detecting Hazard Requiring Stall
Consider instruction in IF/ID being decoded:
If
Previous instruction (lw) activated MemRead, and
Instruction being decoded has a source register (Rs or
Rt) same as the destination register (Rt for lw) of the
previous instruction
Then, stall the pipeline:
Force all control outputs to 0
Prevent PC from changing
Prevent IF/ID from changing
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
27
Stall Implementation
Shift
left 2
zero
Data
mem.
16-20
11-15
21-25
Rs
16-20
Rt
1 mux 0
Sign
ext.
Forwarding
unit
0 mux 1
1 mux 0
Reg. File
21-25
Addr mem
PC
0
16-20
MEM/WB
ALU
26-31
EX/MEM
MUX
opcode
ID/EX
MUX
Rs
MemRead
MUX
IF/ID
Hazard
detection
unit
Control
IF/IDWrite
PCWrite
Rt
Rd
Rd
0-15
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
28
next
Spr 2016, Mar 9 . . .
next is fetched
twice since PC
was frozen
WB:
REG.
WRITE
EX/MEM
MEM:
DM
MEM/WB
WB:
REG.
WRITE
EX: ALU
EX/MEM
MEM:
DM
MEM/WB
WB:
REG.
WRITE
IF/ID
ID: REG.
FILE
READ
ID/EX
EX: ALU
EX/MEM
MEM:
DM
CC5
ELEC 5200-001/6200-001 Lecture 7
CC6
WB:
REG.
WRITE
CC4
MEM/WB
MEM/WB
EX: ALU
IF: IM
CC3
ID/EX
MEM:
DM
ID: REG.
FILE
READ
EX/MEM
ID/EX
IF/ID
State of IF/ID
is frozen in CC3
IF/ID
EX: ALU
CC2
ID: REG.
FILE
READ
IF/ID
ID: REG.
FILE
READ
ID/EX
IF/ID
CC1
IF: IM
and $4, $2, $5
IF: IM
lw $2, 20($1)
IF: IM
Stall
Execution with stall and forwarding:
CC7
CC4: new data
in MEM/WB, to
be written to $2
bubble
(nop)
29
Branch Hazard
Consider heuristic – branch not taken.
Continue fetching instructions in sequence
following the branch instructions.
If branch is taken (indicated by zero output of
ALU):
– Control generates branch signal in ID cycle.
– branch activates PCSource signal in the MEM
cycle to load PC with new branch address.
– Three instructions in the pipeline must be flushed
if branch is taken – can this penalty be reduced?
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
30
16-20 for I-type lw
11-15 for R-type
0-5
1 mux 0
Data
mem.
MemtoReg
RegWrite
MemWrite
MemRead
PCSrc
ALUSrc
Sign
ext.
ALU
cont.
MEM/WB
0 mux 1
16-20
zero
ALU
mem
21-25
1 mux 0
PC
Instr
1 mux 0
beq
26-31
Shift
left 2
Reg. File
opcode
EX/MEM
Branch
ID/EX
ALU
IF/ID
CONTROL
4
Add
Branch Hazard
ALUOp
RegDst
0-15
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
31
Branch Not Taken
Branch on condition to Z
A
B
C
D
Z
cycle b
Branch fetched
cycle b+1
cycle b+2
cycle b+3
Branch decoded Branch decision PC keeps D
(br. not taken)
A fetched
A decoded
A executed
B fetched
cycle b+4
A continues
B decoded
B executed
C fetched
C decoded
D fetched
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
32
Branch Taken
Branch on condition to Z
A
B
C
D
Z
cycle b
Branch fetched
cycle b+1
cycle b+2
cycle b+3
Branch decoded Branch decision PC gets Z
(br. taken)
A fetched
A decoded
A executed
B fetched
cycle b+4
Nop
B decoded
Nop
C fetched
Nop
Three-cycle penalty
Three instructions are
flushed if branch is taken
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
Z fetched
33
16-20 for I-type lw
11-15 for R-type
0-5
1 mux 0
Data
mem.
MemtoReg
RegWrite
MemWrite
MemRead
PCSrc
ALUSrc
Sign
ext.
ALU
cont.
MEM/WB
0 mux 1
16-20
zero
ALU
mem
21-25
1 mux 0
PC
Instr
1 mux 0
beq
26-31
Shift
left 2
Reg. File
opcode
EX/MEM
Branch
ID/EX
Add
IF/ID
CONTROL
4
Add
Branch Penalty Reduction
ALUOp
RegDst
0-15
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
34
Branch Taken
Branch to Z
A
B
C
D
Z
cycle b
Branch fetched
cycle b+1
cycle b+2
Branch decision
PC gets Z
A fetched
A flushed
Z fetched
cycle b+3
cycle b+4
Nop
Nop
Z decoded
Z executed
One-cycle penalty
One instructions is
flushed if branch is taken
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
35
Pipeline Flush
If branch is taken (as indicated by zero), then
control does the following:
– Change all control signals to 0, similar to the case of stall
for data hazard, i.e., insert bubble in the pipeline.
– Generate a signal IF.Flush that changes the instruction in
the pipeline register IF/ID to 0 (nop).
Penalty of branch hazard is reduced by
– Adding branch detection and address generation
hardware in the decode cycle – one bubble needed – a
next address generation logic in the decode stage writes
PC+4, branch address, or jump address into PC.
– Using branch prediction.
– Unrolling loops.
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
36
Branch Prediction
Useful for program loops.
A one-bit prediction scheme: a one-bit buffer
carries a “history bit” that tells what happened on
the last branch instruction
History bit = 1, branch was taken
History bit = 0, branch was not taken
Not taken
taken
Predict
branch
taken
1
Predict
branch
not taken
0
Not taken
taken
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
37
Branch Prediction
Address of
recent branch
instructions
Target
addresses
History
bit(s)
Low-order
bits used
as index
PC+4
Next PC
0
1
=
Prediction
Logic
PC
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
38
Branch Prediction for a Loop
Execution of Instruction d
a
I=0
I=I+1
b
X = X + R(I)
c
N
d
I – 10 = 0?
Y
e
Store X in memory
Execu
-tion
seq.
Old
hist.
bit
Pred.
I
1
0
e
2
1
3
Act.
New
hist.
bit
Predi
ction
1
b
1
Bad
b
2
b
1
Good
1
b
3
b
1
Good
4
1
b
4
b
1
Good
5
1
b
5
b
1
Good
6
1
b
6
b
1
Good
7
1
b
7
b
1
Good
8
1
b
8
b
1
Good
9
1
b
9
b
1
Good
10
1
b
10
e
0
Bad
Next instr.
h.bit = 0 branch not taken, h.bit = 1 branch taken.
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
39
Prediction Accuracy
One-bit predictor:
2 errors out of 10 predictions
Prediction accuracy = 80%
To improve prediction accuracy, use twobit predictor:
A prediction must be wrong twice before it is
changed
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
40
Two-Bit Prediction Buffer
Implemented as a two-bit counter.
Can improve correct prediction statistics.
Not taken
taken
Predict
branch
taken
11
Predict
branch
taken
10
taken
taken
Not taken
Not taken
Not taken
Predict
branch
not taken
00
Predict
branch
not taken
01
taken
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
41
Branch Prediction for a Loop
1
I=0
I=I+1
2
X = X + R(I)
3
N
4
Execution of Instruction 4
I – 10 = 0?
Y
5
Store X in memory
Spr 2016, Mar 9 . . .
Execu
-tion
seq.
Old
Pred.
Buf
Pred.
I
1
10
2
1
2
11
Good
2
11
2
2
2
11
Good
3
11
2
3
2
11
Good
4
11
2
4
2
11
Good
5
11
2
5
2
11
Good
6
11
2
6
2
11
Good
7
11
2
7
2
11
Good
8
11
2
8
2
11
Good
9
11
2
9
2
11
Good
10
11
2
10
5
10
Bad
New
pred.
Act. Buf
Next instr.
ELEC 5200-001/6200-001 Lecture 7
Predi
ction
42
Exceptions
A typical exception occurs when ALU produces
an overflow signal.
Control asserts following actions on exception:
– Change the PC address to 4000 0040hex. This is the
location of the exception routine. This is done by
adding an additional input to the PC input multiplexer.
– Overflow is detected in the EX cycle. Similar to data
hazard and pipeline flush,
Set IF/ID to 0 (nop).
Generate ID.Flush and EX.Flush signals to set all control
signals to 0 in ID/EX and EX/MEM registers. This also
prevents the ALU result (presumed contaminated) from being
written in the WB cycle.
Spr 2016, Mar 9 . . .
ELEC 5200-001/6200-001 Lecture 7
43
Download