CPE 731 - Pipelining Review

advertisement
CPE 731 Advanced Computer
Architecture
Pipelining Review
Dr. Gheith Abandah
Adapted from the slides of Prof. David Patterson, University of
California, Berkeley
Outline
•
•
•
•
•
•
•
5 stage pipelining
Structural and Data Hazards
Forwarding
Branch Schemes
Exceptions and Interrupts
Multi-cycle Operations
Conclusion
3/16/2016
CPE 731, Pipelining
2
Five-Stage Pipelining
Figure A.2, Page A-8
Time (clock cycles)
3/16/2016
Reg
DMem
Ifetch
Reg
DMem
Reg
ALU
DMem
Reg
ALU
O
r
d
e
r
Ifetch
ALU
I
n
s
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Ifetch
Reg
CPE 731, Pipelining
Reg
Reg
DMem
Reg
3
Pipelining is not quite that easy!
• Limits to pipelining: Hazards prevent next instruction
from executing during its designated clock cycle
– Structural hazards: HW cannot support this combination of
instructions (single person to fold and put clothes away)
– Data hazards: Instruction depends on result of prior instruction still
in the pipeline (missing sock)
– Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow (branches
and jumps).
3/16/2016
CPE 731, Pipelining
4
One Memory Port/Structural Hazards
Figure A.4, Page A-14
Time (clock cycles)
3/16/2016
Reg
DMem
Reg
DMem
Reg
DMem
Reg
ALU
Instr 4
Ifetch
ALU
Instr 3
DMem
ALU
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Ifetch
Reg
Ifetch
CPE 731, Pipelining
Reg
Reg
Reg
Reg
DMem
5
One Memory Port/Structural Hazards
(Similar to Figure A.5, Page A-15)
Time (clock cycles)
Stall
DMem
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Bubble
Instr 3
Reg
Reg
DMem
Bubble Bubble
Ifetch
Reg
Reg
Bubble
ALU
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Bubble
Reg
DMem
How do you “bubble” the pipe?
3/16/2016
CPE 731, Pipelining
6
Speed Up Equation for Pipelining
CPIpipelined  Ideal CPI  Average Stall cycles per Inst
Cycle Timeunpipelined
Ideal CPI  Pipeline depth
Speedup 

Ideal CPI  Pipeline stall CPI
Cycle Timepipelined
For simple RISC pipeline, CPI = 1:
Cycle Timeunpipelined
Pipeline depth
Speedup 

1  Pipeline stall CPI
Cycle Timepipelined
3/16/2016
CPE 731, Pipelining
7
Example: Dual-port vs. Single-port
• Machine A: Dual ported memory (“Harvard Architecture”)
• Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
• Machine A is 1.33 times faster
3/16/2016
CPE 731, Pipelining
8
Data Hazard on R1
Figure A.6, Page A-17
Time (clock cycles)
and r6,r1,r7
or
r8,r1,r9
Ifetch
DMem
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
sub r4,r1,r3
Reg
ALU
Ifetch
ALU
O
r
d
e
r
add r1,r2,r3
xor r10,r1,r11
3/16/2016
WB
ALU
I
n
s
t
r.
MEM
ALU
IF ID/RF EX
CPE 731, Pipelining
Reg
Reg
Reg
Reg
DMem
9
Reg
Three Generic Data Hazards
• Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
• Caused by a “Dependence” (in compiler
nomenclature). This hazard results from an actual
need for communication.
3/16/2016
CPE 731, Pipelining
10
Three Generic Data Hazards
• Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5
3/16/2016
CPE 731, Pipelining
11
Three Generic Data Hazards
• Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
• Will see WAR and WAW in more complicated pipes
3/16/2016
CPE 731, Pipelining
12
Outline
•
•
•
•
•
•
•
5 stage pipelining
Structural and Data Hazards
Forwarding
Branch Schemes
Exceptions and Interrupts
Multi-cycle Operations
Conclusion
3/16/2016
CPE 731, Pipelining
13
Forwarding to Avoid Data Hazard
Figure A.7, Page A-19
or
r8,r1,r9
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
and r6,r1,r7
Ifetch
DMem
ALU
sub r4,r1,r3
Reg
ALU
O
r
d
e
r
add r1,r2,r3 Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
xor r10,r1,r11
3/16/2016
CPE 731, Pipelining
Reg
Reg
Reg
Reg
DMem
14
Reg
HW Change for Forwarding
Figure A.23, Page A-37
NextPC
mux
MEM/WR
EX/MEM
ALU
mux
ID/EX
Registers
Data
Memory
mux
Immediate
What circuit detects and resolves this hazard?
3/16/2016
CPE 731, Pipelining
15
Forwarding to Avoid LW-SW Data Hazard
Figure A.8, Page A-20
or
r8,r6,r9
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
sw r4,12(r1)
Ifetch
DMem
ALU
lw r4, 0(r1)
Reg
ALU
O
r
d
e
r
add r1,r2,r3 Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
xor r10,r9,r11
3/16/2016
CPE 731, Pipelining
Reg
Reg
Reg
Reg
DMem
16
Reg
Data Hazard Even with Forwarding
Figure A.9, Page A-21
and r6,r1,r7
or
3/16/2016
DMem
Ifetch
Reg
DMem
Reg
Ifetch
Ifetch
r8,r1,r9
CPE 731, Pipelining
Reg
Reg
Reg
DMem
ALU
O
r
d
e
r
sub r4,r1,r6
Reg
ALU
lw r1, 0(r2) Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
DMem
17
Reg
Data Hazard Even with Forwarding
(Similar to Figure A.10, Page A-21)
Reg
DMem
Ifetch
Reg
Bubble
Ifetch
Bubble
Reg
Bubble
Ifetch
and r6,r1,r7
or r8,r1,r9
Reg
DMem
Reg
Reg
Reg
DMem
ALU
sub r4,r1,r6
Ifetch
ALU
O
r
d
e
r
lw r1, 0(r2)
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
DMem
How is this detected?
3/16/2016
CPE 731, Pipelining
18
Software Scheduling to Avoid Load
Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code:
LW
LW
ADD
SW
LW
LW
SUB
SW
Rb,b
Rc,c
Ra,Rb,Rc
Ra,a
Re,e
Rf,f
Rd,Re,Rf
Rd,d
Fast code:
LW
LW
LW
ADD
LW
SW
SUB
SW
Rb,b
Rc,c
Re,e
Ra,Rb,Rc
Rf,f
Ra,a
Rd,Re,Rf
Rd,d
Compiler optimizes for performance. Hardware checks for safety.
3/16/2016
CPE 731, Pipelining
19
Outline
•
•
•
•
•
•
•
5 stage pipelining
Structural and Data Hazards
Forwarding
Branch Schemes
Exceptions and Interrupts
Multi-cycle Operations
Conclusion
3/16/2016
CPE 731, Pipelining
20
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
Reg
Ifetch
r6,r1,r7
DMem
ALU
18: or
Reg
ALU
14: and r2,r3,r5
DMem
ALU
Ifetch
ALU
10: beq r1,r3,36
ALU
Control Hazard on Branches
Three Stage Stall
22: add r8,r1,r9
36: xor r10,r1,r11
Reg
Reg
Reg
What do you do with the 3 instructions in between?
How do you do it?
Where is the “commit”?
3/16/2016
CPE 731, Pipelining
21
Reg
DMem
Branch Stall Impact
• If CPI = 1, 30% branch,
Stall 3 cycles => new CPI = 1.9!
• Two part solution:
– Determine branch taken or not sooner, AND
– Compute taken branch address earlier
• MIPS branch tests if register = 0 or  0
• MIPS Solution:
– Move Zero test to ID/RF stage
– Adder to calculate new PC in ID/RF stage
– 1 clock cycle penalty for branch versus 3
3/16/2016
CPE 731, Pipelining
22
Pipelined MIPS Datapath
Figure A.24, page A-38
Instruction
Fetch
Memory
Access
Write
Back
Adder
Adder
MUX
Next
SEQ PC
Next PC
Zero?
RS1
MUX
MEM/WB
Data
Memory
EX/MEM
ALU
MUX
Imm
ID/EX
Reg File
IF/ID
Memory
Address
RS2
WB Data
4
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Sign
Extend
RD
RD
RD
• Interplay of instruction set design and cycle time.
3/16/2016
CPE 731, Pipelining
23
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
–
–
–
–
–
Execute successor instructions in sequence
“Squash” instructions in pipeline if branch actually taken
Advantage of late pipeline state update
47% MIPS branches not taken on average
PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
– 53% MIPS branches taken on average
– But haven’t calculated branch target address in MIPS
» MIPS still incurs 1 cycle branch penalty
» Other machines: branch target known before outcome
3/16/2016
CPE 731, Pipelining
24
Four Branch Hazard Alternatives
#4: Delayed Branch
– Define branch to take place AFTER a following instruction
branch instruction
sequential successor1
sequential successor2
........
sequential successorn
branch target if taken
Branch delay of length n
– 1 slot delay allows proper decision and branch target
address in 5 stage pipeline
– MIPS uses this
3/16/2016
CPE 731, Pipelining
25
Scheduling Branch Delay Slots (Fig A.14)
A. From before branch
add $1,$2,$3
if $2=0 then
delay slot
becomes
if $2=0 then
add $1,$2,$3
B. From branch target
sub $4,$5,$6
C. From fall through
add $1,$2,$3
if $1=0 then
delay slot
add $1,$2,$3
if $1=0 then
delay slot
sub $4,$5,$6
becomes
sub $4,$5,$6
becomes
add $1,$2,$3
if $1=0 then
add $1,$2,$3
if $1=0 then
sub $4,$5,$6
sub $4,$5,$6
• A is the best choice, fills delay slot & reduces instruction count (IC)
• In B, the sub instruction may need to be copied, increasing IC
• In B and C, must be okay to execute sub when branch fails
3/16/2016
CPE 731, Pipelining
26
Delayed Branch
• Compiler effectiveness for single branch delay slot:
– Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots useful
in computation
– About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: As processor go to
deeper pipelines and multiple issue, the branch
delay grows and need more than one delay slot
– Delayed branching has lost popularity compared to more
expensive but more flexible dynamic approaches
– Growth in available transistors has made dynamic approaches
relatively cheaper
3/16/2016
CPE 731, Pipelining
27
Evaluating Branch Alternatives
Pipeline speedup =
Pipeline depth
1 +Branch frequency Branch penalty
Assume 4% unconditional branch, 6% conditional branchuntaken, 10% conditional branch-taken
Scheduling
Branch CPI speedup v. speedup v.
scheme
penalty
unpipelined
stall
Stall pipeline
3 1.60
3.1
1.0
Predict taken
1 1.20
4.2
1.33
Predict not taken
1 1.14
4.4
1.40
Delayed branch
0.5 1.10
4.5
1.45
3/16/2016
CPE 731, Pipelining
28
Outline
•
•
•
•
•
•
•
5 stage pipelining
Structural and Data Hazards
Forwarding
Branch Schemes
Exceptions and Interrupts
Multi-cycle Operations
Conclusion
3/16/2016
CPE 731, Pipelining
29
Problems with Pipelining
• Exception: An unusual event happens to an
instruction during its execution
– Examples: divide by zero, undefined opcode
• Interrupt: Hardware signal to switch the
processor to a new instruction stream
– Example: a sound card interrupts when it needs more audio
output samples (an audio “click” happens if it is left waiting)
• Problem: It must appear that the exception or
interrupt must appear between 2 instructions (Ii
and Ii+1)
– The effect of all instructions up to and including Ii is totalling
complete
– No effect of any instruction after Ii can take place
• The interrupt (exception) handler either aborts
program or restarts at instruction Ii+1
3/16/2016
CPE 731, Pipelining
30
Precise Exceptions in Static Pipelines
Key observation: architected state only
change in memory and register write stages.
3/16/2016
CPE 731, Pipelining
31
Outline
•
•
•
•
•
•
•
5 stage pipelining
Structural and Data Hazards
Forwarding
Branch Schemes
Exceptions and Interrupts
Multi-cycle Operations
Conclusion
3/16/2016
CPE 731, Pipelining
32
Multi-cycle Operations
3/16/2016
CPE 731, Pipelining
33
Multi-cycle Operations - Example
3/16/2016
CPE 731, Pipelining
34
Multi-cycle Operations - Example
Unit
Integer ALU
0
Initiation
Interval
1
Memory
1
1
FP Adder
3
1
FP Multiplier
6
1
FP Divider
24
25
3/16/2016
Latency
CPE 731, Pipelining
35
Problems
1/5
1. Structural hazard on divide
•
Solution: Detect at ID stage and stall
2. May have multiple register writes
Example:
mul.d f0, f4, f6 F
...
...
add.d f2, f4, f6
...
...
ld.d f2, 0(r2)
3/16/2016
D
F
M1 M2 M3
D E M
F D E
F D
F
CPE 731, Pipelining
M4
W
M
A1
D
F
M5 M6 M7 M
W
A2
E
D
F
A3
M
E
D
A4 M
W
M W
E M
W
W
W
36
Problems
2/5
2. May have multiple register writes
Solution:
Use shift register to reserve when need to write and stall
on conflict
Before
add.d
ld.d
ld.d
ld.d
3/16/2016
|0|0|0|0|
|0|0|0|1|
|1|0|1|0|
|1|1|0|0|
|1|0|0|0|
Need to stall this
CPE 731, Pipelining
37
Problems
3/5
3. There are long stalls due to RAW dependencies
Example:
ld.d
f4,
mul.d f0,
add.d f2,
s.D
f2,
0(r2)
f4, f6
f0, f8
0(r2)
1-cycle stall
long stall
long stall
Solution:
Mark registers pending as destinations, then in ID stage
stall any instruction that uses the marked registers.
3/16/2016
CPE 731, Pipelining
38
Problems
4/5
4. WAW hazards are possible
Example:
mul.d f0, f4, f6
add.d f0, f6, f8
Solution:
Stall the second instruction if its destination register is
pending as a destination.
3/16/2016
CPE 731, Pipelining
39
Problems
5/5
5. Out-of-order completion creates problems in
restarting after exceptions
Example:
div.d f0, f4, f6
ld.d
f2, 0(r2)
Assume a div by 0 exception
Finishes before div
Solution:
1) Accept imprecise exceptions
2) Buffer results and write-back in order.
3/16/2016
CPE 731, Pipelining
40
Outline
•
•
•
•
•
•
•
5 stage pipelining
Structural and Data Hazards
Forwarding
Branch Schemes
Exceptions and Interrupts
Multi-cycle Operations
Conclusion
3/16/2016
CPE 731, Pipelining
41
And In Conclusion: Control and Pipelining
• Just overlap tasks; easy if tasks are independent
• Speed Up  Pipeline Depth; if ideal CPI is 1, then:
Cycle Timeunpipelined
Pipeline depth
Speedup 

1  Pipeline stall CPI
Cycle Timepipelined
• Hazards limit performance on computers:
– Structural: need more HW resources
– Data (RAW,WAR,WAW): need forwarding, compiler scheduling
– Control: delayed branch, prediction
• Exceptions, Interrupts add complexity
• Multi-cycle operations introduce several problems
3/16/2016
CPE 731, Pipelining
42
Download