Lecture 6

advertisement
CENG 450
Computer Systems and Architecture
Lecture 6
Amirali Baniasadi
amirali@ece.uvic.ca
1
Overview of Today’s Lecture
 MIPS
Pipelining
CPU Pipelining Example:
 Theoretically:
 Speedup should be equal to number of stages ( n tasks, k stages, p latency)
 Speedup = n*p
=~ k (for large n)

p/k*(n-1) + p
 Practically:
 Stages are imperfectly balanced
 Pipelining needs overhead
 Speedup less than number of stages
 If we have 3 consecutive instructions
 Non-pipelined needs 8 x 3 = 24 ns
 Pipelined needs 14 ns
=> Speedup = 24 / 14 = 1.7
 If we have 1003 consecutive instructions
 Add more time for 1000 instruction (i.e. 1003 instruction)on the previous
example
Non-pipelined total time= 1000 x 8 + 24 = 8024 ns
Pipelined total time = 1000 x 2 + 14 = 2014 ns
=> Speedup ~ 3.98~ (8 ns / 2 ns]
~ near perfect speedup
=> Performance increases for larger number of instructions (throughput)
3
MIPS: Software conventions for Registers
0
zero constant 0
16 s0 callee saves
1
at
. . . (caller can clobber)
2
v0 expression evaluation &
23 s7
3
v1 function results
24 t8
4
a0 arguments
25 t9
5
a1
26 k0 reserved for OS kernel
6
a2
27 k1
7
a3
28 gp Pointer to global area
8
t0
...
reserved for assembler
temporary (cont’d)
temporary: caller saves
29 sp Stack pointer
(callee can clobber)
30 fp
frame pointer
31 ra
Return Address (HW)
15 t7
Plus a 3-deep stack of mode bits.
4
Example in C: swap
swap(int v[], int k)
{
int temp;
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
}
° Assume swap is called as a procedure
° Assume temp is register $15; arguments in $a1, $a2; $16 is scratch reg:
° Write MIPS code
swap: MIPS
swap:
addiu
sw
sll
addu
lw
lw
sw
sw
lw
addiu
jr
$sp,$sp, –4
$16, 4($sp)
$t2, $a2,2
$t2, $a1,$t2
$15, 0($t2)
$16, 4($t2)
$16, 0($t2)
$15, 4($t2)
$16, 4($sp)
$sp,$sp, 4
$31
; create space on stack
; callee saved register put onto stack
; multiply k by 4
; address of v[k]
; load v[k]
; load v[k+1]
; store v[k+1] into v[k]
; store old value of v[k] into v[k+1]
; callee saved register restored from stack
; restore top of stack
; return to place that called swap
5 Steps of MIPS Datapath
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Next SEQ PC
Next SEQ PC
Adder
4
Zero?
RS1
MUX
MEM/WB
Memory
EX/MEM
ALU
MUX MUX
ID/EX
Imm
Reg File
IF/ID
Memory
Address
Datapath
RS2
Write
Back
MUX
Next PC
Memory
Access
WB Data
Instruction
Fetch
Sign
Extend
RD
RD
RD
Control Path
7
5 Steps of MIPS Datapath
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Next SEQ PC
Next SEQ PC
Adder
Zero?
RS1
Inst 12
Inst
3
Inst
MUX
MEM/WB
Memory
EX/MEM
ALU
Sign
Extend
RD
Inst 1
Inst 2
RD
Control Path
MUX MUX
ID/EX
Imm
Reg File
IF/ID
Memory
Address
Datapath
RS2
WB Data
4
Write
Back
MUX
Next PC
Memory
Access
RD
Inst 1
Instruction
Fetch
8
Review: Visualizing Pipelining
Time (clock cycles)
Reg
DMem
Ifetch
Reg
DMem
Reg
ALU
DMem
Reg
ALU
O
r
d
e
r
Ifetch
ALU
I
n
s
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Ifetch
Reg
Reg
Reg
DMem
Reg
9
Limits to pipelining
 Hazards: circumstances that would cause incorrect execution if next
instruction were launched
 Structural hazards: Attempting to use the same hardware to do
two different things at the same time
 Data hazards: Instruction depends on result of prior instruction still
in the pipeline
 Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow (branches
and jumps).
10
Example: One Memory Port/Structural
Hazard
Time (clock cycles)
Ifetch
Reg
DMem
Reg
DMem
Reg
ALU
Instr 3
DMem
ALU
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Ifetch
Reg
Reg
Reg
DMem
Reg
Instr 4
Structural Hazard
11
Resolving structural hazards
 Defn: attempt to use same hardware for two different things at
the same time
 Solution 1: Wait
 must detect the hazard
 must have mechanism to stall
 Solution 2: Throw more hardware at the problem
12
Detecting and Resolving Structural Hazard
Time (clock cycles)
Stall
Instr 3
DMem
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Bubble
Reg
Reg
DMem
Bubble Bubble
Ifetch
Reg
Reg
Bubble
ALU
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Bubble
Reg
DMem
13
Eliminating Structural Hazards at Design
Time
Next SEQ PC
Next SEQ PC
Adder
Zero?
RS1
MUX
MEM/WB
Data
Cache
EX/MEM
ALU
MUX MUX
ID/EX
Imm
Reg File
IF/ID
Instr
Cache
Address
Datapath
RS2
WB Data
4
MUX
Next PC
Sign
Extend
RD
RD
RD
Control Path
14
Role of Instruction Set Design in Structural Hazard Resolution
 Simple to determine the sequence of resources used by an instruction
 opcode tells it all
 Uniformity in the resource usage
 Compare MIPS to IA32?
 MIPS approach => all instructions flow through same 5-stage pipeling
15
Data Hazards
Time (clock cycles)
and r6,r1,r7
or
r8,r1,r9
xor r10,r1,r11
Ifetch
DMem
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
sub r4,r1,r3
Reg
ALU
Ifetch
ALU
O
r
d
e
r
add r1,r2,r3
WB
ALU
I
n
s
t
r.
MEM
ALU
IF ID/RF EX
Reg
Reg
Reg
Reg
DMem
16
Reg
Three Generic Data Hazards
 Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
 Caused by a “Data Dependence”. This hazard results from an actual need for
communication.
17
Three Generic Data Hazards
 Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
 an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
 Can’t happen in MIPS 5 stage pipeline because:
 All instructions take 5 stages, and
 Reads are always in stage 2, and
 Writes are always in stage 5
18
Three Generic Data Hazards
 Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
 Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
 Can’t happen in MIPS 5 stage pipeline because:
 All instructions take 5 stages, and
 Writes are always in stage 5
 Will see WAR and WAW in later more complicated pipes
19
Forwarding to Avoid Data Hazard
or
r8,r1,r9
xor r10,r1,r11
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
and r6,r1,r7
Ifetch
DMem
ALU
sub r4,r1,r3
Reg
ALU
O
r
d
e
r
add r1,r2,r3 Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
Reg
Reg
Reg
Reg
DMem
20
HW Change for Forwarding
NextPC
mux
MEM/WR
EX/MEM
ALU
mux
ID/EX
Registers
mux
Immediate
Data
Memory
21
Data Hazard Even with Forwarding
and r6,r1,r7
or
r8,r1,r9
DMem
Ifetch
Reg
DMem
Reg
Ifetch
Ifetch
Reg
Reg
Reg
DMem
ALU
O
r
d
e
r
sub r4,r1,r6
Reg
ALU
lw r1, 0(r2) Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
Reg
DMem
22
Resolving this load hazard
 Adding hardware? ... not
 Detection?
 Compilation techniques?
 What is the cost of load delays?
23
Resolving the Load Data Hazard
and r6,r1,r7
or r8,r1,r9
Reg
DMem
Ifetch
Reg
Bubble
Ifetch
Bubble
Reg
Bubble
Ifetch
Reg
DMem
Reg
Reg
DMem
ALU
sub r4,r1,r6
Ifetch
ALU
O
r
d
e
r
lw r1, 0(r2)
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
DMem
How is this different from the instruction issue stall?
24
Software Scheduling to Avoid Load Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code:
LW
LW
ADD
SW
LW
LW
SUB
SW
Rb,b
Rc,c
Ra,Rb,Rc
a,Ra
Re,e
Rf,f
Rd,Re,Rf
d,Rd
Fast code:
LW
LW
LW
ADD
LW
SW
SUB
SW
Rb,b
Rc,c
Re,e
Ra,Rb,Rc
Rf,f
a,Ra
Rd,Re,Rf
d,Rd
25
Instruction Set Connection
 What is exposed about this organizational hazard in the instruction set?
 k cycle delay?
 bad, CPI is not part of ISA
 k instruction slot delay
 load should not be followed by use of the value in the next k
instructions
 Nothing, but code can reduce run-time delays
 MIPS did the transformation in the assembler
26
Eliminating Control Hazards at Design Time
Next SEQ PC
Next SEQ PC
Adder
Zero?
RS1
MUX
MEM/WB
Data
Cache
EX/MEM
ALU
MUX MUX
ID/EX
Imm
Reg File
IF/ID
Instr
Cache
Address
Datapath
RS2
WB Data
4
MUX
Next PC
Sign
Extend
RD
RD
RD
Control Path
27
Example: Branch Stall Impact
 If 30% branch, Stall 3 cycles significant
 Two part solution:
Determine branch taken or not sooner, AND
Compute taken branch address earlier
 MIPS branch tests if register = 0 or  0
 MIPS Solution:
Move Zero test to ID/RF stage
Adder to calculate new PC in ID/RF stage
1 clock cycle penalty for branch versus 3
28
Pipelined MIPS Datapath
Instruction
Fetch
Memory
Access
Write
Back
Adder
Adder
MUX
Next
SEQ PC
Next PC
Zero?
RS1
MUX
MEM/WB
Data
Memory
EX/MEM
ALU
MUX
ID/EX
Imm
Reg File
IF/ID
Memory
Address
RS2
WB Data
4
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Sign
Extend
EXTRA HARDWARE
RD
RD
RD
• Data stationary control
– local decode for each instruction phase / pipeline stage
29
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
 Execute successor instructions in sequence
 “Squash” instructions in pipeline if branch actually taken
 Advantage of late pipeline state update
 47% MIPS branches not taken on average
 PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
 53% MIPS branches taken on average
 But haven’t calculated branch target address in MIPS
MIPS still incurs 1 cycle branch penalty
Other machines: branch target known before outcome
30
Four Branch Hazard Alternatives
#4: Delayed Branch
 Define branch to take place AFTER a following instruction
branch instruction
sequential successor1
sequential successor2
........
sequential successorn
........
branch target if taken
Branch delay of length n
 1 slot delay allows proper decision and branch target address in
5 stage pipeline
 MIPS uses this
31
Delayed Branch
 Where to get instructions to fill branch delay slot?
 Before branch instruction
 From the target address: only valuable when branch taken
 From fall through: only valuable when branch not taken
 Canceling branches allow more slots to be filled
 Compiler effectiveness for single branch delay slot:
 Fills about 60% of branch delay slots
 About 80% of instructions executed in branch delay slots useful in
computation
 About 50% (60% x 80%) of slots usefully filled
 Delayed Branch downside: 7-8 stage pipelines, multiple instructions
issued per clock (superscalar)
32
Recall:Speed Up Equation for Pipelining
CPIpipelined  Ideal CPI  Average Stall cycles per Inst
Cycle Timeunpipelined
Ideal CPI  Pipeline depth
Speedup 

Ideal CPI  Pipeline stall CPI
Cycle Timepipelined
For simple RISC pipeline, CPI = 1:
Cycle Timeunpipelined
Pipeline depth
Speedup 

1  Pipeline stall CPI
Cycle Timepipelined
33
Example: Evaluating Branch Alternatives
Pipeline speedup =
Pipeline depth
1 +Branch frequency Branch penalty
Assume:
Conditional & Unconditional = 14%, 65% change PC
Scheduling
scheme
Stall pipeline
Predict taken
Predict not taken
Delayed branch
Branch
penalty
3
1
1
0.5
CPI
1.42
1.14
1.09
1.07
speedup v.
stall
1.0
1.26
1.29
1.31
34
Summary
Hazards
Date Hazards & Control Hazards
How to remove Hazard?
Data Hazards:
Forwarding
Change program order
Control Hazards:
Speculate branch outcome
Delay Slots
Use extra hardware
Download