Pipeline Hazards Hakim Weatherspoon CS 3410, Spring 2013 Computer Science

advertisement
Pipeline Hazards
Hakim Weatherspoon
CS 3410, Spring 2013
Computer Science
Cornell University
See P&H Chapter 4.7
Goals for Today
Data Hazards
• Revisit Pipelined Processors
• Data dependencies
• Problem, detection, and solutions
– (delaying, stalling, forwarding, bypass, etc)
• Hazard detection unit
• Forwarding unit
Next time
• Control Hazards
What is the next instruction to execute if
a branch is taken? Not taken?
MIPS Design Principles
Simplicity favors regularity
• 32 bit instructions
Smaller is faster
• Small register file
Make the common case fast
• Include support for constants
Good design demands good compromises
• Support for different type of interpretations/classes
Recall: MIPS instruction formats
All MIPS instructions are 32 bits long, has 3 formats
R-type
op
6 bits
I-type
op
6 bits
J-type
rs
rt
5 bits 5 bits
rs
rt
rd shamt func
5 bits
5 bits
6 bits
immediate
5 bits 5 bits
16 bits
op
immediate (target address)
6 bits
26 bits
Recall: MIPS Instruction Types
Arithmetic/Logical
• R-type: result and two source registers, shift amount
• I-type: 16-bit immediate with sign/zero extension
Memory Access
• load/store between registers and memory
• word, half-word and byte operations
Control flow
• conditional branches: pc-relative addresses
• jumps: fixed offsets, register absolute
Recall: MIPS Instruction Types
Arithmetic/Logical
• ADD, ADDU, SUB, SUBU, AND, OR, XOR, NOR, SLT, SLTU
• ADDI, ADDIU, ANDI, ORI, XORI, LUI, SLL, SRL, SLLV, SRLV, SRAV,
SLTI, SLTIU
• MULT, DIV, MFLO, MTLO, MFHI, MTHI
Memory Access
• LW, LH, LB, LHU, LBU, LWL, LWR
• SW, SH, SB, SWL, SWR
Control flow
• BEQ, BNE, BLEZ, BLTZ, BGEZ, BGTZ
• J, JR, JAL, JALR, BEQL, BNEL, BLEZL, BGTZL
Special
• LL, SC, SYSCALL, BREAK, SYNC, COPROC
Pipelined Processor
memory
register
file
alu
+4
addr
PC
din
control
new
pc
Fetch
memory
compute
jump/branch
targets
extend
Decode
dout
Execute
Memory
WB
register
file
B
alu
D
memory
D
A
Pipelined Processor
+4
IF/ID
M
B
ID/EX
Execute
EX/MEM
Memory
ctrl
Instruction
Decode
Instruction
Fetch
dout
compute
jump/branch
targets
ctrl
extend
din
memory
imm
new
pc
control
ctrl
inst
PC
addr
WriteBack
MEM/WB
Example: : Sample Code (Simple)
add
nand
lw
add
sw
r3,
r6,
r4,
r5,
r7,
r1, r2;
r4, r5;
20(r2);
r2, r5;
12(r3);
Example: Sample Code (Simple)
Assume eight-register machine
Run the following code on a pipelined datapath
add
nand
lw
add
sw
Slides thanks to Sally McKee
r3 r1 r2 ; reg 3 = reg 1 + reg 2
r6 r4 r5 ; reg 6 = ~(reg 4 & reg 5)
r4 20 (r2) ; reg 4 = Mem[reg2+20]
r5 r2 r5 ; reg 5 = reg 2 + reg 5
r7 12(r3) ; Mem[reg3+12] = reg 7
Clock cycle
add
nand
Time Graphs
1
2
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
lw
add
sw
Latency:
Throughput:
Concurrency:
3
4
5
6
7
8
9
EX MEM WB
CPI =
M
U
X
4
target
+
PC+4
PC+4
R0
R1
regB
R2
R3
Register file
instruction
PC
Inst
mem
regA
0
Bits 16-20
Bits 26-31
IF/ID
valA
R4
R5
R6
valB
R7
extend
Bits 11-15
ALU
result
M
U
X
A
L
U
ALU
result
mdata
Data
mem
M
U
X
data
imm
dest
valB
Rd
Rt
op
ID/EX
M
U
X
dest
dest
op
op
EX/MEM
MEM/WB
At time 1,
Fetch
add r3 r1 r2
data
dest
extend
0
0
IF/ID
ID/EX
M
U
X
EX/MEM
MEM/WB
add 3 1 2
M
U
X
4
0
+
/0 4
4
R0
R1
R3
Register file
add 3 1 2
PC
Inst
mem
R2
R4
R5
R6
R7
0
36
9
12
18
7
41
22
extend
Fetch:
add 3 1 2
Bits 11-15
Bits 16-20
Bits 26-31
Time: 1/ 2
IF/ID
0
0
/0 36
/0 9
M
U
X
A
L
U
0
0
Data
mem
M
U
X
data
0
dest
0
/0 3
/0 2
nop
/ add
ID/EX
M
U
X
0
0
nop
nop
EX/MEM
MEM/WB
nand 6 4 5
add 3 1 2
M
U
X
4
/0 4
+
/4 8
8
R0
R2
2
R3
Register file
nand 6 4 5
PC
Inst
mem
R1
1
R4
R5
R6
R7
0
36
9
12
18
7
41
22
extend
Fetch:
nand 6 4 5
Bits 11-15
Bits 16-20
Bits 26-31
Time: 2/ 3
IF/ID
0
36
/ 18
/9 7
3
0
36
9
M
U
X
A
L
U
/0 45
0
Data
mem
data
dest
/0 9
/3 6
/2 5
M
U
X
add
/ nand
ID/EX
3
/0 3
nop
/ add
EX/MEM
M
U
X
0
nop
MEM/WB
lw 4 20(2)
nand 6 4 5
add 3 1 2
M
U
X
nand (18 ∙ 7)
4
+
12
8
R0
R2
5
R3
Register file
lw 4 20(2)
PC
Inst
mem
R1
4
R4
R5
R6
R7
0
36
9
12
18
7
41
22
extend
Fetch:
lw 4 20(2)
Bits 11-15
Bits 16-20
Bits 26-31
Time: 3/ 4
IF/ID
18 = 01 0010
7 = 00 0111
------------------3 = 11 1101
0
/0 45
/ 18
36
18
7
/4 8
/9 7
6
M
U
X
A
L
U
45
/ -3
0
M
U
X
Data
mem
data
dest
/9 7
6
5
3
2
nand
ID/EX
M
U
X
3
/3 6
add
/ nand
EX/MEM
/0 3
nop
/ add
MEM/WB
add 5 2 5
lw 4 20(2)
nand 6 4 5
add 3 1 2
M
U
X
4
8
+
16
12
R0
R2
4
R3
Register file
add 5 2 5
PC
Inst
mem
R1
2
R4
R5
R6
R7
0
36
9
12
18
7
41
22
extend
Fetch:
add 5 2 5
Bits 11-15
Bits 16-20
Bits 26-31
Time: 4
IF/ID
0
45
18
9
7
18
20
M
U
X
A
L
U
-3 45
0
Data
mem
M
U
X
data
dest
7
0
4
6
5
lw
ID/EX
M
U
X
6
6
3
nand
EX/MEM
3
add
MEM/WB
sw 7 12(3)
add 5 2 5
lw 4 20 (2)
nand 6 4 5
add 3 1 2
M
U
X
4
12
+
20
16
R0
R2
5
R3
Register file
sw 7 12(3)
PC
Inst
mem
R1
2
R4
R5
R6
R7
0
36
9
45
18
7
41
22
extend
Fetch:
sw 7 12(3)
Bits 11-15
Bits 16-20
Bits 26-31
Time: 5
IF/ID
0
-3
9
9
7
M
U
20 X
5
A
L
U
29 -3
45
0
Data
mem
M
U
X
data
dest
18
5
5
0
4
add
ID/EX
M
U
X
4
4
6
lw
EX/MEM
6
3
nand
MEM/WB
sw 7 12(3)
add 5 2 5
lw 4 20(2)
nand 6 4 5
M
U
X
4
16
+
20
R0
R1
3
R2
7
R3
Register file
PC
Inst
mem
R4
R5
R6
R7
0
36
9
45
18
7
-3
22
extend
No more
instructions
Bits 11-15
Bits 16-20
Bits 26-31
Time: 6
IF/ID
0
29
9
45
7
22
12
M
U
X
A
L
U
16 29
-3
99
Data
mem
M
U
X
data
dest
7
0
7
5
5
sw
ID/EX
M
U
X
5
5
4
add
EX/MEM
4
6
lw
MEM/WB
nop
nop
sw 7 12(3)
add 5 2 5
lw 4 20(2)
M
U
X
4
20
+
R0
R1
R2
Inst
mem
R3
Register file
PC
R4
R5
R6
R7
0
36
9
45
99
7
-3
22
0
16
45
M
U
12 X
A
L
U
57 16
Data
mem
Bits 11-15
Bits 16-20
22
0
7
Bits 26-31
Time: 7
IF/ID
data
dest
extend
No more
instructions
0
M
U
99 X
M
U
X
7
7
5
sw
ID/EX
EX/MEM
5
4
add
MEM/WB
nop
nop
nop
sw 7 12(3)
add 5 2 5
M
U
X
4
+
R0
R1
R2
Inst
mem
R3
Register file
PC
R4
R5
R6
R7
0
36
9
45
99
16
-3
22
57
M
U
X
57
Bits 11-15
M
U
X
Bits 16-20
IF/ID
Slides thanks to Sally McKee
0
Data
mem
M
U
X
data
dest
7
Bits 26-31
Time: 8
22
22
extend
No more
instructions
A
L
U
16
5
sw
ID/EX
EX/MEM
MEM/WB
nop
nop
nop
nop
sw 7 12(3)
M
U
X
4
+
R0
R1
R2
Inst
mem
R3
Register file
PC
R4
R5
R6
R7
0
36
9
45
99
16
-3
22
M
U
X
A
L
U
Data
mem
data
dest
extend
No more
instructions
M
U
X
Bits 11-15
M
U
X
Bits 16-20
Bits 21-23
Time: 9
IF/ID
ID/EX
EX/MEM
MEM/WB
Takeaway
Pipelining is a powerful technique to mask
latencies and increase throughput
• Logically, instructions execute one at a time
• Physically, instructions execute in parallel
– Instruction level parallelism
Abstraction promotes decoupling
• Interface (ISA) vs. implementation (Pipeline)
Next Goal
What about data dependencies (also known as a
data hazard in a pipelined processor)?
i.e. add r3, r1, r2
sub r5, r3, r4
Data Hazards
Data Hazards
• register file reads occur in stage 2 (ID)
• register file writes occur in stage 5 (WB)
• next instructions may read values about to be written
Data Hazards
time
add r3, r1, r2
sub r5, r3, r4
lw r6, 4(r3)
or r5, r3, r5
sw r6, 12(r3)
Clock cycle
1 2
3
IF
ID
IF
4
5
MEM
WB
MEM
ID
IF
7
8
9
WB
MEM WB
ID
IF
6
MEM WB
ID
IF
ID
MEM
WB
Data Hazards
Data Hazards
• register file reads occur in stage 2 (ID)
• register file writes occur in stage 5 (WB)
• next instructions may read values about to be written
How to detect?
IF/ID
ID/EX
B
D
EX/MEM
Rd
OP
Rd
mem
OP
imm
Rt Rd PC+4
detect
hazard
OP
PC
PC+4
+4
addr
din dout
M
A
B
Ra Rb
D
A
Rd
D
inst
add r3, r1, r2
sub inst
r5, r3, r5
or r6, r3, r4
mem
add r6, r3, r8
B
Detecting Data Hazards
MEM/WB
Takeaway
Data hazards occur when a operand (register) depends on
the result of a previous instruction that may not be
computed yet. A pipelined processor needs to detect data
hazards.
Next Goal
What to do if data hazard detected?
Stalling
How to stall an instruction in ID stage
• prevent IF/ID pipeline register update
– stalls the ID stage instruction
• convert ID stage instr into nop for later stages
– innocuous “bubble” passes through pipeline
• prevent PC update
– stalls the next (IF stage) instruction
WE=0
IF/ID
ID/EX
B
D
Rd
Rd
EX/MEM
OP
MemWr=0
RegWr=0
mem
OP
imm
If detect hazard
OP
detect
hazard
PC+4
PC
Rt Rd PC+4
+4
addr
din dout
M
A
B
Ra Rb
D
A
Rd
D
inst
add r3, r1, r2
sub inst
r5, r3, r5
or r6, r3, r4
mem
add r6, r3, r8
B
Detecting Data Hazards
MEM/WB
Stalling
time
add r3, r1, r2
sub r5, r3, r5
or r6, r3, r4
add r6, r3, r8
Clock cycle
1
2
3
4
5
6
7
8
Stalling
time
r3 = 10
add r3, r1, r2
r3 = 20
sub r5, r3, r5
or r6, r3, r4
add r6, r3, r8
Clock cycle
1
2
3
4
5
6
7
8
Stalling
sub r5,r3,r5
or r6,r3,r4
(WE=0)
/stall
NOP = If(IF/ID.rA ≠ 0 &&
(IF/ID.rA==ID/Ex.Rd
IF/ID.rA==Ex/M.Rd
IF/ID.rA==M/W.Rd))
Rd
WE
Rd
add r3,r1,r2
M
Op
nop
data
mem
WE
PC
B
D
Op
(MemWr=0
RegWr=0)
B
Rd
+4
D
WE
inst
inst
mem
D
rD
B
rA rB
A
Op
A
Stalling
sub r5,r3,r5
or r6,r3,r4
(WE=0)
/stall
NOP = If(IF/ID.rA ≠ 0 &&
(IF/ID.rA==ID/Ex.Rd
IF/ID.rA==Ex/M.Rd
IF/ID.rA==M/W.Rd))
nop
Rd
WE
Rd
(MemWr=0
RegWr=0)
M
Op
nop
data
mem
WE
PC
B
D
Op
(MemWr=0
RegWr=0)
B
Rd
+4
D
WE
inst
inst
mem
D
rD
B
rA rB
A
Op
A
add r3,r1,r2
Stalling
sub r5,r3,r5
or r6,r3,r4
(WE=0)
/stall
NOP = If(IF/ID.rA ≠ 0 &&
(IF/ID.rA==ID/Ex.Rd
IF/ID.rA==Ex/M.Rd
IF/ID.rA==M/W.Rd))
D
(MemWr=0
RegWr=0)
nop
nop
WE
(MemWr=0
RegWr=0)
Rd
M
Rd
data
mem
Op
nop
WE
PC
B
Op
(MemWr=0
RegWr=0)
B
Rd
+4
D
WE
inst
inst
mem
D
rD
B
rA rB
A
Op
A
add r3,r1,r2
Stalling
How to stall an instruction in ID stage
• prevent IF/ID pipeline register update
– stalls the ID stage instruction
• convert ID stage instr into nop for later stages
– innocuous “bubble” passes through pipeline
• prevent PC update
– stalls the next (IF stage) instruction
Takeaway
Data hazards occur when a operand (register) depends on
the result of a previous instruction that may not be
computed yet. A pipelined processor needs to detect data
hazards.
Stalling, preventing a dependent instruction from
advancing, is one way to resolve data hazards. Stalling
introduces NOPs (“bubbles”) into a pipeline. Introduce
NOPs by (1) preventing the PC from updating, (2)
preventing writes to IF/ID registers from changing, and (3)
preventing writes to memory and register file. Bubbles in
pipeline significantly decrease performance.
Next Goal: Resolving Data Hazards via Forwarding
What to do if data hazard detected?
A) Wait/Stall
B) Reorder in Software (SW)
C) Forward/Bypass
Forwarding
Forwarding bypasses some pipelined stages
forwarding a result to a dependent instruction
operand (register).
Three types of forwarding/bypass
• Forwarding from Ex/Mem registers to Ex stage (MEx)
• Forwarding from Mem/WB register to Ex stage (WEx)
• RegisterFile Bypass
Forwarding Datapath
B
B
D
D
B
IF/ID
ID/Ex
forward
unit
MC WE
Rb
Ra
detect
hazard
data
mem
Rd
imm
inst
mem
D
Ex/Mem
Three types of forwarding/bypass
• Forwarding from Ex/Mem registers to Ex stage (MEx)
• Forwarding from Mem/WB register to Ex stage (W  Ex)
• RegisterFile Bypass
M
Rd
A
MC WE
A
Mem/WB
Forwarding Datapath
B
B
D
D
B
IF/ID
ID/Ex
forward
unit
MC WE
Rb
Ra
detect
hazard
data
mem
Rd
imm
inst
mem
D
Ex/Mem
Three types of forwarding/bypass
• Forwarding from Ex/Mem registers to Ex stage (MEx)
• Forwarding from Mem/WB register to Ex stage (W  Ex)
• RegisterFile Bypass
M
Rd
A
MC WE
A
Mem/WB
Forwarding Datapath 1
Ex/MEM to EX Bypass
• EX needs ALU result that is still in MEM stage
• Resolve:
Add a bypass from EX/MEM.D to start of EX
How to detect? Logic in Ex Stage:
forward = (Ex/M.WE && EX/M.Rd != 0 &&
ID/Ex.Ra == Ex/M.Rd)
|| (same for Rb)
Forwarding Datapath 1
A
D
inst
mem
add r3, r1, r2
sub r5, r3, r1
B
IF
data
mem
ID
Ex
M
W
IF
ID
Ex
M
W
Forwarding Datapath 2
Mem/WB to EX Bypass
• EX needs value being written by WB
• Resolve:
Add bypass from WB final value to start of EX
How to detect? Logic in Ex Stage:
forward = (M/WB.WE && M/WB.Rd != 0 &&
ID/Ex.Ra == M/WB.Rd &&
not (ID/Ex.WE && Ex/M.Rd != 0 &&
ID/Ex.Ra == Ex/M.Rd)
|| (same for Rb)
Check pg. 369
Forwarding Datapath 2
A
D
inst
mem
add r3, r1, r2
sub r5, r3, r1
or r6, r3, r4
B
IF
data
mem
ID
Ex
M
W
IF
ID
IF
Ex
ID
M W
Ex M
W
Register File Bypass
Register File Bypass
• Reading a value that is currently being written
Detect:
((Ra == MEM/WB.Rd) or (Rb == MEM/WB.Rd))
and (WB is writing a register)
Resolve:
Add a bypass around register file (WB to ID)
Better: (Hack) just negate register file clock
– writes happen at end of first half of each clock cycle
– reads happen during second half of each clock cycle
Register File Bypass
A
D
inst
mem
add r3, r1, r2
sub r5, r3, r1
or r6, r3, r4
add r6, r3, r8
B
IF
data
mem
ID
Ex
M
W
IF
ID
IF
Ex
ID
IF
M W
Ex M
ID Ex
W
M W
Forwarding Example
time
r3 = 10
add r3, r1, r2
r3 = 20
sub r5, r3, r5
or r6, r3, r4
add r6, r3, r8
Clock cycle
1
2
3
4
5
6
7
8
Forwarding Example 2
time
add r3, r1, r2
sub r5, r3, r4
lw r6, 4(r3)
or r5, r3, r5
sw r6, 12(r3)
Clock cycle
1
2
3
4
5
IF
ID
Ex
M
W
IF
ID
Ex
M
6
W
7
8
Tricky Example
A
D
inst
mem
add r1, r1, r2
SUB r1, r1, r3
OR r1, r4, r1
B
data
mem
Forwarding Datapath
B
B
D
D
B
IF/ID
ID/Ex
forward
unit
MC WE
Rb
Ra
detect
hazard
data
mem
Rd
imm
inst
mem
D
Ex/Mem
Three types of forwarding/bypass
• Forwarding from Ex/Mem registers to Ex stage (MEx)
• Forwarding from Mem/WB register to Ex stage (W  Ex)
• Register File Bypass
M
Rd
A
MC WE
A
Mem/WB
Takeaway
Data hazards occur when a operand (register) depends on the result of
a previous instruction that may not be computed yet. A pipelined
processor needs to detect data hazards.
Stalling, preventing a dependent instruction from advancing, is one way
to resolve data hazards. Stalling introduces NOPs (“bubbles”) into a
pipeline. Introduce NOPs by (1) preventing the PC from updating, (2)
preventing writes to IF/ID registers from changing, and (3) preventing
writes to memory and register file. Bubbles (nops) in pipeline
significantly decrease performance.
Forwarding bypasses some pipelined stages forwarding a result to a
dependent instruction operand (register). Better performance than
stalling.
Administrivia
Prelim1: next Tuesday, February 26th in evening
•
•
Time: We will start at 7:30pm sharp, so come early
Prelim Review: This Thur 6-8pm in Upson B14 and Fri, 5-7pm in
Phillips 203
•
Closed Book
•
•
•
Cannot use electronic device or outside material
Practice prelims are online in CMS
Material covered everything up to end of this week
•
•
•
•
•
Appendix C (logic, gates, FSMs, memory, ALUs)
Chapter 4 (pipelined [and non-pipeline] MIPS processor with hazards)
Chapters 2 (Numbers / Arithmetic, simple MIPS instructions)
Chapter 1 (Performance)
HW1, HW2, Lab0, Lab1, Lab2
Administrivia
HW2 is due tomorrow
• Fill out Survey online. Receive credit/points on homework for survey:
• Should have received email from Kathryn Dimiduk
• Survey is anonymous
Project1 (PA1) due week after prelim
• Continue working diligently. Use design doc momentum
Save your work!
• Save often. Verify file is non-zero. Periodically save to Dropbox, email.
• Beware of MacOSX 10.5 (leopard) and 10.6 (snow-leopard)
Use your resources
• Lab Section, Piazza.com, Office Hours, Homework Help Session,
• Class notes, book, Sections, CSUGLab
Administrivia
Check online syllabus/schedule
• http://www.cs.cornell.edu/Courses/CS3410/2013sp/schedule.html
Slides and Reading for lectures
Office Hours
Homework and Programming Assignments
Prelims (in evenings):
• Tuesday, February 26th
• Thursday, March 28th
• Thursday, April 25th
Schedule is subject to change
Collaboration, Late, Re-grading Policies
“Black Board” Collaboration Policy
• Can discuss approach together on a “black board”
• Leave and write up solution independently
• Do not copy solutions
Late Policy
• Each person has a total of four “slip days”
• Max of two slip days for any individual assignment
• Slip days deducted first for any late assignment,
cannot selectively apply slip days
• For projects, slip days are deducted from all partners
• 25% deducted per day late after slip days are exhausted
Regrade policy
• Submit written request to lead TA,
and lead TA will pick a different grader
• Submit another written request,
lead TA will regrade directly
• Submit yet another written request for professor to regrade.
Quiz
Find all hazards, and say how they are resolved:
add r3, r1, r2
sub r3, r2, r1
nand r4, r3, r1
or r0, r3, r4
xor r1, r4, r3
sb r4, 1(r0)
Memory Load Data Hazard
A
D
inst
mem
lw r4, 20(r8)
sub r6, r4, r1
B
data
mem
Resolving Memory Load Hazard
Load Data Hazard
• Value not available until WB stage
• So: next instruction can’t proceed if hazard detected
Resolution:
• MIPS 2000/3000: one delay slot
– ISA says results of loads are not available until one cycle
later
– Assembler inserts nop, or reorders to fill delay slot
• MIPS 4000 onwards: stall
– But really, programmer/compiler reorders to avoid stalling
in the load delay slot
Quiz 2
add r3, r1, r2
nand r5, r3, r4
add r2, r6, r3
lw r6, 24(r3)
sw r6, 12(r2)
Data Hazard Recap
Delay Slot(s)
• Modify ISA to match implementation
Stall
• Pause current and all subsequent instructions
Forward/Bypass
• Try to steal correct value from elsewhere in pipeline
• Otherwise, fall back to stalling or require a delay slot
Tradeoffs?
Download