ADMIN 1 2 IC220 Set #20:

advertisement
IC220 Set #20:
Laundry, Co-dependency, and other Hazards
of Modern (Architecture) Life
Chapter 6
1
ADMIN
•
Reading for Chapter 6: 6.1, 6.9-6.12
2
Midnight Laundry
Time
6 PM
7
8
9
10
11
12
1
2 AM
6 PM
7
8
9
10
11
12
1
2 AM
Task
order
A
B
C
D
Time
Task
order
A
B
C
D
3
Smarty Laundry
Time
6 PM
7
8
9
10
11
12
1
2 AM
6 PM
7
8
9
10
11
12
1
2 AM
Task
order
A
B
C
D
Time
Task
order
A
PAT06F01.eps
B
C
D
4
Pipelining
•
Improve performance by increasing instruction throughput
Program
execution
Time
order
(in instructions)
200
on
lw $1, 100($0) Instructi
fetch Reg
lw $2, 200($0)
400
600
800
Data
access
ALU
1000
1200
1400
ALU
Data
access
1600
1800
Reg
Instruction Reg
fetch
800 ps
lw $3, 300($0)
Reg
Instruction
fetch
800 ps
800 ps
Program
execution
Time
order
(in instructions)
lw $1, 100($0)
200
Instruction
fetch
400
Reg
600
Data
access
ALU
Instruction
lw $2, 200($0) 200 ps fetch
800
Reg
1200
1400
Reg
Data
access
ALU
Instruction
200 ps fetch
lw $3, 300($0)
1000
Reg
Reg
Data
access
ALU
Reg
200 ps 200 ps 200 ps 200 ps 200 ps
Ideal speedup is number of stages in the pipeline. Do we achieve this?
5
Basic Idea
IF: Instruction fetch
ID: Instruction decode/
register file read
EX: Execute/
address calculation
MEM: Memory access
WB: Write back
Add
4
ADD
Shift
left 2
0
M
u
x
1
PC
Address
Instruction
Instruction
memory
Read
register 1
Read
data 1
Read
register 2
Registers
Write
Read
register
data 2
Write
data
16
Add
result
Zero
0
M
u
x
1
ALU ALU
result
Address
Write
data
Sign
extend
Read
data
Data
Memory
1
M
u
x
0
32
6
Pipelined Datapath
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add Add
result
4
Shift
left 2
PC
Read
register 1
Address
Read
data 1
Read
register 2
Registers
Read
Write
data 2
register
Instruction
0
M
u
x
1
Instruction
memory
Zero
ALU ALU
result
0
M
u
x
1
Write
data
Read
data
Address
0
M
u
x
1
Data
memory
Write
data
16
32
Sign
extend
7
Pipeline Diagrams
Clock cycle:
Time
$s0, $t0,
add $s0,add
$s1,
$s2$t1
200
1
IF
2
400
EX
ID
200
Time
sub $a1, $s2, $a3
add $s0, $t0, $t1
600
3
IF
200
add $s0, $t0, $t1
IF
800
7
1000
WB
MEM
400
ID
6
PAT06F11.eps
600
EX
ID
1000
5
WB
MEM
400
Time
add $t0, $t1, $t2
800
4
600
EX
800
MEM
1000
WB
Assumptions:
• Reads to memory or register file in 2nd half of clock cycle
• Writes to memory or register file in 1st half of clock cycle
What could go wrong?
8
Problem: Dependencies
•
Problem with starting next instruction before first is finished
Clock cycle:
Time
subadd
$s0,
$s2
$s0,$s1,
$t0, $t1
1
200
IF
2
400
3
EX
ID
200
Time
and $a1, $s0, $a3
add $s0, $t0, $t1
IF
800
EX
IF
Time
6
7
800
8
200
IF
1000
WB
MEM
400
ID
add $s0, $t0, $t1
1000
600
200
add $s0, $t0, $t1
5
WB
MEM
ID
add $t0, $t1, $s0
$t2, $s0, $s0
4
400
Time
or
600
600
EX
800
400 MEM 600
EX
ID
1000
WB
800
1000
WB
MEM
Dependencies that “go backward in time” are ____________________
Will the “or” instruction work properly?
9
Solution: Forwarding
Use temporary results, don’t wait for them to be written
Clock cycle:
Time
subadd
$s0,
$s2
$s0,$s1,
$t0, $t1
1
200
IF
2
400
200
Time
IF
add $t0, $t1, $s0
add $s0, $t0, $t1
$t2, $s0, $s0
4
800
EX
add $s0, $t0, $t1
1000
600
200
IF
Time
5
7
800
8
200
1000
WB
MEM
400
ID
6
WB
MEM
400
ID
Time
or
600
EX
ID
and $a1, $s0, $a3
add $s0, $t0, $t1
3
600
EX
800
400 MEM 600
1000
WB
800
1000
PAT06F04.eps
IF
ID
EX
WB
MEM
PAT06F04.eps
Where do we need this?
Will this deal with all hazards?
10
PAT06F04.eps
PAT06F
Problem?
Clock cycle:
Time
lw $t0,add
0($s1)
$s0, $t0, $t1
1
200
IF
2
400
200
Time
add $s0, $t0, $t1
600
EX
ID
sub $a1, $t0, $a3
3
IF
add $s0, $t0, $t1
600
EX
ID
200
IF
1000
5
6
7
WB
MEM
400
Time
add $a2, $t0, $t2
800
4
800
WB
MEM
400
1000
600
EX
ID
800
MEM
1000
WB
Forwarding not enough…
When an instruction tries to ___________
a register following a ____________
to the same register.
11
Solution: “Stall” later instruction until result is ready
Clock cycle:
1
2
3
4
5
6
7
lw $t0, 0($s1)
sub $a1, $t0, $a3
add $a2, $t0, $t2
PAT06F04.eps
PAT06F04.eps
Why does the stall start after ID stage?
PAT06F04.e
12
Assumptions
•
For exercises/exams/everything assume…
– The MIPS 5-stage pipeline
– That we have forwarding
…unless told otherwise
13
Exercise #1 – Pipeline diagrams
•
Draw a pipeline stage diagram for the following sequence of instructions.
Start at cycle #1.
You don’t need fancy pictures – just text for each stage: ID, MEM, etc.
add $s1, $s3, $s4
lw $v0, 0($a0)
sub $t0, $t1, $t2
•
What is the total number of cycles needed to complete this sequence?
•
What is the ALU doing during cycle #4?
•
When does the sub instruction writeback its result?
•
When does the lw instruction access memory?
14
Exercise #2 – Data hazards
•
Consider this code:
1. add $s1, $s3, $s4
2. add $v0, $s1, $s3
3. sub $t0, $v0, $t2
4. and $a0, $v0, $s1
1. Draw lines showing all the data dependencies in this code
2. Which of these dependencies do not need forwarding to avoid stalling?
15
Exercise #3 – Data hazards
•
Draw a pipeline diagram for this code. Show stalls where needed.
1. add $s1, $s3, $s4
2. lw $v0, 0($s1)
3. sub $v0, $v0, $s1
16
Exercise #4 – More Data hazards
•
Draw a pipeline diagram for this code. Show stalls where needed.
1.
2.
3.
4.
lw
lw
sw
sw
$s1,
$v0,
$v0,
$t0,
0($t0)
0($s1)
4($s1)
0($t1)
17
Exercise #5 – Stretch
•
What might be a problem with pipelining the following code?
Else:
beq
lw
sw
add
$a0,
$v0,
$v0,
$a1,
$a1, Else
0($s1)
4($s1)
$a2, $a3
18
Exercise #6 – Stretch
•
This diagram (from before) has a serious bug. What is it?
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add Add
result
4
Shift
left 2
PC
Address
Instruction
memory
Instruction
0
M
u
x
1
Read
register 1
Read
data 1
Read
register 2
Registers
Read
Write
data 2
register
0
M
u
x
1
Write
data
Zero
ALU ALU
result
Read
data
Address
Data
memory
0
M
u
x
1
Write
data
16
32
Sign
extend
19
Big Picture
•
•
Remember the single-cycle implementation
– Inefficient because low utilization of hardware resources
– Each instruction takes one long cycle
Two possible ways to improve on this:
Multicycle
Pipelined
PAT06F11.eps
Clock cycle time
(vs. single cycle)
Amount of hardware used
(vs. single cycle)
Split instruction into multiple stages
(1 per cycle)?
Each stage has its own set of
hardware?
How many instructions executing at
once?
20
The Pipeline Paradox
•
Pipelining does not ________________ the execution time of
any ______________ instruction
•
But by _____________________ instruction execution, it can
greatly improve performance by
________________ the
________________
21
Implementing Pipelining
•
What makes it easy?
– all instructions are the same length
– just a few instruction formats
– memory operands appear only in loads and stores
•
What makes it hard?
– data hazards
– structural hazards
– control hazards
•
What make it really hard?
– exception handling
– Improving performance with out-of-order execution, etc.
22
Structural Hazards
•
Occur when the hardware can’t support the combination of
instructions that we want to execute in the same clock cycle
•
MIPS instruction set designed to reduce this problem
•
But could occur if:
23
Control Hazards
•
What might be a problem with pipelining the following code?
Else:
•
beq
lw
sw
add
$a0,
$v0,
$v0,
$a1,
$a1, Else
0($s1)
4($s1)
$a2, $a3
What other kinds of instructions would cause this problem?
24
Control Hazard Strategy #1: Predict not taken
•
What if we are wrong?
•
Assume branch target and decision known at end of ID cycle. Show a
pipeline diagram for when branch is taken.
beq $a0, $a1, Else
lw $v0, 0($s1)
sw $v0, 4($s1)
Else: add $a1, $a2, $a3
25
Control Hazard Strategies
1. Predict not taken
One cycle penalty when we are wrong – not so bad
Penalty gets bigger with longer pipelines – bigger problem
2.
3.
26
Branch Prediction
Taken
Not taken
Predict taken
Predict taken
Taken
Not taken
Taken
Not taken
Predict not taken
Predict not taken
Taken
Not taken
With more sophistication can get 90-95% accuracy
Good prediction key to enabling more advanced pipelining techniques!
27
Pipeline Control
•
Generate control signal during the ________ stage
•
_________ control signals along just like the __________
Instruction
R-format
lw
sw
beq
Execution/Address Calculation Memory access stage
stage control lines
control lines
Reg
ALU
ALU
ALU
Mem
Mem
Dst
Op1
Op0
Src Branch Read Write
1
1
0
0
0
0
0
0
0
0
1
0
1
0
X
0
0
1
0
0
1
X
0
1
0
1
0
0
Write-back
stage control
lines
Reg Mem to
write
Reg
1
0
1
1
0
X
0
X
WB
Instruction
IF/ID
Control
M
WB
EX
M
WB
ID/EX
EX/MEM
MEM/WB
28
Code Scheduling to Improve Performance
•
Can we avoid stalls by rescheduling?
lw
add
lw
add
•
$t0,
$t2,
$t3,
$t4,
0($t1)
$t0, $t2
4($t1)
$t3, $t4
Dynamic Pipeline Scheduling
– Hardware chooses which instructions to execute next
– Will execute instructions out of order (e.g., doesn’t wait for a
dependency to be resolved, but rather keeps going!)
– Speculates on branches and keeps the pipeline full
(may need to rollback if prediction incorrect)
29
Dynamic Pipeline Scheduling
•
•
Let hardware choose which instruction to execute next
(might execute instructions out of program order)
Why might hardware do better job than programmer/compiler?
Example #1
lw $t0, 0($t1)
add $t2, $t0, $t2
lw $t3, 4($t1)
add $t4, $t3, $t4
Example #2
sw $s0, 0($s3)
lw $t0, 0($t1)
add $t2, $t0, $t2
30
Exercise #1
•
Can you rewrite this code to eliminate stalls?
1.
2.
3.
4.
lw
lw
sw
add
$s1,
$v0,
$v0,
$t0,
0($t0)
0($s1)
4($s1)
$t1, $t2
31
Exercise #2
•
Show a pipeline diagram for the following code, assuming:
– The branch is predicted not taken
– The branch actually is taken
lw
beq
sub
Label2: add
$t1,
$s1,
$v0,
$t0,
0($t0)
$s2, Label2
$v1, $v2
$t1, $t2
32
Exercise #3 – True or False?
1. A pipelined implementation will have a faster clock rate than a
comparable single cycle implementation
2. Pipelining increases performance by splitting up each instruction
into stages, thereby decreasing the time needed to execute each
instruction.
3. A structural hazard could occur if an instruction produce two results
that needed to be written to the register file.
4. “Backwards” branches are likely to not be taken
33
Exercise #4 – Stretch
•
•
What is problematic about the following code?
Show a pipeline diagram – assume branch is predicted not taken, but is
taken.
lw
beq
sub
Label2: add
$s1,
$s1,
$v0,
$t0,
0($t0)
$s2, Label2
$v1, $v2
$t1, $t2
34
Download