document

advertisement
Chapter 4
Processor Architecture
Pipelined Implementation
Instructor: Dr. Hyunyoung Lee
Texas A&M University
–1–
Based on slides provided by Randal E. Bryant, CMU
Overview
General Principles of Pipelining


Goal: Enhancing Performance
Difficulties: Complex implementation, Non-deterministic behavior etc.
Creating a Pipelined Y86 Processor



Rearranging SEQ
Inserting pipeline registers
Problems with data and control hazards
 Data Hazards
» Instruction having register R as source follows shortly after instruction having
register R as destination
» Common condition, don’t want to slow down pipeline
 Control Hazards
» Mispredict conditional branch
• Our design predicts all branches as being taken
• Naïve pipeline executes two extra instructions
» Getting return address for ret instruction
• Naïve pipeline executes three extra instructions
–2–
Real-World Pipelines: Car Washes
Sequential
Parallel
Pipelined
Idea



–3–
Divide process into
independent stages
Move objects through stages
in sequence
At any given times, multiple
objects being processed
Computational Example
300 ps
20 ps
Combinational
logic
R
e
g
Delay = 320 ps
Throughput = 3.12 GIPS
Clock
System




–4–
Computation requires total of 300 picoseconds
Additional 20 picoseconds to save result in register
Can must have clock cycle of at least 320 ps
Throughput = 1/(320 ps) = 0.003125 instructions per ps
= 3125000000 instructions per sec = 3.12 GIPS
3-Way Pipelined Version
100 ps
20 ps
100 ps
20 ps
100 ps
Comb.
logic
A
R
e
g
Comb.
logic
B
R
e
g
Comb.
logic
C
20 ps
R
Delay = 360 ps
e
Throughput = 8.33 GIPS
g
Clock
System


Divide combinational logic into 3 blocks of 100 ps each
Can begin new operation as soon as previous one passes
through stage A.
 Begin new operation every 120 ps

Throughput increases by a factor of 8.33/3.12=2.67 at the
expense of increased latency* (360/320=1.12)
* Latency: total time required to perform a single instruction from beginning to end
–5–
Pipeline Diagrams
Unpipelined
OP1
OP2
OP3

Time
Cannot start new operation until previous one completes
3-Way Pipelined
OP1
OP2
A
B
C
A
B
C
A
B
OP3
C
Time

–6–
Up to 3 operations in process simultaneously
Operating a Pipeline
239
241 300 359
Clock
OP1
A
OP2
B
C
A
B
C
A
B
OP3
0
120
240
360
C
480
640
Time
100 ps
20 ps
100 ps
20 ps
100 ps
20 ps
Comb.
logic
A
R
e
g
Comb.
logic
B
R
e
g
Comb.
logic
C
R
e
g
Clock
–7–
Limitations: Nonuniform Delays
50 ps
20 ps
150 ps
20 ps
100 ps
Comb.
logic
R
e
g
Comb.
logic
B
R
e
g
Comb.
logic
C
A
OP1
OP2
A
B
OP3
B
A
R
Delay = 510 ps
e
Throughput = 5.88 GOPS
g
Clock
C
A
20 ps
C
B
C
Time



–8–
Throughput limited by slowest stage
Other stages sit idle for much of the time
Challenging to partition system into balanced stages
Limitations: Register Overhead
50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps
Comb.
logic
R
e
g
Comb.
logic
R
e
g
Comb.
logic
R
e
g
Clock


 3-stage pipeline:
 6-stage pipeline:
–9–
R
e
g
Comb.
logic
R
e
g
Comb.
logic
R
e
g
Delay = 420 ps, Throughput = 14.29 GOPS
As try to deepen pipeline, overhead of loading registers
becomes more significant
Percentage of clock cycle spent loading register:
 1-stage pipeline:

Comb.
logic
6.25%
16.67%
28.57%
High speeds of modern processor designs obtained through
very deep pipelining
Data Hazards and Control Hazards
Make the pipelined processor work!
Data Hazards


Instruction having register R as source follows shortly after
instruction having register R as destination
Common condition, don’t want to slow down pipeline
Control Hazards

Mispredict conditional branch
 Our design predicts all branches as being taken
 Naïve pipeline executes two extra instructions

Getting return address for ret instruction
 Naïve pipeline executes three extra instructions
– 10 –
Data Dependencies
Combinational
logic
R
e
g
Clock
OP1
OP2
OP3
Time
System

– 11 –
Each operation depends on result from preceding one
Data Hazards
Comb.
logic
A
OP1
OP2
R
e
g
A
Comb.
logic
B
R
e
g
Comb.
logic
C
Clock
B
C
A
B
C
A
B
C
A
B
OP3
OP4
R
e
g
C
Time


– 12 –
Result does not feed back around in time for next operation
Pipelining has changed behavior of system
Data Dependencies in Processors

1
irmovl $50, %eax
2
addl %eax ,
3
mrmovl 100( %ebx ),
%ebx
%edx
Result from one instruction used as operand for another
 Read-after-write (RAW) dependency


Very common in actual programs
Must make sure our pipeline handles these properly
 Get correct results
 Minimize performance impact
– 13 –
newPC
SEQ Hardware


Stages occur in sequence
One operation in process
at a time
New
PC
PC
valM
data out
read
write
Addr
Execute
Bch
valE
CC
CC
ALU
ALU
Key






– 14 –
Blue boxes: predesigned
hardware blocks
 E.g., memories, ALU
Gray boxes: control logic
 Describe in HCL
White ovals: labels for
signals
Thick lines: 32-bit word
values
Thin lines: 4-8 bit values
Dotted lines: 1-bit values
Data
Data
memory
memory
Mem.
control
Memory
ALU
A
Data
ALU
fun.
ALU
B
valA
Decode
A
valB
dstE dstM srcA
srcB
dstE dstM srcA
srcB
B
Register
Register M
file
file E
Write back
icode
Fetch
ifun
rA
rB
Instruction
Instruction
memory
memory
PC
valC
valP
PC
PC
increment
increment
valM
SEQ+ Hardware
data out
read
Data
Data
memory
memory
Mem.
control
Memory
write
Addr


Still sequential
implementation
Reorder PC stage to put at
beginning
Execute
Bch
valE
CC
CC
ALU
ALU
ALU
A
Data
ALU
fun.
ALU
B
PC Stage


Task is to select PC for
current instruction
Based on results
computed by previous
instruction
Processor State


– 15 –
PC is no longer stored in
register
But, can determine PC
based on other stored
information
valA
valB
dstE dstM srcA srcB
dstE dstM srcA srcB
Decode
A
B
Register
Register M
file
file E
Write back
icode
Fetch
ifun
rA
rB
valC
Instruction
Instruction
memory
memory
valP
PC
PC
increment
increment
PC
PC
PC
pIcode pBch
pValM
pValC
pValP
Adding Pipeline Registers
valE, valM
W_icode, W_valM
Write back
valM
W_valE, W_valM, W_dstE, W_dstM
valM
W
valM
Data
Data
memory
memory
Memory
Memory
Data
Data
memory
memory
M_icode,
M_Bch,
M_valA
Addr, Data
Addr, Data
M
valE
Bch
Bch
CC
CC
Execute
ALU
ALU
valE
CC
CC
Execute
aluA, aluB
ALU
ALU
aluA, aluB
E
valA, valB
Decode
valA, valB
srcA, srcB
dstA, dstB
icode, valC
valP
A
B
Register
Register M
file
file
d_srcA,
d_srcB
Decode
A
B
Register
Register M
file
file
E
E
Write back
valP
icode, ifun
rA, rB
valC
Fetch
Instruction
Instruction
memory
memory
D
valP
icode, ifun,
rA, rB, valC
PC
PC
increment
increment
Instruction
Instruction
memory
memory
Fetch
valP
PC
PC
increment
increment
PC
PC
predPC
pState
PC
f_PC
F
– 16 –
W_icode, W_valM
Pipeline Stages
W_valE, W_valM, W_dstE, W_dstM
W
valM
Fetch
Memory
Data
Data
memory
memory
M_icode,
M_Bch,
M_valA
Addr, Data



Select current PC
Read instruction
Compute incremented PC
M
Bch
valE
CC
CC
Execute
ALU
ALU
aluA, aluB
Decode

E
Read program registers
valA, valB
Execute
d_srcA,
d_srcB
Decode
A
B
Register
Register M
file
file
E
Write back

Operate ALU
Memory

D
icode, ifun,
rA, rB, valC
Instruction
Instruction
memory
memory
Fetch
Write Back
– 17 –
valP
PC
PC
increment
increment
predPC
Read or write data memory
PC

valP
Update register file
f_PC
F
Write back
PIPE- Hardware
W
icode
valE
Mem.
control
write
Pipeline registers hold
intermediate values
from instruction
execution
Forward (Upward) Paths


Values passed from one
stage to next
Cannot jump past
stages
dstE dstM
data out
read

valM
Data
Data
memory
memory
Memory
data in
Addr
M_valA
M_Bch
M
icode
Bch
valE
valA
dstE dstM
e_Bch
Execute
E
ALU
fun.
ALU
ALU
CC
CC
icode ifun
ALU
A
ALU
B
valC
valA
valB
d_srcA d_srcB
Select
A
Decode
d_rvalA
A
dstE dstM srcA srcB
W_valM
B
Register
Register M
file
file
W_valE
E
 e.g., valC passes
through decode
dstE dstM srcA srcB
D
Fetch
icode ifun
rA
rB
Instruction
Instruction
memory
memory
valC
valP
PC
PC
increment
increment
Predict
PC
f_PC
M_valA
Select
PC
F
– 18 –
predPC
W_valM
Write back
Feedback Paths
W
icode
valE
dstE dstM
data out
read
Mem.
control
write
Predicted PC
valM
Data
Data
memory
memory
Memory
data in
Addr
M_valA
M_Bch

Guess value of next PC
M
icode
Bch
valE
valA
dstE dstM
e_Bch
Branch information


Jump taken/not-taken
Fall-through or target
address
Execute
E
ALU
fun.
ALU
ALU
CC
CC
icode ifun
ALU
A
ALU
B
valC
valA
valB
dstE dstM srcA srcB
d_srcA d_srcB
Return point

Decode
To register file write
ports
A
dstE dstM srcA srcB
W_valM
B
W_valE
E
D
Fetch
icode ifun
rA
rB
Instruction
Instruction
memory
memory
valC
valP
PC
PC
increment
increment
Predict
PC
f_PC
M_valA
Select
PC
F
– 19 –
d_rvalA
Register
Register M
file
file
Read from memory
Register updates

Select
A
predPC
W_valM
Predicting the
PC
D
M_icode
M_Bch
M_valA
W_icode
W_valM
icode ifun
rA
rB
valC
valP
Predict
PC
Need
valC
Instr
valid
Need
regids
Split
Split
PC
PC
increment
increment
Align
Align
Byte 0
Bytes 1-5
Instruction
Instruction
memory
memory
Select
PC
F

predPC
Start fetch of new instruction after current one has completed
fetch stage
 Not enough time to reliably determine next instruction

Guess which instruction will follow
 Recover if prediction was incorrect
– 20 –
Our Prediction Strategy
Instructions that don’t transfer control


Predict next PC to be valP
Always reliable
Call and unconditional Jumps


Predict next PC to be valC (destination)
Always reliable
Conditional jumps


Predict next PC to be valC (destination)
Only correct if branch is taken
 Typically right 60% of time
Return instruction

– 21 –
Don’t try to predict
Recovering
from PC
Misprediction
M_icode
M_Bch
M_valA
W_icode
W_valM
D
icode ifun
rA
rB
valC
valP
Predict
PC
Need
valC
Instr
valid
Need
regids
Split
Split
PC
PC
increment
increment
Align
Align
Byte 0
Bytes 1-5
Instruction
Instruction
memory
memory
Select
PC
F

predPC
Mispredicted Jump
 Will see branch flag once instruction reaches memory stage
 Can get fall-through PC from valA

Return Instruction
 Will get return PC when ret reaches write-back stage
– 22 –
Pipeline Demonstration
irmovl
$1,%eax
#I1
irmovl
$2,%ecx
#I2
irmovl
$3,%edx
#I3
irmovl
$4,%ebx
#I4
halt
#I5
1
2
3
4
5
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
Cycle 5
File: demo-basic.ys
W
I1
M
I2
E
I3
D
I4
F
I5
– 23 –
6
7
8
9
W
Data Dependencies: 3 Nop’s
# demo-h3.ys
1
2
3
4
5
0x000: irmovl $10,%edx
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
0x006: irmovl
0x00c:
nop
0x00d:
nop
0x00e:
nop
$3,%eax
0x00f: addl %edx,%eax
0x011: halt
6
7
8
9
10
Cycle 6
W
R[ %eax] f 3
Cycle 7
D
– 24 –
valA f R[ %edx] = 10
valB f R[ %eax] = 3
11
W
Data Dependencies: 2 Nop’s
# demo-h2.ys
1
2
3
4
5
0x000: irmovl $10,%edx
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
0x006: irmovl
$3,%eax
0x00c: nop
0x00d:
nop
0x00e: addl %edx,%eax
0x010: halt
6
7
8
9
Cycle 6
W
R[ %eax] f 3
•
•
•
D
valA f R[ %edx] = 10
valB f R[ %eax] = 0
– 25 –
Error
10
W
Data Dependencies: 1 Nop
# demo-h1.ys
1
2
3
4
5
0x000: irmovl $10,%edx
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
0x006: irmovl
0x00c:
$3,%eax
nop
0x00d: addl %edx,%eax
0x00f: halt
6
7
8
Cycle 5
W
R[ %edx] f 10
M
M_valE = 3
M_dstE = %eax
•
•
•
D
– 26 –
valA f R[ %edx] = 0
valB f R[ %eax] = 0
Error
9
W
Data Dependencies: No Nop
# demo-h0.ys
1
2
3
4
5
0x000: irmovl $10,%edx
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
0x006: irmovl
$3,%eax
0x00c: addl %edx,%eax
0x00e: halt
6
7
Cycle 4
M
M_valE = 10
M_dstE = %edx
E
e_valE f 0 + 3 = 3
E_dstE = %eax
D
valA f R[ %edx] = 0
valB f R[ %eax] = 0
– 27 –
Error
8
W
Stalling for Data Dependencies
# demo-h2.ys
1
2
3
4
5
0x000: irmovl $10,%edx
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
E
M
W
D
D
E
M
W
F
F
D
E
M
0x006: irmovl
$3,%eax
0x00c: nop
0x00d: nop
6
bubble
0x00e: addl %edx,%eax
0x010: halt



– 28 –
F
7
8
9
If instruction follows too closely after one that writes
register, slow it down
Hold instruction in decode
Dynamically inject nop into execute stage
10
11
W
Stall Condition
Source Registers

srcA and srcB of current
instruction in decode
stage
Destination Registers


dstE and dstM fields
Instructions in execute,
memory, and write-back
stages
Special Case

Don’t stall for register ID
15 (0xF)
 Indicates absence of
register operand
– 29 –

Don’t stall for failed
conditional move
Detecting Stall Condition
# demo-h2.ys
1
2
3
4
5
0x000: irmovl $10,%edx
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
E
M
W
D
D
E
M
W
F
F
D
E
M
0x006: irmovl
$3,%eax
0x00c: nop
0x00d: nop
6
bubble
0x00e: addl %edx,%eax
0x010: halt
F
7
Cycle 6
W
W_dstE = %eax
W_valE = 3
•
•
•
D
– 30 –
srcA = %edx
srcB = %eax
8
9
10
11
W
Stalling X3
# demo-h0.ys
1
2
3
4
5
F
D
E
M
W
F
D
E
M
W
E
M
W
E
M
W
E
M
W
0x000: irmovl $10,%edx
0x006: irmovl
$3,%eax
bubble
6
bubble
bubble
F
0x00c: addl %edx,%eax
7
8
D
D
D
E
M
W
F
F
F
F
D
E
M
Cycle 6
W
Cycle 5
W_dstE = %eax
M
Cycle 4
M_dstE = %eax
E
•
•
•
D
D
– 31 –
srcA = %edx
srcB = %eax
10
D
0x00e: halt
E_dstE = %eax
9
srcA = %edx
srcB = %eax
•
•
•
D
srcA = %edx
srcB = %eax
11
W
What Happens When Stalling?
# demo-h0.ys
0x000: irmovl $10,%edx
0x006: irmovl
$3,%eax
0x00c: addl %edx,%eax
0x00e: halt



Cycle 8
4
5
6
7
Write Back
Memory
Execute
Decode
Fetch
0x000: bubble
0x006:
irmovl $10,%edx
$3,%eax
0x006: bubble
0x00c:
irmovl
addl
%edx,%eax
$3,%eax
0x00c: halt
0x00e:
addl %edx,%eax
0x00e: halt
Stalling instruction held back in decode stage
Following instruction stays in fetch stage
Bubbles injected into execute stage
 Like dynamically generated nop’s
 Move through later stages
– 32 –
0x000: bubble
0x006:
irmovl $10,%edx
$3,%eax
Implementing Stalling
W_stat
W_stall
W
stat icode
valE
valM
dstE dstM
valE
valA
dstE dstM
M_icode
M_bubble
M
stat icode
Cnd
m_stat
e_Cnd
stat
set_cc
Pipe
control
logic
CC
E_dstM
E_icode
E_bubble
E
stat icode ifun
valC
valA
valB
dstE dstM srcA srcB
d_srcB
d_srcA
srcB
D_icode
srcA
D_bubble
D_stall
D
F_stall
F
stat icode ifun
rA
rB
valC
valP
predPC
Pipeline Control


– 33 –
Combinational logic detects stall condition
Sets mode signals for how pipeline registers should update
Pipeline Register Modes
Input = y
Output = x
x
Normal
stall
=0
Output = x
x
stall
=1
– 34 –
Output = x
x
stall
=0
y
_
Rising
clock
_
Output = x
x
bubble
=0
Input = y
Bubble
_
Output = y
bubble
=0
Input = y
Stall
_
Rising
clock
bubble
=1
_
Rising
clock
_
n
o
p
Output = nop
Data Forwarding
Naïve Pipeline


Register isn’t written until completion of write-back stage
Source operands read from register file in decode stage
 Needs to be in register file at start of stage
Observation

Value generated in execute or memory stage
Trick


– 35 –
Pass value directly from generating instruction to decode
stage
Needs to be available at end of decode stage
Data Forwarding Example
# demo-h2.ys
1
2
3
4
5
0x000: irmovl $10,%edx
F
D
F
E
D
F
M
E
D
F
W
M
E
D
F
0x006: irmovl
$3,%eax
0x00c: nop
0x00d: nop
0x00e: addl %edx,%eax
0x010: halt



– 36 –
irmovl in writeback stage
Destination value in
W pipeline register
Forward as valB for
decode stage
6
7
8
9
10
W
M
E
D
F
W
M
E
D
W
M
E
W
M
W
Cycle 6
W
R[ %eax] f 3
W_dstE = %eax
W_valE = 3
•
•
•
D
srcA = %edx
srcB = %eax
valA f R[ %edx] = 10
valB f W_valE = 3
Bypass Paths
Decode Stage



Forwarding logic selects
valA and valB
Normally from register
file
Forwarding: get valA or
valB from later pipeline
stage
Forwarding Sources



– 37 –
Execute: valE
Memory: valE, valM
Write back: valE, valM
Data Forwarding Example #2
# demo-h0.ys
1
2
3
4
5
6
0x000: irmovl $10,%edx
F
D
F
E
D
M
E
W
M
W
F
D
E
M
W
F
D
E
M
0x006: irmovl
$3,%eax
0x00c: addl %edx,%eax
0x00e: halt
Register %edx


Generated by ALU
during previous cycle
Forward from memory
as valA
Register %eax
– 38 –

Value just generated
by ALU

Forward from execute
as valB
7
Cycle 4
M
M_dstE = %edx
M_valE = 10
E
E_dstE = %eax
e_valE f 0 + 3 = 3
D
srcA = %edx
srcB = %eax
valA f M_valE = 10
valB f e_valE = 3
8
W
Forwarding Priority
# demo-priority.ys
1
2
3
4
5
0x000: irmovl $1, %eax
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F
D
E
M
0x006: irmovl $2, %eax
0x00c: irmovl $3, %eax
0x012: rrmovl %eax, %edx
0x014: halt
Multiple Forwarding
Choices



– 39 –
Which one should
have priority
Match serial
semantics
Use matching value
from earliest pipeline
stage
6
Cycle 5
W
R[ %eax] f 3
1
M
W
R[ %eax] f 3
2
E
W
R[ %eax] f 3
D
valA f R[ %eax
10
%edx] = ?
valB f 0
R[ %eax] = 0
7
8
9
W
10
Limitation of Forwarding
# demo-luh.ys
0x000:
0x006:
0x00c:
0x012:
0x018:
0x01e:
0x020:
1
2

4
irmovl $128,%edx
F
D
E M
irmovl $3,%ecx
F
D
E
rmmovl %ecx, 0(%edx)
F
D
irmovl $10,%ebx
F
mrmovl 0(%edx),%eax # Load %eax
addl %ebx,%eax # Use %eax
halt
Load-use dependency

3
Value needed by end of
decode stage in cycle 7
Value read from memory in
memory stage of cycle 8
5
6
W
M
E
D
F
W
M
E
D
F
7
8
9
10
11
W
M
E
D
F
W
M
E
D
W
M
E
W
M
W
Cycle 7
Cycle 8
M
M
M_dstE = %ebx
M_valE = 10
M_dstM = %eax
m_valM f M[128] = 3
•
•
•
D
valA f M_valE = 10
valB f R[%eax] = 0
– 40 –
Error
Avoiding Load/Use Hazard
# demo-luh.ys
1
2
3
4
5
irmovl $128,%edx
F
irmovl $3,%ecx
rmmovl %ecx, 0(%edx)
irmovl $10,%ebx
D
E
M
W
F
D
F
E
D
F
0x018: mrmovl 0(%edx),%eax # Load %eax
bubble
0x01e: addl %ebx,%eax # Use %eax
0x020: halt
M
E
D
F
0x000:
0x006:
0x00c:
0x012:


Stall using instruction for
one cycle
Can then pick up loaded
value by forwarding from
memory stage
6
7
8
9
W
M
E
D
W
M
E
W
M
W
D
F
E
D
F
M
E
D
F
Cycle 8
W
W_dstE = %ebx
W_valE = 10
M
M_dstM = %eax
m_valM f M[128] = 3
•
•
•
D
– 41 –
valA f W_valE = 10
valB f m_valM = 3
10
11
W
M
E
W
M
12
W
Detecting Load/Use Hazard
Condition
Trigger
Load/Use Hazard
E_icode in { IMRMOVL, IPOPL } &&
E_dstM in { d_srcA, d_srcB }
– 42 –
Control for Load/Use Hazard
# demo-luh.ys
1
2
3
0x000:
0x006:
0x00c:
0x012:
0x018:
4
irmovl $128,%edx
F
D E M
irmovl $3,%ecx
F
D E
rmmovl %ecx, 0(%edx)
F
D
irmovl $10,%ebx
F
mrmovl 0(%edx),%eax # Load %eax
bubble
0x01e: addl %ebx,%eax # Use %eax
0x020: halt


6
7
W
M
E
D
F
W
M
E
D
W
M
E
F
D
F
8
9
10
11
W
M
E
D
F
W
M
E
D
W
M
E
W
M
12
Stall instructions in fetch
and decode stages
Inject bubble into execute
stage
Condition
Load/Use Hazard
– 43 –
5
F
D
E
M
W
stall
stall
bubble
normal
normal
W
Branch Misprediction Example
demo-j.ys
0x000:
xorl %eax,%eax
0x002:
jne t
0x007:
irmovl $1, %eax
0x00d:
nop
0x00e:
nop
0x00f:
nop
0x010:
halt
0x011: t: irmovl $3, %edx
0x017:
irmovl $4, %ecx
0x01d:
irmovl $5, %edx

– 44 –
# Not taken
# Fall through
# Target (Should not execute)
# Should not execute
# Should not execute
Should only execute first 7 instructions
Branch Misprediction Trace
# demo-j
0x000:
xorl %eax,%eax
0x002:
jne t # Not taken
1
2
3
4
5
6
F
D
F
E
D
M
E
W
M
W
F
D
F
E
D
F
M
E
D
0x011: t: irmovl $3, %edx # Target
0x017:
irmovl $4, %ecx # Target+1
0x007:
irmovl $1, %eax # Fall Through
Cycle 5
M

Incorrectly execute two
instructions at branch target
M_Bch = 0
M_valA = 0x007
E
valE f 3
dstE = %edx
D
valC = 4
dstE = %ecx
F
– 45 –
valC f 1
rB f %eax
7
W
M
E
8
9
cancelled
W
M
W
Return Example
0x000:
0x006:
0x007:
0x008:
0x009:
0x00e:
0x014:
0x020:
0x020:
0x021:
0x022:
0x023:
0x024:
0x02a:
0x030:
0x036:
0x100:
0x100:

– 46 –
demo-ret.ys
irmovl Stack,%esp # Intialize stack pointer
nop
# Avoid hazard on %esp
nop
nop
call p
# Procedure call
irmovl $5,%esi
# Return point
halt
.pos 0x20
p: nop
# procedure
nop
nop
ret
irmovl $1,%eax
# Should not be executed
irmovl $2,%ecx
# Should not be executed
irmovl $3,%edx
# Should not be executed
irmovl $4,%ebx
# Should not be executed
.pos 0x100
Stack:
# Stack: Stack pointer
Require lots of nops to avoid data hazards
Incorrect Return Example
# demo-ret
0x023:

ret
0x024:
D
irmovl $1,%eax # Oops! F
0x02a:
irmovl $2,%ecx # Oops!
0x030:
irmovl $3,%edx # Oops!
0x00e:
irmovl $5,%esi # Return
Incorrectly execute 3
instructions following ret
F
E
D
F
M
E
D
F
W
M
E
D
F
W
valM = 0x0e
M
valE = 1
dstE = %eax
E
valE f 2
dstE = %ecx
D
valC = 3
dstE = %edx
F
– 47 –
valC f 5
rB f %esi
W
M
E
D
W
M
E
W
M
W
Pipeline Summary (1)
Concept


Break instruction execution into 5 stages
Run instructions through in pipelined mode
Limitations


Can’t handle dependencies between instructions when
instructions follow too closely
Data dependencies
 One instruction writes register, later one reads it

Control dependency
 Instruction sets PC in way that pipeline did not predict correctly
 Mispredicted branch and return
– 48 –
Pipeline Summary (2)
Fixing the Pipeline

Data Hazards
 Most handled by forwarding
» No performance penalty
 Load/use hazard requires one cycle stall

Control Hazards
 Cancel instructions when detect mispredicted branch
» Two clock cycles wasted
 Stall fetch stage while ret passes through pipeline
» Three clock cycles wasted
– 49 –
Download