Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture

advertisement
Computer Architecture
A Quantitative Approach, Fifth Edition
Chapter 3
Instruction-Level Parallelism
and Its Exploitation
Copyright © 2012, Elsevier Inc. All rights reserved.
1

Pipelining became a universal technique in 1985


Overlaps execution of instructions
Exploits “Instruction Level Parallelism”


Introduction
Introduction
The potential overlap among instructions
Beyond this, there are two main approaches:

Hardware-based dynamic approaches



Used in server and desktop processors
Not used as extensively in PMP processors
Compiler-based static approaches

Not as successful outside of scientific applications
Copyright © 2012, Elsevier Inc. All rights reserved.
2

When exploiting instruction-level parallelism,
goal is to maximize IPC (or minimize CPI)

Pipeline CPI =





Introduction
Instruction-Level Parallelism
Ideal pipeline CPI +
Structural stalls +
Data hazard stalls +
Control stalls
Parallelism within basic block is limited

A code sequence:



no branch in except to the entry & no branch out except at the
exit
Typical size of basic block = 3-6 instructions
Must optimize across branches
Copyright © 2012, Elsevier Inc. All rights reserved.
3

Loop-Level Parallelism



Unroll loop statically or dynamically
Use SIMD (vector processors and GPUs)
Challenges:

Data dependency

Instruction j is data dependent on instruction i if



Introduction
Data Dependence
Instruction i produces a result that may be used by instruction j
Instruction j is data dependent on instruction k and instruction k
is data dependent on instruction i
Dependent instructions cannot be executed
simultaneously
Copyright © 2012, Elsevier Inc. All rights reserved.
4

Dependencies are a property of programs
Pipeline organization determines if dependence
is detected and if it causes a stall

Data dependence conveys:





Introduction
Data Dependence
Possibility of a hazard
Order in which results must be calculated
Upper bound on exploitable instruction level
parallelism
Dependencies that flow through memory
locations are difficult to detect
Copyright © 2012, Elsevier Inc. All rights reserved.
5

Two instructions use the same name but no flow
of information


Not a true data dependence, but is a problem when
reordering instructions
Antidependence: instruction j writes a register or
memory location that instruction i reads


Initial ordering (i before j) must be preserved
Output dependence: instruction i and instruction j
write the same register or memory location


Introduction
Name Dependence
Ordering must be preserved
To resolve, use renaming techniques
Copyright © 2012, Elsevier Inc. All rights reserved.
6

Data Hazards




Read after write (RAW)
Write after write (WAW)
Write after read (WAR)
example
Introduction
Other Factors
if (p1) {s1};
p3=0;
if (p2) {s2};
Control Dependence

Ordering of instruction i with respect to a branch
instruction


Instruction control dependent on a branch cannot be moved
before the branch so that its execution is no longer controlled
by the branch (e.g., cannot move s1 before if(p1) )
An instruction not control dependent on a branch cannot be
moved after the branch so that its execution is controlled by
the branch (e.g., cannot move p3=0 into {s2})
Copyright © 2012, Elsevier Inc. All rights reserved.
7
•
Example 1:

DADDU R1,R2,R3
BEQZ R4,L
DSUBU R1,R1,R6
L: …
OR R7,R1,R8
•
Example 2:

DADDU R1,R2,R3
BEQZ R12,skip
DSUBU R4,R5,R6
DADDU R5,R4,R9
skip:
OR R7,R8,R9
OR instruction dependent
on DADDU and DSUBU
Introduction
Examples
Assume R4 isn’t used after
skip

Possible to move DSUBU
before the branch
Copyright © 2012, Elsevier Inc. All rights reserved.
8

Pipeline scheduling


Separate dependent instruction from the source
instruction by the pipeline latency of the source
instruction
Compiler Techniques
Compiler Techniques for Exposing ILP
Example:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;
Copyright © 2012, Elsevier Inc. All rights reserved.
9
FP Loop: Where are the Hazards?
• First translate into MIPS code:
-To simplify, assume 8 is lowest address, R1 = base
address of X and F2 = s
Loop: L.D
ADD.D
S.D
DADDUI
BNEZ
7/27/2016
F0,0(R1) ;F0=vector element
F4,F0,F2 ;add scalar from F2
0(R1),F4 ;store result
R1,R1,-8 ;decrement pointer 8B (DW)
R1,Loop ;branch R1!=zero
CSCE 430/830, Instruction Level Parallelism
10
FP Loop Showing Stalls
1 Loop: L.D
2
stall
3
ADD.D
4
stall
5
stall
6
S.D
7
DADDUI
8
stall
9
BNEZ
Instruction
producing result
FP ALU op
FP ALU op
Load double
•
F0,0(R1) ;F0=vector element
F4,F0,F2 ;add scalar in F2
0(R1),F4 ;store result
R1,R1,-8 ;decrement pointer 8B (DW)
;assumes can’t forward to branch
R1,Loop ;branch R1!=zero
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Latency in
clock cycles
3
2
1
9 clock cycles: Rewrite code to minimize stalls?
7/27/2016
CSCE 430/830, Instruction Level Parallelism
11
Revised FP Loop Minimizing Stalls
1 Loop: L.D
F0,0(R1)
2
DADDUI R1,R1,-8
3
ADD.D F4,F0,F2
4
stall
5
stall
6
7
S.D
8(R1),F4
BNEZ
R1,Loop
;altered offset when move DADDUI
Swap DADDUI and S.D by changing address of S.D
Instruction
producing result
FP ALU op
FP ALU op
Load double
Instruction
using result
Another FP ALU op
Store double
FP ALU op
Latency in
clock cycles
3
2
1
7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loop
overhead; How to make faster?
7/27/2016
CSCE 430/830, Instruction Level Parallelism
12
Unroll Loop Four Times
(straightforward way)
1 Loop:L.D
3
ADD.D
6
S.D
7
L.D
9
ADD.D
12
S.D
13
L.D
15
ADD.D
18
S.D
19
L.D
21
ADD.D
24
S.D
25
DADDUI
26
BNEZ
F0,0(R1)
F4,F0,F2
0(R1),F4
F6,-8(R1)
F8,F6,F2
-8(R1),F8
F10,-16(R1)
F12,F10,F2
-16(R1),F12
F14,-24(R1)
F16,F14,F2
-24(R1),F16
R1,R1,#-32
R1,LOOP
1 cycle stall
Rewrite loop to
stalls?
2 cycles stall
minimize
;drop DSUBUI & BNEZ
;drop DSUBUI & BNEZ
;drop DSUBUI & BNEZ
;alter to 4*8
27 clock cycles, or 6.75 per iteration
(Assumes R1 is multiple of 4)
7/27/2016
CSCE 430/830, Instruction Level Parallelism
13
Unrolled Loop That Minimizes Stalls
1 Loop:L.D
F0,0(R1)
2 L.D F6,-8(R1)
3 L.D F10,-16(R1)
4 L.D F14,-24(R1)
5 ADD.D
F4,F0,F2
6 ADD.D
F8,F6,F2
7 ADD.D
F12,F10,F2
8 ADD.D
F16,F14,F2
9 S.D 0(R1),F4
10 S.D -8(R1),F8
11 S.D -16(R1),F12
12 DSUBUI
R1,R1,#32
13
S.D 8(R1),F16 ; 8-32 = -24
14 BNEZ
R1,LOOP
14 clock cycles, or 3.5 per iteration
7/27/2016
CSCE 430/830, Instruction Level Parallelism
14
Unrolled Loop Detail
• Do not usually know upper bound of loop
• Suppose it is n, and we would like to unroll the
loop to make k copies of the body
• Instead of a single unrolled loop, we generate a
pair of consecutive loops:
– 1st executes (n mod k) times and has a body that is the
original loop
– 2nd is the unrolled body surrounded by an outer loop that
iterates (n/k) times
• For large values of n, most of the execution time
will be spent in the unrolled loop
7/27/2016
CSCE 430/830, Instruction Level Parallelism
15
5 Loop Unrolling Decisions
•
Requires understanding how one instruction depends on
another and how the instructions can be changed or
reordered given the dependences:
1. Determine loop unrolling useful by finding that loop iterations
were independent (except for maintenance code)
2. Use different registers to avoid unnecessary constraints
forced by using same registers for different computations
3. Eliminate the extra test and branch instructions and adjust
the loop termination and iteration code
4. Determine that loads and stores in unrolled loop can be
interchanged by observing that loads and stores from
different iterations are independent
• Transformation requires analyzing memory addresses
and finding that they do not refer to the same address
5. Schedule the code, preserving any dependences needed to
yield the same result as the original code
7/27/2016
CSCE 430/830, Instruction Level Parallelism
16
In-class Exercise
• Identify data hazards in the code below:
– MULTD
– ADDD
– SD
F3, F4, F2
F2, F6, F1
F2, 0(F3)
• For each of the following code fragments, identify
each type of dependence that a compiler will find
(a fragment may have no dependences) and
whether a compiler could schedule the two
instructions (i.e., change their orders).
1. DADDI R1, R1, #4
LD
R2, 7(R1)
3. SD R2, 7(R1)
SD F2, 200(R7)
7/27/2016
2. DADD R3, R1, R2
SD
R2, 7(R1)
4. BEZ R1, place
SD R1, 7(R1)
CSCE 430/830, Instruction Level Parallelism
17
3 Limits to Loop Unrolling
1. Decrease in amount of overhead amortized with
each extra unrolling
•
Amdahl’s Law
2. Growth in code size
•
For larger loops, it increases the instruction cache miss rate
3. Register pressure: potential shortfall in
registers created by aggressive unrolling and
scheduling
•
•
If not possible to allocate all live values to registers, may lose
some or all of its advantage
Loop unrolling reduces impact of branches on
pipeline; another way is branch prediction
7/27/2016
CSCE 430/830, Instruction Level Parallelism
18
Static Branch Prediction
• Previous lectures showed scheduling code around
delayed branch
• To reorder code around branches, need to predict
branch statically when compile
• Simplest scheme is to predict a branch as taken
7/27/2016
18%
20%
15%
15%
12%
11%
12%
9%
10%
4%
5%
10%
6%
CSCE 430/830, Instruction Level Parallelism
Integer
r
su
2c
o
p
dl
jd
m
2d
dr
o
hy
ea
r
c
do
du
li
c
gc
eq
nt
ot
es
t
pr
es
so
m
pr
e
ss
0%
co
• More accurate
scheme predicts
branches using
profile
information
collected from
earlier runs, and
modify
prediction
based on last
run:
Misprediction Rate
– Average misprediction = untaken branch frequency = 34% SPEC
25%
22%
Floating Point
19
Dynamic Branch Prediction
• Why does prediction work?
– Underlying algorithm has regularities
– Data that is being operated on has regularities
– Instruction sequence has redundancies that are artifacts of
way that humans/compilers think about problems
• Is dynamic branch prediction better than static
branch prediction?
– Seems to be
– There are a small number of important branches in programs
which have dynamic behavior
7/27/2016
CSCE 430/830, Instruction Level Parallelism
20
Dynamic Branch Prediction
• Performance = ƒ(accuracy, cost of misprediction)
• Branch History Table: Lower bits of PC address
index table of 1-bit values
– Says whether or not branch taken last time
– No address check
• Problem: in a loop, 1-bit BHT will cause two
mispredictions (avg is 9 iterations before exit):
– End of loop case, when it exits instead of looping as before
– First time through loop on next time through code, when it
predicts exit instead of looping
7/27/2016
CSCE 430/830, Instruction Level Parallelism
21
Dynamic Branch Prediction
• Solution: 2-bit scheme where change prediction
only if get misprediction twice
T
NT
Predict Taken
T
Predict Not
Taken
T
NT
T
Predict Taken
NT
Predict Not
Taken
NT
• Red: stop, not taken
• Green: go, taken
• Adds hysteresis to decision making process
7/27/2016
CSCE 430/830, Instruction Level Parallelism
22
BHT Accuracy
• Mispredict because either:
– Wrong guess for that branch
– Got branch history of wrong branch when index the table
18%
12%
10%
9%
9%
5%
1%
7
na
sa
30
0
at
rix
pp
m
fp
p
ice
sp
c
do
du
ice
sp
li
c
0%
Integer
7/27/2016
9%
5%
gc
20%
18%
16%
14%
12%
10%
8%
6%
4%
2%
0%
eq
nt
ot
es
t
pr
es
so
Misprediction Rate
• 4096 entry table:
Floating Point
CSCE 430/830, Instruction Level Parallelism
23
Correlated Branch Prediction
• Idea: record m most recently executed branches
as taken or not taken, and use that pattern to
select the proper n-bit branch history table
• In general, a (m,n) predictor means recording last
m branches to select between 2m history tables,
each with n-bit counters
– Thus, old 2-bit BHT is a (0,2) predictor
• Global Branch History: m-bit shift register
keeping T/NT status of last m branches.
• Each entry in table has 2m n-bit predictors.
7/27/2016
CSCE 430/830, Instruction Level Parallelism
24
Correlating Branches
Branch address
(2,2) predictor
–
Behavior of recent
branches selects
between four
predictions of next
branch, updating just
that prediction
4
2-bits per branch predictor
Prediction
2-bit global branch history
7/27/2016
CSCE 430/830, Instruction Level Parallelism
25
Accuracy of Different Schemes
4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT
18%
16%
14%
12%
11%
10%
8%
6%
6%
5%
6%
6%
4,096 entries: 2-bits per entry
Unlimited entries: 2-bits/entry
li
eqntott
expresso
gcc
fpppp
matrix300
0%
spice
1%
0%
doducd
1%
tomcatv
2%
7/27/2016
5%
4%
4%
nasa7
Frequency of Mispredictions
20%
1,024 entries (2,2)
CSCE 430/830, Instruction Level Parallelism
26
Tournament Predictors
Tournament predictor using, say, 4K 2-bit counters
indexed by local branch address. Chooses
between:
• Global predictor
– 4K entries index by history of last 12 branches (212 = 4K)
– Each entry is a standard 2-bit predictor
• Local predictor
– Local history table: 1024 10-bit entries recording last 10
branches, index by branch address
– The pattern of the last 10 occurrences of that particular branch
used to index table of 1K entries with 3-bit saturating counters
7/27/2016
CSCE 430/830, Instruction Level Parallelism
27
Tournament Predictors
• Multilevel branch predictor
• Use n-bit saturating counter to choose between
predictors
• Usual choice between global and local predictors
P1/P2 ={0,1}/{0,1}
0: prediction incorrect
1: prediction correct
7/27/2016
CSCE 430/830, Instruction Level Parallelism
28
Comparing Predictors (Fig. 2.8)
• Advantage of tournament predictor is ability to
select the right predictor for a particular branch
– Particularly crucial for integer benchmarks.
– A typical tournament predictor will select the global predictor
almost 40% of the time for the SPEC integer benchmarks and
less than 15% of the time for the SPEC FP benchmarks
7/27/2016
CSCE 430/830, Instruction Level Parallelism
29
Pentium 4 Misprediction Rate
(per 1000 instructions, not per branch)
14
13
Branch mispredictions per 1000 Instructions
13
6% misprediction rate per branch SPECint
(19% of INT instructions are branch)
12
12
11
2% misprediction rate per branch SPECfp
(5% of FP instructions are branch)
11
10
9
9
8
7
7
6
5
5
4
3
2
1
1
0
0
0
7/27/2016
a
17
7.
m
es
u
17
3.
ap
pl
17
2.
m
gr
id
im
17
1.
sw
e
is
af
ty
up
w
16
8.
w
SPECint2000
18
6.
cr
18
1.
m
cf
17
6.
gc
c
r
17
5.
vp
16
4.
gz
i
p
0
SPECfp2000
CSCE 430/830, Instruction Level Parallelism
30
Branch Target Buffers (BTB)
• Branch target calculation is costly and stalls the
instruction fetch.
• BTB stores PCs the same way as caches
• The PC of a branch is sent to the BTB
• When a match is found the corresponding
Predicted PC is returned
• If the branch was predicted taken, instruction
fetch continues at the returned predicted PC
Branch Target Buffers
Branch target folding: for unconditional branches store the
target instructions themselves in the buffer!
Dynamic Branch Prediction Summary
• Prediction becoming important part of execution
• Branch History Table: 2 bits for loop accuracy
• Correlation: Recently executed branches correlated
with next branch
– Either different branches (GA)
– Or different executions of same branches (PA)
• Tournament predictors take insight to next level, by
using multiple predictors
– usually one based on global information and one based on local
information, and combining them with a selector
– In 2006, tournament predictors using  30K bits are in processors
like the Power5 and Pentium 4
• Branch Target Buffer: include branch address &
prediction;
• Branch target folding.
7/27/2016
CSCE 430/830, Instruction Level Parallelism
33
In-Class Exercise
• Given a deeply pipelined processor and a branch-target
buffer for conditional branches only, assuming a
misprediction penalty of 4 cycles and a buffer miss penalty
of 3 cycles, 90% hit rate and 90% accuracy, and 15%
branch frequency. How much faster is the processor with
the BTB vs. a processor that has a fixed 2-cycle branch
penalty?
• Speedup = CPInoBTB /CPIBTB
•
= (CPIbase + StallsnoBTB )/(CPIbase + StallsBTB )
• Stalls = ∑ Frequency x Penalty
• StallsnoBTB = 15% x 2 = 0.30
• StallsBTB = Stallsmiss-BTB + Stallshit-BTB-correct + Stallshit-BTB-wrong
•
= 15%x10%x3 + 15%x90%x90%x0 +15%x90%x10%x4
•
= 0.097
• Speedup = (1 + 0.3) / (1 + 0.097) = 1.2
7/27/2016
CSCE 430/830, Instruction Level Parallelism
34

Rearrange order of instructions to reduce stalls
while maintaining data flow

Advantages:



Branch Prediction
Dynamic Scheduling
Compiler doesn’t need to have knowledge of
microarchitecture
Handles cases where dependencies are unknown at
compile time
Disadvantage:


Substantial increase in hardware complexity
Complicates exceptions
Copyright © 2012, Elsevier Inc. All rights reserved.
42

Dynamic scheduling implies:




Out-of-order execution
Out-of-order completion
Branch Prediction
Dynamic Scheduling
Creates the possibility for WAR and WAW
hazards
Tomasulo’s Approach


Tracks when operands are available
Introduces register renaming in hardware

Minimizes WAW and WAR hazards
Copyright © 2012, Elsevier Inc. All rights reserved.
43

Example:
DIV.D F0,F2,F4
ADD.D F6,F0,F8
S.D F6,0(R1)
SUB.D F8,F10,F14
MUL.D F6,F10,F8
Branch Prediction
Register Renaming
Antidependence F8
Antidependence F6
+ output dependence with F6
Copyright © 2012, Elsevier Inc. All rights reserved.
44

Example:
Branch Prediction
Register Renaming
DIV.D F0,F2,F4
ADD.D S,F0,F8
S.D S,0(R1)
SUB.D T,F10,F14
MUL.D F6,F10,T

Now only RAW hazards remain, which can be strictly
ordered
Copyright © 2012, Elsevier Inc. All rights reserved.
45
Branch Prediction
Register Renaming

Register renaming is provided by reservation stations
(RS)



Contains:
 The instruction
 Buffered operand values (when available)
 Reservation station number of instruction providing the
operand values
RS fetches and buffers an operand as soon as it becomes
available (not necessarily involving register file)
Pending instructions designate the RS to which they will send
their output




Result values broadcast on a result bus, called the common data bus (CDB)
Only the last output updates the register file
As instructions are issued, the register specifiers are renamed
with the reservation station
May be more reservation stations than registers
Copyright © 2012, Elsevier Inc. All rights reserved.
46

Load and store buffers


Contain data and addresses, act like reservation
stations
Branch Prediction
Tomasulo’s Algorithm
Top-level design:
Copyright © 2012, Elsevier Inc. All rights reserved.
47
Branch Prediction
Tomasulo’s Algorithm

Three Steps:

Issue




Execute





Get next instruction from FIFO queue
If available RS, issue the instruction to the RS with operand values if
available
If operand values not available, stall the instruction
When operand becomes available, store it in any reservation
stations waiting for it
When all operands are ready, issue the instruction
Loads and store maintained in program order through effective
address
No instruction allowed to initiate execution until all branches that
proceed it in program order have completed
Write result

Write result on CDB into reservation stations and store buffers

(Stores must wait until address and value are received)
Copyright © 2012, Elsevier Inc. All rights reserved.
48
Tomasulo Organization
FP Registers
From Mem
FP Op
Queue
Load Buffers
Load1
Load2
Load3
Load4
Load5
Load6
Store
Buffers
Add1
Add2
Add3
Mult1
Mult2
FP adders
7/27/2016
Reservation
Stations
To Mem
FP multipliers
Common Data Bus (CDB)
CSCE 430/830, Instruction Level Parallelism
49
Reservation Station Components
Op: Operation to perform in the unit (e.g., + or –)
Vj, Vk: Value of Source operands
– Store buffers has V field, result to be stored
Qj, Qk: Reservation stations producing source
registers (value to be written)
– Note: Qj,Qk=0 => ready
– Store buffers only have Qi for RS producing result
Busy: Indicates reservation station or FU is busy
Register result status—Indicates which functional unit
will write each register, if one exists. Blank when no
pending instructions that will write that register.
7/27/2016
CSCE 430/830, Instruction Level Parallelism
50
Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue
If reservation station free (no structural hazard),
control issues instr & sends operands (renames registers).
2. Execute—operate on operands (EX)
When both operands ready then execute;
if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available
• Normal data bus: data + destination (“go to” bus)
• Common data bus: data + source (“come from” bus)
– 64 bits of data + 4 bits of Functional Unit source address
– Write if matches expected Functional Unit (produces result)
– Does the broadcast
• Example speed:
3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /
7/27/2016
CSCE 430/830, Instruction Level Parallelism
51
Tomasulo Example
Instruction stream
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
Load1
Load2
Load3
Register result status:
Clock
0
No
No
No
3 Load/Buffers
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
FU count
Add3
No
down
Mult1 No
Mult2 No
Busy Address
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
3 FP Adder R.S.
2 FP Mult R.S.
F0
F2
F4
F6
F8
F10
F12
...
FU
Clock cycle
counter
7/27/2016
CSCE 430/830, Instruction Level Parallelism
52
F30
Tomasulo Example Cycle 1
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Register result status:
Clock
1
7/27/2016
FU
Busy Address
Load1
Load2
Load3
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F0
F2
F4
F6
F8
Yes
No
No
34+R2
F10
F12
...
Load1
CSCE 430/830, Instruction Level Parallelism
53
F30
Tomasulo Example Cycle 2
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
Reservation Stations:
Time Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No
Register result status:
Clock
2
FU
Busy Address
Load1
Load2
Load3
Op
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F0
F2
F4
F6
F8
Load2
Yes
Yes
No
34+R2
45+R3
F10
F12
...
Load1
Note: Can have multiple loads outstanding
7/27/2016
CSCE 430/830, Instruction Level Parallelism
54
F30
Tomasulo Example Cycle 3
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
Reservation Stations:
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes MULTD
Mult2 No
Register result status:
Clock
3
F0
FU
Busy Address
3
S1
Vj
Load1
Load2
Load3
S2
Vk
RS
Qj
Yes
Yes
No
34+R2
45+R3
F10
F12
RS
Qk
R(F4) Load2
F2
Mult1 Load2
F4
F6
F8
...
Load1
• Note: registers names are removed (“renamed”) in Reservation
Stations; MULT issued
• Load1 completing; what is waiting for Load1?
7/27/2016
CSCE 430/830, Instruction Level Parallelism
55
F30
Tomasulo Example Cycle 4
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
Reservation Stations:
Busy Address
3
4
4
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
No
Yes
No
45+R3
F10
F12
Time Name Busy Op
Add1 Yes SUBD M(A1)
Load2
Add2
No
Add3
No
Mult1 Yes MULTD
R(F4) Load2
Mult2 No
Register result status:
Clock
4
F0
FU
Mult1 Load2
...
M(A1) Add1
• Load2 completing; what is waiting for Load2?
7/27/2016
CSCE 430/830, Instruction Level Parallelism
56
F30
Tomasulo Example Cycle 5
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
2 Add1 Yes SUBD M(A1) M(A2)
Add2
No
Add3
No
10 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
5
F0
FU
Mult1 M(A2)
No
No
No
F10
F12
...
M(A1) Add1 Mult2
• Timer starts down for Add1, Mult1
7/27/2016
CSCE 430/830, Instruction Level Parallelism
57
F30
Tomasulo Example Cycle 6
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
1 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD
M(A2) Add1
Add3
No
9 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
6
F0
FU
Mult1 M(A2)
Add2
No
No
No
F10
F12
...
Add1 Mult2
• Issue ADDD here despite name dependency on F6?
7/27/2016
CSCE 430/830, Instruction Level Parallelism
58
F30
Tomasulo Example Cycle 7
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
Busy Address
4
5
Load1
Load2
Load3
7
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
0 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD
M(A2) Add1
Add3
No
8 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
7
F0
FU
No
No
No
Mult1 M(A2)
Add2
F10
F12
...
Add1 Mult2
• Add1 (SUBD) completing; what is waiting for it?
7/27/2016
CSCE 430/830, Instruction Level Parallelism
59
F30
Tomasulo Example Cycle 8
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
2 Add2 Yes ADDD (M-M) M(A2)
Add3
No
7 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
8
7/27/2016
F0
FU
Mult1 M(A2)
No
No
No
F10
F12
...
Add2 (M-M) Mult2
CSCE 430/830, Instruction Level Parallelism
60
F30
Tomasulo Example Cycle 9
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
1 Add2 Yes ADDD (M-M) M(A2)
Add3
No
6 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
9
7/27/2016
F0
FU
Mult1 M(A2)
No
No
No
F10
F12
...
Add2 (M-M) Mult2
CSCE 430/830, Instruction Level Parallelism
61
F30
Tomasulo Example Cycle 10
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
4
5
7
8
Busy Address
Load1
Load2
Load3
10
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
0 Add2 Yes ADDD (M-M) M(A2)
Add3
No
5 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
10
F0
FU
No
No
No
Mult1 M(A2)
F10
F12
...
Add2 (M-M) Mult2
• Add2 (ADDD) completing; what is waiting for it?
7/27/2016
CSCE 430/830, Instruction Level Parallelism
62
F30
Tomasulo Example Cycle 11
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
4 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
11
F0
FU
Mult1 M(A2)
No
No
No
F10
F12
...
(M-M+M)(M-M) Mult2
• Write result of ADDD here?
• All quick instructions complete in this cycle!
7/27/2016
CSCE 430/830, Instruction Level Parallelism
63
F30
Tomasulo Example Cycle 12
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
3 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
12
7/27/2016
F0
FU
Mult1 M(A2)
No
No
No
F10
F12
...
(M-M+M)(M-M) Mult2
CSCE 430/830, Instruction Level Parallelism
64
F30
Tomasulo Example Cycle 13
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
2 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
13
7/27/2016
F0
FU
Mult1 M(A2)
No
No
No
F10
F12
...
(M-M+M)(M-M) Mult2
CSCE 430/830, Instruction Level Parallelism
65
F30
Tomasulo Example Cycle 14
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
4
5
Load1
Load2
Load3
7
8
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
1 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
14
7/27/2016
F0
FU
Mult1 M(A2)
No
No
No
F10
F12
...
(M-M+M)(M-M) Mult2
CSCE 430/830, Instruction Level Parallelism
66
F30
Tomasulo Example Cycle 15
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
Busy Address
3
4
15
7
4
5
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
0 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1
Register result status:
Clock
15
F0
FU
Mult1 M(A2)
No
No
No
F10
F12
...
(M-M+M)(M-M) Mult2
• Mult1 (MULTD) completing; what is waiting for it?
7/27/2016
CSCE 430/830, Instruction Level Parallelism
67
F30
Tomasulo Example Cycle 16
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
4
5
16
8
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
40 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
16
F0
FU
Busy Address
M*F4 M(A2)
No
No
No
F10
F12
...
(M-M+M)(M-M) Mult2
• Just waiting for Mult2 (DIVD) to complete
7/27/2016
CSCE 430/830, Instruction Level Parallelism
68
F30
Faster than light computation
(skip a couple of cycles)
7/27/2016
CSCE 430/830, Instruction Level Parallelism
69
Tomasulo Example Cycle 55
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
4
5
16
8
Load1
Load2
Load3
10
11
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
1 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
55
7/27/2016
F0
FU
Busy Address
M*F4 M(A2)
No
No
No
F10
F12
...
(M-M+M)(M-M) Mult2
CSCE 430/830, Instruction Level Parallelism
70
F30
Tomasulo Example Cycle 56
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
56
10
4
5
16
8
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
56
F0
F2
F4
F6
F8
FU
M*F4 M(A2)
No
No
No
11
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
0 Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
Busy Address
F10
F12
...
(M-M+M)(M-M) Mult2
• Mult2 (DIVD) is completing; what is waiting for it?
7/27/2016
CSCE 430/830, Instruction Level Parallelism
71
F30
Tomasulo Example Cycle 57
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Exec Write
Issue Comp Result
1
2
3
4
5
6
Reservation Stations:
3
4
15
7
56
10
4
5
16
8
57
11
Load1
Load2
Load3
S1
Vj
S2
Vk
RS
Qj
RS
Qk
F2
F4
F6
F8
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 Yes DIVD M*F4 M(A1)
Register result status:
Clock
56
F0
FU
Busy Address
M*F4 M(A2)
No
No
No
F10
F12
...
F30
(M-M+M)(M-M) Result
• Once again: In-order issue, out-of-order execution and
out-of-order completion.
7/27/2016
CSCE 430/830, Instruction Level Parallelism
72
Speculation to greater ILP
•
Greater ILP: Overcome control dependence by
hardware speculating on outcome of branches
and executing program as if guesses were correct
– Speculation  fetch, issue, and execute instructions as if
branch predictions were always correct
– Dynamic scheduling  only fetches and issues
instructions
•
Essentially a data flow execution model:
Operations execute as soon as their operands are
available
7/27/2016
CSCE 430/830 Thread Level Parallelism
77
Speculation to greater ILP
 3 components of HW-based speculation:
1. Dynamic branch prediction to choose which
instructions to execute
2. Speculation to allow execution of instructions
before control dependences are resolved
+ ability to undo effects of incorrectly speculated
sequence
3. Dynamic scheduling to deal with scheduling of
different combinations of basic blocks
7/27/2016
CSCE 430/830 Thread Level Parallelism
78
Adding Speculation to Tomasulo
• Must separate execution from allowing
instruction to finish or “commit”
• This additional step called instruction commit
• When an instruction is no longer speculative,
allow it to update the register file or memory
• Requires additional set of buffers to hold results
of instructions that have finished execution but
have not committed
• This reorder buffer (ROB) is also used to pass
results among instructions that may be
speculated
7/27/2016
CSCE 430/830 Thread Level Parallelism
79
Reorder Buffer (ROB)
• In non-speculative Tomasulo’s algorithm, once
an instruction writes its result, any subsequently
issued instructions will find result in the register
file
• With speculation, the register file is not updated
until the instruction commits
– (we know definitively that the instruction should execute)
• Thus, the ROB supplies operands in interval
between completion of instruction execution and
instruction commit
– ROB is a source of operands for instructions, just as
reservation stations (RS) provide operands in Tomasulo’s
algorithm
– ROB extends architectured registers like RS
7/27/2016
CSCE 430/830 Thread Level Parallelism
80
Reorder Buffer Entry Fields
 Each entry in the ROB contains four fields:
1. Instruction type
•
a branch (has no destination result), a store (has a memory
address destination), or a register operation (ALU operation
or load, which has register destinations)
2. Destination
•
Register number (for loads and ALU operations) or
memory address (for stores)
where the instruction result should be written
3. Value
•
Value of instruction result until the instruction commits
4. Ready
•
7/27/2016
Indicates that instruction has completed execution, and the
value is ready
CSCE 430/830 Thread Level Parallelism
81
Reorder Buffer operation
• Holds instructions in FIFO order, exactly as issued
• When instructions complete, results placed into ROB
– Supplies operands to other instruction between execution
complete & commit  more registers like RS
– Tag results with ROB buffer number instead of reservation station
• Instructions commit values at head of ROB placed in
registers (or memory locations)
Reorder
• As a result, easy to undo
Buffer
speculated instructions
FP
Op
on mispredicted branches
Queue
FP Regs
or on exceptions
Commit path
Res Stations
FP Adder
7/27/2016
CSCE 430/830 Thread Level Parallelism
Res Stations
FP Adder
82
Recall: 4 Steps of Speculative Tomasulo
Algorithm
1. Issue—get instruction from FP Op Queue
If reservation station and reorder buffer slot free, issue instr &
send operands & reorder buffer no. for destination (this stage
sometimes called “dispatch”)
2. Execution—operate on operands (EX)
When both operands ready then execute; if not ready, watch
CDB for result; when both in reservation station, execute;
checks RAW (sometimes called “issue”)
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.
4. Commit—update register with reorder result
When instr. at head of reorder buffer & result present, update
register with result (or store to memory) and remove instr from
reorder buffer. Mispredicted branch flushes reorder buffer
(sometimes called “graduation”)
7/27/2016
CSCE 430/830 Thread Level Parallelism
83
Tomasulo With Reorder buffer:
Done?
FP Op
Queue
ROB7
ROB6
Newest
ROB5
Reorder Buffer
ROB4
ROB3
ROB2
F0
LD F0,10(R2)
Registers
Dest
7/27/2016
ROB1
Oldest
To
Memory
from
Memory
Dest
FP adders
N
Reservation
Stations
Dest
1 10+R2
FP multipliers
CSCE 430/830 Thread Level Parallelism
84
Tomasulo With Reorder buffer:
Done?
FP Op
Queue
ROB7
ROB6
Newest
ROB5
Reorder Buffer
ROB4
ROB3
F10
F0
ADDD F10,F4,F0
LD F0,10(R2)
Registers
Dest
2 ADDD R(F4),ROB1
FP adders
7/27/2016
N
N
ROB2
ROB1
Oldest
To
Memory
from
Memory
Dest
Reservation
Stations
Dest
1 10+R2
FP multipliers
CSCE 430/830 Thread Level Parallelism
85
Tomasulo With Reorder buffer:
Done?
FP Op
Queue
ROB7
ROB6
Newest
ROB5
Reorder Buffer
ROB4
F2
F10
F0
DIVD F2,F10,F6
ADDD F10,F4,F0
LD F0,10(R2)
Registers
Dest
2 ADDD R(F4),ROB1
FP adders
7/27/2016
N
N
N
ROB3
ROB2
ROB1
Oldest
To
Memory
Dest
3 DIVD ROB2,R(F6)
Reservation
Stations
from
Memory
Dest
1 10+R2
FP multipliers
CSCE 430/830 Thread Level Parallelism
86
Tomasulo With Reorder buffer:
Done?
FP Op
Queue
ROB7
Reorder Buffer
F0
F4
-F2
F10
F0
ADDD F0,F4,F6
LD F4,0(R3)
BNE F2,<…>
DIVD F2,F10,F6
ADDD F10,F4,F0
LD F0,10(R2)
Registers
Dest
2 ADDD R(F4),ROB1
6 ADDD ROB5, R(F6)
FP adders
7/27/2016
N
N
N
N
N
N
ROB6
Newest
ROB5
ROB4
ROB3
ROB2
ROB1
Oldest
To
Memory
Dest
3 DIVD ROB2,R(F6)
Reservation
Stations
FP multipliers
CSCE 430/830 Thread Level Parallelism
from
Memory
Dest
1 10+R2
5 0+R3
87
Tomasulo With Reorder buffer:
FP Op
Queue
Reorder Buffer
-- ROB5
F0
F4
-F2
F10
F0
Done?
ST 0(R3),F4
N ROB7
ADDD F0,F4,F6
N ROB6
LD F4,0(R3)
N ROB5
BNE F2,<…>
N ROB4
DIVD F2,F10,F6 N ROB3
ADDD F10,F4,F0 N ROB2
LD F0,10(R2)
N ROB1
Registers
Dest
2 ADDD R(F4),ROB1
6 ADDD ROB5, R(F6)
FP adders
7/27/2016
Newest
Oldest
To
Memory
Dest
3 DIVD ROB2,R(F6)
Reservation
Stations
FP multipliers
CSCE 430/830 Thread Level Parallelism
from
Memory
Dest
1 10+R2
5 0+R3
88
Tomasulo With Reorder buffer:
FP Op
Queue
Reorder Buffer
-- M[10]
F0
F4 M[10]
-F2
F10
F0
Done?
ST 0(R3),F4
Y ROB7
ADDD F0,F4,F6
N ROB6
LD F4,0(R3)
Y ROB5
BNE F2,<…>
N ROB4
DIVD F2,F10,F6 N ROB3
ADDD F10,F4,F0 N ROB2
LD F0,10(R2)
N ROB1
Registers
Dest
2 ADDD R(F4),ROB1
6 ADDD M[10],R(F6)
FP adders
7/27/2016
Newest
Oldest
To
Memory
Dest
3 DIVD ROB2,R(F6)
Reservation
Stations
from
Memory
Dest
1 10+R2
FP multipliers
CSCE 430/830 Thread Level Parallelism
89
Tomasulo With Reorder buffer:
FP Op
Queue
Reorder Buffer
Done?
-- M[10] ST 0(R3),F4
Y ROB7
F0 <val2> ADDD F0,F4,F6 Ex ROB6
F4 M[10] LD F4,0(R3)
Y ROB5
-BNE F2,<…>
N ROB4
F2
DIVD F2,F10,F6 N ROB3
F10
ADDD F10,F4,F0 N ROB2
F0
LD F0,10(R2)
N ROB1
Registers
Dest
2 ADDD R(F4),ROB1
FP adders
7/27/2016
Newest
Oldest
To
Memory
Dest
3 DIVD ROB2,R(F6)
Reservation
Stations
from
Memory
Dest
1 10+R2
FP multipliers
CSCE 430/830 Thread Level Parallelism
90
Tomasulo With Reorder buffer:
FP Op
Queue
Reorder Buffer
What about memory
hazards???
Done?
-- M[10] ST 0(R3),F4
Y ROB7
F0 <val2> ADDD F0,F4,F6 Ex ROB6
F4 M[10] LD F4,0(R3)
Y ROB5
-BNE F2,<…>
N ROB4
F2
DIVD F2,F10,F6 N ROB3
F10
ADDD F10,F4,F0 N ROB2
F0
LD F0,10(R2)
N ROB1
Registers
Dest
2 ADDD R(F4),ROB1
FP adders
7/27/2016
Newest
Oldest
To
Memory
Dest
3 DIVD ROB2,R(F6)
Reservation
Stations
from
Memory
Dest
1 10+R2
FP multipliers
CSCE 430/830 Thread Level Parallelism
91
Avoiding Memory Hazards
• WAW and WAR hazards through memory are
eliminated with speculation because actual
updating of memory occurs in order, when a
store is at head of the ROB, and hence, no
earlier loads or stores can still be pending
• RAW dependence through memory are
maintained by two restrictions:
1. not allowing a load to initiate the second step of its execution
if any active ROB entry occupied by a store has a Destination
field that matches the value of the A field of the load, and
2. maintaining the program order for the computation of an
effective address of a load with respect to all earlier stores.
• these restrictions ensure that any load that
accesses a memory location written to by an
earlier store cannot perform the memory access
until the store has written the data
7/27/2016
CSCE 430/830 Thread Level Parallelism
92
Exceptions and Interrupts
• IBM 360/91 invented “imprecise interrupts”
– Computer stopped at this PC; its likely close to this address
– Not so popular with programmers
– Also, what about Virtual Memory? (Not in IBM 360)
• Technique for both precise interrupts/exceptions
and speculation: in-order completion and inorder commit
– If we speculate and are wrong, need to back up and restart
execution to point at which we predicted incorrectly
– This is exactly same as need to do with precise exceptions
• Exceptions are handled by not recognizing the
exception until instruction that caused it is ready
to commit in ROB
– If a speculated instruction raises an exception, the exception
is recorded in the ROB
– This is why reorder buffers in all new processors
7/27/2016
CSCE 430/830 Thread Level Parallelism
93

To achieve CPI < 1, need to complete multiple
instructions per clock

Solutions:



Statically scheduled superscalar processors
VLIW (very long instruction word) processors
dynamically scheduled superscalar processors
Copyright © 2012, Elsevier Inc. All rights reserved.
Multiple Issue and Static Scheduling
Multiple Issue and Static Scheduling
95
Copyright © 2012, Elsevier Inc. All rights reserved.
Multiple Issue and Static Scheduling
Multiple Issue
96
VLIW: Very Large Instruction Word
• Each “instruction” has explicit coding for multiple
operations
– In IA-64, grouping called a “packet”
– In Transmeta, grouping called a “molecule” (with “atoms” as ops)
• Tradeoff instruction space for simple decoding
– The long instruction word has room for many operations
– By definition, all the operations the compiler puts in the long
instruction word are independent => execute in parallel
– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
» 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
– Need compiling technique that schedules across several branches
7/27/2016
CSCE 430/830 Thread Level Parallelism
97
Recall: Unrolled Loop that Minimizes
Stalls for Scalar
1 Loop:
2
3
4
5
6
7
8
9
10
11
12
13
14
L.D
L.D
L.D
L.D
ADD.D
ADD.D
ADD.D
ADD.D
S.D
S.D
S.D
DSUBUI
BNEZ
S.D
F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
0(R1),F4
-8(R1),F8
-16(R1),F12
R1,R1,#32
R1,LOOP
8(R1),F16
L.D to ADD.D: 1 Cycle
ADD.D to S.D: 2 Cycles
; 8-32 = -24
14 clock cycles, or 3.5 per iteration
7/27/2016
CSCE 430/830 Thread Level Parallelism
98
Loop Unrolling in VLIW
Memory
reference 1
Memory
reference 2
L.D F0,0(R1)
L.D F6,-8(R1)
FP
operation 1
L.D F10,-16(R1) L.D F14,-24(R1)
L.D F18,-32(R1) L.D F22,-40(R1) ADD.D
L.D F26,-48(R1)
ADD.D
ADD.D
S.D 0(R1),F4
S.D -8(R1),F8
ADD.D
S.D -16(R1),F12 S.D -24(R1),F16
S.D -32(R1),F20 S.D -40(R1),F24
S.D -0(R1),F28
FP
op. 2
Int. op/
branch
Clock
1
2
F4,F0,F2
ADD.D F8,F6,F2
3
F12,F10,F2 ADD.D F16,F14,F2
4
F20,F18,F2 ADD.D F24,F22,F2
5
F28,F26,F2
6
7
DSUBUI R1,R1,#48 8
BNEZ R1,LOOP
9
Unrolled 7 times to avoid delays
7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)
Average: 2.5 ops per clock, 50% efficiency
Note: Need more registers in VLIW (15 vs. 6 in SS)
7/27/2016
CSCE 430/830 Thread Level Parallelism
99
Problems with 1st Generation VLIW
• Increase in code size
– generating enough operations in a straight-line code fragment
requires ambitiously unrolling loops
– whenever VLIW instructions are not full, unused functional
units translate to wasted bits in instruction encoding
• Operated in lock-step; no hazard detection HW
– a stall in any functional unit pipeline caused entire processor
to stall, since all functional units must be kept synchronized
– Compiler might predict function units, but caches hard to
predict
• Binary code compatibility
– Pure VLIW => different numbers of functional units and unit
latencies require different versions of the code
7/27/2016
CSCE 430/830 Thread Level Parallelism
100

Limit the number of instructions of a given class
that can be issued in a “bundle”

I.e. one FP, one integer, one load, one store

Examine all the dependencies among the
instructions in the bundle

If dependencies exist in bundle, encode them in
reservation stations

Also need multiple completion/commit
Copyright © 2012, Elsevier Inc. All rights reserved.
Dynamic Scheduling, Multiple Issue, and Speculation
Multiple Issue
105
Loop: LD R2,0(R1)
DADDIU R2,R2,#1
SD R2,0(R1)
DADDIU R1,R1,#8
BNE R2,R3,LOOP
;R2=array element
;increment R2
;store result
;increment pointer
;branch if not last element
Copyright © 2012, Elsevier Inc. All rights reserved.
Dynamic Scheduling, Multiple Issue, and Speculation
Example
106
Copyright © 2012, Elsevier Inc. All rights reserved.
Dynamic Scheduling, Multiple Issue, and Speculation
Example (No Speculation)
107
Copyright © 2012, Elsevier Inc. All rights reserved.
Dynamic Scheduling, Multiple Issue, and Speculation
Example (Speculatin)
108

Need high instruction bandwidth!

Branch-Target buffers

Next PC prediction buffer, indexed by current PC
Copyright © 2012, Elsevier Inc. All rights reserved.
Adv. Techniques for Instruction Delivery and Speculation
Branch-Target Buffer
109

Optimization:



Larger branch-target buffer
Add target instruction into buffer to deal with longer
decoding time required by larger buffer
“Branch folding”
Copyright © 2012, Elsevier Inc. All rights reserved.
Adv. Techniques for Instruction Delivery and Speculation
Branch Folding
110


Most unconditional branches come from
function returns
The same procedure can be called from
multiple sites


Causes the buffer to potentially forget about the
return address from previous calls
Create return address buffer organized as a
stack
Copyright © 2012, Elsevier Inc. All rights reserved.
Adv. Techniques for Instruction Delivery and Speculation
Return Address Predictor
111

Design monolithic unit that performs:


Branch prediction
Instruction prefetch


Fetch ahead
Instruction memory access and buffering

Deal with crossing cache lines
Copyright © 2012, Elsevier Inc. All rights reserved.
Adv. Techniques for Instruction Delivery and Speculation
Integrated Instruction Fetch Unit
112

Register renaming vs. reorder buffers

Instead of virtual registers from reservation stations and
reorder buffer, create a single register pool






Use hardware-based map to rename registers during issue
WAW and WAR hazards are avoided
Speculation recovery occurs by copying during commit
Still need a ROB-like queue to update table in order
Simplifies commit:




Contains visible registers and virtual registers
Record that mapping between architectural register and physical register
is no longer speculative
Free up physical register used to hold older value
In other words: SWAP physical registers on commit
Physical register de-allocation is more difficult
Copyright © 2012, Elsevier Inc. All rights reserved.
Adv. Techniques for Instruction Delivery and Speculation
Register Renaming
113

Combining instruction issue with register
renaming:



Issue logic pre-reserves enough physical registers
for the bundle (fixed number?)
Issue logic finds dependencies within bundle, maps
registers as necessary
Issue logic finds dependencies between current
bundle and already in-flight bundles, maps registers
as necessary
Copyright © 2012, Elsevier Inc. All rights reserved.
Adv. Techniques for Instruction Delivery and Speculation
Integrated Issue and Renaming
114

How much to speculate

Mis-speculation degrades performance and power
relative to no speculation



May cause additional misses (cache, TLB)
Prevent speculative code from causing higher
costing misses (e.g. L2)
Speculating through multiple branches


Complicates speculation recovery
No processor can resolve multiple branches per
cycle
Copyright © 2012, Elsevier Inc. All rights reserved.
Adv. Techniques for Instruction Delivery and Speculation
How Much?
115

Speculation and energy efficiency


Note: speculation is only energy efficient when it
significantly improves performance
Value prediction

Uses:




Loads that load from a constant pool
Instruction that produces a value from a small set of values
Not been incorporated into modern processors
Similar idea--address aliasing prediction--is used on
some processors
Copyright © 2012, Elsevier Inc. All rights reserved.
Adv. Techniques for Instruction Delivery and Speculation
Energy Efficiency
116
Pipelining




Multiple instructions are overlapped in execution
Takes advantage of parallelism
Increases instruction throughput, but does not
reduce the time to complete an instruction
Ideal case:




The time per instruction: Time per instruction on
unpipelined machine / number of pipe stages
Speedup: the number of pipe stages
Intel Pentium 4: 20 stages; Cedar Mill: 31 stages
Intel Core i7: 14 stages. Why?
Copyright © 2012, Elsevier Inc. All rights reserved.
117
The Basics of a RISC Instruction Set
• Key Proprieties:
– All operations on data apply to data in registers
– The only operations that affect memory are load and store
operations
– The instruction formats are few in number, with all
instructions typically being one size.
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
118
Basic Pipelining: A "Typical" RISC ISA
• 32-bit fixed format instruction (3 formats)
– ALU instructions, Load/Store, Branches and Jumps
• 32 32-bit GPR (R0 contains zero, DP take pair)
• 3-address, reg-reg arithmetic instruction
• Single address mode for load/store:
base + displacement
– no indirection
• Simple branch conditions /Jump
see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC,
CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
119
Basic Pipelining: e.g., MIPS ( MIPS)
Register-Register
31
26 25
Op
21 20
Rs1
16 15
Rs2
11 10
6 5
Rd
0
Opx
Register-Immediate
31
26 25
Op
21 20
Rs1
16 15
Rd
immediate
0
Branch
31
26 25
Op
Rs1
21 20
16 15
Rs2/Opx
immediate
0
Jump / Call
31
26 25
Op
7/27/2016
target
CSCE 430/830, Basic Pipelining & Performance
0
120
A simple implementation of a RISC
Instruction Set
• Every instruction can be implemented in at most
5 cycles.
– Instruction fetch cycle (IF): send PC to memory and fetch the
current instruction; Update the PC to the next PC by adding 4.
– Instruction decode/register fetch cycle (ID): decode the
instruction and read the registers. For a branch, do the
equality test on the registers, and compute the target address
– Execution/effective address cycle (EX): The ALU operates on
the operands prepared in the prior cycle, performing:
» Memory reference: ALU adds the base register and the
offset.
LW 100(R1) => 100 + R1
» Register-Register ALU instruction: ALU performs the
operation specified by the opcode on the values read from
the register file. ADD R1,R2,R3 => R1 + R2
» Register-Immediate ALU instruction: ALU performs the
operation specified by the opcode on the first values read
from the register file and the immediate. ADD R1,3,R2 =>
R1 + 3
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
121
A simple implementation of a RISC
Instruction Set
• Every instruction can be implemented in at most
5 cycles.
– Memory access (MEM). Read the memory using the effective
address computed in the previous cycle if the instruction is a
load; If it is a store, write the data to the memory
– Write-back cycle (WB). For Register-Register ALU instruction
or load instruction, write the result into the register file,
whether it comes from the memory (for a load) or from the
ALU (for an ALU instruction)
• Branch instructions require 2 cycles, store
instructions require 4 cycles, others require 5
cycles.
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
122
An example of an instruction execution
• ADD R3, R1, R2 =>
R3 = R1 + R2;
IR
ALU
PC
Reg File
Memory
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
123
Instruction Fetch (IF)
• ADD R3, R1, R2 =>
R3 = R1 + R2;
IR
ALU
PC
000001000010001000011…
Memory
7/27/2016
Reg File
000011000111111100000…
CSCE 430/830, Basic Pipelining & Performance
124
Instruction Fetch (IF)
• ADD R3, R1, R2 =>
R3 = R1 + R2;
IR
ALU
PC
000001000010001000011…
Memory
7/27/2016
Reg File
000011000111111100000…
CSCE 430/830, Basic Pipelining & Performance
125
Instruction Fetch (IF)
• ADD R3, R1, R2 =>
R3 = R1 + R2;
IR
PC
ALU
000001000010001000011…
000001000010001000011…
Memory
7/27/2016
Reg File
000011000111111100000…
CSCE 430/830, Basic Pipelining & Performance
Fetch the instruction
126
Instruction Fetch (IF)
• ADD R3, R1, R2 =>
R3 = R1 + R2;
IR
PC
ALU
000001000010001000011…
000001000010001000011…
Memory
7/27/2016
Reg File
000011000111111100000…
CSCE 430/830, Basic Pipelining & Performance
Fetch the instruction
Move PC to the next
instruction
127
Instruction Decode (ID)
• ADD R3, R1, R2 =>
R3 = R1 + R2;
IR
ALU
000001000010001000011…
Reg File
0
R1
x
R2
y
Memory
R3
R31
7/27/2016
Reg File
R0
z
…
CSCE 430/830, Basic Pipelining & Performance
128
Instruction Decode (ID)
• ADD R3, R1, R2 =>
R3 = R1 + R2;
IR
ALU
000001000010001000011…
Reg File
ADD
0
R1
x
R2
y
Memory
R3
R31
7/27/2016
Reg File
R0
decode the instruction
z
…
CSCE 430/830, Basic Pipelining & Performance
129
Instruction Decode (ID)
• ADD R3, R1, R2 =>
R3 = R1 + R2;
IR
ALU
000001000010001000011…
Reg File
ADD
R1
0
R1
x
R2
y
Memory
R3
R31
7/27/2016
Reg File
R0
decode the instruction
z
…
CSCE 430/830, Basic Pipelining & Performance
130
Instruction Decode (ID)
• ADD R3, R1, R2 =>
R3 = R1 + R2;
IR
ALU
000001000010001000011…
Reg File
ADD
R1
R2
0
R1
x
R2
y
Memory
R3
R31
7/27/2016
Reg File
R0
decode the instruction
z
…
CSCE 430/830, Basic Pipelining & Performance
131
Instruction Decode (ID)
• ADD R3, R1, R2 =>
R3 = R1 + R2;
IR
ALU
000001000010001000011…
Reg File
ADD
R1
R2
0
R1
x
R2
y
Memory
R3
R31
7/27/2016
Reg File
R3 R0
decode the instruction
z
…
CSCE 430/830, Basic Pipelining & Performance
132
Instruction Decode (ID)
• ADD R3, R1, R2 =>
R3 = R1 + R2;
x
IR
ALU
000001000010001000011…
Reg File
ADD
R1
R2
0
R1
x
R2
y
Memory
R3
R31
7/27/2016
Reg File
R3 R0
z
y
decode the instruction
Load the reg file
…
CSCE 430/830, Basic Pipelining & Performance
133
Execution (EX)
• ADD R3, R1, R2 =>
R3 = R1 + R2;
x
IR
ALU
000001000010001000011…
y
Reg File
Memory
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
Execute the operation
134
Execution (EX)
• ADD R3, R1, R2 =>
R3 = R1 + R2;
x
IR
ALU
000001000010001000011…
x+y
y
Reg File
Memory
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
Execute the operation
135
Writeback (WB)
• ADD R3, R1, R2 =>
R3 = R1 + R2;
x
IR
ALU
000001000010001000011…
Reg File
0
R1
x
R2
y
Memory
R3
R31
7/27/2016
Reg File
R0
z
x+y
y
Write the result to the
register
…
CSCE 430/830, Basic Pipelining & Performance
136
Writeback (WB)
• ADD R3, R1, R2 =>
R3 = R1 + R2;
x
IR
ALU
000001000010001000011…
Reg File
0
R1
x
R2
y
Memory
R3
R31
7/27/2016
Reg File
R0
x+y
x+y
y
Write the result to the
register
…
CSCE 430/830, Basic Pipelining & Performance
137
Another example: Load
• LW R3, R31, 200 =>
R3 = mem[R31 + 200];
IR
ALU
000011000111111100000…
Reg File
0
R1
x
R2
y
Memory
R3
R31
7/27/2016
Reg File
R0
z
n
CSCE 430/830, Basic Pipelining & Performance
138
Before the Execution (EX) stage
• LW R3, R31, 200 =>
R3 = mem[R31 + 200];
n
IR
ALU
000011000111111100000…
Reg File
0
R1
x
R2
y
Memory
R3
R31
7/27/2016
Reg File
R0
200
z
n
CSCE 430/830, Basic Pipelining & Performance
139
Execution (EX)
• LW R3, R31, 200 =>
R3 = mem[R31 + 200];
n
IR
ALU
000011000111111100000…
n + 200
200
Reg File
Memory
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
Execute the operation
140
Memory Access (MEM)
• LW R3, R31, 200 =>
R3 = mem[R31 + 200];
n
IR
000001000010001000011…
000001000010000100000…
n + 200
Reg File
R0
0
R1
x
R2
y
R3
n + 200
200
Reg File
Memory
PC
000011000111111100000…
ALU
000011000111111100000…
Read the memory
z
123456
R31
7/27/2016
n
CSCE 430/830, Basic Pipelining & Performance
141
Writeback (WB)
• LW R3, R31, 200 =>
R3 = mem[R31 + 200];
n
IR
000001000010001000011…
000001000010000100000…
n + 200
Reg File
R0
0
R1
x
R2
y
R3
n + 200
200
Reg File
Memory
PC
000011000111111100000…
ALU
000011000111111100000…
z
Write to the
register
123456
R31
7/27/2016
n
CSCE 430/830, Basic Pipelining & Performance
142
5-stage pipeline
•
•
•
•
•
IF:
ID:
EX:
MEM:
WB:
mem,
alu, load/store, branch
reg file, alu, load/store, branch
ALU,
alu, load/store
mem,
load/store
reg file, alu, load
• Can we change the order of MEM and WB? Why?
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
143
5 Steps of MIPS Datapath
Figure A.2, Page A-8
Instruction
Fetch
Instr. Decode
Reg. Fetch
Execute
Addr. Calc
Next SEQ PC
Adder
4
Zero?
RS1
L
M
D
MUX
Data
Memory
ALU
Imm
MUX MUX
RD
Reg File
Inst
Memory
Address
IR <= mem[PC];
RS2
Write
Back
MUX
Next PC
Memory
Access
Sign
Extend
PC <= PC + 4
Reg[IRrd] <= Reg[IRrs1] opIRop Reg[IRrs2]
7/27/2016
WB Data
CSCE 430/830, Basic Pipelining & Performance
146
Classic RISC Pipeline
Source: http://en.wikipedia.org/wiki/Classic_RISC_pipeline
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
147
Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows the overlap among the parts of
the data path, with clock cycle 5 (CC 5) showing the steady-state situation. Because the register file is used as a source in the ID
stage and as a destination in the WB stage, it appears twice. We show that it is read in one part of the stage and written in another
by using a solid line, on the right or left, respectively, and a dashed line on the other side. The abbreviation IM is used for
instruction memory, DM for data memory, and CC for clock cycle.
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
148
Inst. Set Processor Controller
IR <= mem[PC];
PC <= PC + 4
JSR
A <= Reg[IRrs1];
JR
br
if bop(A,b)
Ifetch
opFetch-DCD
ST
B <= Reg[IRrs2]
jmp
RR
PC <= IRjaddr
r <= A opIRop B
RI
LD
r <= A opIRop IRim
r <= A + IRim
WB <= r
WB <= Mem[r]
PC <= PC+IRim
WB <= r
Reg[IRrd] <= WB
Reg[IRrd] <= WB
CSCE 430/830, Basic Pipelining & Performance
Reg[IRrd] <= WB
149
5 Steps of MIPS Datapath
Figure A.3, Page A-9
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Next SEQ PC
Next SEQ PC
Adder
4
Zero?
RS1
MUX
MEM/WB
Data
Memory
EX/MEM
A <= Reg[IRrs1];
Imm
ALU
PC <= PC + 4
MUX MUX
IR <= mem[PC];
ID/EX
RD
Reg File
IF/ID
Memory
Address
RS2
Write
Back
MUX
Next PC
Memory
Access
WB Data
Instruction
Fetch
Sign
Extend
RD
RD
RD
B <= Reg[IRrs2]
rslt <= A opIRop B
WB <= result
7/27/2016
Reg[IRrd] <= WB
CSCE 430/830, Basic Pipelining & Performance
150
Visualizing Pipelining
Figure A.2, Page A-8
Time (clock cycles)
7/27/2016
Reg
DMem
Ifetch
Reg
DMem
Reg
ALU
DMem
Reg
ALU
O
r
d
e
r
Ifetch
ALU
I
n
s
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Ifetch
Reg
Reg
CSCE 430/830, Basic Pipelining & Performance
Reg
DMem
Reg
151
Time (clock cycles)
Reg
DMem
Ifetch
Reg
DMem
Reg
ALU
DMem
Reg
ALU
DMem
Reg
ALU
DMem
Reg
ALU
O
r
d
e
r
Ifetch
ALU
I
n
s
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch
Ifetch
Reg
Ifetch
Reg
Ifetch
Reg
Reg
Reg
152
DMem
Reg
Time (clock cycles)
Reg
DMem
Ifetch
Reg
DMem
Reg
ALU
DMem
Reg
ALU
DMem
Reg
ALU
DMem
Reg
ALU
O
r
d
e
r
Ifetch
ALU
I
n
s
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch
Ifetch
Reg
Ifetch
Reg
Ifetch
Reg
Reg
Reg
153
DMem
Reg
Pipelining is not quite that easy!
• Limits to pipelining: Hazards prevent next instruction
from executing during its designated clock cycle
– Structural hazards: HW cannot support this combination of
instructions
– Data hazards: Instruction depends on result of prior instruction still
in the pipeline
– Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow (branches
and jumps).
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
154
One Memory Port/Structural Hazards
Figure A.4, Page A-14
Time (clock cycles)
7/27/2016
Reg
DMem
Reg
DMem
Reg
DMem
Reg
ALU
Instr 4
Ifetch
ALU
Instr 3
DMem
ALU
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Ifetch
Reg
Ifetch
Reg
CSCE 430/830, Basic Pipelining & Performance
Reg
Reg
Reg
DMem
155
One Memory Port/Structural Hazards
(Similar to Figure A.5, Page A-15)
Time (clock cycles)
Stall
Instr 3
7/27/2016
DMem
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Bubble
Reg
Reg
DMem
Bubble Bubble
Ifetch
Reg
CSCE 430/830, Basic Pipelining & Performance
Reg
Bubble
ALU
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Bubble
Reg
DMem
156
Data Hazard on R1
Figure A.6, Page A-17
Time (clock cycles)
and r6,r1,r7
or
r8,r1,r9
xor r10,r1,r11
7/27/2016
Ifetch
DMem
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
sub r4,r1,r3
Reg
ALU
Ifetch
ALU
O
r
d
e
r
add r1,r2,r3
WB
ALU
I
n
s
t
r.
MEM
ALU
IF ID/RF EX
Reg
CSCE 430/830, Basic Pipelining & Performance
Reg
Reg
Reg
DMem
157
Reg
Three Generic Data Hazards
• Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
• Caused by a “Dependence” (in compiler
nomenclature). This hazard results from an actual
need for communication.
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
158
Three Generic Data Hazards
• Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
• Can it happen in MIPS 5 stage pipeline?
– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
159
Three Generic Data Hazards
• Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
• Will see WAR and WAW in more complicated pipes
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
160
Forwarding to Avoid Data Hazard
Figure A.7, Page A-19
or
r8,r1,r9
xor r10,r1,r11
7/27/2016
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
and r6,r1,r7
Ifetch
DMem
ALU
sub r4,r1,r3
Reg
ALU
O
r
d
e
r
add r1,r2,r3 Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
Reg
CSCE 430/830, Basic Pipelining & Performance
Reg
Reg
DMem
161
Reg
HW Change for Forwarding
Figure A.23, Page A-37
NextPC
mux
MEM/WR
EX/MEM
ALU
mux
ID/EX
Registers
Data
Memory
mux
Immediate
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
162
Forwarding to Avoid LW-SW Data Hazard
Figure A.8, Page A-20
or
r8,r6,r9
xor r10,r9,r11
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
sw r4,12(r1)
Ifetch
DMem
ALU
lw r4, 0(r1)
Reg
ALU
O
r
d
e
r
add r1,r2,r3 Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
Reg
Reg
Reg
DMem
Any hazard that cannot be avoided with forwarding?
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
163
Reg
Data Hazard Even with Forwarding
Figure A.9, Page A-21
and r6,r1,r7
or
7/27/2016
r8,r1,r9
DMem
Ifetch
Reg
DMem
Reg
Ifetch
Ifetch
Reg
Reg
CSCE 430/830, Basic Pipelining & Performance
Reg
DMem
ALU
O
r
d
e
r
sub r4,r1,r6
Reg
ALU
lw r1, 0(r2)Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
DMem
164
Reg
Data Hazard Even with Forwarding
(Similar to Figure A.10, Page A-21)
and r6,r1,r7
or r8,r1,r9
7/27/2016
Reg
DMem
Ifetch
Reg
Bubble
Ifetch
Bubble
Reg
Bubble
Ifetch
Reg
CSCE 430/830, Basic Pipelining & Performance
DMem
Reg
Reg
Reg
DMem
ALU
sub r4,r1,r6
Ifetch
ALU
O
r
d
e
r
lw r1, 0(r2)
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
DMem
165
Software Scheduling to Avoid Load
Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code:
LW
LW
ADD
SW
LW
LW
SUB
SW
Rb,b
Rc,c
Ra,Rb,Rc
a,Ra
Re,e
Rf,f
Rd,Re,Rf
d,Rd
Fast code:
LW
LW
LW
ADD
LW
SW
SUB
SW
Rb,b
Rc,c
Re,e
Ra,Rb,Rc
Rf,f
a,Ra
Rd,Re,Rf
d,Rd
Compiler optimizes for performance. Hardware checks for safety.
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
166
Control Hazard:
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Next SEQ PC
Next SEQ PC
Adder
4
Zero?
RS1
Sign
Extend
RD
RD
RD
• Branch condition determined @ end of EX stage
• Target instruction address ready @ end of Mem stage
MUX
MEM/WB
Data
Memory
EX/MEM
ALU
MUX MUX
ID/EX
Imm
Reg File
IF/ID
Memory
Address
RS2
Write
Back
MUX
Next PC
Memory
Access
WB Data
Instruction
Fetch
r6,r1,r7
22: add r8,r1,r9
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
18: or
DMem
ALU
14: and r2,r3,r5
Reg
ALU
Ifetch
ALU
10: beq r1,r3,36
ALU
Control Hazard on Branches
Three Stage Stall
36: xor r10,r1,r11
Reg
Reg
Reg
Reg
DMem
What do you do with the 3 instructions in between?
How do you do it?
Where is the “commit”?
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
168
Reg
Branch Stall Impact
• If CPI = 1, 30% branch,
Stall 3 cycles => new CPI = 1.9!
• Two part solution:
– Determine branch outcome sooner, AND
– Compute taken branch (target) address earlier
• MIPS branch tests if register = 0 or  0
• MIPS Solution:
– Move Zero test to ID/RF stage
– Add an adder to calculate new PC in ID/RF stage
– 1 clock cycle penalty for branch versus 3
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
169
Two-part solution to the long stalls:
Instruction
Fetch
Memory
Access
Write
Back
Adder
Adder
MUX
Next
SEQ PC
Next PC
Zero?
RS1
RD
RD
RD
1. Determine the condition earlier @ ID
2. Calculate the target instruction address earlier @ ID
MUX
Sign
Extend
MEM/WB
Data
Memory
EX/MEM
ALU
MUX
ID/EX
Imm
Reg File
IF/ID
Memory
Address
RS2
WB Data
4
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
–
–
–
–
–
Execute successor instructions in sequence
“Squash” instructions in pipeline if branch actually taken
Advantage of late pipeline state update
47% MIPS branches not taken on average
PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
– 53% MIPS branches taken on average
– But haven’t calculated branch target address in MIPS
» MIPS still incurs 1 cycle branch penalty
» Other machines: branch target known before branch
condition determination
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
171
Four Branch Hazard Alternatives
#4: Delayed Branch
– Define branch to take place AFTER a following instruction
branch instruction
sequential successor1
sequential successor2
........
sequential successorn
branch target if taken
Branch delay of length n
– 1 slot delay allows proper decision and branch target
address in 5 stage pipeline
– MIPS uses this
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
172
Scheduling Branch Delay Slots (Fig A.14)
A. From before branch
add $1,$2,$3
if $2=0 then
delay slot
becomes
if $2=0 then
add $1,$2,$3
B. From branch target
sub $4,$5,$6
add $1,$2,$3
if $1=0 then
delay slot
becomes
sub $4,$5,$6
add $1,$2,$3
if $1=0 then
sub $4,$5,$6
C. From fall through
add $1,$2,$3
if $1=0 then
delay slot
sub $4,$5,$6
OR $7,$8,$9
becomes
add $1,$2,$3
if $1=0 then
sub $4,$5,$6
OR $7,$8,$9
• A is the best choice, fills delay slot
• In B, the sub instruction may need to be copied, increasing IC
• In B and C, must be okay to execute sub when branch fails
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
173
Delayed Branch
• Compiler effectiveness for single branch delay slot:
– Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots useful
in computation
– About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: As processors go to
deeper pipelines and multiple issues, the branch
delay grows and needs more than one delay slot
– Delayed branching has lost popularity compared to more
expensive but more flexible dynamic approaches
– Growth in available transistors has made dynamic approaches
relatively cheaper
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
174
Speed Up Equation for Pipelining
CPIpipelined  Ideal CPI  Average Stall cycles per Inst
Cycle Timeunpipelined
Ideal CPI  Pipeline depth
Speedup 

Ideal CPI  Pipeline stall CPI
Cycle Timepipelined
For simple RISC pipeline, Ideal CPI = 1:
Cycle Timeunpipelined
Pipeline depth
Speedup 

1  Pipeline stall CPI
Cycle Timepipelined
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
176
Example: Dual-port vs. Single-port
• Machine A: Dual ported memory (“Harvard Architecture”)
• Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
• Machine A is 1.33 times faster
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
177
Evaluating Branch Alternatives
Assume 4% unconditional branch, 6% conditional branchuntaken, 10% conditional branch-taken; Deeper pipeline:
target & condition known @ end of 3rd & 4th stage respectively
What’s the addition to CPI?
Scheduling Uncond
Untaken
Taken.
scheme
penalty
penalty
penalty
Stall pipeline
2
3
3
Predict taken
2
3
2
Predict not taken
2
0
3
Scheduling Uncond Untaken
Taken
All branches
scheme
branches branches
branches
4%
6%
10%
20%
Stall pipeline
0.08
0.18
0.3
0.56
Predict taken
0.08
0.18
0.2
0.46
Predict not taken 0.08
0.00
0.3
0.38
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
178
More branch evaluations
• Suppose the branch frequencies (as percentage of all
instructions) of 15% cond. Branches, 1% jumps and calls,
and 60% cond. Branches are taken. Consider a 4-stage
pipeline where branch is resolved at the end of the 2nd
cycle for uncond. Branches and at the end of the 3rd cycle
for cond. Branches. How much faster would the machine
be without any branch hazards, ignoring other pipeline
stalls?
Pipeline speedupideal = Pipeline depth/(1+Pipeline stalls)
= 4/(1+0) = 4
Pipeline stallsreal = (1x1%) + (2x9%) + (0x6%) = 0.19
Pipeline speedupreal = 4/(1+0.19) = 3.36
Pipeline speedupwithout control hazards = 4/3.36 = 1.19
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
179
More branch question
• A reduced hardware implementation of the
classic 5-stage RISC pipeline might use the EX
stage hardware to perform a branch instruction
comparison and then not actually deliver the
branch target PC to the IF stage until the clock
cycle in which the branch reaches the MEM
stage. Control hazard stalls can be reduced by
resolving branch instructions in ID, but
improving performance in one aspect may reduce
performance in other circumstances.
• How does determining branch outcome in the ID
stage have the potential to increase data hazard
stall cycles?
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
180
Problems with Pipelining
• Exception: An unusual event happens to an
instruction during its execution
– Examples: divide by zero, undefined opcode
• Interrupt: Hardware signal to switch the
processor to a new instruction stream
– Example: a sound card interrupts when it needs more audio
output samples (an audio “click” happens if it is left waiting)
• Problem: It must appear that the exception or
interrupt must appear between 2 instructions (Ii
and Ii+1)
– The effect of all instructions up to and including Ii is totally
complete
– No effect of any instruction after Ii can take place
• The interrupt (exception) handler either aborts
program or restarts at instruction Ii+1
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
181
Homework
• C.1.(e)
– 10-stage pipeline, full forwarding & bypassing, predicted taken
– Branch outcomes and targets are obtained at the end of the ID stage.
LD R1,0(R2)
DADDI R1, R1, #1
SD R1, 0(R2)
7/27/2016
1
2
IF1 IF2
IF1
3
4
5
6
7
8
ID1 ID2 EX1 EX2 MEM1 MEM2
IF2 ………………….
IF1 ……………………
9
10 11
WB1 WB2
CSCE 430/830, Basic Pipelining & Performance
12
13
14
182
Homework
• C.1.(e)
– “For example, data are forwarded from the output of the second EX stage to the input of the
first EX stage, still causing 1-cycle delay”
– Example: (not as the same as the problem)
Data Dependence: Read After Write
1
2
3
4
ADD R1, R2, R3 IF1 IF2 ID1 ID2
SUB R4, R5, R1
IF1 IF2 ID1
5
EX1
ID2
6
7
EX2 MEM1
stall EX1
8
MEM2
EX2
No Data Dependence
1
2
3
4
ADD R1, R2, R3 IF1 IF2 ID1 ID2
SUB R4, R5, R6
IF1 IF2 ID1
5
EX1
ID2
6
7
EX2 MEM1
EX1 EX2
8
9
10
11
MEM2 WB1 WB2
MEM1 MEM2 WB1 WB2
7/27/2016
9
10
11 12 13
WB1 WB2
MEM1 MEM2 WB1 WB2
CSCE 430/830, Basic Pipelining & Performance
12
13
183
Homework
• C.1.(g)
– CPI = The total number of clock cycles / the total number of
instructions of code segment
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
184
Homework
• C.2.(a) & C.2.(b)
– “Assuming that only the first pipe stage can always be done
independent of whether the branch goes”
– An indicator of “predicted not taken”
– The first stage of the instruction after the branch instruction
could be done
– Example: 4-stage pipeline
– Unconditional branch: “at the end of the second cycle”
1
2
3
4
5
6
J loop
IF ID EX WB
Next instruction
IF
Target instruction
IF
ID
EX WB
7/27/2016
CSCE 430/830, Basic Pipelining & Performance
185
Homework
• C.2.(a) & C.2.(b)
– Conditional Branch: “at the end of the third cycle”
– If the branch is not taken, penalty = ?
1
2
3
4
5
6
BNEZ R4, loop
IF ID EX WB
Next instruction
IF
stall ID
EX WB
Next + 1 instruction
IF
ID
EX WB
– If the branch is taken, penalty = ?
1
2
3
4
BNEZ R4, loop
IF ID EX WB
Next instruction
IF
stall ID
Target instruction
IF
7/27/2016
5
EX
ID
CSCE 430/830, Basic Pipelining & Performance
6
WB
EX WB
186
Download