ECE369
The MIPS Processor
ECE369
1
Datapath & control design
•
•
•
•
We will design a simplified MIPS processor
Some of the instructions supported are
– Memory-reference instructions: lw, sw
– Arithmetic-logical instructions: add, sub, and, or, slt
– Control flow instructions: beq, j
Generic implementation
– Use the program counter (PC) to supply instruction address
– Get the instruction from memory
– Read registers
– Use the instruction to decide exactly what to do
All R-type and I-Type instructions use the ALU after reading the
registers
ECE369
2
Summary of Instruction Types
R-Type: op=0
31:26
op
25:21
rs
20:16
rt
15:11 10:6
rd
shamt
5:0
funct
Load/Store: op=35 or 43
31:26
op
25:21
rs
20:16
rt
15:0
address
20:16
rt
15:0
address
Branch: op=4
31:26
op
25:21
rs
ECE369
3
Building blocks
Instruction
address
PC
Instruction
Instruction
memory
Address
a. Instruction memory
5
Register
numbers
5
5
Data
MemWrite
Add Sum
b. Program counter
3
Read
register 1
Read
register 2
Registers
Write
register
Write
data
c. Adder
Write
data
Read
data
Data
memory
16
Sign
extend
32
ALU control
MemRead
Read
data 1
Data
Zero
ALU ALU
result
a. Data memory unit
b. Sign-extension unit
Read
data 2
RegWrite
a. Registers
b. ALU
ECE369
4
Fetching instructions
ECE369
5
Reading registers
op
rs
rt
rd
ECE369
shamt
funct
6
Load/Store memory access
31:26
op
25:21
rs
20:16
rt
ECE369
15:0
address
7
Branch target
ECE369
8
Combining datapath for
memory and R-type instructions
ECE369
9
Appending instruction fetch
ECE369
10
Now Insert Branch
ECE369
11
The simple datapath
ECE369
12
Adding control to datapath
ECE369
13
ALU Control
• given instruction type
00 = lw, sw
01 = beq,
10 = arithmetic
ECE369
14
Control (Reading Assignment: Appendix C.2)
•
Simple combinational logic (truth tables)
Inputs
Op5
Op4
Op3
ALUOp
Op2
Op1
ALU control block
Op0
ALUOp0
ALUOp1
Outputs
R-format
F3
F2
F (5– 0)
Operation2
Iw
sw
beq
RegDst
ALUSrc
Operation1
Operation
MemtoReg
RegWrite
F1
MemRead
Operation0
MemWrite
F0
Branch
ALUOp1
ALUOpO
ECE369
15
Memto- Reg Mem Mem
Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0
R-format
lw
sw
ECE369
beq
16
Datapath in Operation for R-Type Instruction
Memto- Reg Mem Mem
Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0
R-format
1
0
0
1
0
0
0
1
0
lw
sw
ECE369
beq
17
Datapath in Operation for Load Instruction
Memto- Reg Mem Mem
Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0
R-format
1
0
0
1
0
0
0
1
0
lw
0
1
1
1
1
0
0
0
0
sw
X
1
X
0ECE369 0
1
0
0
0
beq
18
Datapath in Operation for Branch Equal Instruction
Memto- Reg Mem Mem
Instruction RegDst ALUSrc
Reg
Write Read Write Branch ALUOp1 ALUp0
R-format
1
0
0
1
0
0
0
1
0
lw
0
1
1
1
1
0
0
0
0
sw
X
1
X
0ECE369 0
1
0
0
0
beq
X
0
X
0
0
0
1
0
1
19
Datapath with control for
Jump instruction
•
•
J-type instructions use 6 bits for the opcode, and 26 bits for the
immediate value (called the target).
newPC <- PC[31-28] IR[25-0] 00
ECE369
20
Timing: Single cycle implementation
•
Calculate cycle time assuming negligible delays except
– Memory (2ns), ALU and adders (2ns), Register file access (1ns)
ECE369
21
Why is Single Cycle not GOOD???
•
•
•
Memory - 2ns;
ALU - 2ns; Adder - 2ns;
Reg - 1ns
Instruction
class
Instruction
memory
Register
read
ALU
Data
memory
Register
write
Total
(in ns)
ALU type
2
1
2
0
1
6
Load word
2
1
2
2
1
8
Store word
2
1
2
2
Branch
2
1
2
Jump
2
ECE369
– what if we had floating point instructions
to handle?
7
5
2
22
•
•
•
Memory - 2ns;
ALU - 2ns; Adder 2ns;
Reg - 1ns
•
•
•
•
•
Loads 24%
Stores 12%
R-type 44%
Branch 18%
Jumps 2%
Instruction
class
Instruction
memory
Register
read
ALU
Data
memory
Register
write
Total
(in ns)
ALU type
2
1
2
0
1
6
Load word
2
1
2
2
1
8
Store word
2
1
2
2
Branch
2
1
2
Jump
2
7
5
2
ECE369
23
•
•
•
Memory - 2ns;
ALU - 2ns; Adder 2ns;
Reg - 1ns
•
•
•
•
•
Loads 24%
Stores 12%
R-type 44%
Branch 18%
Jumps 2%
Avg CPU = 8*24% + 7*12% + 6*44% + 5*18% + 2*2%
Avg CPU = 6.3ns
Instruction
class
Instruction
memory
Register
read
ALU
Data
memory
Register
write
Total
(in ns)
ALU type
2
1
2
0
1
6
Load word
2
1
2
2
1
8
Store word
2
1
2
2
Branch
2
1
2
Jump
2
7
5
2
ECE369
24
Single Cycle Problems
– Wasteful of area
• Each unit used once per clock cycle
– Clock cycle equal to worst case scenario
• Will reducing the delay of common case help?
ECE369
25
Pipelining


ECE369
Four loads:
 Speedup
= 8/3.5 = 2.3
Non-stop:
 Speedup
= 2n/0.5n + 1.5 ≈ 4
= number of stages
26
Pipelining
•
Five stages, one step per stage
– IF: Instruction fetch from memory
– ID: Instruction decode & register read
– EX: Execute operation or calculate address
– MEM: Access memory operand
– WB: Write result back to register
ECE369
27
Pipelining
•
Improve performance by increasing instruction throughput
Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
Ideal speedup is number of stages in the pipeline. Do we achieve this?
28
ECE369
Pipelining

MIPS ISA designed for pipelining

All instructions are 32-bits


Few and regular instruction formats


Easier to fetch and decode in one cycle
Can decode and read registers in one step
Load/store addressing

Can calculate address in 3rd stage, access
memory in 4th stage
ECE369
29
Pipelining: What makes it hard?


Situations that prevent starting the next
instruction in the next cycle
Structure hazards


Data hazard


A required resource is busy
Need to wait for previous instruction to
complete its data read/write
Control hazard

Deciding on control action depends on
previous instruction
ECE369
30
Pipelining: Structure Hazards


Conflict for use of a resource
In MIPS pipeline with a single memory


Load/store requires data access
Instruction fetch would have to stall for that
cycle


Would cause a pipeline “bubble”
Hence, pipelined datapaths require
separate instruction/data memories

Or separate instruction/data caches
ECE369
31
Pipelining: Data Hazards

An instruction depends on completion of
data access by a previous instruction

add
sub
$s0, $t0, $t1
$t2, $s0, $t3
ECE369
32
Pipelining: Control Hazards

Branch determines flow of control


Fetching next instruction depends on branch
outcome
Pipeline can’t always fetch correct instruction


Still working on ID stage of branch
Wait until branch outcome determined
before fetching next instruction
ECE369
33
Pipelining: Summary

Pipelining improves performance by
increasing instruction throughput



Subject to hazards


Executes multiple instructions in parallel
Each instruction has the same latency
Structure, data, control
Instruction set design affects complexity
of pipeline implementation
ECE369
34
Representation
What do we need to add to actually split the datapath into stages?
ECE369
35
Pipelined datapath
ECE369
36
IF for Load, Store, …
Memory and registers
Left half: write
Right half: read
ECE369
37
ID for Load, Store, …
ECE369
38
EX for Load
ECE369
39
MEM for Load
ECE369
40
WB for Load
What is wrong with this datapath?
ECE369
41
WB for Load
ECE369
42
EX for Store
ECE369
43
MEM for Store
ECE369
44
WB for Store
ECE369
45
Graphically representing pipelines
ECE369
46
Graphically representing pipelines
ECE369
47
Pipeline operation
•
•
•
•
•
One operation begins in every cycle
One operation completes in each cycle
Each instruction takes 5 clock cycles
When a stage is not used, no control needs to be applied
How to generate control signals for them is an issue
ECE369
48
Pipeline control
ECE369
49
Pipeline operation

Control signals derived from instruction

As in single-cycle implementation
ECE369
50
Pipeline control
Instruction
R-format
lw
sw
beq
Execution/Address
Calculation stage control
lines
Reg
ALU
ALU
ALU
Dst
Op1
Op0
Src
1
1
0
0
0
0
0
1
X
0
0
1
X
0
1
0
ECE369
Memory access stage
control lines
Branc Mem
Mem
h
Read Write
0
0
0
0
1
0
0
0
1
1
0
0
Write-back
stage control
lines
Reg
Mem
write to Reg
1
0
1
1
0
X
0
X
51
Pipelining is not quite that easy!
•
Limits to pipelining: Hazards prevent next instruction from executing
during its designated clock cycle
– Structural hazards: HW cannot support this combination of
instructions (single person to fold and put clothes away)
– Data hazards: Instruction depends on result of prior instruction
still in the pipeline (missing sock)
– Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow
(branches and jumps).
ECE369
52
Data Hazards in ALU Instructions

Consider this sequence:
sub
and
or
add
sw
$2, $1,$3
$12,$2,$5
$13,$6,$2
$14,$2,$2
$15,100($2)
ECE369
53
Dependencies
•
Problem with starting next instruction before first is finished
– Dependencies that “go backward in time” are data hazards
ECE369
54
Three Generic Data Hazards
Inst I before inst j in in the program
• Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
•
Caused by a “Dependence” (in compiler nomenclature). This hazard
results from an actual need for communication.
ECE369
55
Three Generic Data Hazards
•
Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
•
Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
•
Can’t happen in MIPS 5 stage pipeline because:
All instructions take 5 stages, and
Reads are always in stage 2, and
Writes are always in stage 5
ECE369
56
Three Generic Data Hazards
•
Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
•
•
Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
ECE369
57
Hazards
ECE369
58
Forwarding
•
Use temporary results, don’t wait for them to be written
– register file forwarding to handle read/write to same register
– ALU forwarding
ECE369
59
Detecting the Need to Forward

Pass register numbers along pipeline


ALU operand register numbers in EX stage are
given by


e.g., ID/EX.RegisterRs = register number for Rs
sitting in ID/EX pipeline register
ID/EX.RegisterRs, ID/EX.RegisterRt
Data hazards when
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
Fwd from
EX/MEM
pipeline reg
Fwd from
MEM/WB
pipeline reg
•
ECE369
60
Detecting the Need to Forward

Pass register numbers along pipeline


ALU operand register numbers in EX stage are
given by


e.g., ID/EX.RegisterRs = register number for Rs
sitting in ID/EX pipeline register
ID/EX.RegisterRs, ID/EX.RegisterRt
Data hazards when
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
Fwd from
EX/MEM
pipeline reg
Fwd from
MEM/WB
pipeline reg
•
ECE369
61
Detecting the Need to Forward

But only if forwarding instruction will write
to a register!


EX/MEM.RegWrite, MEM/WB.RegWrite
And only if Rd for that instruction is not
$zero

EX/MEM.RegisterRd ≠ 0,
MEM/WB.RegisterRd ≠ 0

ECE369
62
Forwarding
ECE369
63
Forwarding Conditions

EX hazard



if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10
MEM hazard


if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01
ECE369
64
Double Data Hazard

Consider the sequence:
add $1,$1,$2
add $1,$1,$3
add $1,$1,$4

Both hazards occur


Want to use the most recent
Revise MEM hazard condition

Only fwd if EX hazard condition isn’t true
ECE369
65
Revised Forwarding Condition

MEM hazard


if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01
ECE369
66
Datapath with Forwarding
ECE369
67
Can't always forward
Need to stall
for one cycle
ECE369
68
Stalling
•
•
Hardware detection and no-op insertion is called stalling
Stall pipeline by keeping instruction in the same stage
Program
Time (in clock cycles)
execution
CC 1
CC 2
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
IM
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
Reg
IM
IM
CC 6
CC 7
DM
Reg
Reg
DM
CC 8
CC 9
CC 10
Reg
bubble
add $9, $4, $2
IM
slt $1, $6, $7
IM
ECE369
DM
Reg
Reg
Reg
DM
Reg
69
ECE369
70
Load-Use Hazard Detection


Check when using instruction is decoded
in ID stage
ALU operand register numbers in ID
stage are given by


Load-use hazard when


IF/ID.RegisterRs, IF/ID.RegisterRt
ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
(ID/EX.RegisterRt = IF/ID.RegisterRt))
If detected, stall and insert bubble
ECE369
71
How to Stall the Pipeline

Force control values in ID/EX register
to 0


EX, MEM and WB do nop (no-operation)
Prevent update of PC and IF/ID register



Using instruction is decoded again
Following instruction is fetched again
1-cycle stall allows MEM to read data for lw

Can subsequently forward to EX stage
ECE369
72
Stall logic
•
Stall logic
– If (ID/EX.MemRead) // Load
word instruction AND
– If ((ID/EX.Rt == IF/ID.Rs) or
(ID/EX.Rt == IF/ID.Rt))
•
Insert no-op (no-operation)
– Deasserting all control
signals
•
Stall following instruction
– Not writing program counter
– Not writing IF/ID registers
PCWrite
IF/ID.Rs
IF/ID.Rt
ECE369
ID/EX.Rt
73
Pipeline with hazard detection
ECE369
74
Assume that register file is written in the first half and read in the second half of the
clock cycle.
load r2 <- mem(r1+0)
r3 <- r3 + r2
load r4 <- mem(r2+r3)
r4 <- r5 - r3
Cycles
1
2
3
4
5
6
7
8
9
ID
EX
ME
WB
IF
ID
S
S
EX
ME
WB
IF
S
S
ID
EX
ME
WB
S
S
IF
ID
S
EX
; LOAD1
; ADD
; LOAD2
; SUB
10
11
12
13
load r2 <- mem(r1+0)
IF
r3 <- r3 + r2
load r4 <- mem(r2+r3)
r4 <- r5 - r3
ECE369
ME
WB
75
Summary
ECE369
76
Multi-cycle
ECE369
77
Multi-cycle
ECE369
78
Multi-cycle Pipeline
ECE369
79
Branch Hazards
ECE369
80
Branch hazards
•
•
When we decide to branch, other instructions are in the pipeline!
We are predicting “branch not taken”
– need to add hardware for flushing instructions if we are wrong
Flush these
instructions
(Set control
values to 0)
PC
ECE369
81
Solution to control hazards
•
Branch prediction
– By executing next instruction we are predicting “branch not
taken”
– Need to add hardware for flushing instructions if we are wrong
•
Reduce branch penalty
– By advancing the branch decision to ID stage
– Compare the data read from two registers read in ID stage
– Comparison for equality is a simpler design! (Why?)
– Still need to flush instruction in IF stage
•
Make the hazard into a feature!
– Always execute instruction following branch
ECE369
82
Branch detection in ID stage
Flush IF/ID if we miss-predict
ECE369
83
Data Hazards for Branches
If a comparison register is a destination
of 2nd or 3rd preceding ALU instruction
Can resolve using forwarding

add $1, $2, $3
IF
add $4, $5, $6
…
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
beq $1, $4, target
ECE369
WB
84
Data Hazards for Branches

If a comparison register is a destination
of preceding ALU instruction or 2nd
preceding load instruction

lw
Need 1 stall cycle
$1, addr
IF
add $4, $5, $6
beq stalled
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
ID
EX
beq $1, $4, target
ECE369
MEM
WB
85
Data Hazards for Branches

If a comparison register is a destination
of immediately preceding load instruction

lw
Need 2 stall cycles
$1, addr
IF
beq stalled
ID
EX
IF
ID
beq stalled
MEM
WB
ID
beq $1, $0, target
ID
ECE369
EX
MEM
WB
86
Static Branch Prediction
•Scheduling (reordering) code around delayed branch
• need to predict branch statically at compile time
• use profile information collected from earlier runs
•Behavior of branch is often bimodally distributed!
• Biased toward taken or not taken
•Effectiveness depend on
• frequency of branches and accuracy of the scheme
22%
18%
20%
15%
15%
12%
11%
12%
10%
9%
10%
4%
5%
6%
Integer
r
2c
o
p
dl
jd
FP
su
d
m
hy
dr
o2
ea
r
li
c
do
du
c
gc
eq
nt
ot
es
t
pr
es
so
m
pr
e
ss
0%
co
Integer benchmarks have
higher branch frequency
Misprediction Rate
25%
87
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear: branch penalty is fixed
and can not be reduced by software (this is the example of
MIPS)
#2: Predict Branch Not Taken (treat every branch as not taken)
– Execute successor instructions in sequence
– “flush” instructions in pipeline if branch actually taken
– 47% MIPS branches not taken on average
– PC+4 already calculated, so use it to get next instruction
ECE369
88
Four Branch Hazard Alternatives:
#3: Predict Branch Taken (treat every branch as taken)
As soon as the branch is decoded and the target address is
computed, we assume the branch is taken and begin
fetching and executing at the target address.
– 53% MIPS branches taken on average
– Because in our MIPS pipeline we don’t
know the target address any earlier than
we know the branch outcome, there is no
advantage in this approach for MIPS.
– MIPS still incurs 1 cycle branch penalty
• Other machines: branch target known before
outcome
ECE369
89
Four Branch Hazard Alternatives
#4: Delayed Branch
– In a delayed branch, the execution cycle with a branch
delay of length n is:
branch instruction
sequential successor1
sequential successor2
Branch delay of length n
........
sequential successorn
branch target if taken
These sequential successor instructions are in a
branch-delay slots.
The sequential successors are executed whether
or not the branch is taken.
The job of the compiler is to make the successor instructions valid and
useful.
ECE369
90
Scheduling Branch Delay Slots (Fig A.14)
A. From before branch
add $1,$2,$3
if $2=0 then
delay slot
becomes
if $2=0 then
add $1,$2,$3
B. From branch target
C. From fall through
sub $4,$5,$6
add $1,$2,$3
if $1=0 then
add $1,$2,$3
if $1=0 then
delay slot
becomes
Sub $4, $5, $6
add $1,$2,$3
if $1=0 then
sub $4,$5,$6
ECE369
delay slot
Or $7, $8, $ 9
sub $4,$5,$6
becomes
add $1,$2,$3
if $1=0 then
Or $7, $8, $9
Sub $4,$5,$6
91
Delayed Branch
•
•
•
Where to get instructions to fill branch delay slot?
– Before branch instruction: this is the best choice if feasible.
– From the target address: only valuable when branch taken
– From fall through: only valuable when branch not taken
Compiler effectiveness for single branch delay slot:
– Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots useful
in computation
– About 50% (60% x 80%) of slots usefully filled
Delayed Branch downside: As processor go to deeper pipelines and
multiple issue, the branch delay grows and need more than one
delay slot
– Delayed branching has lost popularity compared to more
expensive but more flexible dynamic approaches
– Growth in available transistors has made dynamic approaches
relatively cheaper
ECE369
92
Reducing branch penalty (loop unrolling)
Source: For ( i=1000; i>0; i=i-1 )
x[i] = x[i] + s;
Direct translation:
– Loop:
LD
ADDD
SD
DADDUI
BNE
F0, 0 (R1);
F4, F0, F2;
F4, 0(R1)
R1, R1, #-8
R1, R2, loop;
R1 points x[1000]
F2 = scalar value
R2 last element
Producer
Consumer
Latency
FP ALU op
Another FP ALU op
3
FP ALU op
Store double
2
Load double
FP ALU op
1
Store double
Store double
0
Assume 1 cycle latency from unsigned integer arithmetic to dependent instruction
93
Reducing stalls
1
2
3
4
5
6
7
8
9
• Pipeline Implementation:
– Loop:
Loop:
LD
F0, 0 (R1)
stall
ADDD
F4, F0, F2
stall
stall
SD
F4, 0(R1)
DADDUI
R1, R1, #-8
stall
BNE
R1, R2, loop
stall
LD
DADDUI
ADDD
stall
stall
SD
BNE
Producer
Consumer
Latency
FP ALU op
Another FP ALU op
3
FP ALU op
Store double
2
Load double
FP ALU op
1
Store double
Store double
0
F0, 0 (R1)
R1, R1, #-8
F4, F0, F2
F4, 8(R1)
R1, R2, Loop
94
Loop Unrolling
Loop LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
DADDUI
BNE
F0, 0(R1)
F4, F0, F2
F4, 0(R1)
F6, -8 (R1)
F8, F6, F2
F8, -8 (R1)
F10, -16 (R1)
F12, F10, F2
F12, -16 (R1)
F14, -24 (R1)
F16, F14, F2
F16, -24 (R1)
R1, R1, #-32
R1, R2, Loop
Producer
Consumer
Latency
FP ALU op
Another FP ALU op
3
FP ALU op
Store double
2
Load double
FP ALU op
1
Store double
Store double
0
; drop SUBI & BNEZ
; drop SUBI & BNEZ
; drop SUBI & BNEZ
27 cycles:
14 instructions,
1 for each LD,
2 for each ADDD,
1 for DADDUI
95
Loop
LD
LD
LD
LD
ADDD
ADDD
ADDD
ADDD
SD
SD
DADDUI
SD
SD
BNE
F0, 0(R1)
F6, -8 (R1)
F10, -16(R1)
F14, -24(R1)
F4, F0, F2
F8, F6, F2
F12, F10, F2
F16, F14, F2
F4, 0(R1)
F8, -8 (R1)
R1, R1, #-32
F12, -16 (R1)
F16, 8(R1)
R1, R2, Loop
Design Issues:
• Code size!
• Instruction cache
• Register space
• Iteration dependence
• Loop termination
• Memory addressing
14 instructions
(3.5 cycles per element vs. 9 cycles!)
96
Dynamic Branch Prediction


In deeper and superscalar pipelines, branch
penalty is more significant
Use dynamic prediction




Branch prediction buffer (aka branch history table)
Indexed by recent branch instruction addresses
Stores outcome (taken/not taken)
To execute a branch



Check table, expect the same outcome
Start fetching from fall-through or target
If wrong, flush pipeline and flip prediction
1-Bit Branch Prediction
•Branch History Table: Lower bits of PC address index table of
1-bit values
• Says whether or not branch taken last time
NT
0x40010100
0x40010104
for (i=0; i<100; i++) {
….
}
addi r10, r0, 100
addi r1, r1, r0
L1:
……
……
…
0x40010A04 addi r1, r1, 1
0x40010A08 bne r1, r10, L1
……
0x40010108
T
1-bit Branch
History Table
T
NT
T
:
:
T
NT
Prediction
98
1-Bit Bimodal Prediction (SimpleScalar Term)
• For each branch, keep track of what happened last time
and use that outcome as the prediction
• Change mind fast
99
1-Bit Branch Prediction
•What is the drawback of using lower bits of the PC?
• Different branches may have the same lower bit value
•What is the performance shortcome of 1-bit BHT?
• in a loop, 1-bit BHT will cause two mispredictions
• End of loop case, when it exits instead of looping as before
• First time through loop on next time through code, when it predicts
exit instead of looping
100
2-bit Saturating Up/Down Counter Predictor
•Solution: 2-bit scheme where change prediction only if get
misprediction twice
2-bit branch prediction
State diagram
101
2-Bit Bimodal Prediction (SimpleScalar Term)
• For each branch, maintain a 2-bit saturating counter:
if the branch is taken: counter = min(3,counter+1)
if the branch is not taken: counter = max(0,counter-1)
• If (counter >= 2), predict taken, else predict not taken
• Advantage: a few atypical branches will not influence the
prediction (a better measure of “the common case”)
• Can be easily extended to N-bits (in most processors, N=2)
102
Branch History Table
•Misprediction reasons:
• Wrong guess for that branch
• Got branch history of wrong branch when indexing the table
103
Branch History Table (4096-entry, 2-bits)
Branch intensive
benchmarks have higher
miss rate. How can we solve
this problem?
Increase the buffer size or
Increase the accuracy
104
Branch History Table (Increase the size?)
Need to focus on
increasing the
accuracy of the
scheme!
105
Correlated Branch Prediction
• Standard 2-bit predictor uses local
information
• Fails to look at the global picture
•Hypothesis: recent branches are
correlated; that is, behavior of recently
executed branches affects prediction
of current branch
106
Correlated Branch Prediction
• A shift register captures the local
path through the program
• For each unique path a predictor is
maintained
• Prediction is based on the behavior
history of each local path
• Shift register length determines
program region size
107
Correlated Branch Prediction
•Idea: record m most recently executed branches as taken or
not taken, and use that pattern to select the proper branch
history table
• In general, (m,n) predictor means record last m branches to select
between 2^m history tables each with n-bit counters
• Old 2-bit BHT is then a (0,2) predictor
If (aa == 2)
aa=0;
If (bb == 2)
bb = 0;
If (aa != bb)
do something;
108
Correlated Branch Prediction
Global Branch History: m-bit shift register keeping T/NT
status of last m branches.
109
Accuracy of Different Schemes
4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT
18%
16%
14%
12%
11%
10%
8%
6%
6%
5%
6%
6%
5%
4%
4%
4,096 entries: 2-bits per entry
Unlimited entries: 2-bits/entry
li
eqntott
expresso
gcc
fpppp
matrix300
0%
spice
1%
0%
doducd
1%
tomcatv
2%
nasa7
Frequency of Mispredictions
20%
1,024 entries (2,2)
110
Tournament Predictors
• A local predictor might work well for some branches or
programs, while a global predictor might work well for others
• Provide one of each and maintain another predictor to
identify which predictor is best for each branch
Local
Predictor
Global
Predictor
Branch PC
M
U
X
Tournament
Predictor
111
Tournament Predictors
•Multilevel branch predictor
• Selector for the Global and Local predictors of correlating branch
prediction
•Use n-bit saturating counter to choose between predictors
•Usual choice between global and local predictors
112
Tournament Predictors
•Advantage of tournament predictor is ability to select the right predictor for
a particular branch
•A typical tournament predictor selects global predictor 40% of the time for
SPEC integer benchmarks
• AMD Opteron and Phenom use tournament style
113
Tournament Predictors (Intel Core i7)
• Based on predictors used in Core Due chip
• Combines three different predictors
• Two-bit
• Global history
• Loop exit predictor
• Uses a counter to predict the exact number of taken
branches (number of loop iterations) for a branch
that is detected as a loop branch
• Tournament: Tracks accuracy of each predictor
• Main problem of speculation:
• A mispredicted branch may lead to another branch
being mispredicted !
114
Branch Prediction
•Sophisticated Techniques:
• A “branch target buffer” to help us look up the destination
• Correlating predictors that base prediction on global behavior
and recently executed branches (e.g., prediction for a specific
branch instruction based on what happened in previous branches)
• Tournament predictors that use different types of prediction
strategies and keep track of which one is performing best.
• A “branch delay slot” which the compiler tries to fill with a useful
instruction (make the one cycle delay part of the ISA)
•Branch prediction is especially important because it enables
other more advanced pipelining techniques to be effective!
•Modern processors predict correctly 95% of the time!
115
Branch Target Buffers (BTB)
•Branch target calculation is costly and stalls the instruction
fetch.
•BTB stores PCs the same way as caches
•The PC of a branch is sent to the BTB
•When a match is found the corresponding Predicted PC is
returned
•If the branch was predicted taken, instruction fetch continues
at the returned predicted PC
116
Branch Target Buffers (BTB)
117
Pipeline without Branch Predictor
IF (br)
PC
Reg Read
Compare
Br-target
PC + 4
In the 5-stage pipeline, a branch completes in two cycles 
If the branch went the wrong way, one incorrect instr is fetched 
One stall cycle per incorrect branch
118
Pipeline with Branch Predictor
IF (br)
PC
Branch
Predictor
Reg Read
Compare
Br-target
119
Other issues in pipelines
•
•
•
•
•
•
Exceptions
– Errors in ALU for arithmetic instructions
– Memory non-availability
Exceptions lead to a jump in a program
However, the current PC value must be saved so that the program
can return to it back for recoverable errors
Multiple exception can occur in a pipeline
Preciseness of exception location is important in some cases
I/O exceptions are handled in the same manner
ECE369
120
Exceptions
ECE369
121
Improving Performance
•
Try and avoid stalls! E.g., reorder these instructions:
lw
lw
sw
sw
$t0,
$t2,
$t2,
$t0,
0($t1)
4($t1)
0($t1)
4($t1)
•
Dynamic Pipeline Scheduling
– Hardware chooses which instructions to execute next
– Will execute instructions out of order (e.g., doesn’t wait for a
dependency to be resolved, but rather keeps going!)
– Speculates on branches and keeps the pipeline full
(may need to rollback if prediction incorrect)
•
Trying to exploit instruction-level parallelism
ECE369
122
Dynamically scheduled pipeline
ECE369
123
Dynamic Scheduling using Tomasulo’s Method
1
FIFO
2
3
124
Where is
the store
queue?
125
Advanced Pipelining
•
•
•
•
Increase the depth of the pipeline
Start more than one instruction each cycle (multiple issue)
Loop unrolling to expose more ILP (better scheduling)
“Superscalar” processors
– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue
•
All modern processors are superscalar and issue multiple
instructions usually with some limitations (e.g., different “pipes”)
•
VLIW: very long instruction word, static multiple issue
(relies more on compiler technology)
•
This class has given you the background you need to learn more!
ECE369
126
Superscalar architecture -Two instructions executed in parallel
ECE369
127
VLIW: Very Large Instruction Word
• Each “instruction” has explicit coding for multiple
operations
– In IA-64, grouping called a “packet”
– In Transmeta, grouping called a “molecule” (with “atoms” as ops)
– Moderate LIW also used in Cray/Tera MTA-2
• Tradeoff instruction space for simple decoding
– The long instruction word has room for many operations
– By definition, all the operations the compiler puts in one long
instruction word are independent => can execute in parallel
– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
» 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
– Need compiling techniques to schedule across several branches
(called “trace scheduling”)
128
Thrice Unrolled Loop that Eliminates
Stalls for Scalar Pipeline Computers
1 Loop:
2
3
4
5
6
7
8
9
10
11
L.D
L.D
L.D
ADD.D
ADD.D
ADD.D
S.D
S.D
DSUBUI
BNEZ
S.D
F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
0(R1),F4
-8(R1),F8
R1,R1,#24
R1,LOOP
8(R1),F12
Minimum times between
pairs of instructions:
L.D to ADD.D: 1 Cycle
ADD.D to S.D: 2 Cycles
A single branch delay
slot follows the BNEZ.
; 8-24 = -16
11 clock cycles, or 3.67 per iteration
129
Loop Unrolling in VLIW
L.D to ADD.D: +1 Cycle
ADD.D to S.D: +2 Cycles
Memory
reference 1
Memory
reference 2
FP
operation 1
1 Loop:
2
3
4
5
6
7
8
9
10
11
FP
op. 2
L.D
L.D
L.D
ADD.D
ADD.D
ADD.D
S.D
S.D
DSUBUI
BNEZ
S.D
Int. op/
branch
F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
0(R1),F4
-8(R1),F8
R1,R1,#24
R1,LOOP
8(R1),F12
Clock
130
Loop Unrolling in VLIW
L.D to ADD.D: +1 Cycle
ADD.D to S.D: +2 Cycles
Memory
reference 1
Memory
reference 2
FP
operation 1
1 Loop:
2
3
4
5
6
7
8
9
10
11
FP
op. 2
L.D
L.D
L.D
ADD.D
ADD.D
ADD.D
S.D
S.D
DSUBUI
BNEZ
S.D
Int. op/
branch
F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
0(R1),F4
-8(R1),F8
R1,R1,#24
R1,LOOP
8(R1),F12
Clock
L.D F0,0(R1)
1
S.D 0(R1),F4
2
3
4
5
6
ADD.D F4,F0,F2
131
Loop Unrolling in VLIW
L.D to ADD.D: +1 Cycle
ADD.D to S.D: +2 Cycles
Memory
reference 1
Memory
reference 2
L.D F0,0(R1)
L.D F6,-8(R1)
FP
operation 1
L.D F10,-16(R1) L.D F14,-24(R1)
L.D F18,-32(R1) L.D F22,-40(R1) ADD.D
L.D F26,-48(R1)
ADD.D
ADD.D
S.D 0(R1),F4
S.D -8(R1),F8
ADD.D
S.D -16(R1),F12 S.D -24(R1),F16
S.D -32(R1),F20 S.D -40(R1),F24
S.D 8(R1),F28
1 Loop:
2
3
4
5
6
7
8
9
10
11
FP
op. 2
L.D
L.D
L.D
ADD.D
ADD.D
ADD.D
S.D
S.D
DSUBUI
BNEZ
S.D
Int. op/
branch
F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
0(R1),F4
-8(R1),F8
R1,R1,#24
R1,LOOP
8(R1),F12
Clock
1
2
F4,F0,F2
ADD.D F8,F6,F2
3
F12,F10,F2 ADD.D F16,F14,F2
4
F20,F18,F2 ADD.D F24,F22,F2
5
F28,F26,F2
6
7
DSUBUI R1,R1,#56 8
BNEZ R1,LOOP
9
Unrolled 7 times to avoid stall delays from ADD.D to S.D
7 results in 9 clocks, or 1.3 clocks per iteration (2.8X: 1.3 vs 3.67)
Average: 2.5 ops per clock (23 ops in 45 slots), 51% efficiency
Note: 8, not -48, after DSUBUI R1,R1,#56 - which may be out of place. See next slide.
Note: We needed more registers in VLIW (used 15 pairs vs. 6 in SuperScalar)
132
Problems with 1st Generation VLIW
• Increase in code size
– generating enough operations in a straight-line code fragment
requires ambitiously unrolling loops
– whenever VLIW instructions are not full, unused functional
units translate to wasted bits in instruction encoding
• Operated in lock-step; no hazard detection HW
– a stall in any functional unit pipeline caused entire processor
to stall, since all functional units must be kept synchronized
– Compiler might predict function unit stalls, but cache stalls
are hard to predict
• Binary code incompatibility
– Pure VLIW => different numbers of functional units and unit
latencies require different versions of the code
133
Multiple Issue Processors
• Exploiting ILP
– Unrolling simple loops
– More importantly, able to exploit parallelism in a less
structured code size
• Modern Processors:
– Multiple Issue
– Dynamic Scheduling
– Speculation
134
Multiple Issue, Dynamic, Speculative
Processors
• How do you issue two instructions concurrently?
– What happens at the reservation station if two instructions issued
concurrently have true dependency?
– Solution 1:
» Issue first during first half and Issue second instruction during second
half of the clock cycle
– Problem:
» Can we issue 4 instructions?
– Solution 2:
» Pipeline and widen the issue logic
» Make instruction issue take multiple clock cycles!
– Problem:
» Can not pipeline indefinitely, new instructions issued every clock cycle
» Must be able to assign reservation station
» Dependent instruction that is being used should be able to refer to the
correct reservation stations for its operands
• Issue step is the bottleneck in dynamically scheduled
superscalars!
135
ARM Cortex-A8 and Intel Core i7
• A8:
•
•
•I7:
•
•
•
Multiple issue
iPad, Motorola Droid, iPhones
Multiple issue
high end dynamically scheduled speculative
High-end desktops, server
136
ARM Cortex-A8
• A8 Design goal: low power, reasonably high clock rate
• Dual-issue
• Statically scheduled superscalar
• Dynamic issue detection
• Issue one or two instructions per clock (in-order)
• 13 stage pipeline
• Fully bypassing
• Dynamic branch predictor
• 512-entry, 2-way set associative branch target buffer
• 4K-entry global history buffer
• If branch target buffer misses
• Prediction through global history buffer
• 8-entry return address stack
• I7: aggressive 4-issue dynamically scheduled speculative
137
pipeline
ARM Cortex-A8
The basic structure of the A8 pipeline is 13 stages. Three cycles are used for instruction fetch and four for
instruction decode, in addition to a five-cycle integer pipeline. This yields a 13-cycle branch misprediction
penalty. The instruction fetch unit tries to keep the 12-entry instruction queue filled.
138
ARM Cortex-A8
The five-stage instruction decode of the A8. In the first stage, a PC produced by the fetch unit (either from the
branch target buffer or the PC incrementer) is used to retrieve an 8-byte block from the cache. Up to two
instructions are decoded and placed into the decode queue; if neither instruction is a branch, the PC is
incremented for the next fetch. Once in the decode queue, the scoreboard logic decides when the instructions can
issue. In the issue, the register operands are read; recall that in a simple scoreboard, the operands always come
from the registers. The register operands and opcode are sent to the instruction execution portion of the pipeline.
139
ARM Cortex-A8
The five-stage instruction decode of the A8. Multiply operations are always performed in ALU pipeline 0.
140
ARM Cortex-A8
Figure 3.39 The estimated composition of the CPI on the ARM A8 shows that pipeline stalls are the primary addition to
the base CPI. eon deserves some special mention, as it does integer-based graphics calculations (ray tracing) and has
very few cache misses. It is computationally intensive with heavy use of multiples, and the single multiply pipeline
becomes a major bottleneck. This estimate is obtained by using the L1 and L2 miss rates and penalties to compute the
L1 and L2 generated stalls per instruction. These are subtracted from the CPI measured by a detailed simulator to
141
obtain the pipeline stalls. Pipeline stalls include all three hazards plus minor effects such as way misprediction.
ARM Cortex-A8 vs A9
A9:
Issue 2 instructions/clk
Dynamic scheduling
Speculation
Figure 3.40 The performance ratio for the A9 compared to the A8, both using a 1 GHz clock and the same size caches for
L1 and L2, shows that the A9 is about 1.28 times faster. Both runs use a 32 KB primary cache and a 1 MB secondary
cache, which is 8-way set associative for the A8 and 16-way for the A9. The block sizes in the caches are 64 bytes for the
A8 and 32 bytes for the A9. As mentioned in the caption of Figure 3.39, eon makes intensive use of integer multiply, and
the combination of dynamic scheduling and a faster multiply pipeline significantly improves performance on the A9.
twolf experiences a small slowdown, likely due to the fact that its cache behavior is worse with the smaller L1 block size of
the A9.
142
Intel Core i7
• The total pipeline depth is 14
stages.
• There are 48 load and 32 store
buffers.
• The six independent functional units
can each begin execution of a
ready micro-op in the same cycle.
143
Intel Core i7
• Instruction Fetch:
• Multilevel branch target buffer
• Return address stack (function
return)
• Fetch 16 bytes from instruction
cache
• 16-bytes in predecode instruction
buffer
• Macro-op fusion: compare
followed by branch fused into
one instruction
• Break 16 bytes into instructions
• Place into 18-entry queue
144
Intel Core i7
• Micro-op decode: translate x86
instructions into micro-ops (directly
executable by the pipeline)
• Generate up to 4 microops/cycle
• Place into 28-entry buffer
• Micro-op buffer:
• loop stream detection:
• Small sequence of
instructions in a loop (<28
instructions)
• Eliminate fetch, decode
• Microfusion
• Fuse load/ALU, ALU/store
pairs
• Issue to single reservation
station
145
Intel Core i7 vs. Atom 230 (45nm technology)
Intel i7 920
ARM A8
Intel Atom 230
4-cores each with FP 1 core, no FP
1 core, with FP
Clock rate
2.66GHz
1GHz
1.66GHz
Power
130W
2W
4W
Cache
3-level,
all 4-way,
128 I, 64 D, 512 L2
1-level
Fully associative
32 I, 32 D
2-level
All 4-way
16 I, 16 D, 64 L2
Pipeline
4ops/cycle
2ops/cycle
2 ops/cycle
Speculative, OOO
In-order, dynamic
issue
In-order
Dynamic issue
Two-level
Two-level
512-entry BTB
4K global history
8-entry return
Two-level
Branch pred
146
Intel Core i7 vs. Atom 230 (45nm technology)
Figure 3.45 The relative performance and energy efficiency for a set of single-threaded benchmarks shows the i7 920 is 4 to
over 10 times faster than the Atom 230 but that it is about 2 times less power efficient on average! Performance is shown in
the columns as i7 relative to Atom, which is execution time (i7)/execution time (Atom). Energy is shown with the line as
Energy (Atom)/Energy (i7). The i7 never beats the Atom in energy efficiency, although it is essentially as good on four
benchmarks, three of which are floating point. The data shown here were collected by Esmaeilzadeh et al. [2011]. The SPEC
benchmarks were compiled with optimization on using the standard Intel compiler, while the Java benchmarks use the Sun
(Oracle) Hotspot Java VM. Only one core is active on the i7, and the rest are in deep power saving mode. Turbo Boost is
used on the i7, which increases its performance advantage but slightly decreases its relative energy efficiency.
Copyright © 2011, Elsevier Inc. All
rights Reserved.
Improving Performance
• Techniques to increase performance:
 pipelining
 improves clock speed
 increases number of in-flight instructions
 hazard/stall elimination
 branch prediction
 register renaming
 out-of-order execution
 bypassing
 increased pipeline bandwidth
148
Deep Pipelining
• Increases the number of in-flight instructions
• Decreases the gap between successive independent
instructions
• Increases the gap between dependent instructions
• Depending on the ILP in a program, there is an optimal
pipeline depth
• Tough to pipeline some structures; increases the cost
of bypassing
149
Increasing Width
• Difficult to find more than four independent instructions
• Difficult to fetch more than six instructions (else, must
predict multiple branches)
• Increases the number of ports per structure
150
Reducing Stalls in Fetch
• Better branch prediction
 novel ways to index/update and avoid aliasing
 cascading branch predictors
• Trace cache
 stores instructions in the common order of execution,
not in sequential order
 in Intel processors, the trace cache stores pre-decoded
instructions
151
Reducing Stalls in Rename/Regfile
• Larger ROB/register file/issue queue
• Virtual physical registers: assign virtual register names to
instructions, but assign a physical register only when the
value is made available
• Runahead: while a long instruction waits, let a thread run
ahead to prefetch (this thread can deallocate resources
more aggressively than a processor supporting precise
execution)
• Two-level register files: values being kept around in the
register file for precise exceptions can be moved to 2nd level
152
Performance beyond single thread ILP
•There can be much higher natural parallelism in some
applications (e.g., Database or Scientific codes)
•Explicit Thread Level Parallelism or Data Level Parallelism
•Thread: process with own instructions and data
• thread may be a process part of a parallel program of
multiple processes, or it may be an independent
program
• Each thread has all the state (instructions, data, PC,
register state, and so on) necessary to allow it to
execute
•Data Level Parallelism: Perform identical operations on data,
and lots of data
153
Thread-Level Parallelism
• Motivation:
 a single thread leaves a processor under-utilized
for most of the time
 by doubling processor area, single thread performance
barely improves
• Strategies for thread-level parallelism:
 multiple threads share the same large processor 
reduces under-utilization, efficient resource allocation
Simultaneous Multi-Threading (SMT)
 each thread executes on its own mini processor 
simple design, low interference between threads
Chip Multi-Processing (CMP)
154
New Approach: Mulithreaded Execution
•Multithreading: multiple threads to share the functional units
of 1 processor via overlapping
• processor must duplicate independent state of each thread e.g., a
separate copy of register file, a separate PC, and for running
independent programs, a separate page table
• memory shared through the virtual memory mechanisms, which
already support multiple processes
• HW for fast thread switch; much faster than full process switch
~100s to 1000s of clocks
•When switch?
• Alternate instruction per thread (fine grain)
• When a thread is stalled, perhaps for a cache miss, another thread
can be executed (coarse grain)
155
M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes
Simultaneous Multi-threading ...
One thread, 8 units
Cycle M M FX FX FP FP BR CC
Two threads, 8 units
Cycle M M FX FX FP FP BR CC
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
156
Time (processor cycle)
Multithreaded Categories
Superscalar
Fine-Grained Coarse-Grained
Thread 1
Thread 2
Multiprocessing
Thread 3
Thread 4
Simultaneous
Multithreading
Thread 5
Idle slot
157
Data-Level Parallelism
in Vector, SIMD, and
GPU Architectures
158
SIMD Variations
•Vector architectures
•SIMD extensions
• MMX: Multimedia Extensions (1996)
• SSE: Streaming SIMD Extensions
• AVX: Advanced Vector Extension (2010)
•Graphics Processor Units (GPUs)
159
Vector Architectures
•Basic idea:
• Read sets of data elements into “vector
registers”
• Operate on those registers
• Disperse the results back into memory
160
VMIPS Instructions
161
Example: VMIPS
•Vector registers
• Each register holds a 64-element,
64 bits/element vector
• Register file has 16 read ports and
8 write ports
•Vector functional units
• Fully pipelined
• Data and control hazards are
detected
•Vector load-store unit
• Fully pipelined
• Words move between registers
• One word per clock cycle after
initial latency
•Scalar registers
• 32 general-purpose registers
• 32 floating-point registers
162
VMIPS Instructions
•Example: DAXPY
L.D
LV
MULVS.D
LV
ADDVV
SV
F0,a
V1,Rx
V2,V1,F0
V3,Ry
V4,V2,V3
Ry,V4
;load scalar a
;load vector X
;vector-scalar mult
;load vector Y
;add
;store result
• In MIPS Code
• ADD waits for MUL, SD waits for ADD
• In VMIPS
• Stall once for the first vector element, subsequent
elements will flow smoothly down the pipeline.
• Pipeline stall required once per vector instruction!
163
Vector Chaining Advantage
• Without chaining, must wait for last element of result to be
written before starting dependent instruction
Load
Mul
Time
Add
• With chaining, can start dependent instruction as soon as first
result appears
Load
Mul
Add
Vector Chaining
• Vector version of register bypassing
• Chaining
• Allows a vector operation to start as soon as the individual
elements of its vector source operand become available
• Results from the first functional unit are forwarded to the
second unit
V
1
LV
v1
MULV v3,v1,v2
ADDV v5, v3, v4
V
2
Chain
Load
Unit
Memory
V
3
V
4
Chain
Mult.
Add
V
5
VMIPS Instructions
•Flexible
• 64 64-bit / 128 32-bit / 256 16-bit, 512 8-bit
• Matches the need of multimedia (8bit), scientific
applications that require high precision.
166
Vector Instruction Execution
ADDV C,A,B
Four-lane
execution using four
pipelined functional
units
Execution using one
pipelined functional
unit
A[6]
B[6]
A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5]
B[5]
A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4]
B[4]
A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3]
B[3]
A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]
C[2]
C[8]
C[9]
C[10]
C[11]
C[1]
C[4]
C[5]
C[6]
C[7]
C[0]
C[0]
C[1]
C[2]
C[3]
Multiple Lanes
•Element n of vector register A is “hardwired” to element n of vector
register B
• Allows for multiple hardware lanes
• No communication between lanes
• Little increase in control overhead
• No need to change machine code
Adding more lanes allows
designers to tradeoff clock rate and
energy without sacrificing
performance!
168
Memory Banks
•Memory system must be designed to support high
bandwidth for vector loads and stores
•Spread accesses across multiple banks
• Control bank addresses independently
• Load or store non sequential words
• Support multiple vector processors sharing the
same memory
169
Vector Summary
• Vector is alternative model for exploiting ILP
• If code is vectorizable, then simpler hardware, energy
efficient, and better real-time model than out-of-order
• More lanes, slower clock rate!
• Scalable if elements are independent
• If there is dependency
• One stall per vector instruction rather than one stall
per vector element
• Programmer in charge of giving hints to the compiler!
• Design issues: number of lanes, functional units and
registers, length of vector registers, exception handling,
conditional operations
• Fundamental design issue is memory bandwidth
170
• Especially with virtual address translation and caching
SIMD
•Implementations:
• Intel MMX (1996)
• Repurpose 64-bit floating point registers
• Eight 8-bit integer ops or four 16-bit integer ops
• Streaming SIMD Extensions (SSE) (1999)
• Separate 128-bit registers
• Eight 16-bit ops, Four 32-bit ops or two 64-bit
ops
• Single precision floating point arithmetic
• Double-precision floating point in
• SSE2 (2001), SSE3(2004), SSE4(2007)
• Advanced Vector Extensions (2010)
171
• Four 64-bit integer/fp ops
SIMD
•Implementations:
• Advanced Vector Extensions (2010)
• Doubles the width to 256 bits
• Four 64-bit integer/fp ops
• Extendible to 512 and 1024 bits for future
generations
• Operands must be consecutive and aligned
memory locations
172
Example SIMD
•Example DXPY:
L.D
MOV
MOV
MOV
DADDIU
Loop:
L.4D
MUL.4D
L.4D
ADD.4D
S.4D
DADDIU
DADDIU
DSUBU
BNEZ
F0,a
F1, F0
F2, F0
F3, F0
R4,Rx,#512
;load scalar a
;copy a into F1 for SIMD MUL
;copy a into F2 for SIMD MUL
;copy a into F3 for SIMD MUL
;last address to load
F4,0[Rx]
F4,F4,F0
F8,0[Ry]
F8,F8,F4
0[Ry],F8
Rx,Rx,#32
Ry,Ry,#32
R20,R4,Rx
R20,Loop
;load X[i], X[i+1], X[i+2], X[i+3]
;a×X[i],a×X[i+1],a×X[i+2],a×X[i+3]
;load Y[i], Y[i+1], Y[i+2], Y[i+3]
;a×X[i]+Y[i], ..., a×X[i+3]+Y[i+3]
;store into Y[i], Y[i+1], Y[i+2], Y[i+3]
;increment index to X
;increment index to Y
;compute bound
;check if done
173
SIMD extensions
•Meant for programmers to utilize
•Not for compilers to generate
• Recent x86 compilers
• Capable for FP intensive apps
• Why is it popular?
• Costs little to add to the standard arithmetic unit
• Easy to implement
• Need smaller memory bandwidth than vector
• Separate data transfers aligned in memory
• Vector: single instruction , 64 memory accesses, page
fault in the middle of the vector likely!
• Use much smaller register space
• Fewer operands
• No need for sophisticated mechanisms of vector
174
architecture
Graphics Processing Unit
•Given the hardware invested to do graphics well, how
can we supplement it to improve performance of a
wider range of applications?
•Basic idea:
• Heterogeneous execution model
• CPU is the host, GPU is the device
• Develop a C-like programming language for GPU
• Unify all forms of GPU parallelism as CUDA
thread
• Programming model: “Single Instruction Multiple
Thread”
175
Graphics Processing Unit
•CUDA’s design goals
extend a standard sequential programming language,
specifically C/C++,
focus on the important issues of parallelism—how to craft efficient
parallel algorithms—rather than grappling with the mechanics of
an unfamiliar and complicated language.
minimalist set of abstractions for expressing parallelism
highly scalable parallel code that can run across tens of thousands
of concurrent threads and hundreds of processor cores.
176
GTX570 GPU
Global Memory
1,280MB
L2 Cache
640KB
Texture Cache
8KB
Up to 1536
Threads/SM
L1 Cache
16KB
Constant Cache
8KB
SM 0
Shared Memory
48KB
Registers
32,768
SM 14
Shared Memory
48KB
Registers
32,768
32
cores
32
cores
177
Programming the GPU
• CUDA Programming Model
– Single Instruction Multiple Thread (SIMT)
•
•
•
•
A thread is associated with each data element
Threads are organized into blocks
Blocks are organized into a grid
GPU hardware handles thread management, not
applications or OS
GTX570 GPU
• 32 threads within a block work collectively
 Memory access optimization, latency hiding
179
GTX570 GPU
Kernel Grid
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Block 8
Block 9
Block 10
Block 11
Block 12
Block 13
Block 14
Block 15
Device with 4 Multiprocessors
MP 0
MP 1
MP 2
MP 3
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
• Up to 1024 Threads/Block
and 8 Active Blocks per SM
Programming the GPU
Matrix Multiplication
Matrix Multiplication
• For a 4096x4096 matrix multiplication
- Matrix C will require calculation of 16,777,216 matrix
cells.
• On the GPU each cell is calculated by its own thread.
• We can have 23,040 active threads (GTX570), which means
we can have this many matrix cells calculated in parallel.
• On a general purpose processor we can only calculate one
cell at a time.
• Each thread exploits the GPUs fine granularity by computing
one element of Matrix C.
• Sub-matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU
bandwidth.
Solving Systems of Equations
Thread Organization
•If we expand to 4096 equations, we can process each row
completely in parallel with 4096 threads
•We will require 4096 kernel launches. One for each equation
Results
CPU Configuration: Intel Xeon @2.33GHz with 2GB RAM
GPU Configuration: NVIDIA Tesla C1060 @1.3GHz
*For
single precision, speedup improves by at least a factor of 2X
Execution time includes data transfer from host to device
Programming the GPU
• Distinguishing execution place of functions:
 _device_ or _global_ => GPU Device
 Variables declared are allocated to the GPU memory
 _host_ => System processor (HOST)
• Function call




Name<<dimGrid, dimBlock>>(..parameter list..)
blockIdx: block identifier
threadIdx: threads per block identifier
blockDim: threads per block
Programming the GPU
//Invoke DAXPY
daxpy(n,2.0,x,y);
//DAXPY in C
void daxpy(int n, double a,
double* x, double* y)
{
for (int i=0;i<n;i++)
y[i]= a*x[i]+ y[i]
}
Programming the GPU
//Invoke DAXPY with 256 threads per Thread Block
_host_
int nblocks = (n+255)/256;
daxpy<<<nblocks, 256>>> (n,2.0,x,y);
//DAXPY in CUDA
_device_
void daxpy(int n,double a,double* x,double* y){
int i=blockIDx.x*blockDim.x+threadIdx.x;
if (i<n)
y[i]= a*x[i]+ y[i]
}
Programming the GPU
• CUDA
• Hardware handles thread management
• Invisible to the programmer (productivity),
• Performance programmers need to know the
operation principles of the threads!
• Productivity vs. performance
• How much power to be given to the programmer,
CUDA is still evolving!
Efficiency Considerations
• Avoid execution divergence
– threads within a warp follow different execution paths.
– Divergence between warps is ok
• Allow loading a block of data into SM
– process it there, and then write the final result back out to
external memory.
• Coalesce memory accesses
– Access executive words instead of gather-scatter
• Create enough parallel work
– 5K to 10K threads
Efficiency Considerations
• GPU Architecture
– Each SM executes multiple warps in a time-sharing fashion
while one or more are waiting for memory values
• Hiding the execution cost of warps that are executed concurrently.
– How many memory requests can be serviced and how many
warps can be executed together while one warp is waiting for
memory values.
Easy to Learn
Takes time to master
Logic Design
194
State Elements
• Unclocked vs. Clocked
• Clocks used in synchronous logic
• Clocks are needed in sequential logic to
decide when an element that contains state
should be updated.
State
element
1
Clock cycle
Combinational logic
State
element
2
Latches and Flip-flops
C
Q
_
Q
D
Latches and Flip-flops
D
D
C
C
Q
D
latch
D
C
Q
D
latch
Q
Q
Q
Latches and Flip-flops
Latches: whenever the inputs change, and the clock is asserted
Flip-flop: state changes only on a clock edge
(edge-triggered methodology)
SRAM
SRAM vs. DRAM
Which one has a better memory density?
static RAM (SRAM): value stored in a cell is kept on a pair
of inverting gates
dynamic RAM (DRAM), value kept in a cell is stored as a
charge in a capacitor.
DRAMs use only a single transistor per bit of storage,
By comparison, SRAMs require four to six transistors per bit
Which one is faster?
In DRAMs, the charge is stored on a capacitor, so it cannot be
kept indefinitely and must periodically be refreshed. (called dynamic)
Synchronous RAMs ??
is the ability to transfer a burst of data from a
series of sequential addresses within an array or row.