Computer Architecture, Part 4 - University of California, Santa Barbara

advertisement
Part IV
Data Path and Control
Nov. 2014
Computer Architecture, Data Path and Control
Slide 1
About This Presentation
This presentation is intended to support the use of the textbook
Computer Architecture: From Microprocessors to Supercomputers,
Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated
regularly by the author as part of his teaching of the upper-division
course ECE 154, Introduction to Computer Architecture, at the
University of California, Santa Barbara. Instructors can use these
slides freely in classroom teaching and for other educational
purposes. Any other use is strictly prohibited. © Behrooz Parhami
Edition
Released
Revised
Revised
Revised
Revised
First
July 2003
July 2004
July 2005
Mar. 2006
Feb. 2007
Feb. 2008
Feb. 2009
Feb. 2011
Nov. 2014
Nov. 2014
Computer Architecture, Data Path and Control
Slide 2
A Few Words About Where We Are Headed
Performance = 1 / Execution time
simplified to 1 / CPU execution time
CPU execution time = Instructions  CPI / (Clock rate)
Performance = Clock rate / ( Instructions  CPI )
Try to achieve CPI = 1
with clock that is as
high as that for CPI > 1
designs; is CPI < 1
feasible? (Chap 15-16)
Design memory & I/O
structures to support
ultrahigh-speed CPUs
(chap 17-24)
Nov. 2014
Define an instruction set;
make it simple enough
to require a small number
of cycles and allow high
clock rate, but not so
simple that we need many
instructions, even for very
simple tasks (Chap 5-8)
Computer Architecture, Data Path and Control
Design hardware
for CPI = 1; seek
improvements with
CPI > 1 (Chap 13-14)
Design ALU for
arithmetic & logic
ops (Chap 9-12)
Slide 3
IV Data Path and Control
Design a simple computer (MicroMIPS) to learn about:
• Data path – part of the CPU where data signals flow
• Control unit – guides data signals through data path
• Pipelining – a way of achieving greater performance
Topics in This Part
Chapter 13 Instruction Execution Steps
Chapter 14 Control Unit Synthesis
Chapter 15 Pipelined Data Paths
Chapter 16 Pipeline Performance Limits
Nov. 2014
Computer Architecture, Data Path and Control
Slide 4
13 Instruction Execution Steps
A simple computer executes instructions one at a time
• Fetches an instruction from the loc pointed to by PC
• Interprets and executes the instruction, then repeats
Topics in This Chapter
13.1 A Small Set of Instructions
13.2 The Instruction Execution Unit
13.3 A Single-Cycle Data Path
13.4 Branching and Jumping
13.5 Deriving the Control Signals
13.6 Performance of the Single-Cycle Design
Nov. 2014
Computer Architecture, Data Path and Control
Slide 5
13.1 A Small Set of Instructions
31
R
I
op
25
rs
20
rt
15
rd
10
sh
fn
5
6 bits
5 bits
5 bits
5 bits
5 bits
6 bits
Opcode
Source 1
or base
Source 2
or dest’n
Destination
Unused
Opcode ext
imm
Operand / Offset, 16 bits
jta
J
Jump target address, 26 bits
inst
Instruction, 32 bits
Fig. 13.1
MicroMIPS instruction formats and naming of the various fields.
We will refer to this diagram later
Seven R-format ALU instructions (add, sub, slt, and, or, xor, nor)
Six I-format ALU instructions (lui, addi, slti, andi, ori, xori)
Two I-format memory access instructions (lw, sw)
Three I-format conditional branch instructions (bltz, beq, bne)
Four unconditional jump instructions (j, jr, jal, syscall)
Nov. 2014
Computer Architecture, Data Path and Control
Slide 6
0
The MicroMIPS
Instruction Set
Copy
Arithmetic
Including
la
rt,imm(rs)
(load address)
makes it easier
to write useful
programs
Logic
Memory access
Control transfer
Table 13.1
Nov. 2014
Instruction
Usage
Load upper immediate
Add
Subtract
Set less than
Add immediate
Set less than immediate
AND
OR
XOR
NOR
AND immediate
OR immediate
XOR immediate
Load word
Store word
Jump
Jump register
Branch less than 0
Branch equal
Branch not equal
Jump and link
System call
lui
rt,imm
add
rd,rs,rt
sub
rd,rs,rt
slt
rd,rs,rt
addi rt,rs,imm
slti rd,rs,imm
and
rd,rs,rt
or
rd,rs,rt
xor
rd,rs,rt
nor
rd,rs,rt
andi rt,rs,imm
ori
rt,rs,imm
xori rt,rs,imm
lw
rt,imm(rs)
sw
rt,imm(rs)
j
L
jr
rs
bltz rs,L
beq
rs,rt,L
bne
rs,rt,L
jal
L
syscall
Computer Architecture, Data Path and Control
op fn
15
0
0
0
8
10
0
0
0
0
12
13
14
35
43
2
0
1
4
5
3
0
Slide 7
32
34
42
36
37
38
39
8
12
13.2 The Instruction Execution Unit
beq,bne
syscall
31
R
I
Next addr
bltz,jr
jta
op
25
rs
20
10
sh
fn
5
5 bits
5 bits
5 bits
5 bits
6 bits
Opcode
Source 1
or base
Source 2
or dest’n
Destination
Unused
Opcode ext
0
imm
Operand / Offset, 16 bits
jta
J
Jump target address, 26 bits
inst
22 instructions
(rs)
Reg
file
inst
rd
Instruction, 32 bits
rs,rt,rd
Instr
cache
15
6 bits
j,jal
PC
rt
12 A/L,
lui,
lw,sw
ALU
Address
Data
Data
cache
(rt)
imm
op fn
Control
Harvard
architecture
Fig. 13.2
Abstract view of the instruction execution unit for MicroMIPS.
For naming of instruction fields, see Fig. 13.1.
Nov. 2014
Computer Architecture, Data Path and Control
Slide 8
13.3 A Single-Cycle Data Path
Incr PC
Next addr
jta
Next PC
ALUOvfl
(PC)
PC
(rs)
rs
rt
Instr
cache
inst
rd
31
imm
op
Instruction fetch
Nov. 2014
Reg
file
ALU
(rt)
/
16
ALU
out
Data
addr
Data
cache
Data
out
Data
in
Func
0
32
SE / 1
0
1
2
Register input
fn
RegDst
RegWrite
Br&Jump
Fig. 13.3
0
1
2
Ovfl
Reg access / decode
ALUSrc
ALUFunc
ALU operation
DataRead
RegInSrc
DataWrite
Data access
Reg
writeback
Key elements of the single-cycle MicroMIPS data path.
Computer Architecture, Data Path and Control
Slide 9
ConstVar
Shift function
Constant
5
amount
0
Amount
5
1
5
Variable
amount
2
00
01
10
11
No shift
Logical left
Logical right
Arith right
Shifter
Function
class
32
5 LSBs
imm
Adder
y
32
k
/
c 31
xy
Shift
Set less
Arithmetic
Logic
0
0 or 1
c0
32
00
01
10
11
An ALU for
MicroMIPS
2
Shifted y
x
lui
s
32
2
32
Shorthand
symbol
for ALU
1
MSB
We use only 5
control signals
(no shifts)
Control
c 32
3
5
x
Func
AddSub
s
ALU
Logic
unit
AND
OR
XOR
NOR
00
01
10
11
Ovfl
y
32input
NOR
Zero
2
Logic function
Zero
Ovfl
Fig. 10.19 A multifunction ALU with 8 control signals (2 for function class,
1 arithmetic, 3 shift, 2 logic) specifying the operation.
Nov. 2014
Computer Architecture, Data Path and Control
Slide 10
13.4 Branching and Jumping
Update
options
for PC
(PC)31:2 + 1
(PC)31:2 + 1 + imm
(PC)31:28 | jta
(rs)31:2
SysCallAddr
Default option
When instruction is branch and condition is met
When instruction is j or jal
When the instruction is jr
Start address of an operating system routine
Lowest 2 bits of
PC always 00
BrTrue
/
30
IncrPC
/
30
Adder
c in
0
1
2
3
NextPC
/
30
PCSrc
Fig. 13.4
Nov. 2014
/
30
/
30
/
30
/
30
4 MSBs 1
/
30
Branch
condition
checker
(rt)
/
32
(rs)
/
30
(PC)31:2
/
26
jta
30
MSBs
SE
/
30
/
32
4
16
imm
MSBs
SysCallAddr
BrType
Next-address logic for MicroMIPS (see top part of Fig. 13.3).
Computer Architecture, Data Path and Control
Slide 11
13.5 Deriving the Control Signals
Table 13.2 Control signals for the single-cycle MicroMIPS implementation.
Control signal
Reg
file
ALU
Data
cache
Next
addr
Nov. 2014
0
1
2
3
RegWrite
Don’t write
Write
RegDst1, RegDst0
rt
rd
$31
RegInSrc1, RegInSrc0
Data out
ALU out
IncrPC
ALUSrc
(rt )
imm
AddSub
Add
Subtract
LogicFn1, LogicFn0
AND
OR
XOR
NOR
FnClass1, FnClass0
lui
Set less
Arithmetic
Logic
DataRead
Don’t read
Read
DataWrite
Don’t write
Write
BrType1, BrType0
No branch
beq
bne
bltz
PCSrc1, PCSrc0
IncrPC
jta
(rs)
SysCallAddr
Computer Architecture, Data Path and Control
Slide 12
Single-Cycle Data Path, Repeated for Reference
Incr PC
Outcome of an executed instruction:
A new value loaded into PC
Possible new value in a reg or memory loc
Next addr
jta
Next PC
ALUOvfl
(PC)
PC
(rs)
rs
rt
Instr
cache
inst
rd
31
imm
op
Instruction fetch
Nov. 2014
Reg
file
ALU
(rt)
/
16
ALU
out
Data
addr
Data
cache
Data
out
Data
in
Func
0
32
SE / 1
0
1
2
Register input
fn
RegDst
RegWrite
Br&Jump
Fig. 13.3
0
1
2
Ovfl
Reg access / decode
ALUSrc
ALUFunc
ALU operation
DataRead
RegInSrc
DataWrite
Data access
Reg
writeback
Key elements of the single-cycle MicroMIPS data path.
Computer Architecture, Data Path and Control
Slide 13
Nov. 2014
Computer Architecture, Data Path and Control
00
01
10
11
00
01
10
0
0
FnClass
LogicFn
0
1
1
0
1
00
10
10
01
10
01
11
11
11
11
11
11
11
10
10
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
PCSrc
10 10
1
0
0
0
1
1
0
0
0
0
1
1
1
1
1
Add’Sub
01
01
01
01
01
01
01
01
01
01
01
01
01
00
ALUSrc
00
01
01
01
00
00
01
01
01
01
00
00
00
00
BrType
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
1
0
DataW rite
001111
000000 100000
000000 100010
000000 101010
001000
001010
000000 100100
000000 100101
000000 100110
000000 100111
001100
001101
001110
100011
101011
000010
000000 001000
000001
000100
000101
000011
000000 001100
DataRead
Load upper immediate
Add
Subtract
Set less than
Add immediate
Set less than immediate
AND
OR
XOR
NOR
AND immediate
OR immediate
XOR immediate
Load word
Store word
Jump
Jump register
Branch on less than 0
Branch on equal
Branch on not equal
Jump and link
System call
fn
RegInS rc
op
RegDst
Table 13.3
Instruction
RegWrite
Control
Signal
Settings
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
01
10
00
00
00
01
11
11
01
10
00
Slide 14
Control Signals in the Single-Cycle Data Path
Incr PC
Next addr
jta
Next PC
ALUOvfl
(PC)
PC
Instr
cache
inst
001111
000000
Br&Jump
BrType
Fig. 13.3
Nov. 2014
Ovfl
Reg
file
ALU
(rt)
/
16
imm
op
00 00
00 00
0
1
2
rd
31
lui
slt
PCSrc
(rs)
rs
rt
ALU
out
Register input
fn
00
01
1
1
RegDst
RegWrite
010101
1
0
ALUSrc
Data
cache
Data
out
Data
in
Func
0
32
SE / 1
Data
addr
x xx 00
1 xx 01
ALUFunc
0
0
0
0
01
01
DataRead
RegInSrc
DataWrite
AddSub LogicFn FnClass
Key elements of the single-cycle MicroMIPS data path.
Computer Architecture, Data Path and Control
0
1
2
Slide 15
0
3
4
5
bltzInst
jInst
jalInst
beqInst
bneInst
8
addiInst
1
2
10
sltiInst
12
13
14
15
andiInst
oriInst
xoriInst
luiInst
35
lwInst
43
/6
RtypeInst
0
8
fn Decoder
1
fn
/6
op Decoder
Instruction
Decoding
op
Nov. 2014
12
syscallInst
32
addInst
34
subInst
36
37
38
39
andInst
orInst
xorInst
norInst
42
sltInst
swInst
63
Fig. 13.5
jrInst
63
Instruction decoder for MicroMIPS built of two 6-to-64 decoders.
Computer Architecture, Data Path and Control
Slide 16
Nov. 2014
Computer Architecture, Data Path and Control
00
01
10
11
00
01
10
0
0
FnClass
LogicFn
0
1
1
0
1
00
10
10
01
10
01
11
11
11
11
11
11
11
10
10
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
PCSrc
10 10
1
0
0
0
1
1
0
0
0
0
1
1
1
1
1
Add’Sub
01
01
01
01
01
01
01
01
01
01
01
01
01
00
ALUSrc
00
01
01
01
00
00
01
01
01
01
00
00
00
00
BrType
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
1
0
DataW rite
001111
000000 100000
000000 100010
000000 101010
001000
001010
000000 100100
000000 100101
000000 100110
000000 100111
001100
001101
001110
100011
101011
000010
000000 001000
000001
000100
000101
000011
000000 001100
DataRead
Load upper immediate
Add
Subtract
Set less than
Add immediate
Set less than immediate
AND
OR
XOR
NOR
AND immediate
OR immediate
XOR immediate
Load word
Store word
Jump
Jump register
Branch on less than 0
Branch on equal
Branch on not equal
Jump and link
System call
fn
RegInS rc
Table 13.3
op
RegDst
Repeated
for
Reference
Instruction
RegWrite
Control
Signal
Settings:
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
01
10
00
00
00
01
11
11
01
10
00
Slide 17
Control Signal Generation
Auxiliary signals identifying instruction classes
arithInst = addInst  subInst  sltInst  addiInst  sltiInst
logicInst = andInst  orInst  xorInst  norInst  andiInst  oriInst  xoriInst
immInst = luiInst  addiInst  sltiInst  andiInst  oriInst  xoriInst
Example logic expressions for control signals
RegWrite = luiInst  arithInst  logicInst  lwInst  jalInst
ALUSrc = immInst  lwInst  swInst
addInst
subInst
jInst
AddSub = subInst  sltInst  sltiInst
DataRead = lwInst
PCSrc0 = jInst  jalInst  syscallInst
Nov. 2014
Computer Architecture, Data Path and Control
.
.
.
Control
.
.
.
sltInst
Slide 18
Putting It All Together
Fig. 10.19
ConstVar
Fig. 13.4
/
30
/
30
IncrPC
/
30
Adder
0
1
2
3
/
30
/
30
/
30
/
30
/
30
/
30
4 MSBs1
/
32
/
32
(rs)
4
16
imm
MSBs
Shift function
Constant
5
amount
0
Amount
5
1
5
Variable
amount
2
/
30
(PC)31:2
/
26
jta
Function
class
imm
5 LSBs
lui
Shift
Set less
Arithmetic
Logic
0
y
0 or 1
c0
32
BrType
00
01
10
11
2
Shifted y
x
SysCallAddr
No shift
Logical left
Logical right
Arith right
Shifter
Adder
PCSrc
00
01
10
11
32
30
MSBs
SE
c in
NextPC
Branch
condition
checker
BrTrue
(rt)
32
k
/
c
c 32 31
xy
s
32
2
32
Shortha
symb
for AL
1
MSB
Cont
3
x
Fun
AddSub
A
Incr PC
Next addr
jta
Next PC
rd
31
imm
op
0
1
2
00
01
10
11
Ovfl
Reg
file
ALU
(rt)
/
16
0
32
SE / 1
Func
ALU
out
Data
addr
Data
cache
Data
out
Data
in
Zero
Ovfl
addInst
subInst
jInst
0
1
2
Register input
fn
Zero
.
.
.
.
.
.
Control
sltInst
Br&Jump
Nov. 2014
RegDst
RegWrite
ALUSrc
ALUFunc
DataRead
RegInSrc
DataWrite
Computer Architecture, Data Path and Control
O
y
32input
NOR
2
Logic function
(rs)
rs
rt
inst
AND
OR
XOR
NOR
ALUOvfl
(PC)
PC
Instr
cache
Logic
unit
Fig. 13.3
Slide 19
13.6 Performance of the Single-Cycle Design
An example combinational-logic data path to compute z := (u + v)(w – x) / y
Add/Sub
latency
2 ns
Multiply
latency
6 ns
Divide
latency
15 ns
Total
latency
23 ns
u
+
v
Note that the divider gets its
correct inputs after 9 ns,
but this won’t cause a problem
if we allow enough total time

w
-
/
z
x
y
Nov. 2014
Beginning with inputs u, v, w, x, and y
stored in registers, the entire computation
can be completed in 25 ns, allowing 1
ns each for register readout and write
Computer Architecture, Data Path and Control
Slide 20
Performance Estimation for Single-Cycle MicroMIPS
Instruction access
2 ns
Register read
1 ns
ALU operation
2 ns
Data cache access
2 ns
Register write
1 ns
Total
8 ns
Single-cycle clock = 125 MHz
R-type 44%
6 ns
Load
24%
8 ns
Store
12%
7 ns
Branch 18%
5 ns
Jump
2%
3 ns
Weighted mean  6.36 ns
ALU-type
P
C
Load
P
C
Store
P
C
Branch
P
C
(and jr)
Jump
(except
jr & jal)
P
C
Not
used
Not
used
Not
used
Not
used
Not
used
Not
used
Not
used
Not
used
Not
used
Fig. 13.6 The MicroMIPS data path unfolded (by depicting the register write
step as a separate block) so as to better visualize the critical-path latencies.
Nov. 2014
Computer Architecture, Data Path and Control
Slide 21
How Good is Our Single-Cycle Design?
Clock rate of 125 MHz not impressive
How does this compare with
current processors on the market?
Not bad, where latency is concerned
Instruction access
2 ns
Register read
1 ns
ALU operation
2 ns
Data cache access
2 ns
Register write
1 ns
Total
8 ns
Single-cycle clock = 125 MHz
A 2.5 GHz processor with 20 or so pipeline stages has a latency of about
0.4 ns/cycle  20 cycles = 8 ns
Throughput, however, is much better for the pipelined processor:
Up to 20 times better with single issue
Perhaps up to 100 times better with multiple issue
Nov. 2014
Computer Architecture, Data Path and Control
Slide 22
14 Control Unit Synthesis
The control unit for the single-cycle design is memoryless
• Problematic when instructions vary greatly in complexity
• Multiple cycles needed when resources must be reused
Topics in This Chapter
14.1 A Multicycle Implementation
14.2 Choosing the Clock Cycle
14.3 The Control State Machine
14.4 Performance of the Multicycle Design
14.5 Microprogramming
14.6 Exception Handling
Nov. 2014
Computer Architecture, Data Path and Control
Slide 23
14.1 A Multicycle Implementation
Appointment
book for a
dentist
Single-cycle Multicycle
Assume longest
treatment takes
one hour
Nov. 2014
Computer Architecture, Data Path and Control
Slide 24
Single-Cycle vs. Multicycle MicroMIPS
Clock
Time
needed
Time
allotted
Instr 1
Instr 2
Instr 3
Instr 4
Clock
Time
needed
Time
allotted
Time
saved
3 cycles
5 cycles
3 cycles
4 cycles
Instr 1
Instr 2
Instr 3
Instr 4
Fig. 14.1
Nov. 2014
Single-cycle versus multicycle instruction execution.
Computer Architecture, Data Path and Control
Slide 25
A Multicycle Data Path
Inst Reg
x Reg
jta
Address
rs,rt,rd
(rs)
PC
imm
Cache
z Reg
Reg
file
ALU
(rt)
Data
Data Reg
op
y Reg
fn
Control
von Neumann
(Princeton)
architecture
Fig. 14.2
Abstract view of a multicycle instruction execution unit for
MicroMIPS. For naming of instruction fields, see Fig. 13.1.
Nov. 2014
Computer Architecture, Data Path and Control
Slide 26
Multicycle Data Path with Control Signals Shown
Three major changes relative to
the single-cycle data path:
26
/
1. Instruction & data
caches combined
Corrections are
shown in red
Inst Reg
3. Registers added for
jta intercycle data x Reg
rt
0
rd 1
31 2
Cache
Data Reg
PCWrite
MemWrite
MemRead
Fig. 14.3
Nov. 2014
op
Reg
file
IRWrite
(rt)
imm 16
/
fn
32 y Reg
SE /
RegInSrc
RegDst
ALUZero
x Mux
ALUOvfl
0
Zero
z Reg
1
Ovfl
(rs)
0
12
Data
0
1
SysCallAddr
rs
PC
InstData
30
/
4 MSBs
Address
0
1
2. ALU performs double duty
for address calculation
RegWrite
y Mux
4
0
1
2
4 3
ALUSrcX
ALUSrcY
30
4
ALU
Func
ALU out
ALUFunc
PCSrc
JumpAddr
Key elements of the multicycle MicroMIPS data path.
Computer Architecture, Data Path and Control
0
1
2
3
Slide 27
14.2 Clock Cycle and Control Signals
Table 14.1
Program
counter
Cache
Register
file
ALU
Nov. 2014
Control signal
0
1
2
JumpAddr
jta
SysCallAddr
PCSrc1, PCSrc0
Jump addr
x reg
PCWrite
Don’t write
Write
InstData
PC
z reg
MemRead
Don’t read
Read
MemWrite
Don’t write
Write
IRWrite
Don’t write
Write
RegWrite
Don’t write
Write
RegDst1, RegDst0
rt
rd
$31
RegInSrc1, RegInSrc0
Data reg
z reg
PC
ALUSrcX
PC
x reg
ALUSrcY1, ALUSrcY0
4
y reg
AddSub
Add
Subtract
LogicFn1, LogicFn0
AND
FnClass1, FnClass0
lui
z reg
3
ALU out
imm
4  imm
OR
XOR
NOR
Set less
Arithmetic
Logic
Computer Architecture, Data Path and Control
Slide 28
Multicycle Data Path, Repeated for Reference
26
/
30
/
4 MSBs
Corrections are
shown in red
Inst Reg
rt
0
rd 1
31 2
Cache
Data Reg
InstData
PCWrite
MemWrite
MemRead
Fig. 14.3
Nov. 2014
(rs)
Reg
file
0
12
Data
op
IRWrite
(rt)
imm 16
/
fn
ALUZero
x Mux
ALUOvfl
0
Zero
z Reg
1
Ovfl
x Reg
rs
PC
0
1
SysCallAddr
jta
Address
32 y Reg
SE /
RegInSrc
RegDst
0
1
RegWrite
y Mux
4
0
1
2
4 3
ALUSrcX
ALUSrcY
30
4
ALU
Func
ALU out
ALUFunc
PCSrc
JumpAddr
Key elements of the multicycle MicroMIPS data path.
Computer Architecture, Data Path and Control
0
1
2
3
Slide 29
Execution
Cycles
Fetch &
PC incr
Decode &
reg read
Table 14.2
Instruction Operations
Signal settings
Any
Read out the instruction and
write it into instruction
register, increment PC
Any
Read out rs & rt into x & y
registers, compute branch
address and save in z register
Perform ALU operation and
save the result in z register
Add base and offset values,
save in z register
If (x reg) =  < (y reg), set PC
to branch target address
InstData = 0, MemRead = 1
IRWrite = 1, ALUSrcX = 0
ALUSrcY = 0, ALUFunc = ‘+’
PCSrc = 3, PCWrite = 1
ALUSrcX = 0, ALUSrcY = 3
ALUFunc = ‘+’
1
2
ALU type
Set PC to the target address
jta, SysCallAddr, or (rs)
Write back z reg into rd
Load
Read memory into data reg
ALUSrcX = 1, ALUSrcY = 1 or 2
ALUFunc: Varies
ALUSrcX = 1, ALUSrcY = 2
ALUFunc = ‘+’
ALUSrcX = 1, ALUSrcY = 1
ALUFunc= ‘-’, PCSrc = 2
PCWrite = ALUZero or
ALUZero or ALUOut31
JumpAddr = 0 or 1,
PCSrc = 0 or 1, PCWrite = 1
RegDst = 1, RegInSrc = 1
RegWrite = 1
InstData = 1, MemRead = 1
Store
Copy y reg into memory
InstData = 1, MemWrite = 1
Load
Copy data register into rt
RegDst = 0, RegInSrc = 0
RegWrite = 1
ALU type
Load/Store
ALU oper
& PC
update
3
Branch
Jump
Reg write
or mem
access
Reg write
for lw
Nov. 2014
4
5
Execution cycles for multicycle MicroMIPS
Computer Architecture, Data Path and Control
Slide 30
14.3 The Control State Machine
Cycle 1
Cycle 2
Notes for State 5:
% 0 for j or jal, 1 for syscall,
don’t-care for other instr’s
@ 0 for j, jal, and syscall,
1 for jr, 2 for branches
# 1 for j, jr, jal, and syscall,
ALUZero () for beq (bne),
bit 31 of ALUout for bltz
For jal, RegDst = 2, RegInSrc = 1,
RegWrite = 1
State 0
InstData = 0
MemRead = 1
IRWrite = 1
ALUSrcX = 0
ALUSrcY = 0
ALUFunc = ‘+’
PCSrc = 3
PCWrite = 1
Jump/
Branch
State 1
ALUSrcX = 0
ALUSrcY = 3
ALUFunc = ‘+’
Note for State 7:
ALUFunc is determined based
on the op and fn fields
Fig. 14.4
Nov. 2014
Cycle 4
State 5
ALUSrcX = 1
ALUSrcY = 1
ALUFunc = ‘-’
JumpAddr = %
PCSrc = @
PCWrite = #
State 6
lw/
sw
ALUtype
State 2
ALUSrcX = 1
ALUSrcY = 2
ALUFunc = ‘+’
Cycle 5
InstData = 1
MemWrite = 1
Branches based
on instruction
sw
Speculative
calculation of
branch address
Start
Cycle 3
lw
State 3
State 4
InstData = 1
MemRead = 1
RegDst = 0
RegInSrc = 0
RegWrite = 1
State 7
State 8
ALUSrcX = 1
ALUSrcY = 1 or 2
ALUFunc = Varies
RegDst = 0 or 1
RegInSrc = 1
RegWrite = 1
The control state machine for multicycle MicroMIPS.
Computer Architecture, Data Path and Control
Slide 31
State and Instruction Decoding
op
/4
5
6
7
8
9
10
11
12
13
14
15
ControlSt0
ControlSt1
ControlSt2
ControlSt3
ControlSt4
ControlSt5
ControlSt6
ControlSt7
ControlSt8
0
1
2
3
4
1
op Decoder
st Decoder
0
1
2
3
4
1
fn
/6
5
bltzInst
jInst
jalInst
beqInst
bneInst
8
addiInst
andiInst
10
sltiInst
12
13
14
15
andiInst
oriInst
xoriInst
luiInst
35
lwInst
43
Nov. 2014
0
jrInst
8
12
syscallInst
32
addInst
34
subInst
36
37
38
39
andInst
orInst
xorInst
norInst
42
sltInst
swInst
63
Fig. 14.5
/6
RtypeInst
fn Decoder
st
63
State and instruction decoders for multicycle MicroMIPS.
Computer Architecture, Data Path and Control
Slide 32
Control Signal Generation
Certain control signals depend only on the control state
ALUSrcX = ControlSt2  ControlSt5  ControlSt7
RegWrite = ControlSt4  ControlSt8
Auxiliary signals identifying instruction classes
addsubInst = addInst  subInst  addiInst
logicInst = andInst  orInst  xorInst  norInst  andiInst  oriInst  xoriInst
Logic expressions for ALU control signals
AddSub = ControlSt5  (ControlSt7  subInst)
FnClass1 = ControlSt7  addsubInst  logicInst
FnClass0 = ControlSt7  (logicInst  sltInst  sltiInst)
LogicFn1 = ControlSt7  (xorInst  xoriInst  norInst)
LogicFn0 = ControlSt7  (orInst  oriInst  norInst)
Nov. 2014
Computer Architecture, Data Path and Control
Slide 33
14.4 Performance of the Multicycle Design
R-type
Load
Store
Branch
Jump
44%
24%
12%
18%
2%
4 cycles
5 cycles
4 cycles
3 cycles
3 cycles
Contribution to CPI
R-type
0.444 = 1.76
Load
0.245 = 1.20
Store
0.124 = 0.48
Branch 0.183 = 0.54
Jump
0.023 = 0.06
ALU-type
P
C
Load
P
C
Store
P
C
Branch
P
C
(and jr)
Not
used
Not
used
Not
used
Not
used
Not
used
Not
used
Not
used
Not
used
_____________________________
Average CPI  4.04
Jump
(except
jr & jal)
P
C
Not
used
Fig. 13.6 The MicroMIPS data path unfolded (by depicting the register write
step as a separate block) so as to better visualize the critical-path latencies.
Nov. 2014
Computer Architecture, Data Path and Control
Slide 34
How Good is Our Multicycle Design?
Clock rate of 500 MHz better than 125 MHz
of single-cycle design, but still unimpressive
Cycle time = 2 ns
Clock rate = 500 MHz
How does the performance compare with
current processors on the market?
Not bad, where latency is concerned
R-type
Load
Store
Branch
Jump
A 2.5 GHz processor with 20 or so pipeline
stages has a latency of about 0.4  20 = 8 ns
Contribution to CPI
R-type
0.444 = 1.76
Throughput, however, is much better for
the pipelined processor:
Up to 20 times better with single issue
Load
Store
Branch
Jump
44%
24%
12%
18%
2%
0.245
0.124
0.183
0.023
4 cycles
5 cycles
4 cycles
3 cycles
3 cycles
=
=
=
=
1.20
0.48
0.54
0.06
_____________________________
Perhaps up to 100 with multiple issue
Nov. 2014
Computer Architecture, Data Path and Control
Average CPI  4.04
Slide 35
14.5 Microprogramming
PC
control
Cache
control
Register
control
ALU
inputs
ALU Sequence
function control
2
bits
JumpAddr
PCSrc
PCWrite
FnType
LogicFn
AddSub
ALUSrcY
ALUSrcX
RegInSrc
Microinstruction
RegDst
RegWrite
InstData
MemRead
MemWrite
IRWrite
Cycle 1
Cycle 2
Notes for State 5:
% 0 for j or jal, 1 for syscall,
don’t-care for other instr’s
@ 0 for j, jal, and syscall,
1 for jr, 2 for branches
# 1 for j, jr, jal, and syscall,
ALUZero () for beq (bne),
bit 31 of ALUout for bltz
For jal, RegDst = 2, RegInSrc = 1,
RegWrite = 1
23
Fig. 14.6
Possible 22-bit microinstruction
format for MicroMIPS.
State 0
InstData = 0
MemRead = 1
IRWrite = 1
ALUSrcX = 0
ALUSrcY = 0
ALUFunc = ‘+’
PCSrc = 3
PCWrite = 1
Jump/
Branch
State 1
ALUSrcX = 0
ALUSrcY = 3
ALUFunc = ‘+’
Nov. 2014
Note for State 7:
ALUFunc is determined based
on the op and fn fields
Computer Architecture, Data Path and Control
Cycle 4
State 5
ALUSrcX = 1
ALUSrcY = 1
ALUFunc = ‘-’
JumpAddr = %
PCSrc = @
PCWrite = #
State 6
Cycle 5
InstData = 1
MemWrite = 1
sw
lw/
sw
Start
The control state machine resembles
a program (microprogram)
Cycle 3
ALUtype
State 2
ALUSrcX = 1
ALUSrcY = 2
ALUFunc = ‘+’
lw
State 3
State 4
InstData = 1
MemRead = 1
RegDst = 0
RegInSrc = 0
RegWrite = 1
State 7
State 8
ALUSrcX = 1
ALUSrcY = 1 or 2
ALUFunc = Varies
RegDst = 0 or 1
RegInSrc = 1
RegWrite = 1
Slide 36
The Control State Machine as a Microprogram
Cycle 1
Cycle 2
Notes for State 5:
% 0 for j or jal, 1 for syscall,
don’t-care for other instr’s
@ 0 for j, jal, and syscall,
1 for jr, 2 for branches
# 1 for j, jr, jal, and syscall,
ALUZero () for beq (bne),
bit 31 of ALUout for bltz
For jal, RegDst = 2, RegInSrc = 1,
RegWrite = 1
State 0
InstData = 0
MemRead = 1
IRWrite = 1
ALUSrcX = 0
ALUSrcY = 0
ALUFunc = ‘+’
PCSrc = 3
PCWrite = 1
Cycle 4
State 5
ALUSrcX = 1
ALUSrcY = 1
ALUFunc = ‘-’
JumpAddr = %
PCSrc = @
PCWrite = #
State 6
Jump/
Branch
ALUSrcX = 0
ALUSrcY = 3
ALUFunc = ‘+’
lw/
sw
Start
InstData = 1
MemWrite = 1
ALUtype
State 2
ALUSrcX = 1
ALUSrcY = 2
ALUFunc = ‘+’
lw
State 3
State 4
InstData = 1
MemRead = 1
RegDst = 0
RegInSrc = 0
RegWrite = 1
State 7
State 8
ALUSrcX = 1
ALUSrcY = 1 or 2
ALUFunc = Varies
RegDst = 0 or 1
RegInSrc = 1
RegWrite = 1
Multiple substates
Fig. 14.4
Nov. 2014
Cycle 5
sw
State 1
Note for State 7:
ALUFunc is determined based
on the op and fn fields
Cycle 3
Multiple substates
Decompose
into 2 substates
The control state machine for multicycle MicroMIPS.
Computer Architecture, Data Path and Control
Slide 37
Symbolic Names for Microinstruction Field Values
Table 14.3 Microinstruction field values and their symbolic names.
The default value for each unspecified field is the all 0s bit pattern.
Field name
PC control
Cache control
Register control
ALU inputs*
ALU function*
Seq. control
Possible field values and their symbolic names
0001
1001
x011
x101
x111
PCjump
PCsyscall
PCjreg
PCbranch
PCnext
0101
1010
1100
CacheFetch
CacheStore
CacheLoad
1000 10000
rt  Data
1001 10001
rt  z
1011 10101
rd  z
1101 11010
$31  PC
000
011
101
110
x10
PC  4
PC  4imm
x  y
x  imm
(imm)
0xx10
1xx01
1xx10
x0011
x0111
+
<
-


x1011
x1111
xxx00


lui
01
10
11
PCdisp1
PCdisp2
PCfetch
* The operator symbol  stands for any of the ALU functions defined above (except for “lui”).
Nov. 2014
Computer Architecture, Data Path and Control
Slide 38
fetch: -------------
Control Unit for
Microprogramming
Multiway
branch
andi: ---------
64 entries
in each table
Dispatch
table 1
Dispatch
table 2
0
0
1
2
3
MicroPC
1
Address
Microprogram
memory or PLA
Incr
Data
Microinstruction register
op (from
instruction
register)
Fig. 14.7
Nov. 2014
Control signals to data path
Sequence
control
Microprogrammed control unit for MicroMIPS .
Computer Architecture, Data Path and Control
Slide 39
fetch:
Microprogram
for MicroMIPS
37 microinstructions
Fig. 14.8
The complete
MicroMIPS
microprogram.
Nov. 2014
PCnext, CacheFetch
PC + 4imm, PCdisp1
lui1:
lui(imm)
rt  z, PCfetch
add1:
x + y
rd  z, PCfetch
sub1:
x - y
rd  z, PCfetch
slt1:
x - y
rd  z, PCfetch
addi1:
x + imm
rt  z, PCfetch
slti1:
x - imm
rt  z, PCfetch
and1:
x  y
rd  z, PCfetch
or1:
x  y
rd  z, PCfetch
xor1:
x  y
rd  z, PCfetch
nor1:
x  y
rd  z, PCfetch
andi1:
x  imm
rt  z, PCfetch
ori1:
x  imm
rt  z, PCfetch
xori:
x  imm
rt  z, PCfetch
lwsw1:
x + imm, mPCdisp2
lw2:
CacheLoad
rt  Data, PCfetch
sw2:
CacheStore, PCfetch
j1:
PCjump, PCfetch
jr1:
PCjreg, PCfetch
branch1: PCbranch, PCfetch
jal1:
PCjump, $31PC, PCfetch
syscall1:PCsyscall, PCfetch
Computer Architecture, Data Path and Control
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
0 (start)
1
7lui
8lui
7add
8add
7sub
8sub
7slt
8slt
7addi
8addi
7slti
8slti
7and
8and
7or
8or
7xor
8xor
7nor
8nor
7andi
8andi
7ori
8ori
7xori
8xori
2
3
4
6
5j
5jr
5branch
5jal
5syscall
Slide 40
14.6 Exception Handling
Exceptions and interrupts alter the normal program flow
Examples of exceptions (things that can go wrong):




ALU operation leads to overflow (incorrect result is obtained)
Opcode field holds a pattern not representing a legal operation
Cache error-code checker deems an accessed word invalid
Sensor signals a hazardous condition (e.g., overheating)
Exception handler is an OS program that takes care of the problem
 Derives correct result of overflowing computation, if possible
 Invalid operation may be a software-implemented instruction
Interrupts are similar, but usually have external causes (e.g., I/O)
Nov. 2014
Computer Architecture, Data Path and Control
Slide 41
Exception
Control
States
Cycle 1
Cycle 2
Jump/
Branch
Cycle 3
Cycle 4
State 5
ALUSrcX = 1
ALUSrcY = 1
ALUFunc = ‘-’
JumpAddr = %
PCSrc = @
PCWrite = #
State 6
Cycle 5
InstData = 1
MemWrite = 1
sw
State 0
InstData = 0
MemRead = 1
IRWrite = 1
ALUSrcX = 0
ALUSrcY = 0
ALUFunc = ‘+’
PCSrc = 3
PCWrite = 1
State 1
ALUSrcX = 0
ALUSrcY = 3
ALUFunc = ‘+’
lw/
sw
Start
ALUtype
Illegal
operation
Fig. 14.10
Nov. 2014
State 2
ALUSrcX = 1
ALUSrcY = 2
ALUFunc = ‘+’
lw
State 3
State 4
InstData = 1
MemRead = 1
RegDst = 0
RegInSrc = 0
RegWrite = 1
State 7
State 8
ALUSrcX = 1
ALUSrcY = 1 or 2
ALUFunc = Varies
RegDst = 0 or 1
RegInSrc = 1
RegWrite = 1
State 10
IntCause = 0
CauseWrite = 1
ALUSrcX = 0
ALUSrcY = 0
ALUFunc = ‘-’
EPCWrite = 1
JumpAddr = 1
PCSrc = 0
PCWrite = 1
Overflow
State 9
IntCause = 1
CauseWrite = 1
ALUSrcX = 0
ALUSrcY = 0
ALUFunc = ‘-’
EPCWrite = 1
JumpAddr = 1
PCSrc = 0
PCWrite = 1
Exception states 9 and 10 added to the control state machine.
Computer Architecture, Data Path and Control
Slide 42
15 Pipelined Data Paths
Pipelining is now used in even the simplest of processors
• Same principles as assembly lines in manufacturing
• Unlike in assembly lines, instructions not independent
Topics in This Chapter
15.1 Pipelining Concepts
15.2 Pipeline Stalls or Bubbles
15.3 Pipeline Timing and Performance
15.4 Pipelined Data Path Design
15.5 Pipelined Control
15.6 Optimal Pipelining
Nov. 2014
Computer Architecture, Data Path and Control
Slide 43
Nov. 2014
Computer Architecture, Data Path and Control
Slide 44
Single-Cycle Data Path of Chapter 13
Incr PC
Next addr
jta
Next PC
ALUOvfl
(PC)
PC
(rs)
rs
rt
Instr
cache
inst
rd
31
imm
op
Br&Jump
Fig. 13.3
Nov. 2014
0
1
2
Clock rate = 125 MHz
CPI = 1 (125 MIPS)
Ovfl
Reg
file
ALU
(rt)
/
16
0
32
SE / 1
ALU
out
Data
addr
Data
cache
Data
out
Data
in
Func
0
1
2
Register input
fn
RegDst
RegWrite
ALUSrc
ALUFunc
DataRead
RegInSrc
DataWrite
Key elements of the single-cycle MicroMIPS data path.
Computer Architecture, Data Path and Control
Slide 45
Multicycle Data Path of Chapter 14
Clock rate = 500 MHz
CPI  4 ( 125 MIPS)
Inst Reg
26
/
4 MSBs
rt
0
rd 1
31 2
Cache
Data Reg
InstData
PCWrite
MemWrite
MemRead
Fig. 14.3
Nov. 2014
(rs)
Reg
file
0
12
Data
op
IRWrite
(rt)
imm 16
/
fn
ALUZero
x Mux
ALUOvfl
0
Zero
z Reg
1
Ovfl
x Reg
rs
PC
32 y Reg
SE /
RegInSrc
RegDst
0
1
SysCallAddr
jta
Address
0
1
30
/
RegWrite
y Mux
4
0
1
2
4 3
ALUSrcX
ALUSrcY
30
4
ALU
Func
ALU out
ALUFunc
PCSrc
JumpAddr
Key elements of the multicycle MicroMIPS data path.
Computer Architecture, Data Path and Control
0
1
2
3
Slide 46
Getting the Best of Both Worlds
Pipelined:
Clock rate = 500 MHz
CPI  1
Single-cycle:
Multicycle:
Clock rate = 125 MHz
CPI = 1
Clock rate = 500 MHz
CPI  4
Single-cycle analogy:
Doctor appointments
scheduled for 60 min
per patient
Nov. 2014
Multicycle analogy:
Doctor appointments
scheduled in 15-min
increments
Computer Architecture, Data Path and Control
Slide 47
15.1 Pipelining Concepts
Strategies for improving performance
1 – Use multiple independent data paths accepting several instructions
that are read out at once: multiple-instruction-issue or superscalar
2 – Overlap execution of several instructions, starting the next instruction
before the previous one has run to completion: (super)pipelined
Approval
1
Cashier
2
2
Registrar
ID photo
3
4
Pickup
5
Start
here
Exit
Fig. 15.1
Nov. 2014
Pipelining in the student registration process.
Computer Architecture, Data Path and Control
Slide 48
Pipelined Instruction Execution
Instr
cache
Instr 3
Instr 4
Instr 5
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Time dimension
Instr 2
Instr 1
Cycle 1
Task
dimension
Fig. 15.2
Nov. 2014
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Pipelining in the MicroMIPS instruction execution process.
Computer Architecture, Data Path and Control
Slide 49
Alternate Representations of a Pipeline
Except for start-up and drainage overheads, a pipeline can execute
one instruction per clock tick; IPS is dictated by the clock frequency
1
2
1
2
3
4
5
f
r
a
d
w
f
r
a
d
w
f
r
a
d
w
f
f = Fetch
r = Reg read
a = ALU op
d = Data access
w = Writeback
r
a
d
w
f
r
a
d
w
f
r
a
d
w
f
r
a
d
3
4
5
6
7
6
7
8
9
10
11
Cycle
1
2
1
2
3
4
5
6
7
f
f
f
f
f
f
f
r
r
r
r
r
r
r
a
a
a
a
a
a
a
d
d
d
d
d
d
d
w
w
w
w
w
w
3
4
5
w
Start-up
region
8
9
10
Cycle
Drainage
region
Pipeline
stage
Instruction
(a) Task-time diagram
(b) Space-time diagram
Fig. 15.3 Two abstract graphical representations of a
5-stage pipeline executing 7 tasks (instructions).
Nov. 2014
Computer Architecture, Data Path and Control
11
Slide 50
w
Pipelining Example in a Photocopier
Example 15.1
A photocopier with an x-sheet document feeder copies the first sheet
in 4 s and each subsequent sheet in 1 s. The copier’s paper path is a
4-stage pipeline with each stage having a latency of 1s. The first
sheet goes through all 4 pipeline stages and emerges after 4 s.
Each subsequent sheet emerges 1s after the previous sheet.
How does the throughput of this photocopier vary with x, assuming
that loading the document feeder and removing the copies takes 15 s.
Solution
Each batch of x sheets is copied in 15 + 4 + (x – 1) = 18 + x seconds.
A nonpipelined copier would require 4x seconds to copy x sheets.
For x > 6, the pipelined version has a performance edge.
When x = 50, the pipelining speedup is (4  50) / (18 + 50) = 2.94.
Nov. 2014
Computer Architecture, Data Path and Control
Slide 51
15.2 Pipeline Stalls or Bubbles
First type of data dependency
$5 = $6 + $7
$8 = $8 + $6
$9 = $8 + $2
sw $9, 0($3)
Cycle 1
Cycle 2
Instr
cache
Reg
file
Instr
cache
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
ALU
Data
cache
Reg
file
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Cycle 8
Data
forwarding
Reg
file
Fig. 15.4 Read-after-write data dependency and its possible resolution
through data forwarding .
Nov. 2014
Computer Architecture, Data Path and Control
Slide 52
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
Instr 3
Instr 5
Instr 4
Instr
cache
Reg
file
Cycle 7
Cycle 8
Cycle 9
Cycle 2
Cycle 3
Writes into $8
Data
cache
ALU
Bubble
Reg
file
Task
dimension
Cycle 4
Reg
file
Data
cache
Bubble
ALU
Instr
cache
Reg
file
Cycle 5
Cycle 6
Reg
file
Data
cache
Reg
file
ALU
Data
cache
Reg
file
Cycle 8
Cycle 9
Cycle 7
Without data forwarding,
three bubbles are needed
to resolve a read-after-write
data dependency
Reads from $8
Time dimension
Instr
cache
Reg
file
ALU
Instr
cache
Reg
file
Instr 3
Instr 2
Instr 1
ALU
Bubble
Instr
cache
Cycle 1
Instr
cache
Instr 4
Instr 5
Cycle 6
Time dimension
Instr 2
Instr 1
Inserting Bubbles in a Pipeline
Data
cache
ALU
Bubble
Reg
file
Instr
cache
Task
dimension
Nov. 2014
Reg
file
Data
cache
Writes into $8
Reg
file
Data
cache
Reg
file
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
ALU
Bubble
Two bubbles, if we assume
that a register can be updated
and read from in one cycle
Reads from $8
Reg
file
Computer Architecture, Data Path and Control
Slide 53
Second Type of Data Dependency
C ycle 1
C ycle 2
Instr
mem
Reg
file
sw $6, . . .
C ycle 3
ALU
C ycle 4
C ycle 5
Data
mem
Reg
file
C ycle 6
C ycle 7
C ycle 8
Reorder?
lw $8, . . .
Insert bubble?
$9 = $8 + $2
Instr
mem
Reg
file
ALU
Data
mem
Reg
file
Instr
mem
Reg
file
ALU
Data
mem
Reg
file
Instr
mem
Reg
file
ALU
Data
mem
Reg
file
Without data forwarding, three (two) bubbles are needed
to resolve a read-after-load data dependency
Fig. 15.5 Read-after-load data dependency and its possible resolution
through bubble insertion and data forwarding.
Nov. 2014
Computer Architecture, Data Path and Control
Slide 54
Control Dependency in a Pipeline
C ycle 1
C ycle 2
Instr
mem
Reg
file
Instr
mem
$6 = $3 + $5
beq $1, $2, . . .
Insert bubble?
C ycle 3
C ycle 4
C ycle 5
ALU
Data
mem
Reg
file
Reg
file
ALU
Data
mem
Reg
file
Instr
mem
Reg
file
ALU
Data
mem
Reg
file
Instr
mem
Reg
file
ALU
Data
mem
$9 = $8 + $2
Assume branch
resolved here
Fig. 15.6
Nov. 2014
C ycle 6
C ycle 7
C ycle 8
Reorder?
(delayed
branch)
Reg
file
Here would need
1-2 more bubbles
Control dependency due to conditional branch.
Computer Architecture, Data Path and Control
Slide 55
15.3 Pipeline Timing and Performance
Latching
of results
t
Function unit
Stage
1
t/q
Fig. 15.7
Nov. 2014
Stage
2
Stage
3
.. .
Stage
q-1
Stage
q

Pipelined form of a function unit with latching overhead.
Computer Architecture, Data Path and Control
Slide 56
Throughput Increase in a q-Stage Pipeline
Throughput improvement factor
8
7
t
t/q + 
Ideal:
/t = 0
6
/t = 0.05
5
or
4
/t = 0.1
q
1 + q / t
3
2
1
1
2
3
4
5
6
Number q of pipeline stages
7
8
Fig. 15.8 Throughput improvement due to pipelining as a function
of the number of pipeline stages for different pipelining overheads.
Nov. 2014
Computer Architecture, Data Path and Control
Slide 57
Pipeline Throughput with Dependencies
Assume that one bubble must be inserted due to read-after-load
dependency and after a branch when its delay slot cannot be filled.
Let  be the fraction of all instructions that are followed by a bubble.
q
Pipeline speedup = (1 + q / t)(1 + )
Effective
CPI
R-type
Load
Store
Branch
Jump
44%
24%
12%
18%
2%
Example 15.3
Calculate the effective CPI for MicroMIPS, assuming that a quarter
of branch and load instructions are followed by bubbles.
Solution
Fraction of bubbles  = 0.25(0.24 + 0.18) = 0.105
CPI = 1 +  = 1.105 (which is very close to the ideal value of 1)
Nov. 2014
Computer Architecture, Data Path and Control
Slide 58
15.4 Pipelined Data Path Design
Stage 1
Stage 2
ALUOvfl
1
PC
inst
Instr
cache
rs
rt
(rs)
1
Reg
file
ALU
imm SE
Incr
IncrPC
SeqInst
op
Nov. 2014
Stage 5
Data
Data
addr
Address
Ovfl
(rt)
Fig. 15.9
Stage 4
Next addr
NextPC
0
Stage 3
Data
cache
0
1
Func
0
1
0
1
2
rt
rd 0
1
31 2
Br&Jump
RegDst
fn
RegWrite
ALUSrc
ALUFunc
DataRead
RetAddr
DataWrite
RegInSrc
Key elements of the pipelined MicroMIPS data path.
Computer Architecture, Data Path and Control
Slide 59
15.5 Pipelined Control
Stage 1
Stage 2
Stage 4
Data
Data
addr
Address
ALUOvfl
1
PC
Stage 5
Next addr
NextPC
0
Stage 3
inst
Instr
cache
rs
rt
(rs)
Ovfl
Reg
file
(rt)
1
imm SE
Incr
Data
cache
ALU
Func
0
1
0
1
2
rt
rd 0
1
31 2
IncrPC
0
1
2
3
5
SeqInst
op
Br&Jump
RegDst
fn
RegWrite
ALUSrc
Fig. 15.10
Nov. 2014
ALUFunc
DataRead RetAddr
DataWrite
RegInSrc
Pipelined control signals.
Computer Architecture, Data Path and Control
Slide 60
15.6 Optimal Pipelining
MicroMIPS pipeline with more than four-fold improvement
PC
Instruction
fetch
Register
readout
Instr
cache
Reg
file
Instr
cache
ALU
operation
ALU
Reg
file
Instr
cache
Data
read/store
Register
writeback
Data
cache
Reg
file
ALU
Reg
file
Data
cache
ALU
Reg
file
Reg
file
Data
cache
Fig. 15.11 Higher-throughput pipelined data path for MicroMIPS
and the execution of consecutive instructions in it .
Nov. 2014
Computer Architecture, Data Path and Control
Slide 61
Optimal Number of Pipeline Stages
Assumptions:
Pipeline sliced into q stages
Stage overhead is 
q/2 bubbles per branch
(decision made midway)
Fraction b of all instructions
are taken branches
Derivation of q opt
Latching
of results
t
Function unit
Stage
1
t/q
Fig. 15.7
Stage
2
Stage
3
.. .
Stage
q-1
Stage
q

Pipelined form of a function unit with latching overhead.
Average CPI = 1 + b q / 2
Throughput = Clock rate / CPI =
1
(t / q + )(1 + b q / 2)
Differentiate throughput expression with respect to q and equate with 0
q opt
=
Nov. 2014
2t / 
b
Varies directly with t /  and inversely with b
Computer Architecture, Data Path and Control
Slide 62
Pipelining Example
An example combinational-logic data path to compute z := (u + v)(w – x) / y
Add/Sub
latency
2 ns
Multiply
latency
6 ns
Divide
latency
15 ns
Throughput, original
= 1/(25  10–9)
= 40 M computations / s
u
+
v
Pipeline register
placement, Option 2

w
-
/
Throughput, option 1
= 1/(17  10–9)
= 58.8 M computations / s
Write, 1 ns
z
x
y
Readout, 1 ns
Nov. 2014
Pipeline register
placement, Option 1
Throughput, Option 2
= 1/(10  10–9)
= 100 M computations / s
Computer Architecture, Data Path and Control
Slide 63
16 Pipeline Performance Limits
Pipeline performance limited by data & control dependencies
• Hardware provisions: data forwarding, branch prediction
• Software remedies: delayed branch, instruction reordering
Topics in This Chapter
16.1 Data Dependencies and Hazards
16.2 Data Forwarding
16.3 Pipeline Branch Hazards
16.4 Delayed Branch and Branch Prediction
16.5 Dealing with Exceptions
16.6 Advanced Pipelining
Nov. 2014
Computer Architecture, Data Path and Control
Slide 64
16.1 Data Dependencies and Hazards
Cycle 1
Cycle 2
Instr
cache
Reg
file
Instr
cache
Cycle 3
Cycle 4
Cycle 5
ALU
Data
cache
Reg
file
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Fig. 16.1
Nov. 2014
Cycle 6
Cycle 7
Cycle 8
Cycle 9
$2 = $1 - $3
Instructions
that read
register $2
Reg
file
Data dependency in a pipeline.
Computer Architecture, Data Path and Control
Slide 65
Resolving Data Dependencies via Forwarding
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Instr
cache
Reg
file
ALU
Instr
cache
Cycle 6
Cycle 7
Data
cache
Reg
file
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Cycle 8
Cycle 9
$2 = $1 - $3
Instructions
that read
register $2
Reg
file
Fig. 16.2 When a previous instruction writes back a value
computed by the ALU into a register, the data dependency
can always be resolved through forwarding.
Nov. 2014
Computer Architecture, Data Path and Control
Slide 66
Pipelined MicroMIPS – Repeated for Reference
Stage 1
Stage 2
Stage 4
Data
Data
addr
Address
ALUOvfl
1
PC
Stage 5
Next addr
NextPC
0
Stage 3
inst
Instr
cache
rs
rt
(rs)
Ovfl
Reg
file
(rt)
1
imm SE
Incr
Data
cache
ALU
Func
0
1
0
1
2
rt
rd 0
1
31 2
IncrPC
0
1
2
3
5
SeqInst
op
Br&Jump
RegDst
fn
RegWrite
ALUSrc
Fig. 15.10
Nov. 2014
ALUFunc
DataRead RetAddr
DataWrite
RegInSrc
Pipelined control signals.
Computer Architecture, Data Path and Control
Slide 67
Certain Data Dependencies Lead to Bubbles
Cycle 1
Cycle 2
Instr
cache
Reg
file
Instr
cache
Cycle 3
ALU
Cycle 4
Cycle 5
Cycle 6
Data
cache
Reg
file
lw
Cycle 7
Cycle 8
Cycle 9
$2,4($12)
Instructions
that read
register $2
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Fig. 16.3 When the immediately preceding instruction writes a
value read out from the data memory into a register, the data
dependency cannot be resolved through forwarding (i.e., we cannot
go back in time) and a bubble must be inserted in the pipeline.
Nov. 2014
Computer Architecture, Data Path and Control
Slide 68
16.2 Data Forwarding
Stage 2
Stage 3
s2
rt
x2
Reg
file
SE
RetAddr3, RegWrite3, RegWrite4
RegInSrc3, RegInSrc4
Ovfl
ALU
(rt)
x3
0
1
2
y2
x3
y3
x4
y4
t2
Forw arding
unit, lower
ALUSrc2
Fig. 16.4
Nov. 2014
Data
cache
x4
0
1
Func
y3
d3
ALUSrc1
Stage 5
d3
d4
Forw arding
unit, upper
x3
y3
x4
y4
(rs)
rs
Stage 4
0
1
RegInSrc3
d3
RegWrite3
d4
RetAddr3
RetAddr3, RegWrite3, RegWrite4
RegInSrc3, RegInSrc4
y4
d4
RegInSrc4
RegWrite4
Forwarding unit for the pipelined MicroMIPS data path.
Computer Architecture, Data Path and Control
Slide 69
Design of the Data Forwarding Units
Stage 2
Stage 3
s2
Let’s focus on designing
the upper data forwarding unit
rt
Ovfl
x2
Reg
file
ALU
(rt)
Table 16.1 Partial truth table for
the upper forwarding unit in the
pipelined MicroMIPS data path.
Data
cache
x3
x3
y3
x4
y4
y2
0
1
0
1
y3
d3
Forw arding
unit, lower
t2
ALUSrc1
ALUSrc2
Fig. 16.4
x4
Func
0
1
2
SE
Stage 5
RetAddr3, RegWrite3, RegWrite4
RegInSrc3, RegInSrc4
x3
y3
x4
y4
(rs)
rs
Stage 4
d3
d4
Forw arding
unit, upper
y4
RegInSrc3
RegWrite3
RetAddr3
RetAddr3, RegWrite3, RegWrite4
RegInSrc3, RegInSrc4
d3
d4
d4
RegInSrc4
RegWrite4
Forwarding unit for the pipelined MicroMIPS data path.
RegWrite3
RegWrite4
s2matchesd3
s2matchesd4
RetAddr3
RegInSrc3
RegInSrc4
Choose
0
0
x
x
x
x
x
x2
0
1
x
0
x
x
x
x2
0
1
x
1
x
x
0
x4
0
1
x
1
x
x
1
y4
1
0
1
x
0
1
x
x3
1
0
1
x
1
1
x
y3
1
1
1
1
0
1
x
x3
Incorrect in textbook
Nov. 2014
Computer Architecture, Data Path and Control
Slide 70
Hardware for Inserting Bubbles
Stage 1
Stage 2
Stage 3
Data hazard
detector
LoadPC
LoadInst
Inst
Instr
cache
PC
(rs)
rs
rt
x2
Reg
file
(rt)
LoadIncrPC Inst
reg
Corrections to textbook
figure shown in red
0
1
2
IncrPC
y2
Bubble
Control signals
from decoder
All-0s
t2
0
1
Controls
or all-0s
DataRead2
Fig. 16.5 Data hazard detector for the pipelined MicroMIPS data path.
Nov. 2014
Computer Architecture, Data Path and Control
Slide 71
Augmentations to Pipelined Data Path and Control
Branch
predictor
Next addr
forwarders
Hazard
detector
Stage 1
Data cache
forwarder
Stage 3
Stage 4
Stage 2
ALUOvfl
1
PC
Stage 5
Next addr
NextPC
0
ALU
forwarders
inst
Instr
cache
rs
rt
(rs)
Ovfl
Reg
file
imm SE
Incr
IncrPC
Data
cache
ALU
(rt)
1
Data
Data
addr
Address
Func
0
1
0
1
0
1
2
rt
rd 0
1
31 2
2
3
5
SeqInst
Fig. 15.10
Nov. 2014
op
Br&Jump
RegDst
fn
RegWrite
ALUSrc
ALUFunc
Computer Architecture, Data Path and Control
DataRead RetAddr
DataWrite
RegInSrc
Slide 72
16.3 Pipeline Branch Hazards
Software-based solutions
Compiler inserts a “no-op” after every branch (simple, but wasteful)
Branch is redefined to take effect after the instruction that follows it
Branch delay slot(s) are filled with useful instructions via reordering
Hardware-based solutions
Mechanism similar to data hazard detector to flush the pipeline
Constitutes a rudimentary form of branch prediction:
Always predict that the branch is not taken, flush if mistaken
More elaborate branch prediction strategies possible
Nov. 2014
Computer Architecture, Data Path and Control
Slide 73
16.4 Branch Prediction
Predicting whether a branch will be taken
 Always predict that the branch will not be taken
 Use program context to decide (backward branch
is likely taken, forward branch is likely not taken)
 Allow programmer or compiler to supply clues
 Decide based on past history (maintain a small
history table); to be discussed later
 Apply a combination of factors: modern processors
use elaborate techniques due to deep pipelines
Nov. 2014
Computer Architecture, Data Path and Control
Slide 74
Forward and Backward Branches
Example 5.5
List A is stored in memory beginning at the address given in $s1.
List length is given in $s2.
Find the largest integer in the list and copy it into $t0.
Solution
Scan the list, holding the largest element identified thus far in $t0.
lw
addi
loop: add
beq
add
add
add
lw
slt
beq
addi
j
done: ...
Nov. 2014
$t0,0($s1)
$t1,$zero,0
$t1,$t1,1
$t1,$s2,done
$t2,$t1,$t1
$t2,$t2,$t2
$t2,$t2,$s1
$t3,0($t2)
$t4,$t0,$t3
$t4,$zero,loop
$t0,$t3,0
loop
#
#
#
#
#
#
#
#
#
#
#
#
#
initialize maximum to A[0]
initialize index i to 0
increment index i by 1
if all elements examined, quit
compute 2i in $t2
compute 4i in $t2
form address of A[i] in $t2
load value of A[i] into $t3
maximum < A[i]?
if not, repeat with no change
if so, A[i] is the new maximum
change completed; now repeat
continuation of the program
Computer Architecture, Data Path and Control
Slide 75
Simple Branch Prediction: 1-Bit History
Taken
Not taken
Not taken
Predict
taken
Predict
not taken
Taken
Two-state branch prediction scheme.
Problem with this approach:
Each branch in a loop entails two mispredictions:
Once in first iteration (loop is repeated, but the history indicates exit from loop)
Once in last iteration (when loop is terminated, but history indicates repetition)
Nov. 2014
Computer Architecture, Data Path and Control
Slide 76
Simple Branch Prediction: 2-Bit History
Taken
Not taken
Not taken
Predict
taken
Taken
Not taken
Predict
taken
again
Taken
Predict
not taken
Not taken
Predict
not taken
again
Taken
Fig. 16.6
Four-state branch prediction scheme.
Example 16.1
L1: ---10 iter’s
---20 iter’s
L2: ------br <c2> L2
---br <c1> L1
Nov. 2014
Impact of different branch prediction schemes
Solution
Always taken: 11 mispredictions, 94.8% accurate
1-bit history: 20 mispredictions, 90.5% accurate
2-bit history: Same as always taken
Computer Architecture, Data Path and Control
Slide 77
Other Branch Prediction Algorithms
Problem 16.3
Taken
Not taken
Not taken
Part a
Predict
taken
Taken
Not taken
Predict
taken
again
Taken
Part b
Predict
taken
Taken
Taken
Not taken
Predict
not taken
Not taken
Taken
Predict
taken
again
Taken
Predict
not taken
again
Not taken
Taken
Predict
not taken
Not taken
Predict
not taken
again
Not taken
Taken
Not taken
Not taken
Fig. 16.6
Predict
taken
Taken
Not taken
Predict
taken
again
Taken
Predict
not taken
Not taken
Predict
not taken
again
Taken
Nov. 2014
Computer Architecture, Data Path and Control
Slide 78
Hardware Implementation of Branch Prediction
Low-order
bits used
as index
Addresses of recent
branch instructions
Target
addresses
History
bit(s)
Incremented
PC
Next
PC
0
Read-out table entry
From
PC
Fig. 16.7
Compare
=
1
Logic
Hardware elements for a branch prediction scheme.
The mapping scheme used to go from PC contents to a table entry
is the same as that used in direct-mapped caches (Chapter 18)
Nov. 2014
Computer Architecture, Data Path and Control
Slide 79
Pipeline Augmentations – Repeated for Reference
Branch
predictor
Next addr
forwarders
Hazard
detector
Stage 1
Data cache
forwarder
Stage 3
Stage 4
Stage 2
ALUOvfl
1
PC
Stage 5
Next addr
NextPC
0
ALU
forwarders
inst
Instr
cache
rs
rt
(rs)
Ovfl
Reg
file
imm SE
Incr
IncrPC
Data
cache
ALU
(rt)
1
Data
Data
addr
Address
Func
0
1
0
1
0
1
2
rt
rd 0
1
31 2
2
3
5
SeqInst
Fig. 15.10
Nov. 2014
op
Br&Jump
RegDst
fn
RegWrite
ALUSrc
ALUFunc
Computer Architecture, Data Path and Control
DataRead RetAddr
DataWrite
RegInSrc
Slide 80
16.5 Advanced Pipelining
Deep pipeline = superpipeline; also, superpipelined, superpipelining
Parallel instruction issue = superscalar, j-way issue (2-4 is typical)
Stage 1
Stage 2
Stage 3
Stage 4
Stage 5
Variable # of stages
Stage q-2 Stage q-1 Stage q
Function unit 1
Instr cache
Instr
decode
Operand
prep
Instr
issue
Function unit 2
Function unit 3
Instruction fetch
Retirement & commit stages
Fig. 16.8
Dynamic instruction pipeline with in-order issue,
possible out-of-order completion, and in-order retirement.
Nov. 2014
Computer Architecture, Data Path and Control
Slide 81
Design Space for Advanced Superscalar Pipelines
Front end:
Instr. issue:
Writeback:
Commit:
In-order or out-of-order
In-order or out-of-order
In-order or out-of-order
In-order or out-of-order
The more OoO stages,
the higher the complexity
Example of complexity due to
out-of-order processing:
MIPS R10000
Source: Ahi, A. et al., “MIPS R10000
Superscalar Microprocessor,”
Proc. Hot Chips Conf., 1995.
Nov. 2014
Computer Architecture, Data Path and Control
Slide 82
Performance Improvement for Deep Pipelines
Hardware-based methods
Lookahead past an instruction that will/may stall in the pipeline
(out-of-order execution; requires in-order retirement)
Issue multiple instructions (requires more ports on register file)
Eliminate false data dependencies via register renaming
Predict branch outcomes more accurately, or speculate
Software-based method
Pipeline-aware compilation
Loop unrolling to reduce the number of branches
Loop: Compute with index i
Increment i by 1
Go to Loop if not done
Nov. 2014
Loop: Compute with index i
Compute with index i + 1
Increment i by 2
Go to Loop if not done
Computer Architecture, Data Path and Control
Slide 83
CPI Variations with Architectural Features
Table 16.2 Effect of processor architecture, branch
prediction methods, and speculative execution on CPI.
Architecture
Methods used in practice
CPI
Nonpipelined, multicycle
Strict in-order instruction issue and exec
5-10
Nonpipelined, overlapped
In-order issue, with multiple function units
3-5
Pipelined, static
In-order exec, simple branch prediction
2-3
Superpipelined, dynamic
Out-of-order exec, adv branch prediction
1-2
Superscalar
2- to 4-way issue, interlock & speculation
0.5-1
Advanced superscalar
4- to 8-way issue, aggressive speculation
0.2-0.5
Need 100 for TIPS performance
Need 100,000 for 1 PIPS
Nov. 2014
3.3 inst / cycle  3 Gigacycles / s
 10 GIPS
Computer Architecture, Data Path and Control
Slide 84
Development of Intel’s Desktop/Laptop Micros
In the beginning, there was the 8080; led to the 80x86 = IA32 ISA
Half a dozen or so pipeline stages
80286
80386
80486
Pentium (80586)
More advanced
technology
A dozen or so pipeline stages, with out-of-order instruction execution
Pentium Pro
Pentium II
Pentium III
Celeron
More advanced
technology
Instructions are broken
into micro-ops which are
executed out-of-order
but retired in-order
Two dozens or so pipeline stages
Pentium 4
Nov. 2014
Computer Architecture, Data Path and Control
Slide 85
Current State of Computer Performance
Multi-GIPS/GFLOPS desktops and laptops
Very few users need even greater computing power
Users unwilling to upgrade just to get a faster processor
Current emphasis on power reduction and ease of use
Multi-TIPS/TFLOPS in large computer centers
World’s top 500 supercomputers, http://www.top500.org
Next list due in June 2009; as of Nov. 2008:
All 500 >> 10 TFLOPS, 30 > 100 TFLOPS, 1 > PFLOPS
Multi-PIPS/PFLOPS supercomputers on the drawing board
IBM “smarter planet” TV commercial proclaims (in early 2009):
“We just broke the petaflop [sic] barrier.”
The technical term “petaflops” is now in the public sphere
Nov. 2014
Computer Architecture, Data Path and Control
Slide 86
The Shrinking Supercomputer
Nov. 2014
Computer Architecture, Data Path and Control
Slide 87
16.6 Dealing with Exceptions
Exceptions present the same problems as branches
How to handle instructions that are ahead in the pipeline?
(let them run to completion and retirement of their results)
What to do with instructions after the exception point?
(flush them out so that they do not affect the state)
Precise versus imprecise exceptions
Precise exceptions hide the effects of pipelining and parallelism
by forcing the same state as that of strict sequential execution
(desirable, because exception handling is not complicated)
Imprecise exceptions are messy, but lead to faster hardware
(interrupt handler can clean up to offer precise exception)
Nov. 2014
Computer Architecture, Data Path and Control
Slide 88
The Three Hardware Designs for MicroMIPS
Incr PC
Single-cycle
Next addr
jta
Next PC
ALUOvfl
(PC)
PC
Instr
cache
inst
rd
31
imm
op
0
1
2
Ovfl
Inst Reg
Reg
file
ALU
(rt)
/
16
ALU
out
Data
cache
0
1
0
1
2
Cache
Data Reg
InstData
RegDst
RegWrite
ALUSrc
Stage 1
y Mux
0
1
2
4 3
(rt)
imm 16
/
4
32 y Reg
SE /
Stage 2
IRWrite
RegInSrc
RegDst
ALUSrcX
RegWrite
Stage 4
ALUOvfl
PC
fn
Stage 3
1
inst
Instr
cache
rs
rt
(rs)
4
ALU
0
1
2
3
Func
ALU out
ALUSrcY
Stage 5
Reg
file
1
Incr
IncrPC
rt
rd
31
PCSrc
JumpAddr
500 MHz
CPI  4
Address
Data
cache
ALU
imm SE
ALUFunc
Data
Data
addr
Ovfl
(rt)
500 MHz
CPI  1.1
op
MemWrite
MemRead
Next addr
NextPC
0
PCWrite
DataRead
RegInSrc
DataWrite
ALUFunc
125 MHz
CPI = 1
Func
0
1
0
1
0
1
2
0
1
2
2
3
5
SeqInst
op
Nov. 2014
Reg
file
30
Register input
fn
Br&Jump
(rs)
0
1
Data
ALUZero
x Mux
ALUOvfl
0
Zero
z Reg
1
Ovfl
x Reg
rs
rt
0
1
rd
31 2
0
1
SysCallAddr
jta
PC
Data
out
Data
in
Func
0
32
SE / 1
Data
addr
30
/
4 MSBs
Address
(rs)
rs
rt
26
/
Multicycle
Br&Jump
RegDst
fn
RegWrite
ALUSrc
ALUFunc
DataRead RetAddr
DataWrite
RegInSrc
Computer Architecture, Data Path and Control
Slide 89
Where Do We
Go from Here?
Memory Design:
How to build a memory unit
that responds in 1 clock
Input and Output:
Peripheral devices,
I/O programming,
interfacing, interrupts
Higher Performance:
Vector/array processing
Parallel processing
Nov. 2014
Computer Architecture, Data Path and Control
Slide 90
Download