Uploaded by 최재혁

vdocuments.mx computer-architecture-b-parhami

advertisement
Part I
Background and Motivation
Jan. 2011
Computer Architecture, Background and Motivation
Slide 1
About This Presentation
This presentation is intended to support the use of the textbook
Computer Architecture: From Microprocessors to Supercomputers,
Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated
regularly by the author as part of his teaching of the upper-division
course ECE 154, Introduction to Computer Architecture, at the
University of California, Santa Barbara. Instructors can use these
slides freely in classroom teaching and for other educational
purposes. Any other use is strictly prohibited. © Behrooz Parhami
Edition
Released
Revised
Revised
Revised
Revised
First
June 2003
July 2004
June 2005
Mar. 2006
Jan. 2007
Jan. 2008
Jan. 2009
Jan. 2011
Second
Jan. 2011
Computer Architecture, Background and Motivation
Slide 2
I Background and Motivation
Provide motivation, paint the big picture, introduce tools:
• Review components used in building digital circuits
• Present an overview of computer technology
• Understand the meaning of computer performance
(or why a 2 GHz processor isn’t 2× as fast as a 1 GHz model)
Topics in This Part
Chapter 1 Combinational Digital Circuits
Chapter 2 Digital Circuits with Memory
Chapter 3 Computer System Technology
Chapter 4 Computer Performance
Jan. 2011
Computer Architecture, Background and Motivation
Slide 3
1 Combinational Digital Circuits
First of two chapters containing a review of digital design:
• Combinational, or memoryless, circuits in Chapter 1
• Sequential circuits, with memory, in Chapter 2
Topics in This Chapter
1.1 Signals, Logic Operators, and Gates
1.2 Boolean Functions and Expressions
1.3 Designing Gate Networks
1.4 Useful Combinational Parts
1.5 Programmable Combinational Parts
1.6 Timing and Circuit Considerations
Jan. 2011
Computer Architecture, Background and Motivation
Slide 4
1.1 Signals, Logic Operators, and Gates
Name
NOT
AND
OR
XOR
Operator
sign and
alternat e(s)
x′
_
¬x or x
xy
x∧ y
x∨ y
x+y
x⊕y
x ≡/ y
Output
is 1 iff:
Input is 0
Both inputs
are 1s
At least one
input is 1
Inputs are
not equal
1−x
x × y or xy
x + y − xy
x + y − 2xy
Graphical
symbol
Arithmetic
expression
Figure 1.1 Some basic elements of digital logic circuits, with
operator signs used in this book highlighted.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 5
The Arithmetic Substitution Method
z′ = 1 – z
xy
x ∨ y = x + y − xy
x ⊕ y = x + y − 2xy
NOT converted to arithmetic form
AND same as multiplication
(when doing the algebra, set zk = z)
OR converted to arithmetic form
XOR converted to arithmetic form
Example: Prove the identity xyz ∨ x ′ ∨ y ′ ∨ z ′ ≡? 1
LHS = [xyz ∨ x ′] ∨ [y ′ ∨ z ′]
= [xyz + 1 – x – (1 – x)xyz] ∨ [1 – y + 1 – z – (1 – y)(1 – z)]
= [xyz + 1 – x] ∨ [1 – yz]
= (xyz + 1 – x) + (1 – yz) – (xyz + 1 – x)(1 – yz)
= 1 + xy2z2 – xyz
This is addition,
= 1 = RHS
not logical OR
Jan. 2011
Computer Architecture, Background and Motivation
Slide 6
Variations in Gate Symbols
AND
OR
NAND
NOR
XNOR
Figure 1.2 Gates with more than two inputs and/or with
inverted signals at input or output.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 7
Gates as Control Elements
Enable/Pass signal
e
Enable/Pass signal
e
Data out
x or 0
Data in
x
Data in
x
(a) AND gate for controlled trans fer
Data out
x or “high impedance”
(b) Tristate buffer
e
e
0
0
0
1
x
ex
(c) Model for AND switch.
x
1
No data
or x
(d) Model for tristate buffer.
Figure 1.3 An AND gate and a tristate buffer act as controlled switches
or valves. An inverting buffer is logically the same as a NOT gate.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 8
Wired OR and Bus Connections
ex
ex
x
x
ey
ey
Data out
(x, y, z, or 0)
y
y
ez
Data out
(x, y, z,
or high
impedance)
ez
z
z
(a) Wired OR of product terms
(b) Wired OR of t ristate outputs
Figure 1.4 Wired OR allows tying together of several
controlled signals.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 9
Control/Data Signals and Signal Bundles
Enable
Compl
8
/
/
8
/
8
(a) 8 NOR gates
Figure 1.5
Jan. 2011
/
32
/
32
(b) 32 AND gat es
/
k
/
k
(c) k XOR gat es
Arrays of logic gates represented by a single gate symbol.
Computer Architecture, Background and Motivation
Slide 10
1.2 Boolean Functions and Expressions
Ways of specifying a logic function
• Truth table: 2n row, “don’t-care” in input or output
• Logic expression: w ′ (x ∨ y ∨ z), product-of-sums,
sum-of-products, equivalent expressions
• Word statement: Alarm will sound if the door
is opened while the security system is engaged,
or when the smoke detector is triggered
• Logic circuit diagram: Synthesis vs analysis
Jan. 2011
Computer Architecture, Background and Motivation
Slide 11
Manipulating Logic Expressions
Table 1.2
Laws (basic identities) of Boolean algebra.
Name of law
OR version
AND version
Identity
One/Zero
Idempotent
Inverse
Commutative
Associative
Distributive
DeMorgan’s
x∨0=x
x1=x
x∨1=1
x0=0
x∨x= x
xx=x
x∨x′=1
xx′=0
x∨y=y∨x
xy=yx
(x ∨ y) ∨ z = x ∨ (y ∨ z)
(x y) z = x (y z)
Jan. 2011
x ∨ (y z) = (x ∨ y) (x ∨ z) x (y ∨ z) = (x y) ∨ (x z)
(x ∨ y)′ = x ′ y ′
(x y)′ = x ′ ∨ y ′
Computer Architecture, Background and Motivation
Slide 12
Proving the Equivalence of Logic Expressions
Example 1.1
• Truth-table method: Exhaustive verification
• Arithmetic substitution
x ∨ y = x + y − xy
x ⊕ y = x + y − 2xy
Example: x ⊕ y ≡? x′y ∨ xy ′
x + y – 2xy ≡? (1 – x)y + x(1 – y) – (1 – x)yx(1 – y)
• Case analysis: two cases, x = 0 or x = 1
• Logic expression manipulation
Jan. 2011
Computer Architecture, Background and Motivation
Slide 13
1.3 Designing Gate Networks
• AND-OR, NAND-NAND, OR-AND, NOR-NOR
• Logic optimization: cost, speed, power dissipation
(a ∨ b ∨ c)′ = a ′b ′c ′
x
y
y
z
z
x
x
y
y
z
z
x
(a) AND-OR circuit
Figure 1.6
Jan. 2011
(b) Intermediate circuit
x
y
y
z
z
x
(c) NAND-NAND equivalent
A two-level AND-OR circuit and two equivalent circuits.
Computer Architecture, Background and Motivation
Slide 14
Seven-Segment Display of Decimal Digits
Optional segment
Figure 1.7 Seven-segment display
of decimal digits. The three open
segments may be optionally used.
The digit 1 can be displayed in two
ways, with the more common rightside version shown.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 15
BCD-to-Seven-Segment Decoder
Example 1.2
4-bit input in [0, 9]
x3 x2 x1 x0
Signals to
enable or
turn on the
segments
e0
0
e5
5
1
e6
e4
e3
6
4
2
3
e2
e1
Figure 1.8 The logic circuit that generates the enable signal for the
lowermost segment (number 3) in a seven-segment display unit.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 16
1.4 Useful Combinational Parts
• High-level building blocks
• Much like prefab parts used in building a house
• Arithmetic components (adders, multipliers, ALUs)
will be covered in Part III
• Here we cover three useful parts:
multiplexers, decoders/demultiplexers, encoders
Jan. 2011
Computer Architecture, Background and Motivation
Slide 17
Multiplexers
x0
z
x1
y
0
x0
x1
(a) 2-to-1 mux
1
z
y
/
32
0
/
32
1
/
32
y
x0
x1
x2
x3
0
1
2
3
z
x2
1 0
y1y0
(d) Mux array
(e) 4-to-1 mux with enable
0
x1
1
z
y
(c) Mux symbol
(b) Switch view
e (Enable)
x0
x0
0
x1
1
x2
0
x3
1
y0
0
z
1
y1
y0
(e) 4-to-1 mux design
Figure 1.9 Multiplexer (mux), or selector, allows one of several inputs
to be selected and routed to output depending on the binary value of a
set of selection or address signals provided to it.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 18
Decoders/Demultiplexers
y1
1
y0
0
x0
x1
1
1
1
x2
y1y0
1 0
y1y0
0
1
2
3
x0
x1
x2
x3
x3
(a) 2-to-4 decoder
(b) Decoder symbol
e
1
(Enable)
0
1
2
3
1
x0
x1
x2
x3
(c) Demultiplexer, or
decoder wit h “enable”
Figure 1.10 A decoder allows the selection of one of 2a options using
an a-bit address as input. A demultiplexer (demux) is a decoder that
only selects an output if its enable signal is asserted.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 19
Encoders
x0 0
x1 0
x0
x1
x2
x3
x2 1
x3 0
1
0
y1y0
y1y0
(a) 4-to-2 encoder
0
1
2
3
(b) Enc oder symbol
Figure 1.11 A 2a-to-a encoder outputs an a-bit binary number
equal to the index of the single 1 among its 2a inputs.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 20
1.5 Programmable Combinational Parts
A programmable combinational part can do the job of
many gates or gate networks
Programmed by cutting existing connections (fuses)
or establishing new connections (antifuses)
• Programmable ROM (PROM)
• Programmable array logic (PAL)
• Programmable logic array (PLA)
Jan. 2011
Computer Architecture, Background and Motivation
Slide 21
PROMs
w
x
x
y
y
z
z
Inputs
Decoder
w
.
.
.
...
Outputs
(a) Programmable
OR gates
Figure 1.12
Jan. 2011
(b) Logic equivalent
of part a
(c) Programmable read-only
memory (PROM)
Programmable connections and their use in a PROM.
Computer Architecture, Background and Motivation
Slide 22
PALs and PLAs
Inputs
8-input
ANDs
...
AND
array
(AND
plane)
.
.
.
6-input
ANDs
OR
array
(OR
plane)
...
4-input
ORs
Outputs
(a) General programmable
combinational logic
(b) PAL: programmable
AND array, fixed OR array
(c) PLA: programmable
AND and OR arrays
Figure 1.13 Programmable combinational logic: general structure and
two classes known as PAL and PLA devices. Not shown is PROM with
fixed AND array (a decoder) and programmable OR array.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 23
1.6 Timing and Circuit Considerations
Changes in gate/circuit output, triggered by changes in its
inputs, are not instantaneous
• Gate delay δ: a fraction of, to a few, nanoseconds
• Wire delay, previously negligible, is now important
(electronic signals travel about 15 cm per ns)
• Circuit simulation to verify function and timing
Jan. 2011
Computer Architecture, Background and Motivation
Slide 24
Glitching
Using the PAL in Fig. 1.13b to implement f = x ∨ y ∨ z
x=0
x
y
z
AND-OR
(PAL)
a
y
f
AND-OR
(PAL)
z
a=x
f
∨y
=a∨z
Figure 1.14
Jan. 2011
2δ
2δ
Timing diagram for a circuit that exhibits glitching.
Computer Architecture, Background and Motivation
Slide 25
CMOS Transmission Gates
y
P
x0
TG
z
N
TG
(a) CM OS transmission gate:
circuit and symbol
x1
TG
(b) Two-input mux built of t wo
transmission gat es
Figure 1.15 A CMOS transmission gate and its use in building
a 2-to-1 mux.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 26
2 Digital Circuits with Memory
Second of two chapters containing a review of digital design:
• Combinational (memoryless) circuits in Chapter 1
• Sequential circuits (with memory) in Chapter 2
Topics in This Chapter
2.1 Latches, Flip-Flops, and Registers
2.2 Finite-State Machines
2.3 Designing Sequential Circuits
2.4 Useful Sequential Parts
2.5 Programmable Sequential Parts
2.6 Clocks and Timing of Events
Jan. 2011
Computer Architecture, Background and Motivation
Slide 27
2.1 Latches, Flip-Flops, and Registers
D
R
R
Q
Q
Q′
S
C
(a) SR latch
D
Q′
S
D
C
C
Q
Q
(b) D latch
D
C
Q
Q
Q
Q′
(c) Master-slave D flip-flop
Figure 2.1
Jan. 2011
D
Q
FF
C
Q
(d) D flip-flop symbol
/
k
D
FF
C
Q /
k
Q
(e) k -bit register
Latches, flip-flops, and registers.
Computer Architecture, Background and Motivation
Slide 28
Latches vs Flip-Flops
D
Q
C
Q
D
Q
Setup Hold
time time
Setup Hold
time time
D
FF
C
Q
C
D latch: Q
D FF: Q
Figure 2.2 Operations of D latch and
negative-edge-triggered D flip-flop.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 29
Reading and Modifying FFs in the Same Cycle
/
k
D
Q
FF
C
Q
/
D
k
Figure 2.3
flip-flops.
Jan. 2011
k
Computation module
(combinational logic)
Q
FF
C
Clock
/
Q
/
k
Propagation delay
Combinational
delay
Register-to-register operation with edge-triggered
Computer Architecture, Background and Motivation
Slide 30
2.2 Finite-State Machines
Example 2.1
Dime
Quarter
Reset
------- Input ------Dime
Current
state
S 00
S 10
S 25
S 00
S 10
S 20
S 35
S 00
S 20
S 30
S 35
S 00
S 25
S 35
S 35
S 00
S 30
S 35
S 35
S 00
S 35
S 35
S 35
S 00
Next state
S 00 is the initial state
S 35 is the final state
S 10
S 20
Reset
Reset
Dime
Start
Quarter
Dime
Quarter
Quarter
S 00
S 25
Reset
Dime
Quarter
Reset
Reset
Dime
Quarter
S 35
Dime
Quarter
S 30
Figure 2.4 State table and state diagram for a vending
machine coin reception unit.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 31
Sequential Machine Implementation
Only for Mealy machine
Inputs
State register
/
n
Next-state
logic
Present
state
Next-state
excitation
signals
Output
logic
Outputs
/
m
/
l
Figure 2.5 Hardware realization of Moore and Mealy
sequential machines.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 32
2.3 Designing Sequential Circuits
Example 2.3
Quarter in
Inputs
q
Q
D
FF2
Dime in
d
C
Q
Output
e
Final
state
is 1xx
Q
D
FF1
C
Q
Q
D
FF0
C
Figure 2.7
Jan. 2011
Q
Hardware realization of a coin reception unit (Example 2.3).
Computer Architecture, Background and Motivation
Slide 33
2.4 Useful Sequential Parts
• High-level building blocks
• Much like prefab closets used in building a house
• Other memory components will be covered in
Chapter 17 (SRAM details, DRAM, Flash)
• Here we cover three useful parts:
shift register, register file (SRAM basics), counter
Jan. 2011
Computer Architecture, Background and Motivation
Slide 34
Shift Register
0 1 0 0 1 1 1 0
Shift
Parallel data in /
k
Serial data in
Load
0
/
k
1
Q
D
FF
C
k – 1 LSBs
Parallel data out
/
k
Serial data out
Q
MSB
/
Figure 2.8 Register with single-bit left shift and parallel load
capabilities. For logical left shift, serial data in line is connected to 0.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 35
Register File and FIFO
Write
data
2 h k -bit registers
/
/
k
k
Write
/
address h
D
Q
FF
C
/
Write
enable
k
Write enable
/
k
/
k
Write
data
/
h
Write
addr
/
Read
addr 0
Q
D
Q
FF
C
Decoder
Muxes
/
Read
data 0
k
Q
/
h
k
k
/
k
D
Q
FF
/
D
Q
FF
C
h
/
k
Read
data 1
Q
C
k
/
Jan. 2011
Read
data 1 k/
Read
addr 1
Read enable
(b) Graphic symbol
for register file
/
k
Q
Push
Read address 0
h
Read address 1
h
/
Read
enable
/
k
Input
Empty
Full
Output /
k
Pop
/
(a) Register file with random access
Figure 2.9
/
Read
data 0 k/
(c) FIFO symbol
Register file with random access and FIFO.
Computer Architecture, Background and Motivation
Slide 36
Row decoder
SRAM
Write enable
/
g
/
h
Data in
Address
Chip
select
.
.
.
Square or
almost square
memory matrix
Data out /
g
. . .
Output
enable
Row buffer
. . .
Address /
Row
Column mux
h Column
g bits data out
(a) SRAM block diagram
Figure 2.10
Jan. 2011
(b) SRAM read mechanism
SRAM memory is simply a large, single-port register file.
Computer Architecture, Background and Motivation
Slide 37
Binary Counter
Input
Incr′Init
0
Mux 1
Load
0
Count register
x
c out
Incrementer
c in
1
x+1
Figure 2.11
Jan. 2011
Synchronous binary counter with initialization capability.
Computer Architecture, Background and Motivation
Slide 38
2.5 Programmable Sequential Parts
A programmable sequential part contain gates and
memory elements
Programmed by cutting existing connections (fuses)
or establishing new connections (antifuses)
• Programmable array logic (PAL)
• Field-programmable gate array (FPGA)
• Both types contain macrocells and interconnects
Jan. 2011
Computer Architecture, Background and Motivation
Slide 39
PAL and FPGA
8-input
ANDs
I/O blocks
CLB
CLB
CLB
CLB
01
Mu x
C
Q
FF
D
Q
Mu x
Configurable
logic block
01
(a) Portion of PAL with storable output
Figure 2.12
Jan. 2011
Programmable
connections
(b) Generic structure of an FPGA
Examples of programmable sequential logic.
Computer Architecture, Background and Motivation
Slide 40
2.6 Clocks and Timing of Events
Clock is a periodic signal: clock rate = clock frequency
The inverse of clock rate is the clock period: 1 GHz ↔ 1 ns
Constraint: Clock period ≥ tprop + tcomb + tsetup + tskew
D
Q
FF1
Q
C
D
Combinational
logic
Clock1
Q
FF2
C
Q
Clock2
Other inputs
Clock period
FF1 begins
to change
Figure 2.13
Jan. 2011
FF1 change
observed
Must be wide enough
to accommodate
worst-cas e delays
Determining the required length of the clock period.
Computer Architecture, Background and Motivation
Slide 41
Synchronization
Asynch
input
Synch
version
Q
D
Synch
version
Q
D
FF
C
Asynch
input
FF1
Q
C
(a) Simple synchroniz er
Q
D
FF2
Q
C
Q
(b) Two-FF synchronizer
Clock
Asynch
input
Synch
version
(c) Input and output waveforms
Figure 2.14 Synchronizers are used to prevent timing problems
arising from untimely changes in asynchronous signals.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 42
Level-Sensitive Operation
D
Q
Latch
φ1
C
Q
Combinational
logic
D
Q
Latch
φ2
C
Q
Combinational
logic
D
Q
Latch
φ1
C
Q
Other inputs
Other inputs
Clock period
φ1
φ
Clocks with
nonoverlapping
highs
2
Figure 2.15
Jan. 2011
Two-phase clocking with nonoverlapping clock signals.
Computer Architecture, Background and Motivation
Slide 43
3 Computer System Technology
Interplay between architecture, hardware, and software
• Architectural innovations influence technology
• Technological advances drive changes in architecture
Topics in This Chapter
3.1 From Components to Applications
3.2 Computer Systems and Their Parts
3.3 Generations of Progress
3.4 Processor and Memory Technologies
3.5 Peripherals, I/O, and Communications
3.6 Software Systems and Applications
Jan. 2011
Computer Architecture, Background and Motivation
Slide 44
3.1 From Components to Applications
Highlevel
view
Figure 3.1
Jan. 2011
Computer organization
Circuit designer
Logic designer
Computer archit ecture
Electronic components
Hardware
Computer designer
System designer
Application designer
Application domains
Software
Lowlevel
view
Subfields or views in computer system engineering.
Computer Architecture, Background and Motivation
Slide 45
What Is (Computer) Architecture?
Client’s requirements:
function, cost, . . .
Client’s taste:
mood, style, . . .
Goals
Interface
Architect
Means
Construction tec hnology:
material, codes, . . .
Engineering
Arts
The world of arts:
aesthetics, trends, . . .
Interface
Figure 3.2 Like a building architect, whose place at the
engineering/arts and goals/means interfaces is seen in this diagram, a
computer architect reconciles many conflicting or competing demands.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 46
3.2 Computer Systems and Their Parts
Computer
Analog
Digital
Fixed-function
Stored-program
Electronic
General-purpose
Number cruncher
Nonelectronic
Special-purpose
Data manipulator
Figure 3.3
The space of computer systems, with what we normally
mean by the word “computer” highlighted.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 47
Price/Performance Pyramid
Super
$Millions
Mainframe
$100s Ks
Server
Differences in scale,
not in substance
Workstation
$10s Ks
$1000s
Personal
Embedded
$100s
$10s
Figure 3.4 Classifying computers by computational
power and price range.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 48
Automotive Embedded Computers
Impact sensors
Brakes
Airbags
Engine
Cent ral
controller
Navigation &
entert ainment
Figure 3.5 Embedded computers are ubiquitous, yet invisible. They
are found in our automobiles, appliances, and many other places.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 49
Personal Computers and Workstations
Figure 3.6 Notebooks, a common class of portable computers,
are much smaller than desktops but offer substantially the same
capabilities. What are the main reasons for the size difference?
Jan. 2011
Computer Architecture, Background and Motivation
Slide 50
Digital Computer Subsystems
Memory
Control
Processor
Input
Link
Datapath
CPU
Input/Output
Output
To/from network
I/O
Figure 3.7
The (three, four, five, or) six main units of a digital
computer. Usually, the link unit (a simple bus or a more elaborate
network) is not explicitly included in such diagrams.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 51
3.3 Generations of Progress
Table 3.2 The 5 generations of digital computers, and their ancestors.
Generation
(begun)
Processor
Memory
I/O devices
technology innovations introduced
0 (1600s)
(Electro-)
mechanical
Wheel, card
Lever, dial,
punched card
Factory
equipment
1 (1950s)
Vacuum tube
Magnetic
drum
Paper tape,
magnetic tape
Hall-size
cabinet
2 (1960s)
Transistor
Magnetic
core
Drum, printer, Room-size
text terminal
mainframe
3 (1970s)
SSI/MSI
RAM/ROM
chip
Disk, keyboard, Desk-size
video monitor mini
4 (1980s)
LSI/VLSI
SRAM/DRAM Network, CD,
mouse,sound
5 (1990s)
ULSI/GSI/
WSI, SOC
SDRAM,
flash
Jan. 2011
Dominant
look & fell
Desktop/
laptop micro
Sensor/actuator, Invisible,
point/click
embedded
Computer Architecture, Background and Motivation
Slide 52
IC Production and Yield
Blank wafer
with defects
30-60 cm
Silicon
crystal
ingot
Slicer
15-30
cm
x x
x x x
x
x
x x
x x
Patterned wafer
Processing:
20-30 steps
(100s of simple or scores
of complex processors)
0.2 cm
Die
Dicer
~1 cm
Figure 3.8
Jan. 2011
Die
tester
Good
die
Microchip
or other part
Mounting
Part
tester
Usable
part
to ship
~1 cm
The manufacturing process for an IC part.
Computer Architecture, Background and Motivation
Slide 53
Effect of Die Size on Yield
120 dies, 109 good
26 dies, 15 good
Figure 3.9 Visualizing the dramatic decrease in yield
with larger dies.
Die yield =def (number of good dies) / (total number of dies)
Die yield = Wafer yield × [1 + (Defect density × Die area) / a]–a
Die cost = (cost of wafer) / (total number of dies × die yield)
= (cost of wafer) × (die area / wafer area) / (die yield)
Jan. 2011
Computer Architecture, Background and Motivation
Slide 54
3.4 Processor and Memory Technologies
Backplane
Die
PC board
Interlayer connections
deposited on the
outside of the stack
Bus
CPU
Connector
Memory
(a) 2D or 2.5D packaging now common
Figure 3.11
Jan. 2011
Stacked layers
glued together
(b) 3D packaging of the future
Packaging of processor, memory, and other components.
Computer Architecture, Background and Motivation
Slide 55
TIPS
Processor
×1.6 / yr
×2 / 18 mos
×10 / 5 yrs
Memory
GIPS
80486
R10000
Pentium II
Pentium
256Mb
68040
64Mb
Gb
1Gb
16Mb
80386
68000
MIPS
80286
4Mb
1Mb
Mb
×4 / 3 yrs
256kb
64kb
kIPS
1980
1990
2000
kb
2010
Calendar year
Figure 3.10 Trends in processor performance and DRAM
memory chip capacity (Moore’s law).
Jan. 2011
Computer Architecture, Background and Motivation
Slide 56
Memory chip capacity
Processor performance
Moore’s
Law
Tb
Pitfalls of Computer Technology Forecasting
“DOS addresses only 1 MB of RAM because we cannot
imagine any applications needing more.” Microsoft, 1980
“640K ought to be enough for anybody.” Bill Gates, 1981
“Computers in the future may weigh no more than 1.5
tons.” Popular Mechanics
“I think there is a world market for maybe five
computers.” Thomas Watson, IBM Chairman, 1943
“There is no reason anyone would want a computer in
their home.” Ken Olsen, DEC founder, 1977
“The 32-bit machine would be an overkill for a personal
computer.” Sol Libes, ByteLines
Jan. 2011
Computer Architecture, Background and Motivation
Slide 57
3.5 Input/Output and Communications
Typically
2-9 cm
Floppy
disk
.
.
..
(a) Cutaway view of a hard disk drive
Figure 3.12
Jan. 2011
CD-ROM
.
..
.
Magnetic
tape
cartridge
(b) Some removable storage media
Magnetic and optical disk memory units.
Computer Architecture, Background and Motivation
Slide 58
10 12
Bandwidth (b/s)
Communication
Technologies
Processor
bus
Geographically distributed
I/O
network
System-area
network
(SAN)
Local-area
network
(LAN)
10 9
Metro-area
network
(MAN)
10 6
Same geographic location
10 3
10 −9
(ns)
10 −6
(μs)
10 −3
(ms)
1
Wide-area
network
(WAN)
(min)
10 3
Latency (s)
Figure 3.13 Latency and bandwidth characteristics of different
classes of communication links.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 59
(h)
3.6 Software Systems and Applications
Software
System
Application:
word processor,
spreadsheet,
circuit simulator,
.. .
Operating system
Translator:
Manager:
Enabler:
Coordinator:
MIPS assembler,
C compiler,
.. .
virtual memory,
security,
file system,
.. .
disk driver,
display driver,
printing,
.. .
scheduling,
load balancing,
diagnostics,
.. .
Figure 3.15
Jan. 2011
Categorization of software, with examples in each class.
Computer Architecture, Background and Motivation
Slide 60
High- vs Low-Level Programming
temp=v[i]
v[i]=v[i+1]
v[i+1]=temp
One task =
many statements
Figure 3.14
Jan. 2011
Compiler
Swap v[i]
and v[i+1]
Assembly
language
instructions,
mnemonic
High-level
language
statements
Interpreter
Very
high-level
language
objectives
or tasks
More conc rete, machine-specific, error-prone;
harder to write, read, debug, or maintain
add
add
add
lw
lw
sw
sw
jr
One statement =
several instructions
$2,$5,$5
$2,$2,$2
$2,$4,$2
$15,0($2)
$16,4($2)
$16,0($2)
$15,4($2)
$31
Assembler
More abstract, machine-independent;
easier to write, read, debug, or maintain
Machine
language
instructions,
binary (hex)
00a51020
00421020
00821020
8c620000
8cf20004
acf20000
ac620004
03e00008
Mostly one-to-one
Models and abstractions in programming.
Computer Architecture, Background and Motivation
Slide 61
4 Computer Performance
Performance is key in design decisions; also cost and power
• It has been a driving force for innovation
• Isn’t quite the same as speed (higher clock rate)
Topics in This Chapter
4.1 Cost, Performance, and Cost/Performance
4.2 Defining Computer Performance
4.3 Performance Enhancement and Amdahl’s Law
4.4 Performance Measurement vs Modeling
4.5 Reporting Computer Performance
4.6 The Quest for Higher Performance
Jan. 2011
Computer Architecture, Background and Motivation
Slide 62
4.1 Cost, Performance, and Cost/Performance
Computer cost
$1 G
$1 M
$1 K
$1
1960
1980
2000
2020
Calendar year
Jan. 2011
Computer Architecture, Background and Motivation
Slide 63
Cost/Performance
Performance
Superlinear:
economy of
scale
Linear
(ideal?)
Sublinear:
diminishing
returns
Cost
Figure 4.1
Jan. 2011
Performance improvement as a function of cost.
Computer Architecture, Background and Motivation
Slide 64
4.2 Defining Computer Performance
CPU-bound task
Input
Processing
Output
I/O-bound task
Figure 4.2 Pipeline analogy shows that imbalance between processing
power and I/O capabilities leads to a performance bottleneck.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 65
Six Passenger Aircraft to Be Compared
B 747
DC-8-50
Jan. 2011
Computer Architecture, Background and Motivation
Slide 66
Performance of Aircraft: An Analogy
Table 4.1 Key characteristics of six passenger aircraft: all figures
are approximate; some relate to a specific model/configuration of
the aircraft or are averages of cited range of values.
Passengers
Range
(km)
Speed
(km/h)
Price
($M)
Airbus A310
250
8 300
895
120
Boeing 747
470
6 700
980
200
Boeing 767
250
12 300
885
120
Boeing 777
375
7 450
980
180
Concorde
130
6 400
2 200
350
DC-8-50
145
14 000
875
80
Aircraft
Speed of sound ≈ 1220 km / h
Jan. 2011
Computer Architecture, Background and Motivation
Slide 67
Different Views of Performance
Performance from the viewpoint of a passenger: Speed
Note, however, that flight time is but one part of total travel time.
Also, if the travel distance exceeds the range of a faster plane,
a slower plane may be better due to not needing a refueling stop
Performance from the viewpoint of an airline: Throughput
Measured in passenger-km per hour (relevant if ticket price were
proportional to distance traveled, which in reality it is not)
Airbus A310
Boeing 747
Boeing 767
Boeing 777
Concorde
DC-8-50
250 × 895 = 0.224 M passenger-km/hr
470 × 980 = 0.461 M passenger-km/hr
250 × 885 = 0.221 M passenger-km/hr
375 × 980 = 0.368 M passenger-km/hr
130 × 2200 = 0.286 M passenger-km/hr
145 × 875 = 0.127 M passenger-km/hr
Performance from the viewpoint of FAA: Safety
Jan. 2011
Computer Architecture, Background and Motivation
Slide 68
Cost Effectiveness: Cost/Performance
Table 4.1 Key characteristics of six passenger
aircraft: all figures are approximate; some relate to
a specific model/configuration of the aircraft or are
averages of cited range of values.
Larger
values
better
Smaller
values
better
Passengers
Range
(km)
Speed
(km/h)
Price
($M)
Throughput
(M P km/hr)
Cost /
Performance
A310
250
8 300
895
120
0.224
536
B 747
470
6 700
980
200
0.461
434
B 767
250
12 300
885
120
0.221
543
B 777
375
7 450
980
180
0.368
489
Concorde
130
6 400
2 200
350
0.286
1224
DC-8-50
145
14 000
875
80
0.127
630
Aircraft
Jan. 2011
Computer Architecture, Background and Motivation
Slide 69
Concepts of Performance and Speedup
Performance = 1 / Execution time
is simplified to
Performance = 1 / CPU execution time
(Performance of M1) / (Performance of M2) = Speedup of M1 over M2
= (Execution time of M2) / (Execution time M1)
Terminology:
M1 is x times as fast as M2 (e.g., 1.5 times as fast)
M1 is 100(x – 1)% faster than M2 (e.g., 50% faster)
CPU time = Instructions × (Cycles per instruction) × (Secs per cycle)
= Instructions × CPI / (Clock rate)
Instruction count, CPI, and clock rate are not completely independent,
so improving one by a given factor may not lead to overall execution
time improvement by the same factor.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 70
Elaboration on the CPU Time Formula
CPU time = Instructions × (Cycles per instruction) × (Secs per cycle)
= Instructions × Average CPI / (Clock rate)
Instructions:
Number of instructions executed, not number of
instructions in our program (dynamic count)
Average CPI:
Is calculated based on the dynamic instruction mix
and knowledge of how many clock cycles are needed
to execute various instructions (or instruction classes)
Clock rate:
1 GHz = 109 cycles / s (cycle time 10–9 s = 1 ns)
200 MHz = 200 × 106 cycles / s (cycle time = 5 ns)
Clock period
Jan. 2011
Computer Architecture, Background and Motivation
Slide 71
Dynamic Instruction Count
How many instructions
are executed in this
program fragment?
Each “for” consists of two instructions:
increment index, check exit condition
12,422,450 Instructions
250 instructions
for i = 1, 100 do
20 instructions
for j = 1, 100 do
40 instructions
for k = 1, 100 do
10 instructions
endfor
endfor
endfor
2 + 20 + 124,200 instructions
100 iterations
12,422,200 instructions in all
2 + 40 + 1200 instructions
100 iterations
124,200 instructions in all
2 + 10 instructions
100 iterations
1200 instructions in all
for i = 1, n
while x > 0
Computer Architecture, Background and Motivation
Slide 72
Static count = 326
Jan. 2011
Faster Clock ≠ Shorter Running Time
1 GHz
Suppose addition takes 1 ns
Clock period = 1 ns; 1 cycle
Clock period = ½ ns; 2 cycles
Solution
4 steps
20 steps
2 GHz
In this example, addition time
does not improve in going from
1 GHz to 2 GHz clock
Figure 4.3 Faster steps do not necessarily
mean shorter travel time.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 73
4.3 Performance Enhancement: Amdahl’s Law
50
f = fraction
f =0
Speedup (s )
40
p
f = 0.01
30
unaffected
= speedup
of the rest
f = 0.02
20
f = 0.05
s=
10
f = 0.1
≤ min(p, 1/f)
0
0
10
20
30
Enhancement factor (p )
40
1
f + (1 – f)/p
50
Figure 4.4
Amdahl’s law: speedup achieved if a fraction f of a
task is unaffected and the remaining 1 – f part runs p times as fast.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 74
Amdahl’s Law Used in Design
Example 4.1
A processor spends 30% of its time on flp addition, 25% on flp mult,
and 10% on flp division. Evaluate the following enhancements, each
costing the same to implement:
a. Redesign of the flp adder to make it twice as fast.
b. Redesign of the flp multiplier to make it three times as fast.
c. Redesign the flp divider to make it 10 times as fast.
Solution
a. Adder redesign speedup = 1 / [0.7 + 0.3 / 2] = 1.18
b. Multiplier redesign speedup = 1 / [0.75 + 0.25 / 3] = 1.20
c. Divider redesign speedup = 1 / [0.9 + 0.1 / 10] = 1.10
What if both the adder and the multiplier are redesigned?
Jan. 2011
Computer Architecture, Background and Motivation
Slide 75
Amdahl’s Law Used in Management
Example 4.2
Members of a university research group frequently visit the library.
Each library trip takes 20 minutes. The group decides to subscribe
to a handful of publications that account for 90% of the library trips;
access time to these publications is reduced to 2 minutes.
a. What is the average speedup in access to publications?
b. If the group has 20 members, each making two weekly trips to
the library, what is the justifiable expense for the subscriptions?
Assume 50 working weeks/yr and $25/h for a researcher’s time.
Solution
a. Speedup in publication access time = 1 / [0.1 + 0.9 / 10] = 5.26
b. Time saved = 20 × 2 × 50 × 0.9 (20 – 2) = 32,400 min = 540 h
Cost recovery = 540 × $25 = $13,500 = Max justifiable expense
Jan. 2011
Computer Architecture, Background and Motivation
Slide 76
4.4 Performance Measurement vs Modeling
Execution time
Machine 1
Machine 2
Machine 3
Program
A
Figure 4.5
Jan. 2011
B
C
D
E
F
Running times of six programs on three machines.
Computer Architecture, Background and Motivation
Slide 77
Generalized Amdahl’s Law
Original running time of a program = 1 = f1 + f2 + . . . + fk
New running time after the fraction fi is speeded up by a factor pi
f1
f2
+
p1
fk
+ ... +
p2
pk
Speedup formula
1
S=
f1
f2
+
p1
Jan. 2011
fk
+ ... +
p2
pk
If a particular fraction
is slowed down rather
than speeded up,
use sj fj instead of fj / pj ,
where sj > 1 is the
slowdown factor
Computer Architecture, Background and Motivation
Slide 78
Performance Benchmarks
Example 4.3
You are an engineer at Outtel, a start-up aspiring to compete with Intel
via its new processor design that outperforms the latest Intel processor
by a factor of 2.5 on floating-point instructions. This level of performance
was achieved by design compromises that led to a 20% increase in the
execution time of all other instructions. You are in charge of choosing
benchmarks that would showcase Outtel’s performance edge.
a. What is the minimum required fraction f of time spent on floating-point
instructions in a program on the Intel processor to show a speedup of
2 or better for Outtel?
Solution
a. We use a generalized form of Amdahl’s formula in which a fraction f
is speeded up by a given factor (2.5) and the rest is slowed down by
another factor (1.2): 1 / [1.2(1 – f) + f / 2.5] ≥ 2 ⇒ f ≥ 0.875
Jan. 2011
Computer Architecture, Background and Motivation
Slide 79
Performance Estimation
Average CPI = ∑All instruction classes (Class-i fraction) × (Class-i CPI)
Machine cycle time = 1 / Clock rate
CPU execution time = Instructions × (Average CPI) / (Clock rate)
Table 4.3 Usage frequency, in percentage, for various
instruction classes in four representative applications.
Application →
Instr’n class ↓
Data
compression
C language
compiler
Reactor
simulation
Atomic motion
modeling
A: Load/Store
25
37
32
37
B: Integer
32
28
17
5
C: Shift/Logic
16
13
2
1
D: Float
0
0
34
42
E: Branch
19
13
9
10
F: All others
8
9
6
4
Jan. 2011
Computer Architecture, Background and Motivation
Slide 80
CPI and IPS Calculations
Example 4.4 (2 of 5 parts)
Consider two implementations M1 (600 MHz) and M2 (500 MHz) of
an instruction set containing three classes of instructions:
Class
F
I
N
CPI for M1
5.0
2.0
2.4
CPI for M2
4.0
3.8
2.0
Comments
Floating-point
Integer arithmetic
Nonarithmetic
a. What are the peak performances of M1 and M2 in MIPS?
b. If 50% of instructions executed are class-N, with the rest divided
equally among F and I, which machine is faster? By what factor?
Solution
a. Peak MIPS for M1 = 600 / 2.0 = 300; for M2 = 500 / 2.0 = 250
b. Average CPI for M1 = 5.0 / 4 + 2.0 / 4 + 2.4 / 2 = 2.95;
for M2 = 4.0 / 4 + 3.8 / 4 + 2.0 / 2 = 2.95 → M1 is faster; factor 1.2
Jan. 2011
Computer Architecture, Background and Motivation
Slide 81
MIPS Rating Can Be Misleading
Example 4.5
Two compilers produce machine code for a program on a machine
with two classes of instructions. Here are the number of instructions:
Class
A
B
CPI
1
2
Compiler 1
600M
400M
Compiler 2
400M
400M
a. What are run times of the two programs with a 1 GHz clock?
b. Which compiler produces faster code and by what factor?
c. Which compiler’s output runs at a higher MIPS rate?
Solution
a. Running time 1 (2) = (600M × 1 + 400M × 2) / 109 = 1.4 s (1.2 s)
b. Compiler 2’s output runs 1.4 / 1.2 = 1.17 times as fast
c. MIPS rating 1, CPI = 1.4 (2, CPI = 1.5) = 1000 / 1.4 = 714 (667)
Jan. 2011
Computer Architecture, Background and Motivation
Slide 82
4.5 Reporting Computer Performance
Table 4.4
Measured or estimated execution times for three programs.
Time on
machine X
Time on
machine Y
Speedup of
Y over X
Program A
20
200
0.1
Program B
1000
100
10.0
Program C
1500
150
10.0
All 3 prog’s
2520
450
5.6
Analogy: If a car is driven to a city 100 km away at 100 km/hr
and returns at 50 km/hr, the average speed is not (100 + 50) / 2
but is obtained from the fact that it travels 200 km in 3 hours.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 83
Comparing the Overall Performance
Table 4.4 Measured or estimated execution times for three programs.
Speedup of
X over Y
Time on
machine X
Time on
machine Y
Speedup of
Y over X
Program A
20
200
0.1
10
Program B
1000
100
10.0
0.1
Program C
1500
150
10.0
0.1
Arithmetic mean
Geometric mean
6.7
2.15
3.4
0.46
Geometric mean does not yield a measure of overall speedup,
but provides an indicator that at least moves in the right direction
Jan. 2011
Computer Architecture, Background and Motivation
Slide 84
Effect of Instruction Mix on Performance
Example 4.6 (1 of 3 parts)
Consider two applications DC and RS and two machines M1 and M2:
Data Comp. Reactor Sim.
Class
A: Ld/Str
25%
32%
B: Integer
32%
17%
C: Sh/Logic
16%
2%
D: Float
0%
34%
E: Branch
19%
9%
F: Other
8%
6%
M1’s CPI
4.0
1.5
1.2
6.0
2.5
2.0
M2’s CPI
3.8
2.5
1.2
2.6
2.2
2.3
a. Find the effective CPI for the two applications on both machines.
Solution
a. CPI of DC on M1: 0.25 × 4.0 + 0.32 × 1.5 + 0.16 × 1.2 + 0 × 6.0 +
0.19 × 2.5 + 0.08 × 2.0 = 2.31
DC on M2: 2.54
RS on M1: 3.94
RS on M2: 2.89
Jan. 2011
Computer Architecture, Background and Motivation
Slide 85
4.6 The Quest for Higher Performance
State of available computing power ca. the early 2000s:
Gigaflops on the desktop
Teraflops in the supercomputer center
Petaflops on the drawing board
Note on terminology (see Table 3.1)
Prefixes for large units:
Kilo = 103, Mega = 106, Giga = 109, Tera = 1012, Peta = 1015
For memory:
K = 210 = 1024, M = 220, G = 230, T = 240, P = 250
Prefixes for small units:
micro = 10−6, nano = 10−9, pico = 10−12, femto = 10−15
Jan. 2011
Computer Architecture, Background and Motivation
Slide 86
Performance Trends and Obsolescence
Tb
TIPS
Processor performance
×1.6 / yr
×2 / 18 mos
×10 / 5 yrs
Memory
GIPS
80486
R10000
Pentium II
Pentium
256Mb
68040
64Mb
Gb
1Gb
16Mb
80386
68000
MIPS
80286
4Mb
1Mb
Mb
Memory chip capacity
Processor
×4 / 3 yrs
256kb
64kb
kIPS
1980
1990
2000
Calendar year
Figure 3.10 Trends in processor
performance and DRAM memory
chip capacity (Moore’s law).
Jan. 2011
kb
2010
“Can I call you back? We
just bought a new computer
and we’re trying to set it up
before it’s obsolete.”
Computer Architecture, Background and Motivation
Slide 87
Super- PFLOPS
computers
Massively parallel
processors
Supercomputer performance
$240M MPPs
$30M MPPs
CM-5
TFLOPS
CM-5
CM-2
Vector
supercomputers
Y-MP
GFLOPS
Cray
X-MP
MFLOPS
1980
1990
2000
2010
Calendar year
Figure 4.7
Jan. 2011
Exponential growth of supercomputer performance.
Computer Architecture, Background and Motivation
Slide 88
The Most Powerful Computers
Performance (TFLOPS)
1000
Plan
Develop
Use
100+ TFLOPS, 20 TB
ASCI Purple
100
30+ TFLOPS, 10 TB
ASCI Q
10+ TFLOPS, 5 TB
ASCI W hite
10
ASCI
3+ TFL OPS, 1.5 TB
ASCI Blue
1+ TFL OPS, 0.5 TB
1
1995
ASCI Red
2000
2005
2010
Calendar year
Figure 4.8 Milestones in the DOE’s Accelerated Strategic Computing
Initiative (ASCI) program with extrapolation up to the PFLOPS level.
Jan. 2011
Computer Architecture, Background and Motivation
Slide 89
Performance is Important, But It Isn’t Everything
TIPS
DSP performance
per Watt
Absolute
proce ssor
performance
Performance
GIPS
GP processor
performance
per Watt
MIPS
kIPS
1980
1990
2000
Figure 25.1
Trend in
computational
performance
per watt of
power used
in generalpurpose
processors
and DSPs.
2010
Calendar year
Jan. 2011
Computer Architecture, Background and Motivation
Slide 90
Roadmap for the Rest of the Book
Fasten your seatbelts
as we begin our ride!
Ch. 5-8: A simple ISA,
variations in ISA
Ch. 9-12: ALU design
Ch. 13-14: Data path
and control unit design
Ch. 15-16: Pipelining
and its limits
Ch. 17-20: Memory
(main, mass, cache, virtual)
Ch. 21-24: I/O, buses,
interrupts, interfacing
Ch. 25-28: Vector and
parallel processing
Jan. 2011
Computer Architecture, Background and Motivation
Slide 91
Part II
Instruction-Set Architecture
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 1
About This Presentation
This presentation is intended to support the use of the textbook
Computer Architecture: From Microprocessors to Supercomputers,
Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated
regularly by the author as part of his teaching of the upper-division
course ECE 154, Introduction to Computer Architecture, at the
University of California, Santa Barbara. Instructors can use these
slides freely in classroom teaching and for other educational
purposes. Any other use is strictly prohibited. © Behrooz Parhami
Edition
Released
Revised
Revised
Revised
Revised
First
June 2003
July 2004
June 2005
Mar. 2006
Jan. 2007
Jan. 2008
Jan. 2009
Jan. 2011
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 2
A Few Words About Where We Are Headed
Performance = 1 / Execution time
simplified to 1 / CPU execution time
CPU execution time = Instructions × CPI / (Clock rate)
Performance = Clock rate / ( Instructions × CPI )
Try to achieve CPI = 1
with clock that is as
high as that for CPI > 1
designs; is CPI < 1
feasible? (Chap 15-16)
Design memory & I/O
structures to support
ultrahigh-speed CPUs
Jan. 2011
Define an instruction set;
make it simple enough
to require a small number
of cycles and allow high
clock rate, but not so
simple that we need many
instructions, even for very
simple tasks (Chap 5-8)
Computer Architecture, Instruction-Set Architecture
Design hardware
for CPI = 1; seek
improvements with
CPI > 1 (Chap 13-14)
Design ALU for
arithmetic & logic
ops (Chap 9-12)
Slide 3
Strategies for Speeding Up Instruction Execution
Performance = 1 / Execution time
simplified to 1 / CPU execution time
CPU execution time = Instructions × CPI / (Clock rate)
Performance = Clock rate / ( Instructions × CPI )
Assembly line analogy
Single-cycle
(CPI = 1)
Items that take longest to
inspect dictate the speed
of the assembly line
Jan. 2011
Faster
Faster
Computer Architecture, Instruction-Set Architecture
Parallel processing
or pipelining
Multicycle
(CPI > 1)
Slide 4
II Instruction Set Architecture
Introduce machine “words” and its “vocabulary,” learning:
• A simple, yet realistic and useful instruction set
• Machine language programs; how they are executed
• RISC vs CISC instruction-set design philosophy
Topics in This Part
Chapter 5 Instructions and Addressing
Chapter 6 Procedures and Data
Chapter 7 Assembly Language Programs
Chapter 8 Instruction Set Variations
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 5
5 Instructions and Addressing
First of two chapters on the instruction set of MiniMIPS:
• Required for hardware concepts in later chapters
• Not aiming for proficiency in assembler programming
Topics in This Chapter
5.1 Abstract View of Hardware
5.2 Instruction Formats
5.3 Simple Arithmetic / Logic Instructions
5.4 Load and Store Instructions
5.5 Jump and Branch Instructions
5.6 Addressing Modes
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 6
5.1 Abstract View of Hardware
...
m ≤ 2 32
Loc 0 Loc 4 Loc 8
4 B / location
Memory
Loc
Loc
m−8 m−4
up to 2 30 words
...
EIU
(Main proc.)
$0
$1
$2
$31
ALU
Execution
& integer
unit
(Coproc. 1)
Integer
mul/div
FP
arith
Hi
FPU
$0
$1
$2
Floatingpoint unit
$31
Lo
TMU
Chapter
10
Figure 5.1
Jan. 2011
Chapter
11
Chapter
12
BadVaddr Trap &
(Coproc. 0) Status memory
Cause unit
EPC
Memory and processing subsystems for MiniMIPS.
Computer Architecture, Instruction-Set Architecture
Slide 7
Data Types
Byte =Byte
8 bits
Used only for floating-point data,
so safe to ignore in this course
Halfword= 2 bytes
Halfword
Word =Word
4 bytes
Doubleword
= 8 bytes
Doubleword
Quadword (16 bytes) also used occasionally
MiniMIPS registers hold 32-bit (4-byte) words. Other common
data sizes include byte, halfword, and doubleword.
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 8
$0
$1
$2
$3
$4
$5
$6
$7
$8
$9
$10
$11
$12
$13
$14
$15
$16
$17
$18
$19
$20
$21
$22
$23
$24
$25
$26
$27
$28
$29
$30
$31
0
Jan. 2011
$zero
$at
$v0
$v1
$a0
$a1
$a2
$a3
$t0
$t1
$t2
$t3
$t4
$t5
$t6
$t7
$s0
$s1
$s2
$s3
$s4
$s5
$s6
$s7
$t8
$t9
$k0
$k1
$gp
$sp
$fp
$ra
Reserved for assembler use
Procedure results
Procedure
arguments
Saved
A 4-b yte word
sits in consecutive
memory addresses
according to the
big-endian order
(most significant
byte has the
lowest address)
Byte numbering:
3
2
3
2
1
0
1
Register
Conventions
0
When loading
a byte into a
register, it goes
in the low end Byte
Temporary
values
Word
Doublew ord
Operands
Saved
across
procedure
calls
More
temporaries
Reserved for OS (kernel)
Global pointer
Stack pointer
Frame pointer
Return address
Saved
A doubleword
sits in consecutive
registers or
memory locations
according to the
big-endian order
(most significant
word comes first)
Computer Architecture, Instruction-Set Architecture
Figure 5.2
Registers and
data sizes in
MiniMIPS.
Slide 9
Registers Used in This Chapter
$8
$9
$10
$11
$12
$13
$14
$15
$16
$17
$18
$19
$20
$21
$22
$23
$24
$25
$t0
$t1
$t2
$t3
$t4
$t5
$t6
$t7
$s0
$s1
$s2
$s3
$s4
$s5
$s6
$s7
$t8
$t9
10 temporary registers
Temporary
values
Change
Operands
Wallet
Keys
Saved
across
procedure
calls
More
temporaries
Figure 5.2
Jan. 2011
8 operand registers
(partial)
Computer Architecture, Instruction-Set Architecture
Analogy for register
usage conventions
Slide 10
5.2 Instruction Formats
High-level language statement:
a = b + c
Assembly language instruction:
add $t8, $s2, $s1
Machine language instruction:
000000 10010 10001 11000 00000 100000
ALU-type Register Register Register
Addition
Unused opcode
instruction
18
17
24
Instruction
cache
P
C
$17
$18
Instruction
fetch
Figure 5.3
Jan. 2011
Register
file
Register
readout
Data cache
(not used)
Register
file
ALU
$24
Operation
Data
read/store
Register
writeback
A typical instruction for MiniMIPS and steps in its execution.
Computer Architecture, Instruction-Set Architecture
Slide 11
Add, Subtract, and Specification of Constants
MiniMIPS add & subtract instructions; e.g., compute:
g = (b + c) − (e + f)
add
add
sub
$t8,$s2,$s3
$t9,$s5,$s6
$s7,$t8,$t9
# put the sum b + c in $t8
# put the sum e + f in $t9
# set g to ($t8) − ($t9)
Decimal and hex constants
Decimal
Hexadecimal
25, 123456, −2873
0x59, 0x12b4c6, 0xffff0000
Machine instruction typically contains
an opcode
one or more source operands
possibly a destination operand
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 12
MiniMIPS Instruction Formats
31
R
31
I
31
J
op
25
rs
20
rt
15
6 bits
5 bits
5 bits
Opcode
Source
register 1
Source
register 2
op
25
rs
20
rt
rd
sh
10
5 bits
Destination
register
15
fn
5
5 bits
6 bits
Shift
amount
Opcode
extension
operand / offset
6 bits
5 bits
5 bits
16 bits
Opcode
Source
or base
Destination
or data
Imm ediate operand
or address offset
op
25
0
0
jump target address
0
6 bits
1 0 0 0 0 0 0 0 0 0 0 0 260 bits
0 0 0 0 0 0 0 1 1 1 1 0 1
Opcode
Memory word address (byte address di vided by 4)
Figure 5.4 MiniMIPS instructions come in only three formats:
register (R), immediate (I), and jump (J).
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 13
5.3 Simple Arithmetic/Logic Instructions
Add and subtract already discussed; logical instructions are similar
add
sub
and
or
xor
nor
31
R
$t0,$s0,$s1
$t0,$s0,$s1
$t0,$s0,$s1
$t0,$s0,$s1
$t0,$s0,$s1
$t0,$s0,$s1
op
25
rs
#
#
#
#
#
#
20
rt
set
set
set
set
set
set
15
$t0
$t0
$t0
$t0
$t0
$t0
rd
to
to
to
to
to
to
($s0)+($s1)
($s0)-($s1)
($s0)∧($s1)
($s0)∨($s1)
($s0)⊕($s1)
(($s0)∨($s1))′
sh
10
5
fn
0
0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 x 0
ALU
instruction
Source
register 1
Source
register 2
Destination
register
Unused
add = 32
sub = 34
Figure 5.5 The arithmetic instructions add and sub have a format that
is common to all two-operand ALU instructions. For these, the fn field
specifies the arithmetic/logic operation to be performed.
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 14
Arithmetic/Logic with One Immediate Operand
An operand in the range [−32 768, 32 767], or [0x0000, 0xffff],
can be specified in the immediate field.
addi
andi
ori
xori
$t0,$s0,61
$t0,$s0,61
$t0,$s0,61
$t0,$s0,0x00ff
#
#
#
#
set
set
set
set
$t0
$t0
$t0
$t0
to
to
to
to
($s0)+61
($s0)∧61
($s0)∨61
($s0)⊕ 0x00ff
For arithmetic instructions, the immediate operand is sign-extended
31
I
op
25
rs
20
rt
15
operand / offset
0
0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1
1 0 Errors 0 1
addi = 8
Source
Destination
Immediate operand
Figure 5.6 Instructions such as addi allow us to perform an
arithmetic or logic operation for which one operand is a small constant.
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 15
5.4 Load and Store Instructions
op
31
I
25
rs
20
rt
15
operand / offset
0
1 0 x 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
lw = 35
sw = 43
Memory
A[0]
A[1]
A[2]
.
.
.
A[i]
Base
register
lw
lw
Data
register
$t0,40($s3)
$t0,A($s3)
Address in
base register
Offset = 4i
Element i
of array A
Offset relative to base
Note on base and offset:
The memory address is the sum
of (rs) and an immediate value.
Calling one of these the base
and the other the offset is quite
arbitrary. It would make perfect
sense to interpret the address
A($s3) as having the base A
and the offset ($s3). However,
a 16-bit base confines us to a
small portion of memory space.
Figure 5.7 MiniMIPS lw and sw instructions and their memory
addressing convention that allows for simple access to array elements
via a base address and an offset (offset = 4i leads us to the i th word).
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 16
lw, sw, and lui Instructions
lw
sw
$t0,40($s3)
$t0,A($s3)
lui
$s0,61
op
31
I
25
rs
# load mem[40+($s3)] in $t0
# store ($t0) in mem[A+($s3)]
# “($s3)” means “content of $s3”
# The immediate value 61 is
# loaded in upper half of $s0
# with lower 16b set to 0s
20
rt
15
operand / offset
0
0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1
lui = 15
Unused
Destination
Immediate operand
0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Content of $s0 after the instruction is executed
Figure 5.8 The lui instruction allows us to load an arbitrary 16-bit
value into the upper half of a register while setting its lower half to 0s.
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 17
Initializing a Register
Example 5.2
Show how each of these bit patterns can be loaded into $s0:
0010 0001 0001 0000 0000 0000 0011 1101
1111 1111 1111 1111 1111 1111 1111 1111
Solution
The first bit pattern has the hex representation: 0x2110003d
lui
ori
$s0,0x2110
$s0,0x003d
# put the upper half in $s0
# put the lower half in $s0
Same can be done, with immediate values changed to 0xffff
for the second bit pattern. But, the following is simpler and faster:
nor
Jan. 2011
$s0,$zero,$zero # because (0 ∨ 0)′ = 1
Computer Architecture, Instruction-Set Architecture
Slide 18
5.5 Jump and Branch Instructions
Unconditional jump and jump through register instructions
j
jr
$ra is the
symbolic
name for
reg. $31
(return
address)
verify
$ra
31
J
# go to mem loc named “verify”
# go to address that is in $ra;
# $ra may hold a return address
op
jump target address
25
0 0 0 0 1 0
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1
j=2
x x x x 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
From PC
(incremented)
op
31
R
Effective target address (32 bits)
25
rs
20
rt
15
rd
10
sh
5
fn
0
0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
ALU
instruction
Source
register
Unused
Unused
Unused
jr = 8
Figure 5.9 The jump instruction j of MiniMIPS is a J-type instruction which
is shown along with how its effective target address is obtained. The jump
register (jr) instruction is R-type, with its specified register often being $ra.
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 19
Conditional Branch Instructions
Conditional branches use PC-relative addressing
bltz $s1,L
beq $s1,$s2,L
bne $s1,$s2,L
31
I
op
25
rs
20
rt
15
operand / offset
0
0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1
bltz = 1
31
I
# branch on ($s1)< 0
# branch on ($s1)=($s2)
# branch on ($s1)≠($s2)
op
Source
25
rs
Zero
20
rt
Relative branch distance in words
15
operand / offset
0
0 0 0 1 0 x 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1
beq = 4
bne = 5
Source 1
Figure 5.10 (part 1)
Jan. 2011
Source 2
Relative branch distance in words
Conditional branch instructions of MiniMIPS.
Computer Architecture, Instruction-Set Architecture
Slide 20
Comparison Instructions for Conditional Branching
slt
$s1,$s2,$s3
slti
$s1,$s2,61
31
R
op
20
if ($s2)<($s3), set $s1 to 1
else set $s1 to 0;
often followed by beq/bne
if ($s2)<61, set $s1 to 1
else set $s1 to 0
rt
15
rd
10
sh
5
fn
0
0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0
ALU
instruction
31
I
rs
25
#
#
#
#
#
op
Source 1
register
rs
25
Source 2
register
20
rt
Destination
15
Unused
slt = 42
operand / offset
0
0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1
slti = 10
Source
Figure 5.10 (part 2)
Jan. 2011
Destination
Immediate operand
Comparison instructions of MiniMIPS.
Computer Architecture, Instruction-Set Architecture
Slide 21
Examples for Conditional Branching
If the branch target is too far to be reachable with a 16-bit offset
(rare occurrence), the assembler automatically replaces the branch
instruction beq $s0,$s1,L1 with:
bne
j
L2: ...
$s1,$s2,L2
L1
# skip jump if (s1)≠(s2)
# goto L1 if (s1)=(s2)
Forming if-then constructs; e.g., if (i == j) x = x + y
bne $s1,$s2,endif
add $t1,$t1,$t2
endif: ...
# branch on i≠j
# execute the “then” part
If the condition were (i < j), we would change the first line to:
slt
beq
Jan. 2011
$t0,$s1,$s2
$t0,$0,endif
# set $t0 to 1 if i<j
# branch if ($t0)=0;
# i.e., i not< j or i≥j
Computer Architecture, Instruction-Set Architecture
Slide 22
Compiling if-then-else Statements
Example 5.3
Show a sequence of MiniMIPS instructions corresponding to:
if (i<=j) x = x+1; z = 1; else y = y–1; z = 2*z
Solution
Similar to the “if-then” statement, but we need instructions for the
“else” part and a way of skipping the “else” part after the “then” part.
slt
bne
addi
addi
j
else: addi
add
endif:...
Jan. 2011
$t0,$s2,$s1
$t0,$zero,else
$t1,$t1,1
$t3,$zero,1
endif
$t2,$t2,-1
$t3,$t3,$t3
#
#
#
#
#
#
#
j<i? (inverse condition)
if j<i goto else part
begin then part: x = x+1
z = 1
skip the else part
begin else part: y = y–1
z = z+z
Computer Architecture, Instruction-Set Architecture
Slide 23
5.6 Addressing Modes
Addressing
Instruction
Other elements involved
Some place
in the machine
Implied
Extend,
if required
Immediate
Reg spec
Register
Reg base
Reg file
Reg
data
Constant offset
Incremented PC
Pseudodirect
Reg file
Constant offset
Base
PC-relative
Operand
PC
Reg data
Mem
Add addr
Mem
Add addr
Mem
Memory data
Mem
Memory data
Mem
addr Memory Mem
data
Figure 5.11 Schematic representation of addressing modes in MiniMIPS.
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 24
Finding the Maximum Value in a List of Integers
Example 5.5
List A is stored in memory beginning at the address given in $s1.
List length is given in $s2.
Find the largest integer in the list and copy it into $t0.
Solution
Scan the list, holding the largest element identified thus far in $t0.
lw
addi
loop: add
beq
add
add
add
lw
slt
beq
addi
maximum
j
done: ...
Jan. 2011
$t0,0($s1)
$t1,$zero,0
$t1,$t1,1
$t1,$s2,done
$t2,$t1,$t1
$t2,$t2,$t2
$t2,$t2,$s1
$t3,0($t2)
$t4,$t0,$t3
$t4,$zero,loop
$t0,$t3,0
#
#
#
#
#
#
#
#
#
#
initialize maximum to A[0]
initialize index i to 0
increment index i by 1
if all elements examined, quit
compute 2i in $t2
compute 4i in $t2
form address of A[i] in $t2
load value of A[i] into $t3
maximum < A[i]?
if not, repeat with no change
# if so, A[i] is the new
loop
# change completed; now repeat
# continuation of the program
Computer Architecture, Instruction-Set Architecture
Slide 25
The 20 MiniMIPS
Instructions
Covered So Far
Copy
Arithmetic
31
R
31
I
31
J
op
25
rs
20
rt
15
rd
10
sh
fn
5
6 bits
5 bits
5 bits
5 bits
5 bits
6 bits
Opcode
Source
register 1
Source
register 2
Destination
register
Shift
amount
Opcode
extension
op
rs
rt
25
20
15
operand / offset
6 bits
5 bits
5 bits
16 bits
Opcode
Source
or base
Destination
or data
Immediate operand
or address offset
op
25
jump target address
0
0
0
6 bits
1 0 0 0 0 0 0 0 0 0 0 0260 bits
0 0 0 0 0 0 0 1 1 1 1 0 1
Opcode
Memory word address (byte address divided by 4)
Logic
Memory access
Control transfer
Table 5.1
Jan. 2011
Instruction
Usage
Load upper immediate
Add
Subtract
Set less than
Add immediate
Set less than immediate
AND
OR
XOR
NOR
AND immediate
OR immediate
XOR immediate
Load word
Store word
Jump
Jump register
Branch less than 0
Branch equal
Branch not equal
lui
add
sub
slt
addi
slti
and
or
xor
nor
andi
ori
xori
lw
sw
j
jr
bltz
beq
bne
Computer Architecture, Instruction-Set Architecture
rt,imm
rd,rs,rt
rd,rs,rt
rd,rs,rt
rt,rs,imm
rd,rs,imm
rd,rs,rt
rd,rs,rt
rd,rs,rt
rd,rs,rt
rt,rs,imm
rt,rs,imm
rt,rs,imm
rt,imm(rs)
rt,imm(rs)
L
rs
rs,L
rs,rt,L
rs,rt,L
op fn
15
0
0
0
8
10
0
0
0
0
12
13
14
35
43
2
0
1
4
5
Slide 26
32
34
42
36
37
38
39
8
6 Procedures and Data
Finish our study of MiniMIPS instructions and its data types:
• Instructions for procedure call/return, misc. instructions
• Procedure parameters and results, utility of stack
Topics in This Chapter
6.1 Simple Procedure Calls
6.2 Using the Stack for Data Storage
6.3 Parameters and Results
6.4 Data Types
6.5 Arrays and Pointers
6.6 Additional Instructions
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 27
6.1 Simple Procedure Calls
Using a procedure involves the following sequence of actions:
1.
2.
3.
4.
5.
6.
Put arguments in places known to procedure (reg’s $a0-$a3)
Transfer control to procedure, saving the return address (jal)
Acquire storage space, if required, for use by the procedure
Perform the desired task
Put results in places known to calling program (reg’s $v0-$v1)
Return control to calling point (jr)
MiniMIPS instructions for procedure call and return from procedure:
Jan. 2011
jal
proc
# jump to loc “proc” and link;
# “link” means “save the return
# address” (PC)+4 in $ra ($31)
jr
rs
# go to loc addressed by rs
Computer Architecture, Instruction-Set Architecture
Slide 28
Illustrating a Procedure Call
main
Prepare
to call
PC
jal
proc
Prepare
to continue
proc
Save, etc.
Restore
jr
Figure 6.1
Jan. 2011
$ra
Relationship between the main program and a procedure.
Computer Architecture, Instruction-Set Architecture
Slide 29
$0
$1
$2
$3
$4
$5
$6
$7
$8
$9
$10
$11
$12
$13
$14
$15
$16
$17
$18
$19
$20
$21
$22
$23
$24
$25
$26
$27
$28
$29
$30
$31
0
Jan. 2011
$zero
$at
$v0
$v1
$a0
$a1
$a2
$a3
$t0
$t1
$t2
$t3
$t4
$t5
$t6
$t7
$s0
$s1
$s2
$s3
$s4
$s5
$s6
$s7
$t8
$t9
$k0
$k1
$gp
$sp
$fp
$ra
Reserved for assembler use
Procedure results
Procedure
arguments
Saved
A 4-b yte word
sits in consecutive
memory addresses
according to the
big-endian order
(most significant
byte has the
lowest address)
Byte numbering:
3
2
3
2
1
0
1
Recalling
Register
Conventions
0
When loading
a byte into a
register, it goes
in the low end Byte
Temporary
values
Word
Doublew ord
Operands
Saved
across
procedure
calls
More
temporaries
Reserved for OS (kernel)
Global pointer
Stack pointer
Frame pointer
Return address
Saved
A doubleword
sits in consecutive
registers or
memory locations
according to the
big-endian order
(most significant
word comes first)
Computer Architecture, Instruction-Set Architecture
Figure 5.2
Registers and
data sizes in
MiniMIPS.
Slide 30
A Simple MiniMIPS Procedure
Example 6.1
Procedure to find the absolute value of an integer.
$v0 ←
|($a0)|
Solution
The absolute value of x is –x if x < 0 and x otherwise.
abs: sub
$v0,$zero,$a0
bltz $a0,done
add $v0,$a0,$zero
done: jr
$ra
#
#
#
#
#
put -($a0) in $v0;
in case ($a0) < 0
if ($a0)<0 then done
else put ($a0) in $v0
return to calling program
In practice, we seldom use such short procedures because of the
overhead that they entail. In this example, we have 3-4
instructions of overhead for 3 instructions of useful computation.
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 31
Nested Procedure Calls
main
PC
Prepare
to call
jal
abc
Prepare
to continue
abc
Procedure
abc
Procedure
xyz
Save
xyz
jal
Text version
is incorrect
Figure 6.2
Jan. 2011
xyz
Restore
jr
$ra
jr
$ra
Example of nested procedure calls.
Computer Architecture, Instruction-Set Architecture
Slide 32
6.2 Using the Stack for Data Storage
sp
Push c
sp
c
b
a
Figure 6.4
push: addi
sw
Jan. 2011
Analogy:
Cafeteria
stack of
plates/trays
b
a
Pop x
b
a
sp
sp = sp – 4
mem[sp] = c
x = mem[sp]
sp = sp + 4
Effects of push and pop operations on a stack.
$sp,$sp,-4
$t4,0($sp)
pop: lw
addi
Computer Architecture, Instruction-Set Architecture
$t5,0($sp)
$sp,$sp,4
Slide 33
Memory
Map in
MiniMIPS
Hex address
00000000
Reserved
1 M words
Program
Text segment
63 M words
00400000
10000000
Addressable
with 16-bit
signed offset
Static data
10008000
1000ffff
Data segment
Dynamic data
448 M words
$gp
$28
$29
$30
$sp
$fp
Stack
Stack segment
7ffffffc
80000000
Second half of address
space reserved for
memory-mapped I/O
Figure 6.3
Jan. 2011
Overview of the memory address space in MiniMIPS.
Computer Architecture, Instruction-Set Architecture
Slide 34
6.3 Parameters and Results
Stack allows us to pass/return an arbitrary number of values
$sp
Local
variables
z
y
..
.
Saved
registers
Frame for
current
procedure
Old ($fp)
$sp
$fp
c
b
a
..
.
Frame for
current
procedure
c
b
a
..
.
Frame for
previous
procedure
$fp
Before calling
Figure 6.5
Jan. 2011
After calling
Use of the stack by a procedure.
Computer Architecture, Instruction-Set Architecture
Slide 35
Example of Using the Stack
Saving $fp, $ra, and $s0 onto the stack and restoring
them at the end of the procedure
$sp
$sp
$fp
$fp
proc: sw
addi
addi
sw
sw
.
($s0)
.
($ra)
.
($fp)
lw
lw
addi
lw
jr
Jan. 2011
$fp,-4($sp)
$fp,$sp,0
$sp,$sp,–12
$ra,-8($fp)
$s0,-12($fp)
#
#
#
#
#
save the old frame pointer
save ($sp) into $fp
create 3 spaces on top of stack
save ($ra) in 2nd stack element
save ($s0) in top stack element
$s0,-12($fp)
$ra,-8($fp)
$sp,$fp, 0
$fp,-4($sp)
$ra
#
#
#
#
#
put top stack element in $s0
put 2nd stack element in $ra
restore $sp to original state
restore $fp to original state
return from procedure
Computer Architecture, Instruction-Set Architecture
Slide 36
6.4 Data Types
Data size (number of bits), data type (meaning assigned to bits)
Signed integer:
Unsigned integer:
Floating-point number:
Bit string:
byte
byte
byte
word
word
word
word
doubleword
doubleword
Converting from one size to another
Type
8-bit number Value
32-bit version of the number
Unsigned 0010 1011
Unsigned 1010 1011
43
171
0000 0000 0000 0000 0000 0000 0010 1011
0000 0000 0000 0000 0000 0000 1010 1011
Signed
Signed
+43
–85
0000 0000 0000 0000 0000 0000 0010 1011
1111 1111 1111 1111 1111 1111 1010 1011
Jan. 2011
0010 1011
1010 1011
Computer Architecture, Instruction-Set Architecture
Slide 37
ASCII Characters
Table 6.1
ASCII (American standard code for information interchange)
0
0
NUL
1
DLE
2
SP
3
0
4
@
5
P
6
`
7
p
1
SOH
DC1
!
1
A
Q
a
q
2
STX
DC2
“
2
B
R
b
r
3
ETX
DC3
#
3
C
S
c
s
4
EOT
DC4
$
4
D
T
d
t
5
ENQ
NAK
%
5
E
U
e
u
6
ACK
SYN
&
6
F
V
f
v
7
BEL
ETB
‘
7
G
W
g
w
8
BS
CAN
(
8
H
X
h
x
9
HT
EM
)
9
I
Y
i
y
a
LF
SUB
*
:
J
Z
j
z
b
VT
ESC
+
;
K
[
k
{
c
FF
FS
,
<
L
\
l
|
d
CR
GS
-
=
M
]
m
}
e
SO
RS
.
>
N
^
n
~
f
SI
US
/
?
O
_
o
DEL
Jan. 2011
Computer Architecture, Instruction-Set Architecture
8-9
a-f
More
More
controls
symbols
8-bit ASCII code
(col #, row #)hex
e.g., code for +
is (2b) hex or
(0010 1011)two
Slide 38
Loading and Storing Bytes
Bytes can be used to store ASCII characters or small integers.
MiniMIPS addresses refer to bytes, but registers hold words.
lb
$t0,8($s3)
lbu
$t0,8($s3)
sb
$t0,A($s3)
op
31
I
25
rs
#
#
#
#
#
20
rt
load rt with mem[8+($s3)]
sign-extend to fill reg
load rt with mem[8+($s3)]
zero-extend to fill reg
LSB of rt to mem[A+($s3)]
15
immediate / offset
0
1 0 x x 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
lb = 32
lbu = 36
sb = 40
Figure 6.6
Jan. 2011
Base
register
Data
register
Address offset
Load and store instructions for byte-size data elements.
Computer Architecture, Instruction-Set Architecture
Slide 39
Meaning of a Word in Memory
Bit pattern
(02114020) hex
0000 0010 0001 0001 0100 0000 0010 0000
00000010000100010100000000100000
Add instruction
00000010000100010100000000100000
Positive integer
00000010000100010100000000100000
Four-character string
Figure 6.7
A 32-bit word has no inherent meaning and can be
interpreted in a number of equally valid ways in the absence of
other cues (e.g., context) for the intended meaning.
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 40
6.5 Arrays and Pointers
Index: Use a register that holds the index i and increment the register in
each step to effect moving from element i of the list to element i + 1
Pointer: Use a register that points to (holds the address of) the list element
being examined and update it in each step to point to the next element
Array index i
Add 1 to i;
Compute 4i;
Add 4i to base
Base
Array A
A[i]
A[i + 1]
Pointer to A[i]
Add 4 to get
the address
of A[i + 1]
Array A
A[i]
A[i + 1]
Figure 6.8 Stepping through the elements of an array using the
indexing method and the pointer updating method.
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 41
Selection Sort
Example 6.4
To sort a list of numbers, repeatedly perform the following:
Find the max element, swap it with the last item, move up the “last” pointer
A
A
first
first
max
A
first
x
y
last
last
last
Start of iteration
Figure 6.9
Jan. 2011
y
x
Maximum identified
End of iteration
One iteration of selection sort.
Computer Architecture, Instruction-Set Architecture
Slide 42
Selection Sort Using the Procedure max
Example 6.4 (continued)
A
A
first
Inputs to
proc max
first
In $a0
max
x
In $v0
In $v1
In $a1
last
Start of iteration
Jan. 2011
y
Outputs from
proc max
last
last
sort: beq
jal
lw
sw
sw
addi
j
done: ...
A
first
$a0,$a1,done
max
$t0,0($a1)
$t0,0($v0)
$v1,0($a1)
$a1,$a1,-4
sort
#
#
#
#
#
#
#
#
y
x
Maximum identified
End of iteration
single-element list is sorted
call the max procedure
load last element into $t0
copy the last element to max loc
copy max value to last element
decrement pointer to last element
repeat sort for smaller list
continue with rest of program
Computer Architecture, Instruction-Set Architecture
Slide 43
6.6 Additional Instructions
MiniMIPS instructions for multiplication and division:
mult
div
$s0, $s1
$s0, $s1
mfhi
mflo
#
#
#
#
#
$t0
$t0
31
R
op
20
rt
15
rd
10
sh
5
fn
Reg
file
Mul/Div
unit
Hi
0
Source
register 1
Source
register 2
Unused
Unused
mult = 24
div = 26
The multiply (mult) and divide (div) instructions of MiniMIPS.
31
R
rs
Hi,Lo to ($s0)×($s1)
Hi to ($s0)mod($s1)
Lo to ($s0)/($s1)
$t0 to (Hi)
$t0 to (Lo)
0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 x 0
ALU
instruction
Figure 6.10
25
set
set
and
set
set
op
25
rs
20
rt
15
rd
10
sh
5
fn
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 x 0
ALU
instruction
Unused
Unused
Destination
register
Unused
mfhi = 16
mflo = 18
Figure 6.11 MiniMIPS instructions for copying the contents of Hi and Lo
registers into general registers .
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 44
Lo
Logical Shifts
MiniMIPS instructions for left and right shifting:
sll
srl
sllv
srlv
$t0,$s1,2
$t0,$s1,2
$t0,$s1,$s0
$t0,$s1,$s0
31
R
op
25
20
$t0=($s1)
$t0=($s1)
$t0=($s1)
$t0=($s1)
rt
15
left-shifted by 2
right-shifted by 2
left-shifted by ($s0)
right-shifted by ($s0)
rd
10
sh
fn
5
0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 x 0
ALU
instruction
31
R
rs
#
#
#
#
op
Unused
25
rs
Source
register
20
rt
Destination
register
15
rd
Shift
amount
10
sh
sll = 0
srl = 2
fn
5
0
0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 x 0
ALU
instruction
Figure 6.12
Jan. 2011
Amount
register
Source
register
Destination
register
Unused
sllv = 4
srlv = 6
The four logical shift instructions of MiniMIPS.
Computer Architecture, Instruction-Set Architecture
Slide 45
Unsigned Arithmetic and Miscellaneous Instructions
MiniMIPS instructions for unsigned arithmetic (no overflow exception):
addu
subu
multu
divu
$t0,$s0,$s1
$t0,$s0,$s1
$s0,$s1
$s0,$s1
addiu $t0,$s0,61
#
#
#
#
#
#
#
#
set $t0 to ($s0)+($s1)
set $t0 to ($s0)–($s1)
set Hi,Lo to ($s0)×($s1)
set Hi to ($s0)mod($s1)
and Lo to ($s0)/($s1)
set $t0 to ($s0)+61;
the immediate operand is
sign extended
To make MiniMIPS more powerful and complete, we introduce later:
sra
$t0,$s1,2
srav $t0,$s1,$s0
syscall
Jan. 2011
# sh. right arith (Sec. 10.5)
# shift right arith variable
# system call (Sec. 7.6)
Computer Architecture, Instruction-Set Architecture
Slide 46
The 20 MiniMIPS
Instructions
Copy
from Chapter 6
(40 in all so far)
Arithmetic
Table 6.2 (partial)
31
R
31
I
31
J
op
25
rs
20
rt
15
rd
10
sh
fn
5
6 bits
5 bits
5 bits
5 bits
5 bits
6 bits
Opcode
Source
register 1
Source
register 2
Destination
register
Shift
amount
Opcode
extension
op
25
rs
20
rt
15
operand / offset
6 bits
5 bits
5 bits
16 bits
Opcode
Source
or base
Destination
or data
Immediate operand
or address offset
op
25
jump target address
0
0
Shift
0
6 bits
1 0 0 0 0 0 0 0 0 0 0 0260 bits
0 0 0 0 0 0 0 1 1 1 1 0 1
Opcode
Memory word address (byte address divided by 4)
Memory access
Control transfer
Jan. 2011
Instruction
Usage
Move from Hi
Move from Lo
Add unsigned
Subtract unsigned
Multiply
Multiply unsigned
Divide
Divide unsigned
Add immediate unsigned
Shift left logical
Shift right logical
Shift right arithmetic
Shift left logical variable
Shift right logical variable
Shift right arith variable
Load byte
Load byte unsigned
Store byte
Jump and link
System call
mfhi rd
mflo rd
addu rd,rs,rt
subu rd,rs,rt
mult rs,rt
multu rs,rt
div
rs,rt
divu rs,rt
addiu rs,rt,imm
sll
rd,rt,sh
srl
rd,rt,sh
sra
rd,rt,sh
sllv rd,rt,rs
srlv rt,rd,rs
srav rd,rt,rd
lb
rt,imm(rs)
lbu
rt,imm(rs)
sb
rt,imm(rs)
jal
L
syscall
Computer Architecture, Instruction-Set Architecture
op fn
0
0
0
0
0
0
0
0
9
0
0
0
0
0
0
32
36
40
3
0
Slide 47
16
18
33
35
24
25
26
27
0
2
3
4
6
7
12
Table 6.2 The 37 + 3 MiniMIPS Instructions Covered So Far
Instruction
Usage
Instruction
Usage
Load upper immediate
Add
Subtract
Set less than
Add immediate
Set less than immediate
AND
OR
XOR
NOR
AND immediate
OR immediate
XOR immediate
Load word
Store word
Jump
Jump register
Branch less than 0
Branch equal
Branch not equal
lui
add
sub
slt
addi
slti
and
or
xor
nor
andi
ori
xori
lw
sw
j
jr
bltz
beq
bne
Move from Hi
Move from Lo
Add unsigned
Subtract unsigned
Multiply
Multiply unsigned
Divide
Divide unsigned
Add immediate unsigned
Shift left logical
Shift right logical
Shift right arithmetic
Shift left logical variable
Shift right logical variable
Shift right arith variable
Load byte
Load byte unsigned
Store byte
Jump and link
mfhi
mflo
addu
subu
mult
multu
div
divu
addiu
sll
srl
sra
sllv
srlv
srav
lb
lbu
sb
jal
System call
syscall
Jan. 2011
rt,imm
rd,rs,rt
rd,rs,rt
rd,rs,rt
rt,rs,imm
rd,rs,imm
rd,rs,rt
rd,rs,rt
rd,rs,rt
rd,rs,rt
rt,rs,imm
rt,rs,imm
rt,rs,imm
rt,imm(rs)
rt,imm(rs)
L
rs
rs,L
rs,rt,L
rs,rt,L
Computer Architecture, Instruction-Set Architecture
rd
rd
rd,rs,rt
rd,rs,rt
rs,rt
rs,rt
rs,rt
rs,rt
rs,rt,imm
rd,rt,sh
rd,rt,sh
rd,rt,sh
rd,rt,rs
rd,rt,rs
rd,rt,rs
rt,imm(rs)
rt,imm(rs)
rt,imm(rs)
L
Slide 48
7 Assembly Language Programs
Everything else needed to build and run assembly programs:
• Supply info to assembler about program and its data
• Non-hardware-supported instructions for convenience
Topics in This Chapter
7.1 Machine and Assembly Languages
7.2 Assembler Directives
7.3 Pseudoinstructions
7.4 Macroinstructions
7.5 Linking and Loading
7.6 Running Assembler Programs
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 49
7.1 Machine and Assembly Languages
$2,$5,$5
$2,$2,$2
$2,$4,$2
$15,0($2)
$16,4($2)
$16,0($2)
$15,4($2)
$31
00a51020
00421020
00821020
8c620000
8cf20004
acf20000
ac620004
03e00008
Executable
machine
language
program
Loader
add
add
add
lw
lw
sw
sw
jr
Machine
language
program
Linker
Assembly
language
program
Assembler
MIPS, 80x86,
PowerPC, etc.
Library routines
(machine language)
Memory
content
Figure 7.1 Steps in transforming an assembly language program to
an executable program residing in memory.
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 50
Symbol Table
Assembly language program
addi
sub
add
test: bne
addi
add
j
done: sw
Symbol
table
$s0,$zero,9
$t0,$s0,$s0
$t1,$zero,$zero
$t0,$s0,done
$t0,$t0,1
$t1,$s0,$zero
test
$t1,result($gp)
done
result
test
28
248
12
Location
0
4
8
12
16
20
24
28
Machine language program
00100000000100000000000000001001
00000010000100000100000000100010
00000001001000000000000000100000
00010101000100000000000000001100
00100001000010000000000000000001
00000010000000000100100000100000
00001000000000000000000000000011
10101111100010010000000011111000
op
rs
rt
rd
sh
fn
Field boundaries shown to facilitate understanding
Determined from assembler
directives not shown here
Figure 7.2 An assembly-language program, its machine-language
version, and the symbol table created during the assembly process.
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 51
7.2 Assembler Directives
Assembler directives provide the assembler with info on how to translate
the program but do not lead to the generation of machine instructions
tiny:
max:
small:
big:
array:
str1:
str2:
.macro
.end_macro
.text
...
.data
.byte
156,0x7a
.word
35000
.float
2E-3
.double 2E-3
.align
2
.space
600
.ascii
“a*b”
.asciiz “xyz”
.global main
Jan. 2011
#
#
#
#
#
#
#
#
#
#
#
#
#
#
start macro (see Section 7.4)
end macro (see Section 7.4)
start program’s text segment
program text goes here
start program’s data segment
name & initialize data byte(s)
name & initialize data word(s)
name short float (see Chapter 12)
name long float (see Chapter 12)
align next item on word boundary
reserve 600 bytes = 150 words
name & initialize ASCII string
null-terminated ASCII string
consider “main” a global name
Computer Architecture, Instruction-Set Architecture
Slide 52
Composing Simple Assembler Directives
Example 7.1
Write assembler directive to achieve each of the following objectives:
a. Put the error message “Warning: The printer is out of paper!” in memory.
b. Set up a constant called “size” with the value 4.
c. Set up an integer variable called “width” and initialize it to 4.
d. Set up a constant called “mill” with the value 1,000,000 (one million).
e. Reserve space for an integer vector “vect” of length 250.
Solution:
a. noppr: .asciiz “Warning: The printer is out of paper!”
b. size: .byte 4
# small constant fits in one byte
c. width: .word 4
# byte could be enough, but ...
d. mill: .word 1000000
# constant too large for byte
e. vect: .space 1000
# 250 words = 1000 bytes
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 53
7.3 Pseudoinstructions
Example of one-to-one pseudoinstruction: The following
not
$s0
# complement ($s0)
is converted to the real instruction:
nor
$s0,$s0,$zero
# complement ($s0)
Example of one-to-several pseudoinstruction: The following
abs
$t0,$s0
# put |($s0)| into $t0
is converted to the sequence of real instructions:
add
slt
beq
sub
Jan. 2011
$t0,$s0,$zero
$at,$t0,$zero
$at,$zero,+4
$t0,$zero,$s0
#
#
#
#
copy x into $t0
is x negative?
if not, skip next instr
the result is 0 – x
Computer Architecture, Instruction-Set Architecture
Slide 54
MiniMIPS
Pseudoinstructions
Copy
Arithmetic
Table 7.1
Shift
Logic
Memory access
Control transfer
Jan. 2011
Pseudoinstruction
Usage
Move
Load address
Load immediate
Absolute value
Negate
Multiply (into register)
Divide (into register)
Remainder
Set greater than
Set less or equal
Set greater or equal
Rotate left
Rotate right
NOT
Load doubleword
Store doubleword
Branch less than
Branch greater than
Branch less or equal
Branch greater or equal
move
la
li
abs
neg
mul
div
rem
sgt
sle
sge
rol
ror
not
ld
sd
blt
bgt
ble
bge
Computer Architecture, Instruction-Set Architecture
regd,regs
regd,address
regd,anyimm
regd,regs
regd,regs
regd,reg1,reg2
regd,reg1,reg2
regd,reg1,reg2
regd,reg1,reg2
regd,reg1,reg2
regd,reg1,reg2
regd,reg1,reg2
regd,reg1,reg2
reg
regd,address
regd,address
reg1,reg2,L
reg1,reg2,L
reg1,reg2,L
reg1,reg2,L
Slide 55
7.4 Macroinstructions
A macro is a mechanism to give a name to an often-used
sequence of instructions (shorthand notation)
.macro name(args)
...
.end_macro
# macro and arguments named
# instr’s defining the macro
# macro terminator
How is a macro different from a pseudoinstruction?
Pseudos are predefined, fixed, and look like machine instructions
Macros are user-defined and resemble procedures (have arguments)
How is a macro different from a procedure?
Control is transferred to and returns from a procedure
After a macro has been replaced, no trace of it remains
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 56
Macro to Find the Largest of Three Values
Example 7.4
Write a macro to determine the largest of three values in registers and to
put the result in a fourth register.
Solution:
.macro mx3r(m,a1,a2,a3)
move
m,a1
bge
m,a2,+4
move
m,a2
bge
m,a3,+4
move
m,a3
.endmacro
#
#
#
#
#
#
#
macro and arguments named
assume (a1) is largest; m = (a1)
if (a2) is not larger, ignore it
else set m = (a2)
if (a3) is not larger, ignore it
else set m = (a3)
macro terminator
If the macro is used as mx3r($t0,$s0,$s4,$s3), the assembler replaces
the arguments m, a1, a2, a3 with $t0, $s0, $s4, $s3, respectively.
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 57
7.5 Linking and Loading
The linker has the following responsibilities:
Ensuring correct interpretation (resolution) of labels in all modules
Determining the placement of text and data segments in memory
Evaluating all data addresses and instruction labels
Forming an executable program with no unresolved references
The loader is in charge of the following:
Determining the memory needs of the program from its header
Copying text and data from the executable program file into memory
Modifying (shifting) addresses, where needed, during copying
Placing program parameters onto the stack (as in a procedure call)
Initializing all machine registers, including the stack pointer
Jumping to a start-up routine that calls the program’s main routine
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 58
7.6 Running Assembler Programs
Spim is a simulator that can run MiniMIPS programs
The name Spim comes from reversing MIPS
Three versions of Spim are available for free downloading:
PCSpim
for Windows machines
xspim
for X-windows
SPIM
A MIPS32 Simulator
spim
for Unix systems
You can download SPIM from:
http://www.cs.wisc.edu/~larus/spim.html
Jan. 2011
James Larus
spim@larusstone.org
Microsoft Research
Formerly: Professor, CS Dept., Univ. Wisconsin-Madison
spim is a self-contained simulator that will
run MIPS32 assembly language programs.
It reads and executes assembly . . .
Computer Architecture, Instruction-Set Architecture
Slide 59
Input/Output Conventions for MiniMIPS
Table 7.2
Input/output and control functions of syscall in PCSpim.
Arguments
Result
1 Print integer
Integer in $a0
Integer displayed
2 Print floating-point
Float in $f12
Float displayed
3 Print double-float
Double-float in $f12,$f13
Double-float displayed
4 Print string
Pointer in $a0
Null-terminated string displayed
Cntl
Input
Output
($v0) Function
5 Read integer
Integer returned in $v0
6 Read floating-point
Float returned in $f0
7 Read double-float
Double-float returned in $f0,$f1
8 Read string
Pointer in $a0, length in $a1 String returned in buffer at pointer
9 Allocate memory
Number of bytes in $a0
10 Exit from program
Jan. 2011
Pointer to memory block in $v0
Program execution terminated
Computer Architecture, Instruction-Set Architecture
Slide 60
PCSpim
User
Interface
PCSpim
Menu bar
File Simulator Window Help
Tools bar
File
R0
R1
Window
Jan. 2011
?
PC
= 00400000
Status = 00000000
Clear Regis ters
Reinitializ e
Reload
Go
Break
Continue
Single Step
Multiple Step ...
Breakpoints ...
Set Value ...
Disp Symbol Table
Settings ...
Tile
1 Messages
2 Tex t Segment
3 Data Segment
4 Regis ters
5 Console
Clear Console
Toolbar
Status bar
Status bar
, ?
Registers
Open
Sav e Log File
Ex it
Simulator
Figure 7.3

(r0) = 0
(at) = 0
EPC
= 00000000
Cause = 00000000
HI
= 00000000
LO
= 00000000
General Registers
R8 (t0) = 0
R16 (s0) = 0
R24
R9 (t1) = 0
R17 (s1) = 0
R25
Text Segment
[0x00400000]
[0x00400004]
[0x00400008]
[0x0040000c]
[0x00400010]
0x0c100008
0x00000021
0x2402000a
0x0000000c
0x00000021
jal 0x00400020 [main]
addu $0, $0, $0
addiu $2, $0, 10
syscall
addu $0, $0, $0
;
;
;
;
;
43
44
45
46
47
Data Segment
DATA
[0x10000000]
[0x10000010]
[0x10000020]
0x00000000 0x6c696146 0x20206465
0x676e6974 0x44444120 0x6554000a
0x44412067 0x000a4944 0x74736554
Messages
See the file README for a full copyright notice.
Memory and registers have been cleared, and the simulator rei
D:\temp\dos\TESTS\Alubare.s has been successfully loaded
For Help, press F1
Base=1; Pseudo=1, Mapped=1; LoadTrap=0
Computer Architecture, Instruction-Set Architecture
Slide 61
8 Instruction Set Variations
The MiniMIPS instruction set is only one example
• How instruction sets may differ from that of MiniMIPS
• RISC and CISC instruction set design philosophies
Topics in This Chapter
8.1 Complex Instructions
8.2 Alternative Addressing Modes
8.3 Variations in Instruction Formats
8.4 Instruction Set Design and Evolution
8.5 The RISC/CISC Dichotomy
8.6 Where to Draw the Line
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 62
Review of Some Key Concepts
Macroinstruction
Instruction
Instruction
Instruction
Instruction
Different from procedure,
in that the macro is replaced
with equivalent instructions
Microinstruction
Microinstruction
Microinstruction
Microinstruction
Microinstruction
Instruction format for a simple RISC design
31
R
31
I
31
J
op
25
rs
20
rt
15
rd
10
sh
fn
5
6 bits
5 bits
5 bits
5 bits
5 bits
6 bits
Opcode
Source
register 1
Source
register 2
Destination
register
Shift
amount
Opcode
extension
op
25
rs
20
rt
15
operand / offset
6 bits
5 bits
5 bits
16 bits
Opcode
Source
or base
Destination
or data
Immediate operand
or address offset
op
25
jump target address
0
Fields used consistently
(simple decoding)
0
Can initiate reading of
registers even before
decoding the instruction
0
6 bits
1 0 0 0 0 0 0 0 0 0 0 0260 bits
0 0 0 0 0 0 0 1 1 1 1 0 1
Opcode
Memory word address (byte address divided by 4)
Jan. 2011
All of the same length
Short, uniform execution
Computer Architecture, Instruction-Set Architecture
Slide 63
8.1 Complex Instructions
Table 8.1 (partial) Examples of complex instructions in two popular modern
microprocessors and two computer families of historical significance
Machine
Instruction
Effect
Pentium
MOVS
Move one element in a string of bytes, words, or
doublewords using addresses specified in two pointer
registers; after the operation, increment or decrement
the registers to point to the next element of the string
PowerPC
cntlzd
Count the number of consecutive 0s in a specified
source register beginning with bit position 0 and place
the count in a destination register
IBM 360-370
CS
Compare and swap: Compare the content of a register
to that of a memory location; if unequal, load the
memory word into the register, else store the content
of a different register into the same memory location
Digital VAX
POLYD
Polynomial evaluation with double flp arithmetic:
Evaluate a polynomial in x, with very high precision in
intermediate results, using a coefficient table whose
location in memory is given within the instruction
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 64
Some Details of Sample Complex Instructions
0000 0010 1100 0111
Source
string
Destination
string
cntlzd
(Count leading 0s)
6 leading 0s
0000 0000 0000 0110
POLYD
(Polynomial evaluation in
double floating-point)
Coefficients
cn–1xn–1 + . . . + c2x2 + c1x + c0
MOVS
x
(Move string)
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 65
Benefits and Drawbacks of Complex Instructions
Fewer instructions in program
(less memory)
Fewer memory accesses for
instructions
Programs may become easier
to write/read/understand
Potentially faster execution
(complex steps are still done
sequentially in multiple cycles,
but hardware control can be
faster than software loops)
Jan. 2011
More complex format
(slower decoding)
Less flexible
(one algorithm for polynomial
evaluation or sorting may not
be the best in all cases)
If interrupts are processed at
the end of instruction cycle,
machine may become less
responsive to time-critical
events (interrupt handling)
Computer Architecture, Instruction-Set Architecture
Slide 66
8.2 Alternative Addressing Modes
Addressing
Instruction
Other elements involved
Some place
in the machine
Implied
Let’s
refresh
our
memory
(from
Chap. 5)
Extend,
if required
Immediate
Reg spec
Register
Reg file
Constant offset
Base
Reg base
PC-relative
Reg file
Reg
data
Constant offset
Reg data
Mem
Add addr
Mem
Add addr
PC
Pseudodirect
Operand
PC
Mem
Memory data
Mem
Memory data
Mem
addr Memory Mem
data
Figure 5.11 Schematic representation of addressing modes in MiniMIPS.
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 67
Table 6.2
Addressing Mode Examples in the MiniMIPS ISA
Instruction
Usage
Instruction
Usage
Load upper immediate
Add
Subtract
Set less than
Add immediate
Set less than immediate
AND
OR
XOR
NOR
AND immediate
OR immediate
XOR immediate
Load word
Store word
Jump
Jump register
Branch less than 0
Branch equal
Branch not equal
lui
add
sub
slt
addi
slti
and
or
xor
nor
andi
ori
xori
lw
sw
j
jr
bltz
beq
bne
Move from Hi
Move from Lo
Add unsigned
Subtract unsigned
Multiply
Multiply unsigned
Divide
Divide unsigned
Add immediate unsigned
Shift left logical
Shift right logical
Shift right arithmetic
Shift left logical variable
Shift right logical variable
Shift right arith variable
Load byte
Load byte unsigned
Store byte
Jump and link
mfhi
mflo
addu
subu
mult
multu
div
divu
addiu
sll
srl
sra
sllv
srlv
srav
lb
lbu
sb
jal
System call
syscall
Jan. 2011
rt,imm
rd,rs,rt
rd,rs,rt
rd,rs,rt
rt,rs,imm
rd,rs,imm
rd,rs,rt
rd,rs,rt
rd,rs,rt
rd,rs,rt
rt,rs,imm
rt,rs,imm
rt,rs,imm
rt,imm(rs)
rt,imm(rs)
L
rs
rs,L
rs,rt,L
rs,rt,L
Computer Architecture, Instruction-Set Architecture
rd
rd
rd,rs,rt
rd,rs,rt
rs,rt
rs,rt
rs,rt
rs,rt
rs,rt,imm
rd,rt,sh
rd,rt,sh
rd,rt,sh
rd,rt,rs
rd,rt,rs
rd,rt,rs
rt,imm(rs)
rt,imm(rs)
rt,imm(rs)
L
Slide 68
More Elaborate Addressing Modes
Addressing
Instruction
Other elements involved
Indexed
Reg file
Index reg
Base reg
Increment amount
Update
(with base)
Base reg
Update
(with index ed)
Reg file
Increment
amount
Indirect
Reg file
Base reg
Index reg
Operand
x := B[i]
Mem
Mem
Add addr Memory data
x := Mem[p]
p := p + 1
Mem
Mem
Incre- addr
Memory data
ment
Mem
Mem
Add addr Memory data
x := B[i]
i := i + 1
Increment
Mem data
PC
Memory
Mem addr
This part maybe replaced with any
Mem addr,
other form of address specif ication
2nd access
t := Mem[p]
x := Mem[t]
Memory
Mem data,
2nd access
x := Mem[Mem[p]]
Figure 8.1 Schematic representation of more elaborate
addressing modes not supported in MiniMIPS.
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 69
Usefulness of Some Elaborate Addressing Modes
Update mode: XORing a string of bytes
loop: lb
xor
addi
bne
$t0,A($s0)
$s1,$s1,$t0
$s0,$s0,-1
$s0,$zero,loop
One instruction with
update addressing
Indirect mode: Case statement
case: lw
add
add
la
add
lw
jr
Jan. 2011
$t0,0($s0)
$t0,$t0,$t0
$t0,$t0,$t0
$t1,T
$t1,$t0,$t1
$t2,0($t1)
$t2
#
#
#
#
get s
form 2s
form 4s
base T
# entry
Branch to location Li
if s = i (switch var.)
T
T+4
T+8
T + 12
T + 16
T + 20
Computer Architecture, Instruction-Set Architecture
L0
L1
L2
L3
L4
L5
Slide 70
8.3 Variations in Instruction Formats
0-, 1-, 2-, and 3-address instructions in MiniMIPS
Category
Format
Opcode
Description of operand(s)
One implied operand in register $v0
0-address
0
1-address
2
2-address
0 rs rt
24 mult
Two source registers addressed, destination implied
3-address
0 rs rt rd
32 add
Destination and two source registers addressed
12 syscall
Address
j
Jump target addressed (in pseudodirect form)
Figure 8.2 Examples of MiniMIPS instructions with 0 to 3
addresses; shaded fields are unused.
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 71
Zero-Address Architecture: Stack Machine
Stack holds all the operands (replaces our register file)
Load/Store operations become push/pop
Arithmetic/logic operations need only an opcode: they pop operand(s)
from the top of the stack and push the result onto the stack
Example: Evaluating the expression (a + b) × (c – d)
Push a
Push b
Add
Push d
Push c
Subtract
Multiply
a
b
a
a+b
d
a+b
c
d
a+b
c–d
a+b
Result
Polish string: a b + d c – ×
If a variable is used again, you may have to push it multiple times
Special instructions such as “Duplicate” and “Swap” are helpful
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 72
One-Address Architecture: Accumulator Machine
The accumulator, a special register attached to the ALU, always holds
operand 1 and the operation result
Only one operand needs to be specified by the instruction
Example: Evaluating the expression (a + b) × (c – d)
Load
add
Store
load
subtract
multiply
a
b
t
c
d
t
Within branch instructions, the condition or
target address must be implied
Branch to L if acc negative
If register x is negative skip the next instruction
May have to store accumulator contents in memory (example above)
No store needed for a + b + c + d + . . . (“accumulator”)
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 73
Two-Address Architectures
Two addresses may be used in different ways:
Operand1/result and operand 2
Condition to be checked and branch target address
Example: Evaluating the expression (a + b) × (c – d)
load
add
load
subtract
multiply
$1,a
$1,b
$2,c
$2,d
$1,$2
Instructions of a hypothetical
two-address machine
A variation is to use one of the addresses as in a one-address
machine and the second one to specify a branch in every instruction
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 74
Example of a Complex Instruction Format
Instruction prefixes (zero to four, 1 B each)
Operand/address
size overwrites and
other modifiers
Mod Reg/Op R/M Scale Index Base
Opcode (1-2 B)
ModR/M
SIB
Offset or displacement (0, 1, 2, or 4 B)
Most memory
operands need
these 2 bytes
Instructions
can contain
up to 15 bytes
Immediate (0, 1, 2, or 4 B)
Components that form a variable-length IA-32 (80x86) instruction.
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 75
Some of IA-32’s Variable-Width Instructions
Type
Format (field widths shown)
1-byte
5 3
2-byte
4 4
3-byte
6
4-byte
8
5-byte
4 3
6-byte
7
8
8
8
8
8
8
32
8
32
Opcode
Description of operand(s)
PUSH
3-bit register specification
JE
4-bit condition, 8-bit jump offset
MOV
8-bit register/mode, 8-bit offset
XOR
ADD
8-bit register/mode, 8-bit base/index,
8-bit offset
3-bit register spec, 32-bit immediate
TEST
8-bit register/mode, 32-bit immediate
Figure 8.3 Example 80x86 instructions ranging in width from 1 to 6
bytes; much wider instructions (up to 15 bytes) also exist
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 76
8.4 Instruction Set Design and Evolution
Desirable attributes of an instruction set:
Consistent, with uniform and generally applicable rules
Orthogonal, with independent features noninterfering
Transparent, with no visible side effect due to implementation details
Easy to learn/use (often a byproduct of the three attributes above)
Extensible, so as to allow the addition of future capabilities
Efficient, in terms of both memory needs and hardware realization
Processor
design
team
New
machine
project
Instruction-set
definition
Implementation
Performance
objectives
Fabrication &
testing
Sales
&
use
?
Tuning &
bug fixes
Feedback
Figure 8.4
Jan. 2011
Processor design and implementation process.
Computer Architecture, Instruction-Set Architecture
Slide 77
8.5 The RISC/CISC Dichotomy
The RISC (reduced instruction set computer) philosophy:
Complex instruction sets are undesirable because inclusion of
mechanisms to interpret all the possible combinations of opcodes
and operands might slow down even very simple operations.
Ad hoc extension of instruction sets, while maintaining backward
compatibility, leads to CISC; imagine modern English containing
every English word that has been used through the ages
Features of RISC architecture
1.
2.
3.
4.
Small set of inst’s, each executable in roughly the same time
Load/store architecture (leading to more registers)
Limited addressing mode to simplify address calculations
Simple, uniform instruction formats (ease of decoding)
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 78
RISC/CISC Comparison via Generalized Amdahl’s Law
Example 8.1
An ISA has two classes of simple (S) and complex (C) instructions.
On a reference implementation of the ISA, class-S instructions
account for 95% of the running time for programs of interest. A RISC
version of the machine is being considered that executes only class-S
instructions directly in hardware, with class-C instructions treated as
pseudoinstructions. It is estimated that in the RISC version, class-S
instructions will run 20% faster while class-C instructions will be
slowed down by a factor of 3. Does the RISC approach offer better or
worse performance compared to the reference implementation?
Solution
Per assumptions, 0.95 of the work is speeded up by a factor of 1.0 /
0.8 = 1.25, while the remaining 5% is slowed down by a factor of 3.
The RISC speedup is 1 / [0.95 / 1.25 + 0.05 × 3] = 1.1. Thus, a 10%
improvement in performance can be expected in the RISC version.
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 79
Some Hidden Benefits of RISC
In Example 8.1, we established that a speedup factor of 1.1 can be
expected from the RISC version of a hypothetical machine
This is not the entire story, however!
If the speedup of 1.1 came with some additional cost, then one might
legitimately wonder whether it is worth the expense and design effort
The RISC version of the architecture also:
Reduces the effort and team size for design
Shortens the testing and debugging phase
Cheaper product and
shorter time-to-market
Simplifies documentation and maintenance
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 80
MIPS Performance Rating Revisited
An m-MIPS processor can execute m million instructions per second
Comparing an m-MIPS processor with a 10m-MIPS processor
Like comparing two people who read m pages and 10m pages per hour
10 pages / hr
100 pages / hr
Reading 100 pages per hour, as opposed to 10 pages per hour, may
not allow you to finish the same reading assignment in 1/10 the time
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 81
RISC / CISC Convergence
The earliest RISC designs:
CDC 6600, highly innovative supercomputer of the mid 1960s
IBM 801, influential single-chip processor project of the late 1970s
In the early 1980s, two projects brought RISC to the forefront:
UC Berkeley’s RISC 1 and 2, forerunners of the Sun SPARC
Stanford’s MIPS, later marketed by a company of the same name
Throughout the 1980s, there were heated debates about the relative
merits of RISC and CISC architectures
Since the 1990s, the debate has cooled down!
We can now enjoy both sets of benefits by having complex instructions
automatically translated to sequences of very simple instructions that
are then executed on RISC-based underlying hardware
Jan. 2011
Computer Architecture, Instruction-Set Architecture
Slide 82
8.6 Where to Draw the Line
The ultimate reduced instruction set computer (URISC):
How many instructions are absolutely needed for useful computation?
Only one!
subtract source1 from source2, replace source2 with the
result, and jump to target address if result is negative
Assembly language form:
label: urisc
dest,src1,target
Pseudoinstructions can be synthesized using the single instruction:
stop: .word
start: urisc
urisc
urisc
Corrected
urisc
version
...
Jan. 2011
0
dest,dest,+1
temp,temp,+1
temp,src,+1
dest,temp,+1
#
#
#
#
#
dest
temp
temp
dest
rest
This is the move
pseudoinstruction
= 0
= 0
= -(src)
= -(temp); i.e. (src)
of program
Computer Architecture, Instruction-Set Architecture
Slide 83
Some Useful Pseudo Instructions for URISC
Example 8.2 (2 parts of 5)
Write the sequence of instructions that are produced by the URISC
assembler for each of the following pseudoinstructions.
parta: uadd
partc: uj
dest,src1,src2
label
# dest=(src1)+(src2)
# goto label
Solution
at1 and at2 are temporary memory locations for assembler’s use
parta: urisc
urisc
urisc
urisc
urisc
partc: urisc
urisc
Jan. 2011
at1,at1,+1
at1,src1,+1
at1,src2,+1
dest,dest,+1
dest,at1,+1
at1,at1,+1
at1,one,label
#
#
#
#
#
#
#
at1 = 0
at1 = -(src1)
at1 = -(src1)–(src2)
dest = 0
dest = -(at1)
at1 = 0
at1 = -1 to force jump
Computer Architecture, Instruction-Set Architecture
Slide 84
URISC Hardware
URISC instruction:
Word 1
Word 2
Word 3
Source 1
Source 2 / Dest
Jump target
Comp
C in
0
PC in
MDR in
MAR in
0
Read
1
R
R’
P
C
Adder
N in
R in
Figure 8.5
Jan. 2011
Write
M
D
R
M
A
R
Z in
N
Z
1 Mux 0
Memory
unit
PCout
Instruction format and hardware structure for URISC.
Computer Architecture, Instruction-Set Architecture
Slide 85
Part III
The Arithmetic/Logic Unit
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 1
About This Presentation
This presentation is intended to support the use of the textbook
Computer Architecture: From Microprocessors to Supercomputers,
Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated
regularly by the author as part of his teaching of the upper-division
course ECE 154, Introduction to Computer Architecture, at the
University of California, Santa Barbara. Instructors can use these
slides freely in classroom teaching and for other educational
purposes. Any other use is strictly prohibited. © Behrooz Parhami
Edition
Released
Revised
Revised
Revised
Revised
First
July 2003
July 2004
July 2005
Mar. 2006
Jan. 2007
Jan. 2008
Jan. 2009
Jan. 2011
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 2
III The Arithmetic/Logic Unit
Overview of computer arithmetic and ALU design:
• Review representation methods for signed integers
• Discuss algorithms & hardware for arithmetic ops
• Consider floating-point representation & arithmetic
Topics in This Part
Chapter 9
Number Representation
Chapter 10 Adders and Simple ALUs
Chapter 11 Multipliers and Dividers
Chapter 12 Floating-Point Arithmetic
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 3
Preview of Arithmetic Unit in the Data Path
Incr PC
Next addr
jta
Next PC
ALUOvfl
(PC)
(rs)
rs
rt
PC
Instr
cache
inst
0
1
2
rd
31
imm
op
Br&Jump
Instruction fetch
Fig. 13.3
Jan. 2011
Register
writeback
Ovfl
Reg
file
ALU
(rt)
/
16
ALU
out
Data
cache
Data
out
Data
in
Func
0
32
SE / 1
Data
addr
0
1
2
Register input
fn
RegDst
RegWrite
Reg access / decode
ALUSrc
ALUFunc
ALU operation
RegInSrc
DataRead
DataWrite
Data access
Key elements of the single-cycle MicroMIPS data path.
Computer Architecture, The Arithmetic/Logic Unit
Slide 4
Computer Arithmetic as a Topic of Study
Brief overview article –
Encyclopedia of Info Systems,
Academic Press, 2002,
Vol. 3, pp. 317-333
Our textbook’s treatment
of the topic falls between
the extremes (4 chaps.)
Graduate course
ECE 252B – Text:
Computer Arithmetic,
Oxford U Press, 2000
(2nd ed., 2010)
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 5
9 Number Representation
Arguably the most important topic in computer arithmetic:
• Affects system compatibility and ease of arithmetic
• Two’s complement, flp, and unconventional methods
Topics in This Chapter
9.1 Positional Number Systems
9.2 Digit Sets and Encodings
9.3 Number-Radix Conversion
9.4 Signed Integers
9.5 Fixed-Point Numbers
9.6 Floating-Point Numbers
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 6
9.1 Positional Number Systems
Representations of natural numbers {0, 1, 2, 3, …}
||||| ||||| ||||| ||||| ||||| ||
27
11011
XXVII
sticks or unary code
radix-10 or decimal code
radix-2 or binary code
Roman numerals
Fixed-radix positional representation with k digits
k–1
Value of a number: x = (xk–1xk–2 . . . x1x0)r = Σ xi r i
i=0
For example:
27 = (11011)two = (1×24) + (1×23) + (0×22) + (1×21) + (1×20)
Number of digits for [0, P]: k = ⎡logr (P + 1)⎤ = ⎣logr P⎦ + 1
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 7
Unsigned Binary Integers
0000
1111
15
1110
0
0001
1
0010
2
14
1101
0011
13
12
1100
1011
Turn x notches
counterclockwise
to add x
3
Inside: Natural number
Outside: 4-bit encoding
11
4
5
10
1010
0100
0101
12
11
10
15
1
2
0
3
4
5
9 8 7
6
6
9
1001
14
13
8
1000
7
0110
0111
Turn y notches
clockwise
to subtract y
Figure 9.1 Schematic representation of 4-bit code for
integers in [0, 15].
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 8
Representation Range and Overflow
−
Overflow region max
max
Numbers smaller
than max −
+
Overflow region
Numbers larger
than max +
Finite set of representable numbers
Figure 9.2 Overflow regions in finite number representation systems.
For unsigned representations covered in this section, max – = 0.
Example 9.2, Part d
Discuss if overflow will occur when computing 317 – 316 in a number
system with k = 8 digits in radix r = 10.
Solution
The result 86 093 442 is representable in the number system which
has a range [0, 99 999 999]; however, if 317 is computed en route to
the final result, overflow will occur.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 9
9.2 Digit Sets and Encodings
Conventional and unconventional digit sets
• Decimal digits in [0, 9]; 4-bit BCD, 8-bit ASCII
• Hexadecimal, or hex for short: digits 0-9 & a-f
• Conventional ternary digit set in [0, 2]
Conventional digit set for radix r is [0, r – 1]
Symmetric ternary digit set in [–1, 1]
• Conventional binary digit set in [0, 1]
Redundant digit set [0, 2], encoded in 2 bits
( 0 2 1 1 0 )two and ( 1 0 1 0 2 )two represent 22
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 10
Carry-Save Numbers
Radix-2 numbers using the digits 0, 1, and 2
Example: (1 0 2 1)two = (1×23) + (0×22) + (2×21) + (1×20) = 13
Possible encodings
(a) Binary
(b) Unary
0
1
2
0
1
1
2
MSB
LSB
Jan. 2011
00
01
10
11 (Unused)
1 0 2 1
0 0 1 0 = 2
1 0 0 1 = 9
00
01 (First alternate)
10 (Second alternate)
11
First bit
Second bit
Computer Architecture, The Arithmetic/Logic Unit
1 0 2 1
0 0 1 1 = 3
1 0 1 0 = 10
Slide 11
The Notion of Carry-Save Addition
Digit-set combination: {0, 1, 2} + {0, 1} = {0, 1, 2, 3} = {0, 2} + {0, 1}
This bit
being 1
represents
overflow
(ignore it)
Carry-save
input
Carry-save
addition
Two
carry-save
inputs
Binary input
Carry-save
output
0
0
Carry-save
addition
0
a. Carry-save addition.
b. Adding two carry-save numbers.
Figure 9.3 Adding a binary number or another
carry-save number to a carry-save number.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 12
9.3 Number Radix Conversion
Two ways to convert numbers from an old radix r to a new radix R
• Perform arithmetic in the new radix R
Suitable for conversion from radix r to radix 10
Horner’s rule:
(xk–1xk–2 . . . x1x0)r = (…((0 + xk–1)r + xk–2)r + . . . + x1)r + x0
(1 0 1 1 0 1 0 1)two = 0 + 1 → 1 × 2 + 0 → 2 × 2 + 1 → 5 × 2 + 1 →
11 × 2 + 0 → 22 × 2 + 1 → 45 × 2 + 0 → 90 × 2 + 1 → 181
• Perform arithmetic in the old radix r
Suitable for conversion from radix 10 to radix R
Divide the number by R, use the remainder as the LSD
and the quotient to repeat the process
19 / 3 → rem 1, quo 6 / 3 → rem 0, quo 2 / 3 → rem 2, quo 0
Thus, 19 = (2 0 1)three
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 13
Justifications for Radix Conversion Rules
( xk −1 xk − 2 L x0 ) r = xk −1r
k −1
+ xk − 2 r
k −2
+ L + x1r + x0
= x0 + r ( x1 + r ( x2 + r (L)))
Justifying Horner’s rule.
x
Binary representation of ⎣x/2⎦
Figure 9.4
Jan. 2011
0
x mod 2
Justifying one step of the conversion of x to radix 2.
Computer Architecture, The Arithmetic/Logic Unit
Slide 14
9.4 Signed Integers
• We dealt with representing the natural numbers
• Signed or directed whole numbers = integers
{ . . . , −3, −2, −1, 0, 1, 2, 3, . . . }
• Signed-magnitude representation
+27 in 8-bit signed-magnitude binary code 0 0011011
–27 in 8-bit signed-magnitude binary code 1 0011011
–27 in 2-digit decimal code with BCD digits 1 0010 0111
• Biased representation
Represent the interval of numbers [−N, P] by the unsigned
interval [0, P + N]; i.e., by adding N to every number
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 15
Two’s-Complement Representation
With k bits, numbers in the range [–2k–1, 2k–1 – 1] represented.
Negation is performed by inverting all bits and adding 1.
0000
1111
–1
1110
+0
0001
+1
0010
+2
–2
1101
0011
–3
1011
+
_
–4
1100
Turn x notches
counterclockwise
to add x
–5
+3
+4
+5
–6
1001
0100
–4
–5
–6
0101
–1
1
2
0
3
4
5
–7 –8 7
6
+6
–7
1010
–2
–3
–8
1000
+7
0110
0111
Turn 16 – y notches
counterclockwise to
add –y (subtract y)
Figure 9.5 Schematic representation of 4-bit 2’s-complement
code for integers in [–8, +7].
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 16
Conversion from 2’s-Complement to Decimal
Example 9.7
Convert x = (1 0 1 1 0 1 0 1)2’s-compl to decimal.
Solution
Given that x is negative, one could change its sign and evaluate –x.
Shortcut: Use Horner’s rule, but take the MSB as negative
–1 × 2 + 0 → –2 × 2 + 1 → –3 × 2 + 1 → –5 × 2 + 0 → –10 × 2 + 1
→ –19 × 2 + 0 → –38 × 2 + 1 → –75
Sign Change for a 2’s-Complement Number
Example 9.8
Given y = (1 0 1 1 0 1 0 1)2’s-compl, find the representation of –y.
Solution
–y = (0 1 0 0 1 0 1 0) + 1 = (0 1 0 0 1 0 1 1)2’s-compl
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
(i.e., 75)
Slide 17
Two’s-Complement Addition and Subtraction
x
k
/
c in
Adder
y
k
/
k
/
k
/
x±y
c out
y or
y′
Add′Sub
Figure 9.6
Jan. 2011
Binary adder used as 2’s-complement adder/subtractor.
Computer Architecture, The Arithmetic/Logic Unit
Slide 18
9.5 Fixed-Point Numbers
Positional representation: k whole and l fractional digits
Value of a number: x = (xk–1xk–2 . . . x1x0 . x–1x–2 . . . x–l )r = Σ xi r i
For example:
2.375 = (10.011)two = (1×21) + (0×20) + (0×2−1) + (1×2−2) + (1×2−3)
Numbers in the range [0, rk – ulp] representable, where ulp = r –l
Fixed-point arithmetic same as integer arithmetic
(radix point implied, not explicit)
Two’s complement properties (including sign change) hold here as well:
(01.011)2’s-compl = (–0×21) + (1×20) + (0×2–1) + (1×2–2) + (1×2–3) = +1.375
(11.011)2’s-compl = (–1×21) + (1×20) + (0×2–1) + (1×2–2) + (1×2–3) = –0.625
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 19
Fixed-Point 2’s-Complement Numbers
1.111
1.110
–.125
0.000
+0
0.001
+.125
0.010
+.25
–.25
1.101
0.011
–.375
1.100
1.011
+
_
–.5
0.100
+.5
+.625
–.625
0.101
+.75
–.75
1.010
+.375
–.875
1.001
–1
1.000
+.875
0.110
0.111
Figure 9.7 Schematic representation of 4-bit 2’s-complement
encoding for (1 + 3)-bit fixed-point numbers in the range [–1, +7/8].
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 20
Radix Conversion for Fixed-Point Numbers
Convert the whole and fractional parts separately.
To convert the fractional part from an old radix r to a new radix R:
• Perform arithmetic in the new radix R
Evaluate a polynomial in r –1: (.011)two = 0 × 2–1 + 1 × 2–2 + 1 × 2–3
Simpler: View the fractional part as integer, convert, divide by r l
(.011)two = (?)ten
Multiply by 8 to make the number an integer: (011)two = (3)ten
Thus, (.011)two = (3 / 8)ten = (.375)ten
• Perform arithmetic in the old radix r
Multiply the given fraction by R, use the whole part as the MSD
and the fractional part to repeat the process
(.72)ten = (?)two
0.72 × 2 = 1.44, so the answer begins with 0.1
0.44 × 2 = 0.88, so the answer begins with 0.10
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 21
9.6 Floating-Point Numbers
Useful for applications where very large and very small
numbers are needed simultaneously
• Fixed-point representation must sacrifice precision
for small values to represent large values
x = (0000 0000 . 0000 1001)two
Small number
y = (1001 0000 . 0000 0000)two
Large number
• Neither y2 nor y / x is representable in the format above
• Floating-point representation is like scientific notation:
−20 000 000 = −2 × 10 7
+0.000 000 007 = +7 × 10–9
Sign
Jan. 2011
Significand
Exponent base
Exponent
Computer Architecture, The Arithmetic/Logic Unit
Also, 7E−9
Slide 22
ANSI/IEEE Standard Floating-Point Format (IEEE 754)
Revision (IEEE 754R) was completed in 2008: The revised version includes
16-bit and 128-bit binary formats, as well as 64- and 128-bit decimal formats
Short (32-bit) format
8 bits,
bias = 127,
–126 to 127
23 bits for fractional part
(plus hidden 1 in integer part)
Sign Exponent
11 bits,
bias = 1023,
–1022 to 1023
Short exponent range is –127 to 128
but the two extreme values
are reserved for special operands
(similarly for the long format)
Significand
52 bits for fractional part
(plus hidden 1 in integer part)
Long (64-bit) format
Figure 9.8
Jan. 2011
The two ANSI/IEEE standard floating-point formats.
Computer Architecture, The Arithmetic/Logic Unit
Slide 23
Short and Long IEEE 754 Formats: Features
Table 9.1 Some features of ANSI/IEEE standard floating-point formats
Feature
Word width in bits
Significand in bits
Significand range
Exponent bits
Exponent bias
Zero (±0)
Denormal
Single/Short
32
23 + 1 hidden
[1, 2 – 2–23]
8
127
e + bias = 0, f = 0
e + bias = 0, f ≠ 0
represents ±0.f × 2–126
e + bias = 255, f = 0
e + bias = 255, f ≠ 0
e + bias ∈ [1, 254]
e ∈ [–126, 127]
represents 1.f × 2e
Double/Long
64
52 + 1 hidden
[1, 2 – 2–52]
11
1023
e + bias = 0, f = 0
e + bias = 0, f ≠ 0
represents ±0.f × 2–1022
e + bias = 2047, f = 0
e + bias = 2047, f ≠ 0
e + bias ∈ [1, 2046]
e ∈ [–1022, 1023]
represents 1.f × 2e
min
2–126 ≅ 1.2 × 10–38
2–1022 ≅ 2.2 × 10–308
max
≅ 2128 ≅ 3.4 × 1038
≅ 21024 ≅ 1.8 × 10308
Infinity (±∞)
Not-a-number (NaN)
Ordinary number
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 24
10 Adders and Simple ALUs
Addition is the most important arith operation in computers:
• Even the simplest computers must have an adder
• An adder, plus a little extra logic, forms a simple ALU
Topics in This Chapter
10.1 Simple Adders
10.2 Carry Propagation Networks
10.3 Counting and Incrementation
10.4 Design of Fast Adders
10.5 Logic and Shift Operations
10.6 Multifunction ALUs
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 25
10.1 Simple Adders
Inputs
x
Outputs
y
c
x
s
c
0
0
1
1
0
1
0
1
0
0
0
1
Inputs
Digit-set interpretation:
{0, 1} + {0, 1}
= {0, 2} + {0, 1}
HA
0
1
1
0
s
Outputs
x
y
cin
cout
s
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
0
0
1
0
1
1
1
0
1
1
0
1
0
0
1
Figures 10.1/10.2
Jan. 2011
y
x
cout
y
FA
cin
Digit-set interpretation:
{0, 1} + {0, 1} + {0, 1}
= {0, 2} + {0, 1}
s
Binary half-adder (HA) and full-adder (FA).
Computer Architecture, The Arithmetic/Logic Unit
Slide 26
Full-Adder Implementations
HA
c out
x
y
x
y
HA
c in
c out
s
(a) FA built of two HAs
x
y
c out
0
1
2
3
0
0
1
2
3
1
c in
s
(b) CMOS mux-based FA
c in
s
(c) Two-level AND-OR FA
Figure10.3 Full adder implemented with two half-adders, by means
of two 4-input multiplexers, and as two-level gate network.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 27
Ripple-Carry Adder: Slow But Simple
x31
y31
c31
c32
cout
FA
Jan. 2011
. . .
y1
c2
x0
y0
c0
c1
FA
FA
s1
s0
Critical path
s31
Figure 10.4
x1
cin
Ripple-carry binary adder with 32-bit inputs and output.
Computer Architecture, The Arithmetic/Logic Unit
Slide 28
Carry Chains and Auxiliary Signals
Bit positions
15 14 13 12
----------1 0 1 1
11 10 9 8
----------0 1 1 0
7 6 5 4
----------0 1 1 0
1 0 0 1
1 1
cout 0 1 0 1
\__________/\__________________/
4
6
g = xy
Jan. 2011
0
3 2 1 0
----------1 1 1 0
0
0 0 1 1 cin
\________/\____/
3
2
Carry chains and their lengths
Computer Architecture, The Arithmetic/Logic Unit
p=x⊕y
Slide 29
10.2 Carry Propagation Networks
gi pi
Carry is:
0
0
1
1
annihilated or killed
propagated
generated
(impossible)
0
1
0
1
g k−1 p k−1
xi
g k−2 p k−2
yi
gi = xi yi
pi = xi ⊕ yi
g i+1 p i+1
gi
pi
...
...
g1 p1
g0 p0
c0
Carry network
ck
c k−1
...
c k−2
ci
c i+1
...
c1
c0
si
Figure 10.5 The main part of an adder is the carry network. The rest
is just a set of gates to produce the g and p signals and the sum bits.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 30
Ripple-Carry Adder Revisited
The carry recurrence: ci+1 = gi ∨ pi ci
Latency of k-bit adder is roughly 2k gate delays:
1 gate delay for production of p and g signals, plus
2(k – 1) gate delays for carry propagation, plus
1 XOR gate delay for generation of the sum bits
gk−1 pk−1
gk−2 pk−2
g1
p1
...
ck
ck−1
Figure 10.6
Jan. 2011
ck−2
c2
c1
g0
p0
c0
The carry propagation network of a ripple-carry adder.
Computer Architecture, The Arithmetic/Logic Unit
Slide 31
The Complete Design of a Ripple-Carry Adder
gi pi
Carry is:
0
0
1
1
annihilated or killed
propagated
generated
(impossible)
0
1
0
1
g k−1 p k−1
gk−1 pk−1
ck
ck
xi
g k−2 p k−2
c k−1
gi = xi yi
pi = xi ⊕ yi
g i+1 p i+1
gi
pi
...
...
gk−2 pk−2
Carry network
ck−2
...
c k−2
g1 p1
g1
.
ck−1
yi
ci
c i+1
p1
.
c2
c1
...
c1
g0 p0
g0
c0
p0
c0
c0
si
Figure 10.6 (ripple-carry network) superimposed on
Figure 10.5 (general structure of an adder).
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 32
First Carry Speed-Up Method: Carry Skip
g4j+3 p4j+3
c4j+4
c4j+3
g4j+2 p4j+2
c4j+2
g4j+1
p4j+1
c4j+1
g4j
p4j
c4j
One-way street
Freeway
Figures 10.7/10.8
A 4-bit section of a ripple-carry network
with skip paths and the driving analogy.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 33
Mux-Based Skip Carry Logic
g4j+3 p4j+3
g4j+2 p4j+2
g4j+1
p4j+1
g4j
p4j
Fig. 10.7
c4j+4
c4j+3
g4j+3 p4j+3
c4j+4
0
1c4j+4
p[4j, 4j+3]
c4j+3
c4j+2
g4j+2 p4j+2
c4j+1
g4j+1
c4j+2
p4j+1
c4j
g4j
p4j
c4j+1
c4j
The carry-skip adder of Fig. 10.7 works fine if we begin with a
clean slate, where all signals are 0s; otherwise, it will run into
problems, which do not exist in this mux-based implementation
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 34
10.3 Counting and Incrementation
k
/
k
/
Data in
Incr′Init
0
k
/
Count
register
D
Q
1
k
/
c in
_
C
Q
Adder /
k
Update
a
(Increment
amount)
Figure 10.9
Jan. 2011
c out
Schematic diagram of an initializable synchronous counter.
Computer Architecture, The Arithmetic/Logic Unit
Slide 35
Circuit for Incrementation by 1
Figure 10.6
Substantially
simpler than
an adder
gk−1 pk−1
ck
xk−1
ck
0g1
gk−2 pk−2
sk−1
x1
0g0
p0
...
ck−1
ck−2
c2
xk−2
ck−1
p1
ck−2
sk−2
...
1
x0
c1
c2
s2
c0
c1
x1
s1
s0
Figure 10.10 Carry propagation network
and sum logic for an incrementer.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
x0
Slide 36
10.4 Design of Fast Adders
• Carries can be computed directly without propagation
• For example, by unrolling the equation for c3, we get:
c3 = g2 ∨ p2 c2 = g2 ∨ p2 g1 ∨ p2 p1 g0 ∨ p2 p1 p0 c0
• We define “generate” and “propagate” signals for a block
extending from bit position a to bit position b as follows:
g[a,b] = gb ∨ pb gb–1 ∨ pb pb–1 gb–2 ∨ . . . ∨ pb pb–1 … pa+1 ga
p[a,b] = pb pb–1 . . . pa+1 pa
• Combining g and p signals for adjacent blocks:
g[h,j] = g[i+1,j] ∨ p[i+1,j] g[h,i]
p[h,j] = p[i+1,j] p[h,i]
j
i+1 i
h
[h, j] = [i + 1, j] ¢ [h, i]
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 37
Carries as Generate Signals for Blocks [ 0, i ]
gi pi
Carry is:
0
0
1
1
annihilated or killed
propagated
generated
(impossible)
0
1
0
1
g k−1 p k−1
xi
g k−2 p k−2
yi
Assuming c0 = 0,
we have ci = g [0,i –1]
g i+1 p i+1
gi
pi
...
...
g1 p1
g0 p0
c0
Carry network
ck
c k−1
Figure 10.5
Jan. 2011
...
c k−2
ci
c i+1
...
c1
c0
si
Computer Architecture, The Arithmetic/Logic Unit
Slide 38
Second Carry Speed-Up Method: Carry Lookahead
[7, 7 ]
[6, 6 ]
¢
[5, 5 ]
[4, 4 ] [3, 3 ]
¢
[2, 2 ]
[1, 1 ]
¢
[6, 7 ]
[0, 0 ]
g [1, 1] p [1, 1]
g [0, 0]
p [0, 0]
¢
[2, 3 ]
[4, 5 ]
[0, 1 ]
¢
¢
[4, 7 ]
[0, 3 ]
¢
¢
¢
[0, 7 ]
[0, 6 ]
¢
[0, 5 ]
[0, 4 ] [0, 3 ]
¢
[0, 2 ]
g [0, 1] p [0, 1]
[0, 1 ]
[0, 0 ]
Figure 10.11 Brent-Kung lookahead carry network for an 8-digit adder,
along with details of one of the carry operator blocks.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 39
Recursive Structure of Brent-Kung Carry Network
[7, 7]
[6, 6]
[5, 5]
[4, 4]
[3, 3]
[2, 2]
[1, 1]
[0, 0]
[7, 7 ]
¢
¢
¢
¢
[6, 6 ]
¢
[5, 5 ]
[4, 4 ] [3, 3 ]
¢
[2, 2 ]
¢
[6, 7 ]
[2, 3 ]
[0, 1 ]
¢
¢
[4, 7 ]
[0, 3 ]
¢
¢
¢
¢
¢
[0, 6]
[0, 5]
[0, 4]
¢
¢
¢
[0, 7 ]
[0, 7]
[0, 0 ]
¢
[4, 5 ]
4-input Brent-Kung
carry network
[1, 1 ]
[0, 3]
[0, 2]
[0, 1]
[0, 6 ]
[0, 5 ]
[0, 4 ] [0, 3 ]
[0, 2 ]
[0, 1 ]
[0, 0 ]
[0, 0]
Figure 10.12 Brent-Kung lookahead carry network for an 8-digit adder,
with only its top and bottom rows of carry-operators shown.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 40
An Alternate Design: Kogge-Stone Network
¢
¢
¢
¢
¢
¢
¢
¢
¢
¢
¢
¢
¢
¢
¢
¢
c7 = g [0,6]
c8 = g [0,7]
c5 = g [0,4]
c6 = g [0,5]
¢
c1 = g [0,0]
c3 = g [0,2]
c4 = g [0,3]
c2 = g [0,1]
Kogge-Stone lookahead carry network for an 8-digit adder.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 41
Brent-Kung vs. Kogge-Stone Carry Network
[7, 7 ]
[6, 6 ]
[5, 5 ]
¢
[4, 4 ]
¢
[3, 3 ]
[2, 2 ]
[1, 1 ]
¢
[6, 7 ]
[0, 0 ]
g [1, 1] p [1, 1]
g [0, 0]
p [0, 0]
¢
[2, 3 ]
[4, 5 ]
[0, 1 ]
¢
¢
[4, 7 ]
[0, 3 ]
¢
¢
¢
[0, 7 ]
[0, 6 ]
¢
[0, 5 ]
[0, 4 ]
¢
[0, 3 ]
[0, 2 ]
g [0, 1] p [0, 1]
[0, 1 ]
11 carry operators
4 levels
Jan. 2011
[0, 0 ]
17 carry operators
3 levels
Computer Architecture, The Arithmetic/Logic Unit
Slide 42
Carry-Lookahead Logic with 4-Bit Block
p i+3
g i+3
p i+2
g i+2
p i+1 g i+1 p i
gi
g [i, i+3]
Intermeidte carries
p [i, i+3]
Block signal generation
ci
c i+3
c i+2
c i+1
Figure 10.13
Blocks needed in the design of carry-lookahead adders
with four-way grouping of bits.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 43
Third Carry Speed-Up Method: Carry Select
Allows doubling of adder width with a single-mux additional delay
x
c out
Adder
c in
Version 0
of sum bits
c out
0
0
1
s
Figure 10.14
Jan. 2011
[a, b]
c
Adder
y
[a, b]
c in
Version 1
of sum bits
a
[a, b]
1
The lower
a positions,
(0 to a – 1)
are added
as usual
Carry-select addition principle.
Computer Architecture, The Arithmetic/Logic Unit
Slide 44
10.5 Logic and Shift Operations
Conceptually, shifts can be implemented by multiplexing
Right-shifted
values
Left-shifted
values
00...0, x[31]
Right’Left
Shift amount
5
00, x[30, 2]
0, x[31, 1]
x[31, 0]
32
32 32
0
1
6-bit code specifying
shift direction & amount
Jan. 2011
x[30, 0], 0
x[31, 0]
. . .
32
32
31
32
32
33
. . .
32
62
32
63
Multiplexer
6
Figure 10.15
2
x[0], 00...0
x[1, 0], 00...0
32
Multiplexer-based logical shifting unit.
Computer Architecture, The Arithmetic/Logic Unit
Slide 45
Arithmetic Shifts
Purpose: Multiplication and division by powers of 2
sra $t0,$s1,2
srav $t0,$s1,$s0
op
31
R
25
20
rt
15
rd
10
sh
5
fn
0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1
ALU
instruction
op
31
R
rs
# $t0 ← ($s1) right-shifted by 2
# $t0 ← ($s1) right-shifted by ($s0)
Unused
25
rs
Source
register
20
rt
Destination
register
15
rd
Shift
amount
10
sh
sra = 3
5
fn
0
0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1
ALU
instruction
Figure 10.16
Jan. 2011
Amount
register
Source
register
Destination
register
Unused
srav = 7
The two arithmetic shift instructions of MiniMIPS.
Computer Architecture, The Arithmetic/Logic Unit
Slide 46
Practical Shifting in Multiple Stages
00
01
10
11
No shift
Logical left
Logical right
Arith right
0
x[31], x[31, 1]
0, x[31, 1]
x[30, 0], 0
x[31, 0]
32
32
32
32
0
2
1
2
2
2
3
y[31, 0]
0
1
2
(0 or 2)-bit shift
3
2
z[31, 0]
3
Multiplexer
0
32
2
(a) Single-bit shifter
Figure 10.17
Jan. 2011
1
(0 or 4)-bit shift
1
2
(0 or 1)-bit shift
3
(b) Shifting by up to 7 bits
Multistage shifting in a barrel shifter.
Computer Architecture, The Arithmetic/Logic Unit
Slide 47
Bit Manipulation via Shifts and Logical Operations
Bits 10-15
AND with mask to isolate a field: 0000 0000 0000 0000 1111 1100 0000 0000
Right-shift by 10 positions to move field to the right end of word
The result word ranges from 0 to 63, depending on the field pattern
32-pixel (4 × 8) block of
black-and-white image:
Row 0
Row 1
Row 2
Row 3
Representation
as 32-bit word:
1010 0000 0101 1000 0000 0110 0001 0111
Hex equivalent:
0xa0a80617
Figure 10.18 A 4 × 8 block of a black-and-white
image represented as a 32-bit word.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 48
10.6 Multifunction ALUs
Arith fn (add, sub, . . .)
Operand 1
Arith
unit
0
Result
Operand 2
1
Logic
unit
Select fn type
(logic or arith)
Logic fn (AND, OR, . . .)
General structure of a simple arithmetic/logic unit.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 49
Const′Var
Shift function
Constant
5
amount
0
Amount
5
1
5
Variable
amount
2
00
01
10
11
No shift
Logical left
Logical right
Arith right
Shifter
Function
class
32
5 LSBs
00
01
10
11
An ALU for
MiniMIPS
Shift
Set less
Arithmetic
Logic
2
Shifted y
0
x
Adder
y
0 or 1
c0
32
32
k
/
c 31
x±y
s
MSB
32
2
32
Shorthand
symbol
for ALU
1
Control
c 32
3
x
Func
Add′Sub
s
ALU
Logic
unit
AND
OR
XOR
NOR
00
01
10
11
Ovfl
y
32input
NOR
Zero
2
Logic function
Zero
Ovfl
Figure 10.19 A multifunction ALU with 8 control signals (2 for
function class, 1 arithmetic, 3 shift, 2 logic) specifying the operation.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 50
11 Multipliers and Dividers
Modern processors perform many multiplications & divisions:
• Encryption, image compression, graphic rendering
• Hardware vs programmed shift-add/sub algorithms
Topics in This Chapter
11.1 Shift-Add Multiplication
11.2 Hardware Multipliers
11.3 Programmed Multiplication
11.4 Shift-Subtract Division
11.5 Hardware Dividers
11.6 Programmed Division
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 51
11.1 Shift-Add Multiplication
Multiplicand
Multiplier
x
y
Partial
products
bit-matrix
y0
y
1
y2
y3
Product
z
Figure 11.1
x
x
x
x
20
21
22
23
Multiplication of 4-bit numbers in dot notation.
z(j+1) = (z(j) + yj x 2k) 2–1 with z(0) = 0 and z(k) = z
|––– add –––|
|–– shift right ––|
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 52
Binary and Decimal Multiplication
Position
7 6 5 4 3 2 1 0
=========================
x24
1 0 1 0
y
0 0 1 1
=========================
z (0)
0 0 0 0
4
+y0x2
1 0 1 0
––––––––––––––––––––––––––
2z (1)
0 1 0 1 0
(1)
z
0 1 0 1 0
4
+y 1 x 2
1 0 1 0
––––––––––––––––––––––––––
0 1 1 1 1 0
2z (2)
(2)
z
0 1 1 1 1 0
4
+y2x2
0 0 0 0
––––––––––––––––––––––––––
2z (3)
0 0 1 1 1 1 0
(3)
z
0 0 1 1 1 1 0
4
+y3x2
0 0 0 0
––––––––––––––––––––––––––
2z (4)
0 0 0 1 1 1 1 0
(4)
z
0 0 0 1 1 1 1 0
=========================
Figure 11.2
Jan. 2011
Example 11.1
Position
7 6 5 4 3 2 1 0
=========================
x104
3 5 2 8
y
4 0 6 7
=========================
z (0)
0 0 0 0
4
+y0x10 2 4 6 9 6
––––––––––––––––––––––––––
10z (1)
2 4 6 9 6
(1)
z
0 2 4 6 9 6
4
+y1x10 2 1 1 6 8
––––––––––––––––––––––––––
10z (2)
2 3 6 3 7 6
(2)
z
2 3 6 3 7 6
4
+y2x10 0 0 0 0 0
––––––––––––––––––––––––––
10z (3)
0 2 3 6 3 7 6
(3)
z
0 2 3 6 3 7 6
4
+y3x10 1 4 1 1 2
––––––––––––––––––––––––––
10z (4)
1 4 3 4 8 3 7 6
(4)
z
1 4 3 4 8 3 7 6
=========================
Step-by-step multiplication examples for 4-digit unsigned numbers.
Computer Architecture, The Arithmetic/Logic Unit
Slide 53
Two’s-Complement Multiplication
Position
7 6 5 4 3 2 1 0
=========================
x24
1 0 1 0
y
0 0 1 1
=========================
z (0)
0 0 0 0 0
4
+y 0 x 2
1 1 0 1 0
––––––––––––––––––––––––––
2z (1)
1 1 0 1 0
(1)
z
1 1 1 0 1 0
4
+y 1 x 2
1 1 0 1 0
––––––––––––––––––––––––––
1 0 1 1 1 0
2z (2)
(2)
z
1 1 0 1 1 1 0
4
+y 2 x 2
0 0 0 0 0
––––––––––––––––––––––––––
2z (3)
1 1 0 1 1 1 0
(3)
z
1 1 1 0 1 1 1 0
4
+(–y3x2 ) 0 0 0 0 0
––––––––––––––––––––––––––
2z (4)
1 1 1 0 1 1 1 0
(4)
z
1 1 1 0 1 1 1 0
=========================
Figure 11.3
Jan. 2011
Example 11.2
Position
7 6 5 4 3 2 1 0
=========================
x24
1 0 1 0
y
1 0 1 1
=========================
z (0)
0 0 0 0 0
4
+y0x2
1 1 0 1 0
––––––––––––––––––––––––––
2z (1)
1 1 0 1 0
(1)
z
1 1 1 0 1 0
4
+y1x2
1 1 0 1 0
––––––––––––––––––––––––––
2z (2)
1 0 1 1 1 0
(2)
z
1 1 0 1 1 1 0
4
+y2x2
0 0 0 0 0
––––––––––––––––––––––––––
2z (3)
1 1 0 1 1 1 0
(3)
z
1 1 1 0 1 1 1 0
4
+(–y3x2 ) 0 0 1 1 0
––––––––––––––––––––––––––
2z (4)
0 0 0 1 1 1 1 0
(4)
z
0 0 0 1 1 1 1 0
=========================
Step-by-step multiplication examples for 2’s-complement numbers.
Computer Architecture, The Arithmetic/Logic Unit
Slide 54
11.2 Hardware Multipliers
Shift
Multiplier y
Doublewidth partial product z (j)
Hi
Shift
Lo
Multiplicand x
0
Mux
1
yj
Enable
Select
c out
Figure 11.4
Jan. 2011
Adder
c in
Add’Sub
Hardware multiplier based on the shift-add algorithm.
Computer Architecture, The Arithmetic/Logic Unit
Slide 55
The Shift Part of Shift-Add
From adder
cout
Sum
/k
Partial product
/k–1
Multiplier
/k–1
/k
To adder
yj
Figure11.5
Shifting incorporated in the connections to
the partial product register rather than as a separate phase.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 56
High-Radix Multipliers
Multiplicand
Multiplier
x
y
0, x, 2x, or 3x
Product
z
Radix-4 multiplication in dot notation.
z(j+1) = (z(j) + yj x 2k) 4–1 with z(0) = 0 and z(k/2) = z
|––– add –––|
Assume k even
|–– shift right ––|
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 57
Tree Multipliers
All partial products
Several partial products
...
...
Large tree of
carry-save
adders
Logdepth
Adder
Logdepth
Product
(a) Full-tree multiplier
Figure 11.6
Jan. 2011
Small tree of
carry-save
adders
Adder
Product
(b) Partial-tree multiplier
Schematic diagram for full/partial-tree multipliers.
Computer Architecture, The Arithmetic/Logic Unit
Slide 58
Array Multipliers
0
s
x3 0
0
x2 0
0
x1 0
0
x0 0
MA
MA
MA
MA
0
0
Figure 9.3a
(Recalling
carry-save
addition)
s
MA
y0
c
MA
MA
MA
y1
z0
0
MA
MA
MA
MA
y2
z1
0
MA
MA
MA
MA
y3
z2
0
z3
FA
Figure 11.7
Jan. 2011
Our original
dot-notation
representing
multiplication
FA
FA
HA
z7
z6
z5
Straightened
dots to depict
array multiplier
to the left
z4
Array multiplier for 4-bit unsigned operands.
Computer Architecture, The Arithmetic/Logic Unit
Slide 59
11.3 Programmed Multiplication
MiniMIPS instructions related to multiplication
mult
multu
mfhi
mflo
$s0,$s1
$s2,$s3
$t0
$t1
#
#
#
#
set
set
set
set
Hi,Lo to ($s0)×($s1); signed
Hi,Lo to ($s2)×($s3); unsigned
$t0 to (Hi)
$t1 to (Lo)
Example 11.3
Finding the 32-bit product of 32-bit integers in MiniMIPS
Multiply; result will be obtained in Hi,Lo
For unsigned multiplication:
Hi should be all-0s and Lo holds the 32-bit result
For signed multiplication:
Hi should be all-0s or all-1s, depending on the sign bit of Lo
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 60
Emulating a Hardware Multiplier in Software
Example 11.4 (MiniMIPS shift-add program for multiplication)
Shift
$a1
(multiplier
Multiplier
y y)
$v0Doublewidth
(Hi part of z)partial
$v1product
(Lo part zof(j)z)
$t0 (carry-out)
$a0Multiplicand
(multiplicandx x)
0
Also, holds
LSB of Hi
during shift
Shift
Mux
1
$t1 (bit j of y)
yj
Enable
Select
$t2 (counter)
c out
Adder
c in
Add’Sub
Part of the
control in
hardware
Figure 11.8 Register usage for programmed multiplication
superimposed on the block diagram for a hardware multiplier.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 61
Multiplication When There Is No Multiply Instruction
Example 11.4 (MiniMIPS shift-add program for multiplication)
shamu: move
move
addi
mloop: move
move
srl
subu
subu
beqz
addu
sltu
noadd: move
srl
subu
subu
sll
MSB of $t0
addu
srl
sll
MSB of $t1
addu
addi
by 1Jan. 2011
bne
jr
$v0,$zero
$vl,$zero
$t2,$zero,32
$t0,$zero
$t1,$a1
$a1,1
$t1,$t1,$a1
$t1,$t1,$a1
$t1,noadd
$v0,$v0,$a0
$t0,$v0,$a0
$t1,$v0
$v0,1
$t1,$t1,$v0
$t1,$t1,$v0
$t0,$t0,31
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
initialize Hi to 0
initialize Lo to 0
init repetition counter to 32
set c-out to 0 in case of no add
copy ($a1) into $t1
halve the unsigned value in $a1
subtract ($a1) from ($t1) twice to
obtain LSB of ($a1), or y[j], in $t1
no addition needed if y[j] = 0
add x to upper part of z
form carry-out of addition in $t0
copy ($v0) into $t1
halve the unsigned value in $v0
subtract ($v0) from ($t1) twice to
obtain LSB of Hi in $t1
# carry-out converted to 1 in
$v0,$v0,$t0
$v1,1
$t1,$t1,31
# right-shifted $v0 corrected
# halve the unsigned value in $v1
# LSB of Hi converted to 1 in
$v1,$v1,$t1
$t2,$t2,-1
# right-shifted $v1 corrected
# decrement repetition
Computer Architecture, The Arithmetic/Logic Unit
$t2,$zero,mloop
$ra
counter
Slide 62
# if counter > 0, repeat multiply loop
# return to the calling program
11.4 Shift-Subtract Division
x
y
Divisor
z
Dividend
y 3 x 23
y 2 x 22
y 1 x 21
y 0 x 20
Subtracted
bit-matrix
s
Figure11.9
Remainder
Division of an 8-bit number by a 4-bit number in dot notation.
z(j) = 2z(j−1) − yk−j x 2k
|shift|
|–– subtract ––|
Jan. 2011
Quotient
with z(0) = z and z(k) = 2k s
Computer Architecture, The Arithmetic/Logic Unit
Slide 63
Integer and Fractional Unsigned Division
Position
7 6 5 4 3 2 1 0
=========================
z
0 1 1 1 0 1 0 1
4
x2
1 0 1 0
=========================
z (0)
0 1 1 1 0 1 0 1
(0)
2z
0 1 1 1 0 1 0 1
4
–y3x2
1 0 1 0
y =1
–––––––––––––––––––––––––– 3
z (1)
0 1 0 0 1 0 1
(1)
2z
0 1 0 0 1 0 1
4
–y2x2
0 0 0 0
y =0
–––––––––––––––––––––––––– 2
z (2)
1 0 0 1 0 1
(2)
2z
1 0 0 1 0 1
4
–y1x2
1 0 1 0
y =1
–––––––––––––––––––––––––– 1
z (3)
1 0 0 0 1
(3)
2z
1 0 0 0 1
4
–y0x2
1 0 1 0
y0=1
––––––––––––––––––––––––––
z (4)
0 1 1 1
s
0 1 1 1
y
1 0 1 1
=========================
Figure 11.10
Jan. 2011
Example 11.5
Position
–1 –2 –3 –4 –5 –6 –7 –8
==========================
z
.1 4 3 5 1 5 0 2
x
.4 0 6 7
==========================
z (0)
.1 4 3 5 1 5 0 2
(0)
10z
1.4 3 5 1 5 0 2
–y–1x
1.2 2 0 1
y–1=3
–––––––––––––––––––––––––––
z (1)
.2 1 5 0 5 0 2
(1)
10z
2.1 5 0 5 0 2
–y–2x
2.0 3 3 5
y–2=5
–––––––––––––––––––––––––––
z (2)
.1 1 7 0 0 2
(2)
10z
1.1 7 0 0 2
–y–3x
0.8 1 3 4
y–3=2
–––––––––––––––––––––––––––
z (3)
.3 5 6 6 2
(3)
10z
3.5 6 6 2
–y–4x
3.2 5 3 6
y–4=8
–––––––––––––––––––––––––––
z (4)
.3 1 2 6
s
.0 0 0 0 3 1 2 6
y
.3 5 2 8
==========================
Division examples for binary integers and decimal fractions.
Computer Architecture, The Arithmetic/Logic Unit
Slide 64
Division with Same-Width Operands
Position
7 6 5 4 3 2 1 0
=========================
z
0 0 0 0 1 1 0 1
4
x2
0 1 0 1
=========================
z (0)
0 0 0 0 1 1 0 1
(0)
2z
0 0 0 1 1 0 1
4
–y3x2
0 0 0 0
y =0
–––––––––––––––––––––––––– 3
z (1)
0 0 0 1 1 0 1
(1)
2z
0 0 1 1 0 1
4
–y2x2
0 0 0 0
y =0
–––––––––––––––––––––––––– 2
z (2)
0 0 1 1 0 1
(2)
2z
0 1 1 0 1
4
–y1x2
0 1 0 1
y =1
–––––––––––––––––––––––––– 1
z (3)
0 0 0 1 1
(3)
2z
0 0 1 1
4
–y0x2
1 0 1 0
y0=0
––––––––––––––––––––––––––
z (4)
0 0 1 1
s
0 0 1 1
y
0 0 1 0
=========================
Figure 11.11
Jan. 2011
Example 11.6
Position
–1 –2 –3 –4 –5 –6 –7 –8
==========================
z
.0 1 0 1
x
.1 1 0 1
==========================
z (0)
.0 1 0 1
(0)
2z
0.1 0 1 0
–y–1x
0.0 0 0 0
y–1=0
–––––––––––––––––––––––––––
z (1)
.1 0 1 0
(1)
2z
1.0 1 0 0
–y–2x
0.1 1 0 1
y–2=1
–––––––––––––––––––––––––––
z (2)
.0 1 1 1
(2)
2z
0.1 1 1 0
–y–3x
0.1 1 0 1
y–3=1
–––––––––––––––––––––––––––
z (3)
.0 0 0 1
(3)
2z
0.0 0 1 0
–y–4x
0.0 0 0 0
y–4=0
–––––––––––––––––––––––––––
z (4)
.0 0 1 0
s
.0 0 0 0 0 0 1 0
y
.0 1 1 0
==========================
Division examples for 4/4-digit binary integers and fractions.
Computer Architecture, The Arithmetic/Logic Unit
Slide 65
Signed Division
Method 1 (indirect): strip operand signs, divide, set result signs
Dividend
z=5
z=5
z = –5
z = –5
Divisor
x=3
x = –3
x=3
x = –3
⇒
⇒
⇒
⇒
Quotient
y=1
y = –1
y = –1
y=1
Remainder
s=2
s=2
s = –2
s = –2
Method 2 (direct 2’s complement): develop quotient with digits
–1 and 1, chosen based on signs, convert to digits 0 and 1
Restoring division: perform trial subtraction, choose 0 for q digit
if partial remainder negative
Nonrestoring division: if sign of partial remainder is correct,
then subtract (choose 1 for q digit) else add (choose –1)
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 66
11.5 Hardware Dividers
Shift
Quotient y
Partial rem ainder z (j) (initially z)
Hi
Shift
Lo
Mux 1
Load
Quotient
digit
selector
Divisor x
0
yk – j
Enable
1
Select
c
out
Adder
Trial difference
Figure 11.12
Jan. 2011
c
in
1
(Always
subtract)
Hardware divider based on the shift-subtract algorithm.
Computer Architecture, The Arithmetic/Logic Unit
Slide 67
The Shift Part of Shift-Subtract
qk–j
From adder
/k
Partial remainder
/k
Quotient
/k
MSB
/k
To adder
Figure 11.13 Shifting incorporated in the connections to the
partial remainder register rather than as a separate phase.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 68
High-Radix Dividers
x
Divisor
y
Quotient
z
Dividend
0, x, 2x, or 3x
s
Remainder
Radix-4 division in dot notation.
z(j) = 4z(j−1) − (yk−2j+1 yk−2j)two x 2k with z(0) = z and z(k/2) = 2ks
|shift|
Assume k even
|––––––– subtract –––––––|
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 69
Array Dividers
z7
y3
x3
z6
MS
x1
MS
x0
z5
MS
b
y2
x2
z4
MS
MS
z3
0
d
z2
MS
0
MS
MS
z1
y1
MS
MS
MS
MS
Our original
dot-notation
for division
0
z0
y0
MS
s3
MS
s2
Figure 11.14
Jan. 2011
MS
s1
MS
0
s0
Straightened
dots to depict
an array divider
Array divider for 8/4-bit unsigned integers.
Computer Architecture, The Arithmetic/Logic Unit
Slide 70
11.6 Programmed Division
MiniMIPS instructions related to division
div
divu
mfhi
mflo
$s0,$s1
$s2,$s3
$t0
$t1
#
#
#
#
Lo = quotient, Hi = remainder
unsigned version of division
set $t0 to (Hi)
set $t1 to (Lo)
Example 11.7
Compute z mod x, where z (singed) and x > 0 are integers
Divide; remainder will be obtained in Hi
if remainder is negative,
then add |x| to (Hi) to obtain z mod x
else Hi holds z mod x
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 71
Emulating a Hardware Divider in Software
Example 11.8 (MiniMIPS shift-add program for division)
Shift
$a1
(quotient
Quotient
y y)
$v0
Partial
(Hi part
remainder
of z)
$v1
z (j) (Lo
(initially
part of
z)z)
Shift
$t0 (MSB of Hi)
yk – j
Load
$t1 (bit k −j of y)
$a0
Divisor
(divisorx x)
0
1
Mux
Enable
1
Quotient
digit
selector
Select
$t2 (counter)
c out
Adder
Trial difference
c in
1
(Always
subtract)
Part of the
control in
hardware
Figure 11.15 Register usage for programmed division superimposed
on the block diagram for a hardware divider.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 72
Division When There Is No Divide Instruction
Example 11.7 (MiniMIPS shift-add program for division)
shsdi: move
move
addi
dloop: slt
sll
slt
or
sll
sge
or
sll
or
beq
subu
nosub: addi
by 1
bne
move
jr
Jan. 2011
$v0,$a2
$vl,$a3
$t2,$zero,32
$t0,$v0,$zero
$v0,$v0,1
$t1,$v1,$zero
$v0,$v0,$t1
$v1,$v1,1
$t1,$v0,$a0
$t1,$t1,$t0
$a1,$a1,1
$a1,$a1,$t1
$t1,$zero,nosub
$v0,$v0,$a0
$t2,$t2,-1
#
#
#
#
#
#
#
#
#
#
#
#
#
#
initialize Hi to ($a2)
initialize Lo to ($a3)
initialize repetition counter to 32
copy MSB of Hi into $t0
left-shift the Hi part of z
copy MSB of Lo into $t1
move MSB of Lo into LSB of Hi
left-shift the Lo part of z
quotient digit is 1 if (Hi) ≥ x,
or if MSB of Hi was 1 before shifting
shift y to make room for new digit
copy y[k-j] into LSB of $a1
if y[k-j] = 0, do not subtract
subtract divisor x from Hi part of z
# decrement repetition counter
$t2,$zero,dloop # if counter > 0, repeat divide loop
$v1,$a1
# copy the quotient y into $v1
$ra
# return to the calling program Slide 73
Computer Architecture, The Arithmetic/Logic Unit
Divider vs Multiplier: Hardware Similarities
Shift
Quotient y
Partial rem ainder z (j) (initially z)
Shift
yk – j
Multiplier y
Dou blewi dth pa rtial product z (j)
Load
Shift
Shift
Quotient
digit
selector
Divisor x
0
Mux
Enable
1
Multiplicand x
1
0
yj
Enable
1
Mux
Select
Select
Figure 11.12
c
out
c
Adder
in
Trial difference
x3 0
0
x2 0
0
MA
x1 0
0
MA
x0 0
z7
y3
MA
Adder
c out
1
(Always
subtract)
0
MA
Figure 11.4
y0
x3
z6
MS
x2
x1
z5
MS
c in
x0
z4
MS
MS
0
MA
MA
MA
MA
y1
z0
0
MA
MA
MA
MA
y2
z1
MS
MS
MS
MS
MS
MS
MS
0
MA
MA
MA
MA
y3
MS
z3
FA
FA
HA
z7
z6
z5
S traighten ed
dots to depic t
arr ay m ultiplier
to the left
Figure 11.14
Jan. 2011
z4
0
Our original
dot-notation
for division
0
z0
y0
MS
MS
MS
0
FA
0
z1
y1
MS
z2
z3
z2
y2
O ur o rigin al
dot-n otation
rep res entin g
m ultiplic ation
Add’Sub
s3
s2
s1
s0
Turn upside-down
Computer Architecture, The Arithmetic/Logic Unit
0
Straightened
dots to depict
an array divider
Figure 11.7
Slide 74
12 Floating-Point Arithmetic
Floating-point is no longer reserved for high-end machines
• Multimedia and signal processing require flp arithmetic
• Details of standard flp format and arithmetic operations
Topics in This Chapter
12.1 Rounding Modes
12.2 Special Values and Exceptions
12.3 Floating-Point Addition
12.4 Other Floating-Point Operations
12.5 Floating-Point Instructions
12.6 Result Precision and Errors
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 75
12.1 Rounding Modes
Short (32-bit) format
8 bits,
23 bits for fractional part
bias = 127, (plus hidden 1 in integer part)
–126 to 127
Sign Exponent
±0, ±∞, NaN
IEEE 754
Format
1.f × 2e
Significand
11 bits,
bias = 1023,
–1022 to 1023
Denormals:
0.f × 2emin
52 bits for fractional part
(plus hidden 1 in integer part)
Denormals allow
graceful underflow
Long (64-bit) format
–∞
Negative numbers
FLP –
max –
min –
Sparser
Overflow
region
±0
Denser
Positive numbers
FLP +
min +
Denser
Figure 12.1
Jan. 2011
Underflow
example
+∞
Sparser
Underflow
regions
Midway
example
max +
Overflow
region
Typical
example
Overflow
example
Distribution of floating-point numbers on the real line.
Computer Architecture, The Arithmetic/Logic Unit
Slide 76
Round-to-Nearest (Even)
rtnei(x)
rtni(x)
4
4
3
3
2
2
1
1
x
–4
–3
–2
–1
1
2
3
4
Jan. 2011
–4
–3
–2
–1
1
–1
–1
–2
–2
–3
–3
–4
–4
(a) Round to nearest even integer
Figure 12.2
x
2
3
(b) Round to nearest integer
Two round-to-nearest-integer functions for x in [–4, 4].
Computer Architecture, The Arithmetic/Logic Unit
Slide 77
4
Directed Rounding
ritni(x)
rutni(x)
4
4
3
3
2
2
1
1
x
–4
–3
–2
–1
1
2
3
4
x
–4
–3
–2
–1
1
–1
–1
–2
–2
–3
–3
–4
–4
(a) Round inward to nearest integer
2
3
4
(b) Round upward to nearest integer
Figure 12.3 Two directed round-to-nearest-integer functions for x in [–4, 4].
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 78
12.2 Special Values and Exceptions
Zeros, infinities, and NaNs (not a number)
± 0 Biased exponent = 0, significand = 0 (no hidden 1)
± ∞ Biased exponent = 255 (short) or 2047 (long), significand = 0
NaN Biased exponent = 255 (short) or 2047 (long), significand ≠ 0
Arithmetic operations with special operands
(+0) + (+0) = (+0) – (–0) = +0
(+0) × (+5) = +0
(+0) / (–5) = –0
(+∞) + (+∞) = +∞
x – (+∞) = –∞
(+∞) × x = ±∞, depending on the sign of x
x / (+∞) = ±0, depending on the sign of x
√(+∞) = +∞
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 79
Exceptions
Undefined results lead to NaN (not a number)
(±0) / (±0) = NaN
(+∞) + (–∞) = NaN
(±0) × (±∞) = NaN
(±∞) / (±∞) = NaN
Arithmetic operations and comparisons with NaNs
NaN + x = NaN
NaN + NaN = NaN
NaN × 0 = NaN
NaN × NaN = NaN
NaN < 2 Æ false
NaN = Nan Æ false
NaN ≠ (+∞) Æ true
NaN ≠ NaN Æ true
Examples of invalid-operation exceptions
Addition:
Multiplication:
Division:
Square-root:
Jan. 2011
(+∞) + (–∞)
0×∞
0 / 0 or ∞ / ∞
Operand < 0
Computer Architecture, The Arithmetic/Logic Unit
Slide 80
12.3 Floating-Point Addition
(±2e1s1) + (±2e1(s2 / 2e1–e2)) = ±2e1(s1 ± s2 / 2e1–e2)
(±2e2s2)
Numbers to be added:
x = 25 × 1.00101101
y = 21 × 1.11101101
Operand with
smaller exponent
to be preshifted
Operands after alignment shift:
x = 25 × 1.00101101
y = 25 × 0.000111101101
Result of addition:
s = 25 × 1.010010111101
s = 25 × 1.01001100
Figure 12.4
Jan. 2011
Extra bits to be
rounded off
Rounded sum
Alignment shift and rounding in floating-point addition.
Computer Architecture, The Arithmetic/Logic Unit
Slide 81
Inp ut 1
Hardware for
Floating-Point
Addition
Inp ut 2
Unpack
Signs Exponents
Significands
Add′Sub
Mu x
Sub
Possible swap
& compleme nt
Align
significands
Control
& sign
logic
Add
Norma lize
& round
Figure 12.5
Simplified schematic of
a floating-point adder.
±
Sign
Exponent
Significand
Pack
Outp ut
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 82
12.4 Other Floating-Point Operations
Overflow
(underflow)
possible
Floating-point multiplication
(±2e1s1) × (±2e2s2) = ±2e1+ e2(s1 × s2)
Product of significands in [1, 4)
If product is in [2, 4), halve to normalize (increment exponent)
Overflow
(underflow)
possible
Floating-point division
(±2e1s1) / (±2e2s2) = ±2e1– e2(s1 / s2)
Ratio of significands in (1/2, 2)
If ratio is in (1/2, 1), double to normalize (decrement exponent)
Floating-point square-rooting
(2es)1/2 = 2e/2(s)1/2
= 2(e–1)2(2s)1/2
Normalization not needed
Jan. 2011
when e is even
when e is odd
Computer Architecture, The Arithmetic/Logic Unit
Slide 83
Hardware for
Floating-Point
Multiplication
and Division
Input 1
Input 2
Unpack
Signs Exponents
Significands
Mul′Div
±
Multiply
or divide
Control
& sign
logic
Normalize
& round
±
Figure 12.6 Simplified
schematic of a floatingpoint multiply/divide unit.
Sign
Exponent
Significand
Pack
Output
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 84
12.5 Floating-Point Instructions
Floating-point arithmetic instructions for MiniMIPS:
add.s
sub.d
mul.d
div.s
neg.s
31
F
op
$f0,$f8,$f10
$f0,$f8,$f10
$f0,$f8,$f10
$f0,$f8,$f10
$f0,$f8
25
ex
20
#
#
#
#
#
ft
set
set
set
set
set
15
$f0
$f0
$f0
$f0
$f0
fs
to
to
to
to
to
10
($f8) +fp
($f8) –fp
($f8) ×fp
($f8) /fp
–($f8)
($f10)
($f10)
($f10)
($f10)
fd
fn
5
0
0 1 0 0 0 1 0 0 0 0 x 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 x x x
Floating-point
instruction
s=0
d=1
Source
register 2
Source
register 1
Destination
register
add.* = 0
sub.* = 1
mul.* = 2
div.* = 3
neg.* = 7
Figure 12.7 The common floating-point instruction format for
MiniMIPS and components for arithmetic instructions. The extension
(ex) field distinguishes single (* = s) from double (* = d) operands.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 85
The Floating-Point Unit in MiniMIPS
...
m ≤ 2 32
Loc 0 Loc 4 Loc 8
4 B / location
Memory
Loc
Loc
m−8 m−4
up to 2 30 words
Coprocessor 1
...
EIU
$0
$1
$2
(Main proc.)
$31
ALU
Execution
& integer
unit
(Coproc. 1)
Integer
mul/div
FP
arith
Hi
FPU
$0
$1
$2
Floatingpoint unit
$31
Pairs of registers,
beginning with an
even-numbered
one, are used for
double operands
Lo
TMU
Chapter
10
Chapter
11
Figure 5.1
Jan. 2011
Chapter
12
BadVaddr Trap &
(Coproc. 0) Status memory
Cause unit
EPC
Memory and processing subsystems for MiniMIPS.
Computer Architecture, The Arithmetic/Logic Unit
Slide 86
Floating-Point Format Conversions
MiniMIPS instructions for number format conversion:
cvt.s.w
cvt.d.w
cvt.d.s
cvt.s.d
cvt.w.s
cvt.w.d
31
F
op
$f0,$f8
$f0,$f8
$f0,$f8
$f0,$f8
$f0,$f8
$f0,$f8
25
ex
#
#
#
#
#
#
20
set
set
set
set
set
set
ft
$f0
$f0
$f0
$f0
$f0
$f0
15
to
to
to
to
to
to
fs
single(integer $f8)
double(integer $f8)
double($f8)
single($f8,$f9)
integer($f8)
integer($f8,$f9)
10
fd
5
fn
0
0 1 0 0 0 1 0 0 0 0 x 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 x x x
Floating-point
instruction
Figure 12.8
Jan. 2011
*.w = 0
w.s = 0
w.d = 1
*.* = 1
Unused
Source
register
Destination
register
To format:
s = 32
d = 33
w = 36
Floating-point instructions for format conversion in MiniMIPS.
Computer Architecture, The Arithmetic/Logic Unit
Slide 87
Floating-Point Data Transfers
MiniMIPS instructions for floating-point load, store, and move:
lwc1
swc1
mov.s
mov.d
mfc1
mtc1
31
F
$f8,40($s3)
$f8,A($s3)
$f0,$f8
$f0,$f8
$t0,$f12
$f8,$t4
op
25
20
ft
load mem[40+($s3)] into $f8
store ($f8) into mem[A+($s3)]
load $f0 with ($f8)
load $f0,$f1 with ($f8,$f9)
load $t0 with ($f12)
load $f8 with ($t4)
15
fs
10
fd
5
fn
0
0 1 0 0 0 1 0 0 0 0 x 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0
Floating-point
instruction
31
R
ex
#
#
#
#
#
#
op
s=0
d=1
25
rs
Unused
20
rt
Source
register
15
rd
Destination
register
10
sh
mov.* = 6
5
fn
0
0 1 0 0 0 1 0 0 x 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Floating-point
instruction
Figure 12.9
Jan. 2011
mfc1 = 0
mtc1 = 4
Source
register
Destination
register
Unused
Unused
Instructions for floating-point data movement in MiniMIPS.
Computer Architecture, The Arithmetic/Logic Unit
Slide 88
Floating-Point Branches and Comparisons
MiniMIPS instructions for floating-point load, store, and move:
bc1t
bc1f
c.eq.*
c.lt.*
c.le.*
31
I
L
L
$f0,$f8
$f0,$f8
$f0,$f8
op
25
20
branch on fp flag true
branch on fp flag false
if ($f0)=($f8), set flag to “true”
if ($f0)<($f8), set flag to “true”
if ($f0)≤($f8), set flag to “true”
rt
operand / offset
15
0
0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 x 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1
Floating-point
instruction
31
F
rs
#
#
#
#
#
op
bc1? = 8
25
ex
true = 1
false = 0
20
ft
Offset
Correction: 1 1 x x x 0
15
fs
10
fd
5
fn
0
0 1 0 0 0 1 0 0 0 0 x 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
Floating-point
instruction
Figure 12.10
Jan. 2011
s=0
d=1
Source
register 2
Source
register 1
Unused
c.eq.* = 50
c.lt.* = 60
c.le.* = 62
Floating-point branch and comparison instructions in MiniMIPS.
Computer Architecture, The Arithmetic/Logic Unit
Slide 89
Floating-Point
Instructions of
MiniMIPS
Copy
Table 12.1
Arithmetic
*
s/d for single/double
# 0/1 for single/double
Conversions
Memory access
Control transfer
Jan. 2011
Instruction
Usage
Move s/d registers
Move fm coprocessor 1
Move to coprocessor 1
Add single/double
Subtract single/double
Multiply single/double
Divide single/double
Negate single/double
Compare equal s/d
Compare less s/d
Compare less or eq s/d
Convert integer to single
Convert integer to double
Convert single to double
Convert double to single
Convert single to integer
Convert double to integer
Load word coprocessor 1
Store word coprocessor 1
Branch coproc 1 true
Branch coproc 1 false
mov.* fd,fs
mfc1 rt,rd
mtc1 rd,rt
add.* fd,fs,ft
sub.* fd,fs,ft
mul.* fd,fs,ft
div.* fd,fs,ft
neg.* fd,fs
c.eq.* fs,ft
c.lt.* fs,ft
c.le.* fs,ft
cvt.s.w fd,fs
cvt.d.w fd,fs
cvt.d.s fd,fs
cvt.s.d fd,fs
cvt.w.s fd,fs
cvt.w.d fd,fs
lwc1 ft,imm(rs)
swc1 ft,imm(rs)
bc1t L
bc1f L
Computer Architecture, The Arithmetic/Logic Unit
ex fn
#
0
4
#
#
#
#
#
#
#
#
0
0
1
1
0
1
rs
rs
8
8
Slide 90
6
0
1
2
3
7
50
60
62
32
33
33
32
36
36
12.6 Result Precision and Errors
Example 12.4
Laws of algebra may not hold in floating-point arithmetic. For example,
the following computations show that the associative law of addition,
(a + b) + c = a + (b + c), is violated for the three numbers shown.
Numbers to be added first
a =-25 × 1.10101011
b = 25 × 1.10101110
Numbers to be added first
b = 25 × 1.10101110
c =-2−2 × 1.01100101
Compute a + b
25 × 0.00000011
a+b = 2−2 × 1.10000000
c =-2−2 × 1.01100101
Compute b + c (after preshifting c)
25 × 1.101010110011011
b+c = 25 × 1.10101011 (Round)
a =-25 × 1.10101011
Compute (a + b) + c
2−2 × 0.00011011
Sum = 2−6 × 1.10110000
Compute a + (b + c)
25 × 0.00000000
Sum = 0 (Normalize to special code for 0)
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 91
Error Control and Certifiable Arithmetic
Catastrophic cancellation in subtracting almost equal numbers:
Area of a needlelike triangle
A = [s(s – a)(s – b)(s – c)]1/2
c
b
a
Possible remedies
Carry extra precision in intermediate results (guard digits):
commonly used in calculators
Use alternate formula that does not produce cancellation errors
Certifiable arithmetic with intervals
A number is represented by its lower and upper bounds [xl, xu]
Example of arithmetic: [xl, xu] +interval [yl, yu] = [xl +fp∇ yl, xu +fpΔ yu]
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 92
Evaluation of Elementary Functions
Approximating polynomials
ln x = 2(z + z3/3 + z5/5 + z7/7 + . . . ) where z = (x – 1)/(x + 1)
ex = 1 + x/1! + x2/2! + x3/3! + x4/4! + . . .
cos x = 1 – x2/2! + x4/4! – x6/6! + x8/8! – . . .
tan–1 x = x – x3/3 + x5/5 – x7/7 + x9/9 – . . .
Iterative (convergence) schemes
For example, beginning with an estimate for x1/2, the following
iterative formula provides a more accurate estimate in each step
q(i+1) = 0.5(q(i) + x/q(i))
Table lookup (with interpolation)
A pure table lookup scheme results in huge tables (impractical);
hence, often a hybrid approach, involving interpolation, is used.
Jan. 2011
Computer Architecture, The Arithmetic/Logic Unit
Slide 93
Function Evaluation by Table Lookup
h bits
xH
Input x
k - h bits
xL
xL
Table
for a
f(x)
Table
for b
Best linear
approximation
in subinterval
Multiply
x
xH
Add
Output
Figure 12.12
Jan. 2011
f(x)
The linear approximation above is
characterized by the line equation
a + b x L , where a and b are
read out from tables based on x H
Function evaluation by table lookup and linear interpolation.
Computer Architecture, The Arithmetic/Logic Unit
Slide 94
Part IV
Data Path and Control
Feb. 2011
Computer Architecture, Data Path and Control
Slide 1
About This Presentation
This presentation is intended to support the use of the textbook
Computer Architecture: From Microprocessors to Supercomputers,
Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated
regularly by the author as part of his teaching of the upper-division
course ECE 154, Introduction to Computer Architecture, at the
University of California, Santa Barbara. Instructors can use these
slides freely in classroom teaching and for other educational
purposes. Any other use is strictly prohibited. © Behrooz Parhami
Edition
Released
Revised
Revised
Revised
Revised
First
July 2003
July 2004
July 2005
Mar. 2006
Feb. 2007
Feb. 2008
Feb. 2009
Feb. 2011
Feb. 2011
Computer Architecture, Data Path and Control
Slide 2
A Few Words About Where We Are Headed
Performance = 1 / Execution time
simplified to 1 / CPU execution time
CPU execution time = Instructions × CPI / (Clock rate)
Performance = Clock rate / ( Instructions × CPI )
Try to achieve CPI = 1
with clock that is as
high as that for CPI > 1
designs; is CPI < 1
feasible? (Chap 15-16)
Design memory & I/O
structures to support
ultrahigh-speed CPUs
(chap 17-24)
Feb. 2011
Define an instruction set;
make it simple enough
to require a small number
of cycles and allow high
clock rate, but not so
simple that we need many
instructions, even for very
simple tasks (Chap 5-8)
Computer Architecture, Data Path and Control
Design hardware
for CPI = 1; seek
improvements with
CPI > 1 (Chap 13-14)
Design ALU for
arithmetic & logic
ops (Chap 9-12)
Slide 3
IV Data Path and Control
Design a simple computer (MicroMIPS) to learn about:
• Data path – part of the CPU where data signals flow
• Control unit – guides data signals through data path
• Pipelining – a way of achieving greater performance
Topics in This Part
Chapter 13 Instruction Execution Steps
Chapter 14 Control Unit Synthesis
Chapter 15 Pipelined Data Paths
Chapter 16 Pipeline Performance Limits
Feb. 2011
Computer Architecture, Data Path and Control
Slide 4
13 Instruction Execution Steps
A simple computer executes instructions one at a time
• Fetches an instruction from the loc pointed to by PC
• Interprets and executes the instruction, then repeats
Topics in This Chapter
13.1 A Small Set of Instructions
13.2 The Instruction Execution Unit
13.3 A Single-Cycle Data Path
13.4 Branching and Jumping
13.5 Deriving the Control Signals
13.6 Performance of the Single-Cycle Design
Feb. 2011
Computer Architecture, Data Path and Control
Slide 5
13.1 A Small Set of Instructions
31
R
I
op
25
rs
20
rt
15
rd
10
sh
fn
5
6 bits
5 bits
5 bits
5 bits
5 bits
6 bits
Opcode
Source 1
or base
Source 2
or dest’n
Destination
Unused
Opcode ext
imm
Operand / Offset, 16 bits
jta
J
Jump target address, 26 bits
inst
Instruction, 32 bits
Fig. 13.1
MicroMIPS instruction formats and naming of the various fields.
We will refer to this diagram later
Seven R-format ALU instructions (add, sub, slt, and, or, xor, nor)
Six I-format ALU instructions (lui, addi, slti, andi, ori, xori)
Two I-format memory access instructions (lw, sw)
Three I-format conditional branch instructions (bltz, beq, bne)
Four unconditional jump instructions (j, jr, jal, syscall)
Feb. 2011
Computer Architecture, Data Path and Control
Slide 6
0
The MicroMIPS
Instruction Set
Copy
Arithmetic
Logic
Memory access
Control transfer
Table 13.1
Feb. 2011
Instruction
Usage
Load upper immediate
Add
Subtract
Set less than
Add immediate
Set less than immediate
AND
OR
XOR
NOR
AND immediate
OR immediate
XOR immediate
Load word
Store word
Jump
Jump register
Branch less than 0
Branch equal
Branch not equal
Jump and link
System call
lui
rt,imm
add
rd,rs,rt
sub
rd,rs,rt
slt
rd,rs,rt
addi rt,rs,imm
slti rd,rs,imm
and
rd,rs,rt
or
rd,rs,rt
xor
rd,rs,rt
nor
rd,rs,rt
andi rt,rs,imm
ori
rt,rs,imm
xori rt,rs,imm
lw
rt,imm(rs)
sw
rt,imm(rs)
j
L
jr
rs
bltz rs,L
beq
rs,rt,L
bne
rs,rt,L
jal
L
syscall
Computer Architecture, Data Path and Control
op fn
15
0
0
0
8
10
0
0
0
0
12
13
14
35
43
2
0
1
4
5
3
0
Slide 7
32
34
42
36
37
38
39
8
12
13.2 The Instruction Execution Unit
31
beq,bne
syscall
R
I
25
rs
20
rt
15
6 bits
5 bits
5 bits
Opcode
Source 1
or base
Source 2
or dest’n
rd
10
5 bits
Destination
sh
fn
5
5 bits
6 bits
Unused
Opcode ext
0
imm
Operand / Offset, 16 bits
Next addr
bltz,jr
jta
j,jal
rs,rt,rd
PC
Instr
cache
op
Jump target address, 26 bits
inst
Instruction, 32 bits
22 instructions
(rs)
Reg
file
inst
jta
J
12 A/L,
lui,
lw,sw
ALU
Address
Data
Data
cache
(rt)
imm
op fn
Control
Harvard
architecture
Fig. 13.2
Abstract view of the instruction execution unit for MicroMIPS.
For naming of instruction fields, see Fig. 13.1.
Feb. 2011
Computer Architecture, Data Path and Control
Slide 8
13.3 A Single-Cycle Data Path
Incr PC
Next addr
jta
Next PC
ALUOvfl
(PC)
(rs)
rs
rt
PC
Instr
cache
inst
rd
31
imm
op
Br&Jump
Instruction fetch
Fig. 13.3
Feb. 2011
0
1
2
Register
writeback
Ovfl
Reg
file
ALU
(rt)
/
16
ALU
out
Data
cache
Data
out
Data
in
Func
0
32
SE / 1
Data
addr
0
1
2
Register input
fn
RegDst
RegWrite
Reg access / decode
ALUSrc
ALUFunc
ALU operation
RegInSrc
DataRead
DataWrite
Data access
Key elements of the single-cycle MicroMIPS data path.
Computer Architecture, Data Path and Control
Slide 9
Const′Var
Shift function
Constant
5
amount
0
Amount
5
1
5
Variable
amount
2
00
01
10
11
No shift
Logical left
Logical right
Arith right
Shifter
Function
class
32
5 LSBs
Adder
y
0 or 1
c0
32
k
/
c 31
x±y
Shift
Set less
Arithmetic
Logic
0
imm
32
00
01
10
11
2
Shifted y
x
An ALU for
MicroMIPS
lui
s
32
2
32
Shorthand
symbol
for ALU
1
MSB
We use only 5
control signals
(no shifts)
Control
c 32
3
5
x
Func
Add′Sub
s
ALU
Logic
unit
AND
OR
XOR
NOR
00
01
10
11
Ovfl
y
32input
NOR
Zero
2
Logic function
Zero
Ovfl
Fig. 10.19 A multifunction ALU with 8 control signals (2 for function class,
1 arithmetic, 3 shift, 2 logic) specifying the operation.
Feb. 2011
Computer Architecture, Data Path and Control
Slide 10
13.4 Branching and Jumping
Update
options
for PC
(PC)31:2 + 1
(PC)31:2 + 1 + imm
(PC)31:28 | jta
(rs)31:2
SysCallAddr
Default option
When instruction is branch and condition is met
When instruction is j or jal
When the instruction is jr
Start address of an operating system routine
Lowest 2 bits of
PC always 00
BrTrue
/
30
IncrPC
/
30
Adder
c in
0
1
2
3
NextPC
/
30
PCSrc
Fig. 13.4
Feb. 2011
/
30
/
30
/
30
/
30
4 MSBs 1
/
30
Branch
condition
checker
(rt)
/
32
(rs)
/
30
(PC)31:2
/
26
jta
30
MSBs
SE
/
30
/
32
4
16
imm
MSBs
SysCallAddr
BrType
Next-address logic for MicroMIPS (see top part of Fig. 13.3).
Computer Architecture, Data Path and Control
Slide 11
13.5 Deriving the Control Signals
Table 13.2 Control signals for the single-cycle MicroMIPS implementation.
Control signal
Reg
file
ALU
Data
cache
Next
addr
Feb. 2011
0
1
2
3
RegWrite
Don’t write
Write
RegDst1, RegDst0
rt
rd
$31
RegInSrc1, RegInSrc0
Data out
ALU out
IncrPC
ALUSrc
(rt )
imm
Add′Sub
Add
Subtract
LogicFn1, LogicFn0
AND
OR
XOR
NOR
FnClass1, FnClass0
lui
Set less
Arithmetic
Logic
DataRead
Don’t read
Read
DataWrite
Don’t write
Write
BrType1, BrType0
No branch
beq
bne
bltz
PCSrc1, PCSrc0
IncrPC
jta
(rs)
SysCallAddr
Computer Architecture, Data Path and Control
Slide 12
Single-Cycle Data Path, Repeated for Reference
Incr PC
Outcome of an executed instruction:
A new value loaded into PC
Possible new value in a reg or memory loc
Next addr
jta
Next PC
ALUOvfl
(PC)
(rs)
rs
rt
PC
Instr
cache
inst
rd
31
imm
op
Br&Jump
Instruction fetch
Fig. 13.3
Feb. 2011
0
1
2
Register
writeback
Ovfl
Reg
file
ALU
(rt)
/
16
ALU
out
Data
cache
Data
out
Data
in
Func
0
32
SE / 1
Data
addr
0
1
2
Register input
fn
RegDst
RegWrite
Reg access / decode
ALUSrc
ALUFunc
ALU operation
RegInSrc
DataRead
DataWrite
Data access
Key elements of the single-cycle MicroMIPS data path.
Computer Architecture, Data Path and Control
Slide 13
Feb. 2011
Computer Architecture, Data Path and Control
00
01
10
11
00
01
10
0
0
FnClass
LogicFn
0
1
1
0
1
00
10
10
01
10
01
11
11
11
11
11
11
11
10
10
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
PCSrc
10 10
1
0
0
0
1
1
0
0
0
0
1
1
1
1
1
Add’Sub
01
01
01
01
01
01
01
01
01
01
01
01
01
00
ALUSrc
00
01
01
01
00
00
01
01
01
01
00
00
00
00
BrType
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
1
0
DataW rite
001111
000000 100000
000000 100010
000000 101010
001000
001010
000000 100100
000000 100101
000000 100110
000000 100111
001100
001101
001110
100011
101011
000010
000000 001000
000001
000100
000101
000011
000000 001100
DataRead
Load upper immediate
Add
Subtract
Set less than
Add immediate
Set less than immediate
AND
OR
XOR
NOR
AND immediate
OR immediate
XOR immediate
Load word
Store word
Jump
Jump register
Branch on less than 0
Branch on equal
Branch on not equal
Jump and link
System call
fn
RegInS rc
op
RegDst
Table 13.3
Instruction
RegWrite
Control
Signal
Settings
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
01
10
00
00
00
01
11
11
01
10
00
Slide 14
Control Signals in the Single-Cycle Data Path
Incr PC
Next addr
jta
Next PC
ALUOvfl
(PC)
Instr
cache
inst
001111
000000
Br&Jump
BrType
Fig. 13.3
Feb. 2011
Ovfl
Reg
file
ALU
(rt)
/
16
imm
op
00 00
00 00
0
1
2
rd
31
lui
slt
PCSrc
(rs)
rs
rt
PC
ALU
out
Register input
fn
00
01
1
1
RegDst
RegWrite
010101
1
0
ALUSrc
Data
cache
Data
out
Data
in
Func
0
32
SE / 1
Data
addr
x xx 00
1 xx 01
ALUFunc
0
0
0
0
01
01
RegInSrc
DataRead
DataWrite
Add′Sub LogicFn FnClass
Key elements of the single-cycle MicroMIPS data path.
Computer Architecture, Data Path and Control
0
1
2
Slide 15
0
3
4
5
bltzIns t
jIns t
jalIns t
beqIns t
bneIns t
8
addiIns t
1
2
10
s ltiIns t
12
13
14
15
andiIns t
oriIns t
xoriIns t
luiIns t
35
lwIns t
43
/6
RtypeIns t
0
8
fn Decoder
1
fn
/6
op Decoder
Instruction
Decoding
op
Feb. 2011
12
s ys callIns t
32
addIns t
34
s ubIns t
36
37
38
39
andIns t
orIns t
xorIns t
norIns t
42
s ltIns t
s wIns t
63
Fig. 13.5
jrIns t
63
Instruction decoder for MicroMIPS built of two 6-to-64 decoders.
Computer Architecture, Data Path and Control
Slide 16
Feb. 2011
Computer Architecture, Data Path and Control
00
01
10
11
00
01
10
0
0
FnClass
LogicFn
0
1
1
0
1
00
10
10
01
10
01
11
11
11
11
11
11
11
10
10
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
PCSrc
10 10
1
0
0
0
1
1
0
0
0
0
1
1
1
1
1
Add’Sub
01
01
01
01
01
01
01
01
01
01
01
01
01
00
ALUSrc
00
01
01
01
00
00
01
01
01
01
00
00
00
00
BrType
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
1
0
DataW rite
001111
000000 100000
000000 100010
000000 101010
001000
001010
000000 100100
000000 100101
000000 100110
000000 100111
001100
001101
001110
100011
101011
000010
000000 001000
000001
000100
000101
000011
000000 001100
DataRead
Load upper immediate
Add
Subtract
Set less than
Add immediate
Set less than immediate
AND
OR
XOR
NOR
AND immediate
OR immediate
XOR immediate
Load word
Store word
Jump
Jump register
Branch on less than 0
Branch on equal
Branch on not equal
Jump and link
System call
fn
RegInS rc
Table 13.3
op
RegDst
Repeated
for
Reference
Instruction
RegWrite
Control
Signal
Settings:
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
01
10
00
00
00
01
11
11
01
10
00
Slide 17
Control Signal Generation
Auxiliary signals identifying instruction classes
arithInst = addInst ∨ subInst ∨ sltInst ∨ addiInst ∨ sltiInst
logicInst = andInst ∨ orInst ∨ xorInst ∨ norInst ∨ andiInst ∨ oriInst ∨ xoriInst
immInst = luiInst ∨ addiInst ∨ sltiInst ∨ andiInst ∨ oriInst ∨ xoriInst
Example logic expressions for control signals
RegWrite = luiInst ∨ arithInst ∨ logicInst ∨ lwInst ∨ jalInst
ALUSrc = immInst ∨ lwInst ∨ swInst
addInst
subInst
jInst
Add′Sub = subInst ∨ sltInst ∨ sltiInst
DataRead = lwInst
PCSrc0 = jInst ∨ jalInst ∨ syscallInst
Feb. 2011
Computer Architecture, Data Path and Control
.
.
.
Control
.
.
.
sltInst
Slide 18
Putting It All Together
Fig. 10.19
Const′Var
Fig. 13.4
/
30
/
30
IncrPC
/
30
Adder
0
1
2
3
/
30
/
30
/
30
/
30
/
30
/
30
4 MSBs1
/
32
/
32
(rs)
4
16
imm
MSBs
Shift function
Cons tant
5
am ount
0
Amount
5
1
5
Variable
am ount
2
/
30
(PC)31:2
/
26
jta
Function
class
imm
5 LSBs
lui
Shif t
Set less
Arithmetic
Logic
0
y
0 or 1
c0
32
BrType
00
01
10
11
2
Shifted y
x
SysCallAddr
No shif t
Logical lef t
Logical right
Arith right
Shifter
Adder
PCSrc
00
01
10
11
32
30
MSBs
SE
c in
NextPC
Branch
condition
checker
BrTrue
(rt)
32
k
/
c
c 32 31
x±y
s
32
2
32
Shorth
symb
for AL
1
MSB
Cont
3
x
Fun
Add′Sub
A
Incr PC
Next addr
jta
Next PC
rd
31
imm
op
0
1
2
00
01
10
11
Ovfl
Reg
file
ALU
(rt)
/
16
0
32
SE / 1
Func
Zero
Ovfl
addInst
ALU
out
Data
addr
Data
cache
Data
out
Data
in
subInst
jInst
0
1
2
Register input
fn
Zero
.
.
.
.
.
.
Control
sltInst
Br&Jump
Feb. 2011
RegDst
RegWrite
ALUSrc
ALUFunc
RegInSrc
DataRead
DataWrite
Computer Architecture, Data Path and Control
O
y
32input
NOR
2
Logic function
(rs)
rs
rt
inst
AND
OR
XOR
NOR
ALUOvfl
(PC)
PC
Instr
cache
Logic
unit
Fig. 13.3
Slide 19
13.6 Performance of the Single-Cycle Design
An example combinational-logic data path to compute z := (u + v)(w – x) / y
Add/Sub
latency
2 ns
Multiply
latency
6 ns
Total
latency
23 ns
Divide
latency
15 ns
u
+
Note that the divider gets its
correct inputs after ≅9 ns,
but this won’t cause a problem
if we allow enough total time
v
×
w
−
/
z
x
y
Feb. 2011
Beginning with inputs u, v, w, x, and y
stored in registers, the entire computation
can be completed in ≅25 ns, allowing 1
ns each for register readout and write
Computer Architecture, Data Path and Control
Slide 20
Performance Estimation for Single-Cycle MicroMIPS
Instruction access
2 ns
Register read
1 ns
ALU operation
2 ns
Data cache access
2 ns
Register write
1 ns
Total
8 ns
Single-cycle clock = 125 MHz
R-type 44%
6 ns
Load
24%
8 ns
Store
12%
7 ns
Branch 18%
5 ns
Jump
2%
3 ns
Weighted mean ≅ 6.36 ns
ALU-type
P
C
Load
P
C
Store
P
C
Branch
P
C
(and jr)
Jump
(except
jr & jal)
P
C
Not
used
Not
used
Not
used
Not
used
Not
used
Not
used
Not
used
Not
used
Not
used
Fig. 13.6 The MicroMIPS data path unfolded (by depicting the register write
step as a separate block) so as to better visualize the critical-path latencies.
Feb. 2011
Computer Architecture, Data Path and Control
Slide 21
How Good is Our Single-Cycle Design?
Clock rate of 125 MHz not impressive
How does this compare with
current processors on the market?
Not bad, where latency is concerned
Instruction access
2 ns
Register read
1 ns
ALU operation
2 ns
Data cache access
2 ns
Register write
1 ns
Total
8 ns
Single-cycle clock = 125 MHz
A 2.5 GHz processor with 20 or so pipeline stages has a latency of about
0.4 ns/cycle × 20 cycles = 8 ns
Throughput, however, is much better for the pipelined processor:
Up to 20 times better with single issue
Perhaps up to 100 times better with multiple issue
Feb. 2011
Computer Architecture, Data Path and Control
Slide 22
14 Control Unit Synthesis
The control unit for the single-cycle design is memoryless
• Problematic when instructions vary greatly in complexity
• Multiple cycles needed when resources must be reused
Topics in This Chapter
14.1 A Multicycle Implementation
14.2 Choosing the Clock Cycle
14.3 The Control State Machine
14.4 Performance of the Multicycle Design
14.5 Microprogramming
14.6 Exception Handling
Feb. 2011
Computer Architecture, Data Path and Control
Slide 23
14.1 A Multicycle Implementation
Appointment
book for a
dentist
Single-cycle Multicycle
Assume longest
treatment takes
one hour
Feb. 2011
Computer Architecture, Data Path and Control
Slide 24
Single-Cycle vs. Multicycle MicroMIPS
Clock
Time
needed
Time
allotted
Instr 1
Instr 2
Instr 3
Instr 4
Clock
Time
needed
Time
allotted
Time
saved
3 cycles
5 cycles
3 cycles
4 cycles
Instr 1
Instr 2
Instr 3
Instr 4
Fig. 14.1
Feb. 2011
Single-cycle versus multicycle instruction execution.
Computer Architecture, Data Path and Control
Slide 25
A Multicycle Data Path
Inst Reg
x Reg
jta
Address
rs,rt,rd
(rs)
PC
imm
Cache
z Reg
Reg
file
ALU
(rt)
Data
y Reg
Data Reg
op
fn
Control
von Neumann
(Princeton)
architecture
Fig. 14.2
Abstract view of a multicycle instruction execution unit for
MicroMIPS. For naming of instruction fields, see Fig. 13.1.
Feb. 2011
Computer Architecture, Data Path and Control
Slide 26
Multicycle Data Path with Control Signals Shown
Three major changes relative to
the single-cycle data path:
26
/
1. Instruction & data
caches combined
Corrections are
shown in red
Inst Reg
rs
rt
0
rd 1
31 2
Cache
PCWrite
MemWrite
MemRead
Fig. 14.3
Feb. 2011
Reg
file
op
IRWrite
(rt)
imm 16
/
Data Reg
fn
32 y Reg
SE /
RegInSrc
RegDst
ALUZero
x Mux
ALUOvfl
0
Zero
z Reg
1
Ovfl
(rs)
0
12
Data
0
1
SysCallAddr
3. Registers added for
jta intercycle data x Reg
PC
Inst′Data
30
/
4 MSBs
Address
0
1
2. ALU performs double duty
for address calculation
RegWrite
y Mux
4
0
1
2
×4 3
ALUSrcX
ALUSrcY
30
×4
ALU
Func
ALU out
ALUFunc
PCSrc
JumpAddr
Key elements of the multicycle MicroMIPS data path.
Computer Architecture, Data Path and Control
0
1
2
3
Slide 27
14.2 Clock Cycle and Control Signals
Table 14.1
Program
counter
Cache
Register
file
ALU
Feb. 2011
Control signal
0
1
2
JumpAddr
jta
SysCallAddr
PCSrc1, PCSrc0
Jump addr
x reg
PCWrite
Don’t write
Write
Inst′Data
PC
z reg
MemRead
Don’t read
Read
MemWrite
Don’t write
Write
IRWrite
Don’t write
Write
RegWrite
Don’t write
Write
RegDst1, RegDst0
rt
rd
$31
RegInSrc1, RegInSrc0
Data reg
z reg
PC
ALUSrcX
PC
x reg
ALUSrcY1, ALUSrcY0
4
y reg
Add′Sub
Add
Subtract
LogicFn1, LogicFn0
AND
FnClass1, FnClass0
lui
z reg
3
ALU out
imm
4 × imm
OR
XOR
NOR
Set less
Arithmetic
Logic
Computer Architecture, Data Path and Control
Slide 28
Multicycle Data Path, Repeated for Reference
26
/
30
/
4 MSBs
Corrections are
shown in red
Inst Reg
rt
0
rd 1
31 2
Cache
PCWrite
MemWrite
MemRead
Fig. 14.3
Feb. 2011
Reg
file
op
IRWrite
(rt)
imm 16
/
Data Reg
Inst′Data
(rs)
0
12
Data
fn
ALUZero
x Mux
ALUOvfl
0
Zero
z Reg
1
Ovfl
x Reg
rs
PC
0
1
SysCallAddr
jta
Address
32 y Reg
SE /
RegInSrc
RegDst
0
1
RegWrite
y Mux
4
0
1
2
×4 3
ALUSrcX
ALUSrcY
30
×4
ALU
Func
ALU out
ALUFunc
PCSrc
JumpAddr
Key elements of the multicycle MicroMIPS data path.
Computer Architecture, Data Path and Control
0
1
2
3
Slide 29
Execution
Cycles
Fetch &
PC incr
Decode &
reg read
Table 14.2
Instruction Operations
Signal settings
Any
Read out the instruction and
write it into instruction
register, increment PC
Any
Read out rs & rt into x & y
registers, compute branch
address and save in z register
Perform ALU operation and
save the result in z register
Add base and offset values,
save in z register
If (x reg) = ≠ < (y reg), set PC
to branch target address
Inst′Data = 0, MemRead = 1
IRWrite = 1, ALUSrcX = 0
ALUSrcY = 0, ALUFunc = ‘+’
PCSrc = 3, PCWrite = 1
ALUSrcX = 0, ALUSrcY = 3
ALUFunc = ‘+’
1
2
ALU type
Set PC to the target address
jta, SysCallAddr, or (rs)
Write back z reg into rd
Load
Read memory into data reg
ALUSrcX = 1, ALUSrcY = 1 or 2
ALUFunc: Varies
ALUSrcX = 1, ALUSrcY = 2
ALUFunc = ‘+’
ALUSrcX = 1, ALUSrcY = 1
ALUFunc= ‘−’, PCSrc = 2
PCWrite = ALUZero or
ALUZero′ or ALUOut31
JumpAddr = 0 or 1,
PCSrc = 0 or 1, PCWrite = 1
RegDst = 1, RegInSrc = 1
RegWrite = 1
Inst′Data = 1, MemRead = 1
Store
Copy y reg into memory
Inst′Data = 1, MemWrite = 1
Load
Copy data register into rt
RegDst = 0, RegInSrc = 0
RegWrite = 1
ALU type
Load/Store
ALU oper
& PC
update
3
Branch
Jump
Reg write
or mem
access
Reg write
for lw
Feb. 2011
4
5
Execution cycles for multicycle MicroMIPS
Computer Architecture, Data Path and Control
Slide 30
14.3 The Control State Machine
Cycle 1
Cycle 2
Notes for State 5:
% 0 for j or jal, 1 for syscall,
don’t-care for other instr’s
@ 0 for j, jal, and syscall,
1 for jr, 2 for branches
# 1 for j, jr, jal, and syscall,
ALUZero (′) for beq (bne),
bit 31 of ALUout for bltz
For jal, RegDst = 2, RegInSrc = 1,
RegWrite = 1
State 0
Inst′Data = 0
MemRead = 1
IRWrite = 1
ALUSrcX = 0
ALUSrcY = 0
ALUFunc = ‘+’
PCSrc = 3
PCWrite = 1
Jump/
Branch
State 1
ALUSrcX = 0
ALUSrcY = 3
ALUFunc = ‘+’
Note for State 7:
ALUFunc is determined based
on the op and fn fields
Fig. 14.4
Feb. 2011
Cycle 4
State 5
ALUSrcX = 1
ALUSrcY = 1
ALUFunc = ‘−’
JumpAddr = %
PCSrc = @
PCWrite = #
State 6
lw/
sw
ALUtype
State 2
ALUSrcX = 1
ALUSrcY = 2
ALUFunc = ‘+’
Cycle 5
Inst′Data = 1
MemWrite = 1
Branches based
on instruction
sw
Speculative
calculation of
branch address
Start
Cycle 3
lw
State 3
State 4
Inst′Data = 1
MemRead = 1
RegDst = 0
RegInSrc = 0
RegWrite = 1
State 7
State 8
ALUSrcX = 1
ALUSrcY = 1 or 2
ALUFunc = Varies
RegDst = 0 or 1
RegInSrc = 1
RegWrite = 1
The control state machine for multicycle MicroMIPS.
Computer Architecture, Data Path and Control
Slide 31
State and Instruction Decoding
op
/4
5
6
7
8
9
10
11
12
13
14
15
ControlSt0
ControlSt1
ControlSt2
ControlSt3
ControlSt4
ControlSt5
ControlSt6
ControlSt7
ControlSt8
0
1
2
3
4
1
op Decoder
st Decoder
0
1
2
3
4
1
fn
/6
5
bltzInst
jInst
jalInst
beqInst
bneInst
8
addiInst
andiInst
10
sltiInst
12
13
14
15
andiInst
oriInst
xoriInst
luiInst
35
lwInst
43
Feb. 2011
0
jrInst
8
12
syscallInst
32
addInst
34
subInst
36
37
38
39
andInst
orInst
xorInst
norInst
42
sltInst
swInst
63
Fig. 14.5
/6
RtypeInst
fn Decoder
st
63
State and instruction decoders for multicycle MicroMIPS.
Computer Architecture, Data Path and Control
Slide 32
Control Signal Generation
Certain control signals depend only on the control state
ALUSrcX = ControlSt2 ∨ ControlSt5 ∨ ControlSt7
RegWrite = ControlSt4 ∨ ControlSt8
Auxiliary signals identifying instruction classes
addsubInst = addInst ∨ subInst ∨ addiInst
logicInst = andInst ∨ orInst ∨ xorInst ∨ norInst ∨ andiInst ∨ oriInst ∨ xoriInst
Logic expressions for ALU control signals
Add′Sub = ControlSt5 ∨ (ControlSt7 ∧ subInst)
FnClass1 = ControlSt7′ ∨ addsubInst ∨ logicInst
FnClass0 = ControlSt7 ∧ (logicInst ∨ sltInst ∨ sltiInst)
LogicFn1 = ControlSt7 ∧ (xorInst ∨ xoriInst ∨ norInst)
LogicFn0 = ControlSt7 ∧ (orInst ∨ oriInst ∨ norInst)
Feb. 2011
Computer Architecture, Data Path and Control
Slide 33
14.4 Performance of the Multicycle Design
R-type
Load
Store
Branch
Jump
44%
24%
12%
18%
2%
4 cycles
5 cycles
4 cycles
3 cycles
3 cycles
Contribution to CPI
R-type
0.44×4 = 1.76
Load
0.24×5 = 1.20
Store
0.12×4 = 0.48
Branch 0.18×3 = 0.54
Jump
0.02×3 = 0.06
ALU-type
P
C
Load
P
C
Store
P
C
Branch
P
C
(and jr)
Not
used
Not
used
Not
used
Not
used
Not
used
Not
used
Not
used
Not
used
_____________________________
Average CPI ≅ 4.04
Jump
(except
jr & jal)
P
C
Not
used
Fig. 13.6 The MicroMIPS data path unfolded (by depicting the register write
step as a separate block) so as to better visualize the critical-path latencies.
Feb. 2011
Computer Architecture, Data Path and Control
Slide 34
How Good is Our Multicycle Design?
Clock rate of 500 MHz better than 125 MHz
of single-cycle design, but still unimpressive
Cycle time = 2 ns
Clock rate = 500 MHz
How does the performance compare with
current processors on the market?
Not bad, where latency is concerned
R-type
Load
Store
Branch
Jump
A 2.5 GHz processor with 20 or so pipeline
stages has a latency of about 0.4 × 20 = 8 ns
Contribution to CPI
R-type
0.44×4 = 1.76
Throughput, however, is much better for
the pipelined processor:
Up to 20 times better with single issue
Load
Store
Branch
Jump
44%
24%
12%
18%
2%
0.24×5
0.12×4
0.18×3
0.02×3
4 cycles
5 cycles
4 cycles
3 cycles
3 cycles
=
=
=
=
1.20
0.48
0.54
0.06
_____________________________
Perhaps up to 100× with multiple issue
Feb. 2011
Computer Architecture, Data Path and Control
Average CPI ≅ 4.04
Slide 35
14.5 Microprogramming
PC
control
Cache
control
Register
control
ALU
inputs
ALU Sequence
function control
2
bits
JumpAddr
PCSrc
PCWrite
FnType
LogicFn
Add′Sub
ALUSrcY
ALUSrcX
RegInSrc
Microinstruction
RegDst
RegWrite
Inst′Data
MemRead
MemWrite
IRWrite
Cycle 1
Cycle 2
Notes for State 5:
% 0 for j or jal, 1 for syscall,
don’t-care for other instr’s
@ 0 for j, jal, and syscall,
1 for jr, 2 for branches
# 1 for j, jr, jal, and syscall,
ALUZero (′) for beq (bne),
bit 31 of ALUout for bltz
For jal, RegDst = 2, RegInSrc = 1,
RegWrite = 1
23
Fig. 14.6
Possible 22-bit microinstruction
format for MicroMIPS.
State 0
Inst′Data = 0
MemRead = 1
IRWrite = 1
ALUSrcX = 0
ALUSrcY = 0
ALUFunc = ‘+’
PCSrc = 3
PCWrite = 1
Jump/
Branch
State 1
ALUSrcX = 0
ALUSrcY = 3
ALUFunc = ‘+’
Feb. 2011
Note for State 7:
ALUFunc is determined based
on the op and fn fields
Computer Architecture, Data Path and Control
Cycle 4
State 5
ALUSrcX = 1
ALUSrcY = 1
ALUFunc = ‘−’
JumpAddr = %
PCSrc = @
PCWrite = #
State 6
Cycle 5
Inst′Data = 1
MemWrite = 1
sw
lw/
sw
Start
The control state machine resembles
a program (microprogram)
Cycle 3
ALUtype
State 2
ALUSrcX = 1
ALUSrcY = 2
ALUFunc = ‘+’
lw
State 3
State 4
Inst′Data = 1
MemRead = 1
RegDst = 0
RegInSrc = 0
RegWrite = 1
State 7
State 8
ALUSrcX = 1
ALUSrcY = 1 or 2
ALUFunc = Varies
RegDst = 0 or 1
RegInSrc = 1
RegWrite = 1
Slide 36
The Control State Machine as a Microprogram
Cycle 1
Cycle 2
Notes for State 5:
% 0 for j or jal, 1 for syscall,
don’t-care for other instr’s
@ 0 for j, jal, and syscall,
1 for jr, 2 for branches
# 1 for j, jr, jal, and syscall,
ALUZero (′) for beq (bne),
bit 31 of ALUout for bltz
For jal, RegDst = 2, RegInSrc = 1,
RegWrite = 1
State 0
Inst′Data = 0
MemRead = 1
IRWrite = 1
ALUSrcX = 0
ALUSrcY = 0
ALUFunc = ‘+’
PCSrc = 3
PCWrite = 1
Jump/
Branch
Cycle 4
State 5
ALUSrcX = 1
ALUSrcY = 1
ALUFunc = ‘−’
JumpAddr = %
PCSrc = @
PCWrite = #
State 6
ALUSrcX = 0
ALUSrcY = 3
ALUFunc = ‘+’
lw/
sw
Start
Inst′Data = 1
MemWrite = 1
ALUtype
State 2
ALUSrcX = 1
ALUSrcY = 2
ALUFunc = ‘+’
lw
State 3
State 4
Inst′Data = 1
MemRead = 1
RegDst = 0
RegInSrc = 0
RegWrite = 1
State 7
State 8
ALUSrcX = 1
ALUSrcY = 1 or 2
ALUFunc = Varies
RegDst = 0 or 1
RegInSrc = 1
RegWrite = 1
Multiple substates
Fig. 14.4
Feb. 2011
Cycle 5
sw
State 1
Note for State 7:
ALUFunc is determined based
on the op and fn fields
Cycle 3
Multiple substates
Decompose
into 2 substates
The control state machine for multicycle MicroMIPS.
Computer Architecture, Data Path and Control
Slide 37
Symbolic Names for Microinstruction Field Values
Table 14.3 Microinstruction field values and their symbolic names.
The default value for each unspecified field is the all 0s bit pattern.
Field name
PC control
Cache control
Register control
ALU inputs*
ALU function*
Seq. control
Possible field values and their symbolic names
0001
1001
x011
x101
x111
PCjump
PCsyscall
PCjreg
PCbranch
PCnext
0101
1010
1100
CacheFetch
CacheStore
CacheLoad
1000 10000
rt ← Data
1001 10001
rt ← z
1011 10101
rd ← z
1101 11010
$31 ← PC
000
011
101
110
x10
PC ⊗ 4
PC ⊗ 4imm
x ⊗ y
x ⊗ imm
(imm)
0xx10
1xx01
1xx10
x0011
x0111
+
<
−
∧
∨
x1011
x1111
xxx00
⊕
∼∨
lui
01
10
11
μPCdisp1
μPCdisp2
μPCfetch
* The operator symbol ⊗ stands for any of the ALU functions defined above (except for “lui”).
Feb. 2011
Computer Architecture, Data Path and Control
Slide 38
fetch: -------------
Control Unit for
Microprogramming
Multiway
branch
andi: ---------
64 entries
in each table
Dispatch
table 1
Dispatch
table 2
0
0
1
2
3
MicroPC
1
Address
Microprogram
memory or PLA
Incr
Data
Microinstruction register
op (from
instruction
register)
Fig. 14.7
Feb. 2011
Control signals to data path
Sequence
control
Microprogrammed control unit for MicroMIPS .
Computer Architecture, Data Path and Control
Slide 39
fetch:
Microprogram
for MicroMIPS
37 microinstructions
Fig. 14.8
The complete
MicroMIPS
microprogram.
Feb. 2011
PCnext, CacheFetch
PC + 4imm, μPCdisp1
lui1:
lui(imm)
rt ← z, μPCfetch
add1:
x + y
rd ← z, μPCfetch
sub1:
x - y
rd ← z, μPCfetch
slt1:
x - y
rd ← z, μPCfetch
addi1:
x + imm
rt ← z, μPCfetch
slti1:
x - imm
rt ← z, μPCfetch
and1:
x ∧ y
rd ← z, μPCfetch
or1:
x ∨ y
rd ← z, μPCfetch
xor1:
x ⊕ y
rd ← z, μPCfetch
nor1:
x ∼∨ y
rd ← z, μPCfetch
andi1:
x ∧ imm
rt ← z, μPCfetch
ori1:
x ∨ imm
rt ← z, μPCfetch
xori:
x ⊕ imm
rt ← z, μPCfetch
lwsw1:
x + imm, mPCdisp2
lw2:
CacheLoad
rt ← Data, μPCfetch
sw2:
CacheStore, μPCfetch
j1:
PCjump, μPCfetch
jr1:
PCjreg, μPCfetch
branch1: PCbranch, μPCfetch
jal1:
PCjump, $31←PC, μPCfetch
syscall1:PCsyscall, μPCfetch
Computer Architecture, Data Path and Control
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
State
0 (start)
1
7lui
8lui
7add
8add
7sub
8sub
7slt
8slt
7addi
8addi
7slti
8slti
7and
8and
7or
8or
7xor
8xor
7nor
8nor
7andi
8andi
7ori
8ori
7xori
8xori
2
3
4
6
5j
5jr
5branch
5jal
5syscall
Slide 40
14.6 Exception Handling
Exceptions and interrupts alter the normal program flow
Examples of exceptions (things that can go wrong):
•
•
•
•
ALU operation leads to overflow (incorrect result is obtained)
Opcode field holds a pattern not representing a legal operation
Cache error-code checker deems an accessed word invalid
Sensor signals a hazardous condition (e.g., overheating)
Exception handler is an OS program that takes care of the problem
• Derives correct result of overflowing computation, if possible
• Invalid operation may be a software-implemented instruction
Interrupts are similar, but usually have external causes (e.g., I/O)
Feb. 2011
Computer Architecture, Data Path and Control
Slide 41
Exception
Control
States
Cycle 1
Cycle 2
Jump/
Branch
Cycle 3
Cycle 4
State 5
ALUSrcX = 1
ALUSrcY = 1
ALUFunc = ‘−’
JumpAddr = %
PCSrc = @
PCWrite = #
State 6
Cycle 5
Inst′Data = 1
MemWrite = 1
sw
State 0
Inst′Data = 0
MemRead = 1
IRWrite = 1
ALUSrcX = 0
ALUSrcY = 0
ALUFunc = ‘+’
PCSrc = 3
PCWrite = 1
State 1
ALUSrcX = 0
ALUSrcY = 3
ALUFunc = ‘+’
lw/
sw
Start
ALUtype
Illegal
operation
Fig. 14.10
Feb. 2011
State 2
ALUSrcX = 1
ALUSrcY = 2
ALUFunc = ‘+’
lw
State 3
State 4
Inst′Data = 1
MemRead = 1
RegDst = 0
RegInSrc = 0
RegWrite = 1
State 7
State 8
ALUSrcX = 1
ALUSrcY = 1 or 2
ALUFunc = Varies
RegDst = 0 or 1
RegInSrc = 1
RegWrite = 1
State 10
IntCause = 0
CauseWrite = 1
ALUSrcX = 0
ALUSrcY = 0
ALUFunc = ‘−’
EPCWrite = 1
JumpAddr = 1
PCSrc = 0
PCWrite = 1
Overflow
State 9
IntCause = 1
CauseWrite = 1
ALUSrcX = 0
ALUSrcY = 0
ALUFunc = ‘−’
EPCWrite = 1
JumpAddr = 1
PCSrc = 0
PCWrite = 1
Exception states 9 and 10 added to the control state machine.
Computer Architecture, Data Path and Control
Slide 42
15 Pipelined Data Paths
Pipelining is now used in even the simplest of processors
• Same principles as assembly lines in manufacturing
• Unlike in assembly lines, instructions not independent
Topics in This Chapter
15.1 Pipelining Concepts
15.2 Pipeline Stalls or Bubbles
15.3 Pipeline Timing and Performance
15.4 Pipelined Data Path Design
15.5 Pipelined Control
15.6 Optimal Pipelining
Feb. 2011
Computer Architecture, Data Path and Control
Slide 43
h
c
t
e
F
Feb. 2011
Reagd
Re
Reigte
Dataory Wr
m
e
M
ALU
Computer Architecture, Data Path and Control
Slide 44
Single-Cycle Data Path of Chapter 13
Incr PC
Next addr
jta
Next PC
ALUOvfl
(PC)
(rs)
rs
rt
PC
Instr
cache
inst
rd
31
imm
op
Br&Jump
Fig. 13.3
Feb. 2011
0
1
2
Clock rate = 125 MHz
CPI = 1 (125 MIPS)
Ovfl
Reg
file
ALU
(rt)
/
16
0
32
SE / 1
ALU
out
Data
addr
Data
cache
Data
out
Data
in
Func
0
1
2
Register input
fn
RegDst
RegWrite
ALUSrc
ALUFunc
RegInSrc
DataRead
DataWrite
Key elements of the single-cycle MicroMIPS data path.
Computer Architecture, Data Path and Control
Slide 45
Multicycle Data Path of Chapter 14
Clock rate = 500 MHz
CPI ≅ 4 (≅ 125 MIPS)
26
/
4 MSBs
Inst Reg
rt
0
rd 1
31 2
Cache
PCWrite
MemWrite
MemRead
Fig. 14.3
Feb. 2011
Reg
file
op
IRWrite
(rt)
imm 16
/
Data Reg
Inst′Data
(rs)
0
12
Data
fn
ALUZero
x Mux
ALUOvfl
0
Zero
z Reg
1
Ovfl
x Reg
rs
PC
32 y Reg
SE /
RegInSrc
RegDst
0
1
SysCallAddr
jta
Address
0
1
30
/
RegWrite
y Mux
4
0
1
2
×4 3
ALUSrcX
ALUSrcY
30
×4
ALU
Func
ALU out
ALUFunc
PCSrc
JumpAddr
Key elements of the multicycle MicroMIPS data path.
Computer Architecture, Data Path and Control
0
1
2
3
Slide 46
Getting the Best of Both Worlds
Pipelined:
Clock rate = 500 MHz
CPI ≅ 1
Single-cycle:
Multicycle:
Clock rate = 125 MHz
CPI = 1
Clock rate = 500 MHz
CPI ≅ 4
Multicycle analogy:
Doctor appointments
scheduled in 15-min
increments
Single-cycle analogy:
Doctor appointments
scheduled for 60 min
per patient
Feb. 2011
Computer Architecture, Data Path and Control
Slide 47
15.1 Pipelining Concepts
Strategies for improving performance
1 – Use multiple independent data paths accepting several instructions
that are read out at once: multiple-instruction-issue or superscalar
2 – Overlap execution of several instructions, starting the next instruction
before the previous one has run to completion: (super)pipelined
Approval
1
Cashier
2
2
Registrar
ID photo
3
4
Pickup
5
Start
here
Exit
Fig. 15.1
Feb. 2011
Pipelining in the student registration process.
Computer Architecture, Data Path and Control
Slide 48
Pipelined Instruction Execution
Instr
cache
Instr 3
Instr 4
Instr 5
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Time dimension
Instr 2
Instr 1
Cycle 1
Task
dimension
Fig. 15.2
Feb. 2011
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Pipelining in the MicroMIPS instruction execution process.
Computer Architecture, Data Path and Control
Slide 49
Alternate Representations of a Pipeline
Except for start-up and drainage overheads, a pipeline can execute
one instruction per clock tick; IPS is dictated by the clock frequency
1
2
1
2
3
4
5
f
r
a
d
w
f
r
a
d
w
f
r
a
d
w
f
f = Fetch
r = Reg read
a = ALU op
d = Data access
w = Writeback
r
a
d
w
f
r
a
d
w
f
r
a
d
w
f
r
a
d
3
4
5
6
7
6
7
8
9
10
11
Cycle
1
2
1
2
3
4
5
6
7
f
f
f
f
f
f
f
r
r
r
r
r
r
r
a
a
a
a
a
a
a
d
d
d
d
d
d
d
w
w
w
w
w
w
3
4
5
w
Start-up
region
8
9
10
Cycle
Drainage
region
Pipeline
stage
Instruction
(a) Task-time diagram
(b) Space-time diagram
Fig. 15.3 Two abstract graphical representations of a
5-stage pipeline executing 7 tasks (instructions).
Feb. 2011
Computer Architecture, Data Path and Control
11
Slide 50
w
Pipelining Example in a Photocopier
Example 15.1
A photocopier with an x-sheet document feeder copies the first sheet
in 4 s and each subsequent sheet in 1 s. The copier’s paper path is a
4-stage pipeline with each stage having a latency of 1s. The first
sheet goes through all 4 pipeline stages and emerges after 4 s.
Each subsequent sheet emerges 1s after the previous sheet.
How does the throughput of this photocopier vary with x, assuming
that loading the document feeder and removing the copies takes 15 s.
Solution
Each batch of x sheets is copied in 15 + 4 + (x – 1) = 18 + x seconds.
A nonpipelined copier would require 4x seconds to copy x sheets.
For x > 6, the pipelined version has a performance edge.
When x = 50, the pipelining speedup is (4 × 50) / (18 + 50) = 2.94.
Feb. 2011
Computer Architecture, Data Path and Control
Slide 51
15.2 Pipeline Stalls or Bubbles
First type of data dependency
$5 = $6 + $7
$8 = $8 + $6
$9 = $8 + $2
sw $9, 0($3)
Cycle 1
Cycle 2
Instr
cache
Reg
file
Instr
cache
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
ALU
Data
cache
Reg
file
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Cycle 8
Data
forwarding
Reg
file
Fig. 15.4 Read-after-write data dependency and its possible resolution
through data forwarding .
Feb. 2011
Computer Architecture, Data Path and Control
Slide 52
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
Instr 3
Instr 5
Instr 4
Instr
cache
Reg
file
Cycle 7
Cycle 8
Cycle 9
Cycle 2
Cycle 3
Writes into $8
Data
cache
ALU
Bubble
Reg
file
Task
dimension
Cycle 4
Reg
file
Data
cache
Bubble
ALU
Instr
cache
Reg
file
Cycle 5
Cycle 6
Reg
file
Data
cache
Reg
file
ALU
Data
cache
Reg
file
Cycle 8
Cycle 9
Cycle 7
Without data forwarding,
three bubbles are needed
to resolve a read-after-write
data dependency
Reads from $8
Time dimension
Instr
cache
Reg
file
ALU
Instr
cache
Reg
file
Instr 3
Instr 2
Instr 1
ALU
Bubble
Instr
cache
Cycle 1
Instr
cache
Instr 4
Instr 5
Cycle 6
Time dimension
Instr 2
Instr 1
Inserting Bubbles in a Pipeline
Data
cache
ALU
Bubble
Reg
file
Instr
cache
Task
dimension
Feb. 2011
Reg
file
Data
cache
Writes into $8
Reg
file
Data
cache
Reg
file
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
ALU
Bubble
Two bubbles, if we assume
that a register can be updated
and read from in one cycle
Reads from $8
Reg
file
Computer Architecture, Data Path and Control
Slide 53
Second Type of Data Dependency
C ycle 1
C ycle 2
Instr
mem
Reg
file
sw $6, . . .
C ycle 3
ALU
C ycle 4
C ycle 5
Data
mem
Reg
file
C ycle 6
C ycle 7
C ycle 8
Reorder?
lw $8, . . .
Insert bubble?
$9 = $8 + $2
Instr
mem
Reg
file
ALU
Data
mem
Reg
file
Instr
mem
Reg
file
ALU
Data
mem
Reg
file
Instr
mem
Reg
file
ALU
Data
mem
Reg
file
Without data forwarding, three (two) bubbles are needed
to resolve a read-after-load data dependency
Fig. 15.5 Read-after-load data dependency and its possible resolution
through bubble insertion and data forwarding.
Feb. 2011
Computer Architecture, Data Path and Control
Slide 54
Control Dependency in a Pipeline
C ycle 1
C ycle 2
Instr
mem
Reg
file
Instr
mem
$6 = $3 + $5
beq $1, $2, . . .
Insert bubble?
C ycle 3
C ycle 4
C ycle 5
ALU
Data
mem
Reg
file
Reg
file
ALU
Data
mem
Reg
file
Instr
mem
Reg
file
ALU
Data
mem
Reg
file
Instr
mem
Reg
file
ALU
Data
mem
$9 = $8 + $2
Assume branch
resolved here
Fig. 15.6
Feb. 2011
C ycle 6
C ycle 7
C ycle 8
Reorder?
(delayed
branch)
Reg
file
Here would need
1-2 more bubbles
Control dependency due to conditional branch.
Computer Architecture, Data Path and Control
Slide 55
15.3 Pipeline Timing and Performance
Latching
of results
t
Function unit
Stage
1
t/q
Fig. 15.7
Feb. 2011
Stage
2
Stage
3
.. .
Stage
q−1
Stage
q
τ
Pipelined form of a function unit with latching overhead.
Computer Architecture, Data Path and Control
Slide 56
Throughput Increase in a q-Stage Pipeline
Throughput improvement factor
8
7
t
t/q + τ
Ideal:
τ/t = 0
6
τ/t = 0.05
5
or
4
τ/t = 0.1
q
1 + qτ / t
3
2
1
1
2
3
4
5
6
Number q of pipeline stages
7
8
Fig. 15.8 Throughput improvement due to pipelining as a function
of the number of pipeline stages for different pipelining overheads.
Feb. 2011
Computer Architecture, Data Path and Control
Slide 57
Pipeline Throughput with Dependencies
Assume that one bubble must be inserted due to read-after-load
dependency and after a branch when its delay slot cannot be filled.
Let β be the fraction of all instructions that are followed by a bubble.
q
Pipeline speedup = (1 + qτ / t)(1 + β)
Effective
CPI
R-type
Load
Store
Branch
Jump
44%
24%
12%
18%
2%
Example 15.3
Calculate the effective CPI for MicroMIPS, assuming that a quarter
of branch and load instructions are followed by bubbles.
Solution
Fraction of bubbles β = 0.25(0.24 + 0.18) = 0.105
CPI = 1 + β = 1.105 (which is very close to the ideal value of 1)
Feb. 2011
Computer Architecture, Data Path and Control
Slide 58
15.4 Pipelined Data Path Design
Stage 1
Stage 2
ALUOvfl
1
PC
inst
Instr
cache
rs
rt
(rs)
1
Reg
file
ALU
imm SE
Incr
IncrPC
SeqInst
op
Feb. 2011
Stage 5
Data
Data
addr
Address
Ovfl
(rt)
Fig. 15.9
Stage 4
Next addr
NextPC
0
Stage 3
Data
cache
0
1
Func
0
1
0
1
2
rt
rd 0
1
31 2
Br&Jump
RegDst
fn
ALUSrc
RegWrite
ALUFunc
DataRead
RetAddr
DataWrite
RegInSrc
Key elements of the pipelined MicroMIPS data path.
Computer Architecture, Data Path and Control
Slide 59
15.5 Pipelined Control
Stage 1
Stage 2
Stage 4
Data
Data
addr
Address
ALUOvfl
1
PC
Stage 5
Next addr
NextPC
0
Stage 3
inst
Instr
cache
rs
rt
(rs)
Ovfl
Reg
file
(rt)
1
imm SE
Incr
Data
cache
ALU
Func
0
1
0
1
2
rt
rd 0
1
31 2
IncrPC
0
1
2
3
5
SeqInst
op
RegDst
Br&Jump
fn
ALUSrc
RegWrite
Fig. 15.10
Feb. 2011
ALUFunc
DataRead RetAddr
DataWrite
RegInSrc
Pipelined control signals.
Computer Architecture, Data Path and Control
Slide 60
15.6 Optimal Pipelining
MicroMIPS pipeline with more than four-fold improvement
PC
Instruction
fetch
Register
readout
Instr
cache
Reg
file
Instr
cache
ALU
operation
ALU
Reg
file
Instr
cache
Data
read/store
Register
writeback
Data
cache
Reg
file
ALU
Reg
file
Data
cache
ALU
Reg
file
Reg
file
Data
cache
Fig. 15.11 Higher-throughput pipelined data path for MicroMIPS
and the execution of consecutive instructions in it .
Feb. 2011
Computer Architecture, Data Path and Control
Slide 61
Optimal Number of Pipeline Stages
Assumptions:
Pipeline sliced into q stages
Stage overhead is τ
q/2 bubbles per branch
(decision made midway)
Fraction b of all instructions
are taken branches
Derivation of q opt
Latching
of results
t
Function unit
Stage
1
t/q
Fig. 15.7
Stage
2
Stage
3
.. .
Stage
q−1
Stage
q
τ
Pipelined form of a function unit with latching overhead.
Average CPI = 1 + b q / 2
Throughput = Clock rate / CPI =
1
(t / q + τ)(1 + b q / 2)
Differentiate throughput expression with respect to q and equate with 0
q opt
=
Feb. 2011
2t / τ
b
Varies directly with t / τ and inversely with b
Computer Architecture, Data Path and Control
Slide 62
Pipelining Example
An example combinational-logic data path to compute z := (u + v)(w – x) / y
Add/Sub
latency
2 ns
Multiply
latency
6 ns
Divide
latency
15 ns
Throughput, original
= 1/(25 × 10–9)
= 40 M computations / s
u
+
Pipeline register
placement, Option 2
v
×
w
−
/
Throughput, option 1
= 1/(17 × 10–9)
= 58.8 M computations / s
Write, 1 ns
z
x
y
Readout, 1 ns
Feb. 2011
Pipeline register
placement, Option 1
Throughput, Option 2
= 1/(10 × 10–9)
= 100 M computations / s
Computer Architecture, Data Path and Control
Slide 63
16 Pipeline Performance Limits
Pipeline performance limited by data & control dependencies
• Hardware provisions: data forwarding, branch prediction
• Software remedies: delayed branch, instruction reordering
Topics in This Chapter
16.1 Data Dependencies and Hazards
16.2 Data Forwarding
16.3 Pipeline Branch Hazards
16.4 Delayed Branch and Branch Prediction
16.5 Dealing with Exceptions
16.6 Advanced Pipelining
Feb. 2011
Computer Architecture, Data Path and Control
Slide 64
16.1 Data Dependencies and Hazards
Cycle 1
Cycle 2
Instr
cache
Reg
file
Instr
cache
Cycle 3
Cycle 4
Cycle 5
ALU
Data
cache
Reg
file
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Fig. 16.1
Feb. 2011
Cycle 6
Cycle 7
Cycle 8
Cycle 9
$2 = $1 - $3
Instructions
that read
register $2
Reg
file
Data dependency in a pipeline.
Computer Architecture, Data Path and Control
Slide 65
Resolving Data Dependencies via Forwarding
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Instr
cache
Reg
file
ALU
Instr
cache
Cycle 6
Cycle 7
Data
cache
Reg
file
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Cycle 8
Cycle 9
$2 = $1 - $3
Instructions
that read
register $2
Reg
file
Fig. 16.2 When a previous instruction writes back a value
computed by the ALU into a register, the data dependency
can always be resolved through forwarding.
Feb. 2011
Computer Architecture, Data Path and Control
Slide 66
Pipelined MicroMIPS – Repeated for Reference
Stage 1
Stage 2
Stage 4
Data
Data
addr
Address
ALUOvfl
1
PC
Stage 5
Next addr
NextPC
0
Stage 3
inst
Instr
cache
rs
rt
(rs)
Ovfl
Reg
file
(rt)
1
imm SE
Incr
Data
cache
ALU
Func
0
1
0
1
2
rt
rd 0
1
31 2
IncrPC
0
1
2
3
5
SeqInst
op
RegDst
Br&Jump
fn
ALUSrc
RegWrite
Fig. 15.10
Feb. 2011
ALUFunc
DataRead RetAddr
DataWrite
RegInSrc
Pipelined control signals.
Computer Architecture, Data Path and Control
Slide 67
Certain Data Dependencies Lead to Bubbles
Cycle 1
Cycle 2
Instr
cache
Reg
file
Instr
cache
Cycle 3
ALU
Cycle 4
Cycle 5
Cycle 6
Data
cache
Reg
file
lw
Cycle 7
Cycle 8
Cycle 9
$2,4($12)
Instructions
that read
register $2
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Instr
cache
Reg
file
ALU
Data
cache
Reg
file
Fig. 16.3 When the immediately preceding instruction writes a
value read out from the data memory into a register, the data
dependency cannot be resolved through forwarding (i.e., we cannot
go back in time) and a bubble must be inserted in the pipeline.
Feb. 2011
Computer Architecture, Data Path and Control
Slide 68
16.2 Data Forwarding
Stage 2
Stage 3
s2
rt
x2
Reg
file
SE
RetAddr3, RegWrite3, RegWrite4
RegInSrc3, RegInSrc4
Ovfl
ALU
(rt)
x3
0
1
2
y2
x3
y3
x4
y4
t2
Forw arding
unit, lower
ALUSrc2
Fig. 16.4
Feb. 2011
Data
cache
x4
0
1
Func
y3
d3
ALUSrc1
Stage 5
d3
d4
Forw arding
unit, upper
x3
y3
x4
y4
(rs)
rs
Stage 4
0
1
RegInSrc3
d3
RegWrite3
d4
RetAddr3
RetAddr3, RegWrite3, RegWrite4
RegInSrc3, RegInSrc4
y4
d4
RegInSrc4
RegWrite4
Forwarding unit for the pipelined MicroMIPS data path.
Computer Architecture, Data Path and Control
Slide 69
Design of the Data Forwarding Units
Stage 2
Stage 3
s2
Let’s focus on designing
the upper data forwarding unit
rt
Ovfl
x2
Reg
file
ALU
(rt)
Table 16.1 Partial truth table for
the upper forwarding unit in the
pipelined MicroMIPS data path.
Data
cache
x3
x3
y3
x4
y4
y2
0
1
0
1
y3
d3
Forw arding
unit, lower
t2
ALUSrc1
ALUSrc2
Fig. 16.4
x4
Func
0
1
2
SE
Stage 5
RetAddr3, RegWrite3, RegWrite4
RegInSrc3, RegInSrc4
x3
y3
x4
y4
(rs)
rs
Stage 4
d3
d4
Forw arding
unit, upper
y4
RegInSrc3
RegWrite3
RetAddr3
RetAddr3, RegWrite3, RegWrite4
RegInSrc3, RegInSrc4
d3
d4
d4
RegInSrc4
RegWrite4
Forwarding unit for the pipelined MicroMIPS data path.
RegWrite3
RegWrite4
s2matchesd3
s2matchesd4
RetAddr3
RegInSrc3
RegInSrc4
Choose
0
0
x
x
x
x
x
x2
0
1
x
0
x
x
x
x2
0
1
x
1
x
x
0
x4
0
1
x
1
x
x
1
y4
1
0
1
x
0
1
x
x3
1
0
1
x
1
1
x
y3
1
1
1
1
0
1
x
x3
Incorrect in textbook
Feb. 2011
Computer Architecture, Data Path and Control
Slide 70
Hardware for Inserting Bubbles
Stage 1
Stage 2
Stage 3
Data hazard
detector
LoadPC
LoadInst
Inst
Instr
cache
PC
(rs)
rs
rt
x2
Reg
file
(rt)
LoadIncrPC Inst
reg
Corrections to textbook
figure shown in red
0
1
2
IncrPC
y2
Bubble
Control signals
from decoder
All-0s
t2
0
1
Controls
or all-0s
DataRead2
Fig. 16.5 Data hazard detector for the pipelined MicroMIPS data path.
Feb. 2011
Computer Architecture, Data Path and Control
Slide 71
Augmentations to Pipelined Data Path and Control
Branch
predictor
Next addr
forwarders
Hazard
detector
Stage 1
Data cache
forwarder
Stage 3
Stage 4
Stage 2
ALUOvfl
1
PC
Stage 5
Next addr
NextPC
0
ALU
forwarders
inst
Instr
cache
rs
rt
(rs)
Ovfl
Reg
file
imm SE
Incr
IncrPC
Data
cache
ALU
(rt)
1
Data
Data
addr
Address
Func
0
1
0
1
0
1
2
rt
rd 0
1
31 2
2
3
5
SeqInst
Fig. 15.10
Feb. 2011
op
RegDst
Br&Jump
fn
ALUSrc
RegWrite
ALUFunc
Computer Architecture, Data Path and Control
DataRead RetAddr
DataWrite
RegInSrc
Slide 72
16.3 Pipeline Branch Hazards
Software-based solutions
Compiler inserts a “no-op” after every branch (simple, but wasteful)
Branch is redefined to take effect after the instruction that follows it
Branch delay slot(s) are filled with useful instructions via reordering
Hardware-based solutions
Mechanism similar to data hazard detector to flush the pipeline
Constitutes a rudimentary form of branch prediction:
Always predict that the branch is not taken, flush if mistaken
More elaborate branch prediction strategies possible
Feb. 2011
Computer Architecture, Data Path and Control
Slide 73
16.4 Branch Prediction
Predicting whether a branch will be taken
• Always predict that the branch will not be taken
• Use program context to decide (backward branch
is likely taken, forward branch is likely not taken)
• Allow programmer or compiler to supply clues
• Decide based on past history (maintain a small
history table); to be discussed later
• Apply a combination of factors: modern processors
use elaborate techniques due to deep pipelines
Feb. 2011
Computer Architecture, Data Path and Control
Slide 74
Forward and Backward Branches
Example 5.5
List A is stored in memory beginning at the address given in $s1.
List length is given in $s2.
Find the largest integer in the list and copy it into $t0.
Solution
Scan the list, holding the largest element identified thus far in $t0.
lw
addi
loop: add
beq
add
add
add
lw
slt
beq
addi
j
done: ...
Feb. 2011
$t0,0($s1)
$t1,$zero,0
$t1,$t1,1
$t1,$s2,done
$t2,$t1,$t1
$t2,$t2,$t2
$t2,$t2,$s1
$t3,0($t2)
$t4,$t0,$t3
$t4,$zero,loop
$t0,$t3,0
loop
#
#
#
#
#
#
#
#
#
#
#
#
#
initialize maximum to A[0]
initialize index i to 0
increment index i by 1
if all elements examined, quit
compute 2i in $t2
compute 4i in $t2
form address of A[i] in $t2
load value of A[i] into $t3
maximum < A[i]?
if not, repeat with no change
if so, A[i] is the new maximum
change completed; now repeat
continuation of the program
Computer Architecture, Data Path and Control
Slide 75
Simple Branch Prediction: 1-Bit History
Taken
Not taken
Not taken
Predict
taken
Predict
not taken
Taken
Two-state branch prediction scheme.
Problem with this approach:
Each branch in a loop entails two mispredictions:
Once in first iteration (loop is repeated, but the history indicates exit from loop)
Once in last iteration (when loop is terminated, but history indicates repetition)
Feb. 2011
Computer Architecture, Data Path and Control
Slide 76
Simple Branch Prediction: 2-Bit History
Taken
Not taken
Not taken
Predict
taken
Taken
Not taken
Predict
taken
again
Taken
Predict
not taken
Not taken
Predict
not taken
again
Taken
Fig. 16.6
Four-state branch prediction scheme.
Example 16.1
L1: ---10 iter’s
---20 iter’s
L2: ------br <c2> L2
---br <c1> L1
Feb. 2011
Impact of different branch prediction schemes
Solution
Always taken: 11 mispredictions, 94.8% accurate
1-bit history: 20 mispredictions, 90.5% accurate
2-bit history: Same as always taken
Computer Architecture, Data Path and Control
Slide 77
Other Branch Prediction Algorithms
Problem 16.3
Taken
Not taken
Not taken
Part a
Predict
taken
Taken
Not taken
Predict
taken
again
Taken
Part b
Predict
taken
Taken
Taken
Not taken
Predict
not taken
Not taken
Taken
Predict
taken
again
Taken
Predict
not taken
again
Not taken
Taken
Predict
not taken
Not taken
Predict
not taken
again
Not taken
Taken
Not taken
Not taken
Fig. 16.6
Predict
taken
Taken
Not taken
Predict
taken
again
Taken
Predict
not taken
Not taken
Predict
not taken
again
Taken
Feb. 2011
Computer Architecture, Data Path and Control
Slide 78
Hardware Implementation of Branch Prediction
Low-order
bits used
as index
Addresses of recent
branch instructions
Target
addresses
History
bit(s)
Incremented
PC
Next
PC
0
Read-out table entry
From
PC
Fig. 16.7
Compare
=
1
Logic
Hardware elements for a branch prediction scheme.
The mapping scheme used to go from PC contents to a table entry
is the same as that used in direct-mapped caches (Chapter 18)
Feb. 2011
Computer Architecture, Data Path and Control
Slide 79
Pipeline Augmentations – Repeated for Reference
Branch
predictor
Next addr
forwarders
Hazard
detector
Stage 1
Data cache
forwarder
Stage 3
Stage 4
Stage 2
ALUOvfl
1
PC
Stage 5
Next addr
NextPC
0
ALU
forwarders
inst
Instr
cache
rs
rt
(rs)
Ovfl
Reg
file
imm SE
Incr
IncrPC
Data
cache
ALU
(rt)
1
Data
Data
addr
Address
Func
0
1
0
1
0
1
2
rt
rd 0
1
31 2
2
3
5
SeqInst
Fig. 15.10
Feb. 2011
op
RegDst
Br&Jump
fn
ALUSrc
RegWrite
ALUFunc
Computer Architecture, Data Path and Control
DataRead RetAddr
DataWrite
RegInSrc
Slide 80
16.5 Advanced Pipelining
Deep pipeline = superpipeline; also, superpipelined, superpipelining
Parallel instruction issue = superscalar, j-way issue (2-4 is typical)
Stage 1
Stage 2
Stage 3
Stage 4
Stage 5
Variable # of stages
Stage q−2 Stage q−1 Stage q
Function unit 1
Instr cache
Instr
decode
Operand
prep
Instr
issue
Function unit 2
Function unit 3
Instruction fetch
Retirement & commit stages
Fig. 16.8
Dynamic instruction pipeline with in-order issue,
possible out-of-order completion, and in-order retirement.
Feb. 2011
Computer Architecture, Data Path and Control
Slide 81
Performance Improvement for Deep Pipelines
Hardware-based methods
Lookahead past an instruction that will/may stall in the pipeline
(out-of-order execution; requires in-order retirement)
Issue multiple instructions (requires more ports on register file)
Eliminate false data dependencies via register renaming
Predict branch outcomes more accurately, or speculate
Software-based method
Pipeline-aware compilation
Loop unrolling to reduce the number of branches
Loop: Compute with index i
Increment i by 1
Go to Loop if not done
Feb. 2011
Loop: Compute with index i
Compute with index i + 1
Increment i by 2
Go to Loop if not done
Computer Architecture, Data Path and Control
Slide 82
CPI Variations with Architectural Features
Table 16.2 Effect of processor architecture, branch
prediction methods, and speculative execution on CPI.
Architecture
Methods used in practice
CPI
Nonpipelined, multicycle
Strict in-order instruction issue and exec
5-10
Nonpipelined, overlapped In-order issue, with multiple function units
3-5
Pipelined, static
In-order exec, simple branch prediction
2-3
Superpipelined, dynamic
Out-of-order exec, adv branch prediction
1-2
Superscalar
2- to 4-way issue, interlock & speculation
0.5-1
Advanced superscalar
4- to 8-way issue, aggressive speculation
0.2-0.5
Need 100 for TIPS performance
Need 100,000 for 1 PIPS
Feb. 2011
3.3 inst / cycle × 3 Gigacycles / s
≅ 10 GIPS
Computer Architecture, Data Path and Control
Slide 83
Development of Intel’s Desktop/Laptop Micros
In the beginning, there was the 8080; led to the 80x86 = IA32 ISA
Half a dozen or so pipeline stages
80286
80386
80486
Pentium (80586)
More advanced
technology
A dozen or so pipeline stages, with out-of-order instruction execution
Pentium Pro
Pentium II
Pentium III
Celeron
More advanced
technology
Instructions are broken
into micro-ops which are
executed out-of-order
but retired in-order
Two dozens or so pipeline stages
Pentium 4
Feb. 2011
Computer Architecture, Data Path and Control
Slide 84
Current State of Computer Performance
Multi-GIPS/GFLOPS desktops and laptops
Very few users need even greater computing power
Users unwilling to upgrade just to get a faster processor
Current emphasis on power reduction and ease of use
Multi-TIPS/TFLOPS in large computer centers
World’s top 500 supercomputers, http://www.top500.org
Next list due in June 2009; as of Nov. 2008:
All 500 >> 10 TFLOPS, ≈30 > 100 TFLOPS, 1 > PFLOPS
Multi-PIPS/PFLOPS supercomputers on the drawing board
IBM “smarter planet” TV commercial proclaims (in early 2009):
“We just broke the petaflop [sic] barrier.”
The technical term “petaflops” is now in the public sphere
Feb. 2011
Computer Architecture, Data Path and Control
Slide 85
The Shrinking Supercomputer
Feb. 2011
Computer Architecture, Data Path and Control
Slide 86
16.6 Dealing with Exceptions
Exceptions present the same problems as branches
How to handle instructions that are ahead in the pipeline?
(let them run to completion and retirement of their results)
What to do with instructions after the exception point?
(flush them out so that they do not affect the state)
Precise versus imprecise exceptions
Precise exceptions hide the effects of pipelining and parallelism
by forcing the same state as that of strict sequential execution
(desirable, because exception handling is not complicated)
Imprecise exceptions are messy, but lead to faster hardware
(interrupt handler can clean up to offer precise exception)
Feb. 2011
Computer Architecture, Data Path and Control
Slide 87
The Three Hardware Designs for MicroMIPS
Incr PC
Single-cycle
Next addr
jta
Next PC
ALUOvfl
(PC)
PC
Instr
cache
inst
rd
31
imm
op
0
1
2
Ovfl
Inst Reg
Reg
file
ALU
(rt)
Data
addr
Data
cache
0
1
0
1
2
Cache
Inst′Data
ALUFunc
125 MHz
CPI = 1
Stage 1
y Mux
0
1
2
×4 3
4
(rt)
32 y Reg
SE /
imm 16
/
Stage 2
IRWrite
RegInSrc
ALUSrcX
RegWrite
Stage 4
A LUOvf l
PC
fn
RegDst
Stage 3
1
ins t
Instr
cache
rs
rt
(rs)
×4
ALU
0
1
2
3
Func
ALU out
ALUSrcY
Stage 5
Reg
file
1
Inc r
Inc rPC
rt
rd
31
JumpAddr
500 MHz
CPI ≅ 4
Address
Data
cache
ALU
imm SE
PCSrc
ALUFunc
Data
Data
addr
Ovf l
(rt)
500 MHz
CPI ≅ 1.1
op
MemWrite
MemRead
N ext addr
Nex tPC
0
PCWrite
RegInSrc
DataRead
DataWrite
ALUSrc
RegDst
RegWrite
Func
0
1
0
1
0
1
2
0
1
2
2
3
5
SeqIns t
op
Feb. 2011
Reg
file
30
Register input
fn
Br&Jump
(rs)
0
1
Data
Data Reg
ALUZero
ALUOvfl
x Mux
0
z Reg
Zero
1
Ovfl
x Reg
rs
rt
0
1
rd
31 2
0
1
SysCallAddr
jta
PC
Data
out
Data
in
Func
0
32
SE / 1
/
16
ALU
out
30
/
4 MSBs
Address
(rs)
rs
rt
26
/
Multicycle
RegDs t
Br&Jump
fn
A LUSrc
RegWrite
A LUFunc
DataRead RetA ddr
DataWrite
Reg InSrc
Computer Architecture, Data Path and Control
Slide 88
Where Do We
Go from Here?
Memory Design:
How to build a memory unit
that responds in 1 clock
Input and Output:
Peripheral devices,
I/O programming,
interfacing, interrupts
Higher Performance:
Vector/array processing
Parallel processing
Feb. 2011
Computer Architecture, Data Path and Control
Slide 89
Part V
Memory System Design
Feb. 2011
Computer Architecture, Memory System Design
Slide 1
About This Presentation
This presentation is intended to support the use of the textbook
Computer Architecture: From Microprocessors to Supercomputers,
Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated
regularly by the author as part of his teaching of the upper-division
course ECE 154, Introduction to Computer Architecture, at the
University of California, Santa Barbara. Instructors can use these
slides freely in classroom teaching and for other educational
purposes. Any other use is strictly prohibited. © Behrooz Parhami
Edition
Released
Revised
Revised
Revised
Revised
First
July 2003
July 2004
July 2005
Mar. 2006
Mar. 2007
Mar. 2008
Feb. 2009
Feb. 2011
Feb. 2011
Computer Architecture, Memory System Design
Slide 2
V Memory System Design
Design problem – We want a memory unit that:
• Can keep up with the CPU’s processing speed
• Has enough capacity for programs and data
• Is inexpensive, reliable, and energy-efficient
Topics in This Part
Chapter 17 Main Memory Concepts
Chapter 18 Cache Memory Organization
Chapter 19 Mass Memory Concepts
Chapter 20 Virtual Memory and Paging
Feb. 2011
Computer Architecture, Memory System Design
Slide 3
17 Main Memory Concepts
Technologies & organizations for computer’s main memory
• SRAM (cache), DRAM (main), and flash (nonvolatile)
• Interleaving & pipelining to get around “memory wall”
Topics in This Chapter
17.1 Memory Structure and SRAM
17.2 DRAM and Refresh Cycles
17.3 Hitting the Memory Wall
17.4 Interleaved and Pipelined Memory
17.5 Nonvolatile Memory
17.6 The Need for a Memory Hierarchy
Feb. 2011
Computer Architecture, Memory System Design
Slide 4
17.1 Memory Structure and SRAM
Output enable
Chip select
Storage
cells
Write enable
Data in
Address
/
/
Q
D
g
FF
h
C
/
g
/
g
Data out
Q
0
D
Q
FF
Address
decoder
/
g
Q
C
1
.
.
.
D
Q
FF
2h –1
C
/
g
Q
WE
D in
D out
Addr
CS
OE
Fig. 17.1 Conceptual inner structure of a 2h × g SRAM chip
and its shorthand representation.
Feb. 2011
Computer Architecture, Memory System Design
Slide 5
Multiple-Chip SRAM
Data
in
32
Address
/
18
/
17
WE
D in
D out
Addr
CS
OE
WE
D in
D out
Addr
CS
OE
WE
D in
D out
Addr
CS
OE
WE
D in
D out
Addr
CS
OE
WE
D in
D out
Addr
CS
OE
WE
D in
D out
Addr
CS
OE
WE
D in
D out
Addr
CS
OE
WE
D in
D out
Addr
CS
OE
MSB
Data out,
byte 3
Fig. 17.2
Feb. 2011
Data out,
byte 2
Data out,
byte 1
Data out,
byte 0
Eight 128K × 8 SRAM chips forming a 256K × 32 memory unit.
Computer Architecture, Memory System Design
Slide 6
SRAM with Bidirectional Data Bus
Output enable
Chip select
Write enable
Data in/out
Address
/
/
h
g
Data in
Data out
Fig. 17.3
When data input and output of an SRAM chip
are shared or connected to a bidirectional data bus, output
must be disabled during write operations.
Feb. 2011
Computer Architecture, Memory System Design
Slide 7
17.2 DRAM and Refresh Cycles
DRAM vs. SRAM Memory Cell Complexity
Word line
Word line
Vcc
Pass
transistor
Capacitor
Bit
line
(a) DRAM cell
Compl.
bit
line
Bit
line
(b) Typical SRAM cell
Fig. 17.4 Single-transistor DRAM cell, which is considerably simpler than
SRAM cell, leads to dense, high-capacity DRAM memory chips.
Feb. 2011
Computer Architecture, Memory System Design
Slide 8
DRAM Refresh Cycles and Refresh Rate
Voltage
for 1
1 Written
Refreshed
Refreshed
Refreshed
Threshold
voltage
Voltage
for 0
0 Stored
10s of ms
before needing
refresh cycle
Time
Fig. 17.5 Variations in the voltage across a DRAM cell capacitor after
writing a 1 and subsequent refresh operations.
Feb. 2011
Computer Architecture, Memory System Design
Slide 9
Loss of Bandwidth to Refresh Cycles
Example 17.2
A 256 Mb DRAM chip is organized as a 32M × 8 memory externally
and as a 16K × 16K array internally. Rows must be refreshed at least
once every 50 ms to forestall data loss; refreshing a row takes 100 ns.
What fraction of the total memory bandwidth is lost to refresh cycles?
Row decoder
16K
Write enable
/
Data in
/
Address
Chip
select
g
h
.
.
.
Square or
almost square
memory matrix
16K
Data out /
g
. . .
Output
enable
Figure 2.10
14
Address /
. . .
Row
h Column
(a) SRAM block diagram
Row buffer
Column mux
11
8
g bits data out
(b) SRAM read mechanism
Solution
Refreshing all 16K rows takes 16 × 1024 × 100 ns = 1.64 ms. Loss of
1.64 ms every 50 ms amounts to 1.64/50 = 3.3% of the total bandwidth.
Feb. 2011
Computer Architecture, Memory System Design
Slide 10
DRAM Packaging
24-pin dual in-line package (DIP)
Vss D4 D3 CAS OE A9 A8 A7 A6 A5 A4 Vss
Legend:
24
23
22
21
20
19
18 17
16
15
14
13
1
2
3
4
5
6
7
9
10
11
12
Ai
CAS
Dj
NC
OE
RAS
WE
8
Address bit i
Column address strobe
Data bit j
No connection
Output enable
Row address strobe
Write enable
Vcc D1 D2 WE RAS NC A10 A0 A1 A2 A3 Vcc
Fig. 17.6
Feb. 2011
Typical DRAM package housing a 16M × 4 memory.
Computer Architecture, Memory System Design
Slide 11
DRAM 1000
Evolution
Computer class
Number of memory chips
Memory size
Supercomputers
100
Servers
Feb. 2011
64
GB
16
GB
4
GB
1
GB
Large
PCs
256
MB
Small
PCs
64
MB
4
MB
Fig. 17.7
Trends in
DRAM main
memory.
256
GB
Workstations
10
1
TB
16
MB
1
MB
1
1980
1990
2000
2010
Calendar year
Computer Architecture, Memory System Design
Slide 12
17.3 Hitting the Memory Wall
Relative performance
10 6
Processor
10 3
Memory
1
1980
1990
2000
2010
Calendar year
Fig. 17.8 Memory density and capacity have grown along with the
CPU power and complexity, but memory speed has not kept pace.
Feb. 2011
Computer Architecture, Memory System Design
Slide 13
Bridging the CPU-Memory Speed Gap
Idea: Retrieve more data from memory with each access
Wideaccess
memory
.
.
.
.
.
.
Narrow bus
to
processor
Mux
(a) Buffer and mult iplex er
at the memory side
Wideaccess
memory
.
.
.
Wide bus
to
processor
.
.
.
Mux
(a) Buffer and mult iplex er
at the processor side
Fig. 17.9 Two ways of using a wide-access memory to bridge
the speed gap between the processor and memory.
Feb. 2011
Computer Architecture, Memory System Design
Slide 14
17.4 Pipelined and Interleaved Memory
Memory latency may involve other supporting operations
besides the physical access itself
Virtual-to-physical address translation (Chap 20)
Tag comparison to determine cache hit/miss (Chap 18)
Address
translation
Row
decoding
& read out
Fig. 17.10
Feb. 2011
Column
decoding
& selection
Tag
comparison
& validation
Pipelined cache memory.
Computer Architecture, Memory System Design
Slide 15
Memory Interleaving
Module accessed
Addresses that
are 0 mod 4
0
Addresses 0, 4, 8, …
Address
Data
in
Dispatch
(based on
2 LSBs of
address)
1
2
Addresses that
are 1 mod 4
Addresses 1, 5, 9, …
Return
data
Data
out
3
0
Addresses that
are 2 mod 4
1
Addresses 2, 6, 10, …
Addresses that
are 3 mod 4
Addresses 3, 7, 11, …
Bus cycle
Memory cycle
2
3
Time
Fig. 17.11 Interleaved memory is more flexible than wide-access
memory in that it can handle multiple independent accesses at once.
Feb. 2011
Computer Architecture, Memory System Design
Slide 16
17.5 Nonvolatile Memory
S u p p ly vo l t a g e
ROM
PROM
EPROM
Word contents
1010
1001
Word
lines
0010
1101
B i t li nes
Fig. 17.12 Read-only memory organization, with the
fixed contents shown on the right.
Feb. 2011
Computer Architecture, Memory System Design
Slide 17
Flash Memory
S o u r c e l i n es
Control gate
Floating gate
Source
Word
lines
n−
p substrate
n+
B i t li nes
Drain
Fig. 17.13 EEPROM or Flash memory organization.
Each memory cell is built of a floating-gate MOS transistor.
Feb. 2011
Computer Architecture, Memory System Design
Slide 18
17.6 The Need for a Memory Hierarchy
The widening speed gap between CPU and main memory
Processor operations take of the order of 1 ns
Memory access requires 10s or even 100s of ns
Memory bandwidth limits the instruction execution rate
Each instruction executed involves at least one memory access
Hence, a few to 100s of MIPS is the best that can be achieved
A fast buffer memory can help bridge the CPU-memory gap
The fastest memories are expensive and thus not very large
A second (third?) intermediate cache level is thus often used
Feb. 2011
Computer Architecture, Memory System Design
Slide 19
Typical Levels in a Hierarchical Memory
Capacity
Access latency
100s B
ns
Cost per GB
Reg’s
10s KB
a few ns
MBs
10s ns
100s MB
100s ns
10s GB
10s ms
TBs
Fig. 17.14
Feb. 2011
$Millions
Cache 1
Cache 2
min+
Speed
gap
Main
Secondary
Tertiary
$100s Ks
$10s Ks
$1000s
$10s
$1s
Names and key characteristics of levels in a memory hierarchy.
Computer Architecture, Memory System Design
Slide 20
Memory Price Trends
100K
■ DRAM
z Flash
10K
$ / GByte
1K
100
Hard disk drive
10
1
0.1
Source: https://www1.hitachigst.com/hdd/technolo/overview/chart03.html
Feb. 2011
Computer Architecture, Memory System Design
Slide 21
18 Cache Memory Organization
Processor speed is improving at a faster rate than memory’s
• Processor-memory speed gap has been widening
• Cache is to main as desk drawer is to file cabinet
Topics in This Chapter
18.1 The Need for a Cache
18.2 What Makes a Cache Work?
18.3 Direct-Mapped Cache
18.4 Set-Associative Cache
18.5 Cache and Main Memory
18.6 Improving Cache Performance
Feb. 2011
Computer Architecture, Memory System Design
Slide 22
18.1 The Need for a Cache
Incr PC
Single-cycle
Next addr
jta
Next PC
ALUOvfl
(PC)
PC
Instr
cache
inst
rd
31
imm
op
0
1
2
Ovfl
Inst Reg
Reg
file
ALU
(rt)
/
16
Data
addr
ALU
out
Data
in
Func
0
32
SE / 1
Data
cache
0
1
0
1
2
Cache
Reg
file
(rt)
32 y Reg
SE /
imm 16
/
y Mux
4
0
1
2
×4 3
30
ALUZero
ALUOvfl
Zero
Ovfl
z Reg
×4
0
1
2
3
ALU
Func
ALU out
Register input
Inst′Data
RegDst
RegWrite
PCWrite
RegInSrc
DataRead
DataWrite
ALUSrc
ALUFunc
125 MHz
CPI = 1
Stage 1
IRWrite
Stage 2
ALUOvfl
PC
fn
RegInSrc
RegDst
ALUSrcX
RegWrite
Stage 3
1
All three of our
MicroMIPS designs
assumed 2-ns data
and instruction
memories; however,
typical RAMs are
10-50 times slower
op
MemWrite
MemRead
Stage 4
inst
Instr
cache
rs
rt
(rs)
Stage 5
Reg
file
Incr
IncrPC
500 MHz
CPI ≅ 4
Data
addr
Data
cache
ALU
imm SE
JumpAddr
Ovfl
(rt)
1
PCSrc
ALUFunc
ALUSrcY
Next addr
NextPC
0
Func
0
1
0
1
0
1
2
rt
rd 0
1
31 2
2
3
Pipelined
500 MHz
CPI ≅ 1.1
5
SeqInst
op
Feb. 2011
x Mux
0
1
(rs)
0
1
Data
Data Reg
fn
Br&Jump
x Reg
jta
rt
0
1
rd
31 2
0
1
SysCallAddr
rs
PC
Data
out
30
/
4 MSBs
Address
(rs)
rs
rt
26
/
Multicycle
Br&Jump
RegDst
fn
ALUSrc
RegWrite
ALUFunc
Computer Architecture, Memory System Design
DataRead RetAddr
DataWrite
RegInSrc
Slide 23
Cache, Hit/Miss Rate, and Effective Access Time
Cache is transparent to user;
transfers occur automatically
Line
Word
CPU
Reg
file
Cache
(fast)
memory
Main
(slow)
memory
Data is in the cache
fraction h of the time
(say, hit rate of 98%)
One level of cache with hit rate h
Go to main 1 – h of the time
(say, cache miss rate of 2%)
Ceff = hCfast + (1 – h)(Cslow + Cfast) = Cfast + (1 – h)Cslow
Feb. 2011
Computer Architecture, Memory System Design
Slide 24
Multiple Cache Levels
CPU
CPU
registers
Level-2
cache
Level-1
cache
Cleaner and
easier to analyze
Main
memory
(a) Level 2 between level 1 and main
CPU
Level-2
cache
CPU
registers
Main
memory
Level-1
cache
(b) Level 2 connected to “backside” bus
Fig. 18.1 Cache memories act as intermediaries between
the superfast processor and the much slower main memory.
Feb. 2011
Computer Architecture, Memory System Design
Slide 25
Performance of a Two-Level Cache System
Example 18.1
A system with L1 and L2 caches has a CPI of 1.2 with no cache miss.
There are 1.1 memory accesses on average per instruction.
What is the effective CPI with cache misses factored in?
What are the effective hit rate and miss penalty overall if L1 and L2
caches are modeled as a single cache?
CPU
CPU
Level
L1
L2
Local hit rate
95 %
80 %
Solution
Miss penalty
8 cycles
60 cycles
registers
1%
95%
Level-1
cache
8
cycles
4%
Level-2
cache
Main
memory
60
cycles
Ceff = Cfast + (1 – h1)[Cmedium + (1 – h2)Cslow]
Because Cfast is included in the CPI of 1.2, we must account for the rest
CPI = 1.2 + 1.1(1 – 0.95)[8 + (1 – 0.8)60] = 1.2 + 1.1 × 0.05 × 20 = 2.3
Overall: hit rate 99% (95% + 80% of 5%), miss penalty 60 cycles
Feb. 2011
Computer Architecture, Memory System Design
Slide 26
Cache Memory Design Parameters
Cache size (in bytes or words). A larger cache can hold more of the
program’s useful data but is more costly and likely to be slower.
Block or cache-line size (unit of data transfer between cache and main).
With a larger cache line, more data is brought in cache with each miss.
This can improve the hit rate but also may bring low-utility data in.
Placement policy. Determining where an incoming cache line is stored.
More flexible policies imply higher hardware cost and may or may not
have performance benefits (due to more complex data location).
Replacement policy. Determining which of several existing cache blocks
(into which a new cache line can be mapped) should be overwritten.
Typical policies: choosing a random or the least recently used block.
Write policy. Determining if updates to cache words are immediately
forwarded to main (write-through) or modified blocks are copied back to
main if and when they must be replaced (write-back or copy-back).
Feb. 2011
Computer Architecture, Memory System Design
Slide 27
18.2 What Makes a Cache Work?
Temporal locality
Spatial locality
Address mapping
(many-to-one)
Cache
memory
Fig. 18.2 Assuming no conflict in
address mapping, the cache will
hold a small program loop in its
entirety, leading to fast execution.
Feb. 2011
Main
memory
Computer Architecture, Memory System Design
9-instruction
program loop
Cache line/ block
(unit of t rans fer
between main and
cache memories)
Slide 28
Desktop, Drawer, and File Cabinet Analogy
Once the “working set” is in
the drawer, very few trips to
the file cabinet are needed.
Access cabinet
in 30 s
Register
file
Access drawer
in 5 s
Access
desktop in 2 s
Cache
memory
Main
memory
Fig. 18.3 Items on a desktop (register) or in a drawer (cache) are
more readily accessible than those in a file cabinet (main memory).
Feb. 2011
Computer Architecture, Memory System Design
Slide 29
Temporal and Spatial Localities
Addresses
From Peter Denning’s CACM paper,
July 2005 (Vol. 48, No. 7, pp. 19-24)
Temporal:
Accesses to the
same address
are typically
clustered in time
Spatial:
When a location
is accessed,
nearby locations
tend to be
accessed also
Feb. 2011
Working set
Time
Computer Architecture, Memory System Design
Slide 30
Caching Benefits Related to Amdahl’s Law
Example 18.2
In the drawer & file cabinet analogy, assume a hit rate h in the drawer.
Formulate the situation shown in Fig. 18.2 in terms of Amdahl’s law.
3
Solution
Without the drawer, a document is accessed in 30 s. So, fetching 1000
documents, say, would take 30 000 s. The drawer causes a fraction h
of the cases to be done 6 times as fast, with access time unchanged for
the remaining 1 – h. Speedup is thus 1/(1 – h + h/6) = 6 / (6 – 5h).
Improving the drawer access time can increase the speedup factor but
as long as the miss rate remains at 1 – h, the speedup can never
exceed 1 / (1 – h). Given h = 0.9, for instance, the speedup is 4, with
the upper bound being 10 for an extremely short drawer access time.
Note: Some would place everything on their desktop, thinking that this
yields even greater speedup. This strategy is not recommended!
Feb. 2011
Computer Architecture, Memory System Design
Slide 31
Compulsory, Capacity, and Conflict Misses
Compulsory misses: With on-demand fetching, first access to any item
is a miss. Some “compulsory” misses can be avoided by prefetching.
Capacity misses: We have to oust some items to make room for others.
This leads to misses that are not incurred with an infinitely large cache.
Conflict misses: Occasionally, there is free room, or space occupied by
useless data, but the mapping/placement scheme forces us to displace
useful items to bring in other items. This may lead to misses in future.
Given a fixed-size cache, dictated, e.g., by cost factors or availability of
space on the processor chip, compulsory and capacity misses are
pretty much fixed. Conflict misses, on the other hand, are influenced by
the data mapping scheme which is under our control.
We study two popular mapping schemes: direct and set-associative.
Feb. 2011
Computer Architecture, Memory System Design
Slide 32
18.3 Direct-Mapped Cache
2-bit word offset in line
3-bit line index in cache
0-3
Main
4-7
mem ory
8-11
Word
address
Tag
locations
32-35
36-39
40-43
64-67
68-71
72-75
Tags
Valid bits
Read tag and
specified word
Data out
1,Tag
Compare
1 if equal
96-99
100-103
104-107
Cache miss
Fig. 18.4 Direct-mapped cache holding 32 words within eight 4-word lines.
Each line is associated with a tag and a valid bit.
Feb. 2011
Computer Architecture, Memory System Design
Slide 33
Accessing a Direct-Mapped Cache
Example 18.4
Show cache addressing for a byte-addressable memory with 32-bit
addresses. Cache line W = 16 B. Cache size L = 4096 lines (64 KB).
Solution
Byte offset in line is log216 = 4 b. Cache line index is log24096 = 12 b.
This leaves 32 – 12 – 4 = 16 b for the tag.
12-bit line index in cache
16-bit line tag
4-bit byte offset in line
32-bit
address
Byte address in cache
Fig. 18.5 Components of the 32-bit address in an example
direct-mapped cache with byte addressing.
Feb. 2011
Computer Architecture, Memory System Design
Slide 34
Direct-Mapped
Cache Behavior
Address trace:
1, 7, 6, 5, 32, 33, 1, 2, . . .
1: miss, line 3, 2, 1, 0 fetched
7: miss, line 7, 6, 5, 4 fetched
6: hit
5: hit
32: miss, line 35, 34, 33, 32 fetched
(replaces 3, 2, 1, 0)
33: hit
1: miss, line 3, 2, 1, 0 fetched
(replaces 35, 34, 33, 32)
2: hit
1,Tag
... and so on
Feb. 2011
Fig. 18.4
2-bit word offs et in line
3-bit line index in cache
Word
address
Tag
35
3 34
2 33
1 32
0
7 6 5 4
Tags
Valid bits
Read tag and
s pecified word
Dat
Compare
Computer Architecture, Memory System Design
1 if equal
Cac
Slide 35
18.4 Set-Associative Cache
0-3
2-bit word offset in line
2-bit set index in cache
Main
memory
16-19 locations
Word
address
Tag
32-35
Option 0
Tags
Option 1
48-51
Valid bits
0
1
1,Tag
Compare
Compare
64-67
Read tag and specified
word from each option
1 if equal
80-83
Data
out
96-99
Cache
miss
112-115
Fig. 18.6 Two-way set-associative cache holding 32 words of
data within 4-word lines and 2-line sets.
Feb. 2011
Computer Architecture, Memory System Design
Slide 36
Accessing a Set-Associative Cache
Example 18.5
Show cache addressing scheme for a byte-addressable memory with
32-bit addresses. Cache line width 2W = 16 B. Set size 2S = 2 lines.
Cache size 2L = 4096 lines (64 KB).
Solution
Byte offset in line is log216 = 4 b. Cache set index is (log24096/2) = 11 b.
This leaves 32 – 11 – 4 = 17 b for the tag.
11-bit set index in cache
Fig. 18.7 Components
32-bit
of the 32-bit address in
address
an example two-way
set-associative cache.
Feb. 2011
17-bit line tag
4-bit byte offset in line
Address in cache used to
read out two candidate
items and their control info
Computer Architecture, Memory System Design
Slide 37
Cache Address Mapping
Example 18.6
A 64 KB four-way set-associative cache is byte-addressable and
contains 32 B lines. Memory addresses are 32 b wide.
a. How wide are the tags in this cache?
b. Which main memory addresses are mapped to set number 5?
Solution
a. Address (32 b) = 5 b byte offset + 9 b set index + 18 b tag
b. Addresses that have their 9-bit set index equal to 5. These are of
the general form 214a + 25×5 + b; e.g., 160-191, 16 554-16 575, . . .
32-bit
address
Tag width =
32 – 5 – 9 = 18
Feb. 2011
Tag
Set index
Offset
18 bits
9 bits
5 bits
Set size = 4 × 32 B = 128 B
Number of sets = 216/27 = 29
Line width =
32 B = 25 B
Computer Architecture, Memory System Design
Slide 38
18.5 Cache and Main Memory
Split cache: separate instruction and data caches (L1)
Unified cache: holds instructions and data (L1, L2, L3)
Harvard architecture: separate instruction and data memories
von Neumann architecture: one memory for instructions and data
The writing problem:
Write-through slows down the cache to allow main to catch up
Write-back or copy-back is less problematic, but still hurts
performance due to two main memory accesses in some cases.
Solution: Provide write buffers for the cache so that it does not
have to wait for main memory to catch up.
Feb. 2011
Computer Architecture, Memory System Design
Slide 39
Faster Main-Cache Data Transfers
..
.
Row
address
decoder
Byte
address
in
14
/
16Kb × 16Kb
memory matrix
Selected row
..
.
.
11
/
.
.
16 Kb = 2 KB
Column mux
Data byte out
Fig. 18.8 A 256 Mb DRAM chip organized as a 32M × 8 memory
module: four such chips could form a 128 MB main memory unit.
Feb. 2011
Computer Architecture, Memory System Design
Slide 40
18.6 Improving Cache Performance
For a given cache size, the following design issues and tradeoffs exist:
Line width (2W). Too small a value for W causes a lot of main memory
accesses; too large a value increases the miss penalty and may tie up
cache space with low-utility items that are replaced before being used.
Set size or associativity (2S). Direct mapping (S = 0) is simple and fast;
greater associativity leads to more complexity, and thus slower access,
but tends to reduce conflict misses. More on this later.
Line replacement policy. Usually LRU (least recently used) algorithm or
some approximation thereof; not an issue for direct-mapped caches.
Somewhat surprisingly, random selection works quite well in practice.
Write policy. Modern caches are very fast, so that write-through is
seldom a good choice. We usually implement write-back or copy-back,
using write buffers to soften the impact of main memory latency.
Feb. 2011
Computer Architecture, Memory System Design
Slide 41
Effect of Associativity on Cache Performance
0.3
Miss rate
0.2
0.1
0
Direct
2-way
4-way
8-way
16-way
32-way
64-way
Associativity
Fig. 18.9
Feb. 2011
Performance improvement of caches with increased associativity.
Computer Architecture, Memory System Design
Slide 42
19 Mass Memory Concepts
Today’s main memory is huge, but still inadequate for all needs
• Magnetic disks provide extended and back-up storage
• Optical disks & disk arrays are other mass storage options
Topics in This Chapter
19.1 Disk Memory Basics
19.2 Organizing Data on Disk
19.3 Disk Performance
19.4 Disk Caching
19.5 Disk Arrays and RAID
19.6 Other Types of Mass Memory
Feb. 2011
Computer Architecture, Memory System Design
Slide 43
19.1 Disk Memory Basics
Sector
Read/write head
Actuator
Recording area
Track c – 1
Track 2
Track 1
Track 0
Arm
Direction of
rotation
Platter
Spindle
Fig. 19.1
Feb. 2011
Disk memory elements and key terms.
Computer Architecture, Memory System Design
Slide 44
Disk Drives
Typically
Typically
2-8
cm
2 - 8 cm
Feb. 2011
Computer Architecture, Memory System Design
Slide 45
Access Time for a Disk
3. Disk rotation until sector
has passed under the head:
Data transfer time (< 1 ms)
2. Disk rotation until the desired
sector arrives under the head:
Rotational latency (0-10s ms)
2
3
1. Head movement
from current position
to desired cylinder:
Seek time (0-10s ms)
1
Sector
Rotation
The three components of disk access time. Disks that spin faster
have a shorter average and worst-case access time.
Feb. 2011
Computer Architecture, Memory System Design
Slide 46
Representative Magnetic Disks
Table 19.1 Key attributes of three representative magnetic disks,
from the highest capacity to the smallest physical size (ca. early 2003).
[More detail (weight, dimensions, recording density, etc.) in textbook.]
Manufacturer and
Model Name
Seagate
Barracuda 180
Hitachi
DK23DA
IBM
Microdrive
Application domain
Capacity
Platters / Surfaces
Cylinders
Sectors per track, avg
Buffer size
Seek time, min,avg,max
Diameter
Rotation speed, rpm
Typical power
Server
180 GB
12 / 24
24 247
604
16 MB
1, 8, 17 ms
3.5″
7 200
14.1 W
Laptop
40 GB
2/4
33 067
591
2 MB
3, 13, 25 ms
2.5″
4 200
2.3 W
Pocket device
1 GB
1/2
7 167
140
1/8 MB
1, 12, 19 ms
1.0″
3 600
0.8 W
Feb. 2011
Computer Architecture, Memory System Design
Slide 47
19.2 Organizing Data on Disk
Sector 4
Gap
Sector 5
(end)
Sector 1
(begin)
Sector 2
Thin-film
head
Sector 3
Magnetic
medium
0
1
0
Fig. 19.2 Magnetic recording along the tracks and the read/write head.
0
30
60
27
16
46
13
43
32
62
29
59
48
15
45
12
1
31
61
28
17
47
14
44
33
0
30
60
49
16
46
13
2
32
62
29
Track i
Track i + 1
Track i + 2
Track i + 3
Fig. 19.3 Logical numbering of sectors on several adjacent tracks.
Feb. 2011
Computer Architecture, Memory System Design
Slide 48
19.3 Disk Performance
Seek time = a + b(c – 1) + β(c – 1)1/2
Average rotational latency = (30 / rpm) s = (30 000 / rpm) ms
Arrival order of
access requests:
D
A, B, C, D, E, F
E
A
F
B
Possible out-oforder reading:
C, F, D, E, B, A
C
Rotation
Fig. 19.4 Reducing average seek time and rotational
latency by performing disk accesses out of order.
Feb. 2011
Computer Architecture, Memory System Design
Slide 49
19.4 Disk Caching
Same idea as processor cache: bridge main-disk speed gap
Read/write an entire track with each disk access:
“Access one sector, get 100s free,” hit rate around 90%
Disks listed in Table 19.1 have buffers from 1/8 to 16 MB
Rotational latency eliminated; can start from any sector
Need back-up power so as not to lose changes in disk cache
(need it anyway for head retraction upon power loss)
Placement options for disk cache
In the disk controller:
Suffers from bus and controller latencies even for a cache hit
Closer to the CPU:
Avoids latencies and allows for better utilization of space
Intermediate or multilevel solutions
Feb. 2011
Computer Architecture, Memory System Design
Slide 50
19.5 Disk Arrays and RAID
The need for high-capacity, high-throughput secondary (disk) memory
Processor RAM
speed
size
Disk I/O
rate
1 GIPS
1 GB
1 TIPS
Disk
capacity
Number of
disks
100 MB/s 1
100 GB
1
1 TB
100 GB/s 1000
100 TB
100
1 PIPS
1 PB
100 TB/s 1 Million
100 PB
100 000
1 EIPS
1 EB
100 PB/s 1 Billion
100 EB
100 Million
1 RAM byte
for each IPS
Feb. 2011
Number of
disks
1 I/O bit per sec
for each IPS
100 disk bytes
for each RAM byte
Computer Architecture, Memory System Design
Amdahl’s
rules of
thumb for
system
balance
Slide 51
Redundant Array of Independent Disks (RAID)
Data organization on m ultiple disks
Data
disk 0
Data
disk 1
Data
disk 2
Mirror
disk 0
Mirror
disk 1
RAID0: Multiple disks for higher
data rate; no redundancy
Mirror
disk 2
RAID1: Mirrored disks
RAID2: Error-correcting code
DataA
disk 0
DataB
disk 1
DataC
disk 2
Data D
disk 3
Data 0
Data 1
Data 2
Data 0’
Data 1’
Data 2’
Data 0”
Data 1”
Data 2”
Data 0’”
Data 1’”
Data 2’”
Data 0
Data 1
Data 2
Data 0’
Data 1’
Data 2’
Data 0”
Data 1”
Parity 2
Data 0’”
Parity 1
Data 2”
Parity P
disk
Spare
disk
RAID3: Bit- or b yte-level striping
with parity/checksum disk
Parity 0
Parity 1
Parity 2
Spare
disk
RAID4: Parity/checksum applied
to sectors,not bits or bytes
Parity 0
Data 1’”
Data 2’”
Spare
disk
RAID5: Parity/checksum
distributed across several disks
A⊕B⊕C⊕D⊕P=0→
B=A⊕C⊕D⊕P
RAID6: Parity and 2nd check
distributed across several disks
Fig. 19.5
Feb. 2011
RAID levels 0-6, with a simplified view of data organization.
Computer Architecture, Memory System Design
Slide 52
RAID Product Examples
IBM ESS Model 750
Feb. 2011
Computer Architecture, Memory System Design
Slide 53
19.6 Other Types of Mass Memory
Typically
2-9 cm
Floppy
disk
.
.
.
..
(a) Cutaway view of a hard disk drive
Fig. 3.12 Magnetic and
optical disk memory units.
Feb. 2011
CD-ROM
..
.
Magnetic
tape
cartridge
(b) Some removable storage media
Flash drive
Thumb drive
Travel drive
Computer Architecture, Memory System Design
Slide 54
Optical Disks
Pits on
adjacent
tracks
Tracks
Laser diode
Beam
splitter
Spiral,
rather than
concentric,
tracks
Detector
Pits
Side view of
one track
Lenses
Protective coating
Substrate
0
1
0
1
0
0
1
1
Fig. 19.6 Simplified view of recording format and access mechanism for
data on a CD-ROM or DVD-ROM.
Feb. 2011
Computer Architecture, Memory System Design
Slide 55
Automated Tape Libraries
Feb. 2011
Computer Architecture, Memory System Design
Slide 56
20 Virtual Memory and Paging
Managing data transfers between main & mass is cumbersome
• Virtual memory automates this process
• Key to virtual memory’s success is the same as for cache
Topics in This Chapter
20.1 The Need for Virtual Memory
20.2 Address Translation in Virtual Memory
20.3 Translation Lookaside Buffer
20.4 Page Placement and Replacement
20.5 Main and Mass Memories
20.6 Improving Virtual Memory Performance
Feb. 2011
Computer Architecture, Memory System Design
Slide 57
20.1 The Need for Virtual Memory
System
Active pieces
of program and
data in memory
Program and
data on several
disk tracks
Unused
space
Stack
Fig. 20.1
Feb. 2011
Program segments in main memory and on disk.
Computer Architecture, Memory System Design
Slide 58
Memory Hierarchy: The Big Picture
Virtual
memory
Main memory
Cache
Registers
Words
Lines
(transferred
explicitly
via load/store)
Fig. 20.2
Feb. 2011
Pages
(transferred
automatically
upon cache miss)
(transferred
automatically
upon page fault)
Data movement in a memory hierarchy.
Computer Architecture, Memory System Design
Slide 59
20.2 Address Translation in Virtual Memory
Virtual
address
Virtual page number
Offset in page
V − P bits
P bits
Address translation
Physical
address
Fig. 20.3
Example
20.1
Feb. 2011
M − P bits
P bits
Physical page number Offset in page
Virtual-to-physical address translation parameters.
Determine the parameters in Fig. 20.3 for 32-bit virtual addresses,
4 KB pages, and 128 MB byte-addressable main memory.
Solution: Physical addresses are 27 b, byte offset in page is 12 b;
thus, virtual (physical) page numbers are 32 – 12 = 20 b (15 b)
Computer Architecture, Memory System Design
Slide 60
Page Tables and Address Translation
Page table
register
Page table
Virtual
page
number
Valid
bits
Other
flags
Main memory
Fig. 20.4 The role of page table in the virtual-to-physical
address translation process.
Feb. 2011
Computer Architecture, Memory System Design
Slide 61
Protection and Sharing in Virtual Memory
Page table for
process 1
0
1
2
3
4
5
6
7
Page table for
process 2
Read & w rite
accesses allowed
Only read accesses
allow ed
0
1
2
3
4
5
6
7
Pointer
Flags
Per mission bits
To disk memory
Main memory
Fig. 20.5 Virtual memory as a facilitator of sharing and
memory protection.
Feb. 2011
Computer Architecture, Memory System Design
Slide 62
The Latency Penalty of Virtual Memory
Virtual address
Page table
register
Memory
access 1
Memory
access 2
Page table
Physical address
Virtual
page
number
Valid
bits
Other
flags
Main memory
Fig. 20.4
Feb. 2011
Computer Architecture, Memory System Design
Slide 63
20.3 Translation Lookaside Buffer
Virtual
Byte
page number offset
Valid
bits
Translation
TLB tags
Tags match
and ent ry
is valid
Other
flags
Physical
page number
Physical
address tag
Virtual
address
Program page in virtual memory
lw
$t0,0($s1)
addi $t1,$zero,0
L: add
$t1,$t1,1
beq
$t1,$s2,D
add
$t2,$t1,$t1
add
$t2,$t2,$t2
add
$t2,$t2,$s1
lw
$t3,0($t2)
slt
$t4,$t0,$t3
beq
$t4,$zero,L
addi $t0,$t3,0
j
L
D: ...
Physical
address
Byte offset
in word
Cache index
All instructions on this
page have the same
virtual page address
and thus entail
the same translation
Fig. 20.6 Virtual-to-physical address translation by a TLB and how
the resulting physical address is used to access the cache memory.
Feb. 2011
Computer Architecture, Memory System Design
Slide 64
Address Translation via TLB
Example 20.2
An address translation process converts a 32-bit virtual address to a
32-bit physical address. Memory is byte-addressable with 4 KB pages.
A 16-entry, direct-mapped TLB is used. Specify the components of the
virtual and physical addresses and the width of the various TLB fields.
Virtual
Byte
page number offset
Virtual
Page number
16
Tag
Valid
bits
TLB tags
4
12
Tags match
and ent ry
TLB
is valid
index
16-entry
TLB Other
flags
Fig. 20.6
Feb. 2011
20
20 12
Physical
page number
Physical
address tag
Virtual
address
Translation
Solution
Physical
address
TLB word width =
16-bit tag +
20-bit phys page # +
1 valid bit +
Other flags
≥ 37 bits
Byte offset
in word
Cache index
Computer Architecture, Memory System Design
Slide 65
Virtual- or Physical-Address Cache?
Virtual-address
cache
TLB
TLB
Main memory
Physical-address
cache
Main memory
Hybrid-address
cache
Main memory
TLB
Cache may be accessed with part
of address that is common between
virtual and physical addresses
TLB access may
form an extra
pipeline stage,
thus the penalty
in throughput can
be insignificant
Fig. 20.7 Options for where virtual-to-physical
address translation occurs.
Feb. 2011
Computer Architecture, Memory System Design
Slide 66
20.4 Page Replacement Policies
Least-recently used (LRU) policy
Implemented by maintaining a stack
Pages Æ
A
B
A
F
B
E
A
LRU stack
MRU
D
B
E
LRU
C
A
D
B
E
B
A
D
E
A
B
D
E
F
A
B
D
B
F
A
D
E
B
F
A
A
E
B
F
Feb. 2011
Computer Architecture, Memory System Design
Slide 67
Approximate LRU Replacement Policy
Least-recently used policy: effective, but hard to implement
Approximate versions of LRU are more easily implemented
Clock policy: diagram below shows the reason for name
Use bit is set to 1 whenever a page is accessed
Page slot 0
Page slot 7
0
0
1
1
1
0
0
1
(a) Before replacement
Fig. 20.8
Feb. 2011
0
Page slot 1
0
0
1
0
0
1
1
(b) A fter replacement
A scheme for the approximate implementation of LRU .
Computer Architecture, Memory System Design
Slide 68
LRU Is Not Always the Best Policy
Example 20.2
Computing column averages for a 17 × 1024 table; 16-page memory
for j = [0 … 1023] {
temp = 0;
for i = [0 … 16]
temp = temp + T[i][j]
print(temp/17.0); }
Evaluate the page faults for row-major and column-major storage.
Solution
1024
61
60 60 60
17
Fig. 20.9
Feb. 2011
60
. . .
Pagination of a 17×1024 table with row- or column-major storage.
Computer Architecture, Memory System Design
Slide 69
20.5 Main and Mass Memories
Working set of a process, W(t, x): The set of pages accessed over
the last x instructions at time t
Principle of locality ensures that the working set changes slowly
W(t, x)
Time, t
Fig. 20.10
Feb. 2011
Variations in the size of a program’s working set.
Computer Architecture, Memory System Design
Slide 70
20.6 Improving Virtual Memory Performance
Table 20.1
Memory hierarchy parameters and their effects on performance
Parameter variation
Potential advantages
Possible disadvantages
Larger main or
cache size
Fewer capacity misses
Longer access time
Larger pages or
longer lines
Fewer compulsory misses
(prefetching effect)
Greater miss penalty
Greater associativity Fewer conflict misses
(for cache only)
Longer access time
More sophisticated
replacement policy
Fewer conflict misses
Longer decision time, more
hardware
Write-through policy
(for cache only)
No write-back time penalty, Wasted memory bandwidth,
easier write-miss handling longer access time
Feb. 2011
Computer Architecture, Memory System Design
Slide 71
Impact of Technology on Virtual Memory
s
Disk seek time
Time
ms
μs
DRAM access time
ns
CPU cycle time
ps
1980
1990
2000
2010
Calendar year
Fig. 20.11
Feb. 2011
Trends in disk, main memory, and CPU speeds.
Computer Architecture, Memory System Design
Slide 72
Performance Impact of the Replacement Policy
0.04
Approximate LRU
Least recently used
0.03
Page fault rate
First in, first out
Ideal (best possible)
0.02
0.01
0.00
0
5
10
15
Pages allocated
Fig. 20.12 Dependence of page faults on the number of pages
allocated and the page replacement policy
Feb. 2011
Computer Architecture, Memory System Design
Slide 73
Summary of Memory Hierarchy
Cache memory:
provides illusion of
very high speed
Main memory:
reasonable cost,
but slow & small
Virtual memory:
provides illusion of
very large size
Virtual
memory
Main memory
Cache
Registers
Words
Lines
(transferred
explicitly
via load/store)
Fig. 20.2
Feb. 2011
Pages
(transferred
automatically
upon cache miss)
Locality
makes
the
illusions
work
(transferred
automatically
upon page fault)
Data movement in a memory hierarchy.
Computer Architecture, Memory System Design
Slide 74
Part VI
Input/Output and Interfacing
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 1
About This Presentation
This presentation is intended to support the use of the textbook
Computer Architecture: From Microprocessors to Supercomputers,
Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated
regularly by the author as part of his teaching of the upper-division
course ECE 154, Introduction to Computer Architecture, at the
University of California, Santa Barbara. Instructors can use these
slides freely in classroom teaching and for other educational
purposes. Any other use is strictly prohibited. © Behrooz Parhami
Edition
Released
Revised
Revised
Revised
Revised
First
July 2003
July 2004
July 2005
Mar. 2007
Mar. 2008
Mar. 2009
Feb. 2011
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 2
VI Input/Output and Interfacing
Effective computer design & use requires awareness of:
• I/O device types, technologies, and performance
• Interaction of I/O with memory and CPU
• Automatic data collection and device actuation
Topics in This Part
Chapter 21 Input/Output Devices
Chapter 22 Input/Output Programming
Chapter 23 Buses, Links, and Interfacing
Chapter 24 Context Switching and Interrupts
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 3
21 Input/Output Devices
Learn about input and output devices as categorized by:
• Type of data presentation or recording
• Data rate, which influences interaction with system
Topics in This Chapter
21.1 Input/Output Devices and Controllers
21.2 Keyboard and Mouse
21.3 Visual Display Units
21.4 Hard-Copy Input/Output Devices
21.5 Other Input/Output Devices
21.6 Networking of Input/Output Devices
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 4
Section 21.1: Introduction
Section 21.3
Section 21.4
Section 21.2
Feb. 2011
Section 21.5: Other devices
Section 21.6: Networked I/O
Computer Architecture, Input/Output and Interfacing
Slide 5
21.1 Input/Output Devices and Controllers
Table 3.3
Some input, output, and two-way I/O devices.
Input type
Prime examples
Other examples
Data rate (b/s)
Main uses
Symbol
Keyboard, keypad
Music note, OCR
10s
Ubiquitous
Position
Mouse, touchpad
Stick, wheel, glove
100s
Ubiquitous
Identity
Barcode reader
Badge, fingerprint
100s
Sales, security
Sensory
Touch, motion, light
Scent, brain signal
100s
Control, security
Audio
Microphone
Phone, radio, tape
1000s
Ubiquitous
Image
Scanner, camera
Graphic tablet
1000s-106s
Photos, publishing
Video
Camcorder, DVD
VCR, TV cable
1000s-109s
Entertainment
Output type
Prime examples
Other examples
Data rate (b/s)
Main uses
Symbol
LCD line segments
LED, status light
10s
Ubiquitous
Position
Stepper motor
Robotic motion
100s
Ubiquitous
Warning
Buzzer, bell, siren
Flashing light
A few
Safety, security
Sensory
Braille text
Scent, brain stimulus
100s
Personal assistance
Audio
Speaker, audiotape
Voice synthesizer
1000s
Ubiquitous
Image
Monitor, printer
Plotter, microfilm
1000s
Ubiquitous
Video
Monitor, TV screen
Film/video recorder
1000s-109s
Entertainment
Two-way I/O
Prime examples
Other examples
Data rate (b/s)
Main uses
Mass storage
Hard/floppy disk
CD, tape, archive
106s
Ubiquitous
Network
Modem, fax, LAN
Cable, DSL, ATM
1000s-109s
Ubiquitous
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 6
Simple Organization for Input/Output
Interrupts
CPU
Main
memory
Cache
System bus
I/O controller
Disk
Figure 21.1
Feb. 2011
Disk
I/O controller
I/O controller
Graphics
display
Network
Input/output via a single common bus.
Computer Architecture, Input/Output and Interfacing
Slide 7
I/O Organization for Greater Performance
CPU
Interrupts
Main
memory
Cache
Memory bus
Bus
adapter
AGP
PCI bus
Intermediate
buses / ports
Graphics
display
Standard
Bus
adapter
I/O bus
I/O controller
I/O controller
Network
Proprietary
Bus
adapter
I/O controller
Disk
Disk
I/O controller
CD/DVD
Figure 21.2 Input/output via intermediate and dedicated I/O buses
(to be explained in Chapter 23).
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 8
21.2 Keyboard and Mouse
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 9
Keyboard Switches and Encoding
Key
cap
Spring
c
d
e
f
8
9
a
b
4
5
6
7
0
1
2
3
(a) Mechanical switch
with a plunger
Conductor-coated membrane
Contacts
(b) Membrane switch
(c) Logical arrangement of keys
Figure 21.3 Two mechanical switch designs and the
logical layout of a hex keypad.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 10
Projection Virtual Keyboard
Hardware:
A tiny laser
device projects
the image of a
full-size
keyboard on
any surface
Software:
Emulates a
real keyboard,
even clicking
key sounds
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 11
Pointing Devices
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 12
How a Mouse Works
y roller
x roller
Mouse pad
y axis
x axis
Ball touching the rollers
rotates them via friction
(a) Mechanical mouse
Figure 21.4
Feb. 2011
Photosensor detects
crossing of grid lines
(b) Optical mouse
Mechanical and simple optical mice.
Computer Architecture, Input/Output and Interfacing
Slide 13
21.3
Visual Display Units
Deflection coils
Electron
beam
≅ 1K
Pixel info:
brightness,
color, etc.
lines
Electron
gun
y
Sensitive
screen
≅ 1K pixels
Feb. 2011
Frame buffer
per line
(a) Image formation on a CRT
Figure 21.5
x
(b) Data defining the image
CRT display unit and image storage in frame buffer.
Computer Architecture, Input/Output and Interfacing
Slide 14
How Color CRT Displays Work
RGB RGB RGB RGB RGBRGB
Direction of
blue beam
Direction of
green beam
Direction of
red beam
Shadow
mask
RGB
Faceplate
(a) The RGB color stripes
Figure 21.6
Feb. 2011
(b) Use of shadow mask
The RGB color scheme of modern CRT displays.
Computer Architecture, Input/Output and Interfacing
Slide 15
Encoding Colors in RGB Format
Besides hue, saturation is used to affect the color’s appearance
(high saturation at the top, low saturation at the bottom)
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 16
Flat-Panel Displays
Column pulses
Column pulses
Row
lines
Address
pulse
Column (data) lines
(a) Passive display
Figure 21.7
Feb. 2011
Column (data) lines
(b) Active display
Passive and active LCD displays.
Computer Architecture, Input/Output and Interfacing
Slide 17
Flexible Display Devices
Paper-thin tablet-size
display unit by E Ink
Sony organic light-emitting
diode (OLED) flexible display
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 18
Other Display Technologies
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 19
21.4 Hard-Copy Input/Output Devices
Document (face down)
Filters
Lens
Detector:
charge-coupled
device (CCD)
Mirror
Light beam
A/D
converter
Light source
Scanning
software
Mirror
Image file
Figure 21.8
Feb. 2011
Scanning mechanism for hard-copy input.
Computer Architecture, Input/Output and Interfacing
Slide 20
Character Formation by Dot Matrices
ooooo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ooooo
oo oo o
oo oo o
oo
oo
oo
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o oo ooo oo ooo oo o
o oo ooo oo ooo oo ooo o
oo
o oo o
oo
o oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
o
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
o
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
o oo
oo
o oo o
o oo ooo oo ooo oo ooo o
o oo ooo oo ooo oo o
Figure 21.9
Feb. 2011
ooooooooo
o
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
ooooooooo
Same dot matrix size,
but with greater resolution
oooooooooooooo
ooooooooooooooooo
oo
oooo
oo
ooo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
o
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
o
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
oo
ooo
oo
oooo
ooooooooooooooooo
oooooooooooooo
Forming the letter “D” via dot matrices of varying sizes.
Computer Architecture, Input/Output and Interfacing
Slide 21
Simulating Intensity Levels via Dithering
Forming five gray levels on a device that supports
only black and white (e.g., ink-jet or laser printer)
Using the dithering patterns above on each of
three colors forms 5 × 5 × 5 = 125 different colors
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 22
Simple Dot-Matrix Printer Mechanism
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 23
Common Hard-Copy Output Devices
Sheet of
paper
Paper
movement
Print head
movement
(a) Ink jet printing
Figure 21.10
Feb. 2011
Rotating
drum
Light from
optical
system
Rollers
Print
head
Ink
droplet
Corona wire
for charging
Fusing of toner
Heater
Print head
assembly
Ink
supply
Cleaning of
excess toner
Sheet of
paper
Toner
(b) Laser printing
Ink-jet and laser printers.
Computer Architecture, Input/Output and Interfacing
Slide 24
How Color Printers Work
Red
Green
The RGB scheme of color monitors is additive:
various amounts of the three primary colors
are added to form a desired color
Blue
Absence of green
Cyan
Magenta
The CMY scheme of color printers is subtractive:
various amounts of the three primary colors
are removed from white to form a desired color
Yellow
Feb. 2011
To produce a more satisfactory shade of black,
the CMYK scheme is often used (K = black)
Computer Architecture, Input/Output and Interfacing
Slide 25
The CMYK Printing Process
Illusion of
full color
created with
CMYK dots
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 26
Color Wheels
Artist’s color wheel,
used for mixing paint
Subtractive color wheel,
used in printing (CMYK)
Additive color wheel,
used for projection
Primary colors appear at center and equally spaced around the perimeter
Secondary colors are midway between primary colors
Tertiary colors are between primary and secondary colors
Source of this and several other slides on color: http://www.devx.com/projectcool/Article/19954/0/
(see also color theory tutorial: http://graphics.kodak.com/documents/Introducing%20Color%20Theory.pdf)
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 27
21.5 Other Input/Output Devices
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 28
Sensors and Actuators
Collecting info about the environment and other conditions
• Light sensors (photocells)
• Temperature sensors (contact and noncontact types)
• Pressure sensors
S
S
N
S
N
S
S
S
N
N
S
N
N
S
N
N
S
S
N
N
(a) Initial state
(a) After
rotation
Figure 21.11
Stepper motor principles
of operation.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 29
Converting Circular Motion to Linear Motion
Locomotive
Feb. 2011
Screw
Computer Architecture, Input/Output and Interfacing
Slide 30
21.6 Networking of Input/Output Devices
Computer 1
Printer 2
Camera
Ethernet
Computer 3
Printer 1
Computer 2
Printer 3
Figure 21.12 With network-enabled peripherals,
I/O is done via file transfers.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 31
Input/Output in Control and Embedded Systems
CPU and
memory
Network
interface
Intelligent devices,
other computers,
archival storage, ...
Digital
output
interface
Signal
conversion
Digital actuators:
stepper motors,
relays, alarms, ...
D/A
output
interface
Signal
conversion
Analog actuators:
valves, pumps,
speed regulators, ...
Digital
input
interface
Digital
signal
conditioning
Digital sensors:
detectors, counters,
on/off switches, ...
A/D input
interface
Analog
signal
conditioning
Analog sensors:
thermocouples,
pressure sensors, ...
Figure 21.13 The structure of a closed-loop
computer-based control system.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 32
22 Input/Output Programming
Like everything else, I/O is controlled by machine instructions
• I/O addressing (memory-mapped) and performance
• Scheduled vs demand-based I/O: polling vs interrupts
Topics in This Chapter
22.1 I/O Performance and Benchmarks
22.2 Input/Output Addressing
22.3 Scheduled I/O: Polling
22.4 Demand-Based I/O: Interrupts
22.5 I/O Data Transfer and DMA
22.6 Improving I/O Performance
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 33
22.1 I/O Performance and Benchmarks
Example 22.1: The I/O wall
An industrial control application spent 90% of its time on CPU
operations when it was originally developed in the early 1980s.
Since then, the CPU component has been upgraded every 5 years,
but the I/O components have remained the same. Assuming that
CPU performance improved tenfold with each upgrade, derive the
fraction of time spent on I/O over the life of the system.
Solution
Apply Amdahl’s law with 90% of the task speeded up by factors of
10, 100, 1000, and 10000 over a 20-year period. In the course of
these upgrades the running time has been reduced from the original
1 to 0.1 + 0.9/10 = 0.19, 0.109, 0.1009, and 0.10009, making the
fraction of time spent on input/output 52.6, 91.7, 99.1, and 99.9%,
respectively. The last couple of CPU upgrades did not really help.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 34
Types of Input/Output Benchmark
Supercomputer I/O benchmarks
Reading large volumes of input data
Writing many snapshots for checkpointing
Saving a relatively small set of results
I/O data throughput, in MB/s, is important
Transaction processing I/O benchmarks
Huge database, but each transaction fairly small
A handful (2-10) of disk accesses per transaction
I/O rate (disk accesses per second) is important
File system I/O benchmarks
File creation, directory management, indexing, . . .
Benchmarks are usually domain-specific
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 35
22.2
Memory location
(hex address)
0xffff0000
31
Input/Output Addressing
Interrupt enable
I
ER
76543210
Data byte
0xffff0004
Device ready
Keyboard control
Keyboard data
32-bit device registers
0xffff0008
31
0xffff000c
I
ER
76543210
Data byte
Display control
Display data
Figure 22.1 Control and data registers for keyboard
and display unit in MiniMIPS.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 36
Hardware for I/O Addressing
Control
Address
Data
Memory
bus
Device status
Device
address
Device data
Compare
=
Figure 22.2
Feb. 2011
Control
logic
Device
controller
Addressing logic for an I/O device controller.
Computer Architecture, Input/Output and Interfacing
Slide 37
Data Input from Keyboard
Example 22.2
Write a sequence of MiniMIPS assembly language instructions to
make the program wait until the keyboard has a symbol to transmit
Memory location
Interrupt enable
and then read the symbol into register
$v0.
(hex address)
0xffff0000
31
Solution
0xffff0004
Device read
I
ER
76543210
Keyboard
Data byte
Keyboard
32-bit device registers
The program must continually examine the keyboard control register,
I
0xffff0008
Display co
ER
31
7
6
5
4
3
2
1
0
ending its “busy wait” when the R bit has been asserted.
0xffff000c
lui
idle: lw
andi
beq
lw
$t0,0xffff
$t1,0($t0)
$t1,$t1,0x0001
$t1,$zero,idle
$v0,4($t0)
#
#
#
#
#
Data byte
put 0xffff0000 in $t0
get keyboard’s control word
isolate the LSB (R bit)
if not ready (R = 0), wait
retrieve data from keyboard
This type of input is appropriate only if the computer is waiting for a
critical input and cannot continue in the absence of such input.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 38
Display da
Data Output to Display Unit
Example 22.3
Memory location
Interrupt enable
Device re
(hex address)
Write a sequence of MiniMIPS assembly
language instructions to EI R
0xffff0000
31
6543210
make the program wait until the display0xffff0004
unit is ready
to accept a 7Data
new
byte
symbol and then write the symbol from $a0 to the display
unit.
32-bit device
registers
0xffff0008
31
Solution
0xffff000c
$t0,0xffff
$t1,8($t0)
$t1,$t1,0x0001
$t1,$zero,idle
$a0,12($t0)
#
#
#
#
#
Display
Data byte
Display
put 0xffff0000 in $t0
get display’s control word
isolate the LSB (R bit)
if not ready (R = 0), wait
supply data to display unit
This type of output is appropriate only if we can afford to have the
CPU dedicated to data transmission to the display unit.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Keyboa
I
ER
76543210
The program must continually examine the display unit’s control
register, ending its “busy wait” when the R bit has been asserted.
lui
idle: lw
andi
beq
sw
Keyboa
Slide 39
22.3
Scheduled I/O: Polling
Examples 22.4, 22.5, 22.6
What fraction of a 1 GHz CPU’s time is spent polling the following
devices if each polling action takes 800 clock cycles?
Keyboard must be interrogated at least 10 times per second
Floppy sends data 4 bytes at a time at a rate of 50 KB/s
Hard drive sends data 4 bytes at a time at a rate of 3 MB/s
Solution
For keyboard, divide the number of cycles needed for 10
interrogations by the total number of cycles available in 1 second:
(10 × 800)/109 ≅ 0.001%
The floppy disk must be interrogated 50K/4 = 12.5K times per sec
(12.5K × 800)/109 ≅ 1%
The hard disk must be interrogated 3M/4 = 750K times per sec
(750K × 800)/109 ≅ 60%
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 40
22.4
Demand-Based I/O: Interrupts
Example 22.7
Consider the disk in Example 22.6 (transferring 4 B chunks of data at
3 MB/s when active). Assume that the disk is active 5% of the time.
The overhead of interrupting the CPU and performing the transfer is
1200 clock cycles. What fraction of a 1 GHz CPU’s time is spent
attending to the hard disk drive?
Solution
When active, the hard disk produces 750K interrupts per second
0.05 × (750K × 1200)/109 ≅ 4.5% (compare with 60% for polling)
Note that even though the overhead of interrupting the CPU is higher
than that of polling, because the disk is usually idle, demand-based
I/O leads to better performance.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 41
Interrupt Handling
Upon detecting an interrupt signal, provided the particular
interrupt or interrupt class is not masked, the CPU acknowledges
the interrupt (so that the device can deassert its request signal)
and begins executing an interrupt service routine.
1. Save the CPU state and call the interrupt service routine.
2. Disable all interrupts.
3. Save minimal information about the interrupt on the stack.
4. Enable interrupts (or at least higher priority ones).
5. Identify cause of interrupt and attend to the underlying request.
6. Restore CPU state to what existed before the last interrupt.
7. Return from interrupt service routine.
The capability to handle nested interrupts is important in dealing with
multiple high-speed I/O devices.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 42
22.5
I/O Data Transfer and DMA
Other cont rol
ReadWrite’
DataReady’
System
bus
Address
Data
Bus request
CPU
and
cache
DMA
controller
Bus grant
Status Source
Length Dest’n
Main
memory
Typical
I/O
device
Figure 22.3 DMA controller shares the system
or memory bus with the CPU.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 43
DMA Operation
CPU
BusRequest
BusGrant
DMA
(a) DMA trans fer in one continuous burst
CPU
BusRequest
BusGrant
DMA
(b) DMA trans fer in several shorter bursts
Figure 22.4 DMA operation and the associated
transfers of bus control.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 44
22.6
Improving I/O Performance
Example 22.9: Effective I/O bandwidth from disk
Consider a hard disk drive with 512 B sectors, average access latency
of 10 ms, and peak throughput of 10 MB/s. Plot the variation of the
effective I/O bandwidth as the unit of data transfer (block) varies in
size from 1 sector (0.5 KB) to 1024 sectors (500 KB).
Solution
Throughput (MB / s)
10
Figure 22.5
Feb. 2011
8
6
5 MB/s
4
2
0.05 MB/s
0
0
100
200
300
Block size (KB)
Computer Architecture, Input/Output and Interfacing
400
500
Slide 45
Computing the Effective Throughput
Elaboration on Example 22.9: Effective I/O bandwidth from disk
Total access time for x bytes = 10 ms + xfer time = (0.01 + 10–7x) s
Effective access time per byte = (0.01 + 10–7x)/x s/B
Effective transfer rate = x/(0.01 + 10–7x) B/s
For x = 100 KB: Effective transfer rate = 105/(0.01 + 10–2) = 5×106 B/s
Average
access
latency
= 10 ms
Peak
throughput
= 10 MB/s
Throughput (MB / s)
10
Figure 22.5
Feb. 2011
8
6
5 MB/s
4
2
0.05 MB/s
0
0
100
200
300
Block size (KB)
Computer Architecture, Input/Output and Interfacing
400
500
Slide 46
Distributed Input/Output
CPU
CPU
CPU
CPU
Mem
HCA
Mem
HCA
HCA =
Host
channel
adapter
Switch
Module with
built-in switch
Router
Switch
Switch
HCA
I/O
Figure 22.6
Feb. 2011
HCA
I/O
HCA
I/O
HCA
HCA
To other subnets
I/O
I/O
Example configuration for the Infiniband distributed I/O.
Computer Architecture, Input/Output and Interfacing
Slide 47
23 Buses, Links, and Interfacing
Shared links or buses are common in modern computers:
• Fewer wires and pins, greater flexibility & expandability
• Require dealing with arbitration and synchronization
Topics in This Chapter
23.1 Intra- and Intersystem Links
23.2 Buses and Their Appeal
23.3 Bus Communication Protocols
23.4 Bus Arbitration and Performance
23.5 Basics of Interfacing
23.6 Interfacing Standards
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 48
23.1
Trench with via
Intra- and Intersystem Links
Metal layer 4
Trench
1. Etched and insulated
Metal
layer 3
2. Coated with copper
via
via
Metal
layer 2
Contact
3. Excess copper removed
(a) Cross section of layers
Metal layer 1
(b) 3D view of wires on multiple metal layers
Figure 23.1 Multiple metal layers provide intrasystem connectivity
on microchips or printed-circuit boards.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 49
Multiple Metal Layers on a Chip or PC Board
Cross section of metal layers
Active elements and
their connectors
Modern chips have
8-9 metal layers
Upper layers carry
longer wires as
well as those that
need more power
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 50
Intersystem Links
Computer
(a) RS-232
(b) Ethernet
Figure 23.2
Example intersystem connectivity schemes.
DTR: data terminal
ready
Signal
ground
5
CTS: clear
to send
Feb. 2011
Transmit
data
4
9
Figure 23.3
(c) ATM
3
8
2
7
Receive
data
1
6
RTS: request
to send
DSR: data set
ready
RS-232 serial interface 9-pin connector.
Computer Architecture, Input/Output and Interfacing
Slide 51
Intersystem Communication Media
Twisted
pair
Plastic
Insulator
Copper
core
Coaxial
cable
Outer
conductor
Reflection
Silica
Light
source
Optical
fiber
Figure 23.4 Commonly used communication media
for intersystem connections.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 52
Comparing Intersystem Links
Table 23.1
Summary of three interconnection schemes.
Interconnection properties
RS-232
Ethernet
ATM
Maximum segment length (m)
10s
100s
1000s
Maximum network span (m)
10s
100s
Unlimited
Up to 0.02
10/100/1000
155-2500
1
100s
53
<1
10s-100s
100s
Input/Output
LAN
Backbone
Low
Low
High
Bit rate (Mb/s)
Unit of transmission (B)
Typical end-to-end latency (ms)
Typical application domain
Transceiver complexity or cost
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 53
23.2
Buses and Their Appeal
1
0
1
2
0
3
n–1
n–2
2
3
n–1
n–2
Point-to-point connections between n units require n(n – 1) channels,
or n(n – 1)/2 bidirectional links; that is, O(n2) links
Bus connectivity requires only one input and one output port per unit,
or O(n) links in all
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 54
Bus Components and Types
.
.
.
Control
.
.
.
Address
.
.
.
Data
Figure 23.5
Handshaking,
direction,
transfer mode,
arbitration, ...
one bit (serial)
to several bytes;
may be shared
The three sets of lines found in a bus.
A typical computer may use a dozen or so different buses:
1. Legacy Buses: PC bus, ISA, RS-232, parallel port
2. Standard buses: PCI, SCSI, USB, Ethernet
3. Proprietary buses: for specific devices and max performance
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 55
23.3
Bus Communication Protocols
c
d
e
f
g
h
Clock
Address placed on the bus
Address
Data
Wait
Figure 23.6
Wait
Data
availability
ensured
Synchronous bus with fixed-latency devices.
Request
Address
or data
Wait
d
c
d
e
h
f
Ack
f
g
h
i
Ready
Figure 23.7 Handshaking on an asynchronous bus for an
input operation (e.g., reading from memory).
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 56
Example Bus Operation
c
d
e
f
g
h
i
j
k
CLK
FRAME′
C/BE′
AD
I/O read
Byte enable
Address
Data 0
Data 1
Data 2
IRDY′
Data 3
Wait
TRDY′
Wait
DEVSEL′
Transfer Address AD turn- Data
initiation transfer around transfer
Figure 23.8
Feb. 2011
Wait
cycle
Data
transfer
Data
transfer
Wait
cycle
Data
transfer
I/O read operation via PCI bus.
Computer Architecture, Input/Output and Interfacing
Slide 57
23.4
R0
R1
R2
Bus Arbitration and Performance
.
.
.
S
y
n
c
.
.
.
Arbiter
R n−1
.
.
.
G0
G1
G2
Gn−1
Bus release
Figure 23.9
Feb. 2011
General structure of a centralized bus arbiter.
Computer Architecture, Input/Output and Interfacing
Slide 58
Some Simple Bus Arbiters
Round robin
Ring
counter
Fixed-priority
R0
0
0
0
0
1
0
0
0
Ri
R0
G0
1
Ri
Gi
Rn–1
G0
Gi
Gn–1
Starvation avoidance
Rotating priority
With fixed priorities, low-priority
units may never get to use the
bus (they could “starve”)
Idea: Order the units circularly,
rather than linearly, and allow the
highest-priority status to rotate
among the units (combine a ring
counter with a priority circuit)
Combining priority with service
guarantee is desirable
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 59
Daisy Chaining
R0
R1
R2
.
.
.
S
y
n
c
.
.
.
Bus release
Arbiter
.
.
.
Bus
grant
G0
G1
G2
Device
A
Daisy chain of devices
Device
B
Device
C
Device
D
Bus request
Figure 23.9 Daisy chaining allows a small centralized arbiter to
service a large number of devices that use a shared resource.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 60
23.5
N
W
Contact
point
Basics of Interfacing
Ground
+5 V DC
S
Microcontroller
with internal
A/D converter
E
Pin x of port y
Figure 23.11 Wind vane supplying an output voltage in
the range 0-5 V depending on wind direction.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 61
23.6
Table 23.2
Attributes ↓
Interfacing Standards
Summary of four standard interface buses.
Name →
PCI
SCSI
FireWire
USB
Type of bus
Backplane
Parallel I/O
Serial I/O
Serial I/O
Standard designation
PCI
ANSI X3.131
IEEE 1394
USB 2.0
Typical application domain
System
Fast I/O
Fast I/O
Low-cost I/O
Bus width (data bits)
32-64
8-32
2
1
Peak bandwidth (MB/s)
133-512
5-40
12.5-50
0.2-15
Maximum number of devices
1024*
7-31#
63
127$
Maximum span (m)
<1
3-25
4.5-72$
5-30$
Arbitration method
Centralized
Self-select
Distributed
Daisy chain
Transceiver complexity or cost
High
Medium
Medium
Low
Notes:
Feb. 2011
* 32 per bus segment;
# One less than bus width;
$ With hubs (repeaters)
Computer Architecture, Input/Output and Interfacing
Slide 62
Standard Connectors
USB A
Host side
4321
1
4
Max cable
length: 5m
Host
(controller & hub)
USB B
Device side
2
3
Hub
Hub
Pin 1: +5V DC
Pin 4: Ground
Figure 23.12
Pin 2: Data −
Pin 3: Data +
Device
Hub
Device
Device
Device
Single product
with hub & device
USB connectors and connectivity structure .
Pin 1: 8-40V DC, 1.5 A
Pin 2: Ground
Pin 3: Twisted pair B −
Pin 4: Twisted pair B +
Pin 5: Twisted pair A −
Pin 6: Twisted pair A +
Shell: Outer shield
Figure 23.13 IEEE 1394 (FireWire) connector.
The same connector is used at both ends.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 63
24 Context Switching and Interrupts
OS initiates I/O transfers and awaits notification via interrupts
• When an interrupt is detected, the CPU switches context
• Context switch can also be used between users/threads
Topics in This Chapter
24.1 System Calls for I/O
24.2 Interrupts, Exceptions, and Traps
24.3 Simple Interrupt Handling
24.4 Nested Interrupts
24.5 Types of Context Switching
24.6 Threads and Multithreading
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 64
24.1
System Calls for I/O
Why the user must be isolated from details of I/O operations
Protection: User must be barred from accessing some disk areas
Convenience: No need to learn details of each device’s operation
Efficiency: Most users incapable of finding the best I/O scheme
I/O abstraction: grouping of I/O devices into a small number of
generic types so as to make the I/O device-independent
Character stream I/O: get(●), put(●) – e.g., keyboard, printer
Block I/O: seek(●), read(●), write(●) – e.g., disk
Network Sockets: create socket, connect, send/receive packet
Clocks or timers: set up timer (get notified via an interrupt)
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 65
24.2
Interrupts, Exceptions, and Traps
Interrupt
Exception
Trap
Both general term for any diversion and the I/O type
Caused by an illegal operation (often unpredictable)
AKA “software interrupt” (preplanned and not rare)
Studying Parhami’s
book for test
6:55
7:40 8:01
9:46
Stomach sends
interrupt signal
E- mail
arrives
Eating dinner
Reading/sending e-mail
8:42
Telemarketer
calls
8:53
9:20
Best friend
calls
Talk ing on the phone
Figure 24.1
Feb. 2011
The notions of interrupts and nested interrupts.
Computer Architecture, Input/Output and Interfacing
Slide 66
24.3
Simple Interrupt Handling
Acknowledge the interrupt by asserting the IntAck signal
Notify the CPU’s next-address logic that an interrupt is pending
Set the interrupt mask so that no new interrupt is accepted
IntAck
IntReq
S
y
n
c
IntDisable
S
S
FF
R
Signals
from/to
devices
Q
Q
FF
Q
Interrupt
acknowledge
R
Q
Interrupt
mask
IntAlert
Signals
from/to
CPU
IntEnable
Figure 24.2
Feb. 2011
Simple interrupt logic for the single-cycle MicroMIPS.
Computer Architecture, Input/Output and Interfacing
Slide 67
Interrupt Timing
c
d
e
f
g
h
Clock
IntReq
Synchronized version
IntAck
IntMask
IntAlert
Figure 24.3
Feb. 2011
Timing of interrupt request and acknowledge signals.
Computer Architecture, Input/Output and Interfacing
Slide 68
Next-Address Logic with Interrupts Added
IncrPC
Old
PC
NextPC
/
30
0
1
IntAlert
/
30
0
1
2
3
/
30
/
30
/
30
/
30
/
30
(PC) 31:28| jta
(rs) 31:2
SysCallAddr
IntHandlerAddr
PCSrc
Figure 24.4 Part of the next-address logic for single-cycle
MicroMIPS, with an interrupt capability added (compare with
the lower left part of Figure 13.4).
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 69
24.4
Nested Interrupts
prog
PC
inst(a)
inst(b)
Interrupts disabled
and (PC) saved
Int
detected
int1
Interrupt
handler
Save state
Save int info
Enable int’s
PC
inst(c)
inst(d)
In t
detected
int2
Save state
Save int info
Interrupts disabled Enable int’s
and (PC) saved
Restore state
Return
Figure 24.6
Feb. 2011
Interrupt
handler
Restore state
Return
Example of nested interrupts.
Computer Architecture, Input/Output and Interfacing
Slide 70
24.5
Scanning e-mail
messages
Types of Context Switching
Taking notes
Task 1
Task 2
Task 3
Time
slice
Context
switch
Talking on telephone
(a) Human multitasking
Figure 24.7
Feb. 2011
(b) Computer multitasking
Multitasking in humans and computers.
Computer Architecture, Input/Output and Interfacing
Slide 71
24.6
Threads and Multithreading
Thread 1
Thread 2
Thread 3
Spawn additional threads
Sync
Sync
(a) Task graph of a program
Figure 24.8
Feb. 2011
(b) Thread structure of a task
A program divided into tasks (subcomputations) or threads.
Computer Architecture, Input/Output and Interfacing
Slide 72
Multithreaded Processors
Threads in memory
Issue pipelines
Bubble
Retirement and
commit pipeline
Function units
Figure 24.9 Instructions from multiple threads as they make
their way through a processor’s execution pipeline.
Feb. 2011
Computer Architecture, Input/Output and Interfacing
Slide 73
Part VII
Advanced Architectures
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 1
About This Presentation
This presentation is intended to support the use of the textbook
Computer Architecture: From Microprocessors to Supercomputers,
Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated
regularly by the author as part of his teaching of the upper-division
course ECE 154, Introduction to Computer Architecture, at the
University of California, Santa Barbara. Instructors can use these
slides freely in classroom teaching and for other educational
purposes. Any other use is strictly prohibited. © Behrooz Parhami
Edition
Released
Revised
Revised
Revised
Revised
First
July 2003
July 2004
July 2005
Mar. 2007
Feb. 2011*
* Minimal update, due to this part not being used for lectures in ECE 154 at UCSB
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 2
VII Advanced Architectures
Performance enhancement beyond what we have seen:
• What else can we do at the instruction execution level?
• Data parallelism: vector and array processing
• Control parallelism: parallel and distributed processing
Topics in This Part
Chapter 25 Road to Higher Performance
Chapter 26 Vector and Array Processing
Chapter 27 Shared-Memory Multiprocessing
Chapter 28 Distributed Multicomputing
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 3
25 Road to Higher Performance
Review past, current, and future architectural trends:
• General-purpose and special-purpose acceleration
• Introduction to data and control parallelism
Topics in This Chapter
25.1 Past and Current Performance Trends
25.2 Performance-Driven ISA Extensions
25.3 Instruction-Level Parallelism
25.4 Speculation and Value Prediction
25.5 Special-Purpose Hardware Accelerators
25.6 Vector, Array, and Parallel Processing
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 4
25.1
Past and Current Performance Trends
Intel 4004: The first μp (1971)
Intel Pentium 4, circa 2005
0.06 MIPS (4-bit processor)
10,000 MIPS (32-bit processor)
8008
8-bit
80386
8080
80486
Pentium, MMX
8084
32-bit
8086
16-bit
8088
80186
80188
Pentium Pro, II
Pentium III, M
Celeron
80286
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 5
Architectural Innovations for Improved Performance
Newer
methods
Feb. 2011
Improvement factor
1. Pipelining (and superpipelining)
3-8 √
2. Cache memory, 2-3 levels
2-5 √
3. RISC and related ideas
2-3 √
4. Multiple instruction issue (superscalar) 2-3 √
5. ISA extensions (e.g., for multimedia)
1-3 √
6. Multithreading (super-, hyper-)
2-5 ?
7. Speculation and value prediction
2-3 ?
8. Hardware acceleration
2-10 ?
9. Vector and array processing
2-10 ?
10. Parallel/distributed computing
2-1000s ?
Computer Architecture, Advanced Architectures
Previously
discussed
Established
methods
Architectural method
Available computing power ca. 2000:
GFLOPS on desktop
TFLOPS in supercomputer center
PFLOPS on drawing board
Covered in
Part VII
Computer performance grew by a factor
of about 10000 between 1980 and 2000
100 due to faster technology
100 due to better architecture
Slide 6
Peak Performance of Supercomputers
PFLOPS
Earth Simulator
× 10 / 5 years
ASCI White Pacific
ASCI Red
TFLOPS
TMC CM-5
Cray X-MP
Cray T3D
TMC CM-2
Cray 2
GFLOPS
1980
1990
2000
2010
Dongarra, J., “Trends in High Performance Computing,”
Computer J., Vol. 47, No. 4, pp. 399-403, 2004. [Dong04]
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 7
Energy Consumption is Getting out of Hand
TIPS
DSP performance
per watt
Absolute
proce ssor
performance
Performance
GIPS
GP processor
performance
per watt
MIPS
kIPS
1980
1990
2000
2010
Calendar year
Figure 25.1 Trend in energy consumption for each MIPS of
computational power in general-purpose processors and DSPs.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 8
25.2
Performance-Driven ISA Extensions
Adding instructions that do more work per cycle
Shift-add: replace two instructions with one (e.g., multiply by 5)
Multiply-add: replace two instructions with one (x := c + a × b)
Multiply-accumulate: reduce round-off error (s := s + a × b)
Conditional copy: to avoid some branches (e.g., in if-then-else)
Subword parallelism (for multimedia applications)
Intel MMX: multimedia extension
64-bit registers can hold multiple integer operands
Intel SSE: Streaming SIMD extension
128-bit registers can hold several floating-point operands
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 9
Intel
MMX
ISA
Extension
Class
Copy
Arithmetic
Shift
Logic
Table
25.1
Memory
access
Control
Feb. 2011
Instruction
Register copy
Parallel pack
Parallel unpack low
Parallel unpack high
Parallel add
Parallel subtract
Parallel multiply low
Parallel multiply high
Parallel multiply-add
Parallel compare equal
Parallel compare greater
Parallel left shift logical
Parallel right shift logical
Parallel right shift arith
Parallel AND
Parallel ANDNOT
Parallel OR
Parallel XOR
Parallel load MMX reg
Parallel store MMX reg
Empty FP tag bits
Vector Op type
32 bits
4, 2
Saturate
8, 4, 2
8, 4, 2
8, 4, 2 Wrap/Saturate#
8, 4, 2 Wrap/Saturate#
4
4
4
8, 4, 2
8, 4, 2
4, 2, 1
4, 2, 1
4, 2
1
Bitwise
1
Bitwise
1
Bitwise
1
Bitwise
32 or 64 bits
32 or 64 bit
Computer Architecture, Advanced Architectures
Function or results
Integer register ↔ MMX register
Convert to narrower elements
Merge lower halves of 2 vectors
Merge upper halves of 2 vectors
Add; inhibit carry at boundaries
Subtract with carry inhibition
Multiply, keep the 4 low halves
Multiply, keep the 4 high halves
Multiply, add adjacent products*
All 1s where equal, else all 0s
All 1s where greater, else all 0s
Shift left, respect boundaries
Shift right, respect boundaries
Arith shift within each (half)word
dest ← (src1) ∧ (src2)
dest ← (src1) ∧ (src2)′
dest ← (src1) ∨ (src2)
dest ← (src1) ⊕ (src2)
Address given in integer register
Address given in integer register
Required for compatibility$
Slide 10
MMX Multiplication and Multiply-Add
b
a×e
a
b
d
e
a
b
d
e
e
f
g
h
e
f
g
h
e×h
z
v
d×g
y
u
×f
x
t
w
s
s
Feb. 2011
d×g
b
a×e
t
u
(a) Parallel multiply low
Figure 25.2
e×h
×f
v
u
t
s
v
add
add
s +t
u+v
(b) Parallel multiply-add
Parallel multiplication and multiply-add in MMX.
Computer Architecture, Advanced Architectures
Slide 11
MMX Parallel Comparisons
14
3
58
66
5 3 12 32 5 6 12 9
79
1
58
65
3 12 22 17 5 12 90 8
65 535
(all 1s)
0
0
0
0 0
(a) Parallel compare equal
Figure 25.3
Feb. 2011
255
(all 1s)
0 0 0
(b) Parallel compare greater
Parallel comparisons in MMX.
Computer Architecture, Advanced Architectures
Slide 12
25.3
Instruction-Level Parallelism
3
Speedup attained
Fraction of cycles
30%
20%
10%
0%
0 1 2 3 4 5 6 7 8
Issuable instructions per cycle
2
1
0
(a)
2
4
6
8
Instruction issue width
(b)
Figure 25.4 Available instruction-level parallelism and the speedup
due to multiple instruction issue in superscalar processors [John91].
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 13
Instruction-Level Parallelism
Figure 25.5
Feb. 2011
A computation with inherent instruction-level parallelism.
Computer Architecture, Advanced Architectures
Slide 14
VLIW and EPIC Architectures
VLIW
EPIC
Very long instruction word architecture
Explicitly parallel instruction computing
General
registers (128)
Memory
Execution
unit
Execution
unit
...
Execution
unit
Execution
unit
Execution
unit
...
Execution
unit
Predicates
(64)
Floating-point
registers (128)
Figure 25.6 Hardware organization for IA-64. General and floatingpoint registers are 64-bit wide. Predicates are single-bit registers.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 15
25.4
Speculation and Value Prediction
spec load
------------check load
-------
------------load
-------
(a) Control speculation
Figure 25.7
Feb. 2011
------store
---load
-------
spec load
------store
---check load
-------
(b) Data speculation
Examples of software speculation in IA-64.
Computer Architecture, Advanced Architectures
Slide 16
Value Prediction
Memo table
Miss
Mult/
Div
Inputs
Mux
0
Output
1
Done
Inputs ready
Control
Output ready
Figure 25.8 Value prediction for multiplication or
division via a memo table.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 17
25.5 Special-Purpose Hardware Accelerators
Data and
program
memory
CPU
FPGA-like unit
on which
accelerators
can be formed
via loading of
configuration
registers
Configuration
memory
Accel. 1
Accel. 3
Accel. 2
Unused
resources
Figure 25.9 General structure of a processor
with configurable hardware accelerators.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 18
Graphic Processors, Network Processors, etc.
Input
buffer
Feedback
path
PE
0
PE
1
PE
2
PE
3
PE
4
PE
PE
5
5
PE
6
PE
7
PE
8
PE
9
PE
10
PE
11
PE
12
PE
13
PE
14
PE
15
Column
memory
Column
memory
Column
memory
Column
memory
Output
buffer
Figure 25.10 Simplified block diagram of Toaster2,
Cisco Systems’ network processor.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 19
SISD
SIMD
Uniprocessors Array or vector
processors
MISD
Rarely used
MIMD
Multiproc’s or
multicomputers
Global
memory
Multiple data
streams
Distributed
memory
Single data
stream
Johnson’ s expansion
Multiple instr
streams
Single instr
stream
25.6 Vector, Array, and Parallel Processing
Shared
variables
Message
passing
GMSV
GMMP
Shared-memory
multiprocessors
Rarely used
DMSV
DMMP
Distributed
Distrib-memory
shared memory multicomputers
Flynn’s categories
Figure 25.11
Feb. 2011
The Flynn-Johnson classification of computer systems.
Computer Architecture, Advanced Architectures
Slide 20
SIMD Architectures
Data parallelism: executing one operation on multiple data streams
Concurrency in time – vector processing
Concurrency in space – array processing
Example to provide context
Multiplying a coefficient vector by a data vector (e.g., in filtering)
y[i] := c[i] × x[i], 0 ≤ i < n
Sources of performance improvement in vector processing
(details in the first half of Chapter 26)
One instruction is fetched and decoded for the entire operation
The multiplications are known to be independent (no checking)
Pipelining/concurrency in memory access as well as in arithmetic
Array processing is similar (details in the second half of Chapter 26)
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 21
MISD Architecture Example
Ins truct ion
streams
1-5
Data
in
Data
out
Figure 25.12 Multiple instruction streams
operating on a single data stream (MISD).
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 22
MIMD Architectures
Control parallelism: executing several instruction streams in parallel
GMSV: Shared global memory – symmetric multiprocessors
DMSV: Shared distributed memory – asymmetric multiprocessors
DMMP: Message passing – multicomputers
0
Processortoprocessor
network
0
1
.
.
.
Memories and processors
Memory modules
Processortomemory
network
p−1
1
.
.
.
m−1
Parallel input/output
Processors
Routers
0
1
A computing node
.
.
.
Interconnection
network
p−1
...
...
Parallel I/O
Figure 27.1 Centralized shared memory.
Feb. 2011
Figure 28.1 Distributed memory.
Computer Architecture, Advanced Architectures
Slide 23
Amdahl’s Law Revisited
50
f = sequential
f =0
Speedup (s )
40
p
f = 0.01
30
f = 0.02
20
fraction
= speedup
of the rest
with p
processors
f = 0.05
1
s=
f + (1 – f)/p
10
f = 0.1
0
0
10
20
30
Enhancement factor (p )
40
50
≤ min(p, 1/f)
Figure 4.4 Amdahl’s law: speedup achieved if a fraction f of a task
is unaffected and the remaining 1 – f part runs p times as fast.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 24
26 Vector and Array Processing
Single instruction stream operating on multiple data streams
• Data parallelism in time = vector processing
• Data parallelism in space = array processing
Topics in This Chapter
26.1 Operations on Vectors
26.2 Vector Processor Implementation
26.3 Vector Processor Performance
26.4 Shared-Control Systems
26.5 Array Processor Implementation
26.6 Array Processor Performance
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 25
26.1 Operations on Vectors
Sequential processor:
Vector processor:
for i = 0 to 63 do
P[i] := W[i] × D[i]
endfor
load W
load D
P := W × D
store P
for i = 0 to 63 do
X[i+1] := X[i] + Z[i]
Y[i+1] := X[i+1] + Y[i]
endfor
Feb. 2011
Unparallelizable
Computer Architecture, Advanced Architectures
Slide 26
26.2 Vector Processor Implementation
From scalar registers
To and from memory unit
Function unit 1 pipeline
Load
unit A
Load
unit B
Function unit 2 pipeline
Vector
register
file
Function unit 3 pipeline
Store
unit
Forwarding muxes
Figure 26.1
Feb. 2011
Simplified generic structure of a vector processor.
Computer Architecture, Advanced Architectures
Slide 27
Conflict-Free Memory Access
0
1
2
62
0,0
0,1
0,2
... 0,62
1,0
1,1
1,2
... 1,62
63
Bank
number
0
1
0,63
0,0
0,1
0,2 ...
0,62
0,63
1,63
1,63
1,0
1,0 ...
1,61
1,62
2,63
2,2 ... 2,62
2,1
2,0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
62,0 62,1 62,2 ... 62,62 62,63
2,62 2,63 2,0 ...
. .
.
.
.
.
.
.
.
.
.
.
62,2 62,3 62,4 ...
2,60
.
.
.
62,0
2,61
.
.
.
62,1
63,0 63,1 63,2 ... 63,62 63,63
63,1 63,2 63,3 ... 63,63
63,0
.. .
(a) Conventional row-major order
2
.. .
62
(b) Skewed row-major order
Figure 26.2 Skewed storage of the elements of a 64 × 64 matrix
for conflict-free memory access in a 64-way interleaved memory.
Elements of column 0 are highlighted in both diagrams .
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 28
63
Overlapped Memory Access and Computation
To and from memory unit
Pipelined adder
Load X
Load Y
Store Z
Vector reg 0
Vector reg 1
Vector reg 2
Vector reg 3
Vector reg 4
Vector reg 5
Figure 26.3 Vector processing via segmented load/store of
vectors in registers in a double-buffering scheme. Solid (dashed)
lines show data flow in the current (next) segment.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 29
26.3 Vector Processor Performance
Time
Without
chaining
×
Multiplication
start-up
With pipeline
chaining
+
Addition
start-up
×
+
Figure 26.4 Total latency of the vector computation
S := X × Y + Z, without and with pipeline chaining.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 30
Clock cycles per vector element
Performance as a Function of Vector Length
5
4
3
2
1
0
0
100
200
300
400
Vector length
Figure 26.5 The per-element execution time in a vector processor
as a function of the vector length.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 31
26.4 Shared-Control Systems
(a) Shared-control array
processor, SIMD
(b) Multiple shared controls,
MSIMD
Control
Processing
...
Processing
...
Control
...
Processing
...
Control
(c) Separate controls,
MIMD
Figure 26.6 From completely shared control
to totally separate controls.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 32
Example Array Processor
Control
Processor array
Switches
Control
broadcast
Parallel
I/O
Figure 26.7 Array processor with 2D torus
interprocessor communication network.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 33
26.5 Array Processor Implementation
Reg
file
Data
memory
N
E
W
ALU
0
Commun
buffer
1
S
To array
state reg
CommunEn
PE state FF
To reg file and
data memory
To NEWS
neighbors
CommunDir
Figure 26.8 Handling of interprocessor communication
via a mechanism similar to data forwarding.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 34
Configuration Switches
Processor array
Control
Switches
Control
broadcast
Parallel
I/O
Figure 26.7
(a) Torus operation
Figure 26.9
Feb. 2011
In
In
Out
Out
(b) Clockwise I/O
(c) Counterclockwise I/O
I/O switch states in the array processor of Figure 26.7.
Computer Architecture, Advanced Architectures
Slide 35
26.6 Array Processor Performance
Array processors perform well for the same class of problems that
are suitable for vector processors
For embarrassingly (pleasantly) parallel problems, array processors
can be faster and more energy-efficient than vector processors
A criticism of array processing:
For conditional computations, a significant part of the array remains
idle while the “then” part is performed; subsequently, idle and busy
processors reverse roles during the “else” part
However:
Considering array processors inefficient due to idle processors
is like criticizing mass transportation because many seats are
unoccupied most of the time
It’s the total cost of computation that counts, not hardware utilization!
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 36
27 Shared-Memory Multiprocessing
Multiple processors sharing a memory unit seems naïve
• Didn’t we conclude that memory is the bottleneck?
• How then does it make sense to share the memory?
Topics in This Chapter
27.1 Centralized Shared Memory
27.2 Multiple Caches and Cache Coherence
27.3 Implementing Symmetric Multiprocessors
27.4 Distributed Shared Memory
27.5 Directories to Guide Data Access
27.6 Implementing Asymmetric Multiprocessors
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 37
Parallel Processing
as a Topic of Study
An important area of study
that allows us to overcome
fundamental speed limits
Our treatment of the topic is
quite brief (Chapters 26-27)
Graduate course ECE 254B:
Adv. Computer Architecture –
Parallel Processing
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 38
27.1 Centralized Shared Memory
Processors
Memory modules
0
Processortoprocessor
network
0
1
.
.
.
Processortomemory
network
1
.
.
.
p−1
m−1
...
Parallel I/O
Figure 27.1 Structure of a multiprocessor with
centralized shared-memory.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 39
Processor-to-Memory Interconnection Network
Processors
0
Mem ories
Row 0
0
1
2
1
2
Row 1
3
4
3
Row 2
4
5
6
5
6
Row 3
7
8
7
Row 4
9
10
Row 5
11
12
11
Row 6
12
13
14
15
13
14
Row 7
(a) Butterfly network
15
0
1
2
3
4
5
6
7
8
9
10
P
r
o
c
e
s
s
o
r
s
0
M
e
m
o
r
i
e
s
1
2
3
4
5
6
7
(b) Beneš network
Figure 27.2 Butterfly and the related Beneš network as examples
of processor-to-memory interconnection network in a multiprocessor.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 40
Processor-to-Memory Interconnection Network
Processors
0
Sections Subsections
8
/
4× 4
Memo ry banks
0, 4, 8, 12, 16, 20, 24, 28
32, 36, 40, 44, 48, 52, 56, 60
8× 8
1
4× 4
224, 228, 232, 236, . . . , 252
1 × 8 s witches
2
8
/
4× 4
1, 5, 9, 13, 17, 21, 25, 29
8× 8
3
4× 4
4
4× 4
225, 229, 233, 237, . . . , 253
8
/
2, 6, 10, 14, 18, 22, 26, 30
8× 8
5
4× 4
6
4× 4
226, 230, 234, 238, . . . , 254
8
/
3, 7, 11, 15, 19, 23, 27, 31
8× 8
7
4× 4
227, 231, 235, 239, . . . , 255
Figure 27.3 Interconnection of eight processors to 256 memory banks
in Cray Y-MP, a supercomputer with multiple vector processors.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 41
Shared-Memory Programming: Broadcasting
Copy B[0] into all B[i] so that multiple processors
can read its value without memory access conflicts
for k = 0 to ⎡log2 p⎤ – 1 processor j, 0 ≤ j < p, do
B[j + 2k] := B[j]
endfor
B
Recursive
doubling
Feb. 2011
0
1
2
3
4
5
6
7
8
9
10
11
Computer Architecture, Advanced Architectures
Slide 42
Shared-Memory Programming: Summation
Sum reduction of vector X
processor j, 0 ≤ j < p, do Z[j] := X[j]
s := 1
while s < p processor j, 0 ≤ j < p – s, do
Z[j + s] := X[j] + X[j + s]
s := 2 × s
endfor
S
Recursive
doubling
Feb. 2011
0
1
2
3
4
5
6
7
8
9
0:0
1:1
2:2
3:3
4:4
5:5
6:6
7:7
8:8
9:9
0:0
0:1
1:2
2:3
3:4
4:5
5:6
6:7
7:8
8:9
0:0
0:1
0:2
0:3
1:4
2:5
3:6
4:7
5:8
6:9
Computer Architecture, Advanced Architectures
0:0
0:1
0:2
0:3
0:4
0:5
0:6
0:7
1:8
2:9
0:0
0:1
0:2
0:3
0:4
0:5
0:6
0:7
0:8
0:9
Slide 43
27.2 Multiple Caches and Cache Coherence
Processors Caches
Memory modules
0
Processortoprocessor
network
0
1
.
.
.
Processortomemory
network
p−1
1
.
.
.
m−1
...
Parallel I/O
Private processor caches reduce memory access traffic through the
interconnection network but lead to challenging consistency problems.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 44
Status of Data Copies
Processors Caches
Memory modules
Multiple consistent
Processortoprocessor
network
0
w
z′
w
x
1
w
y′
y
Single inconsistent
Single consistent
p–1
Invalid
x
z
.
.
.
Processortomemory
network
0
1
.
.
.
z′
m–1
...
Parallel I/O
Figure 27.4 Various types of cached data blocks in a parallel processor
with centralized main memory and private processor caches.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 45
A Snoopy
Cache Coherence
Protocol
P
P
P
P
C
C
C
C
Bus
Memory
CPU w rite miss:
signal write miss
on bus
Invalid
Bus write miss:
write back cache line
Bus write miss
CPU read miss:
signal read miss
on bus
Bus read miss: write back cache line
Exclusive
CPU read
or write hit
(writable)
Shared
CPU w rite hit: signal write miss on bus
(read-only)
CPU read hit
Figure 27.5 Finite-state control mechanism for a bus-based
snoopy cache coherence protocol with write-back caches.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 46
27.3 Implementing Symmetric Multiprocessors
Computing nodes
Interleaved memory
(typically, 1-4 CPUs
and caches per node)
I/O modules
Standard interfaces
Bus adapter
Bus adapter
Very wide, high-bandwidth bus
Figure 27.6 Structure of a generic bus-based symmetric multiprocessor.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 47
Bus Bandwidth Limits Performance
Example 27.1
Consider a shared-memory multiprocessor built around a single bus with
a data bandwidth of x GB/s. Instructions and data words are 4 B wide,
each instruction requires access to an average of 1.4 memory words
(including the instruction itself). The combined hit rate for caches is 98%.
Compute an upper bound on the multiprocessor performance in GIPS.
Address lines are separate and do not affect the bus data bandwidth.
Solution
Executing an instruction implies a bus transfer of 1.4 × 0.02 × 4 = 0.112 B.
Thus, an absolute upper bound on performance is x/0.112 = 8.93x GIPS.
Assuming a bus width of 32 B, no bus cycle or data going to waste, and
a bus clock rate of y GHz, the performance bound becomes 286y GIPS.
This bound is highly optimistic. Buses operate in the range 0.1 to 1 GHz.
Thus, a performance level approaching 1 TIPS (perhaps even ¼ TIPS) is
beyond reach with this type of architecture.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 48
Implementing Snoopy Caches
State
Duplicate tags
and state store
for snoop side
CPU
Tags
Addr
Cmd
Cache
data
array
Snoop side
cache control
Main tags and
state store for
processor side
Processor side
cache control
=?
Tag
=?
Snoop
state
Cmd
Addr
Buffer Buffer
Addr
Cmd
System
bus
Figure 27.7 Main structure for a snoop-based cache coherence algorithm.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 49
27.4 Distributed Shared Memory
Processors with memory
Parallel input/output
y := -1
z := 1
Routers
0
1
z :0
.
.
.
while z=0 do
x := x + y
endwhile
p−1
Interconnection
network
x:0
y:1
Figure 27.8 Structure of a distributed shared-memory multiprocessor.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 50
27.5 Directories to Guide Data Access
Communication &
memory interfaces
Directories
Processors & caches
Memories
Parallel input/output
0
1
.
.
.
Interconnection
network
p−1
Figure 27.9 Distributed shared-memory multiprocessor with a cache,
directory, and memory module associated with each processor.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 51
Directory-Based Cache Coherence
Write miss: return value,
set sharing set to {c}
Read miss: return value,
set sharing set to {c}
Uncached
Write miss:
fetch data from owner,
request invalidation,
return value,
set sharing set to {c}
Data w rite-back:
set sharing set to { }
Read miss:
return value,
include c in
sharing set
Read miss: fetch data from owner,
return value, include c in sharing set
Exclusive
(writable)
Shared
Write miss: invalidate all cached copies,
set sharing set to {c}, return value
(read-only)
Figure 27.10 States and transitions for a directory entry in a directorybased cache coherence protocol (c is the requesting cache).
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 52
27.6 Implementing Asymmetric Multiprocessors
To I/O controllers
Node 0
Link
Memory
Computing nodes (typically, 1-4 CPUs and associated memory)
Node 1
Node 2
Node 3
Link
Link
Link
Ring
network
Figure 27.11 Structure of a ring-based distributed-memory multiprocessor.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 53
Processors
and caches
Memories
To interconnection network
Scalable
Coherent
Interface
(SCI)
0
1
2
3
Figure 27.11 Structure of a ring-based
distributed-memory multiprocessor.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 54
28 Distributed Multicomputing
Computer architects’ dream: connect computers like toy blocks
• Building multicomputers from loosely connected nodes
• Internode communication is done via message passing
Topics in This Chapter
28.1 Communication by Message Passing
28.2 Interconnection Networks
28.3 Message Composition and Routing
28.4 Building and Using Multicomputers
28.5 Network-Based Distributed Computing
28.6 Grid Computing and Beyond
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 55
28.1 Communication by Message Passing
Parallel input/output
Memories and processors
0
1
A computing node
.
.
.
Interconnection
network
p−1
Figure 28.1
Feb. 2011
Routers
Structure of a distributed multicomputer.
Computer Architecture, Advanced Architectures
Slide 56
Router Design
Routing and
arbitration
Ejection channel
LC
LC
Q
Q
Input channels
Input queues
Link controller
Message queue
Output queues
LC
Q
Q
LC
LC
Q
Q
LC
LC
Q
Q
LC
LC
Q
Q
LC
Switch
Output channels
Injection channel
Figure 28.2 The structure of a generic router.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 57
Building Networks from Switches
Straight through
Crossed connection
Lower broadcast
Upper broadcast
Figure 28.3 Example 2 × 2 switch with point-to-point
and broadcast connection capabilities.
Processors
0
Memories
Row 0
0
1
1
2
2
Row 1
3
4
3
Row 2
4
5
5
6
6
Row 3
7
7
Row 4
Figure 27.2
8
9
9
Butterfly and
Beneš networks
10
10
11
Row 6
12
13
14
14
Row 7
(a) Butterfly network
15
0
1
2
3
4
5
6
7
8
13
15
Feb. 2011
Row 5
11
12
P
r
o
c
e
s
s
o
r
s
0
M
e
m
o
r
i
e
s
1
2
3
4
5
6
7
(b) Beneš network
Computer Architecture, Advanced Architectures
Slide 58
Interprocess Communication via Messages
Communication
latency
Time
Process A
Process B
...
...
...
...
...
...
send x
...
...
...
...
...
...
...
...
...
receive x
...
...
...
Process B is suspended
Process B is awakened
Figure 28.4 Use of send and receive message-passing
primitives to synchronize two processes.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 59
28.2 Interconnection Networks
Nodes
(a) Direct network
Figure 28.5
Feb. 2011
Routers
Nodes
(b) Indirect network
Examples of direct and indirect interconnection networks.
Computer Architecture, Advanced Architectures
Slide 60
Direct
Interconnection
Networks
(a) 2D torus
(b) 4D hyperc ube
(c) Chordal ring
(d) Ring of rings
Figure 28.6 A sampling of common direct interconnection networks.
Only routers are shown; a computing node is implicit for each router.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 61
Indirect Interconnection Networks
Level-3 bus
Level-2 bus
Level-1 bus
(a) Hierarchical bus es
(b) Omega network
Figure 28.7 Two commonly used indirect interconnection networks.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 62
28.3 Message Composition and Routing
Message
Packet
data
Padding
First packet
Header
Data or payload
Last packet
Trailer
A transmitted
packet
Flow control
digits (flits)
Figure 28.8
Feb. 2011
Messages and their parts for message passing.
Computer Architecture, Advanced Architectures
Slide 63
Wormhole Switching
Each worm is blocked at the
point of attempted right turn
Destination 1
Destination 2
Worm 1:
moving
Worm 2:
blocked
Source 1
Source 2
(a) Two worms en route to their
respective destinations
(b) Deadlock due to circular waiting
of four blocked worms
Figure 28.9 Concepts of wormhole switching.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 64
28.4 Building and Using Multicomputers
Inputs
t=1
B
A
B
C
D
A
B
C
A
B E G H
C
F
D
E
F
GH
t=2
t=2
t=2
A
C
F
t=1
D
E
F
G
H
H
t=3
D
t=2
t=1
Outputs
G
Time
E
0
(a) Static task graph
Figure 28.10
Feb. 2011
5
10
(b) Schedules on 1-3 computers
A task system and schedules on 1, 2, and 3 computers.
Computer Architecture, Advanced Architectures
Slide 65
15
Building Multicomputers from Commodity Nodes
One module:
CPU(s),
memory,
disks
Expansion
slots
One module:
CPU,
memory,
disks
(a) Current racks of modules
Wireless
connection
surfaces
(b) Futuristic toy-block construction
Figure 28.11 Growing clusters using modular nodes.
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 66
28.5 Network-Based Distributed Computing
PC
Fast network
interface with
large memory
System or
I/O bus
NIC
Network built of
high-s peed
wormhole switches
Figure 28.12
Feb. 2011
Network of workstations.
Computer Architecture, Advanced Architectures
Slide 67
28.6 Grid Computing and Beyond
Computational grid is analogous to the power grid
Decouples the “production” and “consumption” of computational power
Homes don’t have an electricity generator; why should they have a computer?
Advantages of computational grid:
Near continuous availability of computational and related resources
Resource requirements based on sum of averages, rather than sum of peaks
Paying for services based on actual usage rather than peak demand
Distributed data storage for higher reliability, availability, and security
Universal access to specialized and one-of-a-kind computing resources
Still to be worked out as of late 2000s: How to charge for compute usage
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 68
Computing in the Cloud
Computational resources,
both hardware and software,
are provided by, and
managed within, the cloud
Users pay a fee for access
Managing / upgrading is
much more efficient in large,
centralized facilities
(warehouse-sized data
centers or server farms)
Image from Wikipedia
This is a natural continuation of the outsourcing trend for special services,
so that companies can focus their energies on their main business
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 69
The Shrinking Supercomputer
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 70
Warehouse-Sized Data Centers
Image from
IEEE Spectrum,
June 2009
Feb. 2011
Computer Architecture, Advanced Architectures
Slide 71
Download