Lectures for 2nd Edition

advertisement
CS2810
Spring 2007
Dan Watson
dan.watson@usu.edu
Course syllabus, calendar, and assignments found at
http://www.cs.usu.edu/~watson/cs2810
These overheads are based on presentations courtesy of
Professor Mary Jane Irwin, Penn State University
and
Professor Tod Amon, Southern Utah University
2004 Morgan Kaufmann Publishers
1
Chapter 1
2004 Morgan Kaufmann Publishers
2
Introduction
•
This course is all about how computers work
•
But what do we mean by a computer?
– Different types: desktop, servers, embedded devices
– Different uses: automobiles, graphics, finance, genomics…
– Different manufacturers: Intel, Apple, IBM, Microsoft, Sun…
– Different underlying technologies and different costs!
•
Analogy: Consider a course on “automotive vehicles”
– Many similarities from vehicle to vehicle (e.g., wheels)
– Huge differences from vehicle to vehicle (e.g., gas vs. electric)
•
Best way to learn:
– Focus on a specific instance and learn how it works
– While learning general principles and historical perspectives
2004 Morgan Kaufmann Publishers
3
Why learn this stuff?
•
You want to call yourself a “computer scientist”
•
You want to build software people use (need performance)
•
You need to make a purchasing decision or offer “expert” advice
•
Both Hardware and Software affect performance:
– Algorithm determines number of source-level statements
– Language/Compiler/Architecture determine machine instructions
(Chapter 2 and 3)
– Processor/Memory determine how fast instructions are executed
(Chapter 5, 6, and 7)
•
Assessing and Understanding Performance in Chapter 4
2004 Morgan Kaufmann Publishers
4
What is a computer?
•
•
Components:
– input (mouse, keyboard)
– output (display, printer)
– memory (disk drives, DRAM, SRAM, CD)
– network
Our primary focus: the processor (datapath and control)
– implemented using millions of transistors
– Impossible to understand by looking at each transistor
– We need...
2004 Morgan Kaufmann Publishers
5
Where is the Market?
Millions of Computers
1200
1122
1000
892
Embedded
Desktop
Servers
862
800
600
488
400 290
200
0
93
3
1998
114
3
1999
135
4
2000
129
4
2001
131
5
2002
2004 Morgan Kaufmann Publishers
6
By the architecture of a system, I mean the complete and detailed
specification of the user interface. … As Blaauw has said,
“Where architecture tells what happens, implementation tells how
it is made to happen.”
The Mythical Man-Month, Brooks, pg 45
2004 Morgan Kaufmann Publishers
7
Instruction Set Architecture (ISA)
•
•
ISA: An abstract interface between the hardware and the lowest
level software of a machine that encompasses all the information
necessary to write a machine language program that will run
correctly, including instructions, registers, memory access, I/O,
and so on.
“... the attributes of a [computing] system as seen by the
programmer, i.e., the conceptual structure and functional
behavior, as distinct from the organization of the data flows
and controls, the logic design, and the physical
implementation.”
– Amdahl, Blaauw, and Brooks,
1964
– Enables implementations of varying cost and performance to
run identical software
ABI (application binary interface): The user portion of the
instruction set plus the operating system interfaces used by
application programmers. Defines a standard for binary
portability across computers.
2004 Morgan Kaufmann Publishers
8
ISA Type Sales
Other
SPARC
Hitachi SH
PowerPC
Motorola 68K
MIPS
IA-32
ARM
1400
Millions of Processor
1200
1000
800
600
400
200
0
1998
1999
2000
2001
2002
PowerPoint “comic” bar chart with approximate values (see text for
correct values)
2004 Morgan Kaufmann Publishers
9
Moore’s Law
•
In 1965, Gordon Moore predicted that the number of transistors
that can be integrated on a die would double every 18 to 24
months (i.e., grow exponentially with time).
•
Amazingly visionary – million transistor/chip barrier was
crossed in the 1980’s.
– 2300 transistors, 1 MHz clock (Intel 4004) - 1971
– 16 Million transistors (Ultra Sparc III)
– 42 Million transistors, 2 GHz clock (Intel Xeon) – 2001
– 55 Million transistors, 3 GHz, 130nm technology, 250mm2 die
(Intel Pentium 4) - 2004
– 140 Million transistor (HP PA-8500)
2004 Morgan Kaufmann Publishers
10
Historical Perspective
•
ENIAC built in World War II was the first general purpose computer
– Used for computing artillery firing tables
– 80 feet long by 8.5 feet high and several feet wide
– Each of the twenty 10 digit registers was 2 feet long
– Used 18,000 vacuum tubes
– Performed 1900 additions per second
–Since then:
Moore’s Law:
transistor capacity doubles
every 18-24 months
2004 Morgan Kaufmann Publishers
11
Processor Performance Increase
10000
Performance (SPEC Int)
Intel Pentium 4/3000
DEC Alpha 21264A/667
DEC Alpha 21264/600
1000
DEC Alpha 5/500
DEC Alpha 5/300
DEC Alpha 4/266
100
DEC AXP/500
Intel Xeon/2000
IBM POWER 100
HP 9000/750
10
IBM RS6000
MIPS M2000
SUN-4/260
MIPS M/120
1
1987
1989
1991
1993
1995
1997
1999
2001
2003
Year
2004 Morgan Kaufmann Publishers
12
DRAM Capacity Growth
512M
1000000
256M
128M
64M
Kbit capacity
100000
16M
10000
4M
1M
1000
256K
64K
100
16K
10
1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002
Year of introduction
2004 Morgan Kaufmann Publishers
13
Impacts of Advancing Technology
•
Processor
– logic capacity:
– performance:
increases about 30% per year
2x every 1.5 years
ClockCycle = 1/ClockRate
•
•
500 MHz ClockRate = 2 nsec ClockCycle
1 GHz ClockRate = 1 nsec ClockCycle
Memory 4 GHz ClockRate = 250 psec ClockCycle
– DRAM capacity:
– memory speed:
– cost per bit:
Disk
– capacity:
4x every 3 years, now 2x every 2 years
1.5x every 10 years
decreases about 25% per year
increases about 60% per year
2004 Morgan Kaufmann Publishers
15
Example Machine Organization
•
Workstation design target
– 25% of cost on processor
– 25% of cost on memory (minimum memory size)
– Rest on I/O devices, power supplies, box
Computer
CPU
Memory
Devices
Control
Input
Datapath
Output
2004 Morgan Kaufmann Publishers
16
PC Motherboard Closeup
2004 Morgan Kaufmann Publishers
17
Inside the Pentium 4 Processor Chip
2004 Morgan Kaufmann Publishers
18
Instruction Set Architecture
•
A very important abstraction
– interface between hardware and low-level software
– standardizes instructions, machine language bit patterns, etc.
– advantage: different implementations of the same architecture
– disadvantage: sometimes prevents using new innovations
True or False: Binary compatibility is extraordinarily important?
•
Modern instruction set architectures:
– IA-32, PowerPC, MIPS, SPARC, ARM, and others
2004 Morgan Kaufmann Publishers
20
Abstraction
•
Delving into the depths
reveals more information
•
An abstraction omits unneeded detail,
helps us cope with complexity
What are some of the details that
appear in these familiar abstractions?
2004 Morgan Kaufmann Publishers
21
MIPS R3000 Instruction Set Architecture
•
Registers
Instruction Categories
– Load/Store
– Computational
– Jump and Branch
– Floating Point
R0 - R31
• coprocessor
PC
HI
– Memory Management
– Special
LO
3 Instruction Formats: all 32 bits wide
OP
rs
rt
OP
rs
rt
OP
rd
sa
funct
immediate
jump target
Q: How many already familiar with MIPS ISA?
2004 Morgan Kaufmann Publishers
22
How do computers work?
•
Need to understand abstractions such as:
– Applications software
– Systems software
– Assembly Language
– Machine Language
– Architectural Issues: i.e., Caches, Virtual Memory, Pipelining
– Sequential logic, finite state machines
– Combinational logic, arithmetic circuits
– Boolean logic, 1s and 0s
– Transistors used to build logic gates (CMOS)
– Semiconductors/Silicon used to build transistors
– Properties of atoms, electrons, and quantum dynamics
•
So much to learn!
2004 Morgan Kaufmann Publishers
23
Chapter 2
2004 Morgan Kaufmann Publishers
24
Instructions:
•
•
Language of the Machine
We’ll be working with the MIPS instruction set architecture
– similar to other architectures developed since the 1980's
– Almost 100 million MIPS processors manufactured in 2002
– used by NEC, Nintendo, Cisco, Silicon Graphics, Sony, …
1400
1300
1200
1100
1000
900
800
Other
SPARC
Hitachi SH
PowerPC
Motorola 68K
MIPS
IA-32
ARM
700
600
500
400
300
200
100
0
1998
1999
2000
2001
2002
2004 Morgan Kaufmann Publishers
25
MIPS arithmetic
•
•
All instructions have 3 operands
Operand order is fixed (destination first)
Example:
C code:
a = b + c
MIPS ‘code’:
add a, b, c
(we’ll talk about registers in a bit)
“The natural number of operands for an operation like addition is
three…requiring every instruction to have exactly three operands, no
more and no less, conforms to the philosophy of keeping the
hardware simple”
2004 Morgan Kaufmann Publishers
26
MIPS arithmetic
•
•
Design Principle: simplicity favors regularity.
Of course this complicates some things...
C code:
a = b + c + d;
MIPS code:
add a, b, c
add a, a, d
•
•
Operands must be registers, only 32 registers provided
Each register contains 32 bits
•
Design Principle: smaller is faster.
Why?
2004 Morgan Kaufmann Publishers
27
Registers vs. Memory
•
•
•
Arithmetic instructions operands must be registers,
— only 32 registers provided
Compiler associates variables with registers
What about programs with lots of variables
Control
Input
Memory
Datapath
Processor
Output
I/O
2004 Morgan Kaufmann Publishers
28
Memory Organization
•
•
•
Viewed as a large, single-dimension array, with an address.
A memory address is an index into the array
"Byte addressing" means that the index points to a byte of memory.
0
1
2
3
4
5
6
...
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
2004 Morgan Kaufmann Publishers
29
Memory Organization
•
•
•
•
•
Bytes are nice, but most data items use larger "words"
For MIPS, a word is 32 bits or 4 bytes.
0 32 bits of data
4 32 bits of data
Registers hold 32 bits of data
32
bits
of
data
8
12 32 bits of data
...
232 bytes with byte addresses from 0 to 232-1
230 words with byte addresses 0, 4, 8, ... 232-4
Words are aligned
i.e., what are the least 2 significant bits of a word address?
2004 Morgan Kaufmann Publishers
30
Instructions
•
•
•
•
•
Load and store instructions
Example:
C code:
A[12] = h + A[8];
MIPS code:
lw $t0, 32($s3)
add $t0, $s2, $t0
sw $t0, 48($s3)
Can refer to registers by name (e.g., $s2, $t2) instead of number
Store word has destination last
Remember arithmetic operands are registers, not memory!
Can’t write:
add 48($s3), $s2, 32($s3)
2004 Morgan Kaufmann Publishers
31
Our First Example
•
Can we figure out the code?
swap(int v[], int k);
{ int temp;
temp = v[k]
v[k] = v[k+1];
v[k+1] = temp;
swap:
}
muli $2, $5, 4
add $2, $4, $2
lw $15, 0($2)
lw $16, 4($2)
sw $16, 0($2)
sw $15, 4($2)
jr $31
2004 Morgan Kaufmann Publishers
32
So far we’ve learned:
•
MIPS
— loading words but addressing bytes
— arithmetic on registers only
•
Instruction
Meaning
add $s1, $s2, $s3
sub $s1, $s2, $s3
lw $s1, 100($s2)
sw $s1, 100($s2)
$s1 = $s2 + $s3
$s1 = $s2 – $s3
$s1 = Memory[$s2+100]
Memory[$s2+100] = $s1
2004 Morgan Kaufmann Publishers
33
Machine Language
•
Instructions, like registers and words of data, are also 32 bits long
– Example: add $t1, $s1, $s2
– registers have numbers, $t1=9, $s1=17, $s2=18
•
Instruction Format:
000000 10001
op
•
rs
10010
rt
01000
rd
00000
100000
shamt
funct
Can you guess what the field names stand for?
2004 Morgan Kaufmann Publishers
34
Machine Language
•
•
•
•
Consider the load-word and store-word instructions,
– What would the regularity principle have us do?
– New principle: Good design demands a compromise
Introduce a new type of instruction format
– I-type for data transfer instructions
– other format was R-type for register
Example: lw $t0, 32($s2)
35
18
9
op
rs
rt
32
16 bit number
Where's the compromise?
2004 Morgan Kaufmann Publishers
35
Stored Program Concept
•
•
Instructions are bits
Programs are stored in memory
— to be read or written just like data
Processor
•
Memory
memory for data, programs,
compilers, editors, etc.
Fetch & Execute Cycle
– Instructions are fetched and put into a special register
– Bits in the register "control" the subsequent actions
– Fetch the “next” instruction and continue
2004 Morgan Kaufmann Publishers
36
Control
•
Decision making instructions
– alter the control flow,
– i.e., change the "next" instruction to be executed
•
MIPS conditional branch instructions:
bne $t0, $t1, Label
beq $t0, $t1, Label
•
Example:
if (i==j) h = i + j;
bne $s0, $s1, Label
add $s3, $s0, $s1
Label: ....
2004 Morgan Kaufmann Publishers
37
Control
•
MIPS unconditional branch instructions:
j label
•
Example:
if (i!=j)
h=i+j;
else
h=i-j;
•
beq $s4, $s5, Lab1
add $s3, $s4, $s5
j Lab2
Lab1: sub $s3, $s4, $s5
Lab2: ...
Can you build a simple for loop?
2004 Morgan Kaufmann Publishers
38
So far:
•
•
Instruction
Meaning
add $s1,$s2,$s3
sub $s1,$s2,$s3
lw $s1,100($s2)
sw $s1,100($s2)
bne $s4,$s5,L
beq $s4,$s5,L
j Label
$s1 = $s2 + $s3
$s1 = $s2 – $s3
$s1 = Memory[$s2+100]
Memory[$s2+100] = $s1
Next instr. is at Label if $s4 ≠ $s5
Next instr. is at Label if $s4 = $s5
Next instr. is at Label
Formats:
R
op
rs
rt
rd
I
op
rs
rt
16 bit address
J
op
shamt
funct
26 bit address
2004 Morgan Kaufmann Publishers
39
Control Flow
•
•
We have: beq, bne, what about Branch-if-less-than?
New instruction:
if $s1 < $s2 then
$t0 = 1
slt $t0, $s1, $s2
else
$t0 = 0
•
Can use this instruction to build "blt $s1, $s2, Label"
— can now build general control structures
Note that the assembler needs a register to do this,
— there are policy of use conventions for registers
•
2004 Morgan Kaufmann Publishers
40
Policy of Use Conventions
Name Register number
$zero
0
$v0-$v1
2-3
$a0-$a3
4-7
$t0-$t7
8-15
$s0-$s7
16-23
$t8-$t9
24-25
$gp
28
$sp
29
$fp
30
$ra
31
Usage
the constant value 0
values for results and expression evaluation
arguments
temporaries
saved
more temporaries
global pointer
stack pointer
frame pointer
return address
Register 1 ($at) reserved for assembler, 26-27 for operating system
2004 Morgan Kaufmann Publishers
41
Constants
•
•
•
Small constants are used quite frequently (50% of operands)
e.g.,
A = A + 5;
B = B + 1;
C = C - 18;
Solutions? Why not?
– put 'typical constants' in memory and load them.
– create hard-wired registers (like $zero) for constants like one.
MIPS Instructions:
addi $29, $29, 4
slti $8, $18, 10
andi $29, $29, 6
ori $29, $29, 4
•
Design Principle: Make the common case fast.
Which format?
2004 Morgan Kaufmann Publishers
42
How about larger constants?
•
•
We'd like to be able to load a 32 bit constant into a register
Must use two instructions, new "load upper immediate" instruction
lui $t0, 1010101010101010
1010101010101010
•
filled with zeros
0000000000000000
Then must get the lower order bits right, i.e.,
ori $t0, $t0, 1010101010101010
1010101010101010
0000000000000000
0000000000000000
1010101010101010
1010101010101010
1010101010101010
ori
2004 Morgan Kaufmann Publishers
43
Assembly Language vs. Machine Language
•
•
•
•
Assembly provides convenient symbolic representation
– much easier than writing down numbers
– e.g., destination first
Machine language is the underlying reality
– e.g., destination is no longer first
Assembly can provide 'pseudoinstructions'
– e.g., “move $t0, $t1” exists only in Assembly
– would be implemented using “add $t0,$t1,$zero”
When considering performance you should count real instructions
2004 Morgan Kaufmann Publishers
44
Other Issues
•
Discussed in your assembly language programming lab:
support for procedures
linkers, loaders, memory layout
stacks, frames, recursion
manipulating strings and pointers
interrupts and exceptions
system calls and conventions
•
Some of these we'll talk more about later
•
We’ll talk about compiler optimizations when we hit chapter 4.
2004 Morgan Kaufmann Publishers
45
Overview of MIPS
•
•
•
•
•
simple instructions all 32 bits wide
very structured, no unnecessary baggage
only three instruction formats
R
op
rs
rt
rd
I
op
rs
rt
16 bit address
J
op
shamt
funct
26 bit address
rely on compiler to achieve performance
— what are the compiler's goals?
help compiler where we can
2004 Morgan Kaufmann Publishers
46
Addresses in Branches and Jumps
•
•
Instructions:
bne $t4,$t5,Label
$t5
beq $t4,$t5,Label
$t5
j Label
Formats:
op
I
J
•
op
rs
Next instruction is at Label if $t4 °
Next instruction is at Label if $t4 =
Next instruction is at Label
rt
16 bit address
26 bit address
Addresses are not 32 bits
— How do we handle this with load and store instructions?
2004 Morgan Kaufmann Publishers
47
Addresses in Branches
•
•
Instructions:
bne $t4,$t5,Label
beq $t4,$t5,Label
Formats:
I
•
•
Next instruction is at Label if $t4≠$t5
Next instruction is at Label if $t4=$t5
op
rs
rt
16 bit address
Could specify a register (like lw and sw) and add it to address
– use Instruction Address Register (PC = program counter)
– most branches are local (principle of locality)
Jump instructions just use high order bits of PC
– address boundaries of 256 MB
2004 Morgan Kaufmann Publishers
48
To summarize:
MIPS operands
Name
32 registers
Example
Comments
$s0-$s7, $t0-$t9, $zero, Fast locations for data. In MIPS, data must be in registers to perform
$a0-$a3, $v0-$v1, $gp,
arithmetic. MIPS register $zero always equals 0. Register $at is
$fp, $sp, $ra, $at
reserved for the assembler to handle large constants.
Memory[0],
2
30
Accessed only by data transfer instructions. MIPS uses byte addresses, so
memory Memory[4], ...,
words
and spilled registers, such as those saved on procedure calls.
add
MIPS assembly language
Example
Meaning
add $s1, $s2, $s3
$s1 = $s2 + $s3
Three operands; data in registers
subtract
sub $s1, $s2, $s3
$s1 = $s2 - $s3
Three operands; data in registers
$s1 = $s2 + 100
$s1 = Memory[$s2 + 100]
Memory[$s2 + 100] = $s1
$s1 = Memory[$s2 + 100]
Memory[$s2 + 100] = $s1
Used to add constants
Category
Arithmetic
sequential words differ by 4. Memory holds data structures, such as arrays,
Memory[4294967292]
Instruction
addi $s1, $s2, 100
lw $s1, 100($s2)
sw $s1, 100($s2)
store word
lb $s1, 100($s2)
load byte
sb $s1, 100($s2)
store byte
load upper immediate lui $s1, 100
add immediate
load word
Data transfer
Conditional
branch
Unconditional jump
$s1 = 100 * 2
16
Comments
Word from memory to register
Word from register to memory
Byte from memory to register
Byte from register to memory
Loads constant in upper 16 bits
branch on equal
beq
$s1, $s2, 25
if ($s1 == $s2) go to
PC + 4 + 100
Equal test; PC-relative branch
branch on not equal
bne
$s1, $s2, 25
if ($s1 != $s2) go to
PC + 4 + 100
Not equal test; PC-relative
set on less than
slt
$s1, $s2, $s3
if ($s2 < $s3) $s1 = 1;
else $s1 = 0
Compare less than; for beq, bne
set less than
immediate
slti
jump
j
jr
jal
jump register
jump and link
$s1, $s2, 100 if ($s2 < 100) $s1 = 1;
Compare less than constant
else $s1 = 0
2500
$ra
2500
Jump to target address
go to 10000
For switch, procedure return
go to $ra
$ra = PC + 4; go to 10000 For procedure call
2004 Morgan Kaufmann Publishers
49
1. Immediate addressing
op
rs
rt
Immediate
2. Register addressing
op
rs
rt
rd
...
funct
Registers
Register
3. Base addressing
op
rs
rt
Memory
Address
+
Register
Byte
Halfword
Word
4. PC-relative addressing
op
rs
rt
Memory
Address
PC
+
Word
5. Pseudodirect addressing
op
Address
PC
Memory
Word
2004 Morgan Kaufmann Publishers
50
CSE 431
Computer Architecture
Fall 2005
Lecture 02: MIPS ISA Review
Mary Jane Irwin ( www.cse.psu.edu/~mji )
www.cse.psu.edu/~cg431
[Adapted from Computer Organization and Design,
Patterson & Hennessy, © 2005, UCB]
2004 Morgan Kaufmann Publishers
51
(vonNeumann) Processor Organization
•
•
Control needs to
1. input instructions from Memory
2. issue signals to control the
information flow between the
Datapath components and to
control what operations they
perform
3. control instruction sequencing
CPU
Control
Datapath
Memory
Devices
Input
Output
Fetch
Datapath needs to have the
Exec
Decode
– components – the functional units and
storage (e.g., register file) needed to execute instructions
– interconnects - components connected so that the instructions can
be accomplished and so that data can be loaded from and stored to
Memory
2004 Morgan Kaufmann Publishers
52
RISC - Reduced Instruction Set Computer
•
•
•
RISC philosophy
– fixed instruction lengths
– load-store instruction sets
– limited addressing modes
– limited operations
MIPS, Sun SPARC, HP PA-RISC, IBM PowerPC, Intel (Compaq)
Alpha, …
Instruction sets are measured by how well compilers use them
as opposed to how well assembly language programmers use
them
Design goals: speed, cost (design, fabrication, test,
packaging), size, power consumption, reliability,
memory space (embedded systems)
2004 Morgan Kaufmann Publishers
53
MIPS R3000 Instruction Set Architecture (ISA)
•
Registers
Instruction Categories
– Computational
– Load/Store
– Jump and Branch
– Floating Point
R0 - R31
• coprocessor
PC
HI
– Memory Management
– Special
LO
3 Instruction Formats: all 32 bits wide
OP
rs
rt
OP
rs
rt
OP
rd
sa
immediate
jump target
funct
R format
I format
J format
2004 Morgan Kaufmann Publishers
54
Review: Unsigned Binary Representation
Hex
Binary
Decimal
0x00000000
0…0000
0
0x00000001
0…0001
1
0x00000002
0…0010
2
0x00000003
0…0011
3
0x00000004
0…0100
4
0x00000005
0…0101
5
0x00000006
0…0110
6
0x00000007
0…0111
7
0x00000008
0…1000
8
0x00000009
0…1001
9
…
0xFFFFFFFC
1…1100
0xFFFFFFFD
1…1101
0xFFFFFFFE
1…1110
0xFFFFFFFF
1…1111
231 230 229
...
23 22 21
20
bit weight
31 30 29
...
3
0
bit position
1 1 1
...
1 1 1 1
bit
1 0 0 0
...
0 0 0 0
-
2
1
1
232 - 1
232 - 4
232 - 3
232 - 2
232 - 1
2004 Morgan Kaufmann Publishers
55
Aside: Beyond Numbers
•
American Std Code for Info Interchange (ASCII): 8-bit bytes
representing characters
ASCII
Char
ASCII
Char
ASCII
Char
ASCII
Char
ASCII
Char
ASCII
Char
0
Null
32
space
48
0
64
@
96
`
112
p
1
33
!
49
1
65
A
97
a
113
q
2
34
“
50
2
66
B
98
b
114
r
3
35
#
51
3
67
C
99
c
115
s
36
$
52
4
68
D
100
d
116
t
37
%
53
5
69
E
101
e
117
u
38
&
54
6
70
F
102
f
118
v
39
‘
55
7
71
G
103
g
119
w
4
EOT
5
6
ACK
7
8
bksp
40
(
56
8
72
H
104
h
120
x
9
tab
41
)
57
9
73
I
105
i
121
y
10
LF
42
*
58
:
74
J
106
j
122
z
43
+
59
;
75
K
107
k
123
{
44
,
60
<
76
L
108
l
124
|
47
/
63
?
79
O
111
o
127
DEL
11
12
15
FF
2004 Morgan Kaufmann Publishers
56
MIPS Arithmetic Instructions
•
MIPS assembly language arithmetic statement
add
$t0, $s1, $s2
sub
$t0, $s1, $s2
•
•
Each arithmetic instruction performs only one operation
Each arithmetic instruction fits in 32 bits and specifies exactly three
operands
destination  source1 op source2
•
•
Operand order is fixed (destination first)
Those operands are all contained in the datapath’s register file
($t0,$s1,$s2) – indicated by $
2004 Morgan Kaufmann Publishers
58
Aside: MIPS Register Convention
Name
Register
Number
Usage
Preserve on
call?
$zero
0
constant 0 (hardware)
n.a.
$at
1
reserved for assembler
n.a.
$v0 - $v1
2-3
returned values
no
$a0 - $a3
4-7
arguments
yes
$t0 - $t7
8-15
temporaries
no
$s0 - $s7
16-23
saved values
yes
$t8 - $t9
24-25
temporaries
no
$gp
28
global pointer
yes
$sp
29
stack pointer
yes
$fp
30
frame pointer
yes
$ra
31
return addr (hardware)
yes
2004 Morgan Kaufmann Publishers
59
MIPS Register File
•
Holds thirty-two 32-bit registers
– Two read ports and
– One write port
Register File
32 bits
src1 addr
src2 addr
•
Registers are
– Faster than main memory
dst addr
write data
5
32 src1
data
5
5
32
locations
32 src2
32
data
• But register files with more locations
write control
are slower (e.g., a 64 word file could
be as much as 50% slower than a 32 word file)
• Read/write port increase impacts speed quadratically
– Easier for a compiler to use
• e.g., (A*B) – (C*D) – (E*F) can do multiplies in any order
vs. stack
– Can hold variables so that
• code density improves (since register are named with
fewer bits than a memory location)
2004 Morgan Kaufmann Publishers
60
Machine Language - Add Instruction
•
•
Instructions, like registers and words of data, are 32 bits long
Arithmetic Instruction Format (R format):
add $t0, $s1, $s2
op
rs
rt
rd
shamt
funct
op
6-bits
opcode that specifies the operation
rs
5-bits
register file address of the first source operand
rt
5-bits
register file address of the second source operand
rd
5-bits
register file address of the result’s destination
shamt
5-bits
shift amount (for shift instructions)
funct
6-bits
function code augmenting the opcode
2004 Morgan Kaufmann Publishers
61
MIPS Memory Access Instructions
•
MIPS has two basic data transfer instructions for accessing
memory
lw $t0, 4($s3) #load word from memory
sw $t0, 8($s3) #store word to memory
•
The data is loaded into (lw) or stored from (sw) a register in the
register file – a 5 bit address
The memory address – a 32 bit address – is formed by adding the
contents of the base address register to the offset value
– A 16-bit field meaning access is limited to memory locations
within a region of 213 or 8,192 words (215 or 32,768 bytes) of
the address in the base register
– Note that the offset can be positive or negative
•
2004 Morgan Kaufmann Publishers
62
Machine Language - Load Instruction
•
Load/Store Instruction Format (I format):
lw $t0, 24($s2)
op
rs
rt
16 bit offset
Memory
2410 + $s2 =
. . . 0001 1000
+ . . . 1001 0100
. . . 1010 1100 =
0x120040ac
0xf f f f f f f f
0x120040ac
$t0
0x12004094
$s2
data
0x0000000c
0x00000008
0x00000004
0x00000000
word
address (hex)
2004 Morgan Kaufmann Publishers 63
Byte
Addresses
•
•
Since 8-bit bytes are so useful, most architectures address
individual bytes in memory
– The memory address of a word must be a multiple of 4
(alignment restriction)
Big Endian: leftmost byte is word address
IBM 360/370, Motorola 68k, MIPS, Sparc, HP PA
•
Little Endian: rightmost byte is word address
Intel 80x86, DEC Vax, DEC Alpha (Windows NT)
3
2
1
little endian byte 0
0
msb
0
big endian byte 0
lsb
1
2
3
2004 Morgan Kaufmann Publishers
64
Aside: Loading and Storing Bytes
•
MIPS provides special instructions to move bytes
lb $t0, 1($s3) #load byte from memory
sb $t0, 6($s3) #store byte to memory
op
•
rs
rt
16 bit offset
What 8 bits get loaded and stored?
– load byte places the byte from memory in the rightmost 8 bits of
the destination register
• what happens to the other bits in the register?
– store byte takes the byte from the rightmost 8 bits of a register and
writes it to a byte in memory
• what happens to the other bits in the memory word?
2004 Morgan Kaufmann Publishers
65
MIPS Control Flow Instructions
•
MIPS conditional branch instructions:
bne $s0, $s1, Lbl #go to Lbl if $s0$s1
beq $s0, $s1, Lbl #go to Lbl if $s0=$s1
– Ex:
if (i==j) h = i + j;
bne $s0, $s1, Lbl1
add $s3, $s0, $s1
...
Lbl1:
•
Instruction Format (I format):
op
•
rs
rt
16 bit offset
How is the branch destination address specified?
2004 Morgan Kaufmann Publishers
66
Specifying Branch Destinations
•
Use a register (like in lw and sw) added to the 16-bit offset
– which register? Instruction Address Register (the PC)
• its use is automatically implied by instruction
• PC gets updated (PC+4) during the fetch cycle so that it
holds the address of the next instruction
– limits the branch distance to -215 to +215-1 instructions from the
(instruction after the) branch instruction, but most branches are
local anyway
from the low order 16 bits of the branch instruction
16
offset
sign-extend
00
32
32 Add
PC
32
32
4
32
Add
32
branch dst
address
32
?
2004 Morgan Kaufmann Publishers
67
More Branch Instructions
•
We have beq, bne, but what about other kinds of brances (e.g.,
branch-if-less-than)? For this, we need yet another instruction, slt
•
Set on less than instruction:
slt $t0, $s0, $s1
•
# if $s0 < $s1
# $t0 = 1
# $t0 = 0
then
else
Instruction format (R format):
op
rs
rt
rd
funct
2004 Morgan Kaufmann Publishers
68
2
More Branch Instructions, Con’t
•
Can use slt, beq, bne, and the fixed value of 0 in register $zero
to create other conditions
– less than
blt $s1, $s2, Label
slt
bne
$at, $s1, $s2
$at, $zero, Label
– less than or equal to
– greater than
– great than or equal to
•
#$at set to 1 if
# $s1 < $s2
ble $s1, $s2, Label
bgt $s1, $s2, Label
bge $s1, $s2, Label
Such branches are included in the instruction set as pseudo
instructions - recognized (and expanded) by the assembler
– Its why the assembler needs a reserved register ($at)
2004 Morgan Kaufmann Publishers
69
Other Control Flow Instructions
•
MIPS also has an unconditional branch instruction or jump
instruction:
j
•
label
#go to label
Instruction Format (J Format):
op
26-bit address
from the low order 26 bits of the jump instruction
26
00
32
4
PC
32
2004 Morgan Kaufmann Publishers
70
Aside: Branching Far Away
•
What if the branch destination is further away than can be captured
in 16 bits?

The assembler comes to the rescue – it inserts an
unconditional jump to the branch target and inverts the
condition
beq
$s0, $s1, L1
bne
j
$s0, $s1, L2
L1
becomes
L2:
2004 Morgan Kaufmann Publishers
71
Instructions for Accessing Procedures
•
MIPS procedure call instruction:
jal
•
•
ProcedureAddress
Saves PC+4 in register $ra to have a link to the next instruction for the
procedure return
Machine format (J format):
op
•
•
#jump and link
26 bit address
Then can do procedure return with a
jr
$ra
#return
Instruction format (R format):
op
rs
funct
2004 Morgan Kaufmann Publishers
72
Aside: Spilling Registers
•
What if the callee needs more registers? What if the procedure is
recursive?
– uses a stack – a last-in-first-out queue – in memory for passing
additional values or saving (recursive) return address(es)

high addr
One of the general registers, $sp, is
used to address the stack (which
“grows” from high address to low
address)

top of stack
$sp = $sp – 4
on stack at new $sp
$sp

low addr
add data onto the stack – push
data
remove data from the stack – pop
data from stack at $sp
= $sp + 4
$sp
2004 Morgan Kaufmann Publishers
73
MIPS Immediate Instructions
•
•
Small constants are used often in typical code
Possible approaches?
– put “typical constants” in memory and load them
– create hard-wired registers (like $zero) for constants like 1
– have special instructions that contain constants !
addi
$sp, $sp, 4
slti $t0, $s2, 15
•
Machine format (I format):
op
•
#$sp = $sp + 4
#$t0 = 1 if $s2<15
rs
rt
16 bit immediate
I format
The constant is kept inside the instruction itself!
– Immediate format limits values to the range +215–1 to -215
2004 Morgan Kaufmann Publishers
74
Aside: How About Larger Constants?
•
•
•
We'd also like to be able to load a 32 bit constant into a register, for this
we must use two instructions
a new "load upper immediate" instruction
lui $t0, 1010101010101010
Then must get the lower order bits right, use
ori $t0, $t0, 1010101010101010
16
0
8
1010101010101010
1010101010101010
0000000000000000
0000000000000000
1010101010101010
1010101010101010
1010101010101010
2004 Morgan Kaufmann Publishers
75
MIPS Organization So Far
Processor
Memory
Register File
src1 addr
5
src2 addr
5
dst addr
write data
5
1…1100
src1
data
32
32
registers
($zero - $ra)
read/write
addr
src2
32 data
32
32
32 bits
branch offset
32
Fetch
PC = PC+4
Exec
32 Add
PC
32 Add
4
read data
32
32
32
write data
32
Decode
230
words
32
32 ALU
32
32
4
0
5
1
6
2
32 bits
7
3
0…1100
0…1000
0…0100
0…0000
word address
(binary)
byte address
(big Endian)
2004 Morgan Kaufmann Publishers
76
MIPS ISA So Far
Category
Arithmetic
(R & I
format)
Data
Transfer
(I format)
Cond.
Branch (I &
R format)
Uncond.
Jump
(J &
R format)
Instr
Op Code
Example
Meaning
add
0 and 32
add $s1, $s2, $s3
$s1 = $s2 + $s3
subtract
0 and 34
sub $s1, $s2, $s3
$s1 = $s2 - $s3
add immediate
8
addi $s1, $s2, 6
$s1 = $s2 + 6
or immediate
13
ori $s1, $s2, 6
$s1 = $s2 v 6
load word
35
lw
$s1, 24($s2)
$s1 = Memory($s2+24)
store word
43
sw $s1, 24($s2)
Memory($s2+24) = $s1
load byte
32
lb
$s1, 25($s2)
$s1 = Memory($s2+25)
store byte
40
sb
$s1, 25($s2)
Memory($s2+25) = $s1
load upper imm
15
lui
$s1, 6
$s1 = 6 * 216
br on equal
4
beq $s1, $s2, L
if ($s1==$s2) go to L
br on not equal
5
bne $s1, $s2, L
if ($s1 !=$s2) go to L
set on less than
0 and 42
slt
if ($s2<$s3) $s1=1 else
$s1=0
set on less than
immediate
10
slti $s1, $s2, 6
if ($s2<6) $s1=1 else
$s1=0
jump
2
j
2500
go to 10000
jump register
0 and 8
jr
$t1
go to $t1
jump and link
3
jal
2500
go to 10000; $ra=PC+4
$s1, $s2, $s3
2004 Morgan Kaufmann Publishers
77
Review of MIPS Operand Addressing
Modes
• Register addressing – operand is in a register
op
rs
rt
rd
funct
Register
•
op
word operand
Base (displacement) addressing – operand is at the memory
location whose address is the sum of a register and a 16-bit
constant contained within the instruction
rs
rt
offset
Memory
word or byte operand
base register
– Register relative (indirect) with
– Pseudo-direct with
•
op
0($a0)
addr($zero)
Immediate addressing – operand is a 16-bit constant contained
within the instruction
rs
rt
operand
2004 Morgan Kaufmann Publishers
78
Review of MIPS Instruction Addressing
Modes
•
op
PC-relative addressing –instruction address is the sum of the PC
and a 16-bit constant contained within the instruction
rs
rt
offset
Memory
branch destination instruction
Program Counter (PC)
•
op
Pseudo-direct addressing – instruction address is the 26-bit
constant contained within the instruction concatenated with the
upper 4 bits of the PC
Memory
jump address
||
jump destination instruction
Program Counter (PC)
2004 Morgan Kaufmann Publishers
79
MIPS (RISC) Design Principles
•
•
•
•
Simplicity favors regularity
– fixed size instructions – 32-bits
– small number of instruction formats
– opcode always the first 6 bits
Good design demands good compromises
– three instruction formats
Smaller is faster
– limited instruction set
– limited number of registers in register file
– limited number of addressing modes
Make the common case fast
– arithmetic operands from the register file (load-store
machine)
– allow instructions to contain immediate operands
2004 Morgan Kaufmann Publishers
80
Chapter Three
2004 Morgan Kaufmann Publishers
81
Numbers
•
•
•
•
Bits are just bits (no inherent meaning)
— conventions define relationship between bits and numbers
Binary numbers (base 2)
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001...
decimal: 0...2n-1
Of course it gets more complicated:
numbers are finite (overflow)
fractions and real numbers
negative numbers
e.g., no MIPS subi instruction; addi can add a negative number
How do we represent negative numbers?
i.e., which bit patterns will represent which numbers?
2004 Morgan Kaufmann Publishers
82
Possible Representations
•
Sign Magnitude:
000 = +0
001 = +1
010 = +2
011 = +3
100 = -0
101 = -1
110 = -2
111 = -3
•
•
One's Complement
Two's Complement
000 = +0
001 = +1
010 = +2
011 = +3
100 = -3
101 = -2
110 = -1
111 = -0
000 = +0
001 = +1
010 = +2
011 = +3
100 = -4
101 = -3
110 = -2
111 = -1
Issues: balance, number of zeros, ease of operations
Which one is best? Why?
2004 Morgan Kaufmann Publishers
83
MIPS
•
32 bit signed numbers:
0000
0000
0000
...
0111
0111
1000
1000
1000
...
1111
1111
1111
0000 0000 0000 0000 0000 0000 0000two = 0ten
0000 0000 0000 0000 0000 0000 0001two = + 1ten
0000 0000 0000 0000 0000 0000 0010two = + 2ten
1111
1111
0000
0000
0000
1111
1111
0000
0000
0000
1111
1111
0000
0000
0000
1111
1111
0000
0000
0000
1111
1111
0000
0000
0000
1111
1111
0000
0000
0000
1110two
1111two
0000two
0001two
0010two
=
=
=
=
=
+
+
–
–
–
2,147,483,646ten
2,147,483,647ten
2,147,483,648ten
2,147,483,647ten
2,147,483,646ten
maxint
minint
1111 1111 1111 1111 1111 1111 1101two = – 3ten
1111 1111 1111 1111 1111 1111 1110two = – 2ten
1111 1111 1111 1111 1111 1111 1111two = – 1ten
2004 Morgan Kaufmann Publishers
84
MIPS Number Representations
•
32-bit signed numbers (2’s complement):
0000 0000 0000 0000 0000 0000 0000 0000two = 0ten
0000 0000 0000 0000 0000 0000 0000 0001two = + 1ten
...
0111
0111
1000
1000
...
1111
1111
0000
0000
1111
1111
0000
0000
1111
1111
0000
0000
1111
1111
0000
0000
1111
1111
0000
0000
1111
1111
0000
0000
1110two
1111two
0000two
0001two
=
=
=
=
+
+
–
–
maxint
2,147,483,646ten
2,147,483,647ten
2,147,483,648ten
2,147,483,647ten
1111 1111 1111 1111 1111 1111 1111 1110two = – 2ten
1111 1111 1111 1111 1111 1111 1111 1111two = – 1ten
minint
MSB
LSB
•
Converting <32-bit values into 32-bit values
– copy the most significant bit (the sign bit) into the “empty” bits
0010 -> 0000 0010
1010 -> 1111 1010
– sign extend
versus
zero extend (lb vs. lbu)
2004 Morgan Kaufmann Publishers
85
MIPS Arithmetic Logic Unit (ALU)
•
•
zero ovf
Must support the Arithmetic/Logic
add, addi, addiu, addu
sub, subu, neg
mult, multu, div, divu
sqrt
and, andi, nor, or, ori,
xor, xori
beq, bne, slt, slti, sltiu,
sltu
operations of the ISA
1
1
A
32
ALU
result
32
B
32
4
m (operation)
With special handling for
– sign extend – addi, addiu andi, ori, xori, slti, sltiu
– zero extend – lbu, addiu, sltiu
– no overflow detected – addu, addiu, subu, multu, divu,
sltiu, sltu
2004 Morgan Kaufmann Publishers
86
Two's Complement Operations
•
Negating a two's complement number: invert all bits and add 1
– remember: “negate” and “invert” are quite different!
•
Converting n bit numbers into numbers with more than n bits:
– MIPS 16 bit immediate gets converted to 32 bits for arithmetic
– copy the most significant bit (the sign bit) into the other bits
0010
-> 0000 0010
1010
-> 1111 1010
– "sign extension" (lbu vs. lb)
2004 Morgan Kaufmann Publishers
87
Review: 2’s Complement Binary Representation
•
Negate
2’sc binary
decimal
-23 =
1000
-8
-(23 - 1) =
1001
-7
1010
-6
1011
-5
1100
-4
1101
-3
1110
-2
1111
-1
0000
0
0001
1
0010
2
0011
3
0100
4
0101
5
0110
6
0111
7
1011
and add a 1
1010
complement all the bits
•
Note: negate and invert
are different!
23 - 1 =
2004 Morgan Kaufmann Publishers
88
Review: A Full Adder
carry_in
A
B
1-bit
Full
Adder
carry_out
S
A
B
carry_in
carry_out
S
0
0
0
0
0
0
0
1
0
1
0
1
0
0
1
0
1
1
1
0
1
0
0
0
1
1
0
1
1
0
1
1
0
1
0
1
1
1
1
1
S = A  B  carry_in
(odd parity function)
carry_out = A&B | A&carry_in | B&carry_in
(majority function)

How can we use it to build a 32-bit adder?

How can we modify it easily to build an adder/subtractor?
2004 Morgan Kaufmann Publishers
89
Addition & Subtraction
•
Just like in grade school (carry/borrow 1s)
0111
0111
0110
+ 0110
- 0110
- 0101
•
Two's complement operations easy
– subtraction using addition of negative numbers
0111
+ 1010
•
Overflow (result too large for finite computer word):
– e.g., adding two n-bit numbers does not yield an n-bit number
0111
+ 0001
note that overflow term is somewhat misleading,
1000
it does not mean a carry “overflowed”
2004 Morgan Kaufmann Publishers
90
A 32-bit Ripple Carry Adder/Subtractor
Remember 2’s
complement is just

complement all the bits
control
(0=add,1=sub)
B0

B0 if control = 0,
!B0 if control = 1
add a 1 in the least
significant bit
A
0111
B - 0110
0001

0111
 + 1001
1
1 0001
c0=carry_in
A0
1-bit
FA
c1
S0
A1
1-bit
FA
c2
S1
A2
1-bit
FA
c3
S2
B0
B1
B2
...

add/sub
c31
A31
B31
1-bit
FA
S31
c32=carry_out
2004 Morgan Kaufmann Publishers
92
Detecting Overflow
•
•
•
•
No overflow when adding a positive and a negative number
No overflow when signs are the same for subtraction
Overflow occurs when the value affects the sign:
– overflow when adding two positives yields a negative
– or, adding two negatives gives a positive
– or, subtract a negative from a positive and get a negative
– or, subtract a positive from a negative and get a positive
Consider the operations A + B, and A – B
– Can overflow occur if B is 0 ?
– Can overflow occur if A is 0 ?
2004 Morgan Kaufmann Publishers
93
Overflow Detection
•
•
Overflow: the result is too large to represent in 32 bits
Overflow occurs when
– adding two positives yields a negative
– or, adding two negatives gives a positive
– or, subtract a negative from a positive gives a negative
– or, subtract a positive from a negative gives a positive
On your own: Prove you can detect overflow by:
– Carry into MSB xor Carry out of MSB, ex for 4 bit signed numbers
•
0
+
1
1
1
1
0
1
1
1
7
0
0
1
1
3
1
0
1
0
–6
+
0
1
1
0
0
–4
1
0
1
1
–5
0
1
1
1
7
2004 Morgan Kaufmann Publishers
95
Tailoring the ALU to the MIPS ISA
•
Need to support the logic operation (and,nor,or,xor)
– Bit wise operations (no carry operation involved)
– Need a logic gate for each function, mux to choose the output
•
Need to support the set-on-less-than instruction (slt)
– Use subtraction to determine if (a – b) < 0 (implies a < b)
– Copy the sign bit into the low order bit of the result, set
remaining result bits to 0
•
Need to support test for equality (bne, beq)
– Again use subtraction: (a - b) = 0 implies a = b
– Additional logic to “nor” all result bits together
•
Immediates are sign extended outside the ALU with wiring (i.e., no
logic needed)
2004 Morgan Kaufmann Publishers
96
Shift Operations
•
Also need operations to pack and unpack 8-bit characters into
32-bit words
• Shifts move all the bits in a word left or right
sll
$t2, $s0, 8
#$t2 = $s0 << 8 bits
srl
$t2, $s0, 8
#$t2 = $s0 >> 8 bits
op
•
•
rs
rt
rd
shamt
funct
Notice that a 5-bit shamt field is enough to shift a 32-bit value 25 – 1
or 31 bit positions
Such shifts are logical because they fill with zeros
2004 Morgan Kaufmann Publishers
97
Shift Operations, con’t
•
•
An arithmetic shift (sra) maintain the arithmetic correctness of
the shifted value (i.e., a number shifted right one bit should be ½
of its original value; a number shifted left should be 2 times its
original value)
– so sra uses the most significant bit (sign bit) as the bit
shifted in
– note that there is no need for a sla when using two’s
complement number representation
sra
$t2, $s0, 8
#$t2 = $s0 >> 8 bits
The shift operation is implemented by hardware separate from the
ALU
– using a barrel shifter (which would takes lots of gates in
discrete logic, but is pretty easy to implement in VLSI)
2004 Morgan Kaufmann Publishers
98
Multiply
•
Binary multiplication is just a bunch of right shifts and adds
n
multiplicand
multiplier
partial
product
array
n
can be formed in parallel
and added in parallel for
faster multiplication
double precision product
2n
2004 Morgan Kaufmann Publishers
99
MIPS Multiply Instruction
•
Multiply produces a double precision product
mult
$s0, $s1
# hi||lo = $s0 * $s1
op
rs
rt
rd
shamt
funct
– Low-order word of the product is left in processor register lo
and the high-order word is left in register hi
– Instructions mfhi rd and mflo rd are provided to move
the product to (user accessible) registers in the register file
•
•
Multiplies are done by fast, dedicated hardware and are much more
complex (and slower) than adders
Hardware dividers are even more complex and even slower; ditto
for hardware square root
2004 Morgan Kaufmann Publishers
100
Effects of Overflow
•
•
•
An exception (interrupt) occurs
– Control jumps to predefined address for exception
– Interrupted address is saved for possible resumption
Details based on software system / language
– example: flight control vs. homework assignment
Don't always want to detect overflow
— new MIPS instructions: addu, addiu, subu
note: addiu still sign-extends!
note: sltu, sltiu for unsigned comparisons
2004 Morgan Kaufmann Publishers
101
Multiplication
•
•
•
More complicated than addition
– accomplished via shifting and addition
More time and more area
Let's look at 3 versions based on a gradeschool algorithm
0010
__x_1011
•
(multiplicand)
(multiplier)
Negative numbers: convert and multiply
– there are better techniques, we won’t look at them
2004 Morgan Kaufmann Publishers
102
Multiplication: Implementation
Start
Multiplier0 = 1
1. Test
Multiplier0 = 0
Multiplier0
1a. Add multiplicand to product and
Multiplicand
place the result in Product register
Shift left
64 bits
Multiplier
Shift right
64-bit ALU
2. Shift the Multiplicand register left 1 bit
32 bits
Product
Write
3. Shift the Multiplier register right 1 bit
Control test
64 bits
No: < 32 repetitions
32nd repetition?
Datapath
Yes: 32 repetitions
Control
Done
2004 Morgan Kaufmann Publishers
103
Final Version
Start
•Multiplier starts in right half of product
Product0 = 1
1. Test
Product0 = 0
Product0
Multiplicand
32 bits
32-bit ALU
Product
Shift right
Write
Control
test
3. Shift the Product register right 1 bit
64 bits
No: < 32 repetitions
32nd repetition?
What goes here?
Yes: 32 repetitions
Done
2004 Morgan Kaufmann Publishers
104
Floating Point (a brief look)
•
We need a way to represent
– numbers with fractions, e.g., 3.1416
– very small numbers, e.g., .000000001
– very large numbers, e.g., 3.15576  109
•
Representation:
– sign, exponent, significand:
(–1)sign  significand  2exponent
– more bits for significand gives more accuracy
– more bits for exponent increases range
•
IEEE 754 floating point standard:
– single precision: 8 bit exponent, 23 bit significand
– double precision: 11 bit exponent, 52 bit significand
2004 Morgan Kaufmann Publishers
105
Representing Big (and Small) Numbers
•
What if we want to encode the approx. age of the earth?
4,600,000,000 or 4.6 x 109
or the weight in kg of one a.m.u. (atomic mass unit)
0.0000000000000000000000000166 or 1.6 x 10-27
There is no way we can encode either of the above in a 32-bit integer.
•
Floating point representation
(-1)sign x F x 2E
– Still have to fit everything in 32 bits (single precision)
s E (exponent)
1 bit
8 bits
F (fraction)
23 bits
– The base (2, not 10) is hardwired in the design of the FPALU
– More bits in the fraction (F) or the exponent (E) is a trade-off
between precision (accuracy of the number) and range (size of the
number)
2004 Morgan Kaufmann Publishers
106
IEEE 754 floating-point standard
•
Leading “1” bit of significand is implicit
•
Exponent is “biased” to make sorting easier
– all 0s is smallest exponent all 1s is largest
– bias of 127 for single precision and 1023 for double precision
– summary: (–1)sign  (1+significand)  2exponent – bias
•
Example:
– decimal: -.75 = - ( ½ + ¼ )
– binary: -.11 = -1.1 x 2-1
– floating point: exponent = 126 = 01111110
– IEEE single precision: 10111111010000000000000000000000
2004 Morgan Kaufmann Publishers
107
IEEE 754 FP Standard Encoding
•
Most (all?) computers these days conform to the IEEE 754 floating
point standard
(-1)sign x (1+F) x 2E-bias
– Formats for both single and double precision
– F is stored in normalized form where the msb in the fraction is
1 (so there is no need to store it!) – called the hidden bit
– To simplify sorting FP numbers, E comes before F in the word
and E is represented in excess (biased) notation
Single Precision
Double Precision
Object Represented
E (8)
F (23)
E (11)
F (52)
0
0
0
0
0
nonzero
0
nonzero
± denormalized number
± 1-254
anything
± 1-2046
anything
± floating point number
± 255
0
± 2047
0
255
nonzero
2047
nonzero
true zero (0)
± infinity
not a number (NaN)
2004 Morgan Kaufmann Publishers
108
Floating Point Addition
•
Addition (and subtraction)
(F1  2E1) + (F2  2E2) = F3  2E3
– Step 1: Restore the hidden bit in F1 and in F2
– Step 1: Align fractions by right shifting F2 by E1 - E2 positions
(assuming E1  E2) keeping track of (three of) the bits shifted out in
a round bit, a guard bit, and a sticky bit
– Step 2: Add the resulting F2 to F1 to form F3
– Step 3: Normalize F3 (so it is in the form 1.XXXXX …)
• If F1 and F2 have the same sign  F3 [1,4)  1 bit
right shift F3 and increment E3
• If F1 and F2 have different signs  F3 may require
many left shifts each time decrementing E3
– Step 4: Round F3 and possibly normalize F3 again
– Step 5: Rehide the most significant bit of F3 before storing the
result
2004 Morgan Kaufmann Publishers
109
Floating point addition
•
Sign
Exponent
Fraction
Sign
Exponent
1. Compare the exponents of the two numbers.
Shift the smaller number to the right until its
exponent would match the larger exponent
Small ALU
Exponent
difference
0
Start
Fraction
2. Add the significands
1
0
1
0
1
3. Normalize the sum, either shifting right and
incrementing the exponent or shifting left
and decrementing the exponent
Shift right
Control
Overflow or
underflow?
Big ALU
Yes
No
0
0
1
Increment or
decrement
Exception
1
4. Round the significand to the appropriate
number of bits
Shift left or right
No
Rounding hardware
Still normalized?
Yes
Sign
Exponent
Fraction
Done
2004 Morgan Kaufmann Publishers
110
MIPS Floating Point Instructions
•
•
MIPS has a separate Floating Point Register File
($f0, $f1,
…, $f31) (whose registers are used in pairs for double precision
values) with special instructions to load to and store from them
lwcl
$f1,54($s2)
#$f1 = Memory[$s2+54]
swcl
$f1,58($s4)
#Memory[$s4+58] = $f1
And supports IEEE 754 single
add.s $f2,$f4,$f6 #$f2 = $f4 + $f6
and double precision operations
add.d $f2,$f4,$f6
#$f2||$f3 =
$f4||$f5 + $f6||$f7
similarly for sub.s, sub.d, mul.s, mul.d, div.s, div.d
2004 Morgan Kaufmann Publishers
111
MIPS Floating Point Instructions, Con’t
•
And floating point single precision comparison operations
c.x.s $f2,$f4
#if($f2 < $f4) cond=1;
else cond=0
where x may be eq, neq, lt, le, gt, ge
and branch operations
bclt 25
#if(cond==1)
go to PC+4+25
bclf 25
#if(cond==0)
go to PC+4+25
• And double precision comparison operations
c.x.d $f2,$f4
#$f2||$f3 < $f4||$f5
cond=1; else cond=0
2004 Morgan Kaufmann Publishers
112
Floating Point Complexities
•
Operations are somewhat more complicated (see text)
•
In addition to overflow we can have “underflow”
•
Accuracy can be a big problem
– IEEE 754 keeps two extra bits, guard and round
– four rounding modes
– positive divided by zero yields “infinity”
– zero divide by zero yields “not a number”
– other complexities
•
•
Implementing the standard can be tricky
Not using the standard can be even worse
– see text for description of 80x86 and Pentium bug!
2004 Morgan Kaufmann Publishers
113
Chapter Three Summary
•
Computer arithmetic is constrained by limited precision
•
Bit patterns have no inherent meaning but standards do exist
– two’s complement
– IEEE 754 floating point
•
Computer instructions determine “meaning” of the bit patterns
•
Performance and accuracy are important so there are many
complexities in real machines
•
Algorithm choice is important and may lead to hardware
optimizations for both space and time (e.g., multiplication)
•
You may want to look back (Section 3.10 is great reading!)
2004 Morgan Kaufmann Publishers
114
Chapter 4
2004 Morgan Kaufmann Publishers
115
Performance
•
•
•
•
Measure, Report, and Summarize
Make intelligent choices
See through the marketing hype
Key to understanding underlying organizational motivation
Why is some hardware better than others for different programs?
What factors of system performance are hardware related?
(e.g., Do we need a new machine, or a new operating system?)
How does the machine's instruction set affect performance?
2004 Morgan Kaufmann Publishers
116
Which of these airplanes has the best performance?
Airplane
Passengers
Boeing 737-100
Boeing 747
BAC/Sud Concorde
Douglas DC-8-50
101
470
132
146
Range (mi) Speed (mph)
630
4150
4000
8720
598
610
1350
544
•How much faster is the Concorde compared to the 747?
•How much bigger is the 747 than the Douglas DC-8?
2004 Morgan Kaufmann Publishers
117
Computer Performance: TIME, TIME, TIME
•
Response Time (latency)
— How long does it take for my job to run?
— How long does it take to execute a job?
— How long must I wait for the database query?
•
Throughput
— How many jobs can the machine run at once?
— What is the average execution rate?
— How much work is getting done?
•
If we upgrade a machine with a new processor what do we increase?
•
If we add a new machine to the lab what do we increase?
2004 Morgan Kaufmann Publishers
118
Execution Time
•
•
•
Elapsed Time
– counts everything (disk and memory accesses, I/O , etc.)
– a useful number, but often not good for comparison purposes
CPU time
– doesn't count I/O or time spent running other programs
– can be broken up into system time, and user time
Our focus: user CPU time
– time spent executing the lines of code that are "in" our program
2004 Morgan Kaufmann Publishers
119
Book's Definition of Performance
•
For some program running on machine X,
PerformanceX = 1 / Execution timeX
•
"X is n times faster than Y"
PerformanceX / PerformanceY = n
•
Problem:
– machine A runs a program in 20 seconds
– machine B runs the same program in 25 seconds
2004 Morgan Kaufmann Publishers
120
Clock Cycles
•
Instead of reporting execution time in seconds, we often use cycles
seconds
cycles
seconds


program program
cycle
•
Clock “ticks” indicate when to start activities (one abstraction):
time
•
•
cycle time = time between ticks = seconds per cycle
clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)
A 4 Ghz. clock has a
1
4 109
1012  250 picosecond s (ps) cycle time
2004 Morgan Kaufmann Publishers
121
How to Improve Performance
seconds
cycles
seconds


program program
cycle
So, to improve performance (everything else being equal) you can
either (increase or decrease?)
________ the # of required cycles for a program, or
________ the clock cycle time or, said another way,
________ the clock rate.
2004 Morgan Kaufmann Publishers
122
How many cycles are required for a program?
...
6th
5th
4th
3rd instruction
2nd instruction
Could assume that number of cycles equals number of instructions
1st instruction
•
time
This assumption is incorrect,
different instructions take different amounts of time on different machines.
Why? hint: remember that these are machine instructions, not lines of C code
2004 Morgan Kaufmann Publishers
123
Different numbers of cycles for different instructions
time
•
Multiplication takes more time than addition
•
Floating point operations take longer than integer ones
•
Accessing memory takes more time than accessing registers
•
Important point: changing the cycle time often changes the number of
cycles required for various instructions (more later)
2004 Morgan Kaufmann Publishers
124
Example
•
Our favorite program runs in 10 seconds on computer A, which has a
4 GHz. clock. We are trying to help a computer designer build a new
machine B, that will run this program in 6 seconds. The designer can use
new (or perhaps more expensive) technology to substantially increase the
clock rate, but has informed us that this increase will affect the rest of the
CPU design, causing machine B to require 1.2 times as many clock cycles as
machine A for the same program. What clock rate should we tell the
designer to target?"
•
Don't Panic, can easily work this out from basic principles
2004 Morgan Kaufmann Publishers
125
Now that we understand cycles
•
A given program will require
– some number of instructions (machine instructions)
– some number of cycles
– some number of seconds
•
We have a vocabulary that relates these quantities:
– cycle time (seconds per cycle)
– clock rate (cycles per second)
– CPI (cycles per instruction)
a floating point intensive application might have a higher CPI
– MIPS (millions of instructions per second)
this would be higher for a program using simple instructions
2004 Morgan Kaufmann Publishers
126
Performance
•
•
Performance is determined by execution time
Do any of the other variables equal performance?
– # of cycles to execute program?
– # of instructions in program?
– # of cycles per second?
– average # of cycles per instruction?
– average # of instructions per second?
•
Common pitfall: thinking one of the variables is indicative of
performance when it really isn’t.
2004 Morgan Kaufmann Publishers
127
CPI Example
•
Suppose we have two implementations of the same instruction set
architecture (ISA).
For some program,
Machine A has a clock cycle time of 250 ps and a CPI of 2.0
Machine B has a clock cycle time of 500 ps and a CPI of 1.2
What machine is faster for this program, and by how much?
•
If two machines have the same ISA which of our quantities (e.g., clock rate,
CPI, execution time, # of instructions, MIPS) will always be identical?
2004 Morgan Kaufmann Publishers
128
# of Instructions Example
•
A compiler designer is trying to decide between two code sequences
for a particular machine. Based on the hardware implementation,
there are three different classes of instructions: Class A, Class B, and
Class C, and they require one, two, and three cycles (respectively).
The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C
The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.
Which sequence will be faster? How much?
What is the CPI for each sequence?
2004 Morgan Kaufmann Publishers
129
MIPS example
•
Two different compilers are being tested for a 4 GHz. machine with
three different classes of instructions: Class A, Class B, and Class
C, which require one, two, and three cycles (respectively). Both
compilers are used to produce code for a large piece of software.
The first compiler's code uses 5 million Class A instructions, 1
million Class B instructions, and 1 million Class C instructions.
The second compiler's code uses 10 million Class A instructions, 1
million Class B instructions, and 1 million Class C instructions.
•
•
Which sequence will be faster according to MIPS?
Which sequence will be faster according to execution time?
2004 Morgan Kaufmann Publishers
130
Benchmarks
•
•
•
Performance best determined by running a real application
– Use programs typical of expected workload
– Or, typical of expected class of applications
e.g., compilers/editors, scientific applications, graphics, etc.
Small benchmarks
– nice for architects and designers
– easy to standardize
– can be abused
SPEC (System Performance Evaluation Cooperative)
– companies have agreed on a set of real program and inputs
– valuable indicator of performance (and compiler technology)
– can still be abused
2004 Morgan Kaufmann Publishers
131
Benchmark Games
•
An embarrassed Intel Corp. acknowledged Friday that a bug in a
software program known as a compiler had led the company to
overstate the speed of its microprocessor chips on an industry
benchmark by 10 percent. However, industry analysts said the
coding error…was a sad commentary on a common industry
practice of “cheating” on standardized performance tests…The error
was pointed out to Intel two days ago by a competitor, Motorola
…came in a test known as SPECint92…Intel acknowledged that it
had “optimized” its compiler to improve its test scores. The
company had also said that it did not like the practice but felt to
compelled to make the optimizations because its competitors were
doing the same thing…At the heart of Intel’s problem is the practice
of “tuning” compiler programs to recognize certain computing
problems in the test and then substituting special handwritten
pieces of code…
Saturday, January 6, 1996 New York Times
2004 Morgan Kaufmann Publishers
132
SPEC ‘89
Compiler “enhancements” and performance
800
700
600
SPEC performance ratio
•
500
400
300
200
100
0
gcc
espresso
spice
doduc
nasa7
li
eqntott
matrix300
fpppp
tomcatv
Benchmark
Compiler
Enhanced compiler
2004 Morgan Kaufmann Publishers
133
SPEC CPU2000
2004 Morgan Kaufmann Publishers
134
SPEC 2000
Does doubling the clock rate double the performance?
Can a machine with a slower clock rate have better performance?
1.6
Pentium M @ 1.6/0.6 GHz
Pentium 4-M @ 2.4/1.2 GHz
Pentium III-M @ 1.2/0.8 GHz
1400
1.4
1200
1.2
Pentium 4 CFP2000
1000
Pentium 4 CINT2000
1.0
800
0.8
600
0.6
Pentium III CINT2000
400
0.4
Pentium III CFP2000
200
0.2
0
0.0
500
1000
1500
2000
Clock rate in MHz
2500
3000
3500
SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000
Always on/maximum clock
Laptop mode/adaptive
clock
Minimum power/minimum
clock
Benchmark and power mode
2004 Morgan Kaufmann Publishers
135
Experiment
•
Phone a major computer retailer and tell them you are having trouble
deciding between two different computers, specifically you are
confused about the processors strengths and weaknesses
(e.g., Pentium 4 at 2Ghz vs. Celeron M at 1.4 Ghz )
•
What kind of response are you likely to get?
•
What kind of response could you give a friend with the same
question?
2004 Morgan Kaufmann Publishers
136
Amdahl's Law
Execution Time After Improvement =
Execution Time Unaffected +( Execution Time Affected / Amount of Improvement )
•
Example:
"Suppose a program runs in 100 seconds on a machine, with
multiply responsible for 80 seconds of this time. How much do we have to
improve the speed of multiplication if we want the program to run 4 times
faster?"
How about making it 5 times faster?
•
Principle: Make the common case fast
2004 Morgan Kaufmann Publishers
137
Example
•
Suppose we enhance a machine making all floating-point instructions run
five times faster. If the execution time of some benchmark before the
floating-point enhancement is 10 seconds, what will the speedup be if half of
the 10 seconds is spent executing floating-point instructions?
•
We are looking for a benchmark to show off the new floating-point unit
described above, and want the overall benchmark to show a speedup of 3.
One benchmark we are considering runs for 100 seconds with the old
floating-point hardware. How much of the execution time would floatingpoint instructions have to account for in this program in order to yield our
desired speedup on this benchmark?
2004 Morgan Kaufmann Publishers
138
Remember
•
Performance is specific to a particular program/s
– Total execution time is a consistent summary of performance
•
For a given architecture performance increases come from:
–
–
–
–
•
increases in clock rate (without adverse CPI affects)
improvements in processor organization that lower CPI
compiler enhancements that lower CPI and/or instruction count
Algorithm/Language choices that affect instruction count
Pitfall: expecting improvement in one aspect of a machine’s
performance to affect the total performance
2004 Morgan Kaufmann Publishers
139
Performance Metrics
•
Purchasing perspective
– given a collection of machines, which has the
• best performance ?
• least cost ?
• best cost/performance?
•
Design perspective
– faced with design options, which has the
• best performance improvement ?
• least cost ?
• best cost/performance?
•
•
Both require
– basis for comparison
– metric for evaluation
Our goal is to understand what factors in the architecture contribute
to overall system performance and the relative importance (and
cost) of these factors
2004 Morgan Kaufmann Publishers
140
Defining (Speed) Performance
•
Normally interested in reducing
– Response time (aka execution time) – the time between the start
and the completion of a task
• Important to individual users
– Thus, to maximize performance, need to minimize execution time
performanceX = 1 / execution_timeX
If X is n times faster than Y, then
performanceX
execution_timeY
-------------------- = --------------------- = n
performanceY
execution_timeX
– Throughput – the total amount of work done in a given time
• Important to data center managers
– Decreasing response time almost always improves throughput
2004 Morgan Kaufmann Publishers
141
Performance Factors
•
•
Want to distinguish elapsed time and the time spent on our task
CPU execution time (CPU time) – time the CPU spends working on a
task
– Does not include time waiting for I/O or running other programs
CPU execution time
=
for a program
# CPU clock cycles
x clock cycle time
for a program
or
CPU execution time
for a program
•
# CPU clock cycles for a program
= ------------------------------------------clock rate
Can improve performance by reducing either the length of the clock
cycle or the number of clock cycles required for a program
2004 Morgan Kaufmann Publishers
142
Review: Machine Clock Rate
•
Clock rate (MHz, GHz) is inverse of clock cycle time (clock period)
CC = 1 / CR
one clock period
10 nsec clock cycle => 100 MHz clock rate
5 nsec clock cycle => 200 MHz clock rate
2 nsec clock cycle => 500 MHz clock rate
1 nsec clock cycle =>
1 GHz clock rate
500 psec clock cycle =>
2 GHz clock rate
250 psec clock cycle =>
4 GHz clock rate
200 psec clock cycle =>
5 GHz clock rate
2004 Morgan Kaufmann Publishers
143
Clock Cycles per Instruction
•
Not all instructions take the same amount of time to execute
– One way to think about execution time is that it equals the
number of instructions executed multiplied by the average time
per instruction
# CPU clock cycles
=
for a program
•
# Instructions
x
for a program
Average clock cycles
per instruction
Clock cycles per instruction (CPI) – the average number of clock
cycles each instruction takes to execute
– A way to compare two different implementations of the same ISA
CPI for this instruction class
CPI
A
B
C
1
2
3
2004 Morgan Kaufmann Publishers
144
Effective CPI
•
Computing the overall effective CPI is done by looking at the different
types of instructions and their individual cycle counts and averaging
n
Overall effective CPI =

(CPIi x ICi)
i=1
– Where ICi is the count (percentage) of the number of instructions
of class i executed
– CPIi is the (average) number of clock cycles per instruction for that
instruction class
– n is the number of instruction classes
•
The overall effective CPI varies by instruction mix – a measure of the
dynamic frequency of instructions across one or many programs
2004 Morgan Kaufmann Publishers
145
THE Performance Equation
•
Our basic performance equation is then
CPU time
= Instruction_count x CPI x clock_cycle
or
CPU time
•
=
Instruction_count x
CPI
----------------------------------------------clock_rate
These equations separate the three key factors that affect
performance
– Can measure the CPU execution time by running the program
– The clock rate is usually given
– Can measure overall instruction count by using profilers/
simulators without knowing all of the implementation details
– CPI varies by instruction type and ISA implementation for which
we must know the implementation details
2004 Morgan Kaufmann Publishers
146
Determinates of CPU Performance
CPU time
= Instruction_count x CPI x clock_cycle
Algorithm
Programming
language
Compiler
ISA
Processor
organization
Technology
Instruction_c
ount
CPI
clock_cycle
X
X
X
X
X
X
X
X
X
X
X
X
2004 Morgan Kaufmann Publishers
148
A Simple Example
Op
Freq
CPIi
Freq x CPIi
ALU
50%
1
.5
.5
.5
.25
Load
20%
5
1.0
.4
1.0
1.0
Store
10%
3
.3
.3
.3
.3
Branch
20%
2
.4
.4
.2
.4
2.2
1.6
2.0
1.95
=
•
•
•
How much faster would the machine be if a better data cache
reduced the average load time to 2 cycles?
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster
How does this compare with using branch prediction to shave a
cycle off the branch time?
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster
What if two ALU instructions could be executed at once?
CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
2004 Morgan Kaufmann Publishers
150
Comparing and Summarizing Performance
•
How do we summarize the performance for benchmark set with a
single number?
– The average of execution times that is directly proportional to total
execution time is the arithmetic mean (AM)
n
AM =
1/n

Timei
i=1
– Where Timei is the execution time for the ith program of a total of n
programs in the workload
– A smaller mean indicates a smaller average execution time and
thus improved performance
•
Guiding principle in reporting performance measurements is
reproducibility – list everything another experimenter would need to
duplicate the experiment (version of the operating system, compiler
settings, input set used, specific computer configuration (clock rate,
cache sizes and speed, memory size and speed, etc.))
2004 Morgan Kaufmann Publishers
151
SPEC Benchmarks www.spec.org
Integer benchmarks
FP benchmarks
gzip
compression
wupwise
Quantum chromodynamics
vpr
FPGA place & route
swim
Shallow water model
gcc
GNU C compiler
mgrid
Multigrid solver in 3D fields
mcf
Combinatorial optimization
applu
Parabolic/elliptic pde
crafty
Chess program
mesa
3D graphics library
parser
Word processing program
galgel
Computational fluid dynamics
eon
Computer visualization
art
Image recognition (NN)
perlbmk
perl application
equake
Seismic wave propagation
simulation
gap
Group theory interpreter
facerec
Facial image recognition
vortex
Object oriented database
ammp
Computational chemistry
bzip2
compression
lucas
Primality testing
twolf
Circuit place & route
fma3d
Crash simulation fem
sixtrack
Nuclear physics accel
apsi
Pollutant distribution
2004 Morgan Kaufmann Publishers
152
Example SPEC Ratings
2004 Morgan Kaufmann Publishers
153
Other Performance Metrics
•
Power consumption – especially in the embedded market where
battery life is important (and passive cooling)
– For power-limited applications, the most important metric is
energy efficiency
2004 Morgan Kaufmann Publishers
154
Summary: Evaluating ISAs
• Design-time metrics:
– Can it be implemented, in how long, at what cost?
– Can it be programmed? Ease of compilation?
• Static Metrics:
– How many bytes does the program occupy in memory?
• Dynamic Metrics:
– How many instructions are executed? How many bytes does the
processor fetch to execute the program?
– How many clocks are required per instruction?
– How "lean" a clock is practical?
CPI
Best Metric: Time to execute the program!
depends on the instructions set, the
processor organization, and compilation
techniques.
Inst. Count
Cycle Time
2004 Morgan Kaufmann Publishers
155
Chapter --Five
2004 Morgan Kaufmann Publishers
156
Lets Build a Processor
•
•
Almost ready to move into chapter 5 and start building a processor
First, let’s review Boolean Logic and build the ALU we’ll need
(Material from Appendix B)
operation
a
32
ALU
result
32
b
32
2004 Morgan Kaufmann Publishers
157
Review: Boolean Algebra & Gates
•
Problem: Consider a logic function with three inputs: A, B, and C.
Output D is true if at least one input is true
Output E is true if exactly two inputs are true
Output F is true only if all three inputs are true
•
Show the truth table for these three functions.
•
Show the Boolean equations for these three functions.
•
Show an implementation consisting of inverters, AND, and OR gates.
2004 Morgan Kaufmann Publishers
158
An ALU (arithmetic logic unit)
•
Let's build an ALU to support the andi and ori instructions
– we'll just build a 1 bit ALU, and use 32 of them
operation
a
op a
b
res
result
b
•
Possible Implementation (sum-of-products):
2004 Morgan Kaufmann Publishers
159
Review: The Multiplexor
•
Selects one of the inputs to be the output, based on a control input
S
•
A
0
B
1
C
note: we call this a 2-input mux
even though it has 3 inputs!
Lets build our ALU using a MUX:
2004 Morgan Kaufmann Publishers
160
Different Implementations
•
Not easy to decide the “best” way to build something
•
– Don't want too many inputs to a single gate
– Don’t want to have to go through too many gates
– for our purposes, ease of comprehension is important
Let's look at a 1-bit ALU for addition:
CarryIn
a
Sum
b
cout = a b + a cin + b cin
sum = a xor b xor cin
CarryOut
•
How could we build a 1-bit ALU for add, and, and or?
•
How could we build a 32-bit ALU?
2004 Morgan Kaufmann Publishers
161
Building a 32 bit ALU
CarryIn
a0
b0
Operation
CarryIn
ALU0
Result0
CarryOut
Operation
CarryIn
a1
a
0
b1
CarryIn
ALU1
Result1
CarryOut
1
Result
a2
2
b
b2
CarryIn
ALU2
Result2
CarryOut
CarryOut
a31
b31
CarryIn
ALU31
Result31
2004 Morgan Kaufmann Publishers
162
What about subtraction (a – b) ?
•
•
Two's complement approach: just negate b and add.
How do we negate?
•
A very clever solution:
Binvert
Operation
CarryIn
a
0
1
b
0
Result
2
1
CarryOut
2004 Morgan Kaufmann Publishers
163
Adding a NOR function
•
Can also choose to invert a. How do we get “a NOR b” ?
Ainvert
Operation
Binvert
a
CarryIn
0
0
1
1
b
0
+
Result
2
1
CarryOut
2004 Morgan Kaufmann Publishers
164
Tailoring the ALU to the MIPS
•
Need to support the set-on-less-than instruction (slt)
– remember: slt is an arithmetic instruction
– produces a 1 if rs < rt and 0 otherwise
– use subtraction: (a-b) < 0 implies a < b
•
Need to support test for equality (beq $t5, $t6, $t7)
– use subtraction: (a-b) = 0 implies a = b
2004 Morgan Kaufmann Publishers
165
Supporting slt
•
Can we figure out the idea?
Binvert
a
Binvert
CarryIn
a
0
Operation
Ainvert
Operation
Ainvert
CarryIn
0
0
0
1
1
1
1
Result
b
0
+
Result
b
0
2
+
2
1
1
Less
Less
3
3
Set
CarryOut
Overflow
detection
Overflow
Use this ALU for most significant bit
all other bits
Supporting slt
Operation
Binvert
Ainvert
CarryIn
a0
b0
CarryIn
ALU0
Less
CarryOut
Result0
a1
b1
0
CarryIn
ALU1
Less
CarryOut
Result1
a2
b2
0
CarryIn
ALU2
Less
CarryOut
Result2
..
.
a31
b31
0
..
. CarryIn
CarryIn
ALU31
Less
..
.
Result31
Set
Overflow
2004 Morgan Kaufmann Publishers
167
Test for equality
•
Notice control lines:
Operation
Bnegate
Ainvert
0000
0001
0010
0110
0111
1100
=
=
=
=
=
=
and
or
add
subtract
slt
NOR
•Note: zero is a 1 when the result is zero!
a0
b0
CarryIn
ALU0
Less
CarryOut
a1
b1
0
CarryIn
ALU1
Less
CarryOut
a2
b2
0
CarryIn
ALU2
Less
CarryOut
..
.
a31
b31
0
Result0
Result1
..
.
Result2
..
. CarryIn
CarryIn
ALU31
Less
Zero
..
.
..
.
Result31
Set
Overflow
2004 Morgan Kaufmann Publishers
168
Conclusion
•
We can build an ALU to support the MIPS instruction set
– key idea: use multiplexor to select the output we want
– we can efficiently perform subtraction using two’s complement
– we can replicate a 1-bit ALU to produce a 32-bit ALU
•
Important points about hardware
– all of the gates are always working
– the speed of a gate is affected by the number of inputs to the
gate
– the speed of a circuit is affected by the number of gates in series
(on the “critical path” or the “deepest level of logic”)
•
Our primary focus: comprehension, however,
– Clever changes to organization can improve performance
(similar to using better algorithms in software)
– We saw this in multiplication, let’s look at addition now
2004 Morgan Kaufmann Publishers
169
Problem: ripple carry adder is slow
•
•
Is a 32-bit ALU as fast as a 1-bit ALU?
Is there more than one way to do addition?
– two extremes: ripple carry and sum-of-products
Can you see the ripple? How could you get rid of it?
c1
c2
c3
c4
=
=
=
=
b0c0
b1c1
b2c2
b3c3
+
+
+
+
a0c0
a1c1
a2c2
a3c3
+
+
+
+
a0b0
a1b1c2 =
a2b2
a3b3
c3 =
c4 =
Not feasible! Why?
2004 Morgan Kaufmann Publishers
170
Carry-lookahead adder
•
•
An approach in-between our two extremes
Motivation:
– If we didn't know the value of carry-in, what could we do?
– When would we always generate a carry?
gi = ai bi
– When would we propagate the carry?
pi = ai + bi
•
Did we get rid of the ripple?
c1
c2
c3
c4
=
=
=
=
g0
g1
g2
g3
+
+
+
+
p0c0
p1c1 c2 =
p2c2 c3 =
p3c3 c4 =
Feasible! Why?
2004 Morgan Kaufmann Publishers
171
Use principle to build bigger adders
CarryIn
a0
b0
a1
b1
a2
b2
a3
b3
a4
b4
a5
b5
a6
b6
a7
b7
a8
b8
a9
b9
a10
b10
a11
b11
a12
b12
a13
b13
a14
b14
a15
b15
CarryIn
Result0–3
ALU0
P0
G0
pi
gi
C1
ci + 1
CarryIn
Carry-lookahead unit
Result4–7
•
ALU1
P1
G1
pi + 1
gi + 1
C2
•
ci + 2
CarryIn
Result8–11
ALU2
P2
G2
•
Can’t build a 16 bit adder this way... (too
big)
Could use ripple carry of 4-bit CLA
adders
Better: use the CLA principle again!
pi + 2
gi + 2
C3
ci + 3
CarryIn
Result12–15
ALU3
P3
G3
pi + 3
gi + 3
C4
ci + 4
CarryOut
2004 Morgan Kaufmann Publishers
172
ALU Summary
•
•
•
•
We can build an ALU to support MIPS addition
Our focus is on comprehension, not performance
Real processors use more sophisticated techniques for arithmetic
Where performance is not critical, hardware description languages
allow designers to completely automate the creation of hardware!
2004 Morgan Kaufmann Publishers
173
Chapter Five
2004 Morgan Kaufmann Publishers
174
The Processor: Datapath & Control
•
•
We're ready to look at an implementation of the MIPS
Simplified to contain only:
– memory-reference instructions: lw, sw
– arithmetic-logical instructions: add, sub, and, or, slt
– control flow instructions: beq, j
•
Generic Implementation:
–
–
–
–
•
use the program counter (PC) to supply instruction address
get the instruction from memory
read registers
use the instruction to decide exactly what to do
All instructions use the ALU after reading the registers
Why? memory-reference? arithmetic? control flow?
2004 Morgan Kaufmann Publishers
175
More Implementation Details
•
Abstract / Simplified View:
4
Add
Add
Data
PC
Address Instruction
Instruction
memory
Register #
Registers
Register #
ALU
Address
Data
memory
Register #
Data
•
Two types of functional units:
– elements that operate on data values (combinational)
– elements that contain state (sequential)
2004 Morgan Kaufmann Publishers
176
State Elements
•
•
Unclocked vs. Clocked
Clocks used in synchronous logic
– when should an element that contains state be updated?
Falling edge
Clock period
Rising edge
2004 Morgan Kaufmann Publishers
177
An unclocked state element
•
The set-reset latch
– output depends on present inputs and also on past inputs
R
Q
Q
S
2004 Morgan Kaufmann Publishers
178
Latches and Flip-flops
•
•
•
•
Output is equal to the stored value inside the element
(don't need to ask for permission to look at the value)
Change of state (value) is based on the clock
Latches: whenever the inputs change, and the clock is asserted
Flip-flop: state changes only on a clock edge
(edge-triggered methodology)
"logically true",
— could mean electrically low
A clocking methodology defines when signals can be read and written
— wouldn't want to read a signal at the same time it was being written
2004 Morgan Kaufmann Publishers
179
D-latch
•
•
Two inputs:
– the data value to be stored (D)
– the clock signal (C) indicating when to read & store D
Two outputs:
– the value of the internal state (Q) and it's complement
C
Q
D
C
_
Q
Q
D
2004 Morgan Kaufmann Publishers
180
D flip-flop
•
Output changes only on the clock edge
D
D
C
Q
D
latch
D
C
Q
D
latch
Q
Q
Q
C
D
C
Q
2004 Morgan Kaufmann Publishers
181
Our Implementation
•
•
An edge triggered methodology
Typical execution:
– read contents of some state elements,
– send values through some combinational logic
– write results to one or more state elements
State
element
1
Combinational logic
State
element
2
Clock cycle
2004 Morgan Kaufmann Publishers
182
Register File
•
Built using D flip-flops
Read register
number 1
Register 0
Register 1
Read register
number 1
Read
data 1
Read register
number 2
Write
register
Write
data
Register file
Read
data 2
M
...
u
Register n – 2
x
Read data 1
Register n – 1
Read register
number 2
Write
M
u
Read data 2
x
Do you understand? What is the “Mux” above?
2004 Morgan Kaufmann Publishers
183
Abstraction
•
•
Make sure you understand the abstractions!
Sometimes it is easy to think you do, when you don’t
Select
A31
Select
B31
A
B
M
u
x
C31
32
32
M
u
x
32
C
A30
B30
M
u
x
C30
..
.
..
.
A0
B0
M
u
x
C0
2004 Morgan Kaufmann Publishers
184
Register File
•
Note: we still use the real clock to determine when to write
Write
C
0
1
Register number
n-to-2n
decoder
Register 0
.
..
D
C
Register 1
n–1
n
D
..
.
C
Register n – 2
D
C
Register n – 1
Register data
D
2004 Morgan Kaufmann Publishers
185
Simple Implementation
•
Include the functional units we need for each instruction
Instruction
address
MemWrite
Instruction
Add Sum
PC
Address
Read
data
16
Instruction
memory
a. Instruction memory
b. Program counter
c. Adder
Write
data
Data
memory
Sign
extend
32
MemRead
a. Data memory unit
Register
numbers
5
Read
register 1
5
Read
register 2
5
Data
Write
register
4
b. Sign-extension unit
ALU operation
Read
data 1
Data
Registers
Zero
ALU ALU
result
Read
data 2
Write
Data
Why do we need this stuff?
RegWrite
a. Registers
b. ALU
2004 Morgan Kaufmann Publishers
186
Building the Datapath
•
Use multiplexors to stitch them together
PCSrc
M
u
x
Add
Add
4
ALU
result
Shift
left 2
PC
Read
address
Instruction
Instruction
memory
Read
register 1
ALUSrc
Read
data 1
ALU operation
MemWrite
Read
register 2
Registers Read
Write
data 2
register
MemtoReg
Zero
M
u
x
Write
data
ALU ALU
result
Address
Write
data
RegWrite
16
4
Sign
extend
32
Read
data
M
u
x
Data
memory
MemRead
2004 Morgan Kaufmann Publishers
187
Control
•
Selecting the operations to perform (ALU, read/write, etc.)
•
Controlling the flow of data (multiplexor inputs)
•
Information comes from the 32 bits of the instruction
•
Example:
add $8, $17, $18
•
Instruction Format:
000000
10001
10010
01000
op
rs
rt
rd
00000 100000
shamt
funct
ALU's operation based on instruction type and function code
2004 Morgan Kaufmann Publishers
188
Control
•
•
•
e.g., what should the ALU do with this instruction
Example: lw $1, 100($2)
35
2
1
op
rs
rt
16 bit offset
ALU control input
0000
0001
0010
0110
0111
1100
•
100
AND
OR
add
subtract
set-on-less-than
NOR
Why is the code for subtract 0110 and not 0011?
2004 Morgan Kaufmann Publishers
189
Control
•
Must describe hardware to compute 4-bit ALU control input
– given instruction type
00 = lw, sw
ALUOp
01 = beq,
computed from instruction type
10 = arithmetic
– function code for arithmetic
•
Describe it using a truth table (can turn into gates):
2004 Morgan Kaufmann Publishers
190
0
M
u
x
Add
Add
4
Control
PC
Instruction [25–21]
Read
register 1
Instruction [20–16]
Read
register 2
Read
address
Instruction
[31–0]
Instruction
memory
0
M
u
Instruction [15–11] x
1
Write
register
Write
data
Instruction [15–0]
16
1
Shift
left 2
RegDst
Branch
MemRead
MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
Instruction [31–26]
ALU
result
Read
data 1
Zero
Read
data 2
Registers
Sign
extend
0
M
u
x
1
ALU ALU
result
Address
Read
data
1
M
u
x
0
Data
Write memory
data
32
ALU
control
Instruction [5–0]
Memto- Reg Mem Mem
Instruction RegDst ALUSrc
Reg
Write Read Write Branch ALUOp1 ALUp0
R-format
1
0
0
1
0
0
0
1
0
lw
0
1
1
1
1
0
0
0
0
sw
X
1
X
0
0
1
0
0
0
beq
X
0
X
0
0
0
1
0
1
Control
•
Simple combinational logic (truth tables)
Inputs
Op5
Op4
Op3
Op2
ALUOp
Op1
ALU control block
Op0
ALUOp0
ALUOp1
Outputs
F3
F2
F (5– 0)
Operation2
Operation1
Operation
Iw
sw
beq
RegDst
ALUSrc
MemtoReg
F1
Operation0
F0
R-format
RegWrite
MemRead
MemWrite
Branch
ALUOp1
ALUOpO
2004 Morgan Kaufmann Publishers
192
Our Simple Control Structure
•
All of the logic is combinational
•
We wait for everything to settle down, and the right thing to be done
– ALU might not produce “right answer” right away
– we use write signals along with clock to determine when to write
•
Cycle time determined by length of the longest path
State
element
1
Combinational logic
State
element
2
Clock cycle
We are ignoring some details like setup and hold times
2004 Morgan Kaufmann Publishers
193
Single Cycle Implementation
•
Calculate cycle time assuming negligible delays except:
– memory (200ps),
ALU and adders (100ps),
register file access (50ps)
PCSrc
M
u
x
Add
Add
4
ALU
result
Shift
left 2
PC
Read
address
Instruction
Instruction
memory
Read
register 1
ALUSrc
Read
data 1
ALU operation
MemWrite
Read
register 2
Registers Read
Write
data 2
register
MemtoReg
Zero
M
u
x
Write
data
ALU ALU
result
Address
Write
data
RegWrite
16
4
Sign
extend
32
Read
data
M
u
x
Data
memory
MemRead
2004 Morgan Kaufmann Publishers
194
Where we are headed
•
•
Single Cycle Problems:
– what if we had a more complicated instruction like floating
point?
– wasteful of area
One Solution:
– use a “smaller” cycle time
– have different instructions take different numbers of cycles
– a “multicycle” datapath:
PC
Address
Instruction
register
A
Register #
Registers
Register #
Instruction
or data
Memory
Data
Data
Memory
data
register
ALU
ALUOut
B
Register #
2004 Morgan Kaufmann Publishers
195
Multicycle Approach
•
•
•
We will be reusing functional units
– ALU used to compute address and to increment PC
– Memory used for instruction and data
Our control signals will not be determined directly by instruction
– e.g., what should the ALU do for a “subtract” instruction?
We’ll use a finite state machine for control
2004 Morgan Kaufmann Publishers
196
Multicycle Approach
•
•
Break up the instructions into steps, each step takes a cycle
– balance the amount of work to be done
– restrict each cycle to use only one major functional unit
At the end of a cycle
– store values for use in later cycles (easiest thing to do)
– introduce additional “internal” registers
PC
0
M
u
x
1
Address
Memory
MemData
Write
data
Instruction
[20–16]
Instruction
[15–0]
Instruction
register
Instruction
[15–0]
Memory
data
register
0
M
u
x
1
Read
register 1
Instruction
[25–21]
0
M
Instruction u
x
[15–11]
1
Read
data 1
Read
register 2
Registers
Write
Read
register
data 2
A
B
4
Write
data
0
M
u
x
1
16
Sign
extend
32
Zero
ALU ALU
result
ALUOut
0
1M
u
2 x
3
Shift
left 2
2004 Morgan Kaufmann Publishers
197
Instructions from ISA perspective
•
•
Consider each instruction from perspective of ISA.
Example:
– The add instruction changes a register.
– Register specified by bits 15:11 of instruction.
– Instruction specified by the PC.
– New value is the sum (“op”) of two registers.
– Registers specified by bits 25:21 and 20:16 of the instruction
Reg[Memory[PC][15:11]] <=
Reg[Memory[PC][25:21]] op
Reg[Memory[PC][20:16]]
– In order to accomplish this we must break up the instruction.
(kind of like introducing variables when programming)
2004 Morgan Kaufmann Publishers
198
Breaking down an instruction
•
ISA definition of arithmetic:
Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op
Reg[Memory[PC][20:16]]
•
Could break down to:
– IR <= Memory[PC]
– A <= Reg[IR[25:21]]
– B <= Reg[IR[20:16]]
– ALUOut <= A op B
– Reg[IR[20:16]] <= ALUOut
•
We forgot an important part of the definition of arithmetic!
– PC <= PC + 4
2004 Morgan Kaufmann Publishers
199
Idea behind multicycle approach
•
We define each instruction from the ISA perspective (do this!)
•
Break it down into steps following our rule that data flows through at
most one major functional unit (e.g., balance work across steps)
•
Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.)
•
Finally try and pack as much work into each step
(avoid unnecessary cycles)
while also trying to share steps where possible
(minimizes control, helps to simplify solution)
•
Result: Our book’s multicycle Implementation!
2004 Morgan Kaufmann Publishers
200
Five Execution Steps
•
Instruction Fetch
•
Instruction Decode and Register Fetch
•
Execution, Memory Address Computation, or Branch Completion
•
Memory Access or R-type instruction completion
•
Write-back step
INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!
2004 Morgan Kaufmann Publishers
201
Step 1: Instruction Fetch
•
•
•
Use PC to get instruction and put it in the Instruction Register.
Increment the PC by 4 and put the result back in the PC.
Can be described succinctly using RTL "Register-Transfer Language"
IR <= Memory[PC];
PC <= PC + 4;
Can we figure out the values of the control signals?
What is the advantage of updating the PC now?
2004 Morgan Kaufmann Publishers
202
Step 2: Instruction Decode and Register Fetch
•
•
•
Read registers rs and rt in case we need them
Compute the branch address in case the instruction is a branch
RTL:
A <= Reg[IR[25:21]];
B <= Reg[IR[20:16]];
ALUOut <= PC + (sign-extend(IR[15:0]) << 2);
•
We aren't setting any control lines based on the instruction type
(we are busy "decoding" it in our control logic)
2004 Morgan Kaufmann Publishers
203
Step 3 (instruction dependent)
•
ALU is performing one of three functions, based on instruction type
•
Memory Reference:
ALUOut <= A + sign-extend(IR[15:0]);
•
R-type:
ALUOut <= A op B;
•
Branch:
if (A==B) PC <= ALUOut;
2004 Morgan Kaufmann Publishers
204
Step 4 (R-type or memory-access)
•
Loads and stores access memory
MDR <= Memory[ALUOut];
or
Memory[ALUOut] <= B;
•
R-type instructions finish
Reg[IR[15:11]] <= ALUOut;
The write actually takes place at the end of the cycle on the edge
2004 Morgan Kaufmann Publishers
205
Write-back step
• Reg[IR[20:16]] <= MDR;
Which instruction needs this?
2004 Morgan Kaufmann Publishers
206
Summary:
2004 Morgan Kaufmann Publishers
207
Simple Questions
•
How many cycles will it take to execute this code?
Label:
•
•
lw $t2, 0($t3)
lw $t3, 4($t3)
beq $t2, $t3, Label
add $t5, $t2, $t3
sw $t5, 8($t3)
...
#assume not
What is going on during the 8th cycle of execution?
In what cycle does the actual addition of $t2 and $t3 takes place?
2004 Morgan Kaufmann Publishers
208
PCSource
PCWriteCond
PCWrite
ALUOp
Outputs
IorD
MemRead
ALUSrcB
Control
ALUSrcA
MemWrite
MemtoReg
Op
[5–0]
RegWrite
IRWrite
0
RegDst
26
Instruction [25-0]
PC
0
M
u
x
1
Instruction
[31–26]
Address
Memory
MemData
Write
data
Instruction
[20–16]
Instruction
[15–0]
Instruction
register
Instruction
[15–0]
Memory
data
register
0
M
u
x
1
Read
register 1
Instruction
[25–21]
Read
data 1
Read
register 2
Registers
Write
Read
register
data 2
0
M
Instruction u
x
[15–11]
1
A
16
Sign
extend
B
4
32
Instruction [5–0]
28
PC [31–28]
Zero
ALU ALU
result
Write
data
0
M
u
x
1
Shift
left 2
Shift
left 2
Jump
address
[31–0]
0
1M
u
2 x
3
ALU
control
ALUOut
M
1 u
x
2
Review: finite state machines
•
Finite state machines:
– a set of states and
– next state function (determined by current state and the input)
– output function (determined by current state and possibly input)
Next
state
Current state
Next-state
function
Clock
Inputs
Output
function
Outputs
– We’ll use a Moore machine (output based only on current state)
2004 Morgan Kaufmann Publishers
210
Review: finite state machines
•
Example:
B. 37 A friend would like you to build an “electronic eye” for use as a fake security
device. The device consists of three lights lined up in a row, controlled by the outputs
Left, Middle, and Right, which, if asserted, indicate that a light should be on. Only one
light is on at a time, and the light “moves” from left to right and then from right to left,
thus scaring away thieves who believe that the device is monitoring their activity. Draw
the graphical representation for the finite state machine used to specify the electronic
eye. Note that the rate of the eye’s movement will be controlled by the clock speed (which
should not be too great) and that there are essentially no inputs.
2004 Morgan Kaufmann Publishers
211
Implementing the Control
•
Value of control signals is dependent upon:
– what instruction is being executed
– which step is being performed
•
Use the information we’ve accumulated to specify a finite state machine
– specify the finite state machine graphically, or
– use microprogramming
•
Implementation can be derived from specification
2004 Morgan Kaufmann Publishers
212
Graphical Specification of FSM
Instruction fetch
MemRead
ALUSrcA = 0
IorD = 0
IRWrite
ALUSrcB = 01
ALUOp = 00
PCWrite
PCSource = 00
0
Start
•
Note:
Instruction decode/
register fetch
1
ALUSrcA = 0
ALUSrcB = 11
ALUOp = 00
– don’t care if not mentioned
– asserted if name only
– otherwise exact value
Memory address
computation
•
2
How many state
bits will we need?
6
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
8
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 10
Memory
access
3
Memory
access
5
MemRead
IorD = 1
Branch
completion
Execution
Jump
completion
9
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 01
PCWriteCond
PCSource = 01
PCWrite
PCSource = 10
R-type completion
7
MemWrite
IorD = 1
RegDst = 1
RegWrite
MemtoReg = 0
Memory read
completon step
4
RegDst = 1
RegWrite
MemtoReg = 0
2004 Morgan Kaufmann Publishers
213
Finite State Machine for Control
Implementation:
PCWrite
PCWriteCond
IorD
MemRead
MemWrite
IRWrite
Control logic
MemtoReg
PCSource
ALUOp
Outputs
ALUSrcB
ALUSrcA
RegWrite
RegDst
NS3
NS2
NS1
NS0
Instruction register
opcode field
S0
S1
S2
S3
Op0
Op1
Op2
Op3
Op4
Inputs
Op5
•
State register
2004 Morgan Kaufmann Publishers
214
PLA Implementation
•
If I picked a horizontal or vertical line could you explain it?
Op5
Op4
Op3
Op2
Op1
Op0
S3
S2
S1
S0
PCWrite
PCWriteCond
IorD
MemRead
MemWrite
IRWrite
MemtoReg
PCSource1
PCSource0
ALUOp1
ALUOp0
ALUSrcB1
ALUSrcB0
ALUSrcA
RegWrite
RegDst
NS3
NS2
NS1
NS0
2004 Morgan Kaufmann Publishers
215
ROM Implementation
•
•
ROM = "Read Only Memory"
– values of memory locations are fixed ahead of time
A ROM can be used to implement a truth table
– if the address is m-bits, we can address 2m entries in the ROM.
– our outputs are the bits of data that the address points to.
m
n
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
1
1
1
0
0
0
0
0
1
1
0
0
0
1
1
1
0
0
0
0
0
1
1
1
0
0
0
0
1
0
1
m is the "height", and n is the "width"
2004 Morgan Kaufmann Publishers
216
ROM Implementation
•
•
How many inputs are there?
6 bits for opcode, 4 bits for state = 10 address lines
(i.e., 210 = 1024 different addresses)
How many outputs are there?
16 datapath-control outputs, 4 state bits = 20 outputs
•
ROM is 210 x 20 = 20K bits
•
Rather wasteful, since for lots of the entries, the outputs are the
same
— i.e., opcode is often ignored
(and a rather unusual size)
2004 Morgan Kaufmann Publishers
217
ROM vs PLA
•
Break up the table into two parts
— 4 state bits tell you the 16 outputs, 24 x 16 bits of ROM
— 10 bits tell you the 4 next state bits, 210 x 4 bits of ROM
— Total: 4.3K bits of ROM
•
PLA is much smaller
— can share product terms
— only need entries that produce an active output
— can take into account don't cares
•
Size is (#inputs  #product-terms) + (#outputs  #product-terms)
For this example = (10x17)+(20x17) = 510 PLA cells
•
PLA cells usually about the size of a ROM cell (slightly bigger)
2004 Morgan Kaufmann Publishers
218
Another Implementation Style
Complex instructions: the "next state" is often current state + 1
Control unit
PLA or ROM
Outputs
Input
PCWrite
PCWriteCond
IorD
MemRead
MemWrite
IRWrite
BWrite
MemtoReg
PCSource
ALUOp
ALUSrcB
ALUSrcA
RegWrite
RegDst
AddrCtl
1
State
Adder
Address select logic
Op[5– 0]
•
Instruction register
opcode field
2004 Morgan Kaufmann Publishers
219
Details
Op
000000
000010
000100
100011
101011
Dispatch ROM 1
Opcode name
R-format
jmp
beq
lw
sw
Value
0110
1001
1000
0010
0010
Op
100011
101011
Dispatch ROM 2
Opcode name
lw
sw
Value
0011
0101
PLA or ROM
1
State
Adder
3
Mux
2 1
AddrCtl
0
0
Dispatch ROM 2
Dispatch ROM 1
Address select logic
Instruction register
opcode field
State number
0
1
2
3
4
5
6
7
8
9
Address-control action
Use incremented state
Use dispatch ROM 1
Use dispatch ROM 2
Use incremented state
Replace state number by 0
Replace state number by 0
Use incremented state
Replace state number by 0
Replace state number by 0
Replace state number by 0
Value of AddrCtl
3
1
2
3
0
0
3
0
0
0
2004 Morgan Kaufmann Publishers
220
Microprogramming
Control unit
Microcode memory
Outputs
Input
PCWrite
PCWriteCond
IorD
MemRead
MemWrite
IRWrite
BWrite
MemtoReg
PCSource
ALUOp
ALUSrcB
ALUSrcA
RegWrite
RegDst
AddrCtl
Datapath
1
Microprogram counter
Adder
Address select logic
Instruction register
opcode field
•
What are the “microinstructions” ?
2004 Morgan Kaufmann Publishers
221
Microprogramming
•
A specification methodology
– appropriate if hundreds of opcodes, modes, cycles, etc.
– signals specified symbolically using microinstructions
Label
Fetch
Mem1
LW2
ALU
control
Add
Add
Add
SRC1
PC
PC
A
Register
control
SRC2
4
Extshft Read
Extend
PCWrite
Memory
control
Read PC ALU
Read ALU
Write MDR
SW2
Rformat1 Func code A
Write ALU
B
Write ALU
BEQ1
JUMP1
•
•
Subt
A
B
ALUOut-cond
Jump address
Sequencing
Seq
Dispatch 1
Dispatch 2
Seq
Fetch
Fetch
Seq
Fetch
Fetch
Fetch
Will two implementations of the same architecture have the same microcode?
What would a microassembler do?
2004 Morgan Kaufmann Publishers
222
Microinstruction format
Field name
ALU control
SRC1
SRC2
Value
Add
Subt
Func code
PC
A
B
4
Extend
Extshft
Read
ALUOp = 10
ALUSrcA = 0
ALUSrcA = 1
ALUSrcB = 00
ALUSrcB = 01
ALUSrcB = 10
ALUSrcB = 11
Write ALU
RegWrite,
RegDst = 1,
MemtoReg = 0
RegWrite,
RegDst = 0,
MemtoReg = 1
MemRead,
lorD = 0
MemRead,
lorD = 1
MemWrite,
lorD = 1
PCSource = 00
PCWrite
PCSource = 01,
PCWriteCond
PCSource = 10,
PCWrite
AddrCtl = 11
AddrCtl = 00
AddrCtl = 01
AddrCtl = 10
Register
control
Write MDR
Read PC
Memory
Read ALU
Write ALU
ALU
PC write control
ALUOut-cond
jump address
Sequencing
Signals active
ALUOp = 00
ALUOp = 01
Seq
Fetch
Dispatch 1
Dispatch 2
Comment
Cause the ALU to add.
Cause the ALU to subtract; this implements the compare for
branches.
Use the instruction's function code to determine ALU control.
Use the PC as the first ALU input.
Register A is the first ALU input.
Register B is the second ALU input.
Use 4 as the second ALU input.
Use output of the sign extension unit as the second ALU input.
Use the output of the shift-by-two unit as the second ALU input.
Read two registers using the rs and rt fields of the IR as the register
numbers and putting the data into registers A and B.
Write a register using the rd field of the IR as the register number and
the contents of the ALUOut as the data.
Write a register using the rt field of the IR as the register number and
the contents of the MDR as the data.
Read memory using the PC as address; write result into IR (and
the MDR).
Read memory using the ALUOut as address; write result into MDR.
Write memory using the ALUOut as address, contents of B as the
data.
Write the output of the ALU into the PC.
If the Zero output of the ALU is active, write the PC with the contents
of the register ALUOut.
Write the PC with the jump address from the instruction.
Choose the next microinstruction sequentially.
Go to the first microinstruction to begin a new instruction.
Dispatch using the ROM 1.
Dispatch using the ROM 2.
2004 Morgan Kaufmann Publishers
223
Maximally vs. Minimally Encoded
•
No encoding:
– 1 bit for each datapath operation
– faster, requires more memory (logic)
– used for Vax 780 — an astonishing 400K of memory!
•
Lots of encoding:
– send the microinstructions through logic to get control signals
– uses less memory, slower
•
Historical context of CISC:
– Too much logic to put on a single chip with everything else
– Use a ROM (or even RAM) to hold the microcode
– It’s easy to add new instructions
2004 Morgan Kaufmann Publishers
224
Microcode: Trade-offs
•
Distinction between specification and implementation is sometimes blurred
•
Specification Advantages:
– Easy to design and write
– Design architecture and microcode in parallel
•
Implementation (off-chip ROM) Advantages
– Easy to change since values are in memory
– Can emulate other architectures
– Can make use of internal registers
•
Implementation Disadvantages, SLOWER now that:
– Control is implemented on same chip as processor
– ROM is no longer faster than RAM
– No need to go back and make changes
2004 Morgan Kaufmann Publishers
225
Historical Perspective
•
•
•
•
•
In the ‘60s and ‘70s microprogramming was very important for
implementing machines
This led to more sophisticated ISAs and the VAX
In the ‘80s RISC processors based on pipelining became popular
Pipelining the microinstructions is also possible!
Implementations of IA-32 architecture processors since 486 use:
– “hardwired control” for simpler instructions
(few cycles, FSM control implemented using PLA or random logic)
– “microcoded control” for more complex instructions
(large numbers of cycles, central control store)
•
The IA-64 architecture uses a RISC-style ISA and can be
implemented without a large central control store
2004 Morgan Kaufmann Publishers
226
Pentium 4
•
Pipelining is important (last IA-32 without it was 80386 in 1985)
Control
Control
I/O
interface
Chapter 7
Instruction cache
Data
cache
Enhanced
floating point
and multimedia
Integer
datapath
Control
Advanced pipelining
hyperthreading support
•
Secondary
cache
and
memory
interface
Chapter 6
Control
Pipelining is used for the simple instructions favored by compilers
“Simply put, a high performance implementation needs to ensure that the simple
instructions execute quickly, and that the burden of the complexities of the
instruction set penalize the complex, less frequently used, instructions”
2004 Morgan Kaufmann Publishers
227
Pentium 4
•
Somewhere in all that “control we must handle complex instructions
Control
Control
I/O
interface
Instruction cache
Data
cache
Enhanced
floating point
and multimedia
Integer
datapath
Control
Advanced pipelining
hyperthreading support
•
•
•
•
Secondary
cache
and
memory
interface
Control
Processor executes simple microinstructions, 70 bits wide (hardwired)
120 control lines for integer datapath (400 for floating point)
If an instruction requires more than 4 microinstructions to implement,
control from microcode ROM (8000 microinstructions)
Its complicated!
2004 Morgan Kaufmann Publishers
228
Chapter 5 Summary
•
If we understand the instructions…
We can build a simple processor!
•
If instructions take different amounts of time, multi-cycle is better
•
Datapath implemented using:
– Combinational logic for arithmetic
– State holding elements to remember bits
•
Control implemented using:
– Combinational logic for single-cycle implementation
– Finite state machine for multi-cycle implementation
2004 Morgan Kaufmann Publishers
229
Chapter Six
2004 Morgan Kaufmann Publishers
230
Pipelining
•
Improve performance by increasing instruction throughput
Program
execution
Time
order
(in instructions)
200
lw $1, 100($0) Instruction
fetch Reg
lw $2, 200($0)
400
600
Data
access
ALU
800
1000
1200
1400
ALU
Data
access
1600
1800
Reg
Instruction Reg
fetch
800 ps
lw $3, 300($0)
Reg
Instruction
fetch
800 ps
Note:
timing assumptions changed
for this example
800 ps
Program
execution
Time
order
(in instructions)
200
400
600
Instruction
fetch
Reg
lw $2, 200($0) 200 ps
Instruction
fetch
Reg
200 ps
Instruction
fetch
lw $1, 100($0)
lw $3, 300($0)
ALU
800
Data
access
ALU
Reg
1000
1200
1400
Reg
Data
access
ALU
Reg
Data
access
Reg
200 ps 200 ps 200 ps 200 ps 200 ps
Ideal speedup is number of stages in the pipeline. Do we achieve this?
2004 Morgan Kaufmann Publishers
231
Pipelining
•
What makes it easy
– all instructions are the same length
– just a few instruction formats
– memory operands appear only in loads and stores
•
What makes it hard?
– structural hazards: suppose we had only one memory
– control hazards: need to worry about branch instructions
– data hazards: an instruction depends on a previous instruction
•
We’ll build a simple pipeline and look at these issues
•
We’ll talk about modern processors and what really makes it hard:
– exception handling
– trying to improve performance with out-of-order execution, etc.
2004 Morgan Kaufmann Publishers
232
Basic Idea
IF: Instruction fetch
ID: Instruction decode/
register file read
EX: Execute/
address calculation
MEM: Memory access
WB: Write back
Add
4
Shift
left 2
P
C
Address
Instruction
Instruction
memory
Read Read
register 1 data1
Read
register 2
Registers
Write
Read
register
data2
Write
data
16
•
ADD Add
result
Zero
ALU ALU
result
Address
Read
data
Data
Memory
Write
data
Sign 32
extend
What do we need to add to actually split the datapath into stages?
2004 Morgan Kaufmann Publishers
233
Pipelined Datapath
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Shift
left 2
PC
Address
Instruction
memory
Add Add
result
Read
register 1
Read
data 1
Read
register 2
Registers
Read
Write
data 2
register
Zero
ALU ALU
result
Read
data
Address
Data
memory
Write
data
Write
data
16
Sign
extend
32
Can you find a problem even if there are no dependencies?
What instructions can we execute to manifest the problem?
2004 Morgan Kaufmann Publishers
234
Corrected Datapath
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Shift
left 2
PC
Address
Instruction
memory
Add Add
result
Read
register 1
Read
data 1
Read
register 2
Registers
Read
Write
data 2
register
Zero
ALU ALU
result
Read
data
Address
Data
memory
Write
data
Write
data
16
Sign
extend
32
2004 Morgan Kaufmann Publishers
235
Graphically Representing Pipelines
Time (in clock cycles)
Program
execution
order
(in instructions)
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
•
CC 1
CC 2
IM
Reg
IM
CC 3
ALU
Reg
IM
CC 4
CC 5
DM
Reg
ALU
DM
Reg
ALU
DM
Reg
CC 6
CC7
Reg
Can help with answering questions like:
– how many cycles does it take to execute this code?
– what is the ALU doing during cycle 4?
– use this representation to help understand datapaths
2004 Morgan Kaufmann Publishers
236
Pipeline Control
PCSrc
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add Add
result
4
Shift
left 2
Branch
RegWrite
PC
Address
Instruction
memory
Read
register 1
Read
data 1
Read
register 2
Registers
Read
Write
data 2
register
MemWrite
ALUSrc
Zero
Add ALU
result
MemtoReg
Read
data
Address
Data
memory
Write
data
Write
data
Instruction
(15Ð0)
Instruction
(20Ð16)
16
Sign
extend
32
6
ALU
control
MemRead
ALUOp
Instruction
(15Ð11)
RegDst
2004 Morgan Kaufmann Publishers
237
Pipeline control
•
We have 5 stages. What needs to be controlled in each stage?
– Instruction Fetch and PC Increment
– Instruction Decode / Register Fetch
– Execution
– Memory Stage
– Write Back
•
How would control be handled in an automobile plant?
– a fancy control center telling everyone what to do?
– should we use a finite state machine?
2004 Morgan Kaufmann Publishers
238
Pipeline Control
•
Pass control signals along just like the data
Instruction
R-format
lw
sw
beq
Execution/Address Calculation Memory access stage
stage control lines
control lines
Reg
ALU
ALU
ALU
Mem
Mem
Dst
Op1
Op0
Src Branch Read Write
1
1
0
0
0
0
0
0
0
0
1
0
1
0
X
0
0
1
0
0
1
X
0
1
0
1
0
0
Write-back
stage control
lines
Reg Mem to
write
Reg
1
0
1
1
0
X
0
X
WB
Instruction
IF/ID
Control
M
WB
EX
M
WB
ID/EX
EX/MEM
MEM/WB
2004 Morgan Kaufmann Publishers
239
Datapath with Control
PCSrc
ID/EX
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
Add
4
Shift
left 2
PC
Address
Instruction
memory
Add Add
result
Branch
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers
Read
Write
data 2
register
Zero
ALU ALU
result
Read
data
Address
Data
memory
Write
data
Write
data
Instruction
[15–0]
Instruction
[20–16]
16
Sign
extend
32
6
ALU
control
MemRead
ALUOp
Instruction
[15–11]
RegDst
2004 Morgan Kaufmann Publishers
240
Dependencies
•
Problem with starting next instruction before first is finished
– dependencies that “go backward in time” are data hazards
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
10
10
10
10/–20
–20
–20
–20
–20
IM
Reg
DM
Reg
Value of
register $2:
Program
execution
order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
2004 Morgan Kaufmann Publishers
241
Software Solution
•
•
Have compiler guarantee no hazards
Where do we insert the “nops” ?
sub
and
or
add
sw
•
$2, $1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$15, 100($2)
Problem: this really slows us down!
2004 Morgan Kaufmann Publishers
242
Forwarding
•
Use temporary results, don’t wait for them to be written
– register file forwarding to handle read/write to same register
– ALU forwarding
Time (in clock cycles)
CC 1
CC 2
Value of register $2:
10
10
Value of EX/MEM:
X
X
Value of MEM/WB:
X
X
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
–20
X
10/–20
X
–20
–20
X
X
–20
X
X
–20
X
X
–20
X
X
DM
Reg
Program
execution
order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14,$2 , $2
sw $15, 100($2)
what if this $2 was $13?
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
2004 Morgan Kaufmann Publishers
243
Forwarding
•
The main idea (some details not shown)
ID/EX
EX/MEM
MEM/WB
M
u
x
ForwardA
Registers
ALU
M
u
x
Data
memory
M
u
x
ForwardB
Rs
Rt
Rt
Rd
EX/MEM.RegisterRd
M
u
x
Forwarding
unit
MEM/WB.RegisterRd
2004 Morgan Kaufmann Publishers
244
Can't always forward
•
Load word can still cause a hazard:
– an instruction tries to read a register following a load instruction
that writes to the same register.
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
DM
Reg
CC 6
CC 7
CC 8
CC 9
Program
execution
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
•
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
Thus, we need a hazard detection unit to “stall” the load instruction
2004 Morgan Kaufmann Publishers
245
Stalling
•
We can stall the pipeline by keeping an instruction in the same stage
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
Reg
DM
Reg
CC 6
CC 7
CC 8
CC 9
CC 10
Program
execution
order
(in instructions)
lw $2, 20($1)
IM
bubble
and becomes nop
add $4, $2, $5
or $8, $2, $6
add $9, $4, $2
IM
Reg
IM
DM
Reg
IM
Reg
DM
DM
Reg
IM
Reg
Reg
Reg
DM
Reg
2004 Morgan Kaufmann Publishers
246
Hazard Detection Unit
•
Stall by letting an instruction that won’t write anything go forward
Hazard
detection
unit
ID/EX.MemRead
ID/EX
WB
M
u
x
Control
0
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
M
u
x
Registers
M
u
x
ALU
PC
Instruction
memory
M
u
x
Data
memory
IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
M
u
x
ID/EX.RegisterRt
Rs
Rt
Forwarding
unit
2004 Morgan Kaufmann Publishers
247
Branch Hazards
•
When we decide to branch, other instructions are in the pipeline!
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
DM
Reg
CC 6
CC 7
CC 8
CC 9
Program
execution
order
(in instructions)
40 beq $1, $3, 28
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
•
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
We are predicting “branch not taken”
– need to add hardware for flushing instructions if we are wrong
2004 Morgan Kaufmann Publishers
248
Flushing Instructions
IF.Flush
Hazard
detection
unit
ID/EX
WB
Control
0
IF/ID
M
u
x
+
EX/MEM
M
WB
EX/MEM
EX
M
WB
+
4
M
u
x
Shift
left 2
Registers
PC
=
M
u
x
Instruction
memory
ALU
M
u
x
Data
memory
M
u
x
Sign
extend
M
u
x
Fowarding
unit
Note: we’ve also moved branch decision to ID stage
2004 Morgan Kaufmann Publishers
249
Branches
•
•
•
•
If the branch is taken, we have a penalty of one cycle
For our simple design, this is reasonable
With deeper pipelines, penalty increases and static branch prediction
drastically hurts performance
Solution: dynamic branch prediction
Taken
Not taken
Predict taken
Predict taken
Taken
Not taken
Taken
Not taken
Predict not taken
Predict not taken
Taken
Not taken
A 2-bit prediction scheme
2004 Morgan Kaufmann Publishers
250
Branch Prediction
•
Sophisticated Techniques:
– A “branch target buffer” to help us look up the destination
– Correlating predictors that base prediction on global behavior
and recently executed branches (e.g., prediction for a specific
branch instruction based on what happened in previous branches)
– Tournament predictors that use different types of prediction
strategies and keep track of which one is performing best.
– A “branch delay slot” which the compiler tries to fill with a useful
instruction (make the one cycle delay part of the ISA)
•
Branch prediction is especially important because it enables other
more advanced pipelining techniques to be effective!
•
Modern processors predict correctly 95% of the time!
2004 Morgan Kaufmann Publishers
251
Improving Performance
•
Try and avoid stalls! E.g., reorder these instructions:
lw
lw
sw
sw
$t0,
$t2,
$t2,
$t0,
0($t1)
4($t1)
0($t1)
4($t1)
•
Dynamic Pipeline Scheduling
– Hardware chooses which instructions to execute next
– Will execute instructions out of order (e.g., doesn’t wait for a
dependency to be resolved, but rather keeps going!)
– Speculates on branches and keeps the pipeline full
(may need to rollback if prediction incorrect)
•
Trying to exploit instruction-level parallelism
2004 Morgan Kaufmann Publishers
252
Advanced Pipelining
•
•
•
•
Increase the depth of the pipeline
Start more than one instruction each cycle (multiple issue)
Loop unrolling to expose more ILP (better scheduling)
“Superscalar” processors
– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue
•
All modern processors are superscalar and issue multiple
instructions usually with some limitations (e.g., different “pipes”)
•
VLIW: very long instruction word, static multiple issue
(relies more on compiler technology)
•
This class has given you the background you need to learn more!
2004 Morgan Kaufmann Publishers
253
Chapter 6 Summary
•
Pipelining does not improve latency, but does improve throughput
Deeply
pipelined
Multicycle
(Section 5.5)
Pipelined
Multiple issue
with deep pipeline
(Section 6.10)
Multiple issue
with deep pipeline
(Section 6.10)
Multiple-issue
pipelined
(Section 6.9)
Multiple-issue
pipelined
(Section 6.9)
Single-cycle
(Section 5.4)
Deeply
pipelined
Multicycle
(Section 5.5)
Single-cycle
(Section 5.4)
Slower
Pipelined
Faster
Instructions per clock (IPC = 1/CPI)
1
Several
Use latency in instructions
2004 Morgan Kaufmann Publishers
254
Chapter Seven
2004 Morgan Kaufmann Publishers
255
Memories: Review
•
SRAM:
– value is stored on a pair of inverting gates
– very fast but takes up more space than DRAM (4 to 6 transistors)
•
DRAM:
– value is stored as a charge on capacitor (must be refreshed)
– very small but slower than SRAM (factor of 5 to 10)
Word line
A
A
B
B
Pass transistor
Capacitor
Bit line
2004 Morgan Kaufmann Publishers
256
Exploiting Memory Hierarchy
•
Users want large and fast memories!
SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB.
DRAM access times are 50-70ns at cost of $100 to $200 per GB.
Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB.
•
2004
Try and give it to them anyway
– build a memory hierarchy
CPU
Level 1
Increasing distance
from the CPU in
access time
Levels in the
Level 2
memory hierarchy
Level n
Size of the memory at each level
2004 Morgan Kaufmann Publishers
257
Locality
•
A principle that makes having a memory hierarchy a good idea
•
If an item is referenced,
temporal locality: it will tend to be referenced again soon
spatial locality: nearby items will tend to be referenced soon.
Why does code have locality?
•
Our initial focus: two levels (upper, lower)
– block: minimum unit of data
– hit: data requested is in the upper level
– miss: data requested is not in the upper level
2004 Morgan Kaufmann Publishers
258
Cache
•
•
Two issues:
– How do we know if a data item is in the cache?
– If it is, how do we find it?
Our first example:
– block size is one word of data
– "direct mapped"
For each item of data at the lower level,
there is exactly one location in the cache where it might be.
e.g., lots of items at the lower level share locations in the upper level
2004 Morgan Kaufmann Publishers
259
Direct Mapped Cache
Mapping: address is modulo the number of blocks in the cache
Cache
000
001
010
011
100
101
110
111
•
00001
00101
01001
01101
10001
10101
11001
11101
Memory
2004 Morgan Kaufmann Publishers
260
Direct Mapped Cache
•
For MIPS:
Address (showing bit positions)
31 30
Hit
13 12 11
20
2 10
Byte
offset
10
Tag
Data
Index
Index
0
1
2
Valid Tag
Data
1021
1022
1023
20
32
=
What kind of locality are we taking advantage of?
2004 Morgan Kaufmann Publishers
261
Direct Mapped Cache
•
Taking advantage of spatial locality:
Address (showing bit positions)
31
14 13
18
Hit
65
8
210
4
Tag
Byte
offset
Data
Block offset
Index
18 bits
V
512 bits
Tag
Data
256
entries
16
32
32
32
=
Mux
32
2004 Morgan Kaufmann Publishers
262
Hits vs. Misses
•
Read hits
– this is what we want!
•
Read misses
– stall the CPU, fetch block from memory, deliver to cache, restart
•
Write hits:
– can replace data in cache and memory (write-through)
– write the data only into the cache (write-back the cache later)
•
Write misses:
– read the entire block into the cache, then write the word
2004 Morgan Kaufmann Publishers
263
Hardware Issues
•
Make reading multiple words easier by using banks of memory
CPU
CPU
CPU
Multiplexor
Cache
Cache
Cache
Bus
Bus
Memory
b. Wide memory organization
Bus
Memory
Memory
Memory
Memory
bank 0
bank 1
bank 2
bank 3
c. Interleaved memory organization
Memory
a. One-word-wide
memory organization
•
It can get a lot more complicated...
2004 Morgan Kaufmann Publishers
264
Performance
•
Increasing the block size tends to decrease miss rate:
40%
35%
Miss rate
30%
25%
20%
15%
10%
5%
0%
4
16
64
Block size (bytes)
256
1 KB
8 KB
16 KB
64 KB
256 KB
•
Use split caches because there is more spatial locality in code:
Program
gcc
spice
Block size in
words
1
4
1
4
Instruction
miss rate
6.1%
2.0%
1.2%
0.3%
Data miss
rate
2.1%
1.7%
1.3%
0.6%
Effective combined
miss rate
5.4%
1.9%
1.2%
0.4%
2004 Morgan Kaufmann Publishers
265
Performance
•
Simplified model:
execution time = (execution cycles + stall cycles)  cycle time
stall cycles = # of instructions  miss ratio  miss penalty
•
Two ways of improving performance:
– decreasing the miss ratio
– decreasing the miss penalty
What happens if we increase block size?
2004 Morgan Kaufmann Publishers
266
Decreasing miss ratio with associativity
One-way set associative
(direct mapped)
Block
Tag Data
0
Two-way set associative
1
2
3
4
5
6
Set
Tag Data Tag Data
0
1
2
3
7
Four-way set associative
Set
Tag Data Tag Data Tag Data Tag Data
0
1
Eight-way set associative (fully associative)
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data
Compared to direct mapped, give a series of references that:
– results in a lower miss ratio using a 2-way set associative cache
– results in a higher miss ratio using a 2-way set associative cache
assuming we use the “least recently used” replacement strategy
2004 Morgan Kaufmann Publishers
267
An implementation
Address
31 30
12 11 10 9 8
8
22
Index
0
1
2
V
Tag
Data
V
3210
Tag
Data
V
Tag
Data
V
Tag
Data
253
254
255
22
32
4-to-1 multiplexor
Hit
Data
2004 Morgan Kaufmann Publishers
268
Performance
15%
1 KB
12%
2 KB
9%
4 KB
6%
8 KB
16 KB
32 KB
3%
64 KB
128 KB
0
One-way
Two-way
Four-way
Eight-way
Associativity
2004 Morgan Kaufmann Publishers
269
Decreasing miss penalty with multilevel caches
•
Add a second level cache:
– often primary cache is on the same chip as the processor
– use SRAMs to add another cache above primary memory (DRAM)
– miss penalty goes down if data is in 2nd level cache
•
Example:
– CPI of 1.0 on a 5 Ghz machine with a 5% miss rate, 100ns DRAM access
– Adding 2nd level cache with 5ns access time decreases miss rate to .5%
•
Using multilevel caches:
– try and optimize the hit time on the 1st level cache
– try and optimize the miss rate on the 2nd level cache
2004 Morgan Kaufmann Publishers
270
Cache Complexities
•
Not always easy to understand implications of caches:
1200
2000
Radix sort
1000
Radix sort
1600
800
1200
600
800
400
200
Quicksort
400
0
Quicksort
0
4
8
16
32
64
128
256
512 1024 2048 4096
Size (K items to sort)
Theoretical behavior of
Radix sort vs. Quicksort
4
8
16
32
64
128
256
512 1024 2048 4096
Size (K items to sort)
Observed behavior of
Radix sort vs. Quicksort
2004 Morgan Kaufmann Publishers
271
Cache Complexities
•
Here is why:
5
Radix sort
4
3
2
1
Quicksort
0
4
8
16
32
64
128
256
512 1024 2048 4096
Size (K items to sort)
•
Memory system performance is often critical factor
– multilevel caches, pipelined processors, make it harder to predict outcomes
– Compiler optimizations to increase locality sometimes hurt ILP
•
Difficult to predict best algorithm: need experimental data
2004 Morgan Kaufmann Publishers
272
Virtual Memory
•
Main memory can act as a cache for the secondary storage (disk)
Virtual addresses
Physical addresses
Address translation
Disk addresses
•
Advantages:
– illusion of having more physical memory
– program relocation
– protection
2004 Morgan Kaufmann Publishers
273
Pages: virtual memory blocks
•
Page faults: the data is not in memory, retrieve it from disk
– huge miss penalty, thus pages should be fairly large (e.g., 4KB)
– reducing page faults is important (LRU is worth the price)
– can handle the faults in software instead of hardware
– using write-through is too expensive so we use writeback
Virtual address
31 30 29 28 27
15 14 13 12 11 10 9 8
3210
Page offset
Virtual page number
Translation
29 28 27
15 14 13 12 11 10 9 8
Physical page number
3210
Page offset
Physical address
2004 Morgan Kaufmann Publishers
274
Page Tables
Virtual page
number
Page table
Physical page or
Valid disk address
1
1
1
1
0
1
1
0
1
1
0
1
Physical memory
Disk storage
2004 Morgan Kaufmann Publishers
275
Page Tables
Page table register
Virtual address
31 30 29 28 27
1 5 1 4 1 3 1 2 11 1 0 9 8
Virtual page number
Page offset
12
20
Valid
3 2 1 0
Physical page number
Page table
18
If 0 then page is not
present in memory
29 28 27
1 5 1 4 1 3 1 2 11 1 0 9 8
Physical page number
3 2 1 0
Page offset
Physical address
2004 Morgan Kaufmann Publishers
276
Making Address Translation Fast
•
A cache for address translations: translation lookaside buffer
TLB
Virtual page
number Valid Dirty Ref
1
1
1
1
0
1
0
1
1
0
0
0
Tag
Physical page
address
1
1
1
1
0
1
Physical memory
Page table
Physical page
Valid Dirty Ref or disk address
1
1
1
1
0
1
1
0
1
1
0
1
Typical values:
1
0
0
0
0
0
0
0
1
1
0
1
1
0
0
1
0
1
1
0
1
1
0
1
Disk storage
16-512 entries,
miss-rate: .01% - 1%
miss-penalty: 10 – 100 cycles
2004 Morgan Kaufmann Publishers
277
TLBs and caches
Virtual address
TLB access
TLB miss
exception
No
Yes
TLB hit?
Physical address
No
Try to read data
from cache
Cache miss stall
while read block
No
Cache hit?
Yes
Write?
No
Yes
Write access
bit on?
Write protection
exception
Yes
Try to write data
to cache
Deliver data
to the CPU
Cache miss stall
while read block
No
Cache hit?
Yes
Write data into cache,
update the dirty bit, and
put the data and the
address into the write buffer
2004 Morgan Kaufmann Publishers
278
TLBs and Caches
Virtual address
31 30 29
14 13 12 11 10 9
Virtual page number
3 2 1 0
Page offset
12
20
Valid Dirty
Tag
Physical page number
=
=
=
=
=
=
TLB
TLB hit
20
Page offset
Physical page number
Physical address
Block
Cache index
Physical address tag
offset
18
8
4
Byte
offset
2
8
12
Valid
Data
Tag
Cache
=
Cache hit
32
Data
2004 Morgan Kaufmann Publishers
279
Modern Systems
•
2004 Morgan Kaufmann Publishers
280
Modern Systems
•
Things are getting complicated!
2004 Morgan Kaufmann Publishers
281
Some Issues
•
Processor speeds continue to increase very fast
— much faster than either DRAM or disk access times
100,000
10,000
1,000
Performance
CPU
100
10
Memory
1
Year
•
Design challenge: dealing with this growing disparity
– Prefetching? 3rd level caches and more? Memory design?
2004 Morgan Kaufmann Publishers
282
Chapters 8 & 9
(partial coverage)
2004 Morgan Kaufmann Publishers
283
Interfacing Processors and Peripherals
•
•
•
I/O Design affected by many factors (expandability, resilience)
Performance:
— access latency
— throughput
— connection between devices and the system
— the memory hierarchy
— the operating system
A variety of different users (e.g., banks, supercomputers, engineers)
Interrupts
Processor
Cache
Memory- I/O bus
Main
memory
I/O
controller
Disk
Disk
I/O
controller
I/O
controller
Graphics
output
Network
2004 Morgan Kaufmann Publishers
284
I/O
•
Important but neglected
“The difficulties in assessing and designing I/O systems have
often relegated I/O to second class status”
“courses in every aspect of computing, from programming to
computer architecture often ignore I/O or give it scanty coverage”
“textbooks leave the subject to near the end, making it easier
for students and instructors to skip it!”
•
GUILTY!
— we won’t be looking at I/O in much detail
— be sure and read Chapter 8 in its entirety.
— you should probably take a networking class!
2004 Morgan Kaufmann Publishers
285
I/O Devices
•
Very diverse devices
— behavior (i.e., input vs. output)
— partner (who is at the other end?)
— data rate
2004 Morgan Kaufmann Publishers
286
I/O Example: Disk Drives
Platters
Tracks
Platter
Sectors
Track
•
To access data:
— seek: position head over the proper track (3 to 14 ms. avg.)
— rotational latency: wait for desired sector (.5 / RPM)
— transfer: grab the data (one or more sectors) 30 to 80 MB/sec
2004 Morgan Kaufmann Publishers
287
I/O Example: Buses
•
•
•
•
Shared communication link (one or more wires)
Difficult design:
— may be bottleneck
— length of the bus
— number of devices
— tradeoffs (buffers for higher bandwidth increases latency)
— support for many different devices
— cost
Types of buses:
— processor-memory (short high speed, custom design)
— backplane (high speed, often standardized, e.g., PCI)
— I/O (lengthy, different devices, e.g., USB, Firewire)
Synchronous vs. Asynchronous
— use a clock and a synchronous protocol, fast and small
but every device must operate at same rate and
clock skew requires the bus to be short
— don’t use a clock and instead use handshaking
2004 Morgan Kaufmann Publishers
288
I/O Bus Standards
•
Today we have two dominant bus standards:
2004 Morgan Kaufmann Publishers
289
Other important issues
•
Bus Arbitration:
— daisy chain arbitration (not very fair)
— centralized arbitration (requires an arbiter), e.g., PCI
— collision detection, e.g., Ethernet
•
Operating system:
— polling
— interrupts
— direct memory access (DMA)
•
Performance Analysis techniques:
— queuing theory
— simulation
— analysis, i.e., find the weakest link (see “I/O System
Design”)
•
Many new developments
2004 Morgan Kaufmann Publishers
290
Pentium 4
•
I/O Options
Pentium 4
processor
DDR 400
(3.2 GB/sec)
Main
memory
DIMMs
DDR 400
(3.2 GB/sec)
System bus (800 MHz, 604 GB/sec)
AGP 8X
Memory
(2.1 GB/sec)
Graphics
controller
output
hub
CSA
(north bridge)
(0.266 GB/sec)
1 Gbit Ethernet
82875P
Serial ATA
(150 MB/sec)
(266 MB/sec) Parallel ATA
(100 MB/sec)
Serial ATA
(150 MB/sec)
Parallel ATA
(100 MB/sec)
Disk
Disk
Stereo
(surroundsound)
AC/97
(1 MB/sec)
USB 2.0
(60 MB/sec)
...
I/O
controller
hub
(south bridge)
82801EB
CD/DVD
Tape
(20 MB/sec)
10/100 Mbit Ethernet
PCI bus
(132 MB/sec)
2004 Morgan Kaufmann Publishers
291
Fallacies and Pitfalls
•
Fallacy: the rated mean time to failure of disks is 1,200,000 hours,
so disks practically never fail.
•
Fallacy: magnetic disk storage is on its last legs, will be replaced.
•
Fallacy: A 100 MB/sec bus can transfer 100 MB/sec.
•
Pitfall: Moving functions from the CPU to the I/O processor,
expecting to improve performance without analysis.
2004 Morgan Kaufmann Publishers
292
Multiprocessors
•
Idea: create powerful computers by connecting many smaller ones
good news: works for timesharing (better than supercomputer)
bad news: its really hard to write good concurrent programs
many commercial failures
Processor
Processor
Processor
Cache
Cache
Cache
Processor
Processor
Processor
Cache
Cache
Cache
Memory
Memory
Memory
Single bus
Memory
I/O
Network
2004 Morgan Kaufmann Publishers
293
Questions
•
How do parallel processors share data?
— single address space (SMP vs. NUMA)
— message passing
•
How do parallel processors coordinate?
— synchronization (locks, semaphores)
— built into send / receive primitives
— operating system protocols
•
How are they implemented?
— connected by a single bus
— connected by a network
2004 Morgan Kaufmann Publishers
294
Supercomputers
Plot of top 500 supercomputer sites over a decade:
Single Instruction multiple data (SIMD)
500
Cluster
(network of
workstations)
400
Cluster
(network of
SMPs)
300
Massively
parallel
processors
(MPPs)
200
100
Sharedmemory
multiprocessors
(SMPs)
0
93 93 94 94 95 95 96 96 97 97 98 98 99 99 00
Uniprocessors
2004 Morgan Kaufmann Publishers
295
Using multiple processors an old idea
•
Some SIMD designs:
•
Costs for the the Illiac IV escalated from $8 million in 1966 to $32 million in
1972 despite completion of only ¼ of the machine. It took three more years
before it was operational!
“For better or worse, computer architects are not easily discouraged”
Lots of interesting designs and ideas, lots of failures, few successes
2004 Morgan Kaufmann Publishers
296
Topologies
P0
P1
P2
P3
P0
a. 2-D grid or mesh of 16 nodes
P4
P1
P5
P2
P6
P3
P7
P4
P5
P6
P7
b. Omega network
a. Crossbar
b. n-cube tree of 8 nodes (8 = 23 so n = 3)
2004 Morgan Kaufmann Publishers
297
Clusters
•
•
•
•
•
•
Constructed from whole computers
Independent, scalable networks
Strengths:
– Many applications amenable to loosely coupled machines
– Exploit local area networks
– Cost effective / Easy to expand
Weaknesses:
– Administration costs not necessarily lower
– Connected using I/O bus
Highly available due to separation of memories
In theory, we should be able to do better
2004 Morgan Kaufmann Publishers
298
Google
•
•
•
•
•
Serve an average of 1000 queries per second
Google uses 6,000 processors and 12,000 disks
Two sites in silicon valley, two in Virginia
Each site connected to internet using OC48 (2488 Mbit/sec)
Reliability:
– On an average day, 20 machines need rebooted (software error)
– 2% of the machines replaced each year
In some sense, simple ideas well executed. Better (and cheaper)
than other approaches involving increased complexity
2004 Morgan Kaufmann Publishers
299
Concluding Remarks
•
Evolution vs. Revolution
“More often the expense of innovation comes from being too disruptive
to computer users”
“Acceptance of hardware ideas requires acceptance by software
people; therefore hardware people should learn about software. And if
software people want good machines, they must learn more about hardware
to be able to communicate with and thereby influence hardware engineers.”
2004 Morgan Kaufmann Publishers
300
Download