Computer Science 3724

advertisement
Computer Science 3724
Fall Semester, 2014
1
What will we do in this course?
• We will look at the design of an instruction set for a simple
processor.
The processor is based on a “real” processor, the MIPS R2000.
• We will see how logic relates to switching (and transistors) and
how logic forms a calculus for designing digital circuits.
• We will construct the basic logic blocks required to build a simple
computer.
• We will look at the internal structure of that simple processor,
having a 32 bit instruction length and a 32 bit data word.
• We will design the processor, and add enhancements to improve
the speed of execution of its instructions.
• Then we will design a memory system for the processor, and see
how we can match its speed to the processor.
2
Why bother with all this?
• Both software and hardware affect performance. Understanding
how they interact is essential.
• Understanding how computers work helps us be better programmers.
• We may have to provide advice on which computer to purchase
for some application.
• Computing performance has improved exponentially for 40 years.
– Why is the growth rate so fast?
– How long can this continue?
– How does this growth affect the programs I design?
– How does it affect the value of hardware and software?
• How does increased computation speed affect computer peripherals? (e.g., input/output devices.)
3
About questions???
Who questions much, shall learn much, and retain much.” —
Francis Bacon
“Asking a question is embarrassing for a moment, but not asking
is embarrassing for a lifetime.” — Haruki Murakami, Kafka on
the Shore, 2006, p. 255.
If there is something you don’t understand, or need clarified, ask.
If you think of a question after class, come to my office and ask.
If you don’t understand the answer, ask again!
4
A possible users view of a computer system
USER DESKTOP
BROWSER
OFFICE ENVIRONMENT
EDITOR
MAIL
SPREADSHEET
DATABASE
COMPILERS
OPERATING SYSTEM
LIBRARIES
COMPUTER
INPUT AND
OUTPUT
DEVICES
MEMORY
5
A typical “desktop system”:
In this course we will be concerned mainly with the processor. (The
part typically not on the desktop.)
6
Inside the processor box:
Where is the processor?
7
A look at the “motherboard”:
8
The basic functional blocks of a simple computer
CPU
MEMORY
INPUT/
OUTPUT
We sometimes refer to five classic components of as computer.
We often consider the CPU, or processor, as two components — a
datapath and a control unit.
The datapath performs arithmetic and logical operations on data
stored temporarily in internal registers.
The control unit determines exactly what operations are performed.
It also controls access to memory and I/O devices.
9
What are some characteristics of those components?
Characteristics of input:
• wide range of speed — keyboard, touch screen, network, video
• different modes — touch, video, voice, . . .
• different sampling rates — temperature, speed, . . .
Characteristics of output:
• again a wide range of speed — text, speech, video, . . .
• range of technologies — almost any controllable device
Characteristics of the processor:
• Does relatively simple operations at high speed
• Does exactly as it is instructed
• Very efficient for repetitive operations
• Technology developing at a consistent (rapid) rate — roughly
doubling every two years
Characteristics of memory:
• Processors require fast memory, to match processor speeds.
• Very fast memory is relatively expensive, slow memory is relatively cheap.
10
Here some of the inputs and outputs are obvious, but note the lack
of wire connections.
Where is the processor?
11
Inputs? Outputs? Processor?
12
A typical instruction for a computer
w := a + b + c
What does this mean to the computer?
What are a, b, c and w?
How is the expression evaluated?
Is it the same as
w := a + c + b?
How many computer instructions will this expression require?
How long will it take to execute?
Does the execution time depend on the values of a, b, and c?
Is the result exact? Why or why not?
Does the speed or accuracy depend on the particular processor?
Could using more than one processor speed up the calculation? How
about if the calculation was more complex?
13
Historical performance increase:
The following shows the increase in number of transistors for Intel
processors, memory, and other devices (from cmg.org):
These processors span the range of 4 to 64 bit processors.
Note the exponential growth in number of transistors, roughly doubling every two years.
This growth was first observed by Gordon Moore, the co-founder of
Intel, and is called Moore’s law.
14
Projections for the future:
The following graphs use data from the International Technology
Roadmap for Semiconductors (ITRS) 2004 update documentation.
We can see that the predictions were actually pessimistic!
ITRS produces a new roadmap every two years, and the latest is the
2013 roadmap.
(See http://www.itrs.net)
Transistor size:
100
90
channel width (nm)
80
70
60
50
40
30
20
10
2002
2004
2006
2008
15
2010
Year
2012
2014
2016
2018
Memory size (GB) - single chip
Memory size (GB/chip):
10
1
2002
2004
2006
2008
2010 2012
Year
2014
2016
Note the log scale on the y-axis.
This plot shows a stepwise exponential growth with time.
Why does memory have this behavior?
What happens between the beginning and end of each step?
16
2018
Clock Frequency:
Clock frequency (GHz) - on chip
100
10
1
2002 2004 2006 2008 2010 2012 2014 2016 2018
Year
17
Number of transistors on chip - processors:
Millions of transistors on chip
100000
low cost
high performance
10000
1000
100
2002 2004 2006 2008 2010 2012 2014 2016 2018
Year
18
Can this continue?
What will stop this kind of growth?
Presently, a transistor in a high performance processor has a “size”
of about 25 nm.
A silicon atom has a “size” of about 0.54 nm (actually, the distance
between atoms in a silicon crystal.)
When transistors change state, or “switch”, they use energy.
For a fixed power supply voltage, the heat energy produced depends
on the number of transistors.
Presently, processors require high speed fans to keep them cool enough.
Cooling a processor is a serious problem, even now.
What limits the size of a switching device?
What is the minimum energy required to remember a bit of information?
Is there a limit to the speed at which a computation can be performed?
19
Other technologies following a type of Moore’s Law:
The following data was taken from the Technium website
(http://www.kk.org/thetechnium)
Doubling time of various technologies, in months:
Technology
measure
Time
Optical network
dollars/bit
9
Wireless network
bits/second
10
Data communication
bits/dollar
12
Digital cameras
pixels/dollar
12
Magnetic storage
GB/in2
12
RAM (dynamic)
bits/dollar
18
Processor power consumption watts/cm2
18
DNA sequencing
dollars/base pair 18
Disk storage
GB/dollar
20
Why does this happen for some technologies and not others?
What limits the growth in these cases?
20
Instruction set architectures
What is the minimum instruction set required for a processor?
Consider a flowchart for a program.
i := i − 1
no
i<0?
yes
Only two symbols are really necessary; data operations (boxes) and
control operations (arrows, or links).
Does this mean that we really only need two instructions?
Can input and output be handled, as well?
21
Actually, it is possible to combine both types of operation in one
instruction, and this is all that is required to have a fully functioning
computer.
Can you figure out what this instruction could be?
A machine with only one instruction would have interesting properties.
It is an interesting exercise to determine what they are.
Although a single instruction processor is interesting, it is not very
efficient, since many instructions are required to do even simple operations.
The course home page has a link to a simulator for a single instruction
processor.
A more useful exercise is to determine a small but efficient instruction
set for a particular processor.
22
What must an instruction contain?
• An encoding for the operation to be performed (op code)
• The addresses of the operands, and a destination address for the
result
The instruction encoding (op code) depends on the number of different instructions to be encoded.
An instruction may require 0, 1, 2, more operands.
An example of a type of instruction which requires no operand is an
operation on a stack. Here, the operation (e.g., addition) uses the
top value and the next value on the stack, and the result replaces the
top of the stack.
Typical stack operations are push (place a value on the stack) and
pop (removes a value from the stack).
Some operations are inherently unary operations; e.g., negation.
More complex operations (e.g., addition) can add an operand to the
value in a fixed register (often called an accumulator) and store the
result in this accumulator.
It would have the form
Acc ← Mem[addr] op Acc
where op is an arbitrary binary operator.
23
Operations using two addresses can have a number of forms. For
example:
Mem[addr1] ← Mem[addr1] op Mem[addr2]
or
Acc ← Mem[addr1] op Mem[addr2]
Operations using three addresses can implement a full binary operation (like c = a + b) directly:
Mem[addr3] ← Mem[addr1] op Mem[addr2]
Encoding several memory addresses in an instruction requires a large
instruction size. Most processors have at least 32 address bits (4GB
memory), so an instruction using three memory operands would require more than 3 × 32 = 96 bits.
Some processors have variable length instructions (e.g., INTEL processors, used in the PC); others have fixed length instructions (e.g.,
the MIPS style processors, used in many game processors).
Generally, the decoding of fixed length instructions is simpler than
the decoding of variable length instructions.
It is also common for certain instructions to encode data within the
instruction. Typically, the data would be restricted to a constant
with a small range of values. (Incrementing by a small number is a
common operation, and encoding the data directly in an instruction
is efficient.)
This is usually called immediate data.
24
How complex should the instructions be?
It is possible to have instructions that are quite complex.
For example, a well-known processor from the past had a single instruction which could evaluate a polynomial of arbitrary order.
There are (at least) two schools of thought on the design of instruction
sets.
One is that the instruction set should attempt to be as close as
possible to modern computer languages.
Such ISA’s are called Complex Instruction Set architectures (CISC
architectures).
The idea is that compilers for such architectures are simpler.
These architectures typically have variable length instructions, with
many addressing modes. Each instruction may take several (or many)
machine cycles (clock periods).
The PC (Intel, AMD) architectures are of this type.
Another is that the instructions should be as simple and fast as
possible. These instruction set architectures usually have fixed size
instructions, and each instruction completes in a single clock cycle.
Such ISA’s are called Reduced Instruction Set architectures (RISC
architectures).
The MIPS architecture which we will be discussing later is of this
type.
25
Register files
Many processors have sets of registers, often called register files.
Instructions can address individual registers in the register file, using
far fewer bits than a full memory address.
For example, the MIPS processor has 32 32 bit registers; each register
therefore requires only a 5 bit address, and a three operand instruction operating on registers only would require 3 × 5 = 15 bits for the
operand addresses.
The PC has 8 32 bit general registers, and a number of special purpose registers.
Some processors (those in the PC, for example) allow instructions
which mix memory and register operations. Other processors permit
arithmetic and logic operations only on registers. Of course, both
types have instructions to copy values between the register file and
memory.
26
Addressing modes
Many processors have several ways of constructing the memory address for a data operand. The register file may be used to provide
part of the address. This is particularly useful for list or tabular data
structures.
The simplest addressing mode is where the address is part of the instruction itself. It may be used for accessing data, or for determining
the target address for a branch or jump.
Another form of addressing is relative addressing. Here the instruction contains a displacement from the current address. This is most
commonly used with branch instructions.
For such a branch, the target address is calculated as
address = PC + displacement
where PC is the program counter, which contains the address of the
current instruction.
Addressing modes which involve a register often add the value in a
register to a displacement value from the instruction. These would
be calculated as
address = Ri + displacement
where Ri is the register designated for the address by the instruction.
This type of addressing is called indexed or based addressing.
This type of addressing is useful for manipulating list data structures;
the list can be traversed simply by incrementing or decrementing the
register value.
27
This idea can be extended to the use of two registers. This would be
useful for addressing data in a 2-dimensional structure like a table.
The address would be calculated as
address = Ri + Rj + displacement
and is usually called based indexed addressing.
Here, each register can be manipulated independently, allowing for
row and column operations.
More complex addressing modes are also possible.
The target of an address can be a data value (corresponding to a
variable in a program — the address is the variable name).
It can also be an instruction, such as the target of a jump or branch
instruction.
It can also be another address (this corresponds to a pointer in languages like C, or a reference in other languages.)
This capability is called indirect addressing, and may be supported
by the instruction set architecture of the processor.
Indirect addressing is used in the construction of more complex data
structures like linked lists, trees, etc.
Many processors support several different addressing modes. In fact,
the PC supports all the addressing modes mentioned, and several
others.
28
Relating instruction sets to logic
It is also useful to consider what the internal structure of a computer
would be, independent of any particular instruction set.
For example, the requirement that instructions and data be fetched
from memory (and that memory is independent from the processor)
requires that the processor be able to generate and maintain a memory address, and that it be able to provide data to or receive data
from memory.
This implies that there are two entities which can hold information
(an address and a data word) stable long enough for a memory read
or write.
Typically, in logic, we would implement these with registers.
The address is held in the memory address register (MAR).
The data is held in the memory data register (MDR).
Circuitry is also required to generate and maintain instruction addresses. Most often, the next instruction to be executed is the next
instruction in memory. This circuitry is usually called the program
counter (PC).
The instructions themselves contain addresses for data, and there
must be a control unit to decode instructions and manage flow of
control (e.g., branches).
Data, and computational results, are stored in internal registers.
There must be circuitry to perform the required arithmetic and/or
logical operations (the datapath).
29
Combining all these observations, we require a structure similar to
the following:
General
Registers
and/or
Accumulator
M
D
R
Instruction
decode and
Control
Unit
ALU
PC
PCU
30
Address
Generator
M
A
R
The internal structure of a modern style processor —
the MIPS R5000:
More such photomicrographs are available at url
http://micro.magnet.fsu.edu/chipshots
31
The MIPS instruction set architecture
The MIPS has a 32 bit architecture, with 32 bit instructions, a 32
bit data word, and 32 bit addresses.
It has 32 addressable internal registers requiring a 5 bit register address. Register 0 always has the the constant value 0.
Addresses are for individual bytes (8 bits) but instructions must have
addresses which are a multiple of 4. This is usually stated as “instructions must be word aligned in memory.”
There are three basic instruction types with the following formats:
R−type (register)
31
26 25
op
21 20
rs
6 bits 5 bits
16 15
rt
11 10
rd
6 5
0
shamt
funct
5 bits 5 bits 5 bits
6 bits
I−type (immediate)
31
26 25
op
21 20
rs
16 15
rt
0
immediate
6 bits 5 bits 5 bits
16 bits
J−type (jump)
31
26 25
0
op
target
6 bits
26 bits
All op codes are 6 bits.
All register addresses are 5 bits.
32
R−type (register)
31
26 25
op
21 20
rs
16 15
rt
rd
11 10
6 5
shamt
0
funct
The R-type instructions are 3 operand arithmetic and logic instructions, where the operands are contained in the registers indicated by
rs, rt, and rd.
For all R-type instructions, the op field is 000000.
The funct field selects the particular type of operation for R-type
operations.
The shamt field determines the number of bits to be shifted (0 to
31).
These instructions perform the following:
R[rd] ← R[rs] op R[rt]
Following are examples of R-type instructions:
Instruction
add
add unsigned
subtract
Example Meaning
add $s1, $s2, $s3 $s1 = $s2 + $s3
addu $s1, $s2, $s3 $s1 = $s2 + $s3
sub $s1, $s2, $s3 $s1 = $s2 - $s3
subtract unsigned subu $s1, $s2, $s3 $s1 = $s2 - $s3
and
or
and $s1, $s2, $s3 $s1 = $s2 & $s3
or $s1, $s2, $s3 $s1 = $s2 | $s3
33
I−type (immediate)
31
26 25
op
21 20
rs
16 15
rt
0
immediate
The 16 bit immediate field contains a data constant for an arithmetic
or logical operation, or an address offset for a branch instruction.
This type of branch is called a relative branch.
Following are examples of I-type instructions of type:
R[rt] ← R[rs] op imm
Instruction
add
Example Meaning
addi $s1, $s2, imm
$s1 = $s2 + imm
add unsigned addiu $s1, $s2, imm
$s1 = $s2 + imm
subtract
subi $s1, $s2, imm
$s1 = $s2 - imm
and
andi $s1, $s2, imm
$s1 = $s2 & imm
Another I-type instruction is the branch instruction.
Examples of this are:
Instruction
branch on equal
Example Meaning
beq $s1, $s2, imm if $s1 == $s2 go to
PC + 4 + (4 × imm)
branch on not equal bne $s1, $s2, imm if $s1 != $s2 go to
PC + 4 + (4 × imm)
Why is the imm field multiplied by 4 here?
34
J−type (jump)
31
26 25
0
op
target
The J-type instructions are all jump instructions.
The two we will discuss are the following:
Instruction
jump
Example Meaning
j target go to address 4 × target : PC[28:31]
jump and link jal target $31 = PC + 4;
go to address 4 × target : PC[28:31]
Why is the PC incremented by 4?
Why is the target field multiplied by 4?
Recall that the MIPS processor addresses data at the byte level, but
instructions are addressed at the word level.
Moreover, all instructions must be aligned on a word boundary (an
integer multiple of 4 bytes).
Therefore, the next instruction is 4 byte addresses from the current
instruction.
Since jumps must have an instruction as target, shifting the target
address by 2 bits (which is the same as multiplying by 4) allows the
instruction to specify larger jumps.
Note that the jump instruction cannot span (jump across) all of
memory.
35
There are a few more interesting instructions, for comparison, and
memory access:
R-type instructions:
Instruction
Example
Meaning
set less than
slt $s1, $s2, $s3 if ($s2 < $s3), $s1=1;
else $s1=0
jump register jr $ra
go to $ra
set less than also has an unsigned form.
jump register is typically used to return from a subprogram.
I-type instructions:
Instruction
Example
Meaning
set less than slti $s1, $s2, imm if ($s2 < imm), $s1=1;
immediate
else $s1=0
load word
lw $s1, imm($s2)
$s1 = Memory[$s2 + imm]
store word
sw $s1, imm($s2)
Memory[$s2 + imm] = $s1
load word and store word are the only instructions that access
memory directly.
Because data must be explicitly loaded before it is operated on, and
explicitly stored afterwards, the MIPS is said to be a load/store
architecture.
This is often considered to be an essential feature of a reduced instruction set architecture (RISC).
36
The MIPS assembly language
The previous diagrams showed examples of code in a general form
which is commonly used as a simple kind of language for a processor
— a language in which each line in the code corresponds to a single
instruction in the language understood by the machine.
For example,
add $1, $2, $3
means take add together the contents of registers $2 and $3 and
store the result in register $1.
We call this type of language an assembly language.
The language of the machine itself, called the machine language,
consists only of 0’s and 1’s — a binary code.
The machine language instruction corresponding to the previous instruction (with the different fields identified) is:
31
26 25 21 20 16 15 11 10
6 5
0
000000 00010 00011 00001 00000 100000
op
rs
rt
rd
shamt funct
There are usually programs, called assemblers, to translate the more
human readable assembly code to machine language.
37
Compilers and assemblers
A compiler translates a “high level” language like C or Java into the
“machine language” for a particular environment (operating system
and target machine type.)
It it generally possible to compile a high-level language program to
run on almost any commercial computer system.
A single high level language statement corresponds to several, and
often many, machine instructions.
Some modern language compilers (e.g., Java) produce output that
does not correspond to any “real” computer, but rather to a “virtual” or model computer. This output (called bytecode, or p-code)
can then be executed by a software model of the virtual machine
(interpreted) or further translated into the machine language of the
underlying processor.
38
An assembler translates an “assembly language” into the “machine
language” for a particular target machine.
Assembly languages for different target machines are different.
Assembly language instructions normally translate one-for-one to
machine instructions. (Some particular combinations of a few instructions may correspond to only one assembly instruction.)
Assembly code has a simple format. It normally includes labels,
instructions, and directives.
labels correspond directly to addresses (much like variable names in
high-level languages), but are also used to label instructions —
for example, a jump target.
Labels are character strings followed by “:”
For example, in the code following, loop: is a label.
instructions define the particular operations to be executed
directives provide information for the assembler itself.
Directives are preceded by a “.”
For example, the directive .align 2 forces the next item to
align itself on a word boundary.
Typically, there are at least two separate sections, indicated by directives, dividing the program into instructions and data.
39
A simple assembly language program
The following shows a short assembly code segment, for an infinite
loop:
.text
.align2
loop:
addi $1, $0, 0
# set register 1 to 0
sw $0, 128($1)
# store 0 at 128 + the location
# pointed to by register 1
addi $1, $1, 4
# increment register 1 by 4
jmp loop
# go back to loop
Here, loop is a label, .text and .align are directives. The text
following # are comments.
This corresponds to the following machine language program (assuming it starts at memory location 0):
location
instruction
0
001000 00000 00001 00000 00000 000000
4
101011 00001 00000 00000 00010 000000
8
001000 00001 00001 00000 00000 000100
12
000010 00000 00000 00000 00000 000001
40
How does an assembler work?
It is a fairly simple process to write a program to translate these
instructions into machine code; it is a simple one-for-one translation.
The main problem is with labels — forward references, in particular.
Most simple assemblers make two “passes” over the assembler code;
in the first pass all the labels and their corresponding addresses are
placed in a symbol table. In the second pass, the instructions are
generated, using the addresses from the symbol table.
The output of the assembler is an object file. This object file still
may contain unresolved references (say, to library functions) which
are resolved by the linker.
We will look in more detail at how functions work in assembly language later, but it is usual to provide functions for common operations in a library.
For example, there is a function printf which accepts a format string
and one or more values to print as arguments. (This is actually a
standard C function.)
41
In UNIX systems, object files have six components:
• An object file header describing the sizes of the other sections
• The text segment containing the actual machine code
• The data segment containing the data in the source file
• relocation information identifying data and instructions that
rely on absolute addresses, which must be changed if the program
is moved from one part of memory to another.
• The symbol table associating labels with addresses, and holding
places for unresolved references.
• debugging information, containing concise information about
how the program was compiled, so a debugger can associate
memory addresses with lines in the source file.
The following diagram shows the steps involved in assembling and
running a program:
42
Programmer
❄
Assembly
language
program
❄
Assembler
❄
Machine
language
program
Libraries
Other functions
❙
❙
❙
✇
❄ ✠
Loader
❄
Memory
✛
Processor
Input
❄
Output
43
MIPS memory usage
MIPS systems typically divided memory into three parts, called segments.
These segments are the text segment which contains the program’s
instructions, the data segment, which contains the program’s data,
and the stack segment which contains the return addresses for function calls, and also contains register values which are to be saved and
restored. It may also contain local variables.
7fffffffhex
10000000hex
stack
Dynamic data
Static data
00000000
11111111
400000hex
00000000
11111111
Reserved
00000000
11111111
Stack segment
Data segment
Text segment
00000000
11111111
The data segment is divided into 2 parts, the lower part for static
data (with size known at compile time) and the upper part, which
can grow, upward, for dynamic data structures.
The stack segment varies in size during the execution of a program,
as functions are called and returned from.
It starts at the top of memory and grows down.
44
More about assemblers
Sometimes, an assembler will accept a statement that does not correspond exactly to a machine instruction. For example, it may correspond to a small set of machine instructions. These are called
pseudoinstructions.
This is done when a particular set of statements are frequently used,
and have a simple translation to a set of machine instructions.
The original MIPS assembly language had a number of these.
For example, the pseudoinstruction load double
ld $4, 0($1)
would generate the following two instructions:
lw $4, 0($1)
lw $5, 4($1)
The pseudoinstruction load address
la $4, label
generates the instructions
lui $4, imm u and
ori $4, $4, imm l
which load the upper and lower 16 bits of the address, respectively.
The pseudoinstruction
mov $5, $1
moves the contents of register $1 to register $5
What single MIPS instruction corresponds to this pseudoinstruction?
45
Macros
Assemblers also provide set of instructions similar to functions, which
can accept a formal argument. These are called macros.
A macro is expanded as text, so code is generated each time the macro
is used, and the formal argument is replaced as text in the macro.
Consequently, there is no function call — the macro is expanded
directly in the code.
Following is a macro which uses the function printf to print an
integer:
.data
int_str:.asciiz "%d"
.text
.macro
la
print_int($arg)
$a0, int_str
# load format string address
# into first argument register
mov
$a1, $arg
# load macro’s parameter
# (arg) into second argument
# register
jal
printf
.end_macro
This macro would be “called” with a formal argument like
print int($7)
and would have the effect of inserting the above code, with register
$7 replacing the string $arg.
46
Translating programs to assembly language
Given the program statement
y=a+b−c+d
what is an equivalent assembly code?
Assuming that a, b, c, d are in registers $5 to $8, respectively, and
that y is in $9, then we could have:
add $9, $5, $6
# y = a + b
sub $10, $8, $7
# tmp = d - c
add $9, $9, $10
# y = y + tmp
Note that we have introduced a temporary register, $10 (tmp) here.
This is not really necessary.
To place the values of a, b, c, and d in the registers, from memory,
assuming register $20 contains the address for variable a, and variables b,c, d, and y are the next consecutive words in memory, we
could write
lw $5, 0($20)
# load a in reg $5
lw $6, 4($20)
# load b in reg $6
lw $7, 8($20)
# load c in reg $7
lw $8, 12($20)
# load d in reg $8
To store the value of y in memory, we could write
sw $9, 16($20)
# store reg $9 in y
47
Simple data structures
It is common to use some kind of data structure in a high-level programming language.
How would the following be translated into MIPS assembly language?
A[i] = A[i] + B;
Assuming there is a label Astart at the beginning of the data array
A[], and that register $19 has the value 4 × i and that the value of
B is in register $18:
lw
$8, Astart($19)
# load A[i] in reg $8
add $8, $18, $8
# add
sw
# store reg $8 in variable A[i]
$8, Astart($19)
48
B to A[i]
Program structures — loops
Extending the previous example to a simple loop; how would the
following be translated to MIPS assembly language?
for i=0; i<10,i++ {
A[i] = A[i] + B;
}
Here, we need to set up a counter, say, in register $6, and compare
it to 10.
loop:
addi $6, $0,
0
# initialize counter i to 0
addi $19,$0,
0
# initialize array address
addi $5, $0,
10
# set test value for loop
lw
$8, Astart($19) # load A[i] in reg $8
add
$8, $18, $8
sw
$8, Astart($19) # store reg $8 in variable A[i]
addi $6, $6,
# add
B to A[i] (B is in $18)
1
#
increment counter
addi $19,$19, 4
#
increment array address
bne
# jump back until counter
$5, $6,
loop
# equals 10
Note that this is not the most efficient code; the array index itself
could be used to terminate the loop, using one less register, and one
less instruction in the loop.
49
Conditional expressions
Consider the following C code:
if (i==j)
x = x + h;
else
x = x - h;
Assume i, j, x and h are already in registers $4, $5, $6, and $7,
respectively.
In MIPS assembly language, this could be written as:
bne $4, $5, else
# jump to the "else" clause
add $6, $6, $7
# execute the "then" clause
j
# jump past the "else" clause
endif
else: sub $6, $6, $7
# execute the "else" clause
endif: . . .
A similar, but extended, structure could be written for case structures.
50
Subprograms
We have already seen the instruction to jump to a subprogram, jal
which places the value of PC + 4 (the address of the next instruction
in memory) into register $31.
We have also seen how the subprogram returns back to the main
program using the instruction
jr $31
There are still some questions about subprograms, however.
First, what happens when a subprogram calls another subprogram?
There must be some way to save the “old” return address before
overwriting the value in register $31.
Next, how are arguments passed to the subprogram?
To answer the first question, a stack data structure is used to save
the return address in register $31 before a subprogram is called.
The operation of placing a value on the stack is called pushing a
value onto the stack.
Returning a value from the stack to the register is called popping a
value from the stack.
By convention, register $29 is used as a stack pointer.
It is initially set to a high value, (7fffffffhex) and decremented
every time a value is pushed, and incremented whenever a value is
popped.
51
The following diagram shows the state of the stack after three nested
subprogram calls:
main program
stack
call subprogram 1
sp
return address to main
return address to subprogram 1
return address to subprogram 2
call subprogram 2
call subprogram 3
return from subprogram 3
Note that the stack pointer always points to the last element placed
on the stack. It is incremented before pushing, and decremented
after popping.
The return address is not the only thing which must be saved during
the execution of a subprogram. Arguments may also be passed to a
subprogram on the stack.
If a subprogram can call itself (recursion) then its entire state must
be saved. This includes the contents of registers used by the subprogram, and values of local variables, etc. These are also saved on the
stack. The whole block of memory used by the stack in handling a
procedure call is referred to as a procedure call frame.
52
The procedure call frame is usually completely contained in the stack,
and is often called simply a stack frame. In order to facilitate accessing data in the stack frame, there is usually a frame pointer which
points to the start of a frame. The stack pointer points to the end
of the fame.
Argument 6
Argument 5
$fp
Saved registers
frame size
Local variables
$sp
Argument build
In the MIPS convention, register $30 is the frame pointer.
In order to properly preserve the contents of registers in a procedure
call, both the caller and callee must agree on who is responsible for
saving each register. The following convention was used with most
MIPS compilers:
53
MIPS register names and conventions about their use
Register Name
zero
at
v0
v1
a0
a1
a2
a3
t0
t1
t2
t3
t4
t5
t6
t7
s0
s1
s2
s3
s4
s5
s6
s7
t8
t9
k0
k1
gp
sp
fp
ra
Number
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Usage
Constant 0
Reserved for assembler
Expression evaluation and
results of a function
Argument 1
Argument 2
Argument 3
Argument 4
Temporary (not preserved across call)
Temporary (not preserved across call)
Temporary (not preserved across call)
Temporary (not preserved across call)
Temporary (not preserved across call)
Temporary (not preserved across call)
Temporary (not preserved across call)
Temporary (not preserved across call)
Saved temporary (preserved across call)
Saved temporary (preserved across call)
Saved temporary (preserved across call)
Saved temporary (preserved across call)
Saved temporary (preserved across call)
Saved temporary (preserved across call)
Saved temporary (preserved across call)
Saved temporary (preserved across call)
Temporary (not preserved across call)
Temporary (not preserved across call)
Reserved for OS kernel
Reserved for OS kernel
Pointer to global area
Stack pointer
Frame pointer
Return address (used by function call)
54
What happens when a procedure is called
Before calling a procedure, the caller must:
1. Pass the arguments to the callee procedure;
The first 4 arguments are passed in registers $a0 - $a3 ($4 $7). The remaining arguments are placed on the stack.
2. Save any caller-saved registers that the caller expects to use after
the call. This includes the argument registers and the temporary registers $t0 - $t9. (The callee may use these registers,
altering the contents.)
3. Execute a jal to the called procedure (callee). This saves the
return address in $ra.
At this point, the callee must set up its stack frame:
1. Allocate memory on the stack by subtracting the frame size from
the $sp.
2. Save any registers the caller expects to have left unchanged.
These include $ra, $fp, and the registers $s0 - $s7.
3. Set the value of the frame pointer by adding the stack frame size
to $fp and subtracting 4.
The procedure can then execute its function.
Note that the argument list on the stack belongs to the stack frame
of the caller.
55
Returning from a procedure
When the callee returns to the caller, the following steps are required:
1. If the procedure is a function returning a value, the value is
placed in register $v0 and, if two words are required, $v1 (registers $2 and $3).
2. All callee-saved registers are restored by popping the values from
the stack, in the reverse order from which they were pushed.
3. The stack frame is popped by adding the frame size to $sp.
4. The callee returns control to the caller by executing jr $ra
Note that some of the operations may not be required for every
procedure call, and modern compilers would only generate the steps
required for each particular procedure.
For example, the lowest level subprograms to be called (“leaf nodes”)
would not have to save $ra.
If a programming language does not allow a subprogram to call itself
(recursion) then implementing a stack frame may not be required,
but a stack is still required for nested procedure calls.
56
Who does what operation (caller or callee) is to some extent arbitrary,
and different systems may use quite different conventions.
For example, in some systems the subprogram arguments are part of
the callee stack frame, unlike the MIPS in which they belong to the
frame of the caller.
The designation of certain registers as caller save, and others as callee
saved is also arbitrary, and to some extent depends on how many
registers are available.
Having registers which a procedure can use without the overhead of
saving and restoring tends to lower the overhead of a procedure call.
Processors with few general registers (e.g., the INTEL processors)
would likely construct a stack frame quite differently.
It is imperative, however, that all the programs which will be linked
together strictly follow the same conventions.
Typically, procedures from several languages (e.g., assembly, C, Java)
can be intermixed at run time, so the compilers and linkers must
follow the same conventions if they are to interact correctly.
57
An example of a recursive function (factorial)
The following is a simple factorial function written in C:
C program for factorial (recursive)
main ()
{
printf ("the factorial of 10 is %d\n", fact(10))
}
int fact (int n)
{
if (n < 1)
return 1;
else
return (n * fact (n-1));
}
Following is the same code, in MIPS assembly language. First, the
main program is shown, followed by the factorial function itself.
Note that the MIPS specifies a minimum size of 32 bytes for a stack
frame.
58
# Mips assembly code showing recursive function calls:
.text
# Text section
.align 2
# Align following on word boundary.
.globl main
# Global symbol main is the entry
.ent main
# point of the program.
main:
subiu $sp,$sp,32
# Allocate stack space for return
# address and local variables
# (32 bytes). (Stack "grows" downwar
sw
$ra, 20($sp) # Save return address
sw
$fp, 16($sp) # Save old frame pointer
addiu $fp $sp,28
# Set up frame pointer
li
$a0, 10
# put argument (10) in $a0
jal
fact
# jump to factorial function
# the factorial function returns a value in register $v0
#
la
$a0, $LC
# Put format string pointer in $a0
#
move $a1, $v0
# put result in $a1
#
jal printf
# print the result
59
# Instead of using printf, we can use a syscall
move $s0, $v0
# put result in $s0
# Print label for output.
li $v0, 4
# Syscall code for print string
# goes in register $v0
la $a0, $LC
# Put format string pointer in $a0
syscall
# print string
# Print integer result
li
$v0, 1
move
$a0, $s0
# Syscall code for print integer
# Put integer to be printed in $a0
syscall
# print integer
move $v0, $0
# Clear register v0.
# end of print output
# restore saved registers
lw
$ra, 20($sp)
# restore return address
lw
$fp, 16($sp)
# Save old frame pointer
addiu $sp $sp,32
# Pop stack frame
jr
# return to caller (shell)
$ra
.rdata
$LC:
.ascii "The factorial of 10 is "
60
# factorial function
.text
# Text section
fact:
subiu $sp,$sp,32
# Allocate stack frame (32 bytes)
sw
$ra, 20($sp)
# Save return address
sw
$fp, 16($sp)
# Save old frame pointer
addiu $fp $sp,28
# Set up frame pointer
sw
# Save argument (n)
$a0, 0($fp)
# here we do the required calculation
# first check for terminal condition
bgtz
$a0, $L2
# Branch if n > 0
li
$v0, 1
# Return 1
jr
$L1
# Jump to code to return
# do recursion
$L2:
subiu $a0, $a0, 1
# subtract 1 from n
jal
# jump to factorial function
fact
# returning fact(n-1) in $v0
lw
$v1, 0($fp)
# Load n (saved earlier) into $v1
mul
$v0, $v0, $v1 # compute (fact(n-1) * n)
# and return result in $v0
61
#
restore saved registers and return
$L1:
# result is in $2
lw
$ra, 20($sp)
# restore return address
lw
$fp, 16($sp)
# Restore old frame pointer
addiu $sp, $sp,32
# pop stack
jr
# return to calling program
$ra
62
When is assembly language used?
Modern compilers optimize code so well that assembly language is
rarely used to increase performance. Consequently, in large computer
systems, assembly language is rarely used.
Today its main application is in small systems (typically single chip
microcontrollers) where some special function is being implemented,
or there is a need to meet some particular timing constraint.
Typically, such systems have limited memory for programs and data,
and are dedicated to performing a small number of very specific functions.
These kinds of constraints are often typical of I/O functions, and it is
for this type of application that assembly language is still occasionally
useful.
Generally, a programmer will solve a problem using a higher level
language like C first (this makes the resulting code more portable).
Only if the timing or size constraints are not met will the programmer
resort to recoding part or all of the function in assembler.
63
Switching Functions - logic:
Many things can be described by two distinct states; for example,
a light can be “on” or “off;” a switch can be “open” or “closed;” a
statement can be “true” or “false.”
Devices which have exactly two distinct states, say “on” and “off,”
are often particularly simple to construct; in particular, electronic
devices with two distinct states are much simpler to construct than
devices with, say 10 states. A typical electronic device with 2 states
is a switch, which can be “on” (switch closed) or “off” (switch open).
A very effective switch can be made with a single transistor. Transistor switches can be very small; current commercial integrated circuit technology routinely manufactures devices containing many millions of these switches in a single integrated circuit, with each switch
capable of switching, or changing state, in a time of less than 0.1
nanosecond (abbreviated ns, 1 ns is the time required for light to
travel approximately 30 cm, or 1 foot.)
Since such binary (i.e., 2-state) devices are so simple, it is useful to
examine the kinds of operations which can be performed involving
only 2 states. An “algebra” of entities having exactly two states
(“true” and “false” or “1” and “0” was developed by the mathematician George Boole, and later called Boolean Algebra. This algebra
was applied to electronic switching circuits by Shannon, as a “switching algebra.”
64
Exam logic — not what Boole indended?
65
We can define a switching algebra as an algebraic system consisting
of the set {0,1}, two binary operations called OR (+) and AND
(·) and one unary operation (denoted by an overbar, ¯) called
NOT, or complementation, or inversion.
These operations are defined as follows:
OR (+)
AND (·)
NOT (¯)
0+0=0
0·0=0
0=1
0+1=1
0·1=0
1=0
1+0=1
1·0=0
1+1=1
1·1=1
These relations are often expressed in “truth tables” as follows:
OR
AND
A B A+B
A B A·B
NOT
0 0
0
0 0
0
A A
0 1
1
0 1
0
0 1
1 0
1
1 0
0
1 0
1 1
1
1 1
1
Following are the “circuit symbols” for these functions:
A
A+B
B
A
A .B
A
A
B
OR
AND
Note that the symbol for NOT is actually the ◦.
66
NOT
This switching algebra has a number of properties which can be
readily shown:
Idempotency
A+A=A
A·A=A
Commutativity A + B = B + A
A·B =B·A
Associativity
(A + B) + C = A + (B + C)
(A · B) · C = A · (B · C)
Distributivity
A · (B + C) = A · B + A · C
A + (B · C) = (A + B) · (A + C)
Absorption
A + (A · B) = A
A · (A + B) = A
Concensus
A·B+A·C +B·C =A·B+A·C
(A + B) · (A + C) · (B + C) = (A + B) · (A + C)
67
Another useful property is de Morgans theorem:
A+B =A·B
A·B =A+B
de Morgans theorem can be generalized to
F (A1, A2, . . . , An, 0, 1, ·, +) = F (A1, A2, . . . , An, 1, 0, +, ·)
That is, the complement of any expression can be obtained by replacing each variable and element with its complement, and at the
same time interchanging the OR and AND operators.
Duality
Note that each of the preceding properties occur in pairs, and that
one of the pairs can be obtained from the other simply by replacing
the AND operator by the OR operator, and vice versa. This property
is called duality, and the operators AND and OR are said to be dual.
This property comes about because if the 1’s and 0’s are interchanged
in the definition of the AND function, the function becomes the OR
function; similarly, the OR function becomes the AND function. This
property is general, and if one theorem is shown to be true, then its
dual will also be true.
68
The circuit symbol notation can be extended to other logic gates; for
example, the following represents the functions NAND (not AND)
and NOR (not OR):
A
A
A+B
B
A .B
B
NOR
NAND
Note that the NOT function is represented by the ◦.
N-input OR and N-input AND gates are represented by the symbols:
A1
A2
A1
A2
A1 + A2 + ... An
An
A1 A2 ... An
An
n−input OR
n−input AND
There is another commonly used circuit symbol, the exclusive-OR
function, denoted by the symbol ⊕. It is defined as follows:
A B A⊕B
0
0
0
0
1
1
1
0
1
1
1
0
A
B
69
A⊕B
Switch implementation of switching functions:
The functions NOT, AND and OR can be implemented with simple
switches. In fact, in digital electronic circuits, transistors are used as
simple switches in circuits similar to those which follow.
Note that the power supplied to the circuit is shown, (a battery), as
is a device to detect the output (a lamp). They are not part of the
logic, but are required to make the switching logic useful.
In the AND function, the two switches are in series with each other;
in the OR function, the two switches are connected in parallel. For
the NOT function, the switch is connected in parallel with the output
(the lamp). NAND and NOR gates can be constructed similarly.
A
✟q
✟
A
B
✟q
✟q
✟
✟
q
✁✁
A
(a) NOT gate
❝
♠
❝
♠
(b) AND gate
✟q
✟
B
(c) OR gate
These circuits can be combined to form more complex switching
functions. (If you have not seen it before, try to construct a simple
switching circuit for the XOR function).
Note that the inputs for these simple switches are mechanical; e.g.
the press of a finger. For electronic switches such as transistors, the
inputs can be the outputs of other logic functions, so very complex
logic circuits can be designed which operate “automatically”.
70
❝
♠
Canonical forms of switching functions:
It is possible to construct a truth table for any switching function
(i.e., any function of switching variables.)
The truth table provides a complete, unique description of the switching function, but it is cumbersome.
We can derive from the truth table certain unique expressions which
defines the function exactly; in fact, the expression is exactly equivalent to the truth table.
One such expression is called the minterm form of the expression,
or, alternately, the sum of products (SOP) form.
e.g., for the function Y = A ⊕ B, the truth table is:
A B Y =A⊕B
0 0
0
0 1
1
1 0
1
1 1
0
This is equivalent to
Y =A·B+A·B
This is the minterm form of the function. It is obtained by ORing
together all the minterms.
Minterms are the AND terms corresponding to each 1 in the function
column.
71
Minterms are obtained by ANDing together the variables, or their
complements, which have a 1 in the function column. If the variable
has value 1, the variable is taken; if not, its complement is taken.
The minterms are then ORed together to give the function specified
in the truth table.
Note:
1. Each minterm contains all the variables or their complements,
exactly once.
2. Each minterm is unique, except for permutation of the variables.
Therefore, the minterm form of the function is unique.
3. Any expression which contains only variables in the minterm
(sum of products) form, where each product term contains all
the variables, or their complement, exactly once, is a minterm
expression. This means that, no matter how a function is derived, if it contains only minterms then it must be a minterm
form of the function.
72
A dual form of the preceding, called a maxterm form, or product
of sums (POS) form can also be written. The maxterm form of
a function can be obtained from the truth table by applying the
principle of duality to the way described previously for deriving the
minterm form of a function.
Equivalently, we can write down the minterm expression for the complement of the function, Y , and apply de Morgans theorem; e.g.,
A B Y =A⊕B Y =A⊕B
0 0
0
1
0 1
1
0
1 0
1
0
1 1
0
1
In minterm form,
Y =A·B+A·B
Complementing both sides,
Y =Y =A·B+A·B
Applying de Morgans theorem,
Y = (A · B) · (A · B)
= (A + B) · (A + B)
This is the maxterm form of the switching function Y = A ⊕ B
73
The maxterm form can more easily be obtained from the truth table
by ORing together all the variables or their complements which give
a zero for the function; if the variable has value 0 then it is ORed
directly, if it has a value 1, it is complemented.
Each term is called a maxterm, or a sum term. The function is equal
to the AND of all the maxterms.
Example: Find the minterm and maxterm expressions corresponding to the following truth table:
A B C Y Minterms Maxterms
0 0 0
0
1 A·B·C
1 0 0
1
0
2 0 1
0
1 A·B·C
3 0 1
1
1 A·B·C
4 1 0
0
0
A+B+C
5 1 0
1
0
A+B+C
6 1 1
0
1 A·B·C
7 1 1
1
1 A·B·C
A+B+C
Minterm form:
Y = A·B·C +A·B·C +A·B·C +A·B·C +A·B·C
Maxterm form:
Y = (A + B + C) · (A + B + C) · (A + B + C)
74
Sometimes the minterm and maxterm expression are written in a
kind of “shorthand,” where the values (0 or 1) of the set of variables
is used to form a binary number, the decimal equivalent of which
designates the appropriate minterm or maxterm. e.g., the minterm
form of the previous function is written as:
Y =
X
(0, 2, 3, 6, 7)
The maxterm form is written as:
Y =
Y
(1, 4, 5)
Note that the order in which the variable are written down in the
truth table is important, in this case. The numbers which appear
in the minterm form do not appear in the maxterm form, and vice
versa.
The minterm or maxterm form of the function is not usually the simplest or most concise; e.g., the preceding function could be simplified
to the following:
Y = A·B·C +A·B·C +A·B·C +A·B·C +A·B·C
= A·C +B
Systematic ways exist to reduce the complexity of minterm or maxterm forms of switching functions but we will not discuss them here.
The problem is computationally complex (NP-hard).
75
Practical examples
(1) Design of a half adder
A binary half adder is a switching circuit which will add together two
binary digits (called binary bits), producing two output bits, a sum
bit, S, and a carry bit, C. It has the following truth table:
A B S C
0 0 0 0
0 1 1 0
1 0 1 0
1 1 0 1
It is immediately obvious from the truth table that the two functions
S and C can be implemented as S = A ⊕ B, and C = A · B as
shown in the following, which also shows a logic symbol for a half
adder. (Logic symbols for devices more complex than the basic logic
gates are usually just boxes with appropriately labeled inputs and
outputs).
A
B
s
✩
s
C
✪
A C
B S
S
(b)
(a)
76
(2) Design of a full adder
A binary full adder is a switching circuit which will add together two
binary digits (called binary bits), and a third bit called a carry bit
which may have come from a previous full adder. It produces both
a sum bit and a carry bit as output. The full adder therefore has 3
inputs A, B and C where C is the carry bit, and 2 outputs; the sum,
S, and the carry C+. It is possible to connect N such full adders
together to add two N bit numbers.
It should be immediately obvious that the sum bit for a full adder
can be obtained using two half adders, one which adds digits A and
B together producing, say, Z as the sum and the other adding Z and
C together to form the sum of A, B and C. Clearly, a carry output
should be produced when either of the half adders produces a carry, so
the carry output for the full adder can be obtained by ORing together
the Carry outputs of the full adders. Such an implementation of a
full adder is shown in the following:
A
❍
A C
❍
B
C
B S
Z
❍
❍
❍
A C
B S
77
C+
S
The preceding implementation relied on our knowledge of the half
adder.
We will now consider the design of a full adder starting from its
description as a truth table:
A B C S C+
0 0
0 0
0
0 0
1 1
0
0 1
0 1
0
0 1
1 0
1
1 0
0 1
0
1 0
1 0
1
1 1
0 0
1
1 1
1 1
1
We can write the outputs in minterm form as:
S = A·B·C +A·B·C +A·B·C +A·B·C
C+ = A · B · C + A · B · C + A · B · C + A · B · C
These functions can be implemented directly as shown in the next
slide.
78
A
B
C
A
B
C
A
B
C
A
B
C
✩
❈
❈
✪
❈
❈
✩
❈
❈
❈
❡ ❈❈
❡
✪
❡
✩
✪
✪✄
✪ ✄
✄
✄
✪
✄
✄
✩
✄
✄
✄
A
B
C
A
B
C
S
A
B
C
A
B
C
✪
✩
❈
❈
✪
❈
❈
✩
❈
❈
❈
❡ ❈❈
❡
✪
❡
✩
✪
✪✄
✪ ✄
✄
✄
✪
✄
✄
✩
✄
✄
✄
✪
Note that this implementation of the full adder requires more logic
gates than the implementation shown earlier, but the circuit is implemented using only three “levels” of logic; a NOT level (not shown),
an AND level and an OR level. The previous implementation would
require more logic levels if it were implemented using only ANDOR-NOT (AON) logic. This implementation would consequently be
a “slower” implementation, due to the inherent delay in each logic
gate.
The function S cannot be simplified any further using only AON
logic, but C+ can be rewritten as:
C+ = B · C + A · C + A · B
79
C+
Both the half adder and the full adder are useful functional blocks.
In particular, the full adder is often used as the basic building block
to construct larger adders which add many bits simultaneously. The
following figure shows the implementation of a four bit adder, which
adds together two four bit numbers (A3A2 A1A0, and B3B2B1B0)
and produces a 5 bit result (S4S3S2S1 S0), using four full adders.
In general, n full adders are required to implement an adder for
two n-bit words. The general expression for the sum and carry bits
generated in the ith addition stage is
Si = (Aii ⊕ Bi) ⊕ Ci
Ci+1 = Ai · Bi + ((Ai + Bi) · Ci)
0
A0
B0
A1
B1
C S
B
A C+
S0
C S
B
A C+
S1
C S
B
A C+
A2
B2
A3
B3
80
S2
C S
B
A C+
S3
S4
Note that, for this type of adder, before the result of the add operation is correct, the carry result must be allowed to propagate through
all of the full adders. Because of this, this implementation of a wide
word adder is called a ripple carry adder.
It is possible, of course, to design an adder which has no such ripple
at all, simply by creating a truth table for each bit of the n-bit
adder and implementing each bit from, say, its minterm form. This
becomes quite tedious, however, for large n (specification of an 8-bit
adder would require 9 truth tables, each with 256 lines).
It is also possible to devise logic functions which generate the only
the n carry bits for an add operation on two n bit numbers. These
functions are called carry look-ahead functions, and are commonly
used in the construction of fast wide word adders. Such carry lookahead adders are commonly used to implement the add operations on
the fastest computers. As we will see, it is possible to connect the
carry look-ahead units in a tree-like fashion to give reasonably fast
carry generation in a much smaller time than required in the ripple
carry adder.
81
Carry look-ahead adders
Logic expressions for this “carry look-ahead” function can be derived
from the logic functions for the full adder. Recall that, for 2 n-bit
words A = (An−1 , An−2 , . . . , A1 , A0) and B = (Bn−1 , Bn−2, . . . , B1, B0)
we saw earlier that the bit Si for the ith digit of the sum S was
Si = (Ai ⊕ Bi) ⊕ Ci
and the carry Ci+1 was
Ci+1 = Ai · Bi + (Ai + Bi) · Ci
The expression for the carry, Ci+1, can be rewritten as
Ci+1 = Gi+1 + Pi+1 · Ci
where
and
Gi+1 = Ai · Bi
is the carry generation term
Pi+1 = Ai + Bi
is the carry propagation term
Note that Gi+1 and Pi+1 depend only on Ai and Bi.
From this recurrence relation, we see that, if we have an initial carry
in C0 = G0 then
C 1 = G1 + P 1 · G0
C2 = G2 + P2 · (G1 + P1 · G0)
= G2 + P 2 · G1 + P 2 · P 1 · G0
C 3 = G3 + P 3 · G2 + P 3 · P 2 · G1 + P 3 · P 2 · P 1 · G0
...
82
Note that the product terms for the ith carry bit correspond to AND
gates with inputs numbering from 2 to i; consequently for large i the
AND gates will require a large number of inputs. (They cannot be
connected in series because this would reintroduce a ripple effect.)
Fortunately, carry look-ahead units can be cascaded in a kind of tree
fashion, as shown below.
The fact that this cascading is possible is apparent from the original
relation, Ci+1 = Gi+1 + Pi+1 · Ci. All that is necessary at any
level, say, l, is to have the term Cl−1 available. This can come from a
previous level of carry look-ahead units, and replaces the initial carry
input, C0. (Note that the adders used in with a carry look-ahead unit
should have outputs for generate, G, and propagate, P rather than
for the carry, Cout).
The following figure shows an implementation of part (half) of a 16
bit adder, using 4-bit carry look-ahead functions:
✻
✻
Pout Gout
✲
✥✥✥ Cin
✥✥✥
P0 G0 C1 P1 G1 C2 P2 G2 C3 P3 G3
✥
✥
✥✥
✥✥✥
✥
✻ ✻ ✻✻ ✻ ✻✻ ✻ ✻✻
✥
✥
✭✭✻
✭✭
✭
✥✥✥
✭
✭
✥
✭
✭
✂ ❊
✭
✥
✭
✭
✭
✭✭
✥✥
✭✭
✭✭
✂ ❊
✭✭✭✭✭✭✭✭✭✭
✭✭
✥✥✥
✭
✭
✥
✭
✥
✭
✭
✭✭✭✭
❊
✂
✭✭
✥✥
✭✭
✭✭
✭✭✭✭
❊
✥✥✥
✭✭✭
✂
✻
✻
✻
✻
Pout Gout
Pout Gout
r✲ Cin
r✲ Cin
...
P0 G0 C1 P1 G1 C2 P2 G2 C3 P3 G3
P0 G0 C1 P1 G1 C2 P2 G2 C3 P3 G3
✻✻
✻✻
✻✻
✻✻
✻✻
✻✻
✻✻
✻✻
P G
P G
P G
P G
✲C
✲C
✲C
✲C
S AB
S AB
S AB
S AB
❄ ✻✻
❄ ✻✻
❄ ✻✻
❄ ✻✻
S3 A3 B3
S2 A2 B2
S1 A1 B1
S0 A0 B0
P G
P G
P G
P G
✲C
✲C
✲C
✲C
S AB
S AB
S AB
S AB
❄ ✻✻
❄ ✻✻
❄ ✻✻
❄ ✻✻ . . .
S7 A7 B7
S6 A6 B6
S5 A5 B5
S4 A4 B4
83
Combinational Logic — Using MSI circuits:
When designing logic circuits, the “discrete logic gates”; i.e., individual AND, OR, OR etc. gates, are often neither the simplest nor the
most effective devices we could use. There are available many standard MSI (medium scale integrated) functions which can do many of
the things commonly required in logic circuits.
These devices, or similar devices, are often used as components of
“programmable logic devices.”
The digital multiplexer
One MSI function which has been available for a long time is the
digital selector, or multiplexer. It is the digital equivalent of the
rotary switch or selector switch (e.g., the channel selector on a TV
set). Its function is to accept a binary number as a “selector input,”
and present the logic level connected to that input line as output
from the data selector.
A circuit diagram for a possible 4-line to 1-line data selector/multiplexer
(abbreviated as MUX for multiplexer) is shown in the following slide.
Here, the output Y is equal to the input I0, I1, I2, I3 depending on
whether the select lines S1 and S0 have values 00, 01, 10, 11 for S1
and S0 respectively. That is, the output Y is selected to be equal to
the input of the line given by the binary value of the select lines (or
address) S1S0.
84
S1
S0
r
r
I0
❍
❍
❍❢
✟
✟
✟
❍❍
❍❢ r
✟
✟✟
I1
r
r
I2
r
I3
✣✢
✣✢
✣✢
✣✢
PP
✏✏
PP ❅
✏✏
PP ❅
✏
P❅
✏✏
The logic equation for this 4-line to 1-line MUX is:
Y = I0 · S1 · S0 + I1 · S1 · S0 + I2 · S1 · S0 + I3 · S1 · S0
This device can be used simply as a data selector/multiplexer, or it
can be used to perform logic functions. Its simplest application is to
implement a truth table directly; e.g., with a 4 line to 1 line MUX, it
is possible to implement any 2-variable function directly, simply by
connecting I0, I1, I2, I3 to logic 1 in logic 0, as dictated by a truth
table. In this way, a MUX can be used as a simple look-up table
for switching functions. This facility makes the MUX a very general
purpose logic device.
Connecting the inputs to a 4-bit memory makes the device a programmable logic device.
85
Example: Use a 4 line to 1 line MUX to implement the function
shown in the following truth table (Y = A · B + A · B)
A B Y
1
0
0
1
0 0 1 = I0
0 1 0 = I1
1 0 0 = I2
1 1 1 = I3
I0
I1
I2
I3
S1 S0
A B
Simply connecting I0 = 1, I1 = 0, I2 = 0, I3 = 1, and the inputs A
and B to the S1 and S0 selector inputs of the 4-line to 1-line MUX
implement this truth table, as shown above.
The 4-line to 1-line MUX can also be used to implement any function
of three logical variables, as well. To see this, we need note only that
the only possible functions of one variable C, are C, C, and the
constants 0 or 1. (i.e., C, C, C + C = 1, and 0) We need only
connect the appropriate value, C, C, 0 or 1, to I0, I1, I2, I3 to
obtain a function of 3 variables. The MUX still behaves as a table
lookup device; it is now simply looking up values of another variable.
86
Y
Example: Implement the function
Y (A, B, C) = A · B · C + A · B · C + A · B · C + A · B · C
Using a 4-line to 1-line MUX.
Here, again, we use the A and B variables as data select inputs. We
can use the above equation to construct the table shown below.
The residues are what is “left over” in each minterm when the “address” variables are taken away. To implement this circuit, we connect I0 and I3 to C, and I1 and I2 to C, as shown:
Input “Address” Other variables
(residues)
I0
A·B
C
I1
A·B
C
I2
A·B
C
I3
A·B
C
C
C
C
C
I0
I1
I2
I3
S1 S0
A B
In general a 4 input MUX can give any function of 3 inputs, an 8
input MUX can give any functional of 4 variables, and a 16 input
MUX, any function of 5 variables.
87
Y
Example: Use an 8 input MUX to implement the following equation:
Y = A·B·C ·D+A·B·C ·D+A·B·C ·D+A·B·C ·D+
A·B·C ·D+A·B·C ·D+A·B·C ·D+A·B·C ·D
Again, we will use A, B, C as data select inputs, or address inputs,
connected to S2, S1 and S0, respectively.
Input Address
Residues
I0
A·B·C D
I1
A·B·C D
I2
A·B·C D+D =1
I3
A·B·C
I4
A·B·C D
I5
A·B·C D
I6
A·B·C D+D =1
I7
A·B·C
D
D
1
0
D
D
1
0
I0
I1
I2
I3
I4
I5
I6
I7
S2 S1 S0
Y
A B C
Values of the address set A, B, C with no residues corresponding to
the address in the above table must have logic value 0 connected to
the corresponding data input. The select variables A, B, C must be
connected to S2, S1 ,and S0, respectively.
88
MUX “trees”
In practice, about 16 line to 1 line MUX’s are the largest which can
be reasonably constructed as a single circuit.
It is possible to use a “tree” of smaller MUX’s to make arbitrarily
large MUX’s. The following shows an implementation of a 16 line to
1 line MUX using five 4 line to 1 line MUX’s.
I0
I1
I2
I3
I0
I1
I2
I3
S1 S0
S1 S0
I4
I5
I6
I7
I0
I1
I2
I3
S1 S0
❇
❇
❇
❇
❇
❇
❇
❇
❇
❇
❇
❇
❇
❇
❅
❇
❅
❅
❇
❅ ❇
❅
❅
S1 S0
I8
I9
I10
I11
I0
I1
I2
I3
S1 S0
S1 S0
I12
I13
I14
I15
I0
I1
I2
I3
S1 S0
✂
✂
✂
✂
✂
✂
✂
✂
✂
✂
✂
✂
✂
✂
✂
✂
✂
S1 S0
89
I0
I1
I2
I3
S1 S0
S3 S2
Decoders (demultiplexers):
Another commonly used MSI device is the decoder. Decoders, in
general, transform a set of inputs into a different set of outputs,
which are coded in a particular manner; e.g., certain decoders are
designed to decode binary or BCD coded numbers and produce the
correct output to display a digit on a 7 segment (calculator type)
display.
Normally, however, the term “decoder” implies a device which performs, in a sense, the inverse operation of a multiplexer. A decoder
accepts an n digit number as its n “select” inputs and produces an
output (usually a logic 0) at one of its 2n possible outputs. Decoders
are usually referred to as n line to 2n line decoders; e.g. a 3 line to 8
line decoder. This type of decoder is really a kind of binary to unary
decoder.
Most decoders have inverted outputs, so the selected output is set to
logic 0, while all the other outputs remain at logic 1. As well, most
decoders have an “enable” input E, which “enables” the operation
of the decoder — when the E input is set to 0, the device behaves
as a decoder and selects the output determined by the select inputs;
when the E input is set to 1, the outputs of the decoder are all set to
1. (The bar over the E indicates that it is an “active low” input; that
is, a logic 0 enables the function). The enable input allows decoders
to be connected together in a treelike fashion, much as we saw for
MUX’s.
90
A typical 3 line to 8 line decoder with an enable input behaves according to the following truth table, and has the circuit symbol as
shown.
E S2 S1 S0 O0 O1 O2 O3 O4 O5 O6 O7
1
0
0
x
0
0
x
0
0
x
0
1
1
0
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
1
1
1
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
1
1
0
1
1
1
1
1
1
1
1
1
0
0
0
1
1
1
0
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
0
1
1
1
0
❢
S0
S1
S2
E O
0
O1
O2
O3
O4
O5
O6
O7
❢
❢
❢
❢
❢
❢
❢
❢
Note that, when the E input is enabled, an output of 0 is produced
corresponding to each minterm of S2, S1, S0 . These minterm can be
combined together using other logic gates to form any required logic
function of the input variables. In fact, the minterms can be used to
produce several functions at the same time.
Using de Morgans theorem, we can see that when the outputs are inverted, as is normally the case, then the minterm form of the function
can be obtained by NANDing the required terms together.
91
Example: An implementation the functions defined by the following truth table using a decoder and NAND gates is shown below:
A B C Y1 Y2
0 0 0
0
1
0 0 1
1
1
0 1 0
1
0
0 1 1
0
0
1 0 0
1
0
1 0 1
0
1
1 1 0
0
1
1 1 1
0
0
C
B
A
S0
S1
S2
O0
O1
O2
O3
O4
O5
O6
O7
❢
❢ s
❅
❢
❅
✱
❅ ❅
✱
❢
❅ ❅✱ ✱✱
❅✱❅
❢
✱ ✱✱❅
◗✱ ❅
✱❅ ❅
❢ ✱◗◗
✱
❅
✱ ◗◗
❅
❢ ✱
◗
◗
◗
❢
✩
✐
✪
✩
✐
92
Y1
✪
Note that additional functions of the same variables would only require another NAND gate for each function.
Y2
Read only memory (ROM):
Often, devices requiring 8 or more input variables are implemented
using a ROM. A simple type of ROM can be constructed from a
decoder, a MUX, and a number of wires. We will look at a small (16
bit) ROM constructed in this way. Normally, memory is arranged in
a square array, as shown in the following slide. This general organization is used for other types of memory, as well.
To use the ROM to implement a logic function, the address lines
are used as the variable inputs, and the contents of the memory
are the function values.
Usually the memory has a word length of more than 1 bit; typically
4 or 8 bits, so several functions can be implemented simultaneously.
In the following figure, it is assumed that the decoder produces a
logic 1 as output when the input code selects that output; otherwise
it produces logic 0, and that a logic 1 output “outvotes” a logic 0
output, in the sense that if both are present on the same wire, the
logic 1 will dominate. (“Real” circuits usually have the opposite
behavior). A bit is “programmed” when a link is present.
93
O3
A
A3
S1
B
A2
S0 O1
s
✡s
O2
s
✡s
s
✡s
s
✡s
O0
C
A1
S1I0
D
A0
S0
I1
I2
I3
Y
In this example, the function is
Y =A·B·C ·D+A·B·C ·D+A·B·C ·D+A·B·C ·D
corresponding to memory locations 0010, 0110, 1001, and 1010.
A type of programmable read-only memory uses small “fuse links” to
connect the horizontal and vertical wires at each intersection. This
device is “programmed” by passing sufficient current through a link
to “blow” the fuse. The link could also be a transistor which could
be turned on or off, allowing a read-write type of memory to be
implemented. We will see logic devices which could be used for this
purpose shortly.
94
Programmable logic arrays (PLA’s)
The ROM implementation of a function may become quite expensive
for functions with a large number of variables, because all potential
minterms of the function are implemented, whether or not they are
needed. A programmable logic array (PLA) requires that only the
minterms required for a function be implemented, and allows the
implementation of several functions simultaneously. Moreover, the
functions can be implemented directly from their minterm forms (although it is often possible to eliminate some of the minterms, further
decreasing the cost of the PLA).
The PLA can be considered as a direct POS (or SOP) implementation
of a set of switching functions, with a set of AND functions followed
by a set of OR functions. A PLA is often said to have an “AND”
plane followed by an “OR” plane.
In practice, either NAND or NOR gates are used, with the resulting
PLA said to be a NAND/NAND or a NOR/NOR device. The next
slide shows a full adder implemented using a NAND/NAND PLA.
Note that, since the full adder does not require the minterm A·B ·C,
this minterm is not included in the “AND” plane of the PLA. Note
also that the PLA can implement a function in POS form directly,
without reducing the function to minterm form. This often leads to
opportunities for minimizing the area of a PLA. Also, a PLA can
implement additional functions of the same set of variables simply
by adding another logic gate to the “OR” plane.
95
The PLA is an efficient device for the implementation of several
functions of the same set of variables.
AND plane
t
t
t
t
t
t
t
t
t
t
t
t
t
t
❣
✁❆
✁ ❆
✁
❆
❣
✁❆
✁ ❆
✁
❆
A
B
AB̄C
AB̄ C̄
t
t
t
ABC
AB C̄
t
t
OR plane
t
ĀBC
ĀB C̄
❣
✁❆
✁ ❆
✁
❆
t
ĀB̄C
✥
❣
t
✦
✥
t
❣
t
✦
✥
❣
✦
✥
❣
✦
✥
❣
✦
✥
❣
t
t
t
t
✦
✥
❣ t
✦
✧
✧
❣ ✦
❣ ✦
S
C
96
C+
Sequential Logic
Sequential logic differs from combinational logic in that the output
of the logic device is dependent not only on the present inputs to the
device, but also on past inputs; i.e., the output of a sequential logic
device depends on its present internal state and the present inputs.
This implies that a sequential logic device has some kind of memory
of at least part of its “history” (i.e., its previous inputs).
A simple memory device can be constructed from combinational devices with which we are already familiar. By a memory device, we
mean a device which can remember if a signal of logic level 0 or 1 has
been connected to one of its inputs, and can make this fact available
at an output. A very simple, but still useful, memory device can be
constructed from a simple OR gate, as shown:
s
A
Q
In this memory device, if A and Q are initially at logic 0, then Q
remains at logic 0. However if the single input A ever becomes a
logic 1, then the output Q will be logic 1 ever after, regardless of
any further changes in the input at A. In this simple memory, the
output is a function of the state of the memory element only; after
the memory is “written” then it cannot be changed back. However,
it can be “read.” Such a device could be used as a simple read only
memory, which could be “programmed” only once.
97
Often a state table or timing diagram is used to describe the behaviour of a sequential device. Following is both a state table and a
timing diagram for this simple memory shown previously. The state
table shows the state which the device enters after an input (the
“next state”), for all possible states and inputs. For this device, the
output is the value stored in the memory.
State table
Present State Input Next State Output
Qn
A
Qn+1
0
0
0
0
A
0
1
1
1
0
1
1
1
1
1
1
1
Q
time →
Note that the output of the memory is used as one of the inputs; this
is called feedback and is characteristic of programmable memory devices. (Without feedback, a “permanent” electronic memory device
would not be possible.) The use of feedback can introduce problems
which are not found in strictly combinational circuits. In particular,
it is possible to inadvertently construct devices for which the output
is not determined by the inputs, and for which it is not possible to
predict the output. A simple example is an inverter with its input
connected to its output. Such a device is logically inconsistent; in a
physical implementation the device would probably either oscillate
from 1 to 0 to 1 · · · or remain at an intermediate value between logic
0 and logic 1, producing an invalid and erroneous output.
98
The R-S latch
More complicated, stable, memory elements could be constructed
using simple logic gates. In particular, simple, alterable memory
cells can be readily constructed. One basic (but not often used in
this form) memory device is the following, called an RS (reset-set)
latch, or flip flop. It is the most basic of all the class of circuits
which are called latches or flip flops. A logic diagram for this device
is shown in the following, together with its circuit symbol and state
table..
S R Qn+1 Qn+1
0 0
0 1
1 0
Qn
0
1
Qn
1
0
1 1
0
0
❢r
R
Q
❍❍
✟✟
❍❍
✟
✟✟❍❍
✟
❍
✟
❢r
S
R
Q
S
Q
Q
We can analyze this circuit to determine all possible outputs Q and
Q for all inputs to R and S; e.g., suppose we raise S to logic 1, with
Q at logic 0. Then Q must be 0, (the output of a NOR gate is 0
if any input is 1), and Q must be 1. If S is returned to 0, then Q
remains 0 and Q remains 1; i.e., the RS latch “remember” if S was
set to 1, while R is 0. If R is raised to logic 1 while S is at logic 0,
then Q is set to logic 0, and Q is set to logic 1; i.e., the latch is reset.
If both R and S are raised to logic 1, then both Q and Q will be at
logic 0. This output is inconsistent with the identification of Q and
Q as the two outputs, and therefore should be avoided.
99
Race conditions
Clearly, setting both R and S to 1 should be avoided to prevent
logical inconsistency.
However, a far more serious problem occurs if R and S change from
logic 1 to logic 0 simultaneously. This situation is called a race
condition. If both R and S are at logic 1, then Q and Q are at logic
0. When R and S are both set to 0, then both Q and Q should switch
to logic 1. However, when they switch to logic 1, they should cause
a switch back to logic 0 again, because of the logic 1 input to each
NOR gate. If both NOR gates were identical, this would occur over
and over again, indefinitely — an oscillation of the outputs Q and
Q from state 0 to 1 and back, with a period depending on the time
delay for the NOR gate. In practice, one gate is a little faster than
the other, and the final outcome depends on the relative speeds of the
two gates. other. However, the final outcome cannot be predicted.
100
Clocked latches
There are three control signals often associated with flip flops; they
are the clock or enable signal; the preset, and the clear signal. The
clock signal is ANDed with the inputs R and S, so that these signals can reach the flip-flop only when the clock pulse is 1; all other
times the inputs to the inputs to the flip-flop are 0, and it retains its
previous value.
The clock input is used for several purposes; it is used to “capture”
data which is available for only a short time; and it is used to synchronize several flip-flops so they can all operate simultaneously, or
synchronously. The following figure shows a circuit diagram for a
clocked RS flip-flop, together with its circuit symbol.
Note the special symbol, similar to an arrowhead, which denotes the
clock input.
✜
R
clock
S
r
❢r
✢
❍❍
✟
❍❍
✟✟
✟
✟ ❍❍
❍
✟✟
✜
❢r
✢
101
Q
R
Q
>
S
Q
Q
Asynchronous preset and clear
The preset and clear signals are used to set the state of the flipflop regardless of the state of the clock input. Because they are not
synchronized by the clock pulse, they are said to be asynchronous.
The following figure shows how a simple clocked RS flip flop with
preset and clear inputs could be constructed from simple logic gates.
Clear
✜
Clear
R
clock
S
r
❢r
✢
❍❍
✟✟
✟
❍❍
✟✟❍❍
✟
✟
❍
✜
❢r
✢
Q
R
Q
>
S
Q
Q
P reset
P reset
The preset and clear act as an unclocked RS flip-flop, and consequently a logic 1 should not be applied to both at the same time.
A dual form of the RS flip flop, the RS flip flop can be implemented
with NAND gates, as follows:
R
✜
❢r
Q
❢r
Q
✢
❍❍
✟
✟
❍❍
✟
✟✟❍❍
❍
✟✟
✜
✢
S
Clock inputs, together with preset and clear inputs could similarly be
provided for this device. Since the inputs to this device are inverted,
the preset and clear inputs would also be inverted.
102
The D Latch and the D flip-flop
It is possible to create a latch which has no race condition, simply by
providing only one input to a RS latch, and generating an inverted
signal to present to the other terminal of the latch. In this case, the
S and R inputs are always inverted with respect to each other, and
no race condition can occur. A circuit for a D latch follows:
D
clock
r
✜
❍
❍
❍❢
✟
✟✟
✝✆
r
❢r
✢
❍❍
✟✟
❍❍
✟
✟✟❍❍
✟
❍
✟
✜
❢r
✢
Q
D
Q
>
Q
Q
The D latch is used to capture, or “latch” the logic level which is
present on the Data line when the clock input is high. If the data
on the D line changes state while the clock pulse is high, then the
output, Q, follows the input, D. This effect can be seen in the timing
diagram in the next slide.
The D flip-flop, while a slightly more complicated circuit, performs
a function very similar to the D latch. In the case of the D flip-flop,
however, the rising edge of the clock pulse is used to “capture” the
input to the flip flop. This device is very useful when it is necessary
to “capture” a logic level on a line which is very rapidly varying.
103
The following figureshows a timing diagram for a D-type flip-flop.
This type of device is said to be “edge triggered” — either rising
edge triggered (i.e. a 0–1 transition) or falling edge triggered (i.e.,
a 1–0 transition) devices are available.
CLOCK
D
Q
time →
(a) The D latch
(b) The D flip flop
Both the D latch and D flip-flop have the following truth table:
P reset Clear Clock D Q
Q
0
1
x
x
1
0
1
0
x
x
0
1
0
0
x
x
1
1
1
1
↑ or 1 0
0
1
1
1
↑ or 1 1
1
0
1
1
0
X Q0 Q0
The symbol ↑ means a leading edge, or 0 − 1 transition as the clock
input to the flip flop. For a D latch, it would be the level 1.
104
The JK flip-flop
The JK flip flop is the most versatile flip-flop, and the most commonly
used flip flop when discrete devices are used to implement arbitrary
state machines. Like the RS flip-flop, it has two data inputs, J and
K, and a clock input. It has no undefined states or race condition,
however. It is always behaves like it is edge triggered; normally on
the falling edge.
The JK flip-flop has the following characteristics:
1. If one input (J or K) is at logic 0, and the other is at logic 1,
then the output is set or reset (by J and K respectively), just
like the RS flip-flop, but on the (falling) clock edge.
2. If both inputs are 0, then it remains in the same state as it was
before the clock pulse occurred; again like the RS flip flop.
3. If both inputs are high, however the flip-flop changes state whenever the (falling) edge of a clock pulse occurs; i.e., the clock pulse
toggles the flip-flop.
105
There are two basic types of JK flip-flops. The first type is basically
an RS flip-flop with its outputs Q and Q ANDed together with J
and K respectively. This type of JK flip-flop has no special name.
Note that the connection between the outputs and the inputs to the
AND gates determines the input conditions to R and S when J = K
= 1. This connection is what causes the toggling, and eliminates the
invalid condition which occurs in the RS flip flop. A simplified form
of this flip-flop is shown in (a) below.
The second type of JK flip-flop is called a master-slave flip flop. This
consists of two RS flip flops arranged so that when the clock pulse
enables the first, or master, latch, it disables the second, or slave,
latch. When the clock changes state again (i.e., on its falling edge)
the output of the master latch is transferred to the slave latch. Again,
toggling is accomplished by the connection of the output with the
input AND gates. An example of this type of flip-flop is shown in
(b). The circuit symbol for a JK flip flop is shown in (c).
Master
J
Clock
K
✏
✑
>
✏
S Q
✑
R Q
q✄✂
✄✂ q
J
K
..
.
✏
✏
❝
❝q
✑
✑
❍❍ ✟✟ q
q ❍
❍
✟❝
✟
✟
❍
✟✟ ❍❍
✏
✏
❝q
❝
✑
✑
..
.
(b)
(a)
106
Slave
✏
✏
❝
❝q Q
✑
✑
❍❍ ✟✟
✟
❍
✟✟ ❍❍
✏
✏
❝q Q
❝
✑
✑
J Q
>
K Q
(c)
The T flip flop
This type of flip-flop is a simplified version of the JK flip-flop. It is
not usually found as an IC chip by itself, but is used in many kinds
of circuits, especially counter and dividers. Its only function is that
it toggles itself with every clock pulse (on either the leading edge,
on the trailing edge) it can be constructed from the RS flip-flop as
shown below.
R
Q
r✄✂
Q
✄ r
✂
>T
S
T
Q
(b)
(a)
This flip flop is normally set, or “loaded” with the preset and clear
inputs. It can be used to obtain an output pulse train with a frequency of half that of the clock pulse train, as seen from the timing
diagram. In this example, the T flip flop is triggered on the falling
edge of the clock pulse.
Several T flip-flops are often connected together to form a “divide by
N” counter, where N is usually a power of 2.
107
Data registers:
The simplest type of register is a data register, which is used for
the temporary storage of a “word” of data. In its simplest form,
it consists of a set of N D flip flops, all sharing a common clock.
All of the digits in the N bit data word are connected to the data
register by an N line “data bus”. Following is a four bit data register,
implemented with four D flip flops.
I0
O0
❄
D
Q
✻
I1
❄
D
>
r
Q
✻
I2
O2
❄
D
>
Q
Clock
O1
Q
✻
I3
❄
D
>
Q
r
Q
>
Q
r
O3
Q
r
The data register is said to be a synchronous device, because all the
flip flops change state at the same time (they share a common clock
input).
108
✻
Shift registers
Another common form of register used in computers and in many
other types of logic circuits is a shift register. It is simply a set of
flip flops (usually D latches or RS flip-flops) connected together so
that the output of one becomes the input of the next, and so on in
series. It is called a shift register because the data is shifted through
the register by one bit position on each clock pulse. Following is a
four bit shift register, implemented with D flip flops.
in
D
Q
D
>
r
D
>
Q
Clock
Q
Q
D
>
Q
>
Q
r
r
Q
Q
r
On the leading edge of the first clock pulse, the signal in on the D
input is latched in the first flip flop. On the leading edge of the next
clock pulse, the contents of the first flip-flop is stored in the second
flip-flop, and the signal which is present at the DATA input is stored
is the first flip-flop, etc. Because the data is entered one bit at a
time, this called a serial-in shift register.
Since there is only one output, and data leaves the shift register one
bit at a time, then it is also a serial out shift register. (Shift registers
are named by their method of input and output; either serial or
parallel.)
109
out
Parallel input can be provided through the use of the preset and
clear inputs to the flip-flop. The parallel loading of the flip-flop can
be synchronous (i.e., occurs with the clock pulse) or asynchronous
(independent of the clock pulse) depending on the design of the shift
register. Parallel output can be obtained from the outputs of each
flip-flop as shown.
O0
✻
in
D
Q
>
r
D
>
Q
Clock
r
Q
O1
O2
O3
✻
✻
✻
r
D
Q
>
Q
r
D
>
Q
r
r
Q
Q
r
Communication between a computer and a peripheral device is often
done serially, while computation in the computer itself is usually
performed with parallel logic circuitry. A shift register can be used
to convert information from serial form to parallel form, and vice
versa. Many different kinds of shift registers can be constructed,
depending upon the particular function required.
110
Counters — weighted coding of binary numbers
A simple binary counter can be made using T flip flops. The flipflops are attached to each other in a way so that the output of one
acts as the clock for the next, and so on. In this case, the position
of the flip-flop in the chain determines its weight; i.e., for a binary
counter, the “power of two” it corresponds to. A 3-bit (modulo 8)
binary counter could be configured with T flip-flops as shown:
O0
>T
O1
✻
r
❅
❅
❅
>T
O2
✻
r
❅
❅
❅
✻
>T
Following is a timing diagram for this circuit:
..
. 0
..
. 1
..
. 2
..
. 3
..
. 4
..
. 5
..
. 6
..
. 7
..
. 8
..
. 9
..
.
.
. 10 .. 11 ..
CLOCK
O0
O1
O2
Note that is this counter, each flip-flops changes state on the falling
edge of the pulse from the previous flip-flop. Therefore there will
be a slight time delay, due to the propagation delay of the flip-flops
between the time one flip-flop changes state and the time the next one
changes state. i.e., the change of state ripples through the counter,
and these counters are therefore called ripple counters.
111
It is possible to design counters which will count up, count down,
and which can be preset to any desired number. Counters can also
be made which count in BCD, base 12 or any other number base.
A count down counter can be made by connecting the Q output to
the clock input in the previous counter.
Using the preset and clear inputs, and by gating the output of each
T flip flop with another logic level, using AND gates (say logic 0 for
counting down, logic 1 for counting up) then a presetable up-down
binary counter can be constructed.
The following figure shows an up-down counter, without preset or
clear:
count
enable
count
up/down
1 = up
0 = down
Clock
q
q
✄✂
O0
q
❍❍ ❞
✟✟
J Q
>
K Q
q
q
q
O1
q
q
✄✂
✒✑
✒✑
J Q
>
K Q
q
q
q
q
q
✄✂
✒✑
✒✑
112
O2
J Q
>
K Q
q
q
q
O3
q
q
✄✂
✒✑
✒✑
J Q
>
K Q
q
Synchronous counters
The counters shown previously have been “asynchronous counters”;
so called because the flip flops do not all change state at the same
time, but change as a result of a previous output. The output of
one flip flop is the input to the next; the state changes consequently
“ripple through” the flip flops, requiring a time proportional to the
length of the counter. It is possible to design synchronous counters,
using JK flip flops, where all flip flops change state at the same time;
i.e., the clock pulse is presented to each JK flip flop at the same
time. This can be easily done by noting that, for a binary counter,
any given digit changes its value (from 1 to 0 or from 0 to 1) whenever
all the previous digits have a value of 1. Following is an example of
a 4-bit binary synchronous counter.
O0
O1
r
r
✁
Clock
r
J
Q
>
K
r
r
r
✁
Q
r
J
Q
K
r
✁
r
✣✢
r
✁
>
Q
r
113
O2
J
r
Q
>
K
O3
Q
✣✢
r
✁
J
Q
>
K
Q
State machines
A “state machine” is a device in which the output depends in some
systematic way on variables other than the immediate inputs to the
device. These “other variables” are called the state variables for the
machine, and depend on the history of the machine. For example, in
a counter, the state variables are the values stored in the flip flops.
For a binary machine, with n possible state variables, there may be
as many as 2n possible states, with each state corresponding to a
unique assignment of values to the state variables.
The behavior of a state machine can be completely described by a
“state table,” or equivalently, a “state diagram.” The next slide
shows a state table which describes the operation of a modulo 8
counter; the counter has 8 states, denoted S0 to S7, a single input,
the clock input, and 3 output digits, O2, O1 and O0 . In this state
table, the entries where the clock input is 0 have been expressed on
a single line; in a full state table, this line would actually correspond
to 8 lines.
The essence of a state table can be captured in a state diagram. A
state diagram is a graph with labelled nodes and arcs; the nodes are
the states (denoted by circles, labelled with the state), and the arcs
are the possible transitions between states. The arcs are labelled
with the input which causes the transition, and the output which
results from the input. The next slide also shows a state diagram for
a modulo 8 counter.
114
input present
next
outputs
state
state
O2 O1 O0
0
Sx
no change
no change
1
S0
S1
0
0
1
1
S1
S2
0
1
0
1
S2
S3
0
1
1
1
S3
S4
1
0
0
1
S4
S5
1
0
1
1
S5
S6
1
1
0
1
S6
S7
1
1
1
1
S7
S0
0
0
0
0/000
✤✜
✬✩
✠
1/000
1/001
✬✩
✬✩
0 ❛
✜
✤
❘
✠
✦
✯
❥
✦ ✫✪
❛
S
0/111
0/110
S7
✣
✫✪
✕☞
1/111
☞
✬✩
✤
❘
S6
S1
0/001
✢
✫✪
▲ 1/010
❯▲
✬✩
✜
✠
S2
0/010
✣
✢
✫✪
✫✪
0/100
❑
☞1/011
1/110 ▲
✤✜
✬✩
✬✩
☛☞
▲
✜
✤
❘
✠
0/101
0/011
3
5 ❛ ✬✩
✠ ✦
❨❛
✦
✙
✣
✢
✫✪
✫✪
S
S
1/101
S4
1/100
✫✪
115
Designing a state machine
Typically, when we design a state machine, we first identify the required states (i.e., identify what information must be remembered),
and then consider how to go from state to state, and, finally, what
output to produce (i.e., identify state transitions and outputs). The
following examples show how a state machine can be obtained from
a written description of the device.
Example — the serial adder
The serial adder accepts as input two serial strings of digits of arbitrary length, starting with the low order bits, and produces the sum
of the two bit streams as its output. (The input bit streams could
come from, say, two shift registers clocked simultaneously.) This
device can be easily described as a state machine.
We first decide what must be “remembered” — in this case, it is
easy; all that must be remembered is whether or not there is a carry
to be added into the next highest order bits. Therefore, the device
will have two states, carry = 0 (C0), and carry = 1 (C1), as shown
below. We next identify the transitions between the states, and the
necessary outputs, also shown in the state diagram.
00/0
01/1
10/1
✬✩ 11/0
✤
❘
✲
10/0
✢
11/1
✫✪
C0
✛
✣
✫✪
✬✩
✜
01/0
✠
C1
00/1
116
input/output
The corresponding state table, containing exactly the same information as the state diagram is as follows:
Present state Inputs Next state Output
C0
0
0
C0
0
C0
0
1
C0
1
C0
1
0
C0
1
C0
1
1
C1
0
C1
0
0
C0
1
C1
0
1
C1
0
C1
1
0
C1
0
C1
1
1
C1
1
117
Example — a sequence detector
A state machine is required which outputs a logic 1 whenever the
input sequence 0101 is detected, and which outputs a otherwise. The
input is supplied serially, one bit at a time. The following is an
example input sequence and output sequence:
input
0 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0
output
0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0
This state machine can be designed in a straightforward way. Assume
that the machine is initially in some state, say, state A. If a 0 is input,
then this is may be the start of the required sequence, so the machine
should output a 0 and go to the next state, state B. If a 1 is input,
then this is certainly not the start of the required sequence, so the
machine should output a 0 and stay in state A. When the machine is
in state A, therefore, it has detected no digits of the required input
sequence. When the machine is in state B, it has detected exactly
one digit (the first 0) of the required input sequence.
1/0
✤✜
✬✩
✠
A
0/0
✫✪
✲
✬✩
B
✫✪
118
If the machine is in state B and a 0 is input, then two consecutive
0’s must have been input; this input is clearly not the second digit of
the required sequence, but it may be the first digit of the sequence.
Therefore, the machine should stay in state B and output a 0. If a 1
is input while the machine is in state B, then the first two digits of
the required sequence have been detected, and the machine should
go to the next state, state C, and output a 0. When the machine is
in state C, it has detected exactly two digits (0 1) from the required
sequence.
1/0
0/0
✤✜
✤✜
✬✩
✠
0/0
A
✫✪
✲
✬✩
✠
1/0
B
✫✪
✲
✬✩
C
✫✪
If the machine is in state C and a 0 is input, then three digits of
the required sequence have been input, so the machine should go
to its next state, state D, and output a 0. If a 1 is input when the
machine is in state C, then this input is clearly no part of the required
sequence, so the machine should start over in state A and output a
0.
1/0
0/0
✤✜
✤✜
✬✩
✠
A
0/0
✫✪
■
❅
❅
❅
❅
✲
✬✩
✠
1/0
B
✫✪
1/0
119
✲
0/0
✬✩
C
✫✪
✲
✬✩
D
✫✪
If the machine is in state D and a 0 is input, then this is not the
required input (the input has been 0 1 0 0), but this may be the
first digit of another sequence, so the machine should go to state B
and output a 0. If a 1 is input while the machine is in state D, then
the required sequence (0 1 0 1) has been detected, so a 1 should be
output. Moreover, the last two digits input may be the first two
digits of another sequence, so the machine should go to state C. This
completes the state diagram, as shown in the following figure:
1/0
0/0
✤✜
✤✜
✬✩
✠
0/0
A
✫✪
■
❅
❅
❅
❅
0/0
✲
✬✩
✠✠
1/0
B
✫✪
1/0
✲
❅
❅
0/0 ❅❅✬✩
✬✩
✲
C
✛
✫✪
1/1
D
✫✪
The following is a state table corresponding to the state diagram:
Present Input Next Output
State
State
A
0
B
0
A
1
A
0
B
0
B
0
B
1
C
0
C
0
D
0
C
1
A
0
D
0
B
0
D
1
C
1
120
Algorithmic State Machines:
An interesting way of specifying a state machine, equivalent to the
use of a state table or a state diagram, is by the use of a “flowchart”;
actually, a particular type of flowchart called an ASM (algorithmic
state machine) diagram. In an ASM diagram, or flowchart, the “algorithm” which is implemented by the state machine is presented
in a clear fashion. The following figure shows a flowchart for the
controller for the traffic light at an intersection where there is both
East/West and North/South traffic.
❄
NS green
EW red
(50 seconds)
S↓
❄
N↑
NS yellow
EW red
(10 seconds)
E→
←W
❄
NS red
EW green
(50 seconds)
N↑
❄
NS red
EW yellow
(10 seconds)
❄
121
We call the rectangular blocks “action blocks,” because they specify
some action. Note that, although not specifically shown, “time” is
implicitly an input in this flowchart; moreover, each of the individual
blocks does not necessarily correspond to an individual “state” of
the system. Since the blocks specify different time periods, they
imply some method to measure time, for example, by counting clock
pulses. A more explicit flowchart is shown in the following slide,
which assumes that a clock pulse occurs every 10 seconds.
The state diagram shown in the figure is equivalent to the flowchart;
note, however, that the flowchart looks simpler, in that the clock
input is implicit, and only transitions out of the state are explicitly
shown.
The square blocks correspond to the states of the system. Transitions
are specified by the arrows in the flowchart. The output is specified
in the blocks, and, in this example, the clock input is explicit.
Recall that, in the first flowchart, each block did not necessarily
correspond to an individual state of the traffic light controller.
Either flow chart could be considered an ASM diagram, but we will
prefer the style in which each block corresponds to a unique state.
122
CLOCK/COLOR
NS green
EW red
(10 seconds)
✬✩
0/GR ✤
❘
1/GR
NS red
EW yellow
(10 seconds)
✛
1/GR
✬✩
0/GR ✤
❘
❄
❄
NS red
EW green
(10 seconds)
NS green
EW red
(10 seconds)
B
1/GR
✬✩
0/GR ✤
❘
❄
❄
NS red
EW green
(10 seconds)
C
1/GR
✬✩
0/GR ✤
❘
❄
❄
NS red
EW green
(10 seconds)
D
1/GR
✬✩
0/GR ✤
❘
❄
❄
NS red
EW green
(10 seconds)
1/RG
✬✩
✜
✠
0/RG
J
1/RG
✬✩
✜
✠
0/RG
I
E
1/RG
✬✩
✜
✠
0/RG
H
✣
✢
✫✪ ✫✪
✻
✻
1/YR
✬✩
0/YR ✤
❘
❄
✲
K
✣
✢
✫✪ ✫✪
✻
✻
NS green
EW red
(10 seconds)
✬✩
✜
✠
0/RG
✣
✢
✫✪ ✫✪
✻
✻
NS green
EW red
(10 seconds)
1/RY
✣
✢
✫✪ ✫✪
✻
✻
NS green
EW red
(10 seconds)
L
✣
✢
✫✪ ✫✪
✻
✻
NS yellow
EW red
(10 seconds)
✛
A
✬✩
✜
✠
0/RY
❄
NS red
EW green
(10 seconds)
F
✲
1/RG
✬✩
✜
✠
0/RG
G
1/RG
✣
✢
✫✪ ✫✪
123
Consider another traffic light example, this time with external inputs,
shown in the next slide. In this example, the East/West traffic, is
less frequent than North/South traffic, and if there is no East/West
traffic, the North/South light should remain green. Eastbound or
westbound traffic is sensed by traffic sensors, labeled ET and WT
respectively, as shown in the diagram.
Again, this flowchart has implicit time, or clock, inputs, and introduces decision blocks (the diamond shaped blocks) for the traffic
sensor inputs. The decision blocks cause a transition from one block
to one of several others, depending on whether or not some condition is met. As before, this flowchart does not correspond directly
to a state diagram, but can readily be expanded to one which does,
as shown in the slide following, where each rectangular block corresponds to a 10 second interval.
This structure, consisting of arrows, action blocks, and decision blocks
is sufficiently general to specify any algorithm. These flowcharts are
often called ASM diagrams, when they refer to control devices in particular. They are (if carefully drawn) equivalent to state diagrams,
where the rectangular blocks correspond to the states of the state
machine (or, as in some of the examples, they correspond to blocks
of states which can readily be expanded). The decision boxes and
arrows correspond to transitions between the states, and the clock
input is implicit in the ordering of the blocks.
124
✛
❄
NS green
EW red
(20 seconds)
✲❄
❝
❄
NS green
EW red
(10 seconds)
S↓
N↑
no
✛
no
✛
WT
ET
❄
❅
❅
ET ❅
❅
❄
❅
❅
❅
❅
yes ❝ yes
WT ❅ ✲ ❄
❅
❅
❅
❄
NS yellow
EW red
(10 seconds)
N↑
❄
NS red
EW green
(20 seconds)
❄
NS red
EW yellow
(10 seconds)
❄
125
✲
ET WT/color
✛
✛
❄
✬✩
❄
NS green
EW red
(10 seconds)
A
✫✪
X/GR
❄
✬✩
❄
NS green
EW red
(10 seconds)
B
✫✪
✲❄
❜
X/GR
❄
NS green
EW red
(10 seconds)
no
✛
no
✛
00/GR
❄
❅
❅
ET ❅
✣
✫✪
X/GR
❄
✬✩
❄
NS yellow
EW red
(10 seconds)
D
✫✪
X/RG
❄
✬✩
❄
NS red
EW green
(10 seconds)
E
✫✪
X/RG
❄
✬✩
❄
NS red
EW green
(10 seconds)
F
✫✪
X/RY
❄
✬✩
❄
NS red
EW yellow
(10 seconds)
❄
C
01/YR
10/YR
11/YR
❅
❄
❅
❅
❅
❅ yes yes
❜
WT ❅ ✲ ❄
❅
❅
❅
❄
✬✩
✤
❘
G
✲
126
✫✪
❄
✲
Implementation of state machines:
There are a number of ways in which state machines described by a
state diagram or state table or ASM diagram can be implemented.
The actual “details” of the implementation depend on a number of
things such as the type of flip-flop used to hold the state information
(e.g. D ff’s or JK ff’s), the way in which the next-state logic is to
be implemented (e.g., using simple logic gates, MUX’s, PLA’s, etc.),
and the way in which the state information is stored in the flip flops
(this is usually referred to as the “state coding”).
Although these details determine the actual physical implementation
of the device, the method used to arrive at this implementation is
quite general, and can be summarized as follows:
1. Construct the state table (or state diagram, or a complete ASM
diagram) for the device, and ensure that it correctly describes
the required device.
2. Assign binary values to the states, to be encoded using flipflops.
3. Design the logic necessary to produce the appropriate values for
the flop flops to enter the next state, using the present state and
input values as inputs to this logic. Also, design the logic to
produce the appropriate outputs.
127
For the second step, two commonly used codings are:
1. a binary weighted coding, in which each state is specified by a
binary number. For N states this coding requires [log2(N )] flipflops. [log2(N )] means “the integer equal to or the next integer
greater than log2(N )”.
2. a unary coding, often called a “one hot” coding, in which each
state is assigned to a flip flop. A value 1 stored in the flip flop
means that the device is in the state corresponding to that particular flip flop. Since a device can be in only one state at any
time, there will always be exactly one flip flop with a value 1
stored in it. Although this coding requires more flip flops to
store the state information (N , rather than log2(N )), the nextstate logic is usually much simpler to design, because only two
flip flops need to have their values changed; the one corresponding to the present state, and the one corresponding to the next
state.
128
Step (3) is not difficult, since the logic required is only simple combinational logic, but it can be quite tedious, especially for a binary
weighted coding since several (possibly all) flip flops may have to
change their values to produce the appropriate next-state. (For example, in a 3 bit counter, the change from state 011 to state 100
requires that all flip flops change state at the same time). The type
of flip flop used also affects the design effort. D flip flops have only
one input, which is equal to the value to be stored in the flip flop. JK
flip flops have two inputs to be controlled, each by a separate block
of logic. This means that the design effort is easier for D flip flops
(because the next-state logic must only produce one input for each
flip flop), but the number of logic gates may be fewer for a JK flip
flop (because there are more ways to change the state). Since we are
more interested in reducing the design effort, we will use D flip flops
for our designs.
The next slide shows an ASM diagram for a device with four states,
A,B,C,D, one input, Y, and one output, Z, together with the corresponding state diagram. The state table is also shown. The outputs
from the device are specified on the arrows in the ASM diagram, and
in the state diagram.
129
0/0
✲ ❝✛
✻
✤✜
✻
❄
✬✩
✠
A
A
✛
✫✪
❄
✛
✚❩
Y =0
✚
❩
❩
Z = 0 ✚✚
❩
❩
✚
❩
✚
❩
✚
❩✚
1/0
✲ ❝❄Y
=1
Z=0
✻
❄
B
✲
Z=0
C
1/0
Z=0
❄
✬✩
C
✫✪
❄
✬✩
D
✛
✫✪
X/0
❄
❄
✚❩
✚
❩
✚
❩
✚
❩
❩
✚
❩
✚
❩
✚
❩✚
B
X/0
❄
Y =1
Z=0
❄
✬✩
D
✫✪
Y =0
Z = 1✲
130
0/1
State Table
Present State Input Next State Output
A
0
A
0
A
1
B
0
B
0
C
0
B
1
C
0
C
0
D
0
C
1
D
0
D
0
A
1
D
1
B
0
We will first design a state machine corresponding to this state table
using a binary weighted coding for the states. (Later we will design
the same state machine using the unary, or “one hot” state coding.)
The design requires log2(4) = 2 flip flops; we will use D flip flops as
memory elements. We choose the following coding for the states:
State F F1 F F0
A
0
0
B
0
1
C
1
0
D
1
1
(This choice of state coding is arbitrary; the problem of finding a
state coding which requires a minimum number of logic gates for its
implementation is NP-hard).
131
We next reconstruct the state table, including the values for each flip
flop, as follows:
Present State
Input D FF inputs required to Output
produce next state for
QF F1 QF F0
A
B
C
D
Y
DF F1
DF F0
Z
0
0
0
A
0
0
0
0
0
1
B
0
1
0
0
1
0
C
1
0
0
0
1
1
C
1
0
0
1
0
0
D
1
1
0
1
0
1
D
1
1
0
1
1
0
A
0
0
1
1
1
1
B
0
1
0
This table is, effectively, three truth tables, one for each of the D
inputs to F F1 and F F0, and one for the output, Z. Note that there
are three inputs to each truth table; namely, the outputs of F F1
and F F0, and Y. The required logic could be implemented in several
ways; using simple logic gates, using three 4 line to 1 line MUX’s (or
three 8 line to 1 line MUX’s), or using one 3 line to 8 line decoder,
and several NAND gates. The implementation shown in the following
slide uses the 3 line to 8 line decoder, and three NAND gates.
(A PLA implementation is particularly attractive for state machines,
because all the logic functions to be implemented are functions of the
same set of input variables.)
132
O7
O6
O5
O4
O3
O2
O1
O0
S0 S1 S2
❡
❡
❡
❡
❡
❡
❡
❡
✑
✑
✑
✑
✑
✑
✑
✑
✑
✑
✑
✑
✘
✑
✘✘✘
✘
✘
✏
✘✘
✏✏
t ✘✘
✘
✏✏
❜
✏
★
❜✏✏
✏
★
❜
✏
t
❜
❍❍
❜ ★★
❜
❍❍ ★
❍❍
❍ ❜❜
★
❍★ ❍❍ ❜
❛❛★❍❍❍ ❍❍❜
★❛❛
❍
❍
★
❛❛ ❍❍
❍
❛❛
❛❛
Y
clock
❍❍
❍❍ ❥
✟✟
✟✟
✩
❣
✧
✧
✧
✧
✧
✪
✞
✧✝
✧
✧
✧
✧
t
✩
❣
✪
D
>
Q
F F0
D
>
Z
Q
F F1
✞
✝
✞
✝
The preceding technique would be quite tedious for the one hot state
coding, since there would be four flip flops, and the state table would
consequently require 32 lines. For the one hot state coding, we can
consider each flip flop individually, and design the logic required to
set it to 1 or 0, without concern for the other flip flops (except, of
course, that one or more of them may potentially provide an input
to the logic.)
In this case, we can break up the state table into a separate truth
table for the inputs required to produce each state; that is, we group
together in separate tables the lines corresponding to each separate
“next state.” For the previous example, we have the following four
tables:
133
For State A
Present State Input, Y Next State Output
A
0
A
0
D
0
A
1
For State B
Present State Input, Y Next State Output
A
1
B
0
D
1
B
0
For State C
Present State Input, Y Next State Output
B
0
C
0
B
1
C
0
For State D
Present State Input, Y Next State Output
C
0
D
0
C
1
D
0
The “Next State” column is, of course, not required in the tables.
Each table can be used to design the logic required to set the corresponding flip flop. The following are directly from these tables:
For F FA, DA = A · Y + D · Y = (A + D) · Y
For F FB , DB = A · Y + D · Y = (A + D) · Y
For F FC , DC = B · Y + B · Y = B
For F FD , DD = C · Y + C · Y = C
The output, Z, would be evaluated as Z = D · Y
134
With a little practice, these design equations can be obtained directly
from a state diagram or ASM diagram. A corresponding circuit
diagram would be as shown below.
Note that, for the one hot coding, although more “next state” circuits must be designed, they are normally much simpler than for the
binary coded state assignment. In fact, for a small device, the implementation effort may be much less for a “one hot” implementation
than for an implementation using binary coded states.
Yr
❍❍ ❞ r
✟✟
✁
clock
✁
✏
Z
✏
D Q
✑
>
r
✁
r
✏
A
D Q
✑
>
✄
✁
r
B
D Q
>
C
r
Repeating the design equations from the previous page:
DA = A · Y + D · Y = (A + D) · Y
DB = A · Y + D · Y = (A + D) · Y
DC = B · Y + B · Y = B
DD = C · Y + C · Y = C
Z =D·Y
135
D Q
>
Dr
✑
Of course, sometimes a simple solution to a design problem is apparent, without requiring much design effort. For example, if we wanted
to design a device to detect the sequence 0101, say, and output a 1
when this sequence was detected, we could construct a state table
for the device, and complete the design as in the previous examples.
(Recall that we constructed a state diagram and state table for this
device earlier). There is another, simpler solution, however, using a
4 bit serial in parallel out shift register (i.e., 4 D FF’s,) four comparators, and an AND gate, as shown below. (The comparator is
the complement of the XOR function, and is often represented in a
circuit as an XAND function). This simple design can be used to
detect any four bit sequence, by simply changing the inputs to the
comparators.
✩
Z
✓✏
0
input, X
clock
D Q
>
r
r
r
✓✏
1
D Q
>
r
r
136
✓✏
0
D Q
>
r
✓✏
1
D Q
>
✪
Structured implementation of state machines
State machines are typically implemented in three ways; using individual logic gates, typically called a “random logic” implementation,
using a PLA, and using memory as a “look-up” for the combinational
logic. The PLA and memory (typically read-only memory) implementations are quite effective, because a state machine has a fixed
(usually relatively small) number of inputs and outputs, and both
those approaches can be readily automated.
An implementation based on memory is often called a microcoded
implementation. (This term is often reserved for the memory-based
implementation of the control unit of a computer.) It is common
to have a small “microcode engine” including simple functions like a
counter and registers, and for the microcode itself to be “sequenced”
by the counter. (In a sense, the counter is used to fetch microcode
words from memory, and these microcode words control the external
state changes and outputs.)
137
State machine models
Mealy state machines
Up to this point, we have implicitly considered only one model for
a state machine; a model in which the outputs are a function of
both the present state and the input. This model, shown pictorially
below, is called the Mealy model for sequential devices. It is a general
model for state machines, and assumes that there are two types of
inputs; clock inputs and data inputs. The clock inputs cause the
state transitions and “gate” the outputs, (so the outputs are really
“pulse” outputs; i.e., they are valid only when the clock is asserted).
The data transitions determine the values of next-states and outputs.
Essentially, the clock inputs control the timing of the state transitions
and outputs, while the data inputs determine their values.
Primary
inputs
❏❏
❏❏
State
values
❏❏
Combinational logic
Primary
outputs
❏❏
Memory
next-state
values
Clock
So far, the state diagrams we have drawn correspond to this model;
we have labeled the transitions with the inputs which cause the transition, and the output corresponding to the transition.
138
Moore state machines
Another, model for state machines is the Moore model, in which the
outputs are associated with the states of the device. In the Moore
machine, the outputs are stable for the full time the device is in a
given state. (The outputs are said to be “level” outputs, and are
valid even when the clock inputs are not asserted.) Again, there
are two types of inputs, clock inputs and data inputs. In this case,
however, the clock inputs only directly enable the state transitions.
In this model, the transitions are functions of the present states and
inputs, but the outputs are functions of the states only. Below is a
pictorial representation of the Moore model of a state machine. (The
Moore model describes state machines like the traffic light controllers
we have seen as ASM diagrams in a very natural manner.)
Primary
inputs
❏❏
✡✡
❏❏
✡✡
State
values
Next-state
Combinational logic
❏❏
✡✡
next-state
values
Memory
Clock
❏❏
✡✡
Output
Combinational logic
139
❏❏
✡✡
Primary
outputs
The following figure shows a state diagram for the Mealy machine
derived earlier which produces a 1 as output whenever the sequence
0101 is input. This machine has four states, and the outputs are
associated with the inputs to the state machine.
1/0
✤✜
✬✩
✠
A
0/0
0/0
0/0
✫✪
■
❅
❅
❅
❅
✤✜
✲
✬✩
✠✠
1/0
B
✫✪
1/0
✲
❅
❅
✬✩
0/0 ❅❅✬✩
✲
C
✛
✫✪
1/1
D
✫✪
The next figure shows a state diagram for a Moore machine which
performs the same function.
✬✩
✠
A/0
0
✫✪
■
❅
❅
■
❅
❅
❅
❅
❅
❅
❅
0
0 ✜
✤
1 ✜
✤
✲
✬✩
✠✠
B/0
1
✫✪
1
1
❅
❅
❅ ✬✩
✬✩
❅
0
✲ C/0
✲ D/0
✧✫✪
✧ ✧
✧
1 ✧ ✧✧✒
✧
✧ ✧✧
✧ ✧
✬✩
✧
✧
✧
✠
✧
✧
E/1 ✧
0
✫✪
✫✪
Note that a state diagram for a Moore machine is labeled differently
from a Mealy machine state diagram; the transitions are labeled
only with the inputs which cause the transition, while the states are
labeled with the corresponding outputs. (The output is a function
of the state only, and does not depend directly on the input.)
140
Comparing the Mealy and Moore state diagrams, it is clear that they
are very similar for states A, B, C and D. State E is a “new” state,
because the output of 1 must be associated with some state. In fact,
state E is equivalent to state C in the Mealy diagram — Mealy state
C has been split into two Moore states, one (C) with output 0, and
one (E) with output 1. They are equivalent in the sense that the
outputs from both Mealy state C and Moore state E (and Moore
state C, too) have the same next-states for the same inputs.
Note that state C was the only Mealy state in which the incoming
arcs (or arrows) correspond to different outputs. Moreover, the state
which was “split” retained all transitions to the corresponding nextstates in the Mealy machine. (In this example, all of the other states
are associated with only one output.)
In general, from this observation, it is possible to convert any Mealy
type machine into an equivalent Moore type machine, and vice-versa.
First, we must define what we mean for two state machines to be
equivalent. Two state machines are said to be equivalent if they
produce exactly the same output for all inputs. Consequently, to
derive an equivalent Moore machine from a Mealy machine, it must
be possible to guarantee that the two machines produce the same
output after any arbitrary input string has been input. This can
be done by splitting all the Mealy states corresponding to different
outputs, and ensuring that these states are connected to next-states
which correspond to equivalent states in the original Mealy machine.
141
As a slightly more complex example, the Mealy machine specified by
the following state table, where x is the single external input, and
y is the output, and having a state diagram as shown below can be
converted into a Moore machine as follows:
Present State Next State Output, y, for
x=0 x=1 x=0
x=1
A
C
B
0
0
B
A
D
1
0
C
B
A
1
1
D
D
C
1
0
1/0
✬✩
1/0
A
✛
✫✪
❅
0/1
■ ❅
❅
❅ ❅
❅ ❅
❅
❅
❅
✲
❅
❅
❅ ✬✩
✬✩
✜
❘
❅
✠
1/0
0/1
D
C ✛
✬✩
B
✛
0/1
✫✪
✒✫✪
0/0
✢
✫✪
1/1
Each state with different output values associated with transitions
into the state is split into states corresponding to each different output; e.g., state B has a transition from state A with an output of
0, and from state C with an output of 1. Therefore, State B is split
into two states, B0, with an output of 0, and B1, with an output of
1. Every transition to B with output 0 goes to B0; every transition
to B with an output 1 goes to B1. The next-states of B0 and B1 are
exactly the same as for B. State D is split into two states D0 and
D1, similarly.
142
The state table becomes the following, corresponding to the state
diagram shown below.
Present State Next State
State
x=0 x=1 output
A
C
B0
1
B0
A
D0
0
B1
A
D0
1
C
B1
A
0
D0
D1
C
0
D1
D1
C
1
Here we have added a column called state output, which is the output
the device has while in a given state. The output no longer depends
on the input, x.
1
❅
❅
❅ ✬✩
✬✩
✬✩
✬✩1
❘
❅
✲
1
✛
C/0
D0 /0
B0 /0
A/1 ✛
✫✪
✫✪
0
✟✟
❅
✒✫✪
✒✫✪
■
❅
✟
■
❅
❅
■
❅
✟
❅ ✟✟
❅ ❅❅❅❅
0
✟
✟❅
❅ ❅
0
❅1
✟
❅ ❅
✟
1
✟
❅
❅ ❅
✟
❅
✟✟
❅
❅
✟
0✟✟ 1
❅ 0
❄
❅
✟
❅
✬✩
✬✩
✜
❅
✟
✠
✟
❅
✟
✠
❅ B /1 ✟
D1 /1
0
1
✢
✫✪
✫✪
143
We can see that the Moore machine accepts the same input sequences
as the Mealy machine we started with, and produces the same output
sequence. In addition, it produces the output 1 when started in
state A, without having any input sequence. i.e., a Moore machine
accepts a zero length sequence, called a null sequence, and produces
an output (while in its initial state.) If we wish, we can add a new
state A0 as the initial state, which produces a different output, say,
0, indicating that the machine is in its initial state. (There will be
no transitions back into this initial state.)
Note that, in general, any Mealy machine with N internal states
and P outputs can be converted to a Moore machine with at most
P × N + 1 states. (The Mealy machine and its corresponding Moore
machine will be equivalent, in the sense that both will give exactly
the same output for all possible input sequences.)
144
Present State Output Next State
x=0 x=1
A0
0
C
B0
A1
1
C
B0
B0
0
A1
D0
B1
1
A1
D0
C
0
B1
A1
D0
0
D1
C
D1
1
D1
C
✬✩
A0 /0
❍❍
❍❍
✫✪
❍
❅
❍❍
❅
❍
❅
❍❍0
❅
❍
❍❍
❅ 1
1
❍
❅
❍❍
❅
❍
❅
❍❍
❅
❅
❍
❅ ✬✩
❘
❍
❘
❅
✬✩
✬✩1
✬✩
❘
❅
✲
1
✛
B0 /0
A1 /1 ✛
D0 /0
C/0
✟
✫✪
✫✪
0
❅
✒✫✪
✟✒✫✪
■
❅
✟
■
❅
❅
✟
■
❅
❅ ✟✟
0
❅ ❅❅❅❅
✟
❅
❅ ❅
0
✟✟ ❅ 1
✟
❅ ❅
1
❅
✟
❅ ❅
❅
✟✟
❅
✟
❅
✟
0✟✟ 1
❅ 0
❅
❄
✟
❅
✬✩
✜
❅✬✩
✟
✠
✟
❅
✟
✠
❅ B /1 ✟
D1 /1
0
1
✢
✫✪
✫✪
145
“Computer arithmetic”
What kinds of numbers are typically represented in a computer system?
Positive integers (for addressing, pointers.)
Signed integers (for integer arithmetic.)
“Real numbers” for arithmetic using real numbers.
Perhaps the most important characteristics of the representation of
numbers in a computer system is that numbers are represented using
a fixed number of binary digits.
This introduces the problems of overflow or underflow.
For example, the sum of the following four bit (unsigned) numbers
would be
1101 + 0101 = 10010
Using only four bits, the result would be 0010, and, depending on
the particular processor and instruction, the overflow may or may
not be detected.
When an overflow or underflow is detected, the processor usually
causes an exception.
In this case, the program jumps to a predetermined location in memory which handles the exception, and the return address is also saved
so the program can continue after the address is handled.
We will discuss exceptions later.
146
How are signed integers represented?
Following are four possibilities for representing negative numbers:
one’s complement representation A negative number is the
complement of a positive number; e.g.
00000001
represents 1
11111110
represents -1
Note that there are two representations of zero.
This representation is not commonly used.
two’s compliment representation This is the one’s complement
representation plus 1; e.g.
00000001
represents 1
11111111
represents -1
This is the usual representation for signed integers.
To extend the size of a 2’s complement integer, it is sign extended; e.g., to make the 8-bit representation of -7 a 16 bit representation, the high order bit in the 8-bit representation is used
to fill in the higher order bits.
11111001
1111111111111001
-7 (8-bit 2’s complement)
-7 (16-bit 2’s complement)
147
sign-magnitude representation An integer has a single bit, say
the high order bit, which represents the sign. For an eight bit
number, the representation would be
sddddddd
00000001
represents 1
10000001
represents -1
This is the usual representation for the mantissa (significand) of
a real (floating point) number.
Note that for an integer, there are again two representations for
zero.
biased representation (sometimes called excess n representation.)
A bias is added to the representation to get the number being
represented. The following shows a representation with a bias of
127 (or excess 127):
10000000
represents 1
(1 + 01111111)
01111110
represents -1
(-1 + 01111111)
This is the usual representation for the exponent of a real (floating point) number.
148
Given that, for integer arithmetic, we will use a 2’s complement
representation, and that we want to combine arithmetic and logic
operations in one unit (the ALU), how would an ALU for the MIPS
be implemented?
So far, we have the following instructions to implement: add, addu,
addi, sub, subu, addiu, and, andi, or, ori, slt, sltu,
slti.
Consider the following single-bit ALU:
Operation
Binvert
Carryin
a
0
1
Result
b
0
+
2
1
Less
3
Carryout
Note that there are three control bits; the single bit Binvert, and
the two bit input to the MUX, labeled Operation.
The ALU performs the operations and, or, add, and subtract.
149
This ALU implements all the integer arithmetic and logical instructions seen so far. Note that the control inputs must be set according
to the particular operation required of the ALU.
A controller will also be required, to determine what particular operation is required of the ALU for each individual instruction.
The ALU will be a component of the datapath of the processor we
will design; the high order bit should detect overflow, as below:
Operation
Binvert
Carryin
a
0
1
Result
b
0
+
2
1
Less
3
Set
overflow
detection
Overflow
A set of 32 of these components can be used to implement a full 32
bit ALU, as shown in the next slide. It also produces an output,
zero, which is set to 1 whenever the 32 bit output is 0.
150
The 32 bit ALU
Binvert
Operation
a0
b0
Carryin
ALU0
Less
CarryOut
a1
b1
0
Carryin
ALU1
Less
CarryOut
a2
b2
0
Carryin
ALU2
Less
CarryOut
a31
b31
0
Carryin
ALU31
Less
Result0
Result1
zero
Result2
Result31
Set
Overflow
ALU control lines Function
0 00
and
0 01
or
0 10
add
1 10
subtract
1 11
set on less than
151
The ALU depicted on the previous slide uses ripple-carry adders.
We have seen how to build a carry look-ahead adder which would
permit faster arithmetic operations.
Using the carry look-ahead units we considered earlier, the changes
required are not difficult since they merely compute the Carryin
inputs to the ALU. In order to handle 2’s complement arithmetic,
however inverted inputs would be required.
Binvert
a0
b0
0
a0
b0
Operation
a0
b0
Carryin
ALU0
Less
CarryOut
a1
b1
0
Carryin
ALU1
Less
CarryOut
a2
b2
0
Carryin
ALU2
Less
CarryOut
a31
b31
0
Carryin
ALU31
Less
1
Result0
c1
a1
b1
0
1
a1 Carry
b1 look−
ahead
c2
a2
b2
0
a2
b2
1
Result1
Result2
c31
a31
b31
0
a31
b31
1
152
Result31
Set
Overflow
Integer Multiplication
Integer multiplication is really repeated addition.
The basic algorithm is a kind of “shift and add” of partial products,
obtained by multiplying individual digits in the multiplier by the
multiplicand, as follows:
1010
× 0110
0000
1010
1010
multiplicand
multiplier
0 × 1010
1 × 1010, shifted left 1 bit
1 × 1010, shifted left 2 bits
0000
0 × 1010, shifted left 3 bits
0111100
product — sum of partial products
Note that the product of 2 n-bit numbers requires 2n bits.
This multiplication algorithm can be implemented in a number of
ways.
Recalling that single bit binary multiplication is the same as the AND
function, then simply ANDing the multiplicand with the digits of the
multiplier, shifting, and adding them together is all that is required.
This type of multiplier can be implemented using a single 32 bit
adder, and 32 shift operations.
One such implementation is shown in the following slide.
153
Hardware:
Multiplicand
Multiplier
Shift right
32 bits
32 bit ALU
Product
64 bits
Shift right
Write
Control
test
Algorithm:
Start
=1
multiplier[i]
=0?
=0
Add multiplicand to left half of product.
Place result in left half of product
Shift product register right 1 bit
Shift multiplier register right 1 bit
32 repetitions?
Done
154
No
Note that this implementation requires 32 shift and add operations,
and requires that the multiplier, multiplicand, and product all be
stored in separate registers.
Noting that the product register “consumes” one additional bit on
each iteration, and the multiplier register effectively removes one bit
in the same iteration, we can reduce the hardware further by storing
the multiplier in the right hand half of the product register.
It will be shifted out at the same rate product digits are shifted in.
155
Hardware:
Multiplicand
32 bit ALU
Product
64 bits
Shift right
Write
Control
test
Algorithm:
Start
=1
multiplier[i]
=0?
=0
Add multiplicand to left half of product.
Place result in left half of product
Shift product register right 1 bit
32 repetitions?
Done
156
No
Another possibility is to use an array of adders and AND gates to
directly implement each partial product and sum all the partial products.
This would require n2 adders for an n-bit multiplier.
Following is a single multiply-add unit, consisting of an adder and
an AND gate:
xj
Pi
Ci
yi Pi
C i+1
xj
yi
A B C
C S
xj
C i+1 Pi+1
157
xj
Ci
Pi+1
A 4 bit parallel multiplier:
x3
x2
x1
x0
y0
xj
yi Pi
C i+1
xj
xj
yi Pi
C i+1
xj
Ci
Pi+1
xj
yi Pi
C i+1
xj
Ci
Pi+1
yi Pi
C i+1
xj
Ci
Pi+1
xj
Ci
Pi+1
y1
xj
yi Pi
C i+1
xj
Ci
Pi+1
xj
yi Pi
C i+1
xj
Ci
Pi+1
xj
yi Pi
C i+1
xj
yi Pi
C i+1
xj
Ci
Pi+1
xj
Ci
Pi+1
y2
yi Pi
C i+1
xj
xj
Ci
Pi+1
yi Pi
C i+1
xj
xj
Ci
Pi+1
yi Pi
C i+1
xj
xj
Ci
Pi+1
yi Pi
C i+1
xj
xj
Ci
Pi+1
y3
yi Pi
C i+1
xj
P7
xj
Ci
Pi+1
P6
yi Pi
C i+1
xj
xj
Ci
Pi+1
P5
yi Pi
C i+1
xj
xj
Ci
Pi+1
P4
158
yi Pi
C i+1
xj
xj
Ci
Pi+1
P3
P2
0
P1
0
P0
0
0
Signed multiplication
The previous algorithms work fine for positive numbers, but how
could negative numbers be handled?
One way is to convert the negative numbers to positive (2’s complementation), perform the multiplication, and adjust the sign by
performing 2’s complementation again if necessary.
It turns out that there is a more elegant solution.
Recoded multiplication
If the adders which are used to construct the multiplier can also subtract, another possibility for speeding up the multiplication process
is to “recode” the multiplication operation. We can consider the
following as a “recoding” of the multiply operation:
ai ai−1 operation comment
0
0
—
no action
0
1
1xM
add
1
0
2xM
shift and add
1
1
4xM - M shift 2 and subtract
This recoding applied to the previous algorithms would allow the
multiply operations to complete in 16 iterations, rather than 32, using
the same hardware.
Essentially, this algorithm does a 2-bit multiply in each step.
159
Booth’s algorithm
If the multiplier contains a string of 1’s, then this can be rewritten
as a string of 0’s, provided we can do a subtraction. For example, if
we were to perform the following multiplication:
0
0
0
1
1
0
1
x 0
0
1
1
1
1
0
We could rewrite the multiplier as 0100000 - 0000010 to give the
following equivalent operation:
0
0
0
1
1
0
1
x 0
1
0
0
0 -1
0
Note that this “recoding” of the multiplier allows many more shifts
over 0’s than the previous multiplier. It is possible to use this observation to recode groups of 2 bits in a multiplier to use only shifts
and add or subtract operations to implement the multiplication.
The main idea behind Booth’s algorithm is to identify strings of 1’s
and replace them with 0’s, and apply the above observation.
This can be accomplished by examining pairs of bits:
Left bit Right bit Explanation
Example
1
0
Beginning of a run of 1’s 00011110
1
1
Middle of a run of 1’s
00011110
0
1
End of a run of 1’s
00011110
0
0
Middle of a run of 0’s
00011110
160
Booth’s algorithm simply examines pairs of bits in the multiplier
and does the following, using the hardware for the multiplier shown
previously:
1. Depending on the values of the current bit and previous bit, do
the following:
00: Middle of a string of 0’s, so do nothing
01: End of a string of 1’s, so add multiplicand to the left half
of the product register
10: Beginning of a string of 1’s, so subtract multiplicand from
the left half of the product register
11: Middle of a string of 1’s, so do nothing
2. Shift the product register to the right by 1 bit.
3. Repeat from step 1 until all the multiplier bits have been consumed.
The original purpose was to speed up multiplication, since shifting
was much faster than adding.
One of the major advantages of Booth’s algorithm is that it also
works for 2’s complement numbers.
In modern processors, the multiply operation is usually implemented
directly in the hardware with a multiply unit as part of the datapath.
161
Division
Division is similar to multiplication, in that it is based on repeated
subtraction. The main difference is that the quotient of two integers
is not necessarily an integer.
The basic algorithm is a “shift and subtract” procedure, similar to
multiplication.
1010
Quotient
1101101
Dividend
-1010
subtract Divisor
11
difference
111
shift in next bit, compare to Divisor (set 0 in Quotient)
1110
shift in next bit, compare to Divisor
-1010
subtract Divisor (set 1 in Quotient)
100
difference
1001
shift in next bit, compare to Divisor (set 0 in quotient)
Remainder (last bit was shifted in)
This algorithm is slightly more complex than multiplication, because
of the comparison with the Divisor at each step. Only if the divisor
is greater than the partial dividend is the subtraction actually done.
In practice, the operations compare and subtract are essentially
the same, so it is common just to subtract and check to see if the
result is positive, otherwise the previous value is restored, a 0 set as
the quotient bit, and the shift performed.
Note that the divisor and remainder can be the same size.
162
In the same way that the multiplier and product could share the
same 64 bit register, the quotient and remainder bits can share a
single 64 bit register for division.
Divisor
32 bit ALU
Remainder
64 bits
Shift right
Shift left
Write
Control
test
Note the similarity between the hardware for multiplication and division — the same hardware can be used for both functions.
The difference is only in the control algorithm for each.
Following is a control algorithm for the hardware for division:
163
Start
Shift the remainder register left 1 bit
Subtract the divisor register from the
left half of the remainder register and
place the result in the left half of the
remainder register
>=0
remainder[i]
>= 0 ?
Shift the remainder register
to the left, setting the new
rightmost bit to 1
<0
Restore the original value by adding
the divisor register to the left half of the
remainder register and place the sum in
the left half of the remainder register.
Shift the remainder register to the left,
setting the new rightmost bit to 0
32 repetitions?
No
Done. Shift left half of remainder 1 bit right
164
“Real” arithmetic — floating point numbers
Often we want to represent numbers over a wider range than can be
represented with 32 bit integers. Most computers support floating
point numbers. These numbers are similar to the representation in
scientific notation. They consist of two parts, a mantissa (significand)
and an exponent.
The representation is of the form
(−1)s × M × 2E
where s denotes the sign, M is the mantissa, and E is the exponent.
The exponent is adjusted such that the mantissa is of the form
1.xxxxxx...
There is a standard for representing binary floating point numbers,
(IEEE 754 floating point standard) universally supported by manufacturers of computer systems.
The single precision (32 bit) form of a floating point number is:
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 1312 11 10 9 8 7 6 5 4 3 2 1 0
s
exponent
mantissa
✛
✲ ✛
✲
8 bits
23 bits
The exponent is in excess 127 representation, and the mantissa is
normalized so the leading digit is 1. (Since it is always 1, it is not
stored explicitly, permitting an additional digit to be represented in
the mantissa.)
165
There is a double precision (64 bit minimum) form for floating point
numbers. Here the exponent is 11 bits, in excess 1023 form, and the
mantissa is 52 bits.
Single precision numbers are in the range 2.0 × 10−38 to 2.0 × 1038.
Double precision numbers are in the range 2.0×10−308 to 2.0×10308.
Note that floating point numbers are not distributed uniformly.
The following shows the case for a 2-bit mantissa:
✛
✲
-1 0
2 2
2
1
2
2
2
3
This shows that the separation between floating point numbers depends on the value of the exponent.
This means that “computer arithmetic” with floating point numbers
does not behave like “real” arithmetic.
For example, the following property does not hold:
(A + B) + C = A + (B + C)
Also, subtracting two numbers which are nearly equal causes “loss of
precision.”
For example, (using decimal arithmetic)
1.11010110 × 102
- 1.11010000 × 102
0.00000110 × 102 = 1.10000000 × 10−4
which now really has only 3 significant digits.
166
The IEEE 754 floating point standard attempts to minimize problems
with floating point arithmetic, by several means, including:
• Providing several user-selected rounding modes, including
1. Round to nearest (the default mode)
2. Round towards +∞
3. Round towards −∞
4. Round towards 0
These rounding modes allow the user to determine the effect of
rounding errors.
• Provide representations for +∞, −∞, and “not a number”
(NAN).
• Insisting that all calculations produce the same result as if the
floating point calculations were exact, and then reduced to the
number of bits used in the mantissa.
In order to do this, three additional bits are required when performing arithmetic operations. Two bits, called the guard and
round bits are required to ensure that normal rounding is accurate. A third bit called the sticky bit which is set if any of the
discarded bits in the exact calculation would have been 1. This
is required for rounding towards ∞.
167
• Providing well-defined exceptions, and a provision for trapping
and handling those exceptions.
The five possible exceptions are:
1. Invalid operation; e.g., 0/0, 0 × ∞
2. Overflow
3. Underflow
4. Divide by 0 (this produces ±∞ if the exception is not trapped).
5. Inexact — the rounded result is not the actual result
The standard also provides for denormalized numbers — numbers
where the leading digits in the mantissa are 0 (and the implied digit
is not present before the decimal.) This allows a graceful underflow.
Special combinations of the exponent (E) and fractional part of the
mantissa (f ) represent denormalized numbers, 0, ∞, and NAN:
• if E = Emax and f = 0 then the number represents ±∞, depending on the sign bit.
• if E = Emax , its maximum value (255 for single precision numbers) and f 6= 0 then the number represents NAN.
• if E = 0 and f = 0 then the number represents 0.
• if E = 0 and f 6= 0 then the number is denormalized, and
represents (−1)s × 2−(Emax −1) × 0.f
168
Another feature of the floating point standard was the provision for
extended formats for both single and double precision floating point
numbers.
The format parameters are summarized in the following table:
Parameter
Single
Single
Double
Extended
Double
Extended
Mantissa (bits)
23
≥ 32
52
≥ 64
Exponent (bits)
8
≥ 11
11
≥ 15
Total width (bits)
32
≥ 43
64
≥ 79
Exponent bias
127 unspecified
1023 unspecified
Emax
127
≥ 1023
1023
≥ 16383
Emin
-126
≤ −1022
-1022
≤ 16382
Recall that the leading 1 of the mantissa for normalized numbers is
not included in the table, so the actual precision of the mantissa is
one more bit than indicated.
The extended precision formats are used in the floating point processors designed by INTEL. It uses an 80 bit representation for floating
point numbers in its floating point units.
The IEEE floating point standard is presently being reviewed, and
proposals have been made to combine this standard with another
standard for decimal floating point arithmetic.
169
Implementation of floating point arithmetic
Floating point addition:
✗
✖
✔
✕
Start
❄
Compare exponents, and shift the
smaller to the right until its exponent
equals the larger exponent.
❄
Add the mantissas
✲
❄
Normalize the sum, shifting either right
or left, incrementing or decrementing
the exponent with each shift.
❄
✟✟❍❍❍
❍❍
✟✟
✟
❍❍
✟
Overflow
or
✟
❍❍
✟✟
❍❍underflow?✟✟
✟
❍❍
❍✟✟
❄ no
yes
✗
no
❄
✟✟❍❍❍
❍❍
✟✟
✟
❍❍
✟
Still
✟
❍❍
✟
normalized?
✟✟
❍❍
✟✟
❍❍
❍✟✟
✗
✖
❄
yes
Done
✔
✕
170
Exception
✖
Round the mantissa to the appropriate
number of bits
❄
✔
✕
Floating point multiplication:
✗
✖
✔
✕
Start
❄
Add the exponents.
Subtract the bias from the sum to get
the new biased exponent.
❄
Multiply the mantissas
✲
❄
Normalize the product, shifting right
and incrementing the exponent.
❄
✟❍
❍
❍❍
✟✟
✟
❍
✟
✟ Overflow or ❍❍
✟
❍
✟
❍❍ underflow? ✟✟
✟
❍❍
✟
❍❍✟✟
yes
❄ no
✗
no
❄
✟❍
❍
✟
❍❍
✟
✟
❍❍
✟
✟
Still
✟
❍
❍
✟
❍❍ normalized? ✟✟
✟
❍❍
✟
❍❍✟✟
yes
❄
Set the sign bit appropriately.
✗
✖
✔
❄
Done
✕
171
Exception
✖
Round the mantissa to the appropriate
number of bits
❄
✔
✕
Hardware implementations of the basic floating point operations (addition, multiplication, and division) are provided in virtually all modern microprocessors.
Some processors have independent units for multiplication and addition, so both operations can execute in parallel.
The MIPS had a separate floating point unit which was used in
combination with the processor chip. Later versions integrated the
floating point unit with the processor.
A similar evolution happened earlier with the INTEL 80x8x architecture — the floating point unit was a separate co-processor, and
operated in parallel with the main processor.
Internally, INTEL’s floating point processor (which was the first floating point unit to comply with the then new floating point standard)
used 80 bit arithmetic, in a stack architecture.
172
How can we determine performance?
Let us look at an example from the transportation industry:
Aircraft
Passenger
Fuel
Cruising
Throughput Cost
Capacity Capacity Range Speed
Boeing 747-400
421
216,847 10,734
920
387,320 0.048
Boeing 767-300
270
91,380 10,548
853
230,310 0.032
Airbus 340-300
284
139,681 12,493
869
246,796 0.039
Airbus 319-100
120
23,859
4,442
837
100,440 0.045
77
11,750
2,406
708
54,516 0.063
132
119,501
6,230
2180
287,760 0.145
Dash-8
50
3,202
1,389
531
26,550 0.046
My car
5
60
700
100
500 0.017
BAE-146-200
Concorde
Where fuel capacity is in litres, range is in Km., and speed is
in Km/h,
throughput is the
(number of passengers) × (cruising speed)
and cost is the
(fuel) per (passenger - Km.)
determined as (fuel capacity)/(passengers × range)
173
Which of these has the best “performance?”
This depends on how you define the term “performance.”
For raw speed, (getting from one place to another quickly) the Concorde is over twice as fast as its closest competitor.
If we are interested in the rate at which people are carried (we call
this throughput) then the Boeing 747-400 clearly has the best
performance.
Often we are interested in relating performance and cost. In this
example, if we consider cost as the amount of fuel used per passengerKm., then the most economical plane is the Boeing 767-300. Clearly,
though, the car is easily the most economical overall.
Note that we could also define cost in many different ways.
We can define similar measures of performance and cost for computers.
In a computer system, we are interested in the number of computations done per unit time, as well as the cost of the computation.
Typically, we are interested in several aspects of the cost; e.g. the
initial purchase price, the operating cost, or the cost of training users
of the system.
174
In a computer system, we may be interested in the amount of time
it takes a program to complete, (speed or response time), or the
rate at which a number of processes complete (throughput), or in
the cost of the system, relative to its performance.
Since a computer program is merely a set of instructions for the
particular computer, one might think that comparing the average
instruction speed for two computers would be a good measure of
performance.
This turns out not to be so, for a number of reasons; for example:
• Different computers have different instruction sets; some have
very powerful instructions and others very simple, so the number
of instructions required for a program might be very different on
two different computers.
• The instructions themselves may be implemented differently, and
have different execution times. This may even be true for two
machines which have the same instruction set (e.g., the Pentium
and the Pentium IV, or the AMD Athelon).
• Different compilers may produce very different machine code
from the same source code.
175
Typically, a processor has a basic “clock speed” and instructions
require some multiple of this clock speed to execute.
In order to determine the time required to execute a particular program (TP ) we might think that we could take each instruction (I)
to be executed, multiply it by the number of clock cycles for the
instruction (CP II ), and sum the result.
TP =
X
(I × CP II ) × (time for one clock cycle)
This does not work for several reasons:
• Many processors have instructions with variable execution times
• Most processors today execute several instructions simultaneously
It is possible, however, to approximate the run time of a program if
we can determine an average number of cycles per instruction for the
particular processor (and the program to be run).
In this case, the execution time can be approximated by
TP =
X
(I × average CPI) × (time for one clock cycle)
176
This can be rewritten as
TP = (N × average CPI) × (time for one clock cycle)
where N is the number of instructions executed by the program.
Note the following:
• The number of instructions executed by the program depends on
the compiler used to generate the machine language code, and
on the particular instruction set of the processor.
• The average CP I depends on the particular instruction mix of
the program.
• The clock cycle time depends on the detailed implementation of
the processor, including the speed of the underlying technology,
the complexity of the individual instructions, and the degree of
parallelism in the processor.
Improvements in compiler technology typically produce about a 5%
speedup per year.
Improvements in technology typically produce about a 50% speedup
per year.
177
All of the previous discussion makes several assumptions:
• The process under consideration is the only process running on
the machine.
In a “real” computing environment, many processes may be running simultaneously. (In a Linux system, run the program top
to see what processes are presently using resources).
• The processor speed determines the rate at which instructions
are executed.
In reality, memory access can be much slower than the processor
speed, especially for large programs where the entire data and
instructions cannot fit in main memory. (We will discuss memory
performance later in the course.)
• In high performance systems, several processors may work simultaneously on a single process.
At present, most processes run on a single processor, but it is
possible to break up a computation into several “threads” which
can be executed on different, interconnected, processors. (We
will discuss this later in the course.)
178
Why not use “typical” programs to measure performance?
This seems reasonable, since our idea of performance for a computer
system is related to the time required to run the programs in which
we are interested.
Performance = 1/execution time
If we can find a “typical set of programs” which fairly reflect the type
of code we run, then comparing the time to run these on different
machines may be a good measure of performance.
Generally, though, we do not know exactly what programs will be
run on a system throughout its lifetime.
Also, the typical load (set of programs to be run) usually changes
over the useful life of a computer system.
Usually, our goal is more modest — to determine the “best” processor
for a particular set of programs, at a given price, at a given time.
179
Consider the following example:
Program Time on
Time on
Machine A Machine B
P1
10 s
20 s
P2
50 s
25 s
Here, if we consider P1, then Machine A is twice as fast as Machine
B. If we consider program P2, Machine B is twice as fast as Machine
A.
It may be reasonable to use a weighted average of the programs,
where the weight is the relative number of times each program is
usually run. For example, if P1 is run 3 times as frequently as P2,
then the relative time required for Machines A and B is:
(3 × 10) + 50
(3 × 20) + 25
So, Machine A requires 80/85 the time of Machine B.
Alternately, Machine A has 85/80 × the performance of Machine B.
Note that, for different weightings of the two programs, the conclusion as to which machine has the higher performance could be
different.
180
Performance benchmarks
In order to compare different processors, or different implementations
of a single processor, people use various measures of performance, or
benchmarks. Many benchmarks exist, often providing contradictory
information about various processors.
Several “standard” benchmark suites are available, and many of these
also specify how the benchmark programs are to be compiled and run.
One of the most famous benchmark suites (and also one of the most
useful) is the SPEC benchmark suite. Information about it can be
found at URL
http://www.spec.org/
The SPEC benchmark uses the weighted running times of a set of
programs. The programs have changed with time; the present SPEC
CPU (SPEC CPU2006) was preceded by SPEC CPU 2000, SPEC95,
SPEC92, and SPEC89.
There are now sets of SPEC benchmarks for different aspects of systems performance, including integer and floating point performance,
and graphics processor performance.
181
SPEC 2006 Benchmarks
Benchmark
Language Category
Integer
400.perlbench
C
Programming Language
401.bzip2
C
Compression
403.gcc
C
C Programming Language Compiler
429.mcf
C
Combinatorial Optimization
445.gobmk
C
AI, Game Playing: Go
456.hmmer
C
Bioinf., Gene Sequence Search
458.sjeng
C
AI, Game Playing: Chess
462.libquantum C
Physics/Quantum Computing
464.h264ref
C
Video Compression
471omnetpp
C++
Discrete Event Simulation
473.astar
C++
Path-finding Algorithms
483.xalancbmk C++
XML Processing
182
Benchmark
Language
Category
410.bwaves
Fortran
Fluid Dynamics
416.gamess
Fortran
Quantum Chemistry
433.milc
C
Physics/ Quantum Chromodynamics
434.zeusmp
Fortran
Physics/ Computational Fluid Dynamics
435.gromacs
C, Fortran Biochemistry / Molecular Dynamics
Float
436.cactusADM C, Fortran Physics / General Relativity
437.leslie3d
Fortran
Fluid Dynamics
444.deall
C++
Finite Element Analysis
450.soplex
C++
Linear Programming, Optimization
450.povray
C++
Image Ray-tracing
454.calculix
C, Fortran Structural Mechanics
459.GemsDFTD Fortran
Computational Electromagnetics
465.tonto
Fortran
Quantum Chemistry
470.lbm
C
Fluid Dynamics
481.wrf
C, Fortran Weather
482.sphinx3
C
Speech Recognition
183
Determining the effect of performance “improvements”:
Consider the case where some aspect of the performance of a processor is improved, without making other improvements.
For example, consider a numerically intensive problem in which 25%
of the time is spent doing floating point arithmetic.
Suppose the floating point unit is improved to perform five times
faster. How much faster does the program run now?
Clearly, only the part of the program that has improved performance
will run faster, and we can easily calculate by how much:
0.75 + 0.25/5 = 0.8 — the it will require 80% of the original time.
This observation can be expressed as
execution time after improvement = execution time of unimproved part
+
execution time of improved part
amount of improvement
This relationship is called Amdahl’s law.
Note that the overall speedup is relatively small (20%) even though
the performance increase for part of the code was dramatic.
Amdahl’s law has interesting consequences for parallel machines —
ultimately, it is the serial, or unparallelizable, component of the code
that determines its running time.
184
Brief summary of performance measures
The only meaningful measure of performance is execution time for
your “job mix”
The time to execute a program depends on:
Clock speed (MHz)
Code size
Cycles per instruction (CPI)
Composite or other measures of performance — what problems arise
from their use?
MIPS
(Millions of Instructions Per Second, or
Meaningless Indicator of Processor Speed)
MFLOPS
(Millions of Floating Point Operations Per Second)
SPEC
185
Where are we now?
We have built up a “toolbox” of components (logic gates, adders,
ALU’s, MUX’s, registers, etc.), and skills, (combinitoral logic design,
state machine design) and want to use those to implement a small
MIPS-like processor.
186
Design and implementation of the processor
We now have all the raw material to design a processor with the
instruction set we examined earlier.
We will actually design several implementations of the processor,
each with different performance characteristics.
The first implementation will be a “single cycle processor” in which
each instruction will take exactly one clock period. (In other words,
the CPI will be 1.)
In the next implementation, each instruction will require several cycles for execution, and different instructions may require a different
number of cycles. In this case, the clock period may be shorter than
the single cycle machine, but the CPI will be larger.
We will begin by reviewing the instruction set and designing a data
path.
Earlier, when discussing the instruction set, we identified a rough
structure for a computer system:
CPU
✓
✓
❙
❙
❙
❙
✓
✓
✓
✓
❙
❙
✚❩
✚
❩
❩
✚
❩✚
INPUT/
OUTPUT
187
❙
❙
✓
✓
MEMORY
Presently, we are interested in the CPU only, which we concluded
would have a structure similar to the following:
General
Registers
and/or
Accumulator
M
D
R
Instruction
decode and
Control
Unit
ALU
PC
PCU
Address
Generator
M
A
R
The memory address register (MAR) and memory data register(MDR)
are the interface to memory.
The ALU and register file are the core of the data path.
The program control unit (PCU) fetches instructions and data, and
handles branches and jumps.
The instruction decode unit (IDU) is the control unit for the processor.
188
The “building blocks”
We have already designed many of the major components for the
processor, or have at least identified how they could be implemented.
For example, we have already designed an ALU, a data register, and
a register file.
A controller is merely a state machine, and we can implement one
using, say, a PLA, after identifying the required states and transitions.
Following are some of the combinational logic components we will
use:
Adder
A
ALU
B
32
A
32
❄
◗
✑
◗✑
❅
❅
❅ Adder
❅
OP
❄
Sum Carry
B
❄
S
32
❄
❄
A
32
❄
◗
✑
◗✑
❅
❅
✲
❅ ALU
❅
32
❄
B
32
❄
Multiplexor
❄
Result Zero
✲
❄
MUX
❄
Y
Note that the diagram highlights the control signals (OP and S).
189
Following are some of the register components we will use:
Counter
Register
Write enable
✲
30
PC
Data in
✲
32
Data out
✲
✲
32
32
✂✂❇❇
✂✂❇❇
Clock
Clock
Register
file
Write enable
5
✲
Read
register 1
✲
Read
register 2
5
✲
5
32
✲
Read
data 1
✲
32
Registers
Write
Register
Write
data
Read
data 2
✲
32
✂✂❇❇
Clock
Note that the registers have a write enable input as well as a clock
input. This input must be asserted in order for the register to be
written.
We have already seen how to construct a register file from simple D
registers.
190
Timing considerations
In a single-cycle implementation of the processor, a single instruction
(e.g., add) may require that a register be read from and written into
in the same clock period. In order to accomplish this, the register
file (and other register elements) must be edge triggered.
This can be done by using edge triggered elements directly, or by
using a master-slave arrangement similar to one we saw earlier:
master
slave
DQ
>
DQ
>
s
❍❍ ❞
✟✟
Another observation about a single cycle processor — the memory
for instructions must be different from the memory for data, because
both must be addressed in the same cycle. Therefore, there must be
two memories; one for instructions, and one for data.
Data memory
Instruction memory
MemWr
✁
32
✁
32
✲
✲
Address
Write
data
Read
data
✁
✁
32
✲
32
Data
Memory
✲
Read
Address
Instruction
[31-0]
Instruction
Memory
MemRd
191
✁
✲
32
The MIPS instruction set:
Following is the MIPS instruction format:
R-type (register)
31
26 25 21 20 16 15 11 10
op
rs
rt
6 bits 5 bits
rd
6 5
0
shamt funct
5 bits 5 bits 5 bits
6 bits
I-type (immediate)
31
26 25 21 20 16 15
op
rs
rt
0
immediate
6 bits 5 bits 5 bits
16 bits
J-type (jump)
31
26 25
0
op
target
6 bits
26 bits
We will develop an implementation of a very basic processor having
the instructions:
R-type instructions
add, sub, and, or, slt
I-type instructions
addi, lw, sw, beq
J-type instructions
j
Later, we will add additional instructions.
192
Steps in designing a processor
• Express the instruction architecture in a Register Transfer Language (RTL)
• From the RTL description of each instruction, determine
– the required datapath components
– the datapath interconnections
• Determine the control signals required to enable the datapath
elements in the appropriate sequence for each instruction
• Design the control logic required to generate the appropriate
control signals at the correct time
193
A Register Transfer Language description of some operations:
The ADD instruction
add rd, rs, rt
• mem[PC]
Fetch the instruction from memory
• R[rd] ← R[rs] + R[rt] Set register rd to the value of the
sum of the contents of registers rs
and rt
• PC ← PC + 4
calculate the address of the next instruction
All other R-type instructions will be similar.
The addi instruction
addi rs, rt, imm16
• mem[PC]
Fetch the instruction from memory
• R[rt] ← R[rs] +
Set register rt to the value of
SignExt(imm16)
the sum of the contents of register
rs and the immediate data word
imm16
• PC ← PC + 4
calculate the address of the next instruction
All immediate arithmetic and logical instructions will be similar.
194
The load instruction
lw rs, rt, imm16
• mem[PC]
Fetch the instruction from memory
• Addr ← R[rs] +
Set memory address to the value of
SignExt(imm16)
the sum of the contents of register
rs and the immediate data word
imm16
• R[rt] ← Mem[Addr]
load the data at address Addr into
register rt
• PC ← PC + 4
calculate the address of the next instruction
The store instruction
sw rs, rt, imm16
• mem[PC]
Fetch the instruction from memory
• Addr ← R[rs] +
Set memory address to the value of
SignExt(imm16)
the sum of the contents of register
rs and the immediate data word
imm16
• Mem[Addr] ← R[rt]
store the data from register rt into
memory at address Addr
• PC ← PC + 4
calculate the address of the next instruction
195
The branch instruction
beq rs, rt, imm16
• mem[PC]
Fetch the instruction from memory
• Cond ← R[rs] - R[rt]
Evaluate the branch condition
• if (Cond eq 0)
calculate the address of the next in-
PC ← PC + 4 +
struction
(SignExt(imm16) × 4)
• else PC ← PC + 4
The jump instruction
j target
target is a memory address
• mem[PC]
Fetch the instruction from memory
• PC ← PC + 4
increment PC by 4
• PC<31:2> ← PC<31:28>
replace low order 28 bits with
concat I<25:0> << 2
the low order 26 bits from the instruction left shifted by 2
196
The Instruction Fetch Unit
Note that all instructions require that the PC be incremented.
We will design a datapath which performs this function — the Instruction Fetch Unit.
Its operation is described by the following:
• mem[PC]
Fetch the instruction from memory
• PC ← PC + 4
Increment the PC
Add
4
PC
Read
address
Instruction
[31−0]
Instruction
Memory
Note that this does not yet handle branches or jumps.
Since it is the same for all instructions, when describing individual
instructions this component will normally be omitted.
197
Datapath for R-type instructions
• R[rd] ← R[rs] op R[rt]
Example: add rd, rs, rt
Recall that this instruction type has the following format:
R−type (register)
31
26 25
op
6 bits
21 20
rs
16 15
rt
5 bits
5 bits
11 10
rd
5 bits
6 5
0
shamt
funct
5 bits
6 bits
The datapath contains the 32 bit register file and and ALU capable
of performing all the required arithmetic and logic functions.
RegWr
Inst[25−21]
rs
Inst[20−16]
rt
Inst
Inst[15−11] rd
Read
clk
Register 1
Read
Register 2
Read
data 1
Registers
Write
Read
Register
data 2
ALUCtr
BusA
32
ALU
BusB
32
Result
32
Write
data
Note that the register is read from and written to at the “same
time.” This implies that the register’s memory elements must be
edge triggered, or are read and written on different clock phases, to
allow the arithmetic operation to complete before the data is written
in the register.
198
This datapath contains everything required to implement the required instructions add, sub, and, or, slt. All that is required
is that the appropriate values be provided for the ALUCtr input for
the required operation.
The register operands in the instruction field determine the registers which are read from and written to, and the funct field of the
instruction determine which particular ALU operation is executed.
Recalling the control inputs for the ALU seen earlier, the values for
the control input are:
ALU control lines Function
000
and
001
or
010
add
110
subtract
111
set on less than
A control unit for the processor will be designed later.
It will set all the required control signals for each instruction, depending both on the particular instruction being executed (the op
code) and, for r-type instructions, the funct field.
199
Datapath for Immediate arithmetic and logical instructions
• R[rt] ← R[rs] op imm16
Example: addi rt, rs, imm16
Recall that this instruction type has the following format:
I−type (immediate)
31
26 25
op
21 20
rs
6 bits
16 15
rt
5 bits
0
immediate
5 bits
16 bits
The main difference between this and an r-type instruction is that
here one operand is taken from the instruction, and sign extended (for
signed data) or zero extended (for logical and unsigned operations.)
RegWr
ALUCtr
Inst[25−21]
rs
Inst[20−16]
rt
0
M
U
Inst[15−11] X
1
Read
Clk
Register 1
Read
Register 2
Registers
Write
Read
Register
data 2
Write
data
Inst[15−0]
RegDst
Read
data 1
16
imm16
Sign
extend
BusA
32
ALU
BusB
32
0
M
U
X
1
32
ALUSrc
Note the use of MUX’s (with control inputs) to add functionality.
200
Datapath for the Load instruction
lw rt, rs, imm16
• Addr ← R[rs] +
Calculate the memory address
SignExt(imm16)
• R[rt] ← Mem[Addr]
load the data into register rt
This is also an immediate type instruction:
I−type (immediate)
31
26 25
op
21 20
rs
6 bits
5 bits
16 15
rt
0
immediate
5 bits
16 bits
RegWr
AluSrc
Inst[25−21]
Inst[20−16]
0
M
U
Inst[15−11] X
1
Read
Clk
Register 1
Read
Register 2
AluCtr
MemtoReg
Read BusA
data 1
Registers
Write
Read BusB
Register
data 2
32
ALU
32
Write
data
RegDst
Address
0
M
U
X
1
Read
data
Data In
Inst[15−0]
16
Sign
extend
32
32
Write
data
Data
Memory
32
MemRd
201
32
1
M
U
X
0
Datapath for the Store instruction
sw rt, rs, imm16
• Addr ← R[rs] +
Calculate the memory address
SignExt(imm16)
• Mem[Addr] ← R[rt]
Store the data from register rt to
memory
This is also an immediate type instruction:
I−type (immediate)
31
26 25
op
21 20
rs
6 bits
5 bits
16 15
rt
0
immediate
5 bits
16 bits
RegWr
AluSrc
Inst[25−21]
Inst[20−16]
Read
Clk
Register 1
Read
Register 2
0
M
U
Inst[15−11] X
1
Registers
Write
Read BusB
Register
data 2
AluCtr
MemWr
MemtoReg
Read BusA
data 1
ALU
Address
0
M
U
X
1
Write
data
RegDst
Read
data
Data In
Inst[15−0]
16
Sign
extend
32
32
Write
data
Data
Memory
32
32
MemRd
202
32
1
M
U
X
0
Datapath for the Branch instruction
beq rt, rs, imm16
• Cond ← R[rs] - R[rt]
Calculate the branch condition
• if (Cond eq 0)
calculate the address of the next in-
PC ← PC + 4 +
struction
(SignExt(imm16) × 4)
• else PC ← PC + 4
This is also an immediate type instruction.
In the load and store instructions, the ALU was used to calculate
the address for data memory.
It is possible to do this for the branch instructions as well, but it
would require first performing the comparison using the ALU, and
then using the ALU to calculate the address.
This would require two clock periods, in order to sequence the operations correctly.
A faster implementation would be to provide another adder to implement the address calculation. This is what we will do, for the
present example.
203
0
M
U
X
1
Add
4
Add
Shift
left 2
Branch
204
RegWr
Inst[25−21]
PC
Read
address
Instruction
[31−0]
Instruction
Memory
ALUSrc
Inst[20−16]
Read
Register 1
Read
Register 2
0
M
U
Inst[15−11] X
1
Registers
Write
Read
Register
data 2
Read
data 1
Write
data
RegDst
Inst[15−0]
16
ALUCtr
Sign
extend
32
Zero
ALU
0
M
U
X
1
PCSrc
Datapath for the Jump instruction
j target
• PC<31:2> ← PC<31:28>
Calculate the jump address by con-
concat target<25:0>
catenating the high order 4 bits of
the PC with the target address
Here, the address calculation is just obtained from the high order 4
bits of the PC and the 26 bits (shifted left by 2 bits to make 28) of
the target address.
The additions to the datapath are straightforward.
J−type (jump)
31
26 25
0
op
target address
6 bits
26 bits
205
Shift
left 2
0
M
U
X
1
Add
4
Add
1
M
U
X
0
Shift
left 2
Jump
Branch
PCSrc
206
RegWr
Inst[25−21]
PC
Read
address
Instruction
[31−0]
Instruction
Memory
ALUSrc
Inst[20−16]
Read
Register 1
Read
Register 2
0
M
U
Inst[15−11] X
1
Registers
Write
Read
Register
data 2
Read
data 1
Write
data
RegDst
Inst[15−0]
16
ALUCtr
Sign
extend
32
Zero
ALU
0
M
U
X
1
Putting it together
The datapath was shown in segments, some of which built on each
other.
Required control signals were identified, and all that remains is to:
1. Combine the datapath elements
2. Design the appropriate control signals
Combining the datapath elements is rather straightforward, since
we have mainly built up the datapath by adding functionality to
accommodate the different instruction types.
When two paths are required, we have implemented both and used
multiplexors to choose the appropriate results.
The required control signals are mainly the inputs for those MUX’s
and the signals required by the ALU.
The next slide shows the combined data path, and the required control signals.
The actual control logic is yet to be designed.
207
Inst[25−0]
Jump address[31−0]
26
Shift
left 2
0
M
U
X
1
28
PC+4[31−28]
Add
Add
RegDst
4
Inst [31−26]
Shift
left 2
Jump
Control
1
M
U
X
0
PCSrc
Branch
MemRead
MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
208
PC
Read
address
Instruction
[31−0]
Instruction
Memory
Inst[25−21]
rs
Inst[20−16]
rt
0
M
U
Inst[15−11] X
1
rd
Inst[15−0]
Read
Register 1
Read
Register 2
Read BusA
data 1
32
Registers
Write
Read BusB
Register
data 2
32
Write
data
16
Inst[5−0]
Sign
extend
Zero
ALU
0
M
U
X
1
32
32
funct
32
ALU
control
Address
Read
data
Write
data
32
Data
Memory
32
32
1
M
U
X
0
Designing the control logic
The control logic depends on the details of the devices in the control
path, and on the individual bits in the op code for the instructions.
The arithmetic and logic operations for the r-type instructions also
depend on the funct field of the instruction.
The datapath elements we have used are:
• a 32 bit ALU with an output indicating if the result is zero
• adders
• MUX’s (2 line to 1-line)
• a 32 register × 32 bits/register register file
• individual 32 bit registers
• a sign extender
• instruction memory
• data memory
209
The ALU — a single bit
Operation
Binvert
Carryin
a
0
1
Result
b
0
+
2
1
3
Less
Carryout
Note that there are three control bits; the single bit Binvert, and
the two bit input to the MUX, labeled Operation.
The ALU performs the operations and, or, add, and subtract.
210
The 32 bit ALU
Binvert
Operation
a0
b0
Carryin
ALU0
Less
CarryOut
a1
b1
0
Carryin
ALU1
Less
CarryOut
a2
b2
0
Carryin
ALU2
Less
CarryOut
a31
b31
0
Carryin
ALU31
Less
Result0
Result1
zero
Result2
Result31
Set
Overflow
ALU control lines Function
000
and
001
or
010
add
110
subtract
111
set on less than
211
We will design the control logic to implement the following instructions (others can be added similarly):
Name
Op-code
Op5 Op4 Op3 Op2 Op1 Op0
R-format
0
0
0
0
0
0
lw
1
0
0
0
1
1
sw
1
0
1
0
1
1
beq
0
0
0
1
0
0
j
0
0
0
0
1
0
Note that we have omitted the immediate arithmetic and logic functions.
The funct field will also have to be decoded to produce the required
control signals for the ALU.
A separate decoder will be used for the main control signals and the
ALU control. This approach is sometimes called local decoding. Its
main advantage is in reducing the size of the main controller.
212
The control signals
The signals required to control the datapath are the following:
• Jump — set to 1 for a jump instruction
• Branch — set to 1 for a branch instruction
• MemtoReg — set to 1 for a load instruction
• ALUSrc — set to 0 for r-type instructions, and 1 for instructions
using immediate data in the ALU (beq requires this set to 0)
• RegDst — set to 1 for r-type instructions, and 0 for immediate
instructions
• MemRead — set to 1 for a load instruction
• MemWrite — set to 1 for a store instruction
• RegWrite — set to 1 for any instruction writing to a register
• ALUOp (k bits) — encodes ALU operations except for r-type
operations, which are encoded by the funct field
For the instructions we are implementing, ALUOp can be encoded
using 2 bits as follows:
ALUOp[1] ALUOp[0] Instruction
0
0
memory operations (load, store)
0
1
beq
1
0
r-type operations
213
The following tables show the required values for the control signals
as a function of the instruction op codes:
Instruction Op-code
RegDst ALUSrc MemtoReg
Reg
Write
r-type
0 0 0 0 0 0
1
0
0
1
lw
1 0 0 0 1 1
0
1
1
1
sw
1 0 1 0 1 1
x
1
x
0
beq
0 0 0 1 0 0
x
0
x
0
j
0 0 0 0 1 0
x
x
x
0
Instruction Op-code
Mem Mem Branch ALUOp[1:0] Jump
Read Write
r-type
0 0 0 0 0 0
0
0
0
10
0
lw
1 0 0 0 1 1
1
0
0
00
0
sw
1 0 1 0 1 1
0
1
0
00
0
beq
0 0 0 1 0 0
0
0
1
01
0
j
0 0 0 0 1 0
0
0
0
xx
1
This is all that is required to implement the control signals; each
control signal can be expressed as a function of the op-code bits.
For example,
RegDst = Op5 · Op4 · Op3 · Op2 · Op1 · Op0
ALUSrc = Op5 · Op4 · Op2 · Op1 · Op0
All that remains is to design the control for the ALU.
214
The ALU control
The inputs to the ALU control are the ALUOp control signals, and
the 6 bit funct field.
The funct field determines the ALU operations for the r-type operations, and ALUOp signals determine the ALU operations for the
other types of instructions.
Previously, we saw that if ALUOp[1] was 1, it indicated an r-type
operation. ALUOp[0] was set to 0 for memory operations (requiring
the ALU to perform an add operation to calculate the address for
data) and to 1 for the beq operation, requiring a subtraction to
compare the two operands.
The ALU itself requires three inputs.
The following table shows the required inputs and outputs for the
instructions using the ALU:
Instruction ALUOp funct
ALU
operation
ALU control
input
lw
00
x x x x x x add
010
sw
00
x x x x x x add
010
beq
01
x x x x x x subtract
110
add
10
1 0 0 0 0 0 add
010
sub
10
1 0 0 0 1 0 subtract
110
and
10
1 0 0 1 0 0 AND
000
or
10
1 0 0 1 0 1 OR
001
slt
10
1 0 1 0 1 0 set on less than
111
215
Extending the instruction set
What is necessary to add another instruction to the instruction set?
First, the appropriate elements must be added to the datapath.
Second, any control elements must be added, and appropriate control signals identified.
Third, the control logic must be extended to enable the appropriate
elements in the datapath.
Let us consider adding the instruction or immediate (ori)
It has the form
ori $s1, $s2, imm
Its function is to perform the logical OR of the contents of register
$s2 with the zero extended immediate data field imm, storing the
result in register $s1.
$s1 ← $s2 | ZeroExtend[imm]
It has op-code 0 0 1 1 0 1 and is an immediate type instruction.
216
First — add elements to the data path
Examining the data path, the ALU can perform the OR operation,
but the extender unit only supports sign extension. It can be replaced by a unit, sign or zero extend, which can perform both
functions.
Second — add control elements
This new unit requires a new control signal to select the zero extend
function (0) or the sign extend function (1).
We will label the new signal ExtOp.
Also, the 2-bit control signal ALUOp only encodes the operations add
and subtract. Adding a third bit would allow the encoding of the
operations AND and OR.
It can be encoded as follows:
ALUOp[2] ALUOp[1] ALUOp[0] Instruction
0
0
0
memory operations (load, store)
0
0
1
beq
0
1
0
ori
1
x
x
r-type operations
(subtract, in the ALU)
The following diagram shows the changes required to the datapath:
217
Inst[25−0]
26
Shift
left 2
Jump address[31−0]
0
M
U
X
1
28
PC+4[31−28]
Add
Add
RegDst
ExtOp
4
Inst [31−26]
Shift
left 2
Jump
1
M
U
X
0
PCSrc
Branch
MemRead
MemtoReg
Control
ALUOp
MemWrite
ALUSrc
RegWrite
218
PC
Read
address
Instruction
[31−0]
Instruction
Memory
Inst[25−21]
rs
Inst[20−16]
rt
0
M
U
Inst[15−11] X
1
rd
Inst[15−0]
Read
Register 1
Read
Register 2
Read
data 1
BusA
32
Zero
Registers
Write
Register
Read
data 2
Write
data
16
Inst[5−0]
Sign or
zero
extend
BusB
32
ALU
0
M
U
X
1
32
32
funct
32
ALU
control
Address
Read
data
Write
data
32
Data
Memory
32
32
1
M
U
X
0
Third - the control logic
The truth table for the ALU control unit extends to:
Instruction ALUOp funct
ALU
ALU control
operation
input
lw
000
x x x x x x add
010
sw
000
x x x x x x add
010
beq
001
x x x x x x subtract
110
ori
010
x x x x x x OR
001
add
100
1 0 0 0 0 0 add
010
sub
100
1 0 0 0 1 0 subtract
110
and
100
1 0 0 1 0 0 AND
000
or
100
1 0 0 1 0 1 OR
001
slt
100
1 0 1 0 1 0 set on less than
111
For the ori instruction, the following settings are required for the
remaining control signals:
Jump
0
Branch
0
MemRead 0
MemWrite 0
MemtoReg 0
ALUSrc
1
ALU operand is from the extender
RegDst
0
rt is the destination register
RegWrite
1
result will be written in reg[rt]
ExtOp
0
zero extend
219
The modified tables for the control signals are:
Inst.
Op-code
RegDst ALUSrc MemtoReg
Reg
Write
r-type 0 0 0 0 0 0
1
0
0
1
lw
1 0 0 0 1 1
0
1
1
1
sw
1 0 1 0 1 1
x
1
x
0
beq
0 0 0 1 0 0
x
0
x
0
j
0 0 0 0 1 0
x
x
x
0
ori
0 0 1 1 0 1
0
1
0
1
Inst.
Op-code
Mem Mem Branch Jump ALUOp ExtOp
Read Write
2 1 0
r-type 0 0 0 0 0 0
0
0
0
0
1 0 0
x
lw
1 0 0 0 1 1
1
0
0
0
0 0 0
1
sw
1 0 1 0 1 1
0
1
0
0
0 0 0
1
beq
0 0 0 1 0 0
0
0
1
0
0 0 1
1
j
0 0 0 0 1 0
0
0
0
1
x x x
x
ori
0 0 1 1 0 1
0
0
0
0
0 1 0
0
Some of the control logic may have to be modified. For example,
the logic generating the signal ALUSrc would have to ensure that the
value 1 was set for the ori instruction:
ALUSrc = Op5·Op4·Op2·Op1·Op0+Op5 · Op4 · Op3 · Op2 · Op1 · Op0
The new control signal, ExtOp can be evaluated as:
ExtOp = Op5 · Op4 · Op2 · Op1 · Op0 + Op5 · Op4 · Op3 · Op2 · Op1 · Op0
220
Other control logic implementations
Because there are only a few instructions to be implemented, the
control logic for this processor implementation is probably best implemented using simple logic functions as shown previously.
It is quite common to implement simple controllers as a PLA.
Following is a PLA implementation for the processor we have designed so far:
op5
. .
. op0
000000
R−type
op5
. .
. op0
001101
ori
op5
. .
. op0
op5
. .
. op0
op5
. .
. op0
op5
. .
. op0
100011
101011
000100
000010
lw
sw
beq
j
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
Jump
ExtOp
ALUSrc[2]
ALUSrc[1]
ALUSrc[0]
221
The controller for the ALU could be implemented similarly, although
it is also probably best implemented using simple logic functions, as
well.
Note that, in the preceding controller, there was an AND term corresponding to each instruction. For a small number of instructions, this
is effective. However, if the number of instructions is large (i.e., there
are op codes for most of the 6 bit instruction combinations) then the
controller could also be implemented as a read-only memory (ROM).
In this case, the op codes would be used as address inputs to the
ROM, and the outputs would be the values stored at those addresses.
There are 12 output bits, and the total size of the memory would be
26 = 64 words of 12 bits. The encoding would be quite straightforward; merely the contents of the logic table for each control bit.
This would not be an efficient implementation for the ALU control,
however. The funct field has 6 bits, and the ALUOp control input
has 3 bits, for a total of 9 bits, requiring 29 = 512 memory words of
3 bits.
Another option for the ALU control bits is to use the funct field
to generate the required three control signals, and have the main
controller also generate these control signals directly. They could
then be selected by a MUX, which would select the control signals
evaluated from the funct field only if the instruction is r-type.
The input to the MUX could be the logical OR of the instruction
field, which evaluates to 0 only for r-type instructions.
222
The time required for single cycle instructions
Arithmetic and logical instructions
PC
time
Inst. Memory
Reg. Read mux
ALU
mux Reg. Write
Inst. Memory
Reg. Read mux
Sign ext. add
ALU
mux mux
Branch
PC
(The sign extension and add occur in parallel with the other operations,
register read and ALU comparision )
Load
PC
Inst. Memory
Reg. Read mux
ALU
Data Memory mux Reg. Write
The "critical path"
Store
PC
Inst. Memory
Reg. Read mux
ALU
Data Mem.
Jump
PC
Inst. Memory mux
The clock period must be at least as long as the time for the critical
path.
223
R−type operations
Inst[25−0]
26
Jump address[31−0]
Shift
left 2
0
M
U
X
1
28
PC+4[31−28]
Add
Add
RegDst
ExtOp
4
Shift
left 2
Jump
1
M
U
X
0
PCSrc
Branch
MemRead
MemtoReg
Inst [31−26]
Control
ALUOp
MemWrite
224
ALUSrc
RegWrite
PC
Read
address
Instruction
[31−0]
Instruction
Memory
Inst[25−21]
rs
Inst[20−16]
rt
0
M
U
Inst[15−11] X
1
rd
Inst[15−0]
Read
Register 1
Read
Register 2
Read
data 1
Registers
Write
Read
Register
data 2
Write
data
16
Inst[5−0]
Sign or
zero
extend
BusA
32
Zero
BusB
32
ALU
0
M
U
X
1
32
32
funct
32
Address
Read
data
Write
data
32
Data
Memory
32
ALU
control
32
1
M
U
X
0
The Branch instruction − beq
Inst[25−0]
26
Jump address[31−0]
Shift
left 2
0
M
U
X
1
28
PC+4[31−28]
Add
Add
RegDst
ExtOp
4
Inst [31−26]
Shift
left 2
Jump
1
M
U
X
0
PCSrc
Branch
Control
MemRead
MemtoReg
ALUOp
225
MemWrite
ALUSrc
RegWrite
PC
Read
address
Instruction
[31−0]
Instruction
Memory
Inst[25−21]
rs
Inst[20−16]
rt
0
M
U
Inst[15−11] X
1
rd
Inst[15−0]
Read
Register 1
Read
Register 2
Read
data 1
BusA
32
Zero
Registers
Write
Register
Read
data 2
Write
data
16
Inst[5−0]
Sign or
zero
extend
BusB
32
ALU
0
M
U
X
1
32
32
funct
32
Address
Read
data
Write
data
32
Data
Memory
32
ALU
control
32
1
M
U
X
0
The Load instruction
Inst[25−0]
26
Shift
left 2
Jump address[31−0]
PC+4[31−28]
Add
Add
RegDst
ExtOp
4
Inst [31−26] Control
226
rs
Inst[20−16] rt
Inst[25−21]
PC
Read
address
Instruction
[31−0]
Instruction
Memory
0
M
U
Inst[15−11] X
1
rd
Inst[15−0]
0
M
U
X
1
28
Shift
left 2
Jump
Branch
MemRead
MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
Read
Register 1
Read
Register 2
Read
data 1
BusA
32
Registers
Write
Read
Register
data 2
Write
data
BusB
32
16
Inst[5−0]
Sign or
zero
extend
funct
PCSrc
Zero
ALU
0
M
U
X
1
32
1
M
U
X
0
32
32
ALU
control
Address
Write
data
Read
data
32
Data
Memory
32
32
1
M
U
X
0
The Store instruction
Inst[25−0]
26
Shift
left 2
Jump address[31−0]
PC+4[31−28]
Add
Add
RegDst
ExtOp
4
Inst [31−26] Control
227
rs
Inst[20−16] rt
Inst[25−21]
PC
Read
address
Instruction
[31−0]
Instruction
Memory
0
M
U
Inst[15−11] X
1
rd
Inst[15−0]
0
M
U
X
1
28
Shift
left 2
Jump
Branch
MemRead
MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
Read
Register 1
Read
Register 2
Read
data 1
BusA
32
Registers
Write
Read
Register
data 2
Write
data
BusB
32
16
Inst[5−0]
Sign or
zero
extend
funct
PCSrc
Zero
ALU
0
M
U
X
1
32
1
M
U
X
0
32
32
ALU
control
Address
Write
data
Read
data
32
Data
Memory
32
32
1
M
U
X
0
The Jump instruction
Inst[25−0]
26
Shift
left 2
Jump address[31−0]
PC+4[31−28]
Add
Add
RegDst
ExtOp
4
Inst [31−26] Control
228
rs
Inst[20−16] rt
Inst[25−21]
PC
Read
address
Instruction
[31−0]
Instruction
Memory
0
M
U
Inst[15−11] X
1
rd
Inst[15−0]
0
M
U
X
1
28
Shift
left 2
Jump
Branch
MemRead
MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite
Read
Register 1
Read
Register 2
Read
data 1
BusA
32
Registers
Write
Read
Register
data 2
Write
data
BusB
32
16
Inst[5−0]
Sign or
zero
extend
funct
PCSrc
Zero
ALU
0
M
U
X
1
32
1
M
U
X
0
32
32
ALU
control
Address
Write
data
Read
data
32
Data
Memory
32
32
1
M
U
X
0
Why is the single cycle implementation not used?
In order to have a single cycle implementation, each instruction in
the instruction set must have all the operands and control signals
available to implement the full instruction.
This means, for example, that an instruction could not require two
operands from memory. It also means that data and instructions
cannot share the same memory.
A multi-cycle implementation could have instructions and data stored
in the same memory; instructions could be fetched in one cycle, and
data fetched in another.
As well, every instruction will use exactly the same amount of time for
its execution. Instructions like the jump instruction, which involve
only a few datapath elements require the same time as say, the load
instruction, which involve almost all the elements in the datapath.
With more than one clock cycle, instructions using few datapath
elements could complete in fewer clock cycles than instructions using
many elements.
Also, there may be opportunities to reuse some datapath elements if
instructions used more than one clock cycle. For example, the ALU
could also be used to calculate branch addresses.
229
Considerations in a multi-cycle implementation
There may be many considerations in the design a multi-cycle processor.
For example, the first version of the IBM PC used a processor with an
8-bit path to memory, although the internal data paths were 16 bits.
This meant that, for a full data word to be fetched from memory,
two (8-bit) memory accesses were required. (At the time, the cost
of external connections (pins on the integrated circuit “chip”) were
expensive, so a smaller path to memory made the processor cheaper
to manufacture.)
Other operations could be also performed in several cycles to reduce hardware costs; for example, a 32-bit add function could be
implemented using an 8-bit adder, but requiring four clock cycles to
complete the add operation.
In general, a multi-cycle implementation attempts to find a compromise between the number of cycles required for a particular function
and the hardware complexity for its implementation, at a given cost.
It is a trade-off between resources and time.
The problem of designing a multi-cycle processor is therefore an optimization problem:
For a given cost, (i.e., amount of hardware, or logic) what is the
fastest processor which can be implemented with the specified instruction set.
230
This is really a multi-dimensional problem, and at any given time,
different manufacturers of similar hardware have had very different
implementations of a processor, with different performance characteristics.
(Consider INTEL and AMD today; they implement much the same
instruction sets, but with different price and performance characteristics. Years ago, IBM and AMDAHL processors implemented the
same instruction sets very differently, as well.)
We will consider the problem in two steps; first, decide on the hardware resources to be available, then decide the minimum clock period,
and what operations should be done in each cycle.
For the hardware resources in our implementation we will have:
• a single memory, 32 bits wide, for instructions and data
• a single ALU similar to that designed earlier
• a full 32 bit internal datapath
• as few other arithmetic elements as possible (we will attempt to
eliminate the adders required for addressing)
231
How are instructions broken down into cycles?
This is also a complex problem. A reasonable approach might be to:
• find the single indivisible operation which requires the longest
time
• attempt to do as many of the shorter operations as possible in
single cycles of the same length
In its simplest form, this is a “greedy algorithm.”
It is made more complex by the fact that the operations may have
to be performed in some particular order.
These are called precedence relations, and discovering them is important whenever looking for opportunities for parallelism.
For example, an instruction must be fetched from memory before the
arithmetic or logic function it specifies can be executed.
In many processors, fetching a value (instructions or data) from memory is the operation which takes the longest time.
In others, it is possible to divide even this operation into sub-operations;
e.g., generate a memory address in one cycle and read or write the
value in the next cycle.
For our purposes, we will consider the fetching of an operand from
memory as the single indivisible operation which will define our basic
cycle time.
232
Looking back at the instruction timing for the single cycle processor,
we see that the load instruction requires two memory accesses, and
therefore will require at least two cycles.
Arithmetic and logical instructions
PC
time
Inst. Memory Reg. Read mux
ALU
mux Reg. Write
ALU
Data Memory mux Reg. Write
✲
Load
✛
PC
Inst. Memory Reg. Read mux
✲
The ”critical path”
Jump
PC
Inst. Memory mux
Considering the option of using the ALU to increment the PC, note
also that if the PC is read at the beginning of a cycle and loaded at
the end of the cycle, then it can be incremented in parallel with the
memory access. Also, if the diagram really represents the time for the
various operations, the register and MUX operations together require
approximately the same time as a memory operation, requiring five
cycles in total.
1
2
Inst. Memory Reg. Read mux
✛
PC
3
ALU
The ”critical path”
233
4
5
Data Memory mux Reg. Write
✲
A multi-cycle implementation
We will consider the design of a multi-cycle implementation of the
processor developed so far. The processor will have:
• a single memory for instructions and data
• a single ALU for both addressing and data operations
• instructions requiring different numbers of cycles
There are now resource limitations — only one access to memory,
one access to the register file, and one ALU operation can occur in
each clock cycle.
It is clear that both the instruction and data would be required
during the execution of an instruction. Additional registers, the
instruction register (IR) and the memory data register (MDR)
will be required to hold the instruction and data words from memory
between cycles.
Registers may also be required to hold the register operands from
BusA and BusB (registers A and B, respectively).
(Recall that the branch instructions require an arithmetic comparison before an address calculation.)
We will look at each type of instruction individually to determine if
it can actually be done with the time and resources available.
234
The R-type instructions
• R[rd] ← R[rs] op R[rt]
1
Example: add rd, rs, rt
2
Inst. Memory Reg. Read mux
3
4
ALU
mux Reg. Write
PC
Clearly, the instruction can be completed in four cycles, from the
timing. We need only determine if the required resources are available.
• In the first cycle, the instruction is fetched from memory, and
the ALU is used to increment the PC. The instruction must be
saved in the instruction register (IR) so it can be used in the
following cycles. (This may extend the cycle time).
• In the second cycle, the registers are read, and the values from
the registers to be used by the ALU must be saved, in registers
A and B, again new registers.
• In the third cycle, the r-type operation is completed in the ALU,
and the result saved in another new register, ALUOut.
• In the fourth cycle, the value in register ALUOut is written into
the register file.
Four registers had to be added to preserve values from one cycle
to the next, but there were no resource conflicts — the ALU was
required only in the first and third cycle.
235
We can capture these steps in an RTL description:
Cycle 1 IR ← mem[PC]
PC ← PC + 4
Cycle 2 A ← R[rs]
Save instruction in IR
increment PC
save register values for next cycle
B ← R[rt]
Cycle 3 ALUOut ← A op B calculate result and store in ALUOut
Cycle 4 R[rd] ← ALUOut
store result in register file
This is really an expansion of the original RTL description of the
R-type instructions, where the internal registers are also used. The
original description was:
mem[PC]
Fetch the instruction from memory
R[rd] ← R[rs] op R[rt]
Set register rd to the value of the
operation applied to the contents of
registers rs and rt
PC ← PC + 4
calculate the address of the next instruction
When using a “silicon compiler” to design a processor, designers often
refine the RTL description in a similar way in order to achieve a more
efficient implementation for the datapath or control.
236
The Branch instruction — beq
• Cond ← R[rs] - R[rt]
Calculate the branch condition
• if (Cond eq 0)
calculate the address of the next in-
PC ← PC + 4 +
struction
(SignExt(imm16) × 4)
• else PC ← PC + 4
1
2
3
Inst. Memory Reg. Read mux
Sign ext. add
PC
ALU
mux mux
In this case, three arithmetic operations are required, (incrementing
the PC, comparing the register values, and adding the immediate
field to the PC.)
Clearly, the comparison could not be done until the values have been
read from the register, so this must be done in cycle 3.
The address calculation could be done in cycle 2, however, since it
uses only data from the instruction (the immediate field) and the
new value of the PC, and the ALU is not being used in this cycle.
The result would have to be stored in a register, to be used in the
next cycle. We could use the register ALUOut for this, since the
R-type operations only require it at the end of cycle 3.
Recall that the ALU produced an output Zero which could be used
to implement the comparison. It is available during the third cycle,
and could be used to enable the replacement of the PC with the value
stored in ALUOut in the previous cycle.
237
The original RTL for the beq was:
• mem[PC]
Fetch the instruction from memory
• Cond ← R[rs] - R[rt]
Evaluate the branch condition
• if (Cond eq 0)
calculate the address of the next in-
PC ← PC + 4 +
struction
(SignExt(imm16) × 4)
• else PC ← PC + 4
Rewriting the RTL code for the beq instruction, including the operations on the internal registers, we have:
Cycle 1 IR ← mem[PC]
Save instruction in IR
PC ← PC + 4
increment PC
Cycle 2 A ← R[rs]
save register values for next cycle
B ← R[rt]
(for comparison)
ALUOut ← PC +
calculate address for branch
signextend(imm16) << 2 and place in ALUOut
Cycle 3 Compare A and B
if Zero is set
replace PC with ALUOut if Zero
then PC ← ALUOut
is set, otherwise do not change PC
Note that this instruction now requires three cycles.
Also, the first cycle is identical to that of the R-type instructions.
The second cycle does the same as the R-type, and also does the
address calculation. Note that, at this point, the instruction may
not require the result of the address calculation, but it is calculated
anyway.
238
The Load instruction
• Addr ← R[rs] + SignExt(imm16) Calculate the memory address
• R[rt] ← Mem[Addr]
1
load data into register rt
2
Inst. Memory Reg. Read mux
3
ALU
4
5
Data Memory mux Reg. Write
PC
Clearly, the first cycle is the same as in the previous examples.
For the second cycle, register R[rs] contains part of an address, and
register R[rt] contains a value to be saved in memory (for store)
or to be replaced from memory (for load). They must therefore be
saved in registers (A and B) for future use, like the previous instructions.
In the third cycle, the address is calculated from the contents of A and
the imm16 field of the instruction and stored in a register (ALUOut)
for use in the next cycle.
This address (now in ALUOut) is used to access the appropriate memory location in the fourth cycle, and the contents of memory are
placed in a register MDR, the memory data register.
In the fifth cycle, the contents of the MDR are stored in the register
file in register R[rt].
239
The original RTL for load was:
• mem[PC]
Fetch the instruction from memory
• Addr ← R[rs] +
Set memory address to the value of
SignExt(imm16)
the sum of the contents of register
rs and the immediate data word
imm16
• R[rt] ← Mem[Addr]
load the data at address Addr into
register rt
• PC ← PC + 4
calculate the address of the next instruction
The RTL for this implementation is:
Cycle 1 IR ← mem[PC]
PC ← PC + 4
Cycle 2 A ← R[rs]
Save instruction in IR
increment PC
save address register for next cycle
B ← R[rt]
Cycle 3 ALUOut ← A +
signextend(imm16)
calculate address for data
and place in ALUOut
Cycle 4 MDR ← Mem[ALUOut] store contents of memory at address
ALUOut in MDR
Cycle 5 R[rt] ← MDR
store value originally from memory
in R[rt]
Recall that this instruction was the longest instruction in the single
cycle implementation.
240
The Store instruction
• Addr ← R[rs] + SignExt(imm16) Calculate the memory address
• Mem[Addr] ← R[rt]
store the contents of register rt
in memory
1
2
3
Inst. Memory Reg. Read mux
ALU
4
Data Memory
PC
The store instruction is much like the load instruction, except that
the value in register R[rt] is written into memory, rather than read
from it.
The main difference is that, in the fourth cycle, the address calculated
from R[rs] and imm16 (and saved in ALUOut) is used to store the
value from register R[rt] in memory.
A fifth cycle is not required.
241
The original RTL for store was:
• mem[PC]
Fetch the instruction from memory
• Addr ← R[rs] +
Set memory address to the value of the
SignExt(imm16)
sum of the contents of register rs and
the immediate data word imm16
• R[rt] ← Mem[Addr]
load the data at address Addr into register rt
• PC ← PC + 4
calculate the address of the next instruction
The RTL for this implementation is:
Cycle 1 IR ← mem[PC]
PC ← PC + 4
Cycle 2 A ← R[rs]
B ← R[rt]
Cycle 3 ALUOut ← A +
Save instruction in IR
increment PC
save address register for next cycle
save value to be written
calculate address for data
signextend(imm16) and place in ALUOut
Cycle 4 Mem[ALUOut] ← B
store contents of register rt in
memory at address ALUOut
242
The Jump instruction
• PC<31:2> ← PC<31:28>
Calculate the jump address by con-
concat target<25:0>
catenating the high order 4 bits of
the PC with the target address
1
2
Inst. Memory mux
PC
The first cycle, which fetches the instruction from memory and places
it in IR, and increments PC by 4, is the same as other instructions.
The next operation is to concatenate the low order 26 bits of the
instruction with the high order 4 bits of the PC.
In the PC, the low order 2 bits are 0, so they are not actually loaded
or stored.
The shift of the bits from the instruction can be accomplished without
any additional hardware, merely by connecting bit IR[25] to bit
PC[27], etc.
Note that adding 4 to the PC may cause the four high order bits to
change.
Could this cause problems ?
243
The original RTL for jump was:
• mem[PC]
Fetch the instruction from memory
• PC ← PC + 4
increment PC by 4
• PC<31:2> ← PC<31:28>
replace low order 28 bits with
concat I<25:0> << 2
the low order 26 bits from the instruction left shifted by 2
The RTL for this implementation is:
Cycle 1 IR ← mem[PC]
Save instruction in IR
PC ← PC + 4
increment PC
Cycle 2
Cycle 3 PC<31:2> ← PC<31:28> replace low order 28 bits with
concat IR<25:0> << 2 the low order 26 bits from the
Note that nothing is done for this instruction in cycle 2.
There is no clear reason for this, except that cycle 2 is substantially
the same for all other instructions, and following this gives a clearer
distinction between the fetch–decode–execute cycles.
244
Changes to the datapath for a multi-cycle implementation
We have found that several additional registers are required in the
multi-cycle datapath in order to save information from one cycle to
the next.
These were the registers IR, MDR, A, B, and ALUOut.
The overall hardware complexity may be reduced, however, since the
adders required for addressing have been replaced by the ALU.
Recall that the primary reason for choosing five cycles was the assumption that the time to obtain a value from memory was the single
slowest operation in the datapath. Also, we assumed that the register
file operations take a smaller, but comparable, amount of time.
If either of these conditions were not true, then quite a different
schedule of operations might have been chosen.
245
What is done during each cycle?
For this implementation, we have determined that instructions will be
divided into five cycles. (Other divisions are possible, of course, but
the original MIPS also used five cycles for the longest instruction.)
These cycles are as follows:
1. Instruction fetch (IF)
The instruction is fetched and the next address calculated.
IR ← Memory[PC]
PC ← PC + 4
2. Instruction decode (ID)
The instruction is decoded, and the register values to be read
(the contents of registers rs and rt) are stored in registers A
and B respectively.
A ← reg[IR[25:21]]
B ← reg[IR[20:16]]
At this time, the target of a branch instruction can also be calculated, because both the PC and the instruction are available.
It will have to be stored in a register (ALUOut) until it is used.
ALUOut ← PC + sign-extend(IR[15:0]) <<2
where <<2 means a left-shift of 2.
246
3. Execution (EX)
In this cycle, either
• the ALU operation is completed (for r-type and arithmetic
immediate instructions),
ALUOut ← A op B ,
or
ALUOut ← A op sign-extend(IR[15:0])
• or the memory address of a data word is calculated (for load
or store),
ALUOut ← A + sign-extend(IR[15:0])
• or the branch instruction is completed if the conditional
expression evaluates to TRUE,
if A = B
PC ← ALUOut
Note that if the target address were not calculated in the
previous clock cycle, it would have to be calculated in the
next one; the ALU is used for the comparison in this cycle.
• or the jump instruction is completed
PC ← PC[31:28] || IR[25:0] <<2
where the operator || denotes concatenation.
247
4. Memory (MEM)
Only the load or store instructions requires this cycle. In this
cycle, data is read from memory,
MDR ← Memory[ALUOut]
or data is written to memory,
Memory[ALUOut] ← B
5. Writeback (WB)
In this cycle, a value is written to a register in the register file.
Either the r-type and immediate arithmetic operations write
their results to the register file
reg[IR[15:11]] ← ALUOut
or the value read from memory in the previous cycle (for a load
instruction) is written into the register file,
reg[IR[20:16]] ← MDR
Note that not all instructions require every cycle. In particular,
branch and jump instructions require only the first 3 cycles (IF,
ID, and EX).
The R-type instructions require 4 cycles (IF, ID, EX, and WB).
Store also requires 4 cycles (IF, ID, EX, and MEM).
Load requires all 5 cycles (IF, ID, EX, MEM, and WB).
248
The datapath for the multi-cycle processor
Fortunately, after our design of the single cycle processor, we have
a good idea of the datapath elements required to implement each
individual instruction. We can also seek opportunities to reuse functional blocks in different cycles, potentially reducing the number of
hardware blocks (and hence the complexity and cost) of the datapath.
The datapath for the multi-cycle processor is similar to that of the
single cycle processor, with
• the addition of the registers noted (IR, MDR, A, B, and ALUOut)
• the elimination of the adders for address calculation
• a MUX must be extended because there are now three separate
calculations for the next address (jump, branch, and the normal
incrementing of the PC).
• additional control signals controlling the writing of the registers.
The following diagrams show the datapath for the multi-cycle implementation of the processor.
The additions to the datapath for each cycle is shown in red.
The required control signals are shown in green in the final figure.
249
PCSource
0
M
U
X
Memory
Address
MemData
M
U
X
ALUSrcA
IRWrite
MemWrite
MemRead
250
PC
IorD
PCW
0
0
M
U
X
Inst[31−26]
Inst[25−21]
Zero
ALU
Inst[20−16]
Inst[15−0]
Write
data
Instruction
Register
4
1M
U
X
ALUSrcB
ALU
PCSource
Address
MemData
Inst[31−26]
Read
Register 1
Read
Register 2
rs
Inst[25−21]
rt
Inst[20−16]
0
M
U
X
Read BusA A
data 1
Zero
ALU
Registers
Inst[15−0]
Write
data
M
U
X
ALUSrcA
IRWrite
MemWrite
Memory
RegWrite
0
M
U
X
MemRead
251
PC
IorD
PCW
0
Write
Register
Instruction
Register
Read
data 2
BusB
B
Write
data
4
1M
U
X
3
ALUSrcB
16
Sign
extend
32
Shift
left 2
ALU
ALUOut
PCSource
0
Address
MemData
Inst[31−26]
Read
Register 1
Read
Register 2
rs
Inst[25−21]
rt
Inst[20−16]
0
M
U
X
1
Read BusA A
data 1
26
Shift
left 2
28
Write
Register
Instruction
Register
Read
data 2
BusB
Zero
ALU
0
B
Write
data
4
1M
U
2X
3
ALUSrcB
16
Sign
extend
32
Shift
left 2
Jump
address
PC[31−28]
Registers
Inst[15−0]
Write
data
ALUSrcA
IRWrite
MemWrite
Memory
RegWrite
0
M
U
X
MemRead
252
PC
IorD
PCW
Inst[25−0]
M
1U
X
2
ALU
ALUOut
PCSource
0
Address
MemData
Inst[31−26]
Read
Register 1
Read
Register 2
rs
Inst[25−21]
rt
Inst[20−16]
0
M
U
X
1
Read BusA A
data 1
Write
Register
Instruction
Register
Read
data 2
BusB
Shift
left 2
28
4
1M
U
2X
ALUSrcB
16
Sign
extend
32
Shift
left 2
Jump
address
PC[31−28]
Zero
ALU
0
B
Write
data
3
Memory
Data
Register
26
Registers
Inst[15−0]
Write
data
ALUSrcA
IRWrite
MemWrite
Memory
RegWrite
0
M
U
X
1
MemRead
253
PC
IorD
PCW
Inst[25−0]
M
1U
X
2
ALU
ALUOut
PCSource
0
Address
MemData
Inst[31−26]
Write
data
Read
Register 1
RegDst Read
Register 2
rs
Inst[25−21]
rt
Inst[20−16]
Inst[15−0]
Instruction
Register
Memory
Data
Register
ALUSrcA
IRWrite
MemWrite
Memory
RegWrite
0
M
U
X
1
MemRead
254
PC
IorD
PCW
Inst[25−0]
0
M
U
Inst[15−11] X
1
rd
0
M
U
X
1
Read BusA A
data 1
26
Shift
left 2
28
PC[31−28]
Zero
ALU
Registers
Write
Register
Read
data 2
BusB
0
B
Write
data
4
0
M
U
X
1
1M
U
2X
3
ALUSrcB
16
MemtoReg
Sign
extend
32
Shift
left 2
Jump
address
M
1U
X
2
ALU
ALUOut
PCWriteCond
PCWrite
PCSource
ALUOp
MemRead
Outputs
MemWrite
Control
ALUSrcB
PCSource
IorD
ALUSrcA
RegWrite
MemtoReg
IRWrite
RegDst
0
op
Address
MemData
op
Inst[31−26]
Write
data
Read
Register 1
RegDst Read
Register 2
rs
Inst[25−21]
rt
Inst[20−16]
Inst[15−0]
Instruction
Register
Memory
Data
Register
ALUSrcA
IRWrite
MemWrite
Memory
RegWrite
0
M
U
X
1
MemRead
255
PC
IorD
PCW
Inst[25−0]
0
M
U
Inst[15−11] X
rd 1
0
M
U
X
1
Read BusA A
data 1
26
Shift
left 2
28
PC[31−28]
Zero
ALU
Registers
Write
Register
Read
data 2
BusB
0
B
Write
data
4
0
M
U
X
1
ALU
1M
U
2X
3
ALUSrcB
16
Sign
extend
32
ALU
control
Shift
left 2
MemtoReg
Inst[5−0]
funct
Jump
address
M
1U
X
2
ALUOut
The control signals
The following control signals are identified in the datapath:
Action when
Signal
RegDst
0 (deasserted)
the register written is the the register written is the
rt field
RegWrite
1 (asserted)
rd field
the register file will not be the register addressed by
written into
the instruction will be
written into
ALUSrcA
the first ALU operand is the first ALU operand is
the PC
register A
MemRead no memory read occurs
the contents of memory
at the specified address is
placed on the data bus
MemWrite no memory write occurs
the contents of register B
is written to memory at
the specified address
MemtoReg the value written to the the value written to the
register file comes from register file comes from the
ALUOut
MDR
256
Action when
Signal
IorD
0 (deasserted)
the
memory
1 (asserted)
address the
memory
address
comes from the PC (an comes from ALUOut (a
instruction)
IRWrite
data read)
the IR is not written into the IR is written into (an
instruction is read)
PCWrite
none (see below)
the PC is written into;
the value comes from the
MUX controlled by the
signal PCSource
PCWriteCond if both it and PCWrite are the PC is written if the
not asserted, the PC is not ALU output Zero is active
written
257
Following are the 2-bit control signals:
Signal
ALUOp
Value Action taken
00
ALU performs ADD operation
01
ALU performs SUBTRACT operation
10
ALU performs operation specified by funct
field
ALUSrcB
00
the second ALU operand is from register B
01
the second ALU operand is 4
10
the second ALU operand is the sign extended
low order 16 bits of the IR (imm16)
11
the second ALU operand is the sign extended
low order 16 bits of the IR shifted left by 2 bits
PCSource
00
the PC is updated with the value PC + 4
01
the PC is updated with the value in register ALUOut (the branch target address, for a
branch instruction)
10
the PC is updated with the jump target address
The control unit must now be designed.
Since the instructions will now require several states, the control will
be a state machine, with the instruction op codes as inputs and the
control signals as outputs.
258
Review of instruction cycles and actions
Cycle Instruction type action
IF
all
IR ← Memory[PC]
PC ← PC + 4
ID
all
A ← Reg[rs]
B ← Reg[rt]
ALUOut ← PC + (imm16 <<2)
EX
R-type
ALUOut ← A op B
Load/Store
ALUOut ← A + sign-extend(imm16)
Branch
if (A == B) then PC ← ALUOut
Jump
PC ← PC[31:28] || (IR[25:0] <<2)
MEM Load
WB
MDR ← Memory[ALUOut]
Store
Memory[ALUOut] ← B
R-type
Reg[rd] ← ALUOut
Load
Reg[rt] ← MDR
Note that the first two steps are required for all instructions, and all
instructions require at least the first 3 cycles.
The MEM step is required only by the load and store instructions.
The ALU control unit is still a combinational logic block, as before.
259
Design of the control unit
The control unit is a state machine, implementing the state sequencing for every instruction.
Following is a partial state machine, detailing the IF and ID stages,
which are the same for all instructions:
IF
Start
Memread = 1
ALUSrcA = 0
IorD = 0
IRWrite = 1
ALUSrcB = 01
ALUOp = 00
PCWrite = 1
PCSource = 00
OP = ’LW’
OP = ’SW’
0
ID
1
ALUSrcA = 0
ALUSrcB = 11
ALUOp = 00
OP = ’R−type’
OP = ’BEQ’
OP = ’J’
The partial state machines which implement each of the instructions
follow.
260
The memory reference instructions (Load and Store)
from state 1
OP = ’LW’ or
OP = ’SW’
2
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
OP = ’LW’
3
OP = ’SW’
5
MemRead = 1
IorD = 1
MemWrite = 1
IorD = 1
4
RegWrite = 1
MemtoReg = 1
RegDst = 0
To state 0
(instruction
completed)
261
R-type instructions
from state 1
OP = ’R−type’
6
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 10
7
RegDst = 1
MemtoReg = 0
RegWrite = 1
To state 0
(instruction
completed)
262
Branch and Jump instructions
from state 1
from state 1
OP = ’BEQ’
OP = ’J’
8
9
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 01
PCWriteCond = 1
PCSource = 01
PCWrite = 1
PCSource = 10
To state 0
(instruction
completed)
These can be combined into a single state diagram, and a state machine derived from this.
263
The combined control unit
Memread = 1
ALUSrcA = 0
IorD = 0
IRWrite = 1
ALUSrcB = 01
ALUOp = 00
PCWrite = 1
PCSource = 00
Start
OP = ’LW’
or
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 10
OP = ’SW’
7
5
MemRead = 1
IorD = 1
MemWrite = 1
IorD = 1
RegDst = 1
MemtoReg = 0
RegWrite = 1
4
RegWrite = 1
MemtoReg = 1
RegDst = 0
264
ALUSrcA = 0
ALUSrcB = 11
ALUOp = 00
OP = ’J’
OP = ’R−type’
OP = ’LW’
3
1
OP = ’BEQ’
OP = ’SW’
6
2
0
8
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 01
PCWriteCond = 1
PCSource = 01
9
PCWrite = 1
PCSource = 10
Implementing the control unit
All that remains is to implement the control unit is to design the
control logic itself.
Inputs are the instruction op codes, as before, and the outputs are
the control signals.
The following steps are typically followed in the implementation of
any sequential device:
• Construct the state diagram or equivalent (done).
• Assign numeric (binary) values to the states.
• Choose a memory element for state memory. (Normally, these
would be D flip flops or JK flip flops.)
• Design the combinational logic blocks to implement the nextstate functions.
• Design the combinational logic blocks to implement the outputs.
The actual implementation can be done in a number of ways; as
discrete logic, a PLA, read-only memory, etc.
Typically, the control unit would be automatically generated from a
description in some high level design language.
265
The control unit we have described is a Moore state machine, where
the outputs are a function only of the state.
✲
✲
✲
✲
✲
Primary
Inputs
✲
✲
✲
✲
t
t
t
✲
State
State
✲
✲
Inputs
Outputs
t
✲
State
memory
✲
✲
✲
Primary
✲
Outputs
✲
✲
✲
✲
✲
Following is a state table corresponding to the previous state diagram. Note that the outputs are missing, but they depend only on
the state values, not the inputs.
266
Present
INPUT
State
Next
State
0 0000
X
1 0001
1 0001
lw
100011 2 0010
1 0001
sw
101011 2 0010
1 0001
R
000000 6 0110
1 0001
BEQ 000100 8 1000
1 0001
J
000010 9 1001
2 0010
lw
100011 3 0011
2 0010
sw
101011 5 0101
3 0011
X
4 0100
4 0100
X
0 0000
5 0101
X
0 0000
6 0110
X
7 0111
7 0111
X
0 0000
8 1000
X
0 0000
9 1001
X
0 0000
Note that the outputs are not shown in this table. The notation X
in the input column means that this state change does not depend
on the particular instruction, only on the previous state.
The following figure shows an implementation of the next-state logic
for the state machine shown previously.
267
State
0
OP4
OP3
OP2
OP1
OP0
S3
S2
S1
S0
2
3 6
lw
sw R beq j lw sw
Operation
OP5
1
❍❍ ❞
✟✟
t
❍❍ ❞
✟✟
t
❍❍ ❞
✟✟
❍❍ ❞
✟✟
t
❍❍ ❞
✟✟
❍❍ ❞
✟✟
❍❍ ❞
✟✟
❍❍ ❞
✟✟
❍❍ ❞
✟✟
❍❍ ❞
✟✟
S3
S2
S1
S0
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
Q D
State
memory
268
t
t
t
t
t
t
AND plane
t
t
t
t
t
t
t
t
t
t
OR plane
Other controller implementations
An alternative to a PLA implementation in an implementation using
a read-only memory (or even a read-write memory.)
In this case, the inputs to the memory would be the OP codes (6
bits) and the state codes (4 bits in our case, but larger in a processor
with a richer instruction set.)
The next-state values and outputs would be stored in the memory.
For this example, there would be 10 (6 + 4) address bits, so the size
of the memory would be 210 = 1024 words of 16 bits (10 single bit
control signals and 3 2-bit control signals.)
This is a large memory for the simple control function we have implemented with a PLA.
op5
❚❚
..
.
✔✔
op0
state
inputs
❚
✔
control
outputs
state
outputs
A hybrid approach could be to use the PLA to generate the next-state
values, and a memory for the outputs associated with each state.
In this case, the memory size is 24 = 16 words of 16 bits.
269
Microprogrammed control
An alternative to designing a classical state machine is to designing
a microprogrammed control unit.
A microprogrammed control unit is a simple processor that generates
the control signals for a more complex processor.
✲
✲
✲
✲
✲
✲
✲
✲
✲
✲
Microcode
ROM
✻
✉
1
❏
❄
❏
❏
❏
Datapath
control
outputs
❅
❅
Adder
❄
✡
✡
✡
✡
Microprogram
counter
Microprogram
✻
Address select
logic
✻
sequencer
✛
✻
op code
It has storage for the microcode (values of the control outputs and
microinstructions) and a microprogram sequencer which decides the
next microprogram operation. (This is essentially a next-state generator.)
270
Next-state generation
Note in the previous state diagram that, in many cases (states 0, 3,
6 in the example), the next state is the numerically next state in the
state machine; the state value is merely incremented.
For many other states (states 4,5,7,8,9 in the example), the next state
is the first state (instruction fetch).
For the other states, (states 1 and 2 in the example) there is a small
subset of next-states reachable from those states. Typically, a dispatch table (stored in ROM) is associated with each such state.
For the state machine described earlier, there would be two dispatch
tables; one for state 1 and the other for state 2.
They would contain the next-state information as follows:
Dispatch ROM 1
OP
Name
Value
state
000000 R-type Rformat1 0110
000010 j
JUMP1
1001
000100 beq
BEQ1
1000
100011 lw
Mem1
0010
101011 sw
Mem1
0010
271
Dispatch ROM 2
OP
Name Value state
100011 lw
LW2
0011
101011 sw
SW2
0101
The microprogram sequencer can be expanded to the following, where
the Address Select Logic block has been expanded to include the
four possible sources for the next instruction described earlier:
✲
✲
✲
✲
✲
✲
✲
✲
✲
✲
Microcode
ROM
✻
✉
1
❏
❄
❏
❏
❏
Datapath
control
outputs
❅
❅
Adder
❄
✡
✡
✡
✡
Microprogram
counter
✻
✎
MUX
3
2
1 0
✍
☞
✛
✌
✻ ✻
✻ ✻
Microprogram
sequencer
0
ROM 1
ROM 2
✻
✻
✻
op code
Each microcode instruction will have to include a control word (input
to the MUX above) to control the microprogram sequencer.
272
Designing the microcode
The basic function of a microprogram is to supply the control signals
required to implement each instruction in the appropriate order.
A microprogram is made up of microinstructions (microcode).
The way the microcode is organized, or formatted, depends on a number of things. Two extremes of microcode are horizontal microcode
and vertical microcode.
Horizontal microcode usually requires more storage. It provides all
the control signals for a single cycle directly.
Vertical microcode is more compact. Typically, operations are encoded so that the operations can be specified in fewer bits. This
supports less parallelism, and the second level of decoding may extend the cycle time.
In either case, it is usual to group together the required outputs and
control information into fields. These fields are merely collections
of outputs that perform related functions. For example, it might be
useful to group together all the signals that control memory, or the
ALU.
Often, the values in different fields are given labels, much as in assembly language programs.
We can identify the following fields for a microprogrammed implementation of the simple MIPS:
273
Field name
Function
ALU control
Specify the ALU operation for this clock cycle
SRC1
Specify the source for the first ALU operand
SRC2
Specify the source for the second ALU operand
Register control Specify read or write for the register file, and the source
for the write
Memory
Specify read or write and the source for the memory.
For a read, specify the destination register
PCWrite control Specify writing of the PC
Sequencing
Determines how to choose the next microinstruction
Note that the first six fields correspond to sets of control signals for
the datapath, the last (Sequencing) determines the source for the
address of the next micro-code instruction (next-state).
Typically, those fields would have symbolic values, which would later
be translated to actual control signal values, somewhat like the translation of an assembly language program to machine code.
274
The following tables show the values for each field:
Field name Values
Function
ALU
Add
Add, using the ALU
control
Subt
Subtract, using the ALU
Funct
Use the funct field to determine ALU control
PC
Use PC as first ALU input
A
Use register A as first ALU input
B
Use register A as second ALU input
4
Use the constant 4 second ALU input
Extend
Use the sign extended imm16 field as the sec-
SRC1
SRC2
ond ALU input
Extshft
Use the 2-bit left shifted sign extended imm16
field as the second ALU input
Register
control
Read
Read two registers using rs and rt fields,
placing results in A and B
Write ALU Write the contents of ALUOut into the register file in register rd
Write MDR Write the contents of MDR into the register
file in register rt
275
Field name Values
Function
Memory
Read memory at the address in PC and write
Read PC
result in IR
Read ALU
Read memory at the address in ALUOut and
write result in MDR
Write ALU
Write memory at the address in ALUOut using the contents of B as data
PCWrite
ALU
Write the output of the ALU into the PC
ALUOut-cond Write the contents of ALUOut into the PC if
the Zero output of the ALU is active
Jump-address Write the jump address from the instruction
into the PC
Sequencing Seq
The next microinstruction is the next sequentially
Fetch
The next microinstruction is instruction fetch
(state 0)
Dispatch i
The next microinstruction is obtained from
dispatch ROM i (1 or 2)
Every line of microcode will have a value for each of these fields.
Eventually, as in the translation of assembly language instructions,
these (symbolic) values will be translated into the actual values of
the control signals.
276
Creating a microprogram
Let us look at writing the microcode for a few operations:
The first thing done is the fetching and decoding of an instruction
(states 0 and 1 in the state diagram):
ALU
Label control SRC1 SRC2
Register
PCWrite
Control Memory
control
Fetch Add
PC
4
Read PC ALU
Add
PC
Extshft Read
Sequencing
Seq
Dispatch 1
The first line describes the (now familiar) operations of fetching an
instruction, storing it in the IR, adding 4 to the PC, and writing the
value back to the PC.
The second line describes the calculation of the branch address, and
the storing of register values in registers A and B.
The Sequencing field determines where the next microcode instruction comes from.
For the first microinstruction, it is the next in sequence.
For the second, it depends on the op code (Dispatch ROM 1).
277
The memory access instructions lw and sw:
ALU
Register
Label control SRC1 SRC2
Mem1 Add
A
Control
PCWrite
Memory
control
Sequencing
Extend
Dispatch 2
LW2
Read ALU
Seq
Write MDR
SW2
Fetch
Write ALU
Fetch
Note that the value in the Dispatch 2 table will cause a jump to
either LW1 or LW2.
R-type instructions:
ALU
Label
Register
control
SRC1 SRC2 Control
Rformat1 Func code A
PCWrite
Memory control
B
Sequencing
Seq
Write ALU
Fetch
Branch and jump instructions (beq and j):
ALU
Register
PCWrite
Label
control SRC1 SRC2 Control Memory control
BEQ1
Subt
A
B
Sequencing
ALUOut-cond Fetch
JUMP1
Jump address Fetch
278
What remains is to translate these microinstructions into actual values to be stored in the microcode ROM.
In this case, it is fairly straightforward to identify the values in each
field with appropriate values for the control signals:
Field
Signals
name
Value
active
ALU
Add
ALUOp = 00 Cause the ALU to add
control Subt
Comment
ALUOp = 01 Cause the ALU to subtract
Func code ALUOp = 10 Use the funct field to determine
ALU operation
SRC1
PC
ALUSrcA=0
Use the PC as the ALU’s first input
A
ALUSrcA=1
Use register A as the ALU’s first
input
SRC2
B
ALUSrcB=00 Use register B as the second ALU
input
4
ALUSrcB=01 Use 4 as the second ALU input
Extend
ALUSrcB=10 Use the sign extended imm16 field
as the second ALU input
Extshft
ALUSrcB=11 The shifted sign extended imm16
field is the second ALU input
279
Field
name
Signals
Value
active
Comment
Register Read
Place contents of registers
referenced by rs, rt in registers A, B
control
Write ALU RegWrite,
RegDst=1,
Write the contents of
ALUOut to register rd
MemtoReg=0
Write MDR RegWrite,
RegDst=0,
Write the contents of
MDR to register rt
MemtoReg=1
Memory Read PC
MemRead,
Place the value in memory at
IorD=0, IRWrite address referenced by PC into
IR and MDR
Read ALU
MemRead,
Place the value in memory at
IorD=1
address ALUOut into IR
Write ALU MemWrite,
IorD=1
Write memory using ALUOut
as address, B contents as data
280
Field
Signals
name
Value
active
Comment
PC write
ALU
PCSource=00, Write ALU output to PC
PCwrite
control
ALUOut-cond PCSource=01, If ALU output is zero,
PCwrite
write ALU output to PC
Jump address PCSource=10, Write jump address from
Sequencing Fetch
PCwrite
instruction to PC
AddrCtl=00
Go to the first microinstruction
Dispatch 1
AddrCtl=01
Microcode address from
ROM 1
Dispatch 2
AddrCtl=10
Microcode address from
ROM 2
Seq
AddrCtl=11
Next microinstruction is
sequential
281
It is now a matter of straightforward substitution to arrive at the
microcode to be stored in the ROM:
ALU
Register
PCWrite
State control SRC1 SRC2 Control Memory control
Sequencing
0
00
0
01
000
1001
0010
11
1
00
0
11
000
0000
0000
01
2
00
1
10
000
0000
0000
10
3
00
0
00
000
1010
0000
11
4
00
0
00
101
0000
0000
00
5
00
0
00
000
0110
0000
00
6
10
1
00
000
0000
0000
11
7
00
0
00
110
0000
0000
00
8
01
1
00
000
0000
0101
00
9
00
0
00
000
0000
1010
00
The 18 control signals here are, in order:
ALU control
ALUOp[2]
SRC1
ALUSrcA
SRC2
ALUSrcB[2]
Register Control RegWrite, RegDst, MemtoReg
Memory
MemRead, MemWrite, IorD, IRWrite
PCWrite control PCSource[2], PCWrite, PCWriteCond
Sequencing
AddrCtl[2]
282
Advantages/disadvantages of microprogram control:
For large instruction sets:
• The control is easier to design — similar to programming
• The control is more flexible — easier to adapt or modify
• Changes to the instruction set can be made late in the design
cycle
• Very powerful instruction sets can be implemented in different
datapaths
Generality:
• Different instruction sets can be implemented on the same machine
• Instruction sets can potentially be adapted to the particular application
• Many different datapath organizations can be used with the same
instruction set (cost/performance tradeoffs)
Microcode control can be slower than direct logic implementation of
the control, and may require more circuitry (transistors).
It also may encourage “instruction set bloat” — adding instructions
because they can easily be provided.
283
Adding additional instructions
Clearly, adding an additional instruction can be accomplished by
adding to the control unit, provided that the instruction can actually
be implemented in the datapath.
For example, adding the ori instruction would require adding a third
bit to the control signal ALUOp in order to be able to encode logic
operations, and adding the capability to zero extend the imm16 field
(with control signal ExtOp, as before). These additions are the same
as those required for the single instruction implementation, and their
truth tables can be referred to for the appropriate values for these
control signals.
Note that the new control signals may have to be added to the existing states, as well.
The additional control signals would also have to be generated by
the controller.
A microprogrammed control unit is usually easier to modify than a
conventional controller. It may be slower, though, because of the
(local) memory access time for the microinstructions.
The following diagram shows the modified datapath and controls for
the processor with the ori instruction.
284
PCWriteCond
PCSource
PCWrite
ALUOp
IorD
ALUSrcB
ALUSrcA
MemWrite
Control
ExtOp
PCSource
MemRead
Outputs
RegWrite
MemtoReg
IRWrite
RegDst
0
op
285
Address
MemData
Inst[31−26]
Write
data
Read
Register 1
RegDst Read
Register 2
rs
Inst[25−21]
rt
Inst[20−16]
Inst[15−0]
Instruction
Register
Memory
Data
Register
ALUSrcA
IRWrite
MemWrite
MemRead
Memory
RegWrite
0
M
U
X
1
op
0
M
U
Inst[15−11] X
rd 1
0
M
U
X
1
Read BusA A
data 1
26
Shift
left 2
28
Write
Register
Read
data 2
BusB
Zero
ALU
Write
data
0
M
U
X
1
0
B
4
ALU
1M
U
2X
3
ALUSrcB
16
MemtoReg
Inst[5−0]
Sign
or 0
extend
32
ALU
control
Shift
left 2
funct
Jump
address
PC[31−28]
Registers
ExtOp
PC
IorD
PCW
Inst[25−0]
M
1U
X
2
ALUOut
The following shows the additions to the state diagram required to
implement the ORI instruction:
from state 1
OP = ’ORI’
14
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 010
ExtOp = 0
15
RegDst = 0
MemtoReg = 0
RegWrite = 1
To state 0
(instruction
completed)
Also, in states 0, 1, and 2, the control signal ALUOp would have to
change from 00 to 000. In state 6, it would change from 10 to 100,
and in state 8, from 01 to 001.
The control signal ExtOp would have to be set to a value of 1 in
states 1 and 2.
286
Modifying the microcode to add the ori instruction
The two additional control signals would have to be added. The
third bit in the ALUOp control would naturally be added to the ALU
control field, as would a label for the OR function. The control signal
ExtOp would also have to be added to one of the fields, say, SRC2.
Field
Signals
name
Value
active
ALU
Add
ALUOp = 000 Cause the ALU to add
control Subt
Or
Comment
ALUOp = 001 Cause the ALU to subtract
ALUOp = 010 Cause the ALU to perform OR
Func code ALUOp = 100 Use the funct field to determine
ALU operation
SRC2
B
ALUSrcB=00
Use register B as the second ALU
input
4
ALUSrcB=01
Use 4 as the second ALU input
Extend
ExtOp = 1
Sign extension of imm16
ALUSrcB=10
Use the sign extended imm16 field
as the second ALU input
Extshft
ExtOp = 1
Sign extension of imm16
ALUSrcB=11
The shifted sign extended imm16
field is the second ALU input
UExtend
ExtOp = 0
Unsigned extension of imm16
ALUSrcB=10
Use the imm16 field as the second
ALU input
287
Note that two labels have been added; OR, to specify an OR operation in the ALU, and UExtend to specify unsigned extension.
Sign extension was also explicitly specified, where required.
Less obviously, another label, say, Write ALUi has to be added to the
Register control field, because the value to be written comes from
the register ALUOut and is to be written to the register indexed by
rt, which requires a new combination of control signals.
Field
name
Signals
Value
active
Comment
Register Read
Read 2 registers using the rs
control
and rt fields and save the results in registers A and B
Write ALU RegWrite=1
RegDst=1
Write to the register file using
the rd field as destination
MemtoReg=0 and ALUOut as source
Write MDR RegWrite=1
RegDst=0
Write to the register file using
the rt field as destination
MemtoReg=1 and MDR as the source
Write ALUi RegWrite=1
RegDst=0
Write to the register file using
the rt field as destination
MemtoReg=0 and ALUOut as source
288
Since two states were added, two microcode instructions would also
be required. The microcode is similar to that for the R-type instructions.
The ori instruction:
ALU
Register
Label control SRC1 SRC2
ORi
OR
A
Control
UExtend
PCWrite
Memory control
Sequencing
Seq
Write ALUi
Fetch
Note that the additional two signals would now automatically be
provided for all instructions, since they are specified in the microcode
fields.
One other change is required. The op code for the instruction ori
has to be added to Dispatch ROM 1.
Dispatch ROM 1
OP
Name
Value
000000 R-type Rformat1
000010 j
JUMP1
000100 beq
BEQ1
001101 ori
ORi
100011 lw
Mem1
101011 sw
Mem1
289
Exceptions and interrupts
A feature of virtually all processors is the capability to respond to
error conditions, and to be “interrupted” by some external condition.
These interruptions to the normal flow of events in the processor are
called exceptions or interrupts.
We will call something with a cause external to the processor an interrupt, and an exception when the cause is internal to the processor
(say, an illegal instruction).
Note that this is by no means a standard nomenclature; the terms
are often used interchangeably.
Normally, interrupts and exceptions are handled by a combination
of hardware (the processor) and software (the operating system.)
Three things are required when an exception occurs:
1. The cause of the exception must be recorded.
2. The exception must be “handled” in some way. Normally, the
processor jumps to some location in memory where there is code
for an “exception handler.” The PC is set to this address by the
processor hardware.
3. The processor must have some way to return to the code that
was originally running, after handling the exception.
290
Adding exception handling
We will implement the hardware and control functions to handle two
types of exceptions; undefined instruction and arithmetic overflow.
Recall that the ALU had an overflow detection output, which can be
used as an input to the controller.
1. We will use a register labeled Cause to store a number (0 or 1)
to identify the type of exception, (0 for undefined instruction, 1
for arithmetic overflow).
It requires a control signal CauseWrite to be generated by the
controller. The controller also must set the value written to
the register, depending on whether or not the exception was an
arithmetic overflow.
The control signal IntCause is used to set this value.
2. The PC will be set to memory address C0000000 where the
operating system is expected to provide an event handler.
This is accomplished by adding another input (input 3) to the
MUX which updates the PC address. The MUX is controlled by
the 2-bit signal PCSource.
3. The address of the instruction which caused the exception is
stored in the register EPC, a 32 bit register.
Writing to this register is controlled by the new signal EPCWrite.
291
Storing the address of the instruction can be done several ways; for
example, it could be stored at the beginning of each instruction.
This would require a change to the datapath, and a way to disable
the storing of the address after each exception.
It is possible to store the address with only a small change to the
datapath (merely adding the EPC register to accept the output of the
ALU).
Recall that the next address (PC + 4) is calculated in the ALU, and
is written to the PC in the first cycle of every instruction. The ALU
can be used to subtract the value 4 from the PC after an exception
is detected, but before it is written into the EPC, so it contains the
actual address of the present instruction.
(Actually, there would be no real problem with saving the value
PC + 4 in the EPC; the interrupt handler could be responsible for
the subtraction.)
So, in order to handle these two exceptions, we have added two
registers — EPC and Cause, and three control signals — EPCWrite,
IntCause, and CauseWrite.
The changes to the processor datapath and control signals required
for the implementation of the exceptions detailed above are shown
in the following diagram.
292
CauseWrite
IntCause
PCWriteCond
EPCWrite
Outputs
PCWrite
PCSource
IorD
MemRead
ALUOp
Control
MemWrite
ALUSrcB
ALUSrcA
MemtoReg
RegWrite
IRWrite
RegDst
0
op
Inst[25−0]
26
Shift
left 2
28
1M
U
2X
Jump
address
293
C0000000
3
PC[31−28]
PC
0
M
U
X
1
Memory
Address
MemData
Read
Register 1
Read
Register 2
rs
Inst[25−21]
rt
Inst[20−16]
Inst[15−0]
Write
data
0
M
U
X
1
Inst[31−26]
Instruction
Register
Memory
Data
Register
0
M
U
Inst[15−11] X
1
rd
Read BusA A
data 1
Zero
ALU
Registers
Write
Register
Read BusB
B
data 2
Write
data
1M
U
2X
0
3
Sign
extend
Inst[5−0]
EPC
0
4
0
M
U
X
1
16
ALUOut
overflow
32
1
ALU
control
Shift
left 2
funct
0
M
U
X
1
Cause
Adding exception handling to the control unit
The exceptions overflow and undefined can be implemented by
the addition of only one state each:
OP = ’other’
10
11
overflow
IntCause = 0
CauseWrite = 1
ALUSrcA = 0
ALUSrcB = 01
ALUOp = 01
PCSource = 11
EPCWrite = 1
PCWrite = 1
IntCause = 1
CauseWrite = 1
ALUSrcA = 0
ALUSrcB = 01
ALUOp = 01
PCSource = 11
EPCWrite = 1
PCWrite = 1
to state 0
The input overflow is an output from the ALU. It is a combinational logic output, produced while the ALU is performing the
selected operation.
294
Start
OP = ’LW’
or
0
1
ALUSrcA = 0
ALUSrcB = 11
ALUOp = 00
OP = ’BEQ’
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
OP = ’other’
OP = ’R−type’
OP = ’SW’
6
2
OP = ’J’
8
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 10
OP = ’LW’
9
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 10
PCWriteCond = 1
PCSource = 01
PCWrite = 1
PCSource = 10
OP = ’SW’
overflow
3
5
MemRead = 1
IorD = 1
4
RegWrite = 1
MemtoReg = 1
RegDst = 0
MemWrite = 1
IorD = 1
overflow
7
RegDst = 1
MemtoReg = 0
RegWrite = 1
11
IntCause = 1
CauseWrite = 1
ALUSrcA = 0
ALUSrcB = 01
ALUOp = 01
PCSource = 11
EPCWrite = 1
PCWrite = 1
10
IntCause = 0
CauseWrite = 1
ALUSrcA = 0
ALUSrcB = 01
ALUOp = 01
PCSource = 11
EPCWrite = 1
PCWrite = 1
The control unit, with exception handling
The ALU operation which could result in an overflow is done in the
295
EX cycle, and the overflow signal is only available then, unless it is
saved in a register.
Memread = 1
ALUSrcA = 0
IorD = 0
IRWrite = 1
ALUSrcB = 01
ALUOp = 00
PCWrite = 1
PCSource = 00
Adding interrupts and exceptions with microcode
It is not difficult to add exception handling with microcode.
Note that three additional control signals were added. The simplest
thing to do is to add another microcode field, called, say, Exception.
It determines the values of the three control signals, EPCWrite,
IntCause, and CauseWrite.
Field
name
Signals
Value
Exception Overflow
active
Comment
EPCWrite=1
Save the output of the ALU
(PC - 4) in EPC
IntCause=1
Select the cause input
CauseWrite=1 Write the selected value in
the Cause register
Undefined EPCWrite=1
Save the output of the ALU
(PC - 4) in EPC
IntCause=0
Select the cause input
CauseWrite=1 Write the selected value in
the Cause register
The operations required are that the PC is decremented by 4, saved
in the EPC, the appropriate value set in the Cause register, and a
jump effected to the exception handler at address C0000000.
296
Another required modification to the microcode is to add a selection to the PC write control field to accommodate the jump to the
exception handler:
Field
name
Signals
Value
PC write ALU
active
Comment
PCSource=00
Select the ALU output as the
control
source for the PC
PCWrite=1
ALUOut-cond PCSource=01
Write into the PC
Select the ALUOut register as
the source for the PC
PCWritecond=1 Write into the PC if the zero
output of the ALU is set
jump-address
PCSource=10
Select the jump address field
as the source for the PC
Exception
PCSource=11
Select address C0000000 as
the value to be written to the
PC
The other changes required are:
• Add the exception state address to Dispatch Rom 1
• Add Dispatch Rom 3 (with the Overflow output as address) to
the microprogram sequencer. This requires adding another bit
to the Sequencing field, as well.
• Changing the Sequencing field of the microcode for R-type instructions at label Rformat1 from Seq to Dispatch 3.
297
.
Microcode for the exceptions:
ALU
Register
PCWrite
Label
control SRC1 SRC2 Control
Memory control
Exception Sequencing
Overflow
Subt
PC
4
Exception Overflow
Undefined Subt
PC
4
Exception Undefined Fetch
Fetch
Of course, the Exception field must be added to all the other microcode lines, but the values will all be the default (0) values.
298
More about interrupts
The ability to handle interrupts and exceptions is an important feature for processors.
We have added the control logic to detect the two types of exceptions described earlier, but note that the Cause and the EPC register
cannot be read.
Instructions would have to be provided to allow these registers to be
read and manipulated.
Processors usually have policies relating to exceptions. The MIPS
processor had the policy that instructions which cause an exception
has no effect (e.g., nothing is written into a register.)
For some exceptions, if this policy is used, the operation may have
to complete before the exception can be detected, and the result of
the operation must then be “rolled back.”
This makes the implementation of exceptions difficult — sometimes
the state prior to an operation must be saved so it can be restored.
This constraint alone sometimes results in instructions requiring more
cycles for their implementation.
299
Exceptions and interrupts in other processors
A common type of interrupt is a vectored interrupt. Here, different interrupts or exceptions jump to different addresses in memory.
The operating system places an appropriate interrupt handler for the
particular interrupt at each of these locations.
A vectored interrupt both identifies the type of interrupt, and provides the handler at the same time. (Since different interrupts or
exceptions have different vectors.)
In the INTEL processors, it is the responsibility of the interrupting
device to provide the interrupt vector. (This is usually done by
one of the peripheral controller chips, under control of the operating
system.)
A major problem with the PC architecture is that only a small number of interrupts (typically 16) can be handled by the controller chip.
this has lead to many problems with hardware devices “sharing interrupts” — defeating the advantages of vectored interrupts.
We will look at interrupts again, later, when we discuss input and
output devices.
300
Some questions about exceptions and interrupts
The following questions often have different answers for different processors:
• How does a processor return control of the program flow from
the exception or interrupt handler to the interrupted program?
Some processors have explicit instructions for this (e.g., the MIPS
processors), others treat interrupts and exceptions as being similar to subprogram calls (INTEL processors do this.)
• What happens when an exception or interrupt is itself interrupted?
Some processors save the return addresses in a stack data structure, and successive levels of interrupts just increase the stack
depth. Typically, this is the way subprogram return addresses
are also stored.
Some processors automatically turn off the interrupt capability
at the beginning of an interrupt, and it must be explicitly turned
back on by the interrupt or exception handler to accept another
interrupt.
Some processors have both features — instructions can turn the
interrupt capability on and off, and can allow interrupts to be
interrupted themselves. (This turns out to be important for
implementing certain operating system functions.)
301
Comments on our implementation of exceptions
Note that our implementation has only one register for the address
of the interrupting instruction, and no way to read that address and
modify it to resume the program where the exception occurred.
What changes would be required to the instruction set accomplish
this?
The simplest solution would probably be to allow only one interrupt
at a time, by disabling the interrupt capability, and to provide:
1. An instruction to store the EPC in the register file.
2. An instruction to store the Cause register in the register file.
3. An instruction to turn on interrupt capability after the next
instruction completed execution. (This assumes that the next
instruction restores the PC to the address of the instruction following the one that caused the exception.)
Note that these would require changes to the datapath and control.
This example was just to give the flavor of the problems involved
with handling exceptions in the processor. More complex instruction
sets and architectures exacerbate the problems.
302
Comments on handling interrupts
Although exception handling is complex, it is often simpler than the
handling of external interrupts.
Exceptions occur as a result of occurrences internal to the processor.
Consequently, they are usually both predictable, and occur and are
detected at known times in the execution of a particular instruction.
Interrupts are external events, and are not at all synchronized with
the execution of instructions in the processor.
Since interrupts may be notification of an urgent event, they usually
require fast servicing.
Decisions therefore have to be taken about exactly when in the execution of an instruction an interrupt will be detected and handled.
Some of the considerations are:
• If the instruction is not allowed to complete, information must
be retained in order to either continue or restart the interrupted
instruction. How will this be done?
• If the interrupted instruction is allowed to complete, how will the
processor return to the next instruction in the current program?
• Can the interrupt handler be interrupted?
• Can interrupts be prioritized so that a high priority interrupt
can interrupt a lower priority interrupt?
303
How can we “speed up” the processor?
One idea is to try to make the most frequently used instructions as
fast as possible.
Instruction distributions for some common program types
Instruction type
Type of program
LATEX C compiler
Fortran
(numerical)
calls
0.012
0.006
0.010
branches
0.115
0.229
0.068
loads/stores
0.331
0.231
0.456
flops
0.001
0.000
0.163
data (R-type)
0.414
0.293
0.284
nops
0.127
0.241
0.059
304
Instruction counts for a 60 page LaTeX document (the GWM manual)
count
percent type
1387566431 (1.004) cycles (55.5s @ 25.0MHz)
1382108615 (1.000) instructions
206864803
(0.150) basic blocks
19570428
(0.014) calls
342200862
(0.248) loads
181925435
(0.132) stores
524126297
(0.379) loads+stores
524252660
(0.379) data bus use
50344780
(0.036) partial word references
150139046
(0.109) branches
316292645
(0.229) nops
0
(0.000) load interlock cycles
5292110
(0.004) multiply/divide interlock cycles
124148
(0.000) flops (0.00224 mflops/s @ 25.0MHz)
305
FORTRAN number crunching – hard-sphere molecular dynamics calculation
count
percent type
873855362 (1.050) cycles (35s @ 25.0MHz)
832495305 (1.000) instructions
81119362
(0.097) basic blocks
8071192
(0.010) calls
289695712 (0.348) loads
112932164 (0.136) stores
402627876 (0.484) loads+stores
426704925 (0.513) data bus use
258649
(0.000) partial word references
56782868
(0.068) branches
20814556
(0.025) nops
0
(0.000) load interlock cycles
751865
(0.001) multiply/divide interlock cycles
124343032 (0.149) flops (3.56 mflop/s @ 25.0MHz)
40496083
(0.049) floating point data interlock cycles
8015
(0.000) floating point add interlock cycles
46335
(0.000) floating point multiply interlock cycles
0
(0.000) floating point divide interlock cycles
57759
(0.000) other floating point interlock cycles
24071112
(0.029) 1 cycle interlocks
24052679
(0.029) overlapped floating point cycles
306
Other ideas for “speedup”
There are a number of ways of “speeding up” a multicycle processor
— generally by doing certain operations in parallel.
For example, in the INTEL 80x86 processors, the fetching of instructions from memory is decoupled from the instruction execution.
There is a logically separate bus interface unit which attempts to
fill an instruction queue during the times when the execution unit is
not receiving operands from memory. (The 80x86 is not a load/store
processor.)
Would this be a useful idea for our multicycle implementation of the
MIPS?
Another possibility is performing operations in the datapath in parallel. For example, it is not unusual for a processor to have different
adders for integer and floating point operations, and those operations
can be performed simultaneously.
(The MIPS R2000/R3000 performs floating point operations in parallel with integer operations.)
307
A Gantt chart showing a simple, multicycle implementation
WB
MEM
ALU
RD
IF
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14
time (clock cycles)
The thick lines indicate memory accesses.
Not all instructions would require all the cycles shown.
308
A simple overlap implementation — here the instruction fetch proceeds during the WB clock phase, in which results are written into
internal registers.
WB
MEM
ALU
RD
IF
0
1
2
3
4
5
6
7
8
9 10 11 12 13
time (clock cycles)
Could this implementation be done with the multicycle datapath
shown earlier?
Are the resources used in the IF cycle also used in the WB cycle?
What parts of the instruction register are required?
When in the cycle are they used?
Which instructions do not have a WB cycle, and how could they be
handled?
309
An implementation which makes full use of a single memory bus —
data reads and writes do not interfere with instruction fetch.
WB
MEM
ALU
RD
IF
0
1
2
3
4
5
6
7
8
9 10 11 12 13
time (clock cycles)
Note that the single memory is a bottleneck.
In reality, not every instruction accesses data from memory; in sample
codes earlier, only 1/4 to 1/2 of the instructions were loads or stores.
The Gantt chart for this situation would be more complex.
Would the single cycle datapath be sufficient in this case?
What instructions would cause problems with this implementation?
310
Pipelining
Pipelining is a technique which allows several instructions to overlap in time; different parts of several consecutive instructions are
executed simultaneously.
The basic structure of a pipelined system is not very different from
the multicycle implementation previously discussed.
In the pipelined implementation, however, resources from one cycle
cannot be reused by another cycle. Also, the results from each stage
in the pipeline must be saved in a pipeline register for use in the next
pipeline stage.
311
A pipelined implementation
WB
MEM
ALU
RD
IF
0
1
2
3
4
5
6
7
8
9 10 11 12 13
time (clock cycles)
Note that two memory accesses may be required in each machine
cycle (an instruction fetch, and a memory read or write.)
How could this problem be reduced or eliminated?
312
What is required to pipeline the datapath?
Recall that when the multi-cycle implementation was designed, information which had to be retained from cycle to cycle was stored in
a register until it was needed.
In a pipelined implementation, the results from each pipeline stage
must be saved if they will be required in the next stage.
In a multi-cycle cycle implementation, resources could be “shared”
by different cycles.
In a pipelined implementation, every pipeline stage must have all the
resources it requires on every clock cycle.
A pipelined implementation will therefore require more hardware
than either a single cycle or a multicycle implementation.
A reasonable starting point for a pipelined implementation would be
to add pipeline registers to the single cycle implementation.
We could have each pipeline stage do the operations in each cycle of
the multi-cycle implementation.
The next figure shows a first attempt at the datapath with pipeline
registers added.
313
1
M
U
X
0
Add
4
Add
Shift
left 2
Inst[25−21]
314
PC
Read
address
Instruction
[31−0]
Instruction
Memory
Inst[20−16]
Read
Register 1
Read
Register 2
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[15−0]
16
Inst[15−11]
IF
Read
data 1
ID
Sign
extend
ALU
Address
0
M
U
X
1
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
0
M
U
X
1
EX
MEM
WB
It is useful to note the changes that have been made to the datapath,
The most obvious change is, of course, the addition of the pipeline
registers.
The addition of these registers introduce some questions.
How large should the pipeline registers be?
Will they be the same size in each stage?
The next change is to the location of the MUX that updates the PC.
This must be associated with the IF stage. In this stage, the PC
should also be incremented.
The third change is to preserve the address of the register to be
written in the register file. This is done by passing the address along
the pipeline registers until it is required in the WB stage.
The output of the MUX which provides the write address is now the
pipeline register.
315
Pipeline control
Since five instructions are now executing simultaneously, the controller for the pipelined implementation is, in general, more complex.
It is not as complex as it appears on first glance, however.
For a processor like the MIPS, it is possible to decode the instruction
in the early pipeline stages, and to pass the control signals along the
pipeline in the same way as the data elements are passed through
the pipeline.
(This is what will be done in our implementation.)
A variant of this would be to pass the instruction field (or parts of
it) and to decode the instruction as needed for each stage.
For our processor example, since the datapath elements are the same
as for the single cycle processor, then the control signals required
must be similar, and can be implemented in a similar way.
All the signals can be generated early (in the ID stage) and passed
along the pipeline until they are required.
316
W
B
PCSrc
M
E
M
1
M
U
X
0
W
B
RegDst
M MemRead
E MemWrite
M Branch
E ALUSrc
X ALUop
RegWrite
W
B MemtoReg
Inst [31−26]
Add
Add
4
Shift
left 2
Inst[25−21]
317
PC
Read
address
Instruction
[31−0]
Instruction
Memory
Inst[20−16]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[15−0]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[5−0]
Inst[15−11]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
Executing an instruction
In the following figures, we will follow the execution of an instruction
through the pipeline.
The instructions we have implemented in the datapath are those of
the simplest version of the single cycle processor, namely:
• the R-type instructions
• load
• store
• beq
We will follow the load instruction, as an example.
318
1
M
U
X
0
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
4
Shift
left 2
Inst[25−21]
PC
Read
address
Instruction
[31−0]
319
Instruction
Memory
Inst[20−16]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[15−0]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[5−0]
Inst[15−11]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
1
M
U
X
0
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Shift
left 2
Inst[25−21]
PC
Read
address
Instruction
[31−0]
320
Instruction
Memory
Inst[20−16]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[15−0]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[5−0]
Inst[15−11]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
1
M
U
X
0
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Shift
left 2
Inst[25−21]
PC
Read
address
Instruction
[31−0]
321
Instruction
Memory
Inst[20−16]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[15−0]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[5−0]
Inst[15−11]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
1
M
U
X
0
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Shift
left 2
Inst[25−21]
PC
Read
address
Instruction
[31−0]
322
Instruction
Memory
Inst[20−16]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[15−0]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[5−0]
Inst[15−11]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
1
M
U
X
0
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Shift
left 2
Inst[25−21]
PC
Read
address
Instruction
[31−0]
323
Instruction
Memory
Inst[20−16]
Read
Register 1
Read
Register 2
Read
data 1
Zero
Registers
Write
Read
Register
data 2
Write
data
Inst[15−0]
16
Sign
extend
ALU
IF
ID
Read
data
Write
data
Data
Memory
0
M
U
X
1
32
Inst[5−0]
Inst[15−11]
Address
0
M
U
X
1
ALU
control
0
M
U
X
1
EX
MEM
WB
Representing a pipeline pictorially
These diagrams are rather complex, so we often represent a pipeline
as simpler figures representing the structure as follows:
LW
IM
REG
SW
ALU
IM
REG
ADD
IM
DM
REG
ALU
DM
REG
ALU
DM
REG
Often an even simpler representation is sufficient:
IF
ID
IF
ALU MEM WB
ID
IF
ALU MEM WB
ID
ALU MEM WB
The following figure shows a pipeline with several instructions in
progress:
324
REG
LW
ADD
SW
IM
REG
IM
ALU
REG
IM
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
325
SUB
BEQ
AND
IM
REG
IM
REG
IM
REG
REG
Pipeline “hazards”
There are three types of “hazards” in pipelined implementations —
structural hazards, control hazards, and data hazards.
Structural hazards
Structural hazards occur when there are insufficient hardware resources to support the particular combination of instructions presently
being executed.
The present implementation has a potential structural hazard if there
is a single memory for data and instructions.
Other structural hazards cannot happen in a simple linear pipeline,
but for more complex pipelines they may occur.
Control hazards
These hazards happen when the flow of control changes as a result
of some computation in the pipeline.
One question here is what happens to the rest of the instructions in
the pipeline?
Consider the beq instruction.
The branch address calculation and the comparison are performed in
the EX cycle, and the branch address returned to the PC in the next
cycle.
326
What happens to the instructions in the pipeline following a successful branch?
There are several possibilities.
One is to stall the instructions following a branch until the branch
result is determined. (Some texts refer to a stall as a “bubble.”)
This can be done by the hardware (stopping, or stalling the pipeline
for several cycles when a branch instruction is detected.)
beq
IF
ID
stall
add
ALU MEM WB
stall
stall
IF
ID
IF
lw
ALU MEM WB
ID
ALU MEM WB
It can also be done by the compiler, by placing several nop instructions following a branch. (It is not called a pipeline stall then.)
beq
nop
nop
nop
IF
ID
IF
ALU MEM WB
ID
IF
ALU MEM WB
ID
ALU MEM WB
IF
ID
add
IF
ALU MEM WB
ID
IF
lw
327
ALU MEM WB
ID
ALU MEM WB
Another possibility is to execute the instructions in the pipeline. It
is left to the compiler to ensure that those instructions are either
nops or useful instructions which should be executed regardless of
the branch test result.
This is, in fact, what was done in the MIPS. It had one “branch delay
slot” which the compiler could with a useful instruction about 50%
of the time.
beq
IF
branch delay slot
ID
ALU MEM WB
IF
ID
instruction at
branch target
IF
ALU MEM WB
ID
ALU MEM WB
We saw earlier that branches are quite common, and inserting many
stalls or nops is inefficient.
For long pipelines, however, it is difficult to find useful instructions to
fill several branch delay slots, so this idea is not used in most modern
processors.
328
Branch prediction
If branches could be predicted, there would be no need for stalls.
Most modern processors do some form of branch prediction.
Perhaps the simplest is to predict that no branch will be taken.
In this case, the pipeline is flushed if the branch prediction is wrong,
and none of the results of the instructions in the pipeline are written
to the register file.
How effective is this prediction method?
What branches are most common?
Consider the most common control structure in most programs —
the loop.
In this structure, the most common result of a branch is that it is
taken; consequently the next instruction in memory is a poor prediction. In fact, in a loop, the branch is not taken exactly once — at
the end of the loop.
A better choice may be to record the last branch decision, (or the
last few decisions) and make a decision based on the branch history.
Branches are problematic in that they are frequent, and cause inefficiencies by requiring pipeline flushes. In deep pipelines, this can be
computationally expensive.
329
Data hazards
Another common pipeline hazard is a pipeline hazard. Consider the
following instructions:
add $r2, $r1, $r3
add $r5, $r2, $r3
Note that $r2 is written in the first instruction, and read in the
second.
In our pipelined implementation, however, $r2 is not written until
four cycles after the second instruction begins, and therefore three
bubbles or nops would have to be inserted before the correct value
would be read.
add $r2, $r1, $r3
IF
ID
ALU MEM WB
data hazard
add $r5, $r2, $r3
IF
ID
ALU MEM WB
The following would produce a correct result:
IF
ID
nop
ALU MEM WB
nop
nop
IF
ID
ALU MEM WB
The following figure shows a series of pipeline hazards.
330
add $2, $1, $3
sub $5, $2, $3
331
and $7, $6, $2
beq $0, $2, −25
sw $7, 100($2)
IM
REG
IM
ALU
REG
IM
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
IM
REG
IM
REG
REG
Handling data hazards
There are a number of ways to reduce data hazards.
The compiler could attempt to reorder instructions so that instructions reading registers recently written are not too close together,
and insert nops where it is not possible to do so.
For deep pipelines, this is difficult.
Hardware could be constructed to detect hazards, and insert stalls
in the pipeline where necessary.
This also slows down the pipeline (it is equivalent to adding nops.)
An astute observer could note that the result of the ALU operation
is stored in the pipeline register at the end of the ALU stage, two
cycles before it is written into the register file.
If instructions could take the value from the pipeline register, it could
reduce or eliminate many of the data hazards.
This idea is called forwarding.
The following figure shows how forwarding would help in the pipeline
example shown earlier.
332
add $2, $1, $3
sub $5, $2, $3
and $7, $6, $2
IM
REG
IM
ALU
REG
IM
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
ALU
DM
REG
333
IM
beq $0, $2, −25
sw $7, 100( $2)
REG
IM
forwarding
Note how forwarding eliminates the data hazards in these cases.
REG
REG
Implementing forwarding
Note that from the previous examples there are now two potential
additional sources of operands for the ALU during the EX cycle —
the EX/MEM pipeline register, and the the MEM/WB pipeline.
What additional hardware would be required to provide the data
from the pipeline stages?
The data to be forwarded could be required by either of the inputs
to the ALU, so two MUX’s would be required — one for each ALU
input.
The MUX’s would have three sources of data; the original data from
the registers (in pipeline stage ID/EX) or the two pipeline stages to
be forwarded.
Looking only at the datapath for R-type operations, the additional
hardware would be as follows:
334
ID/EX
Read R1
EX/MEM
MEM/WB
M
U
X
Read
Data 1
Read R2
zero
ForwardA
Registers
Write R
Write data
ALU
result
Read
M
U
X
Data 2
Read
address
Write
Read
Data
Memory
Write
Data
ForwardB
rt
0
rd
1
M
U
X
There would also have to be a “forwarding unit” which provides
control signals for these MUX’s.
335
0
Data
1
M
U
X
Forwarding control
Under what conditions does a data hazard (for R-type operations)
occur?
It is when a register to be read in the EX cycle is the same register
as one targeted to be written, and is held in either the EX/MEM
pipeline register or the MEM/WB pipeline register.
These conditions can be expressed as:
1. EX/MEM.RegisterRd = ID/EX.RegisterRs or ID/EX.RegisterRt
2. MEM/WB.RegisterRd = ID/EX.RegisterRs or ID/EX.RegisterRt
Some instructions do not write registers, so the forwarding unit
should check to see if the register actually will be written. (If it
is to be written, the control signal RegWrite, also in the pipeline,
will be set.)
Also, an instruction may try to write some value in register 0. More
importantly, it may try to write a non-zero value there, which should
not be forwarded — register 0 is always zero.
Therefore, register 0 should never be forwarded.
336
The register control signals ForwardA and ForwardB have values
defined as:
MUX control Source
00
ID/EX
Explanation
Operand comes from the register file
(no forwarding)
01
MEM/WB Operand forwarded from a memory
operation or an earlier ALU operation
10
EX/MEM Operand forwarded from the previous ALU operation
The conditions for a hazard with a value in the EX/MEM stage are:
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd 6= 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
then ForwardA = 10
if (EX/MEM.RegWrite
and (EX/MEM.RegisterRd 6= 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
then ForwardB = 10
337
For hazards with the MEM/WB stage, an additional constraint is
required in order to make sure the most recent value is used:
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd 6= 0)
and (EX/MEM.RegisterRd 6= ID/EX.RegisterRs)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
then ForwardA = 01
if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd 6= 0)
and (EX/MEM.RegisterRd 6= ID/EX.RegisterRt)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
then ForwardB = 01
The datapath with the forwarding control is shown in the next figure.
338
ID/EX
Read R1
EX/MEM
M
U
X
Read
Data 1
Read R2
zero
ForwardA
Registers
Write R
Write data
MEM/WB
ALU
result
Read
Data 2
M
U
X
Write
Read
Data
Data
Memory
Write
Data
ForwardB
rs
rt
Read
address
0
rd
1
EX/MEM.RegisterRd
M
U
X
Forwarding
unit
MEM/WB.RegisterRd
For a datapath with forwarding, the hazards which are fixed by forwarding are not considered hazards any more.
339
0
1
M
U
X
Forwarding for other instructions
What considerations would have to be made if other instructions
were to make use of forwarding?
The immediate instructions
The major difference is that the B input to the ALU comes from the
instruction and sign extension unit, so the present MUX controlled
by the ALUSrc signal could still be used as input to the ALU.
The major change is that one input to this MUX is the output of the
MUX controlled by ForwardB.
The load and store instructions
These will work fine, for loads and stores following R-type instructions.
There is a problem, however, for a store following a load.
lw $2, 100($3)
sw $2, 400($3)
IM
REG
IM
ALU
REG
DM
REG
ALU
DM
Note that this situation can also be resolved by forwarding.
It would require another forwarding controller in the MEM stage.
340
REG
There is a situation which cannot be handled by forwarding, however.
Consider a load followed by an R-type operation:
lw $2, 100($3)
IM
add $4,$3, $2
REG
IM
ALU
REG
DM
REG
ALU
DM
REG
Here, the data from the load is not ready when the r-type instruction
requires it — we have a hazard.
What can be done here?
lw $2, 100($3)
IM
REG
STALL
ALU
DM
IM
REG
REG
ALU
DM
add $4,$3, $2
With a “stall”, forwarding is now possible.
It is possible to accomplish this with a nop, generated by a compiler.
Another option is to build a “hazard detection unit” in the control
hardware to detect this situation.
341
REG
The condition under which the “hazard detection circuit” is required
to insert a pipeline stall is when an operation requiring the ALU
follows a load instruction, and one of the operands comes from the
register to be written.
The condition for this is simply:
if (ID/EX.MemRead
and (ID/EX.RegisterRt = IF/ID.RegisterRs)
or (ID/EX.RegisterRt = IF/ID.RegisterRt))
then STALL
342
Forwarding with branches
For the beq instruction, if the comparison is done in the ALU, the
forwarding already implemented is sufficient.
add $2,$3, $4
IM
REG
ALU
IM
beq $2, $3,25
REG
DM
REG
ALU
DM
REG
In the MIPS processor, however, the branch instructions were implemented to require only two cycles. The instruction following the
branch was always executed. (The compiler attempted to place a
useful instruction in this “jump delay slot”, but if it could not, an
nop was placed there.)
The original MIPS did not have forwarding, but it is useful to consider
the kinds of hazards which could arise with this instruction.
Consider the sequence
add $2, $3, $4
beq $2, $5, 25
IF
ID
IF
ALU MEM WB
ID
ALU MEM WB
Here, if the conditional test is done in the ID stage, there is a hazard
which cannot be resolved by forwarding.
343
In order to correctly implement this instruction in a processor with
forwarding, both forwarding and hazard detection must be employed.
The forwarding must be similar to that for the ALU instructions,
and the hazard detection similar to that for the load/ALU type instructions.
Presently, most processors do not use a “branch delay slot” for branch
instructions, but use branch prediction.
Typically, there is a small amount of memory contained in the processor which records information about the last few branch decisions
for each branch.
In fact, individual branches are not identified directly in this memory;
the low order address bits of the branch instruction are used as an
identifier for the branch.
This means that sometimes several branches will be indistinguishable
in the branch prediction unit. (The frequency of this occurrence
depends on the size of the memory used for branch prediction.)
We will discuss branch prediction in more depth later.
344
Exceptions and interrupts
Exceptions are a kind of control hazard.
Consider the overflow exception discussed previously for the multicycle implementation.
In the pipelined implementation, the exception will not be identified
until the ALU performs the arithmetic operation, in stage 3.
The operations in the pipeline following the instruction causing the
exception must be flushed. As discussed earlier, this can be done by
setting the control signals (now in pipeline registers) to 0.
The instruction in the IF stage can be turned into a nop.
The control signals ID.flush AND EX.flush control the MUX’s
which zero the control lines.
The PC must be loaded with a memory value at which the exception
handler resides (some fixed memory location).
This can be done by adding another input to the PC MUX.
The address of the instruction causing the exception must then be
saved in the EPC register. (Actually, the value PC + 4 is saved).
Note that the instruction causing the exception cannot be allowed
to complete, or it may overwrite the register value which caused the
overflow. Consider the following instruction:
add $1, $1, $2
The value in register 1 would be overwritten if the instruction finished.
345
The datapath, with exception handling for overflow:
IF.Flush
EX.Flush
ID.Flush
Hazard
detection
unit
40000040
ID/EX
M
u
x
WB
Control
0
M
u
x
0
M
u
x
M
0
EX
IF/ID
EX/MEM
M
u
x
Cause
WB
MEM/WB
M
WB
Except
PC
4
Shift
left 2
Registers
PC
=
M
u
x
Instruction
memory
ALU
M
u
x
Sign
extend
M
u
x
Forwarding
unit
346
Data
memory
M
u
x
Interrupts can be handled in a way similar to that for exceptions.
Here, though, the instruction presently being completed may be allowed to finish, and the pipeline flushed.
(Another possibility is to simply allow all instructions presently in
the pipeline to complete, but this will increase the interrupt latency.)
The value of the PC + 4 is stored in the EPC, and this will be the
return address from the interrupt, as discussed earlier.
Note that the effect of an interrupt on every instruction will have to
be carefully considered — what happens if an interrupt occurs near
a branch instruction?
347
Superscalar and superpipelined processors
Most modern processors have longer pipelines (superpipelined) and
two or more pipelines (superscalar) with instructions sent to each
pipeline simultaneously.
In a superpipelined processor, the clock speed of the pipeline can be
increased, while the computation done in each stage is decreased.
In this case, there is more opportunity for data hazards, and control
hazards.
In the Pentium IV processor, pipelines are 20 stages long.
In a superscalar machine, there may be hazards among the separate
pipelines, and forwarding can become quite complex.
Typically, there are different pipelines for different instruction types,
so two arbitrary instructions cannot be issued at the same time.
Optimizing compilers try to generate instructions that can be issued
simultaneously, in order to keep such pipelines full.
In the Pentium IV processor, there are six independent pipelines,
most of which handle different instruction types.
In each cycle, an instruction can be issued for each pipeline, if there
is an instruction of the appropriate type available.
348
Dynamic pipeline scheduling
Many processors today use dynamic pipeline scheduling to find
instructions which can be executed while waiting for pipeline stalls
to be resolved.
The basic model is a set of independent state machines performing
instruction execution; one unit fetching and decoding instructions
(possibly several at a time), several functional units performing the
operations (these may be simple pipelines), and a commit unit which
writes results in registers and memory in program execution order.
Generally, the commit unit also “kills off” results obtained from
branch prediction misses and other speculative computation.
In the Pentium IV processor, up to six instructions can be issued in
each clock cycle, while four instructions can be retired in each cycle.
(This clearly shows that the designers anticipated that there would
be many instructions issued — on average 1/3 of the instructions —
that would be aborted.)
349
Instruction fetch
and decode unit
Functional
units
In-order issue
Reservation
station
Reservation
station
…
Reservation
station
Reservation
station
Integer
Integer
…
Floating
point
Load/
Store
Out-of-order execute
In-order commit
Commit
unit
Dynamic pipeline scheduling is used in the three most popular processors in machines today — the Pentium II, III, and IV machines,
the AMD Athlon, and the Power PC.
350
A generic view of the Pentium P-X and the Power PC
pipeline
Data
cache
PC
Instruction
cache
Branch
prediction
Instruction queue
Register file
Decode/dispatch unit
Reservation
station
Branch
Reservation
station
Integer
Reservation
station
Integer
Reservation
station
Floating
point
Commit
unit
Reorder
buffer
351
Reservation
station
Reservation
station
Store
Load
Complex
integer
Load/
store
Speculative execution
One of the more important ways in which modern processors keep
their pipelines full is by executing instructions “out of order” and
hoping that the dynamic data required will be available, or that the
execution thread will continue.
Two cases where speculative computation are common are the “store
before load” case, where normally if a data element is stored, the
element being loaded does not depend on the element being stored.
The second case is at a branch — both threads following the branch
may be executed before the branch decision is taken, but only the
thread for the successful path would be committed.
Note that the type of speculation in each case is different — in the
first, the decision may be incorrect; in the second, one thread will
be incorrect.
352
Download