Document

advertisement
Computer Organization
(EENG 3710)
Instructor: Partha Guturu
EE Department
Quick Recap on our respective roles
• Who is responsible for your learning of Computer
Organization?
Some aphorisms on Teaching philosophy:
“I do not teach my pupils. I provide conditions in which they
can learn”- Albert Einstein
“I hear and I forget. I see and I remember. I do and I
understand” Chinese proverb
"Give a man a fish and you feed him for a day. Teach a man to
fish and you feed him for a lifetime." -- Chinese proverb
What does the data say?
– Even if you are
fascinating…..
– People only
remember the
first 15 minutes of
what you say
100
Percent of
Students
50
Paying
Attention
0
0
10
20 30 40 50
Time from Start of Lecture
(minutes)
60
What’s so good about our approach?
• Learner-Centric Approach
• Life-long learning
• Proactive versus reactive
Course Objectives: What you need to learn?
• High level view of a computer
• Different types
– Desk/lap tops
– Servers
– Embedded systems
•
•
•
•
Anatomy of a computer and our focus here
Computer Organization versus Architecture
Instruction sets
Different components of a computer and their interworking
• Computer Performance Issues
Different Applications & Requirements
•
•
•
Desktop Applications
– Emphasis on performance of integer and Floating Point (FP) data types
– Little regard for program (code) size and power consumption
Server Applications
– Database, file system, web applications, time-sharing
– FP (Floating Point) performance is much less important than integer and
character strings
– Little regard for program (code) size and power consumption
Embedded Applications
– Digital Signal Processors (DSPs), media processors, control
– High value placed on program size and power consumption
• Less memory, is cheaper and lower power
• Reduce chip costs: FP instructions may be optional
Embedded Computers in Your Car
Relative levels of demand for different
computer types
Anatomy of Computer & Our Focus
Application (ex: browser)
Compiler
Software
Hardware
Operating
System
Assembler
Processor Memory I/O system
Instruction Set
Architecture
Datapath Control
Digital Design
Circuit Design
transistors
* Coordination of many
levels (layers) of abstraction
Why a Compiler?
In Paris they simply stared
when I spoke to them in
French; I never did succeed in
making those idiots
understand their own
language.
Mark Twain, The Innocents
Abroad, 1869
Why High Level Language
• Ease of thinking and coding in an
English/Math like language
• Enhanced productivity because of the
ease to debug and validate
• Maintainability
• Target independent development
• Availability of optimizing compilers
A Dissection to Reveal Finer Details
High Level Language
Program (e.g., C)
Compiler
Assembly Language
Program (e.g.,MIPS)
Assembler
Machine Language
Program (MIPS)
Machine
Interpretation
Hardware Architecture Description
(Logisim, VHDL, Verilog, etc.)
Architecture
Implementation
Logic Circuit Description
(Logisim, etc.)
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
0000
1010
1100
0101
1001
1111
0110
1000
1100
0101
1010
0000
0110
1000
1111
1001
lw
lw
sw
sw
1010
0000
0101
1100
$t0, 0($2)
$t1, 4($2)
$t1, 0($2)
$t0, 4($2)
1111
1001
1000
0110
0101
1100
0000
1010
1000
0110
1001
1111
What is in a Computer?
• Components:
–
–
–
–
processor (datapath, control)
input (mouse, keyboard)
output (display, printer)
memory (cache (SRAM), main memory (DRAM))
• Our primary focus: the processor (datapath and
control)
– Implemented using millions of transistors
– Impossible to understand by looking at each transistor
– We need abstraction!
5 Major Components of a Computer
Personal Computer
Computer
Processor
Control
(“brain”)
Datapath
(“brawn”)
Memory
(where
programs,
data
live when
running)
Devices
Input
Output
Keyboard,
Mouse
Disk
(where
programs,
data
live when
not running)
Display,
Printer
5 Major Components of a Computer
Processor Chip (CPU) Components
Motherboard Lay-Out
Dramatic Changes in Technology
•
•
•
•
Processor
– Logic capacity: about 30% ~ 35% per year
– Clock rate : about 30% per year
Memory
– DRAM: Dynamic Random Access Memory
– Capacity: about 60% per year (4x every 3 years)
– Memory speed: about 10% per year
– Cost per bit: improves about 25% per year
Disk
– Capacity: about 60% ~ 100% per year
– Speed: about 10% per year
Network Bandwidth
– 10 Mb ------(10 years)-- 100Mb ------(5 years)-- 1 Gb
Growth Capacity of DRAM Chips
K = 1024 (210)
In recent years growth rate has
slowed to 2x every 2 year
# of transistors on an IC
Dramatic Changes in Technology
Gordon Moore
Intel Cofounder
2X Transistors /
Chip
Every 1.5 years
Called
“Moore’s Law”
Year
The Underlying Technologies
Year
Technology
Relative Performance/Unit
Cost
1951
Vacuum Tube
1
1965
Transistor
35
1975
Integrated Circuit (IC)
900
1995
Very Large Scale IC
(VLSI)
2,400,000
2005
Ultra VLSI
6,200,000,000
What if technology in the automobile industry
advanced at the same rate?
What if the automobile …
“If the automobile had followed the same
development cycle as the computer,
a Rolls-Royce would today cost $100,
get a million miles per gallon,
and explode once a year,
killing everyone inside.”
– Robert X. Cringely,
InfoWorld magazine
Complex Chip Manufacturing Process
Enabled by Technological Breakthroughs
Computer Architecture versus Computer Organization
Computer architecture is the abstract image of a computing system that
is seen by a machine language (or assembly language) programmer,
including the instruction set, memory address modes, processor
registers, and address and data formats;
whereas the computer organization is a lower level, more concrete,
description of the system that involves how the constituent parts of
the system are interconnected and how they interoperate in order to
implement the architectural specification
--Phillip A. Laplante (2001), Dictionary of Computer Science,
Engineering, and Technology
-> Can change organization without changing architecture (e.g. 64 bit
architecture with 16 bit machine using 4 clock cycles)
Course Outline
•
•
•
•
•
•
•
•
•
Topic
# weeks
Introduction to Computer Organization
(1)
Computer Instructions
(2)
Arithmetic and Logic Unit
(1)
Performance Analysis
(1)
Data Path and Control
(2)
Performance Enhancement with Pipelining
(2)
Memory Hierarchy and Virtual Memory Concepts (2)
Storage, Networks, and other Peripherals
(1)
Engineering Design with Microcomputers
(2)
Course Objectives
Know about the different software and hardware components of a
digital computer .
Comprehend how different components of the digital computer
collaborate to produce the end result in an application
development process
Apply principles of logic design to digital computer design.
Analyze digital computer and decompose it into modules and
lower level logical blocks involving both combinational and
sequential circuit elements.
Synthesize various components of computer's Arithmetic Logic
Unit, Control Units, and Data Paths
Understand and Assess (evaluate) computer CPU
performance, and learn methods to enhance computer
performance.
Language of the Computer
• We will have a quick look at MIPS language
• MIPS- Not to be confused with million instructions per second
• MIPS- Microprocessor without Interlocked Pipelined Stages- a
RISC (Reduced Instruction Set Computer) processor developed
by MIPS Technologies.
• By 1990 1 out of 3 RISC processors was using MIPS;
Architecture also called MIPS
• CISCO routers, Nintendo 64, Sony Play Station, Play Station 2,
etc. use MIPS designs
Why bother to learn assembly language?
•

“The difference between mediocre and star programmers is that star
programmers understand assembly language, whether or not they use it on a
daily basis.”
“Assembly language is the language of the computer
itself. To be a programmer without ever learning
assembly language is like being a professional race
car driver without understanding how your
carburetor works. To be a truly successful
programmer, you have to understand exactly what
the computer sees when it is running a program.
Nothing short of learning assembly language will do
that for you. Assembly language is often seen as a
black art among today's programmers - with those
knowing this art being more productive, more
knowledgeable, and better paid, even if they
primarily work in other languages.”
Basic Instruction Format
Three Instruction Formats:
R Opcode
31
I
rs
26 25
Opcode
31
21 20
rs
26 25
J Opcode
31
rt
rd
16 15
rt
21 20
shamt
11 10
funct
6 5
0
Immediate
16 15
0
Memory Address
26 25
0
Now Guess MIPS Architecture
• How many registers?
• How big a memory could be supported?
• What is memory word size?
• How to handle data in RAM?
Non-architectural design/implementation
issues that vary from design to design:
• Roles of registers
Instruction Set Architecture (ISA)
• Instructions
– The words of a computer’s language are called
instructions
• Instructions set
– The vocabulary of a computer’s language is
called instruction set
• Instruction Set Architecture (ISA)
– The set of instructions a particular CPU
implements is an Instruction Set Architecture.
The Instruction Set Architecture (ISA)
software
instruction set architecture
hardware
The interface description separating
the software and hardware
ISA Sales
ISA: CISC vs. RISC
• Early trend was to add more and more instructions to new CPUs to do
elaborate operations
– CISC (Complex Instruction Set Computer)
– The primary goal of CISC architecture is to complete a task in as few
lines of assembly as possible.
– VAX architecture had an instruction to multiply polynomials!
• RISC philosophy (Cocke IBM, Patterson, Hennessy, 1980s) –
Reduced Instruction Set Computer
– Keep the instruction set small and simple, makes it easier to build
fast hardware.
– Let software do complicated operations by composing simpler ones.
The MIPS ISA
• Instruction Categories
– Load/Store
– Computational
– Jump and Branch
– Floating Point
Registers
R0 - R31
• coprocessor
PC
HI
– Memory Management
– Special

3 Instruction Formats: all 32 bits wide
OP
rs
rt
OP
rs
rt
OP
rd
LO
sa
immediate
jump target
funct
MIPS Registers and their Roles
Name
Number
Use
Preserved
across a Call?
$zero
$at
0
1
$v0 -$v1
2-3
Values for function results
No
4-7
Expression Evaluation
Arguments
No
$ao -$a3
The constant value 0
Assembler Temporary
N.A.
No
$t0 -$t7
8-15
Temporaries
No
$s0 -$s7
16-23
Saved Temporaries
YES
$t8 -$t9
24-25
Temporaries
No
$k0 -$k1
26-27
Reserved for OS kernel
No
$gp (28) global pointer, $sp (29) stack pointer, $fp (30) frame
pointer, $ra (31)return address are all preserved across a call
Simple operations
• Compute f = (a+b)-(c-d) assuming these
variables are in some $s registers
• Memory operation- base register concept
• Why a multiplication factor of 4 is required
for ‘n’ the array element- Answer:-Memory
addresses are in MIPS are byte addresses.
list of
Quick Recap- CompilersSorted
10 numbers
C++ program to
Sort 10 numbers
Input
C++ Compiler
(Machine-X code to
translate any C++
Program into Assembly
Program for Machine X)
Machine X
Two steps in this
dotted area can be
Output :
merged together
Machine X into a single step
Assembly
Program to sort
10 numbers
Input
Assembler
(Machine-X code to
translate any Machine X
Assembly Program
into Machine X code)
Machine X
Output
10 numbers
Input
Machine X code
To Sort 10
numbers
Output
Machine X
Quick Recap- Shortcut Compilers
C++ program to
Sort 10 numbers
10 numbers
Input
C++ Compiler
(Machine-X code to
translate any C++
Program directly
into Machine X code)
Input
Output
Machine X code
To Sort 10
numbers
Machine X
Machine X
Output:
Sorted list of
10 numbers
Quick Recap- Bootstrapping
C++ program to translate
any C++ program into
Machine Y code
Input
Input
C++ Compiler
(Machine-X code to
translate any C++
Program directly
into Machine X code)
Machine X
Output:
Machine
X
code
to
translate
Output
any C++ program into
Machine Y code
Machine Y Code
for C++
compiler (i.e. to
Machine X
translate any
C++ program
into Machine Y
code)
This can be installed and
run on Machine Y; thus you
have a compiler for
Machine Y
Chapter 2- MIPS Programming
Quick Recap- MIPS
• MIPS language- expansion of the acronym
• No of registers and architecture in general
• The 3 Instruction formats and the various
fields (e.g. rs, rt, rd, shamt, etc.)
• Now, we proceed along with
– MIPS Assembly Instruction formats
– Coding simple problems and translating into MIPS
machine code
Simple Statements
• C Code: d = (a + b) – (c + d)
• Machine code assuming a, b, c, and d are in
MIPS registers
• Machine code assuming that a, b, c, and d are
in consecutive memory locations from a given
starting address (use lw, sw)
Loops and Branches
• Develop assembly code for a typical C-code to
add 100 numbers as follows:
// Read 100 numbers into an array A
sum = 0;
for (i = 0; i < 100; i++)
{
sum = sum + A[i];
}
// Print sum
Procedure Calls
• Caller and Callee- who should preserve
which registers?
• Leaf and recursive procedure examples for
explaining the conventions, and jla and jr
instructions.
SPIM
Courtesy: Prof. Jerry Breecher
Clark University
Appendix A
MIPS Simulation
•
SPIM is a simulator.
– Reads a MIPS assembly language program.
– Simulates each instruction.
– Displays values of registers and memory.
– Supports breakpoints and single stepping.
– Provides simple I/O for interacting with user.
SPIM Versions
• SPIM is the command line version.
• XSPIM is x-windows version (Unix workstations).
• There is also a windows version. You can use this at home and it can
be downloaded from:
http://www.cs.wisc.edu/~larus/spim.html.
Resources On the Web
•
There’s a very good SPIM tutorial at
http://chortle.ccsu.edu/AssemblyTutorial/Chapter-09/ass09_1.html
•
In fact, there’s a tutorial for a good chunk of the ISA portion of this course at:
http://chortle.ccsu.edu/AssemblyTutorial/tutorialContents.html
•
Here are a couple of other good references you can look at:
Patterson_Hennessy_AppendixA.pdf
And
http://babbage.clarku.edu/~jbreecher/comp_org/labs/Introduction_To_SPIM.pdf
SPIM Program
•
•
•
MIPS assembly language.
Must include a label “main” – this will be called by the SPIM startup
code (allows you to have command line arguments).
Can include named memory locations, constants and string literals in a
“data segment”.
General Layout
•
Data definitions start with .Data directive.
•
Code definition starts with .Text directive.
– “Text” is the traditional name for the memory that holds a program.
Usually have a bunch of subroutine definitions and a “main”.
•
Simple Example
.data
# data memory
foo:
.word 0
# 32 bit variable
.text
.align 2
.globl main
main:
lw
$a0,foo
# program memory
# word alignment
# main is global
Data Definitions
•
You can define variables/constants with:
– .word :
defines 32 bit quantities.
– .byte:
defines 8 bit quantities.
– .asciiz:
zero-delimited ascii strings.
– .space:
allocate some bytes.
Data Examples
.data
prompt: .asciiz “Hello World\n”
msg:
.asciiz
“The answer is ”
x:
.space 4
y:
.word
4
str: .space 100
MIPS: Software Conventions For
Registers
Simple I/O
SPIM provides some simple I/O using the “syscall” instruction. The specific
I/O done depends on some registers.
– You set $v0 to indicate the operation.
– Parameters in $a0, $a1.
I/O Functions
System call is used to communicate with the system and do simple I/O.
$v0
Load arguments (if any) into registers $a0, $a1 or $f12 (for floating point).
do: syscall
Results returned in registers $v0 or $f0.
Example: Reading an int
li
$v0,5
syscall
# Indicate we want function 5
# Upon return from the syscall, $v0 has the integer typed by
# a human in the SPIM console
# Now print that same integer
move $a0,$v0 # Get the number to be printed into register
li
$v0,1 # Indicate we’re doing a write-integer
syscall
Printing A String
.data
msg:
.asciiz
.text
.globl
main:
li $v0,4
la $a0,msg
syscall
jr
$ra
“SPIM IS FUN”
A Typical MIPS READ and WRITE
Program
.data 0x10000000
A: .word 0, 0
.text
main: la $t0, A
li $v0, 5 #setting up return reg for read
syscall
sw $v0, ($t0)
li $v0, 5 #setting up return reg for read
syscall
sw $v0, 4($t0)
lw $t1, 0($t0)
lw $t2, 4($t0)
add $t3, $t1, $t2
li $v0, 1 #setting up return reg for print
move $a0,$t3
syscall
A C-Program with Read and Sum Loops
int main (int argc, char **argv) // Older versions of C accept: void main()
{
int A[5], i;
for (i = 0; i <=4; i++)
{
scanf(“%d”, A[i]);
}
sum = 0;
for (i = 0; i <=4; i++)
{
sum = sum + A[i];
}
printf(“The sum of 5 numbers is: %d\n”, sum);
}
The MIPS equivalent of the C-Program
with Read and Sum Loops
.data
A: .word 0 #Create space for the first word A[0] and initialize it to 0
.space 16 #Create space for 4 more words A[1] .. A[4]
msg: .asciiz "The sum of 5 numbers is: "
.text
main: la $t0, A #Store in $t0 the address of A[0], the first of five words
li $t1, 0
#Store in $t1, the initial value of loop variable
li $t2, 4
#Store in $t2, the final value of loop variable
li $t3, 0
#Initialize $t3 that increments by 4 with each word read
loop: add $t4, $t0, $t3 #Put in $t4 the address of next word
li $v0, 5
# Initialize $v0 for Read
syscall
sw $v0,($t4)
# put the new integer read into the word location pointed by $t4
addi $t3, $t3, 4 #increment $t3 by 4 for calculation of next word address
addi $t1, 1
ble $t1, $t2, loop
(continued to next slide …)
The MIPS equivalent of the C-Program
with Read and Sum Loops
… Continued from previous slide.
li $t1, 0
#Do the same initialization for identical loop at addLoop
li $t2, 4
li $t3, 0
li $s0, 0
addloop:
add $t4, $t0, $t3
lw $t5, ($t4)
#Read the integer at address in $t4 into $t5
add $s0, $s0, $t5 #Update the partial sum in $s0 by adding the new integer
addi $t3, $t3, 4
addi $t1, 1
ble $t1, $t2, addloop
li $v0, 4
la $a0, msg
syscall
#Make System Ready to print String
#Load starting address (msg) of the string into $a0- argument register
li $v0, 1
#Make System Ready to print the integer (sum)
move $a0, $s0
syscall
SPIM Subroutines
•
•
•
•
The stack is set up for you – just use $sp.
You can view the stack in the data window.
main is called as a subroutine (have it return using jr $ra).
For now, don’t worry about details. But the next few pages do some
excellent example of how stacks all work.
Why Are Stacks So Great?
•
•
Some machines provide a memory stack as part of the architecture (e.g.,
VAX)
Sometimes stacks are implemented via software convention (e.g., MIPS)
Why Are Stacks So Great?
MIPS Function Calling Conventions
SP fact:
addiu $sp, $sp, -32
sw $ra, 20($sp)
...
sw $s0, 4($sp)
...
lw $ra, 20($sp)
addiu $sp, $sp, 32
jr $ra
C-Program for a leaf-procedure
void main()
{
int e, f, g, h;
scanf(%d”, &e);
scanf(“%d”, &f);
scanf(“%d”, &g);
scanf(“%d”, &h);
result = leaf_procedure(e, f, g,
h)
printf (“Result = %d\n”, result);
}
Int leaf_procedure(int e, int f, int g, int h)
{
int res;
int temp1, temp2; //Not required
// (only for making it
// close to MIPS code)
temp1 = e + f;
temp2 = g + h;
res = temp1 – temp2;
return (res)
}
Page 1: MIPS code for the main
(calling program ) of leaf_procedure
.data
e: .word 0
f: .word 0
g: .word 0
h: .word 0
.text
main: la $t0, e #Load address of e into $t0
li $t1, 0 #set the loop iteration variable to 0
readLoop: sll $t2,$t1, 2 #Since each word is 4 bytes long, multiply loop variable by 4
add $t3, $t0, $t2 #First time in the loop, $t3 will have address of e
li $v0, 5 #Prepare for read
syscall
sw $v0, ($t3) #Newly read value will go to e, f, g , or h depending upon
# whether the loop variable $t1 contains 0, 1, 2, or 3,
# that is, whether $t2 is 0, 4, 8, or 12.
addi $t1, $t1, 1
xori $t2, $t1, 4 #You can destroy the original $t2 value because you are recomputing
# it from $t1 at the beginning of the loop!
bne $t2, $zero, readLoop #You haven't read all the 4 integers; go back to readloop.
Page 2: MIPS Code Continuation for
the main of leaf_procedure
#reading complete. Make preparations for the leaf_procedure that computes (e+f)-(g+h)
# by saving arguments in argument registers.
lw $a0, 0($t0) #load e into $a0
lw $a1, 4($t0) #load f into $a1
lw $a2, 8($t0) #load g into $a2
lw $a3, 12($t0) #load h into $a3
jal leaf_procedure
#this instruction stores the address of next instruction (the return
#address, that is, the adress of the instruction at the “print “ label)
#in $ra and jumps onto the label leaf_procedure
print: move $t0, $v0
li $v0, 1 #Prepare for print
move $a0, $t0
syscall
j last
Page 3: MIPS Code for the
leaf_procedure itself
leaf_procedure: addi $sp, $sp, -12 #Make space on the stack for 3 integers
lw $s0, 0($sp) #save the contents of the registers you plan to temporarily use
# in this procedure on stack so that original values can be restored
# before returning to the calling program
lw $s1, 4($sp)
lw $s2, 8 ($sp)
add $s1, $a0, $a1 #Add e and f in $a0 and $a1, respectively, and put in $s1
add $s2, $a2, $a3 #Add g and h in $a2 and $a3, respectively, and put in $s2
sub $s0, $s1, $s2 #subract g+h in $s2 from e+f in $s1, and put it in $s0
#Make preparations for returning back to the calling procedure (main in this case)
move $v0, $s0 #Put the computed value into return value register
sw $s0, 0($sp) #Restore values on stack to the original resisters
sw $s1, 4($sp)
sw $s2, 8 ($sp)
addi $sp, $sp, 12 #Update stack
jr $ra #Jump to location pointed to by $ra (print, in our case)
last: # the main program wil stop here as there is no valid instruction here.
MIPS Function Calling Conventions
main() {
printf("The factorial of 10 is %d\n", fact(10));
}
int fact (int n) {
if (n <= 1) return(1);
return (n * fact (n-1));
}
MIPS Function Calling Conventions
.text
.global main
main:
subu $sp, $sp, 32
#stack frame size is 32 bytes
sw
$ra,20($sp)
#save return address
li
$a0,10
# load argument (10) in $a0
jal
fact
#call fact
la
$a0 LC
#load string address in $a0
move $a1,$v0
#load fact result in $a1
jal
printf
# call printf
lw
$ra,20($sp)
# restore $sp
addu $sp, $sp,32
# pop the stack
jr
$ra
# exit()
.data
LC:
.asciiz "The factorial of 10 is %d\n"
MIPS Function Calling Conventions
.text
fact:
sw
sw
subu
bgtz
li
j
L2:
jal
lw
mul
L1:
addu
jr
subu
$sp,$sp,8
# stack frame is 8 bytes
$ra,8($sp)
#save return address
$a0,4($sp)
# save argument(n)
$a0,$a0,1
# compute n-1
$a0, L2 # if n-1>0 (ie n>1) go to L2
$v0, 1
#
L1
# return(1)
# new argument (n-1) is already in $a0
fact
# call fact
$a0,4($sp)
# load n
$v0,$v0,$a0
# fact(n-1)*n
lw
$ra,8($sp)
# restore $ra
$sp,$sp,8
# pop the stack
$ra
# return, result in $v0
MIPS Function Calling Conventions
MIPS Function Calling Conventions
MIPS Function Calling Conventions
MIPS Function Calling Conventions
MIPS Function Calling Conventions
MIPS Function Calling Conventions
MIPS Function Calling Conventions
MIPS Function Calling Conventions
MIPS Function Calling Conventions
MIPS Function Calling Conventions
MIPS Function Calling Conventions
MIPS Function Calling Conventions
MIPS Function Calling Conventions
MIPS Function Calling Conventions
MIPS Function Calling Conventions
Sample SPIM Programs (on the web)
multiply.s: multiplication subroutine based on repeated addition and a test
program that calls it.
http://babbage.clarku.edu/~jbreecher/comp_org/labs/multiply.s
fact.s: computes factorials using the multiply subroutine.
http://babbage.clarku.edu/~jbreecher/comp_org/labs/fact.s
sort.s: the sorting program from the text.
http://babbage.clarku.edu/~jbreecher/comp_org/labs/sort.s
strcpy.s: the strcpy subroutine and test code.
http://babbage.clarku.edu/~jbreecher/comp_org/labs/strcpy.s
EENG 3710
Computer Organization
Arithmetic 3
ALU Design – Integer Addition, Multiplication & Division
Adapted from David H. Albonesi
Copyright David H. Albonesi and the University of Rochester.
E. J. Kim
Integer multiplication
1000 (multiplicand)
• Pencil and paper
binary
multiplication
x
1001
(multiplier)
Integer multiplication
1000 (multiplicand)
• Pencil and paper
binary
multiplication
x
1001
(multiplier)
1000
Integer multiplication
1000 (multiplicand)
• Pencil and paper
binary
multiplication
x
1001
(multiplier)
1000
00000
Integer multiplication
1000 (multiplicand)
• Pencil and paper
binary
multiplication
x
1001
(multiplier)
1000
00000
00000
0
Integer multiplication
1000 (multiplicand)
• Pencil and paper
binary
multiplication
x
1001
(multiplier)
1000
00000
00000
1000000
0
Integer multiplication
1000 (multiplicand)
• Pencil and paper
binary
multiplication
x
1001
(multiplier)
1000
00000
00000
+100000
0
0
1001000
(product)
(partial products)
Integer multiplication
1000 (multiplicand)
• Pencil and paper
binary
multiplication
x
1001
(multiplier)
1000
00000
00000
+100000
0
0
1001000
(product)
(partial products)
• Key elements
– Examine multiplier bits from right to left
– Shift multiplicand left one position each step
– Simplification: each step, add multiplicand to
Integer multiplication
• Initialize product register to 0
1000 (multiplicand)
1001 (multiplier)
00000000 (running product)
Integer multiplication
• Multiplier bit = 1: add multiplicand to product
1000 (multiplicand)
1001 (multiplier)
00000000
+1000
00001000 (new running product)
Integer multiplication
• Shift multiplicand left
10000 (multiplicand)
1001 (multiplier)
00000000
+1000
00001000
Integer multiplication
• Multiplier bit = 0: do nothing
10000 (multiplicand)
1001 (multiplier)
00000000
+1000
00001000
Integer multiplication
• Shift multiplicand left
100000 (multiplicand)
1001 (multiplier)
00000000
+1000
00001000
Integer multiplication
• Multiplier bit = 0: do nothing
100000 (multiplicand)
1001 (multiplier)
00000000
+1000
00001000
Integer multiplication
• Shift multiplicand left
1000000 (multiplicand)
1001 (multiplier)
00000000
+1000
00001000
Integer multiplication
• Multiplier bit = 1: add multiplicand to product
1000000 (multiplicand)
1001 (multiplier)
00000000
+1000
00001000
+1000000
01001000 (product)
Integer multiplication
• 64-bit hardware implementation
LSB
– Multiplicand loaded into right half of multiplicand register
– Product register initialized to all 0’s
– Repeat the following 32 times
• If multiplier register LSB=1, add multiplicand to product
• Shift multiplicand one bit left
• Shift multiplier one bit right
•
Integer
multiplication
Algorithm
Integer multiplication
• Drawback: half of 64-bit multiplicand register
are zeros
– Half of 64 bit adder is adding zeros
• Solution: shift product right instead of
multiplicand left
– Only left half of product register added to
multiplicand
Integer multiplication
• Drawback: half of 64-bit multiplicand register
are zeros
– Half of 64 bit adder is adding zeros
• Solution: shift product right instead of
multiplicand left
– Only left half of product register added to
multiplicand
1000
1001
(multiplicand)
(multiplier)
00000000
(running product)
Integer multiplication
• Drawback: half of 64-bit multiplicand register
are zeros
– Half of 64 bit adder is adding zeros
• Solution: shift product right instead of
multiplicand left
– Only left half of product register added to
multiplicand
1000 (multiplicand)
1001 (multiplier)
00000000
+1000
10000000 (new running product)
Integer multiplication
• Drawback: half of 64-bit multiplicand register
are zeros
– Half of 64 bit adder is adding zeros
• Solution: shift product right instead of
multiplicand left
– Only left half of product register added to
multiplicand
1000
(multiplicand)
1001 (multiplier)
00000000
+1000
01000000 (new running product)
Integer multiplication
• Drawback: half of 64-bit multiplicand register
are zeros
– Half of 64 bit adder is adding zeros
• Solution: shift product right instead of
multiplicand left
– Only left half of product register added to
multiplicand
1000 (multiplicand)
1001 (multiplier)
00000000
+1000
01000000 (new running product)
Integer multiplication
• Drawback: half of 64-bit multiplicand register
are zeros
– Half of 64 bit adder is adding zeros
• Solution: shift product right instead of
multiplicand left
– Only left half of product register added to
multiplicand
1000 (multiplicand)
1001 (multiplier)
00000000
+1000
00100000 (new running product)
Integer multiplication
• Drawback: half of 64-bit multiplicand register
are zeros
– Half of 64 bit adder is adding zeros
• Solution: shift product right instead of
multiplicand left
– Only left half of product register added to
multiplicand
1000
(multiplicand)
1001 (multiplier)
00000000
+1000
00100000 (new running product)
Integer multiplication
• Drawback: half of 64-bit multiplicand register
are zeros
– Half of 64 bit adder is adding zeros
• Solution: shift product right instead of
multiplicand left
– Only left half of product register added to
multiplicand
1000 (multiplicand)
1001 (multiplier)
00000000
+1000
00010000 (new running product)
Integer multiplication
• Drawback: half of 64-bit multiplicand register
are zeros
– Half of 64 bit adder is adding zeros
• Solution: shift product right instead of
multiplicand left
– Only left half of product register added to
multiplicand
1000 (multiplicand)
1001 (multiplier)
00000000
+1000
0001000
0
10010000 (new running product)
Integer multiplication
• Drawback: half of 64-bit multiplicand register
are zeros
– Half of 64 bit adder is adding zeros
• Solution: shift product right instead of
multiplicand left
– Only left half of product register added to
multiplicand
1000 (multiplicand)
1001 (multiplier)
00000000
+1000
+1000
0001000
0
01001000 (product)
Integer multiplication
• Hardware implementation
Integer multiplication
• Final improvement: use right half of product
register for the multiplier
•
Integer
multiplication
Final algorithm
Multiplication of signed numbers
• Naïve approach
– Convert to positive numbers
– Multiply
– Negate product if multiplier and multiplicand signs
differ
– Slow and extra hardware
Multiplication of signed numbers
• Booth’s algorithm
– Invented for speed
• Shifting was faster than addition at the time
• Objective: reduce the number of additions required
– Fortunately, it works for signed numbers as well
– Basic idea: the additions from a string of 1’s in the
multiplier can be converted to a single addition
and a single subtraction operation
– Example: 00111110 is equivalent to 01000000 –
requires an addition for this bit
00000010
position
requires additions for each of
these bit positions
and a subtraction for this bit
position
Booth’s algorithm
• Starting from right to left, look at two adjacent
bits of the multiplier
– Place a zero at the right of the LSB to start
• If bits = 00, do nothing
• If bits = 10, subtract the multiplicand from the
product
– Beginning of a string of 1’s
• If bits = 11, do nothing
– Middle of a string of 1’s
• Example
x
Booth recoding
0010
1101
(multiplicand)
(multiplier)
• Example
Booth recoding
0010
(multiplicand)
00001101 0
extra bit
position
(product+multiplier)
• Example
Booth recoding
0010
00001101 0
+1110
11101101 0
(multiplicand)
• Example
Booth recoding
0010
00001101 0
+1110
11110110 1
(multiplicand)
• Example
Booth recoding
0010
00001101 0
+1110
11110110 1
+0010
00010110 1
(multiplicand)
• Example
Booth recoding
0010
00001101 0
+1110
11110110 1
+0010
00001011 0
(multiplicand)
• Example
Booth recoding
0010
00001101 0
+1110
11110110 1
+0010
00001011 0
+1110
11101011 0
(multiplicand)
• Example
Booth recoding
0010
00001101 0
+1110
11110110 1
+0010
00001011 0
+1110
11110101 1
(multiplicand)
• Example
Booth recoding
0010
00001101 0
+1110
11110110 1
+0010
00001011 0
+1110
11110101 1
(multiplicand)
• Example
Booth recoding
0010
(multiplicand)
00001101 0
+1110
11110110 1
+0010
00001011 0
+1110
111110101 (product)
•
Integer
division
Pencil and paper binary division
(divisor) 1000 01001000
(dividend)
Integer division
• Pencil and paper binary division
1
(divisor) 1000 01001000
- 1000
0001
(dividend)
(partial remainder)
•
Integer
division
Pencil and paper binary division
1
(divisor) 1000 01001000
- 1000
00010
(dividend)
•
Integer
division
Pencil and paper binary division
10
(divisor) 1000 01001000
- 1000
00010
(dividend)
•
Integer
division
Pencil and paper binary division
10
(divisor) 1000 01001000
- 1000
000100
(dividend)
•
Integer
division
Pencil and paper binary division
100
(divisor) 1000 01001000
- 1000
000100
(dividend)
•
Integer
division
Pencil and paper binary division
100
(divisor) 1000 01001000
- 1000
0001000
(dividend)
•
Integer
division
Pencil and paper binary division
1001
(divisor) 1000 01001000
- 1000
0001000
- 0001000
0000000
(quotient)
(dividend)
(remainder)
•
Integer
division
Pencil and paper binary division
1001
(divisor) 1000 01001000
- 1000
0001000
- 0001000
0000000
(quotient)
(dividend)
(remainder)
• Steps in hardware
– Shift the dividend left one position
– Subtract the divisor from the left half of the
dividend
– If result positive, shift left a 1 into the quotient
– Else, shift left a 0 into the quotient, and repeat from
• Initial state
(divisor) 1000
Integer division
01001000
(dividend)
0000
(quotient)
•
Integer
division
Shift dividend left one position
(divisor) 1000
10010000
(dividend)
0000
(quotient)
Integer division
• Subtract divisor from left half of dividend
(divisor) 1000
10010000 (dividend)
- 1000
(keep these
00010000bits)
0000
(quotient)
•
Integer
division
Result positive, left shift a 1 into the quotient
(divisor) 1000
10010000
- 1000
00010000
(dividend)
0001
(quotient)
Integer division
• Shift partial remainder left one position
(divisor) 1000
10010000
- 1000
00100000
(dividend)
0001
(quotient)
Integer division
• Subtract divisor from left half of partial
remainder
(divisor) 1000
10010000
- 1000
00100000
- 1000
10100000
(dividend)
0001
(quotient)
Integer division
• Result negative, left shift 0 into quotient
(divisor) 1000
10010000
- 1000
00100000
- 1000
10100000
(dividend)
0010
(quotient)
Integer division
• Restore original partial remainder (how?)
(divisor) 1000
10010000
- 1000
00100000
- 1000
10100000
00100000
(dividend)
0010
(quotient)
Integer division
• Shift partial remainder left one position
(divisor) 1000
10010000
- 1000
00100000
- 1000
10100000
01000000
(dividend)
0010
(quotient)
Integer division
• Subtract divisor from left half of partial
remainder
(divisor) 1000
10010000
- 1000
00100000
- 1000
10100000
01000000
- 1000
11000000
(dividend)
0010
(quotient)
Integer division
• Result negative, left shift 0 into quotient
(divisor) 1000
10010000
- 1000
00100000
- 1000
10100000
01000000
- 1000
11000000
(dividend)
0100
(quotient)
Integer division
• Restore original partial remainder
(divisor) 1000
10010000
- 1000
00100000
- 1000
10100000
01000000
- 1000
11000000
01000000
(dividend)
0100
(quotient)
Integer division
• Shift partial remainder left one position
(divisor) 1000
10010000
- 1000
00100000
- 1000
10100000
01000000
- 1000
11000000
10000000
(dividend)
0100
(quotient)
Integer division
• Subtract divisor from left half of partial
remainder
(divisor) 1000
10010000
- 1000
00100000
- 1000
10100000
01000000
- 1000
11000000
10000000
- 1000
00000000
(dividend)
0100
(quotient)
Integer division
• Result positive, left shift 1 into quotient
(divisor) 1000
10010000
- 1000
00100000
- 1000
10100000
01000000
- 1000
11000000
10000000
- 1000
00000000
(remainder)
(dividend)
1001
(quotient)
Integer division
• Hardware implementation
What operations do
we do here?
Load dividend here initially
Integer and floating point revisited
integer
ALU
P
C
instruction
memory
integer
register
file
HI
LO
flt pt
register
file
data
memory
integer
multiplier
flt pt
adder
flt pt
multiplier
• Integer ALU handles add, subtract, logical, set less than,
equality test, and effective address calculations
• Integer multiplier handles multiply and divide
– HI and LO registers hold result of integer multiply and divide
Floating point representation
• Floating point (fp) numbers represent reals
– Example reals: 5.6745, 1.23 x 10-19, 345.67 x 106
– Floats and doubles in C
• Fp numbers are in signed magnitude representation of the form
(-1)S x M x BE where
–
–
–
–
–
S is the sign bit (0=positive, 1=negative)
M is the mantissa (also called the significand)
B is the base (implied)
E is the exponent
Example: 22.34 x 10-4
• S=0
• M=22.34
• B=10
• E=-4
Floating point representation
• Fp numbers are normalized in that M has only
one digit to the left of the “decimal point”
– Between 1.0 and 9.9999… in decimal
– Between 1.0 and 1.1111… in binary
– Simplifies fp arithmetic and comparisons
– Normalized: 5.6745 x 102, 1.23 x 10-19
– Not normalized: 345.67 x 106 , 22.34 x 10-4 , 0.123 x
10-45
– In binary format, normalized numbers are of the
form (-1)S x 1.M x BE
• Leading 1 in 1.M is implied
Floating point representation tradeoffs
• Representing a wide enough range of fp values
with enough precision (“decimal” places) given
limited bits (-1)S x 1.M x BE
32 bits
S
M??
E??
– More E bits increases the range
– More M bits increases the precision
– A larger B increases the range but decreases the
precision
– The distance between consecutive fp numbers is
not constant!
…
…
BE
BE+1
BE+2
Floating point representation tradeoffs
• Allowing for fast arithmetic implementations
– Different exponents requires lining up the
significands; larger base increases the probability of
equal exponents
• Handling very small and very large numbers
representable negative
numbers (S=1)
exponent
overflow
0
exponent
underflow
representable positive
numbers (S=0)
exponent
overflow
•
Sorting/comparing
fp
numbers
fp numbers can be treated as integers for
sorting and comparing purposes if E is placed to
the left
(-1)S x 1.M x BE
S
E
bigger E is bigger
number
M
If E’s are same,
bigger M is bigger
number
• Example
– 3.67 x 106 > 6.34 x 10-4 > 1.23 x 10-4
Biased exponent notation
• 111…111 represents the most positive E and
000…000 represents the most negative E for
sorting/comparing purposes
• To get correct signed value for E, need to subtract a bias of
011…111
• Biased fp numbers are of the form
(-1)S x 1.M x BE-bias
• Example: assume 8 bits for E
– Bias is 01111111 = 127
– Largest E represented by 11111111 which is
255 – 127 = 128
– Smallest E represented by 00000000 which is
0 – 127 = -127
IEEE 754 floating point standard
• Created in 1985 in response to the wide range
of fp formats used by different companies
– Has greatly improved portability of scientific
applications
• B=2
S
1 bit
E
M
8 bits
23 bits
• Single precision (sp) format (“float” in C)
S
1 bit
E
M
11 bits
52 bits
• Double precision (dp) format (“double” in C)
IEEE 754 floating point standard
• Exponent bias is 127 for sp and 1023 for dp
• Fp numbers are of the form (-1)S x 1.M x 2E-bias
– 1 in mantissa and base of 2 are implied
– Sp form is
(-1)S x 1.M22 M21 …M0 x 2E-127
and value is
(-1)S x (1+(M22x2-1) +(M21x2-2)+…+(M0x2-23)) x 2E-127
• Sp example
1
00000001
1000…00
0M
S
E
– Number
is –1.1000…000 x 21-127=-1.1 x 2-126=1.763
x 10-38
IEEE 754 floating point standard
• Denormalized numbers
– Allow for representation of very small numbers
representable negative
numbers
exponent
overflow
0
exponent
underflow
representable positive
numbers
– Identified by E=0 and a non-zero M
exponent
overflow
– Format is (-1)S x 0.M x 2-(bias-1)
– Smallest positive dp denormalized number is
0.00…01 x 2-1022 = 2-1074
smallest positive dp normalized number is 1.0 x 21023
– Hardware support is complex, and so often handled
by software
Floating point addition
• Make both exponents the same
– Find the number with the smaller one
– Shift its mantissa to the right until the exponents match
• Must include the implicit 1 (1.M)
• Add the mantissas
• Choose the largest exponent
• Put the result in normalized form
– Shift mantissa left or right until in form 1.M
– Adjust exponent accordingly
• Handle overflow or underflow if necessary
• Round
• Renormalize if necessary if rounding produced an
unnormalized result
Floating point addition
• Algorithm
•
Floating
point
addition
example
Initial values
1
S
E
0
S
00000001
00000011
E
0000…0110
0M
0100…0011
1M
•
Floating
point
addition
example
Identify smaller E and calculate E difference
1
S
00000001
E
0000…0110
0M
difference = 2
0
S
00000011
E
0100…0011
1M
•
Floating
point
addition
example
Shift smaller M right by E difference
1
S
E
0
S
00000001
00000011
E
0100…0001
1M
0100…0011
1M
•
Floating
point
addition
example
Add mantissas
1
S
00000001
E
0
S
00000011
E
0100…0001
1M
0100…0011
1M
-0.0100…00011 + 1.0100…00111 =
1.0000…00100
0
S
E
0000…0010
0M
•
Floating
point
addition
example
Choose larger exponent for result
1
S
E
0
S
00000011
E
0
S
00000001
00000011
E
0100…0001
1M
0100…0011
1M
0000…0010
0M
•
Floating
point
addition
example
Final answer (already normalized)
1
S
E
0
S
00000011
E
0
S
00000001
00000011
E
0100…0001
1M
0100…0011
1M
0000…0010
0M
•
Floating
point
addition
Hardware design
determine
smaller
exponent
•
Floating
point
addition
Hardware design
shift mantissa
of smaller
number right
by exponent
difference
•
Floating
point
addition
Hardware design
add mantissas
•
Floating
point
addition
Hardware design
normalize result by
shifting mantissa of
result and adjusting
larger exponent
•
Floating
point
addition
Hardware design
round result
•
Floating
point
addition
Hardware design
renormalize if
necessary
Floating point multiply
• Add the exponents and subtract the bias from the sum
– Example: (5+127) + (2+127) – 127 = 7+127
• Multiply the mantissas
• Put the result in normalized form
– Shift mantissa left or right until in form 1.M
– Adjust exponent accordingly
• Handle overflow or underflow if necessary
• Round
• Renormalize if necessary if rounding produced an unnormalized
result
• Set S=0 if signs of both operands the same, S=1 otherwise
Floating point multiply
• Algorithm
•
Floating
point
multiply
example
Initial values
1
S
E
0
S
00000111
11100000
E
1000…0000
0M
1000…0000
0M
-1.5 x 27-127
1.5 x 2224-127
•
Floating
point
multiply
example
Add exponents
1
S
E
0
S
00000111
11100000
E
1000…0000
0M
1000…0000
0M
00000111 + 11100000 = 11100111 (231)
-1.5 x 27-127
1.5 x 2224-127
•
Floating
point
multiply
example
Subtract bias
1
S
E
0
S
00000111
11100000
E
1000…0000
0M
1000…0000
0M
-1.5 x 27-127
1.5 x 2224-127
11100111 – 01111111 = 11100111 + 10000001 = 01101000 (104)
01101000
S
E
M
•
Floating
point
multiply
example
Multiply the mantissas
1
S
E
0
S
00000111
11100000
E
1000…0000
0M
1000…0000
0M
1.1000… x 1.1000… = 10.01000…
01101000
S
E
M
-1.5 x 27-127
1.5 x 2224-127
•
Floating
point
multiply
example
Normalize by shifting 1.M right one position
and adding one to E
1
S
E
0
S
00000111
11100000
E
1000…0000
0M
1000…0000
0M
10.01000… => 1.001000…
01101001
S
E
001000…
M
-1.5 x 27-127
1.5 x 2224-127
•
Floating
point
multiply
example
Set S=1 since signs are different
1
S
E
0
S
00000111
11100000
E
1 01101001
S
E
1000…0000
0M
1000…0000
0M
001000…
M
-1.5 x 27-127
1.5 x 2224-127
-1.125 x 2105-127
Rounding
• Fp arithmetic operations may produce a result
with more digits than can be represented in
1.M
• The result must be rounded to fit into the
available number of M positions
• Tradeoff of hardware cost (keeping extra bits)
and speed versus accumulated rounding error
Rounding
• Examples from decimal multiplication
• Renormalization is required after rounding in c)
Rounding
• Examples from binary multiplication (assuming
two bits for M)
1.01 x 1.01 = 1.1001
(1.25 x 1.25 = 1.5625)
1.11 x 1.01 = 10.0011
(1.75 x 1.25 = 2.1875)
Result has twice as many bits
1.10 x 1.01 = 1.111
May require renormalization
after rounding
(1.5 x 1.25 = 1.875)
Rounding
• In binary, an extra bit of 1 is halfway in between
the two possible representations
1.001 (1.125) is halfway between 1.00 (1) and 1.01 (1.25)
1.101 (1.625) is halfway between 1.10 (1.5) and 1.11 (1.75)
•
IEEE
754
rounding
modes
Truncate
– Remove all digits beyond those supported
– 1.00100 -> 1.00
• Round up to the next value
– 1.00100 -> 1.01
• Round down to the previous value
– 1.00100 -> 1.00
– Differs from Truncate for negative numbers
• Round-to-nearest-even
– Rounds to the even value (the one with an LSB of 0)
– 1.00100 -> 1.00
Implementing rounding
• A product may have twice as many digits as
the multiplier and multiplicand
– 1.11 x 1.01 = 10.0011
• For round-to-nearest-even, we need to know
LSB of final rounded result
– The value to the right of the LSB (round bit)
1.00101 rounds to
1.01
– Whether any other digits to the right of the round
Roundare
bit 1’s Sticky bit = 0 OR 1 = 1
digit
• The sticky bit is the OR of these digits
1.00100 rounds to
1.00
Implementing rounding
• The product before normalization may have 2
digits to the left of the binary point
bb.bbbb…
• Product register format needs to be
1b.bbbb…
• Two possible cases
01.bbbb…
r sssss…
r sssss…
Need this as a result bit!
Implementing rounding
• The guard bit (g) becomes part of the
unrounded result when the MSB = 0
• g, r, and s suffice for rounding addition as well
MIPS floating point registers
floating point registers
31
f0
f1
0
.
.
.
control/status register
31
FCR31
0
implementation/revision
31register
FCR0
0
f30
f31
• 32 32-bit FPRs
– 16 64-bit registers (32-bit register pairs) for dp
floating point
– Software conventions for their usage (as with GPRs)
• Control/status register
– Status of compare operations, sets rounding mode,
MIPS floating point instruction
overview
• Operate on single and double precision
operands
• Computation
– Add, sub, multiply, divide, sqrt, absolute value,
negate
– Multiply-add, multiply-subtract
• Added as part of MIPS-IV revision of ISA specification
• Load and store
– Integer register read for EA calculation
– Data to be loaded or stored in fp register file
MIPS R10000 arithmetic units
EA calc
P
C
instruction
memory
integer
register
file
integer
ALU
integer
ALU +
multiplier
flt pt
adder
flt pt
register
file
flt pt
multiplier
flt pt
divider
flt pt
sq root
data
memory
•
MIPS
R10000
arithmetic
units
Integer ALU + shifter
– All instructions take one cycle
• Integer ALU + multiplier
– Booth’s algorithm for multiplication (5-10 cycles)
– Non-restoring division (34-67 cycles)
• Floating point adder
– Carry propagate (2 cycles)
• Floating point multiplier (3 cycles)
– Booth’s algorithm
• Floating point divider (12-19 cycles)
• Floating point square root unit
Processor Design - 1
Adopted from notes by David A. Patterson, John Kubiatowicz, and others.
Copyright © 2001
University of California at Berkeley
203
Outline of Slides
•
•
•
•
•
•
Overview
Design a processor: step-by-step
Requirements of the instruction set
Components and clocking
Assembling an adequate Data path
Controlling the data path
204
Chapter 5.1 - Processor Design 1
The Big Picture: Where Are We Now?
• The five classic components of a computer
Processor
Input
Control
Memory
Datapath
• Today’s topic: design a single cycle processor
Output
machine
design
Arithmetic
Chapter 5.1 - Processor Design 1
205
inst. set design
technology
The CPU
°Processor (CPU): the active part of the computer, which does all the work
(data manipulation and decision-making)
°Datapath: portion of the processor which contains hardware necessary to
perform operations required by the processor (the brawn)
°Control: portion of the processor (also in hardware) which tells the
datapath what needs to be done (the brain)
206
Chapter 5.1 - Processor Design 1
Big Picture: The Performance Perspective
CPI
• Performance of a machine is determined by:
– Instruction count
– Clock cycle time
– Clock cycles per instruction
• Processor design (datapath and control) will determine:
– Clock cycle time
– Clock cycles per instruction
Inst. Count
Cycle Time
• What we will do Today:
– Single cycle processor:
• Advantage: One clock cycle per instruction
• Disadvantage: long cycle time
207
Chapter 5.1 - Processor Design 1
How to Design a Processor: Step-by-step
1. Analyze instruction set  datapath requirements
– the meaning of each instruction is given by the register transfers
– datapath must include storage element for ISA registers
• possibly more
– datapath must support each register transfer
2. Select set of datapath components and establish clocking methodology
3. Assemble datapath meeting the requirements
4. Analyze implementation of each instruction to determine setting of control points
that effects the register transfer.
5. Assemble the control logic
208
Chapter 5.1 - Processor Design 1
The MIPS Instruction Formats
• All MIPS instructions are 32 bits long. The three instruction formats:
31
26
21
16
11
6
– R-type
op
rs
rt
rd
shamt
– I-type
31 6 bits 26
op
– J-type
31 6 bits 26
op
209
5 bits 21
rs
5 bits
5 bits 16
5 bits
5 bits
0
funct
6 bits 0
immediate
rt
5 bits
16 bits
target address
6 bits
26 bits
• The different fields are:
– op: operation of the instruction
– rs, rt, rd: the source and destination register specifiers
– shamt: shift amount
– funct: selects the variant of the operation in the “op” field
– address / immediate: address offset or immediate value
– target address: target address of the jump instruction
Chapter 5.1 - Processor Design 1
0
Step 1a: The MIPS-lite Subset for Today
• ADD and SUB
31
26
op
- addU rd, rs, rt
- subU rd, rs, rt
21
rs
6 bits
16
rt
5 bits
5 bits
11
6
0
rd
shamt
funct
5 bits
5 bits
6 bits
• OR Immediate:
- ori
rt, rs,
imm16
31
op
• LOAD / STORE Word
- lw rt, rs, imm16
- sw rt, rs, imm16
26
- beq rs, rt, imm16
31
rs
5 bits
6 bits
5 bits
16 bits
16
rt
5 bits
21
rs
0
immediate
5 bits
21
26
op
16
rt
5 bits
26
op
6 bits
• BRANCH:
210
rs
6 bits
31
21
0
immediate
16 bits
16
rt
5 bits
0
Chapter 5.1 - Processor Design 1
immediate
16 bits
Logical Register Transfers
• Register Transfer Logic gives the meaning of the instructions
• All start by fetching the instruction
op | rs | rt | rd | shamt | funct = MEM[ PC ]
op | rs | rt | Imm16
211
= MEM[ PC ]
inst
Register Transfers
ADDU
R[rd]  R[rs] + R[rt]; PC  PC + 4
SUBU
R[rd]  R[rs] – R[rt]; PC  PC + 4
ORi
R[rt]  R[rs] | zero_ext(Imm16);
LOAD
R[rt]  MEM[ R[rs] + sign_ext(Imm16)];
PC  PC + 4
STORE
MEM[ R[rs] + sign_ext(Imm16) ]  R[rt];
PC  PC + 4
BEQ
if ( R[rs] == R[rt] ) then PC  PC + 4 + sign_ext(Imm16)] || 00
else PC  PC + 4
PC  PC + 4
Chapter 5.1 - Processor Design 1
Step 1: Requirements of the Instruction Set
• Memory
– instruction & data
• Registers (32 x 32)
– read RS
– read RT
– Write RT or RD
• PC
• Extender
• Add and Sub register or extended immediate
• Add 4 or extended immediate to PC
212
Chapter 5.1 - Processor Design 1
Step 2: Components of the Datapath
• Combinational Elements
• Storage Elements
–Clocking methodology
213
Chapter 5.1 - Processor Design 1
Combinational Logic Elements (Basic Building Blocks)
OP
CarryIn
Adder
A
Y
B
32
Sum
A
32
Carry
32
B
MUX
ALU
Adder
32
32
Result
32
Select
A
ALU
B
214
32
32
MUX
•
32
Chapter 5.1 - Processor Design 1
Storage Element: Register File
• Register File consists of 32 registers:
– Two 32-bit output busses:
busA and busB
– One 32-bit input bus: busW
• Register is selected by:
– RA (number) selects the register to put on busA (data)
– RB (number) selects the register to put on busB (data)
– RW (number) selects the register to be written
via busW (data) when Write Enable is 1
RW RA RB
Write Enable 5 5 5
busW
32
Clk
• Clock input (CLK)
– The CLK input is a factor ONLY during write operation
– During read operation, behaves as a combinational logic block:
• RA or RB valid  busA or busB valid after “access time.”
215
busA
32
32 32-bit
Registers busB
32
Chapter 5.1 - Processor Design 1
Storage Element: Idealized Memory
• Memory (idealized)
– One input bus: Data In
– One output bus: Data Out
• Memory word is selected by:
– Address selects the word to put on Data Out
– Write Enable = 1: address selects the memory
word to be written via the Data In bus
Write Enable
Address
Data In
32
Clk
• Clock input (CLK)
– The CLK input is a factor ONLY during write operation
– During read operation, behaves as a combinational logic block:
• Address valid  Data Out valid after “access time.”
216
Chapter 5.1 - Processor Design 1
DataOut
32
Memory Hierarchy (Ch. 7)
• Want a single main memory, both large and fast
• Problem 1: large memories are slow while fast memories are small
• Example: MIPS registers (fast, but few)
• Solution: mix of memories provides illusion of single large, fast memory
• Cache: a small, fast memory; Holds a copy of part
of a larger, slower memory
• Imem, Dmem are really separate caches
memories
217
Chapter 5.1 - Processor Design 1
Digression: Sequential Logic, Clocking
• Combinational circuits: no memory
• Output depends only on the inputs
• Sequential circuits: have memory
• How to ensure memory element is updated neither
too soon, nor too late?
• Recall hardware multiplier
• Product/multiplier register is the writable memory
element
• Gate propagation delay means ALU result takes time to
stabilize; Delay varies with inputs
• Must wait until result stable before write to
product/multiplier register else get garbage
• How to be certain ALU output is stable?
218
Chapter 5.1 - Processor Design 1
Adding a Clock to a Circuit
• Clock: free running signal with fixed cycle time (clock period)
high (1)
low (0)
period
rising edge falling edge
° Clock determines when to write memory element
• level-triggered - store clock high (low)
• edge-triggered - store only on clock edge
° We will use negative (falling) edge-triggered methodology
219
Chapter 5.1 - Processor Design 1
Role of Clock in MIPS Processors
• single-cycle machine: does everything in one clock cycle
• instruction execution = up to 5 steps
• must complete 5th step before cycle ends
falling clock edge
rising clock edge
clock
signal
instruction execution
step 1/step 2/step 3/step 4/step 5
220
datapath
stable
register(s)
written
Chapter 5.1 - Processor Design 1
SR-Latches
• SR-latch with NOR Gates
• S = 1 and R = 1 not allowed
221
° Symbol for SR-Latch with NOR gates
Chapter 5.1 - Processor Design 1
SR-Latches
• SR-latch with NAND Gates, also known as S´R´ -latch
• S = 0 and R = 0 not allowed
Chapter 5.1 - Processor Design 1
222
° Symbol for SR-Latch with NAND gates
SR-Latches with Control Input
• SR-latch with NAND Gates and control input C
° C = 0, no change of state;
223
Chapter 5.1 - Processor Design 1
° C = 1, change is allowed;
• If S = 1 and R = 1, Q and Q´ are Indetermined
D-Latches
• D-latch based on SR-Latch with NAND Gates and control input C
° C = 0, no change of state;
• Q (t + t ) = Q (t )
° C = 1, change is allowed;
• Q (t + t ) = D (t )
• No Indeterminate Output
224
Chapter 5.1 - Processor Design 1
Negative Edge-Triggered MasterSlave D-Flip-Flop
° Symbol for D-Flip Flop.
Chapter 5.1 - Processor Design 1
° Arrowhead (>) indicates an edgetriggered sequential circuit.
225
° Bubble means that triggering is
effective during the HighLow C
transition
Clocking Methodology for the Entire Datapath
Clk
Setup
Hold
Setup
Hold
Don’t Care
.
.
.
.
.
.
.
.
.
.
.
.
• Design/synthesis based on pulsed-sequential circuits
– All combinational inputs remain at constant levels and only clock
signal appears as a pulse with a fixed period Tcc
• All storage elements are clocked by the same clock edge
• Cycle time Tcc = CLK-to-q + longest delay path + Setup time + clock
skew
• (CLK-to-q + shortest delay path - clock skew) > hold time
226
Chapter 5.1 - Processor Design 1
Step 3: Assemble Data Path Meeting Requirements
• Register Transfer Requirements
Datapath “Assembly”
• Instruction Fetch
• Read Operands and Execute Operation
227
Chapter 5.1 - Processor Design 1
Stages of the Datapath (1/6)
Problem: a single, atomic block which “executes
an instruction” (performs all necessary
operations beginning with fetching the
instruction) would be too bulky and inefficient
Solution: break up the process of “executing an
instruction” into stages, and then connect the
stages to create the whole datapath
 Smaller stages are easier to design
 Easy to optimize (change) one stage without
touching the others
228
Chapter 5.1 - Processor Design 1
Stages of the Datapath (2/6)
There is a wide variety of MIPS instructions: so
what general steps do they have in common?
Stage 1: instruction fetch
 No matter what the instruction, the 32-bit
instruction word must first be fetched from
memory (the cache-memory hierarchy)
 Also, this is where we increment PC
(that is, PC = PC + 4, to point to the next
instruction: byte addressing so + 4)
229
Chapter 5.1 - Processor Design 1
Stages of the Datapath (3/6)
Stage 2: Instruction Decode
 upon fetching the instruction, we next gather
data from the fields (decode all necessary
instruction data)
 first, read the Opcode to determine instruction
type and field lengths
 second, read in data from all necessary
registers
-for add, read two registers
-for addi, read one register
-for jal, no reads necessary
230
Chapter 5.1 - Processor Design 1
Stages of the Datapath (4/6)
°Stage 3: ALU (Arithmetic-Logic Unit)
the real work of most instructions is done here:
arithmetic (+, -, *, /), shifting, logic (&, |),
comparisons (slt)
what about loads and stores?
-lw $t0, 40($t1)
-the address we are accessing in memory = the value in
$t1 + the value 40
-so we do this addition in this stage
231
Chapter 5.1 - Processor Design 1
Stages of the Datapath (5/6)
°Stage 4: Memory Access
 actually only the load and store instructions do
anything during this stage; the others remain idle
 since these instructions have a unique step, we
need this extra stage to account for them
 as a result of the cache system, this stage is
expected to be just as fast (on average) as the
others
232
Chapter 5.1 - Processor Design 1
Stages of the Datapath (6/6)
°Stage 5: Register Write
 most instructions write the result of some
computation into a register
 examples: arithmetic, logical, shifts, loads, slt
 what about stores, branches, jumps?
-don’t write anything into a register at the end
-these remain idle during this fifth stage
233
Chapter 5.1 - Processor Design 1
1. Instruction
Fetch
234
imm
2. Decode/
Register
Read
ALU
Data
memory
rd
rs
rt
registers
+4
instruction
memory
PC
Generic Steps: Datapath
3. Execute 4. Memory 5. Reg.
Write
Chapter 5.1 - Processor Design 1
Datapath Walkthroughs for
Different Instructions
http://engineering.unt.edu/electrical/public/guturu/datapath.pdf
Datapath Walkthroughs (1/3)
add
$r3, $r1, $r2
# r3 = r1+r2
 Stage 1: fetch this instruction, incr. PC ;
 Stage 2: decode to find it’s an add, then read
registers $r1 and $r2 ;
 Stage 3: add the two values retrieved in Stage 2 ;
 Stage 4: idle (nothing to write to memory) ;
 Stage 5: write result of Stage 3 into register $r3 ;
236
Chapter 5.1 - Processor Design 1
+4
237
reg[1]
reg[1]+reg[2]
reg[2]
ALU
Data
memory
2
registers
3
1
imm
add r3, r1, r2
PC
instruction
memory
Example: add Instruction
Chapter 5.1 - Processor Design 1
Datapath Walkthroughs (2/3)
slti
$r3, $r1, 17
Stage 1: fetch this instruction, inc. PC
Stage 2: decode to find it’s an slti, then read
register $r1
Stage 3: compare value retrieved in Stage 2 with
the integer 17
Stage 4: go idle
Stage 5: write the result of Stage 3 in register $r3
238
Chapter 5.1 - Processor Design 1
imm
reg[1]-17
ALU
Data
memory
3
reg[1]
17
slti r3, r1, 17
+4
x
1
registers
PC
instruction
memory
Example: slti Instruction
239
Chapter 5.1 - Processor Design 1
Datapath Walkthroughs (3/3)
sw
$r3, 17($r1)
 Stage 1: fetch this instruction, inc. PC
 Stage 2: decode to find it’s a sw, then read
registers $r1 and $r3
 Stage 3: add 17 to value in register $41
(retrieved in Stage 2)
 Stage 4: write value in register $r3 (retrieved
in Stage 2) into memory address computed
in Stage 3
 Stage 5: go idle (nothing to write into a
register)
240
Chapter 5.1 - Processor Design 1
imm
241
17
reg[1]
reg[1]+17
reg[3]
ALU
Data
MEM[r1+17]<=r3 memory
3
SW r3, 17(r1)
+4
x
1
registers
PC
instruction
memory
Example: sw Instruction
Chapter 5.1 - Processor Design 1
Why Five Stages? (1/2)
Could we have a different number of
stages?
Yes, and other architectures do
So why does MIPS have five if instructions
tend to go idle for at least one stage?
There is one instruction that uses all five
stages: the load
242
Chapter 5.1 - Processor Design 1
Why Five Stages? (2/2)
lw
$r3, 17($r1)
 Stage 1: fetch this instruction, inc. PC
 Stage 2: decode to find it’s a lw, then read register
$r1
 Stage 3: add 17 to value in register $r1 (retrieved
in Stage 2)
 Stage 4: read value from memory address
compute in Stage 3
 Stage 5: write value found in Stage 4 into register
$r3
243
Chapter 5.1 - Processor Design 1
244
registers
17
reg[1]+17
ALU
Data
memory
imm
reg[1]
MEM[r1+17]
+4
x
1
3
LW r3, 17(r1)
PC
instruction
memory
Example: lw Instruction
Chapter 5.1 - Processor Design 1
Datapath Summary
°The datapath based on data transfers required to perform instructions
registers
rd
rs
rt
imm
ALU
Data
memory
+4
instruction
memory
PC
°A controller causes the right transfers to happen
opcode, funct
Controller
245
Chapter 5.1 - Processor Design 1
Overview of the Instruction Fetch Unit
• The common operations
– Fetch the Instruction: mem[PC]
– Update the program counter:
• Sequential Code: PC  PC + 4
– Branch and Jump: PC  “something else”
Clk
PC
Next Address
Logic
Address
Instruction
Memory
Instruction Word
32
246
Chapter 5.1 - Processor Design 1
Add & Subtract
addu rd, rs, rt
R[rd]  R[rs] op R[rt]; Example:
– Ra, Rb, and Rw come from instruction’s rs, rt, and rd fields
– ALUctr and RegWr: control logic after decoding the instruction
31
26
21
op
rs
6 bits
16
11
rt
5 bits
5 bits
Rd Rs Rt
RegWr
5 5 5
Rw Ra Rb
247
32 32-bit
Registers
0
rd
shamt
funct
5 bits
5 bits
6 bits
ALU
ctr
busA
32
busB
32
ALU
busW
32
Clk
6
Result
3
2
Chapter 5.1 - Processor Design 1
Register-Register Timing: One complete cycle
Clk
PC Old
Value
Rs, Rt, Rd,
Op, Func
ALUctr
RegWr
busA,
B
busW
Clk-to-Q
New Value
Old
Value
Old
Value
Old
Value
Old
Value
Old
Value
248
New Value
Register File Access Time
New Value
ALU Delay
New Value
busA
32
busB
32
ALUct
r
ALU
Rd Rs Rt
RegWr 5 5
5
Rw Ra Rb
busW
32 32-bit
32
Clk
Registers
Instruction Memory Access Time
New Value
Delay through Control Logic
New Value
Register Write
Occurs Here
3
2
Result
Chapter 5.1 - Processor Design 1
Logical Operations With Immediate
31
26
21
op
31
rs
6 bits
11
16
rt
5 bits
immediate
5 bits
16 15
Rd
Rt
Mux
Rs Rt?
RegWr 5 5
5
immediate
ALUct
r
ALU
ZeroExt
249
16 bits
Result
32
Mux
busA
Rw Ra Rb
32
32 32-bit
Registers
busB
32
16
0
• R[rt]  R[rs] op ZeroExt[ imm16 ]
RegDst
imm16
16 bits
rd?
0000000000000000
16 bits
busW
32
Clk
0
32
ALUSrc
Chapter 5.1 - Processor Design 1
Load Operations
R[rt]  Mem[R[rs] + SignExt[imm16]]; Example: lw rt, rs, imm16
31
26
op
6 bits
21
rs
5 bits
16
rt
5 bits
ALU
ctr
ALU
W_Src
32
MemWr
M
ux
Extender
Mux
busA
Rw Ra Rb
32
32 32-bit
Registers busB
32
imm16 16
0
immediate
16 bits
rd
Rd Rt
RegDst
Mux
Rs Rt?
RegWr5 5 5
busW
32
Clk
11
WrEnAdr
Data
In
32
??
Data
32
32
Clk Memory
ALUSrc
ExtOp
250
Chapter 5.1 - Processor Design 1
Store Operations
Mem[ R[rs] + SignExt[imm16]  R[rt] ]; Example: sw rt, rs, imm16
31
26
21
op
16
rs
6 bits
5 bits
0
rt
immediate
5 bits
16 bits
Rd Rt
RegDst
Mux
Rs Rt
RegWr5 5 5
32
ExtOp
251
MemWr
W_Src
32
M
ux
16
Extender
imm16
ALU
busA
Rw Ra Rb
32
32 32-bit
Registers busB
32
Mux
busW
32
Clk
ALU
ctr
Data In32
Clk
WrEn Adr
32
Data
Memory
ALUSrc
Chapter 5.1 - Processor Design 1
The Branch Instruction
31
26
op
6 bits
21
16
rs
5 bits
rt
5 bits
0
immediate
16 bits
• beq rs, rt, imm16
– mem[PC]
– Equal  R[rs] == R[rt]
Fetch the instruction from memory
Calculate the branch condition
– if (Equal)
Calculate the next instruction’s address
• PC  PC + 4 + ( SignExt(imm16)  4 )
– else
• PC  PC + 4
252
Chapter 5.1 - Processor Design 1
Datapath for Branch Operations
26
21
op
rs
6 bits
beq
16
rt
5 bits
0
immediate
5 bits
rs, rt, imm16
16 bits
Datapath generates condition (equal)
Inst Address
nPC_sel
4
32
PC
Mux
Adder
Rs Rt
5
5
busA
Rw Ra Rb
32
32 32-bit
Registers
busB
32
RegWr 5
00
Adder
253
PC Ext
imm16
Cond
busW
Clk
Equal?
31
Clk
Chapter 5.1 - Processor Design 1
Summary: A Single Cycle Datapath
32
0
32
32
WrEn Adr
Data In
Data
Clk Memory
Mux
imm16 16
1
00
=
ALU
imm16
ALUc MemWr MemtoReg
tr
Rs Rt
5 5
Rw Ra Rb busA
32
32 32-bit
0
Registers busB
32
Mux
Clk
Imm16
Equal
Rt
0
Extender
Clk
Rt Rd
Instruction<31:0>
<0:15>
busW
32
<11:15>
PC
PC Ext
Adder
Mux
Adder
4
Rs
RegDst
Rd
1
RegWr 5
<16:20>
nPC_sel
<21:25>
Inst
Memory
Adr
1
ExtOp ALUSrc
254
Chapter 5.1 - Processor Design 1
An Abstract View of the Critical Path
• Register file and ideal memory:
– The CLK input is a factor ONLY during write operation
– During read operation, behave as combinational logic:
•
Address valid  Output valid after “access time.”
PC
Clk
Next Address
255
Clk
Imm
1
6
A
32
32
ALU
Ideal
Instruction
Instruction
Memory Rd Rs Rt
5 5
5
Instruction
Address
32 Rw Ra Rb
32 32-bit
Registers
Critical Path (Load Operation) =
PC’s Clk-to-Q +
Instruction Memory’s Access Time +
Register File’s Access Time +
ALU to Perform a 32-bit Add +
Data Memory Access Time +
Setup Time for Register File Write +
Clock Skew
B
32
Data
Address
Data
In
Ideal
Data
Memory
Clk
Chapter 5.1 - Processor Design 1
An Abstract View of the Implementation
Ideal
Instruction
Memory
PC
Clk
32
Instruction
Rd Rs Rt
5
5
5
Rw Ra Rb
32 32-bit
Registers
Clk
Control Signals Conditions
A
32
32
ALU
Next Address
Instruction
Address
Control
B
32
Data
Address
Data
In
Ideal
Data
Memory
Data
Out
Clk
Datapath
256
Chapter 5.1 - Processor Design 1
Steps 4 & 5: Implement the control
In The Next Section
257
Chapter 5.1 - Processor Design 1
Summary: MIPS-lite Implementations
• single-cycle: uses single l-o-n-g clock cycle for each instruction executed
• Easy to understand, but not practical
• slower than implementation that allows
instructions to take different numbers of clock
cycles
• fast instructions: (beq) fewer clock cycles
• slow instructions (mult?): more cycles
• multicycle, pipelined implementations later
• Next time, finish the single-cycle implementation
258
Chapter 5.1 - Processor Design 1
Summary
• 5 steps to design a processor
– 1. Analyze instruction set => datapath requirements
– 2. Select set of datapath components & establish clock methodology
– 3. Assemble datapath meeting the requirements
– 4. Analyze implementation of each instruction to determine setting of control points
that effects the register transfer.
– 5. Assemble the control logic
• MIPS makes it easier
– Instructions same size
– Source registers always in same place
– Immediates same size, location
– Operations always on registers/immediates
• Single cycle datapath: CPI = 1, TCC  long
• Next time: implementing control
259
Chapter 5.1 - Processor Design 1
Processor Design - 2
Adopted from notes by David A. Patterson, John Kubiatowicz, and others.
Copyright © 2001
University of California at Berkeley
260
Summary: A Single Cycle Datapath
RegDst
nPC_sel
5
busA
Rw Ra Rb
32 32-bit
Registers
busB
32
imm16
16
=
32
0
1
32
Data In
32
ExtOp
Clk
ALUSrc
32
0
Mux
00
5
Mux
261
MemtoReg
Rt
Extender
Adder
PC Ext
imm16
Rs
Clk
Clk
ALUctr MemWr
Equal
ALU
Mux
PC
Adder
32
Imm16
0
RegWr 5
busW
Rd
Rd Rt
1
4
Rt
Instruction<31:0>
<0:15>
Rs
<11:15>
Adr
<16:20>
<21:25>
Inst
Memory
WrEn Adr
1
Data
Memory
Chapter 5.2 - Processor
Design 2
An Abstract View of the Critical Path
• Register file and ideal memory:
– The CLK input is a factor ONLY during write operation
– During read operation, behave as combinational logic:
• Address valid => Output valid after “access time.”
Ideal
Instruction
Memory
Instruction
Rd Rs
5
5
Instruction
Address
Rt
5
Imm
16
A
32
32 32-bit
Registers
PC
32
Rw Ra Rb
32
ALU
B
Clk
32
Clk
Next Address
Critical Path (Load Operation) =
PC’s Clk-to-Q +
Instruction Memory’s Access Time +
Register File’s Access Time +
ALU to Perform a 32-bit Add +
Data Memory Access Time +
Setup Time for Register File Write +
Clock Skew
262
Data
Address
Ideal
Data
Memory
Data
In
Clk
Chapter 5.2 - Processor
Design 2
The Big Picture: Where are We Now?
• The Five Classic Components of a Computer
Processor
Input
Control
Memory
Datapath
Output
• Next Topic: Designing the Control for the Single Cycle
Datapath
263
Chapter 5.2 - Processor
Design 2
An Abstract View of the Implementation
Control
Ideal
Instruction
Memory
Instruction
Rd Rs
5
5
A
32
Rw Ra Rb
32
32 32-bit
Registers
PC
Clk
Conditions
Rt
5
Clk
32
ALU
Next Address
Instruction
Address
Control Signals
B
32
Data
Address
Ideal
Data
Memory
Data
In
Data
Out
Clk
Datapath
264
Chapter 5.2 - Processor
Design 2
Recap: A Single Cycle Datapath
• Rs, Rt, Rd and Imed16 hardwired into datapath from Fetch Unit
• We have everything except control signals (underline)
– Today’s lecture will show you how to generate the control signals
Instruction<31:0>
RegWr
Rs
5
5
Rt
Rt
Zero
ALUctr
5
Imm16
MemtoReg
0
32
32
WrEn Adr
Data In 32
Clk
Mux
0
1
32
ALU
16
Extender
imm16
32
Mux
32
Clk
Rw Ra Rb
32 32-bit
Registers
busB
32
Rd
MemWr
busA
busW
Rs
<0:15>
Clk
1 Mux 0
<11:15>
RegDst
Rt
<21:25>
Rd
Instruction
Fetch Unit
<16:20>
nPC_sel
1
Data
Memory
ALUSrc
265
ExtOp
Chapter 5.2 - Processor
Design 2
Recap: Meaning of the
Control Signals
0 PC  PC + 4
1 PC  PC + 4 + SignExt(Im16) || 00
• Later in lecture: higher-level connection between mux and branch cond
• nPC_sel:
nPC_sel
Adr
4
00
Adder
imm16
PC
Mux
Adder
PC Ext
266
Inst
Memory
Clk
Chapter 5.2 - Processor
Design 2
Recap: Meaning of the Control Signals
• ExtOp:
• ALUsrc:
• ALUctr:
RegDst
“zero”, “sign”
0 regB; 1 immed
“add”, “sub”, “or”
0
1
32
ExtOp
32
Data In
Clk
ALUSrc
32
WrEn Adr
Data
Memory
MemWr:
1
write memory
°
MemtoReg: 0 
ALU; 1 Mem
MemtoReg
0
°
RegDst:
0
“rt”; 1 “rd”
°
RegWr:
1
write register
Mux
267
32
ALU
16
MemWr
=
Mux
Extender
RegWr 5 5 Rs 5 Rt
Rw Ra Rb busA
busW
32 32-bit
32
busB
Registers
32
Clk
imm16
ALUctr
Equal
Rd Rt
0
1
°
1
Chapter 5.2 - Processor
Design 2
The add Instruction
31
26
op
6 bits
21
16
rs
5 bits
rt
5 bits
11
6
0
rd
shamt
funct
5 bits
5 bits
6 bits
• add rd, rs, rt
mem[PC]
Fetch the instruction from
memory
R[rd]  R[rs] + R[rt] The actual operation
PC  PC + 4
Calculate the next
instruction’s address
268
Chapter 5.2 - Processor
Design 2
Fetch Unit at the Beginning of add
Inst
Memory
Adr
nPC_sel
Adder
0
imm16
PC
Mux
Adder
269
• Fetch the instruction from
Instruction memory:
Instruction  mem[PC]
(This is the same for all
instructions)
00
4
Instruction<31:0>
1
Clk
Chapter 5.2 - Processor
Design 2
The Single Cycle Datapath during add
31
26
21
op
rs
16
rt
11
6
rd
0
shamt
funct
• R[rd]  R[rs] + R[rt]
Instruction<31:0>
1 Mux 0
Rs
RegWr = 1
Imm16
Zero
1
MemWr = 0
0
32
32
WrEn Adr
Data In 32
32
Clk
Mux
Extender
16
Rd
MemtoReg = 0
busA
Rw Ra Rb
32
32 32-bit
Registers
busB
0
32
imm16
Rs
5
Mux
32
Clk
5
Rt
ALU
busW
5
ALUctr = Add
Rt
<0:15>
Clk
<11:15>
RegDst = 1
Rt
<16:20>
Rd
Instruction
Fetch Unit
<21:25>
nPC_sel= +4
1
Data
Memory
ALUSrc = 0
ExtOp = x
270
Chapter 5.2 - Processor
Design 2
Instruction Fetch Unit at the End of add
• PC  PC + 4
– This is the same for all instructions except: Branch and Jump
Inst
Memory
Instruction<31:0>
Adr
nPC_sel
4
Adder
00
0
imm16
PC
Mux
Adder
271
1
Clk
Chapter 5.2 - Processor
Design 2
The Single Cycle Datapath during Or Immediate
31
26
21
op
rs
16
0
rt
immediate
• R[rt]  R[rs] or ZeroExt(Imm16)
Instruction<31:0>
1 Mux 0
RegWr =
5
Rt
ALUctr =
5
Extender
16
Rd
Imm16
Zero
1
32
MemWr =
0
32
32
WrEn Adr
Data In 32
Clk
Mux
busA
Rw Ra Rb
32
32 32-bit
Registers
busB
0
32
imm16
Rs
MemtoReg =
Mux
32
Clk
5
Rt
ALU
busW
Rs
<0:15>
Clk
<11:15>
RegDst =
Rt
<21:25>
Rd
Instruction
Fetch Unit
<16:20>
nPC_sel =
1
Data
Memory
ALUSrc =
272
ExtOp =
Chapter 5.2 - Processor
Design 2
The Single Cycle Datapath during Or Immediate
31
26
21
op
rs
• R[rt]  R[rs]
or
16
0
rt
immediate
ZeroExt(Imm16)
Instruction<31:0>
5
Rt
Rt
ALUctr = Or
5
Zero
ALU
16
Extender
imm16
1
32
Rs
Rd
MemWr = 0
0
32
32
WrEn Adr
Data In 32
Clk
Imm16
MemtoReg = 0
Mux
busA
Rw Ra Rb
32
32 32-bit
Registers
busB
0
32
Mux
32
Clk
Rs
<0:15>
1 Mux 0
RegWr = 1 5
busW
Clk
<11:15>
RegDst = 0
Rt
<16:20>
Rd
Instruction
Fetch Unit
<21:25>
nPC_sel= +4
1
Data
Memory
ALUSrc = 1
273ExtOp = 0
Chapter 5.2 - Processor
Design 2
The Single Cycle Datapath during Load
31
26
21
op
rs
16
0
rt
immediate
• R[rt]  Data Memory {R[rs] + SignExt[imm16]}
Instruction<31:0>
5
ALUctr =
Add
Rt
5
busA
Rw Ra Rb
32
32 32-bit
Registers
busB
0
32
Rt
Zero
Rd
Imm16
MemtoReg = 1
MemWr = 0
0
Mux
ALU
16
Extender
imm16
1
Rs
32
Mux
32
Clk
Rs
<0:15>
1 Mux 0
RegWr = 1 5
busW
Clk
<11:15>
RegDst = 0
Rt
<16:20>
Rd
Instruction
Fetch Unit
<21:25>
nPC_sel= +4
1
WrEn Adr
Data In 32
32
Clk
Data
Memory
32
ALUSrc = 1
ExtOp = 1
274
Chapter 5.2 - Processor
Design 2
he Single Cycle Datapath during Store
31
26
op
21
16
rs
0
rt
immediate
• Data Memory {R[rs] + SignExt[imm16]}  R[rt]
Instruction<31:0>
1 Mux 0
RegWr =
Rs
5
5
Rt
Rt
ALUctr =
5
busA
0
1
32
Rd
MemWr =
32
WrEn Adr
Clk
MemtoReg =
0
32
Data In 32
Imm16
Mux
16
Extender
imm16
32
Mux
32
Clk
Rw Ra Rb
32 32-bit
Registers
busB
32
ALU
busW
Zero
Rs
<0:15>
Clk
<11:15>
RegDst =
Rt
<21:25>
Rd
Instruction
Fetch Unit
<16:20>
nPC_sel =
1
Data
Memory
ALUSrc =
275
ExtOp =
Chapter 5.2 - Processor
Design 2
The Single Cycle Datapath during Store
31
26
21
op
rs
16
0
rt
immediate
• Data Memory {R[rs] + SignExt[imm16]}  R[rt]
5
5
ALUctr =
Add
Rt
5
Rd
Imm16
MemtoReg = x
Zero
ALU
16
Extender
imm16
1
Rs
MemWr = 1
0
32
32
WrEn Adr
Data In 32
32
Clk
Mux
busA
Rw Ra Rb
32
32 32-bit
Registers
busB
0
32
Rt
Mux
32
Clk
Rs
<0:15>
1 Mux 0
RegWr = 0
busW
Clk
<11:15>
RegDst = x
Rt
Instruction
Fetch Unit
<16:20>
Rd
Instruction<31:0>
<21:25>
nPC_sel= +4
1
Data
Memory
ALUSrc = 1
276
ExtOp = 1
Chapter 5.2 - Processor
Design 2
Single Cycle Datapath during Branch
31
26
21
op
rs
16
rt
0
immediate
• if (R[rs] – R[rt] == 0) then Zero  1; else Zero  0
Instruction<31:0>
5
5
Rt
Rt
ALUctr =Sub
Zero
ALU
Extender
16
Rd
1
32
MemWr = 0
0
32
32
WrEn Adr
Data In 32
Clk
Imm16
MemtoReg = x
Mux
busA
Rw Ra Rb
32
32 32-bit
Registers
busB
0
32
imm16
Rs
5
Mux
32
Clk
Rs
<0:15>
1 Mux 0
RegWr = 0
busW
Clk
<11:15>
RegDst = x
Rt
<21:25>
Rd
Instruction
Fetch Unit
<16:20>
nPC_sel= “Br”
1
Data
Memory
ALUSrc = 0
277ExtOp = x
Chapter 5.2 - Processor
Design 2
Instruction Fetch Unit at the End of Branch
31
26
op
21
rs
16
0
rt
immediate
• if (Zero == 1) then PC = PC + 4 + SignExt(imm16)4 ; else PC = PC + 4
Inst
Memory
nPC_sel
Instruction<31:0>
Adr
° What is encoding of nPC_sel?
Zero
• Direct MUX select?
• Branch / not branch
° Let’s choose second option
4
Adder
imm16
PC
Mux
Adder
278
00
0
1
nPC_sel
0
1
1
zero?
x
0
1
MUX
0
0
1
Clk
Chapter 5.2 - Processor
Design 2
Step 4: Given Datapath: RTL  Control
Instruction<31:0>
Rt
Rs
Rd
<0:15>
Fun
<11:15>
Adr
Op
<16:20>
<21:25>
<21:25>
Inst
Memory
Imm16
Control
nPC_sel
RegWr RegDst ExtOp ALUSrc ALUctr
MemWr
MemtoReg
Zero
DATA PATH
279
Chapter 5.2 - Processor
Design 2
A Summary of Control Signals
inst
Register Transfer
ADD
R[rd]  R[rs] + R[rt];
PC  PC + 4
ALUsrc = RegB, ALUctr = “add”, RegDst = rd, RegWr, nPC_sel = “+4”
SUB
R[rd]  R[rs] – R[rt];
PC  PC + 4
ALUsrc = RegB, ALUctr = “sub”, RegDst = rd, RegWr, nPC_sel = “+4”
ORi
R[rt]  R[rs] + zero_ext(Imm16);
PC  PC + 4
ALUsrc = Im, Extop = “Z”, ALUctr = “or”, RegDst = rt, RegWr, nPC_sel = “+4”
LOAD
R[rt]  MEM[ R[rs] + sign_ext(Imm16)];
PC  PC + 4
ALUsrc = Im, Extop = “Sn”, ALUctr = “add”,
MemtoReg,
nPC_sel = “+4”
RegDst = rt, RegWr,
STORE
MEM[ R[rs] + sign_ext(Imm16) ]  R[rs];
PC  PC + 4
ALUsrc = Im, Extop = “Sn”, ALUctr = “add”, MemWr, nPC_sel = “+4”
BEQ
if ( R[rs] == R[rt] ) then PC  PC + sign_ext(Imm16)] || 00 else PC  PC + 4
nPC_sel = “Br”, ALUctr = “sub”
280
Chapter 5.2 - Processor
Design 2
A Summary of Control Signals
See
Appendix A
We Don’t Care :-)
func 10 0000 10 0010
op 00 0000 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
RegDst
add
1
sub
1
ori
0
lw
0
sw
x
beq
x
jump
x
ALUSrc
0
0
1
1
1
0
x
MemtoReg
RegWrite
0
1
0
1
0
1
1
1
x
0
x
0
x
0
MemWrite
nPCsel
0
0
0
0
0
0
0
0
1
0
0
1
0
0
Jump
0
0
0
0
0
0
1
ExtOp
x
x
0
1
1
x
x
Add
Subtract
Or
Add
Add
Subtract
xxx
ALUctr<2:0>
31
26
21
16
R-type
op
rs
rt
I-type
op
rs
rt
J-type
op
11
rd
shamt
immediate
target address
281
6
0
funct
add, sub
ori, lw, sw, beq
jump
Chapter 5.2 - Processor
Design 2
The Concept of Local Decoding
op
00 0000
00 1101 10 0011 10 1011 00 0100 00 0010
R-type
ori
lw
sw
beq
jump
RegDst
1
0
0
x
x
x
ALUSrc
0
1
1
1
0
x
MemtoReg
0
0
1
x
x
x
RegWrite
1
1
1
0
0
0
MemWrite
0
0
0
1
0
0
Branch
0
0
0
0
1
0
Jump
0
0
0
0
0
1
ExtOp
x
0
1
1
x
x
“R-type”
Or
Add
Add
Subtract
xxx
ALUop<N:0>
func
op
Main
Control
6
6
ALUop
N
ALU
Control
(Local)
ALUctr
3
ALU
282
Chapter 5.2 - Processor
Design 2
The Encoding of ALUop
func
op
Main
Control
6
ALU
Control
(Local)
6
ALUop
ALUctr
3
N
• In this exercise, ALUop has to be 2 bits wide to represent:
– “R-type” instructions (1)
– “I-type” instructions that require the ALU to perform:
• (2) Or, (3) Add, and (4) Subtract
• To implement the full MIPS ISA, ALUop has to be 3 bits to represent:
– “R-type” instructions (1)
– “I-type” instructions that require the ALU to perform:
• (2) Or, (3) Add, (4) Subtract, and (5) And (Example: andi)
ALUop (Symbolic)
ALUop<2:0>
283
R-type
ori
lw
sw
“R-type”
Or
Add
0 10
0 00
1 00
beq
jump
Add
Subtract
xxx
0 00
0 01
xxx
Chapter 5.2 - Processor
Design 2
The Decoding of the “func” Field
func
op
Main
Control
6
ALU
Control
(Local)
6
ALUop
N
ALUop (Symbolic)
ALUop<2:0>
31
op
3
R-type
ori
lw
sw
“R-type”
1 00
Or
0 10
Add
0 00
Add
0 00
26
R-type
ALUctr
21
rs
16
11
rt
rd
beq
jump
Subtract
0 01
xxx
xxx
6
0
shamt
funct
P. 286 text:
funct<5:0>
Instruction Operation
ALUctr
ALUctr<2:0>
ALU Operation
add
000
And
10 0010
subtract
001
Or
10 0100
and
010
Add
10 0101
or
110
Subtract
10 1010
set-on-less-than
284
ALU
10 0000
111
Set-on-less-than
Chapter 5.2 - Processor
Design 2
The Truth Table for ALUctr
funct<3:0>
R-type
ALUop
(Symbolic) “R-type”
ALUop<2:0>
1 00
Instruction Op.
0000
add
ori
lw
sw
beq
0010
subtract
Or
Add
Add
Subtract
0100
and
0 10
0 00
0 00
0 01
0101
or
1010
set-on-less-than
ALUop
func
bit<2> bit<1> bit<0>
bit<3> bit<2> bit<1> bit<0>
ALU
Operation
ALUctr
bit<2> bit<1> bit<0>
0
0
0
x
x
x
x
Add
0
1
0
0
x
1
x
x
x
x
Subtract
1
1
0
0
1
x
x
x
x
x
Or
0
0
1
1
x
x
0
0
0
0
Add
0
1
0
1
x
x
0
0
1
0
Subtract
1
1
0
1
x
x
0
1
0
0
And
0
0
0
1
x
x
0
1
0
1
Or
0
0
1
1
x
x
1
0
1
0
Set on <
1
1
1
285
Chapter 5.2 - Processor
Design 2
The Logic Equation for ALUctr<2>
ALUop
func
bit<2> bit<1> bit<0>
bit<3> bit<2> bit<1> bit<0>
ALUctr<2>
0
x
1
x
x
x
x
1
1
x
x
0
0
1
0
1
1
x
x
1
0
1
0
1
This makes func<3> a don’t care
• ALUctr<2> =
!ALUop<2> & ALUop<0> +
ALUop<2> & !func<2> & func<1> & !func<0>
286
Chapter 5.2 - Processor
Design 2
The Logic Equation for ALUctr<1>
ALUop
func
bit<2> bit<1> bit<0>
bit<3> bit<2> bit<1> bit<0>
ALUctr<1>
0
0
0
x
x
x
x
1
0
x
1
x
x
x
x
1
1
x
x
0
0
0
0
1
1
x
x
0
0
1
0
1
1
x
x
1
0
1
0
1
• ALUctr<1> =
!ALUop<2> & !ALUop<0> +
ALUop<2> & !func<2> & !func<0>
287
Chapter 5.2 - Processor
Design 2
The Logic Equation for ALUctr<0>
ALUop
func
bit<2>
bit<1>
bit<0>
bit<3>
bit<2>
bit<1>
bit<0>
ALUctr<0>
0
1
x
x
x
x
x
1
1
x
x
0
1
0
1
1
1
x
x
1
0
1
0
1
• ALUctr<0> = !ALUop<2> & ALUop<0>
+ ALUop<2> & !func<3> & func<2> & !func<1> & func<0>
+ ALUop<2> & func<3> & !func<2> & func<1> & !func<0>
288
Chapter 5.2 - Processor
Design 2
The ALU Control Block
func
ALU
Control
(Local)
6
ALUop
ALUctr
3
3
• ALUctr<2>
• ALUctr<1>
• ALUctr<0>
=
!ALUop<2>
ALUop<2>
= !ALUop<2>
ALUop<2>
= !ALUop<2>
+ ALUop<2>
+
ALUop<2>
289
&
&
&
&
&
&
&
&
&
ALUop<0> +
!func<2> & func<1> &
!ALUop<0> +
!func<2> & !func<0>
ALUop<0>
!func<3> & func<2>
!func<1> & func<0>
func<3>
& !func<2>
func<1> & !func<0>
!func<0>
Chapter 5.2 - Processor
Design 2
Step 5: Logic for Each Control Signal
• nPC_sel
• ALUsrc
• ALUctr
<= if (OP == BEQ) then “Br” else “+4”
<= if (OP == “Rtype”) then “regB” else “immed”
<= if (OP == “Rtype”) then funct
elseif (OP == ORi) then “OR”
elseif (OP == BEQ) then “sub”
else “add”
• ExtOp
<= _____________
• MemWr
<= _____________
• MemtoReg <= _____________
• RegWr:
<=_____________
• RegDst:
<= _____________
290
Chapter 5.2 - Processor
Design 2
Step 5: Logic for each control signal
• nPC_sel
• ALUsrc
• ALUctr
<= if (OP == BEQ) then “Br” else “+4”
<= if (OP == “Rtype”) then “regB” else “immed”
<= if (OP == “Rtype”) then funct
elseif (OP == ORi) then “OR”
elseif (OP == BEQ) then “sub”
else “add”
• ExtOp
<= if (OP == ORi) then “zero” else “sign”
• MemWr
<= (OP == Store)
• MemtoReg <= (OP == Load)
• RegWr:
<= if ((OP == Store) || (OP == BEQ)) then 0 else 1
• RegDst:
<= if ((OP == Load) || (OP == ORi)) then 0 else 1
291
Chapter 5.2 - Processor
Design 2
The “Truth Table” for the Main Control
RegDst
ALUSrc
op
Main
Control
6
:
ALUop
func
6
ALU
Control
(Local)
ALUctr
3
3
op
RegDst
ALUSrc
MemtoReg
RegWrite
MemWrite
nPC_sel
Jump
ExtOp
ALUop (Symbolic)
ALUop <2>
ALUop <1>
ALUop <0>
00 0000
R-type
1
0
0
1
0
0
0
x
“R-type”
1
0
0
292
00 1101 10 0011 10 1011 00 0100 00 0010
ori
lw
sw
beq
jump
0
0
x
x
x
1
1
1
0
x
0
1
x
x
x
1
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
1
1
x
x
Or
Add
Add Subtract xxx
0
0
0
x
0
1
0
0
x
0
0
0
0
xChapter 5.2 - Processor
1
Design 2
A Real MIPS Datapath (CNS T0)
293
Chapter 5.2 - Processor
Design 2
Summary: A Single Cycle Processor
ALUop
3
RegDst
op
6
Instr<31:26>
Main
Control
func
ALUSrc
:
5
5
Rt
Rt
5
Imm16
0
1
MemtoReg
MemWr
0
32
32
WrEn Adr
Data In 32
32
Clk
Mux
16
Extender
imm16
Instr<15:0>
32
ALU
Rw Ra Rb
32 32-bit
Registers
busB
32
Zero
Mux
32
Clk
Rd
ALUctr
busA
busW
Rs
<0:15>
Rs
<11:15>
Clk
<16:20>
Instruction
Fetch Unit
1 Mux 0
RegWr
3
Instruction<31:0>
<21:25>
RegDst
Rt
ALUctr
Instr<5:0> 6
nPC_sel
Rd
ALU
Control
1
Data
Memory
ALUSrc
294
ExtOp
Chapter 5.2 - Processor
Design 2
Recap: An Abstract View of the Critical Path (Load)
• Register file and ideal memory:
–The CLK input is a factor ONLY during write operation
–During read operation, behave as combinational logic:
• Address valid  Output valid after “access time.”
Ideal
Instruction
Memory
Instruction
Rd
5
Instruction
Address
Rs
5
Rt
5
Imm
16
A
32
Rw Ra
32 32-bit
Registers
PC
32
Rb
32
ALU
B
Clk
32
Clk
Next Address
Critical Path (Load Operation) =
PC’s Clk-to-Q +
Instruction Memory’s Access Time +
Register File’s Access Time +
ALU to Perform a 32-bit Add +
Data Memory Access Time +
Setup Time for Register File Write +
Clock Skew
295
Data
Address
Data In
Ideal
Data
Memory
Clk
Chapter 5.2 - Processor
Design 2
Worst Case Timing (Load)
Clk
Old
Value
Rs, Rt, Rd,
Op, Func
PC
ALUct
r
ExtOp
ALUSrc
MemtoReg
RegWr
busA
busB
Clk-to-Q
New Value
Old
Value
Old
Value
Old
Value
Old
Value
Old
Value
Old
Value
Instruction Memoey Access Time
New Value
Delay through Control Logic
New Value
New Value
New Value
New Value
New Value
Old
DelayValue
through Extender & Mux
Old
Value
Addres
s
Old
Value
busW
Old
296
Value
Register
Write Occurs
Register File Access Time
New Value
New Value
ALU Delay
New Value
Data Memory Access Time
New
Chapter 5.2 - Processor
Design 2
Drawback of this Single Cycle Processor
• Long cycle time:
–Cycle time must be long enough for the load
instruction:
PC’s Clock -to-Q +
Instruction Memory Access Time +
Register File Access Time +
ALU Delay (address calculation) +
Data Memory Access Time +
Register File Setup Time +
Clock Skew
• Cycle time for load is much longer than needed for all
other instructions
297
Chapter 5.2 - Processor
Design 2
Summary
° Single cycle datapath: CPI = 1, CCT  long
° 5 steps to design a processor
•
1. Analyze instruction set => datapath requirements
•
2. Select set of datapath components & establish clock methodology
•
3. Assemble datapath meeting the requirements
•
4. Analyze implementation of each instruction to determine setting of
control points that effects the register transfer.
•
5. Assemble the control logic
Processor
° Control is the hard part
Input
Control
Memory
° MIPS makes control easier
•
Instructions same size
•
Source registers always in same place
•
Immediates same size, location
Operations always on registers/immediates
•
298
Datapath
Output
Chapter 5.2 - Processor
Design 2
Designing a Multi-cycle Processor
Adapted from the lecture notes of John Kubiatowicz (UCB)
Recap: A Single Cycle Datapath
Instruction<31:0>
1 Mux 0
RegWr 5
Rs
5
Rt
Rs
ALUctr
5
busA
0
1
32
MemtoReg
0
32
32
WrEn Adr
Data In 32
Clk
ALUSrc
ExtOp
Imm16
Data
Memory
Mux
16
Extender
imm16
32
Mux
32
Clk
Rw Ra Rb
32 32-bit
Registers
busB
32
Rd
Equal MemWr
ALU
busW
Rt
<0:15>
Clk
<11:15>
RegDst
Rt
<16:20>
Rd
Instruction
Fetch Unit
<21:25>
nPC_sel
1
Recap: The “Truth Table” for the Main Control
RegDst
ALUSrc
op
6
Main
Control
:
ALUop
func
6
ALU
Control
(Local)
ALUctr
3
3
op
RegDst
ALUSrc
MemtoReg
RegWrite
MemWrite
Branch
Jump
ExtOp
ALUop (Symbolic)
ALUop <2>
ALUop <1>
ALUop <0>
00 0000
R-type
1
0
0
1
0
0
0
x
“R-type”
1
0
0
00 1101 10 0011 10 1011 00 0100 00 0010
ori
lw
sw
beq
jump
0
0
x
x
x
1
1
1
0
x
0
1
x
x
x
1
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
1
1
x
x
Or
Add
Add Subtract xxx
0
0
0
x
0
1
0
0
x
0
0
0
0
x
1
The Big Picture: Where are We Now?
• The Five Classic Components of a Computer
Processor
Input
Control
Memory
Datapath
Output
• Today’s Topic: Designing the Datapath for the Multiple Clock
Cycle Datapath
•
looks like a FSM with PC as state
ALU
Reg.
Wrt
Result Store
Data
Mem
MemWr
RegDst
RegWr
MemRd
MemWr
fun
Mem
Access
ExtOp
ALUSrc
ALUctr
Equal
op
Ext
Register
Fetch
Instruction
Fetch
PC
Next PC
nPC_sel
Abstract View of our single cycle processor
Main
Control
ALU
control
What’s wrong with our CPI=1 processor?
Arithmetic & Logical
PC
Inst Memory
Reg File
mux
ALU
mux
setup
Load
PC
Inst Memory
mux
Reg File
Critical Path
ALU
Data Mem
Store
PC
Inst Memory
Reg File
ALU
Data Mem
Branch
PC
Inst Memory
Reg File
mux
cmp
mux
• Long Cycle Time
• All instructions take as much time as the slowest
• Real memory is not as nice as our idealized memory
– cannot always get the job done in one (short) cycle
mux setup
Basic Limits on Cycle Time
• Next address logic
– PC <= branch ? PC + offset : PC + 4
• Instruction Fetch
– InstructionReg <= Mem[PC]
• Register Access
– A <= R[rs]
• ALU operation
– R <= A + B
Result Store
MemWr
RegDst
RegWr
Reg.
File
Data
Mem
MemRd
MemWr
Mem
Access
Exec
ALUctr
ALUSrc
ExtOp
Operand
Fetch
Instruction
Fetch
PC
Next PC
nPC_sel
Control
Operand
Fetch
Instruction
Fetch
PC
Next PC
• Place enables on all registers
Exec
Reg.
File
Result Store
Data
Mem
Mem
Access
MemWr
RegDst
RegWr
MemRd
MemWr
ALUctr
ALUSrc
ExtOp
Equal
nPC_sel
Partitioning the CPI=1 Datapath
• Add registers between smallest steps
•
Critical Path ?
B
ExtOp
Equal
S
Reg.
File
RegDst
RegWr
MemToReg
MemRd
MemWr
ALUctr
Ext
ALUSrc
ALU
A
Result Store
Reg
File
Mem
Acces
s
IR
nPC_sel
E
Data
Mem
Operand
Fetch
Instruction
Fetch
PC
Next PC
Example Multicycle Datapath
M
Recall: Step-by-step Processor Design
Step 1: ISA => Logical Register Transfers
Step 2: Components of the Datapath
Step 3: RTL + Components => Datapath
Step 4: Datapath + Logical RTs => Physical RTs
Step 5: Physical RTs => Control
Step 4: R-type (add, sub, . . .)
• Logical Register Transfer
inst
Logical Register Transfers
ADDU
R[rd] <– R[rs] + R[rt]; PC <– PC + 4
• Physical Register Transfers
inst
Physical Register Transfers
Time
IR <– MEM[pc]
ADDU
A<– R[rs]; B <– R[rt]
S <– A + B
R[rd] <– S;
PC <– PC + 4
Reg.
File
M
Data
Mem
B
S
Mem
Acces
s
A
Exec
Reg
File
IR
Inst. Mem
PC
Next PC
E
Step 4: Logical immed
• Logical Register Transfer
inst
Logical Register Transfers
ORI
R[rt] <– R[rs] OR ZExt(Im16); PC <– PC + 4
• Physical Register Transfers
inst
Physical Register Transfers
Time
IR <– MEM[pc]
ORI
A<– R[rs]; B <– R[rt]
S <– A or ZExt(Im16)
R[rt] <– S;
PC <– PC + 4
Reg.
File
M
Data
Mem
B
S
Mem
Acces
s
A
Exec
Reg
File
IR
Inst. Mem
PC
Next PC
E
Step 4 : Load
• Logical Register Transfer
inst
Logical Register Transfers
LW
R[rt] <– MEM[R[rs] + SExt(Im16)];
• Physical Register Transfers
PC <– PC + 4
inst
Physical Register Transfers
Time
IR <– MEM[pc]
LW
A<– R[rs]; B <– R[rt]
S <– A + SExt(Im16)
M <– MEM[S]
R[rd] <– M;
PC <– PC + 4
Reg.
File
M
Data
Mem
B
S
Mem
Acces
s
A
Exec
Reg
File
IR
Inst. Mem
PC
Next PC
E
Step 4 : Store
• Logical Register Transfer
inst
Logical Register Transfers
SW
MEM[R[rs] + SExt(Im16)] <– R[rt];
• Physical Register Transfers
PC <– PC + 4
Time
inst
Physical Register Transfers
IR <– MEM[pc]
SW
A<– R[rs]; B <– R[rt]
S <– A + SExt(Im16);
MEM[S] <– B
PC <– PC + 4
Reg.
File
M
Data
Mem
B
S
Mem
Acces
s
A
Exec
Reg
File
IR
Inst. Mem
PC
Next PC
E
Step 4 : Branch
• Logical Register Transfer
inst
Logical Register Transfers
BEQ
if R[rs] == R[rt]
then PC <= PC + 4+SExt(Im16) || 00
• Physical Register Transfers
else PC <= PC + 4
Time
inst
Physical Register Transfers
IR <– MEM[pc]
BEQ E<– (R[rs] = R[rt])
if !E then PC <– PC + 4
else PC <–PC+4+SExt(Im16)||00
Reg.
File
M
Data
Mem
B
S
Mem
Acces
s
A
Exec
Reg
File
IR
Inst. Mem
PC
Next PC
E
Multi-Cycle Data Path
http://www.ee.unt.edu/public/guturu/MultiCycleDesign.docx
www.ee.unt.edu/public/guturu/multicyclestatemachine.pdf
Alternative data-path (book): Multiple Cycle Datapath
•
Minimizes Hardware: 1 memory, 1 adder
PCWr
PCWrCond
Zero
MemWr
ALUSelA
RegWr
1
WrAdr
32
Din Dout
32
32
Rt 0
5
Rd
Ra
Rb
busA
Reg File
Imm 16
32
busW busB 32
1 Mux 0
Extend
ExtOp
32
1
Rw
1
Zero
<< 2
4
0
1
32
32
2
3
32
MemtoReg
ALU
Control
ALUOp
ALUSelB
ALU Out
1
32
Mux
Ideal
Memory
Rt
5
Target
ALU
Mux
RAdr
Rs
32
0
0
Mux
0
Instruction Reg
32
32
RegDst
32
PC
32
IRWr
BrWr
Mux
IorD
PCSrc
Our Control Model
• State specifies control points for Register Transfer
• Transfer occurs upon exiting state (same falling edge)
inputs (conditions)
Next State
Logic
Control State
State X
Register Transfer
Control Points
Depends on Input
Output Logic
outputs (control points)
Step 4  Control Specification for multicycle proc
IR <= MEM[PC]
ORi
LW
SW
BEQ
S <= A fun B
S <= A or ZX
S <= A + SX
M <= MEM[S]
R[rd] <= S
R[rt] <= S
R[rt] <= M
PC <= PC + 4 PC <= PC + 4 PC <= PC + 4
S <= A + SX
MEM[S] <= B
PC <= PC + 4
PC <=
Next(PC,Equal)
Write-back Memory Execute
“decode / operand fetch”
A <= R[rs]
B <= R[rt]
R-type
“instruction fetch”
Traditional FSM Controller
next
state op cond state
control points
Truth Table
11
next
State
Equal
6
4
op
datapath State
State
control points
Step 5  (datapath + state
diagram control)
• Translate RTs into control points
• Assign states
• Then go build the controller
Mapping RTs to Control Points
IR <= MEM[PC]
imem_rd, IRen
A <= R[rs]
B <= R[rt]
“instruction fetch”
“decode”
R-type
S <= A fun B
ORi
ALUfun, Sen S <= A or ZX
LW
S <= A + SX
M <= MEM[S]
R[rd] <= S
PC <= PC + 4
RegDst,
RegWr,
PCen
R[rt] <= S
R[rt] <= M
PC <= PC + 4 PC <= PC + 4
SW
BEQ
S <= A + SX
MEM[S] <= B
PC <= PC + 4
PC <=
Next(PC,Equal)
Write-back Memory Execute
Aen, Ben,
Een
Assigning
States
IR <= MEM[PC]
0000
“instruction fetch”
“decode”
A <= R[rs]
B <= R[rt]
R-type
ORi
S <= A fun B
S <= A or ZX
0100
0110
LW
S <= A + SX
1000
M <= MEM[S]
1001
SW
BEQ
S <= A + SX
1011
MEM[S] <= B
PC <= PC + 4
1100
R[rd] <= S
R[rt] <= S
R[rt] <= M
PC <= PC + 4 PC <= PC + 4 PC <= PC + 4
0101
0111
1010
PC <= Next(PC)
0011
Write-back Memory Execute
0001
(Mostly) Detailed Control Specification (missing0)
State Op field Eq Next IR PC
Ops Exec
Mem
Write-Back
en sel A B E Ex Sr ALU S R W M M-R Wr Dst
0000 ?????? ? 0001 1
0001 BEQ
x 0011
111
0001 R-type x 0100
111
0001 ORI
x 0110
1 1 1 -all same in Moore machine
0001 LW
x 1000
111
0001 SW
x 1011
111
0011
BEQ: 0011
0100
R: 0101
0110
ORi: 0111
1000
1001
LW:
1010
1011
1100
SW:
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
0
1
x
x
x
x
x
x
x
x
x
0000
0000
0101
0000
0111
0000
1001
1010
0000
1100
0000
1 0
1 1
x 0 x
x 0 x
0 1 fun 1
1 0
0 1 1
0 0 or 1
1 0
0 1 0
1 0 add 1
1 0 1
1 0
1 1 0
1 0 add 1
1 0
0 1 0
Performance Evaluation
• What is the average CPI?
– state diagram gives CPI for each instruction type
– workload gives frequency of each type
Type
CPIi for type
Frequency
CPIi x freqIi
Arith/Logic
4
40%
1.6
Load
5
30%
1.5
Store
4
10%
0.4
branch
3
20%
0.6
Average CPI:4.1
Introduction to Pipelining
Adapted from the lecture notes of Dr. John Kubiatowicz (UC
Berkeley)
Pipelining is Natural!
• Laundry Example
• Ann, Brian, Cathy, Dave
A
each have one load of clothes
to wash, dry, and fold
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
B
C
D
Sequential Laundry
6 PM
7
8
9
10
11
Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a
s
k
A
B
O
r
d
e
r
C
D
• Sequential laundry takes 6 hours for 4 loads
Pipelined Laundry: Start work ASAP
6 PM
7
8
9
10
11
Midnight
Time
30 40
T
a
s
k
40
40
40 20
A
B
O
r
d
e
r
C
D
• Pipelined laundry takes 3.5 hours for 4 loads
Pipelining Lessons
• Latency vs. Throughput
• Question
– What is the latency in both cases ?
– What is the throughput in both cases ?
30 40
40
40
40 20
A
B
C
D
Pipelining doesn’t help latency of single task, it helps
throughput of entire workload
Pipelining Lessons [contd…]
• Question
– What is the fastest operation in the example ?
– What is the slowest operation in the example
30 40
40
40
40 20
A
B
C
D
Pipeline rate limited by slowest pipeline stage
Pipelining Lessons [contd…]
30 40
40
40
40 20
A
B
C
D
Multiple tasks operating simultaneously using different resources
Pipelining Lessons [contd…]
• Question
– Would the speedup increase if we had more steps ?
30 40
40
40
40 20
A
B
C
D
Potential Speedup = Number of pipe stages
Pipelining Lessons [contd…]
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
• Question
– Will it affect if “Folder” also took 40 minutes
Unbalanced lengths of pipe stages reduces speedup
Pipelining Lessons [contd…]
30 40
40
40
40 20
A
B
C
D
Time to “fill” pipeline and time to “drain” it reduces speedup
Five Stages of an Instruction
Cycle 1 Cycle 2
Load Ifetch
Reg/Dec
Cycle 3 Cycle 4 Cycle 5
Exec
Mem
Wr
• Ifetch: Instruction Fetch
– Fetch the instruction from the Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode
• Exec: Calculate the memory address
• Mem: Read the data from the Data Memory
• Wr: Write the data back to the register file
Conventional Pipelined Execution Representation
Time
IFetch Dcd
Exec
IFetch Dcd
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
IFetch Dcd
IFetch Dcd
IFetch Dcd
Program Flow
IFetch Dcd
WB
Example
Program
Execution
Order (in
instructions)
Time
Inst.
Fetch
lw $1,100($0)
R
e
g
ALU
Mem
R
e
g
Inst.
Fetch
lw $2, 200($0)
R
e
g
ALU
Mem
Inst.
Fetch
lw $3, 300($0)
Program
Execution
Order (in
instructions)
lw $1,100($0)
lw $2, 200($0)
lw $3, 300($0)
R
e
g
Time
Inst.
Fetch
R
e
g
Inst.
Fetch
ALU
R
e
g
Mem
R
e
g
Inst.
Fetch
ALU
Mem
R
e
g
ALU
R
e
g
Mem
R
e
g
R
e
g
ALU
Mem
R
e
g
Example [contd…]
• Timepipeline = Timenon-pipeline / Pipe stages
– Assumptions
• Stages are perfectly balanced
• Ideal conditions
Definitions
• Performance is in units of things per sec
– bigger is better
• If we are primarily concerned with response time
– performance(x) =
1
execution_time(x)
" X is n times faster than Y" means
Execution_time(Y)
Performance(X)
n
=
=
Performance(Y)
Execution_time(X)
Example [contd…]
• Speedup in this case = 24/14 = 1.7
• Lets add 1000 more instructions
– Time (non-pipelined) = 1000 x 8 + 24 ns = 8000 ns
– Time (pipelined) = 1000 x 2 + 14 ns = 2014 ns
– Speedup = 8000 / 2014 = 3.98 = 4 (approx) = 8/2
Instruction throughput is important metric (as opposed to individual instruction)
as real programs execute billions of instructions in practical case !!!
Pipeline Hazards
• Structural Hazard
IFetch Dcd
Exec
IFetch Dcd
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
IFetch Dcd
IFetch Dcd
IFetch Dcd
Program Flow
IFetch Dcd
WB
Pipeline Hazard [contd…]
• Control Hazard
• Example
– add
– beq
– lw
$4, $5, $6
$1, $2, 40
$3, 300($0)
Pipleline Hazard [contd…]
• Data Hazards
• Example
– add
– sub
$s0, $t0, $t1
$t2, $s0, $t3
Summary Pipelining Lessons
6 PM
7
8
9
Time
30 40
T
a
s
k
A
B
O
r
d
e
r
C
D
40
40
40 20
• Pipelining doesn’t help
latency of single task, it
helps throughput of
entire workload
• Pipeline rate limited by
slowest pipeline stage
• Multiple tasks operating
simultaneously using
different resources
• Potential speedup =
Number pipe stages
• Unbalanced lengths of
pipe stages reduces
speedup
• Time to “fill” pipeline and
time to “drain” it reduces
speedup
• Stall for Dependences
Summary of Pipeline Hazards
• Structural Hazards
– Hardware design
• Control Hazard
– Decision based on results
• Data Hazard
– Data Dependency
Pipelining - II
Adapted from CS 152C (UC Berkeley) lectures notes of
Spring 2002
Revisiting Pipelining Lessons
6 PM
7
8
9
Time
30 40
T
a
s
k
A
B
O
r
d
e
r
C
D
40
40
40 20
• Pipelining doesn’t help
latency of single task, it
helps throughput of
entire workload
• Pipeline rate limited by
slowest pipeline stage
• Multiple tasks operating
simultaneously using
different resources
• Potential speedup =
Number pipe stages
• Unbalanced lengths of
pipe stages reduces
speedup
• Time to “fill” pipeline and
time to “drain” it reduces
speedup
• Stall for Dependences
Revisiting Pipelining Hazards
• Structural Hazards
– Hardware design
• Control Hazard
– Decision based on results
• Data Hazard
– Data Dependency
Control Signals for existing Datapath
IF: Instruction Fetch
ID: Instruction Decode/
register file read
EX: Execute/address
calculation
MEM: Memory Access
WB: Write back
ADD
ADD
4
Shift left 2
Read Reg1
M
U
X
P
C
Address
Read Reg2
Zero
ADD
Instruction
Instruction
Memory
Read
Data1
Registers
Read
Data2
Write Reg
M
U
X
Address
Read Data
Data
Memory
Write Data
Write Data
16
Sign
Extend
32
The Right to Left Control can lead to hazards
M
U
X
Place registers between each
step
IF/ID
ID/EX
EX/MEM
MEM/WB
ADD
ADD
4
Shift left 2
Read Reg1
M
U
X
P
C
Address
Read Reg2
Zero
ADD
Instruction
Instruction
Memory
Read
Data1
Registers
Read
Data2
Write Reg
M
U
X
Address
Read Data
Data
Memory
Write Data
Write Data
16
Sign
Extend
32
M
U
X
Example
10
lw
r1, r2(35)
14
addI r2, r2, 3
20
sub
r3, r4, r5
24
beq
r6, r7, 100
30
ori
r8, r9, 17
34
add
r10, r11, r12
100
and
r13, r14, 15
Start: Fetch 10
n
WB
Ctrl
A
Exec
im
Reg
File
Mem
Ctrl
rs rt
S
M
=
PC
10
D
Mem
Acces
s
Data
Mem
B
Next PC
IR
n
Reg.
File
n
Decode
Inst. Mem
n
IF 10
lw
r1, r2(35)
14
addI r2, r2, 3
20
sub
r3, r4, r5
24
beq
r6, r7, 100
30
ori
r8, r9, 17
34
add
r10, r11, r12
100 and
r13, r14, 15
n
n
WB
Ctrl
Mem
Ctrl
A
S
M
Reg.
File
im
Exec
rt
Reg
File
2
=
PC
14
D
Mem
Acces
s
Data
Mem
B
Next PC
IR
n
Decode
lw r1, r2(35)
Inst. Mem
Fetch 14, Decode 10
ID 10
lw
r1, r2(35)
IF 14
addI r2, r2, 3
20
sub
r3, r4, r5
24
beq
r6, r7, 100
30
ori
r8, r9, 17
34
add
r10, r11, r12
100 and
r13, r14, 15
n
WB
Ctrl
S
M
Reg.
File
r2
n
Mem
Ctrl
Exec
35
Reg
File
lw r1
rt
2
=
PC
20
D
Mem
Acces
s
Data
Mem
B
Next PC
IR
Decode
addI r2, r2, 3
Inst. Mem
Fetch 20, Decode 14, Exec 10
EX 10
lw
r1, r2(35)
14
addI r2, r2, 3
20
sub
r3, r4, r5
24
beq
r6, r7, 100
30
ori
r8, r9, 17
34
add
r10, r11, r12
100 and
r13, r14, 15
n
WB
Ctrl
D
Reg.
File
M
Mem
Acces
s
Data
Mem
Mem
Ctrl
r2+35
Exec
3
Reg
File
r2
lw r1
addI r2, r2, 3
5
4
B
PC
24
=
Next PC
IR
Decode
sub r3, r4, r5
Inst. Mem
Fetch 24, Decode 20, Exec 14, Mem 10
M 10
lw
EX 14
addI r2, r2, 3
ID 20
sub
r3, r4, r5
24
beq
r6, r7, 100
30
ori
r8, r9, 17
34
add
r10, r11, r12
IF
100 and
r1, r2(35)
r13, r14, 15
Reg.
File
M[r2+35]
D
Mem
Acces
s
Data
Mem
WB
Ctrl
r2+3
Exec
r4
Mem
Ctrl
lw r1
addI r2
sub r3
7
Reg
File
6
r5
PC
30
=
Next PC
IR
Decode
beq r6, r7 100
Inst. Mem
Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10
WB 10
M 14
lw
r1, r2(35)
addI r2, r2, 3
EX 20
ID 24
sub
r3, r4, r5
beq
r6, r7, 100
IF 30
ori
r8, r9, 17
add
r10, r11, r12
34
100 and
r13, r14, 15
r1=M[r2+35]
WB
Ctrl
Reg.
File
addI r2
Mem
Ctrl
r2+3
sub r3
r4-r5
r6
Exec
100
Reg
File
beq
xx
9
=
PC
100
D
Mem
Acces
s
Data
Mem
r7
Next PC
IR
Decode
ori r8, r9 17
Inst. Mem
Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14
10
WB 14
M 20
lw
r1, r2(35)
addI r2, r2, 3
sub
r3, r4, r5
EX 24
ID 30
beq
r6, r7, 100
ori
r8, r9, 17
34
add
r10, r11, r12
IF 100 and r13, r14, 15
Pipelining Load Instruction
Cycle 1 Cycle 2
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Clock
1st lw Ifetch
Reg/Dec Exec
2nd lw Ifetch
Mem
Reg/Dec Exec
3rd lw Ifetch
Wr
Mem
Reg/Dec Exec
Wr
Mem
Wr
The five independent functional units in the pipeline
datapath are:
– Instruction Memory for the Ifetch stage
– Register File’s Read ports (bus A and busB) for the
Reg/Dec stage
– ALU for the Exec stage
– Data Memory for the Mem stage
– Register File’s Write port (bus W) for the Wr stage
Pipelining the R Instruction
Cycle 1 Cycle 2
R-type Ifetch
Cycle 3 Cycle 4
Reg/Dec Exec
Wr
Ifetch: Instruction Fetch
– Fetch the instruction from the Instruction Memory
Reg/Dec: Registers Fetch and Instruction Decode
Exec:
– ALU operates on the two register operands
– Update PC
Wr: Write the ALU output back to the register file
Pipelining Both L and R type
Cycle 1 Cycle 2
R-type Ifetch
R-type
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Reg/Dec Exec
Ifetch
Load
Reg/Dec Exec
Ifetch
Ops! We have a problem!
Wr
Wr
Reg/Dec Exec
R-type Ifetch
Mem
Wr
Reg/Dec Exec
Wr
R-type Ifetch
Reg/Dec Exec
Wr
We have pipeline conflict or structural hazard:
– Two instructions try to write to the register file
at the same time!
– Only one write port
Important Observations
Each functional unit can only be used once per instruction
Each functional unit must be used at the same stage for all instructions:
– Load uses Register File’s Write Port during its
5th stage
– R-type uses Register File’s Write Port during
its 4th stage
1
Load
2
Ifetch
1
R-type Ifetch
3
4
Reg/Dec Exec
2
3
Reg/Dec Exec
5
Mem
4
Wr
Wr
Solution
Delay R-type’s register write by one cycle:
– Now R-type instructions also use Reg File’s write port at Stage 5
– Mem stage is a NOOP stage: nothing is being done.
1
2
R-type Ifetch
Cycle 1 Cycle 2
Ifetch
R-type
3
Reg/Dec Exec
Load
5
Mem
Wr
Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Reg/Dec Exec
Ifetch
4
Mem
Reg/Dec Exec
Ifetch
Wr
Mem
Reg/Dec Exec
R-type Ifetch
Wr
Mem
Reg/Dec Exec
R-type Ifetch
Wr
Mem
Reg/Dec Exec
Wr
Mem
Wr
Datapath (Without Pipeline)
IR <- Mem[PC]; PC <– PC+4;
A <- R[rs]; B<– R[rt]
S <– A + SX;
M <– Mem[S]
Mem[S] <- B
If Cond
PC < PC+SX;
Reg.
File
S
D
M
Data
Mem
B
Mem
Access
A
Exec
R[rd] <– M;
IR
Inst. Mem
R[rt] <– S;
PC
Next PC
R[rd] <– S;
S <– A + SX;
Equal
S <– A or ZX;
Reg
File
S <– A + B;
Datapath (With Pipeline)
IR <- Mem[PC]; PC <– PC+4;
A <- R[rs]; B<– R[rt]
Mem[S] <- B
A
S
M
B
D
Reg.
File
R[rd] <– M;
IR
Inst. Mem
R[rt] <– M;
PC
Next PC
R[rd] <– M;
M <– Mem[S]
if Cond PC
< PC+SX;
Mem
Acces
s
Data
Mem
M <– S
S <– A + SX;
Exec
M <– S
S <– A + SX;
Equal
S <– A or ZX;
Reg
File
S <– A + B;
Structural Hazard and Solution
Time (clock cycles)
Instr 3
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
ALU
Instr 4
Mem
ALU
Instr 2
Reg
ALU
Instr 1
Mem
ALU
O
r
d
e
r
Load
ALU
I
n
s
t
r.
Reg
Mem
Reg
Control Hazard - #1 Stall
Add
Beq
Reg
Mem
Reg
Mem
Lost
potential
Reg
Mem
Mem
Reg
Reg
ALU
Load
Mem
ALU
O
r
d
e
r
Time (clock cycles)
ALU
I
n
s
t
r.
Mem
• Stall: wait until decision is clear
• Impact: 2 lost cycles (i.e. 3 clock cycles per
branch instruction) => slow
Reg
Control Hazard – #2 Predict
Add
Beq
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
ALU
Load
Mem
ALU
O
r
d
e
r
Time (clock cycles)
ALU
I
n
s
t
r.
Reg
Mem
Reg
Predict: guess one direction then back up if
wrong
Impact: 0 lost cycles per branch instruction if
right, 1 if wrong (right 50% of time)
More dynamic scheme: history of 1 branch
Control Hazard - #3 Delayed Branch
Beq
Load
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
ALU
Misc
Reg
ALU
Add
Mem
ALU
O
r
d
e
r
Time (clock cycles)
ALU
I
n
s
t
r.
Mem
Reg
Delayed Branch: Redefine branch behavior (takes
place after next instruction)
Impact: 0 clock cycles per branch instruction if can
find instruction to put in “slot” ( 50% of time)
Data Hazards (RAW)
Dependencies backwards in time are
hazards
and r6,r1,r7
xor r10,r1,r11
W
B
Reg
Reg
Im
Reg
Im
Reg
Dm
Reg
Dm
ALU
or r8,r1,r9
Im
ME
M
Dm
ALU
sub r4,r1,r3
E
X
ALU
O
r
d
e
r
add r1,r2,r3
ID/R
FReg
ALU
I
n
s
t
r.
Time (clock cycles)
I
F
Im
Reg
Dm
Reg
Data Hazards [contd…]
“Forward” result from one stage to
Time (clock cycles)
another
or r8,r1,r9
Reg
Im
Reg
Im
Reg
Im
Reg
Dm
Reg
Dm
Reg
Dm
ALU
xor r10,r1,r11
W
B
Reg
ALU
and r6,r1,r7
Im
ME
M
Dm
ALU
sub r4,r1,r3
E
X
ALU
O
r
d
e
r
add r1,r2,r3
ID/R
FReg
ALU
I
n
s
t
r.
I
F
Im
Reg
Dm
Reg
Data Hazards [contd…]
Dependencies backwards in time are hazards
sub r4,r1,r3
Stall
ME
M
Dm
W
B
Reg
Im
Reg
ALU
lw r1,0(r2)
ID/R
FReg
ALU
Time (clock cycles)
I
F
Im
E
X
Dm
Reg
Can’t solve with forwarding:
Must delay/stall instruction dependent on loads
Hazard Detection
I-Fetch
DCD MemOpFetch OpFetch
IFetch
DCD
Exec
Store
°°°
Structural
Hazard
I-Fetch
DCD
OpFetch
Jump
IFetch
IF
DCD EX
IF
Mem WB
DCD EX
IF
DCD
°°°
RAW (read after write) Data Hazard
Mem
WB
DCD EX
Mem WB
IF
DCD
IF
Control Hazard
DCD OF
WAW Data Hazard
(write after write)
OF
Ex
RS
Ex
Mem
WAR Data Hazard
(write after read)
Forwarding Unit For Resolving Data
Hazards
www.ee.unt.edu/public/guturu/MIPS-Pipeline-With-Forwarding-Unit.pdf
Hazard Detection and Forwarding Units
www.ee.unt.edu/public/guturu/MIPS_Pipeline_Hazard_Detection.jpg
Three Generic Data Hazards
• Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
• Caused by a “Data Dependence” (in compiler
nomenclature). This hazard results from an
actual need for communication.
Three Generic Data Hazards
• Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5
Three Generic Data Hazards
• Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called “output dependence” by compiler writers
This also results from the reuse of name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
• Will see WAR and WAW in later more
complicated pipes
Hazard Detection
Suppose instruction i is about to be issued and a
predecessor instruction j is in the instruction
pipeline.
New Inst
Instruction
Movement:
Inst I
Inst J
Window on execution:
Only pending instructions can
cause hazards
A RAW hazard exists on register if Rregs( i ) Wregs( j )
A WAW hazard exists on register if Wregs( i ) Wregs( j )
A WAR hazard exists on register if Wregs( i ) Rregs( j )
Computing CPI
• Start with Base CPI
• Add stalls
CPI  CPIbase  CPI stall
CPI stall  STALLtype1  freq type1  STALLtype 2  freq type 2
Suppose:
–CPIbase=1
–Freqbranch=20%, freqload=30%
–Suppose branches always cause 1 cycle stall
–Loads cause a 2 cycle stall
Then: CPI = 1 + (10.20)+(2  0.30)= 1.8
Summary
• Control Signals need to be propagated
• Insert Registers between every stage to
“remember” and “propagate” values
• Solutions to Control Hazard are Stall, Predict
and Delayed Branch
• Solutions to Data Hazard is “Forwarding”
• Effective CPI = CPIideal + CPIstall
Memory Subsystem and Cache
Adapted from lectures notes of Dr. Patterson and Dr.
Kubiatowicz of UC Berkeley
The Big Picture
Processor
Input
Control
Memory
Datapath
Output
Technology Trends
Capacity
Logic:
2x in 3 years
DRAM: 4x in 3 years
Disk:
4x in 3 years
Speed (latency)
2x in 3 years
2x in 10 years
2x in 10 years
DRAM
Year
Size
1000:1!
2:1!
1980
64 Kb
1983
256 Kb
Cycle Time
250 ns
220 ns
1986
1989
1992
1995
190 ns
165 ns
145 ns
120 ns
1 Mb
4 Mb
16 Mb
64 Mb
Technology Trends [contd…]
Processor-DRAM Memory Gap (latency)
1000
CPU
“Moore’s Law”
Processor-Memory
Performance Gap:
(grows 50% / year)
100
10
µProc
60%/yr.
(2X/1.5yr)
“Less’ Law?”
DRAM
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1
Time
DRAM
9%/yr.
(2X/10 yrs)
The Goal: Large, Fast, Cheap
Memory !!!
• Fact
– Large memories are slow
– Fast memories are small
• How do we create a memory that is large,
cheap and fast (most of the time) ?
– Hierarchy
– Parallelism
• By taking advantage of the principle of locality:
– Present the user with as much memory as is available in the
cheapest technology.
– Provide access at the speed offered by the fastest technology.
Processor
Control
Speed (ns):1s
Size (bytes):
100s
On-Chip
Cache
Registers
Datapath
Second
Level
Cache
(SRAM)
Main
Memory
(DRAM)
10ns
100ns
Ks
Ms
Secondary
Storage
(Disk)
Tertiary
Storage
(Tape)
10,000,00 10,000,000,000ns
0ns
(10 sec)
(10
Gsms)
Ts
Today’s Situation
• Rely on caches to bridge gap
• Microprocessor-DRAM performance gap
– time of a full cache miss in instructions executed
1st Alpha (7000):
2nd Alpha (8400):
340 ns/5.0 ns = 68 clks x 2 or 136 instructions
266 ns/3.3 ns = 80 clks x 4 or 320 instructions
Memory Hierarchy (1/4)
• Processor
– executes programs
– runs on order of nanoseconds to
picoseconds
– needs to access code and data for
programs: where are these?
• Disk
– HUGE capacity (virtually limitless)
– VERY slow: runs on order of milliseconds
– so how do we account for this gap?
Memory Hierarchy (2/4)
• Memory (DRAM)
– smaller than disk (not limitless capacity)
– contains subset of data on disk: basically
portions of programs that are currently being
run
– much faster than disk: memory accesses
don’t slow down processor quite as much
– Problem: memory is still too slow
(hundreds of nanoseconds)
– Solution: add more layers (caches)
Memory Hierarchy (3/4)
Processor
Higher
Levels in
memory
hierarchy
Lower
Level 1
Level 2
Level 3
Increasing
Distance
from Proc.,
Decreasing
cost / MB
...
Level n
Size of memory at each level
Memory Hierarchy (4/4)
• If level is closer to Processor, it must be:
– smaller
– faster
– subset of all higher levels (contains most
recently used data)
– contain at least all the data in all lower
levels
• Lowest Level (usually disk) contains all
available data
Analogy: Library
• You’re writing a term paper (Processor)
at a table in Evans
• Evans Library is equivalent to disk
– essentially limitless capacity
– very slow to retrieve a book
• Table is memory
– smaller capacity: means you must return
book when table fills up
– easier and faster to find a book there once
you’ve already retrieved it
Analogy : Library [contd…]
• Open books on table are cache
– smaller capacity: can have very few open
books fit on table; again, when table fills up,
you must close a book
– much, much faster to retrieve data
• Illusion created: whole library open on the
tabletop
– Keep as many recently used books open on
table as possible since likely to use again
– Also keep as many books on table as
possible, since faster than going to library
Memory Hierarchy Basics
• Disk contains everything.
• When Processor needs something, bring
it into to all lower levels of memory.
• Cache contains copies of data in
memory that are being used.
• Memory contains copies of data on disk
that are being used.
• Entire idea is based on Temporal
Locality: if we use it now, we’ll want to
use it again soon (a Big Idea)
Caches : Why does it Work ?
Probability
of reference
0
Address Space
2^n - 1
• Temporal Locality (Locality in Time):
=> Keep most recently accessed data items closer to the processor
• Spatial Locality (Locality in Space):
=> Move blocks consists of contiguous words to the upper levels
To Processor Upper Level
Memory
Lower Level
Memory
Blk X
From Processor
Blk Y
Cache Design Issues
• How do we organize cache?
• Where does each memory address map
to? (Remember that cache is subset of
memory, so multiple memory addresses
map to the same cache location.)
• How do we know which elements are in
cache?
• How do we quickly locate them?
Direct Mapped Cache
• In a direct-mapped cache, each memory
address is associated with one possible
block within the cache
– Therefore, we only need to look in a single
location in the cache for the data if it exists
in the cache
– Block is the unit of transfer between cache
and memory
Direct Mapped Cache [contd…]
Memory
Address
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
Memory
Cache
Index
0
1
2
3
4 Byte Direct
Mapped Cache
• Cache Location 0 can be
occupied by data from:
–Memory location 0, 4, 8, ...
–In general: any memory
location that is multiple of 4
Issues with Direct Mapped Cache
• Since multiple memory addresses map
to same cache index, how do we tell
which one is in there?
• What if we have a block size > 1 byte?
• Result: divide memory address into three
fields
ttttttttttttttttt iiiiiiiiii oooo
tag to check if you
have correct block
index to
byte
select block
offset
Example of a direct mapped cache
• For a 2^N byte cache:
– The uppermost (32 - N) bits are always the Cache Tag
– The lowest M bits are the Byte Select (Block Size = 2^M)
Block address
31
9
Cache Tag
Example: 0x50
4
0
Cache Index
Byte Select
Ex: 0x01
Ex: 0x00
Stored as part
of the cache “state”
Cache Data
Byte 31
0x50
Byte 63
:
Cache Tag
Byte 1 Byte 0 0
:
Valid Bit
Byte 33 Byte 32 1
2
3
:
:
Byte 1023
:
:
Byte 992 31
Terminology
• All fields are read as unsigned integers.
• Index: specifies the cache index (which
“row” of the cache we should look in)
• Offset: once we’ve found correct block,
specifies which byte within the block we
want
• Tag: the remaining bits after offset and
index are determined; these are used to
distinguish between all the memory
addresses that map to the same location
Terminology [contd…]
• Hit: data appears in some block in the upper level (example: Block
X)
– Hit Rate: the fraction of memory access found in the upper level
– Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieve from a block in the lower level (Block
Y)
– Miss Rate = 1 - (Hit Rate)
– Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor
Lower Level
• Hit Time <<ToMiss
Penalty
Upper Level
Memory
Processor
Memory
Blk X
From Processor
Blk Y
How is the hierarchy managed ?
• Registers <-> Memory
– by compiler (programmer?)
• cache <-> memory
– by the hardware
• memory <-> disks
– by the hardware and operating system (virtual
memory)
– by the programmer (files)
Example
• Suppose we have a 16KB of data in a
direct-mapped cache with 4 word blocks
• Determine the size of the tag, index and
offset fields if we’re using a 32-bit
architecture
• Offset
– need to specify correct byte within a block
– block contains
4 words
16 bytes
24 bytes
– need 4 bits to specify correct byte
Example [contd…]
• Index: (~index into an “array of blocks”)
– need to specify correct row in cache
– cache contains 16 KB = 214 bytes
– block contains 24 bytes (4 words)
– # rows/cache = # blocks/cache (since
there’s one block/row)
= bytes/cache
bytes/row
= 214 bytes/cache
24 bytes/row
= 210 rows/cache
– need 10 bits to specify this many rows
Example [contd…]
• Tag: use remaining bits as tag
– tag length = mem addr length
- offset
- index
= 32 - 4 - 10 bits
= 18 bits
– so tag is leftmost 18 bits of memory address
Accessing data in cache
Memory
Address (hex)
...
• Ex.: 16KB of data, direct- 00000010
mapped,
4 word blocks
• Read 4 addresses
–0x00000014,
0x0000001C,
0x00000034,
0x00008014
• Memory values on right:
–only cache/memory
level of hierarchy
00000014
00000018
0000001C
...
Value of Word
...
a
b
c
d
...
00000030
00000034
00000038
0000003C
...
e
f
g
h
...
00008010
00008014
00008018
0000801C
...
i
j
k
l
...
Accessing data in cache [contd…]
• 4 Addresses:
–0x00000014, 0x0000001C, 0x00000034,
0x00008014
• 4 Addresses divided (for convenience)
into Tag, Index, Byte Offset fields
000000000000000000
000000000000000000
000000000000000000
000000000000000010
Tag
0000000001 0100
0000000001 1100
0000000011 0100
0000000001 0100
Index
Offset
16 KB Direct Mapped Cache, 16B blocks
• Valid bit: determines whether anything is stored in that row (when computer
initially turned on, all entries are invalid)
Valid
Index Tag
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
...
1022 0
1023 0
0x4-7
0x0-3
...
Example Block
0x8-b
0xc-f
Read 0x00000014 = 0…00 0..001 0100
• 000000000000000000 0000000001 0100
Index field Offset
Tag field
Valid
0x4-7
0x8-b
0xc-f
0x0-3
Tag
Index
0
1
2
3
4
5
6
7
...
0
0
0
0
0
0
0
0
1022
1023 0
0
...
So we read block 1 (0000000001)
• 000000000000000000 0000000001 0100
Tag field
Index field Offset
Valid
0x4-7
0x8-b
0xc-f
0x0-3
Index Tag
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
...
1022 0
1023 0
...
No valid data
• 000000000000000000 0000000001 0100
Tag field
Index field Offset
Valid
0x4-7
0x8-b
0xc-f
0x0-3
Index Tag
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
...
1022 0
1023 0
...
So load that data into cache, setting tag, valid
• 000000000000000000 0000000001 0100
Tag field
Index field Offset
Valid
0x4-7
0x8-b
0xc-f
0x0-3
Index Tag
0 0
a
b
c
d
1 1 0
2 0
3 0
4 0
5 0
6 0
7 0
...
1022 0
1023 0
...
Read from cache at offset, return word b
• 000000000000000000 0000000001 0100
Tag field
Index field Offset
Valid
0x4-7
0x8-b
0xc-f
0x0-3
Index Tag
0 0
a
b
c
d
1 1 0
2 0
3 0
4 0
5 0
6 0
7 0
...
1022 0
1023 0
...
Read 0x0000001C = 0…00 0..001 1100
• 000000000000000000 0000000001 1100
Tag field
Index field Offset
Valid
0x4-7
0x8-b
0xc-f
0x0-3
Index Tag
0 0
a
b
c
d
1 1 0
2 0
3 0
4 0
5 0
6 0
7 0
...
1022 0
1023 0
...
Data valid, tag OK, so read offset return word d
• 000000000000000000 0000000001 1100
Valid
0x4-7
0x8-b
0xc-f
0x0-3
Index Tag
0 0
a
b
c
d
1 1 0
2 0
3 0
4 0
5 0
6 0
7 0
...
1022 0
1023 0
...
Read 0x00000034 = 0…00 0..011 0100
• 000000000000000000 0000000011 0100
Tag field
Index field Offset
Valid
0x4-7
0x8-b
0xc-f
0x0-3
Index Tag
0 0
a
b
c
d
1 1 0
2 0
3 0
4 0
5 0
6 0
7 0
...
1022 0
1023 0
...
So read block 3
• 000000000000000000 0000000011 0100
Tag field
Index field Offset
Valid
0x4-7
0x8-b
0xc-f
0x0-3
Index Tag
0 0
a
b
c
d
1 1 0
2 0
3 0
4 0
5 0
6 0
7 0
...
1022 0
1023 0
...
No valid data
• 000000000000000000 0000000011 0100
Tag field
Index field Offset
Valid
0x4-7
0x8-b
0xc-f
0x0-3
Index Tag
0 0
a
b
c
d
1 1 0
2 0
3 0
4 0
5 0
6 0
7 0
...
1022 0
1023 0
...
Load that cache block, return word f
• 000000000000000000 0000000011 0100
Tag field
Index field Offset
Valid
0x4-7
0x8-b
0xc-f
0x0-3
Index Tag
0 0
a
b
c
d
1 1 0
2 0
e
f
g
h
3 1 0
4 0
5 0
6 0
7 0
...
1022 0
1023 0
...
Read 0x00008014 = 0…10 0..001 0100
• 000000000000000010 0000000001 0100
Tag field
Index field Offset
Valid
0x4-7
0x8-b
0xc-f
0x0-3
Index Tag
0 0
a
b
c
d
1 1 0
2 0
e
f
g
h
3 1 0
4 0
5 0
6 0
7 0
...
1022 0
1023 0
...
So read Cache Block 1, Data is Valid
• 000000000000000010 0000000001 0100
Tag field
Index field Offset
Valid
0x4-7
0x8-b
0xc-f
0x0-3
Index Tag
0 0
a
b
c
d
1 1 0
2 0
e
f
g
h
3 1 0
4 0
5 0
6 0
7 0
...
1022 0
1023 0
...
Cache Block 1 Tag does not match (0 != 2)
• 000000000000000010 0000000001 0100
Tag field
Index field Offset
Valid
0x4-7
0x8-b
0xc-f
0x0-3
Index Tag
0 0
a
b
c
d
1 1 0
2 0
e
f
g
h
3 1 0
4 0
5 0
6 0
7 0
...
1022 0
1023 0
...
Miss, so replace block 1 with new data & tag
• 000000000000000010 0000000001 0100
Tag
field
Index
field
Offset
Valid
0x4-7
0x8-b
0xc-f
0x0-3
Index Tag
0 0
i
j
k
l
1 1 2
2 0
e
f
g
h
3 1 0
4 0
5 0
6 0
7 0
...
1022 0
1023 0
...
And return word j
• 000000000000000010 0000000001 0100
Tag
field
Index
field
Offset
Valid
0x4-7
0x8-b
0xc-f
0x0-3
Index Tag
0 0
i
j
k
l
1 1 2
2 0
e
f
g
h
3 1 0
4 0
5 0
6 0
7 0
...
1022 0
1023 0
...
Things to Remember
• We would like to have the capacity of disk
at the speed of the processor:
unfortunately this is not feasible.
• So we create a memory hierarchy:
– each successively lower level contains “most
used” data from next higher level
• Exploit temporal and spatial locality
Virtual Memory
Adapted from lecture notes of Dr. Patterson and
Dr. Kubiatowicz of UC Berkeley
View of Memory Hierarchies
Thus far
{
{
Next:
Virtual
Memory
Regs
Instr. Operands
Cache
Blocks
Upper Level
Faster
L2 Cache
Blocks
Memory
Pages
Disk
Files
Tape
Larger
Lower Level
Memory Hierarchy: Some Facts
Upper Level
Capacity
Access Time
Cost
Staging
Xfer Unit
CPU Registers
100s Bytes
<10s ns
Registers
Cache
K Bytes
10-100 ns
$.01-.001/bit
Cache
Instr. Operands
Blocks
Main Memory
M Bytes
100ns-1us
$.01-.001
Disk
G Bytes
ms
-4
-3
10 - 10 cents
Tape
infinite
sec-min
10 -6
faster
prog./compiler
1-8 bytes
cache cntl
8-128 bytes
Memory
Pages
OS
512-4K bytes
Files
user/operator
Mbytes
Disk
Tape
Larger
Lower Level
Virtual Memory: Motivation
• If Principle of Locality allows caches to offer
(usually) speed of cache memory with size of
DRAM memory,
then recursively why not use at next level to
give speed of DRAM memory, size of Disk
memory?
• Treat Memory as “cache” for Disk !!!
• Share memory between multiple processes
but still provide protection – don’t let one
program read/write memory of another
• Address space – give each program the illusion
that it has its own private memory
– Suppose code starts at addr 0x40000000. But
different processes have different code, both at
the same address! So each program has a
different view of memory
Advantages of Virtual Memory
• Translation:
– Program can be given consistent view of memory, even though
physical memory is scrambled
– Makes multithreading reasonable (now used a lot!)
– Only the most important part of program (“Working Set”) must be in
physical memory.
– Contiguous structures (like stacks) use only as much physical
memory as necessary yet still grow later.
• Protection:
– Different threads (or processes) protected from each other.
– Different pages can be given special behavior
• (Read Only, Invisible to user programs, etc).
– Kernel data protected from User programs
– Very important for protection from malicious programs
=> Far more “viruses” under Microsoft Windows
• Sharing:
– Can map same physical page to multiple users
(“Shared memory”)
Virtual to Physical Address Translation
Program
operates in
its virtual
address
space
virtual
address
(inst. fetch
load, store)
HW
mapping
physical
address
(inst. fetch
load, store)
Physical
memory
(incl. caches)
• Each program operates in its own virtual address
space; ~only program running
• Each is protected from the other
• OS can decide where each goes in memory
• Hardware (HW) provides virtual -> physical
mapping
Mapping Virtual Memory to Physical Memory
• Divide into equal sized

chunks (about 4KB)
• Any chunk of Virtual Memory
assigned to any chuck of Physical
Memory (“page”)
Stack
Physical
Memory
64 MB
Heap
Static
0
Code
0
Paging Organization (eg: 1KB Page)
Page is unit
Virtual
Physical
of mapping
Address
Address
page
0
0
1K
page
0
1K
0
page
1
1K
1024
1K
Addr
1024 page 1
2048 page 2 1K
...
... ...
Trans
MAP
...
... ...
7168 page 7 1K
Physical
31744 page 31 1K
Memory Page also unit of
Virtual
transfer from disk
to physical memory Memory
Virtual Memory Mapping
Virtual Address:
page no. offset
Page Table
Base Reg
index
into
page
table
(actually,
concatenation)
Page Table
...
V
A.R. P. P. A.
+
Val Access Physical
-id Rights Page
Address Physical
Memory
Address
.
...
Page Table located in physical memory
Issues in VM Design
What is the size of information blocks that are transferred from secondary
to main storage (M)?  page size
(Contrast with physical block size on disk, I.e. sector size)
Which region of M is to hold the new block  placement policy
How do we find a page when we look for it?  block identification
Block of information brought into M, and M is full, then some region of M
must be released to make room for the new block
 replacement policy
What do we do on a write?  write policy
Missing item fetched from secondary memory only on the occurrence of a
fault  demand load policy
cache
mem
disk
reg
pages
frame
Virtual Memory Problem # 1
• Map every address  1 extra memory
accesses for every memory access
• Observation: since locality in pages of
data, must be locality in virtual addresses
of those pages
• Why not use a cache of virtual to physical
address translations to make translation
fast? (small is fast)
• For historical reasons, cache is called a
Translation Lookaside Buffer, or TLB
Memory Organization with TLB
•TLBs usually small, typically 128 - 256 entries
• Like any other cache, the TLB can be fully
associative, set associative, or direct mapped
VA
Processor
hit PA
TLB
Lookup
miss
Translation
miss
Cache
hit
data
Main
Memory
Typical TLB Format
Virtual Physical Dirty Ref Valid Access
Address Address
Rights
• TLB just a cache on the page table mappings
• TLB access time comparable to cache
(much less than main memory access time)
• Ref: Used to help calculate LRU on replacement
• Dirty: since use write back, need to know whether
or not to write page to disk when replaced
What if not in TLB
• Option 1: Hardware checks page table and
loads new Page Table Entry into TLB
• Option 2: Hardware traps to OS, up to OS to
decide what to do
• MIPS follows Option 2: Hardware knows
nothing about page table format
TLB Miss
• If the address is not in the TLB, MIPS
traps to the operating system
• The operating system knows which
program caused the TLB fault, page
fault, and knows what the virtual address
desired was requested
valid virtual physical
1
2
9
TLB Miss: If data is in Memory
• We simply add the entry to the TLB,
evicting an old entry from the TLB
valid virtual physical
1
1
7
2
32
9
What if data is on disk ?
• We load the page off the disk into a free
block of memory, using a DMA transfer
– Meantime we switch to some other process
waiting to be run
• When the DMA is complete, we get an
interrupt and update the process's page
table
– So when we switch back to the task, the
desired data will be in memory
What if the memory is full ?
• We load the page off the disk into a least
recently used block of memory, using a
DMA transfer
– Meantime we switch to some other process
waiting to be run
• When the DMA is complete, we get an
interrupt and update the process's page
table
– So when we switch back to the task, the
desired data will be in memory
Virtual Memory Problem # 2
• Page Table too big!
– 4GB Virtual Memory ÷ 4 KB page
 ~ 1 million Page Table Entries
 4 MB just for Page Table for 1 process,
25 processes  100 MB for Page Tables!
• Variety of solutions to tradeoff memory
size of mapping function for slower when
miss TLB
– Make TLB large enough, highly associative so
rarely miss on address translation
Two Level Page Tables
2nd Level
Page Tables
64
MB
Super
Page
Table
Virtual Memory

Physical
Memory
Heap
...
0
Stack
Static
Code
0
Summary
• Apply Principle of Locality Recursively
• Reduce Miss Penalty? add a (L2) cache
• Manage memory to disk? Treat as cache
– Included protection as bonus, now critical
– Use Page Table of mappings
vs. tag/data in cache
• Virtual memory to Physical Memory
Translation too slow?
– Add a cache of Virtual to Physical Address
Translations, called a TLB
Summary
• Virtual Memory allows protected sharing of
memory between processes with less swapping
to disk, less fragmentation than always swap or
base/bound
• Spatial Locality means Working Set of Pages is
all that must be in memory for process to run
fairly well
• TLB to reduce performance cost of VM
• Need more compact representation to reduce
memory size cost of simple 1-level page table
(especially 32-  64-bit address)
Download