55:035
55:035 Computer Architecture and Organization 1
Virtual Memory
Basics
Address Translation
Cache vs VM
Paging
Replacement
TLBs
Segmentation
Page Tables
55:035 Computer Architecture and Organization 2
Capacity
Access Time
Cost
CPU Registers
100s Bytes
<10s ns
Cache
K Bytes
10-100 ns
1-0.1 cents/bit
Registers
Instr. Operands
Cache
Blocks
Staging
Xfer Unit prog./compiler
1-8 bytes cache cntl
8-128 bytes
Main Memory
M Bytes
200ns- 500ns
$.0001-.00001 cents /bit
Disk
G Bytes, 10 ms
(10,000,000 ns)
-5 -6
10 - 10 cents/bit
Tape infinite sec-min
10
-8
Memory
Disk
Tape
Pages
Files
55:035 Computer Architecture and Organization
OS
4K-16K bytes user/operator
Mbytes
Upper Level faster
Larger
Lower Level
3
Some facts of computer life…
Computers run lots of processes simultaneously
No full address space of memory for each process
Must share smaller amounts of physical memory among many processes
Virtual memory is the answer!
Divides physical memory into blocks, assigns them to different processes
55:035 Computer Architecture and Organization 4
Virtual memory (VM) allows main memory
(DRAM) to act like a cache for secondary storage (magnetic disk).
VM address translation a provides a mapping from the virtual address of the processor to the physical address in main memory or on disk .
Compiler assigns data to a “virtual” address.
VA translated to a real/physical somewhere in memory…
(allows any program to run anywhere; where is determined by a particular machine, OS)
55:035 Computer Architecture and Organization 5
VM provides the following benefits
Allows multiple programs to share the same physical memory
Allows programmers to write code as though they have a very large amount of main memory
Automatically handles bringing in data from disk
55:035 Computer Architecture and Organization 6
Programs reference “virtual” addresses in a non-existent memory
These are then translated into real “physical” addresses
Virtual address space may be bigger than physical address space
Divide physical memory into blocks, called pages
Anywhere from 512 to 16MB (4k typical)
Virtual-to-physical translation by indexed table lookup
Add another cache for recent translations (the TLB)
Invisible to the programmer
Looks to your application like you have a lot of memory!
55:035 Computer Architecture and Organization 7
Process 1 ’ s
Virtual
Address
Space
Page Frames
Process 2 ’ s
Virtual
Address
Space
Physical Memory
55:035 Computer Architecture and Organization
Disk
8
20 bits 12 bits
Virtual page number Page offset
Per-process page table
Page
Table base
Log
2 of pagesize
Valid bit
Protection bits
Dirty bt
Reference bit
Physical page number Page offset
To physical memory
55:035 Computer Architecture and Organization 9
Relieves problem of making a program that was too large to fit in physical memory – well….fit!
Allows program to run in any location in physical memory
(called relocation)
Really useful as you might want to run same program on lots machines…
Virtual
Address
0
4
8
12
A
B
C
D
Virtual Memory
Physical
Address
0
4K
8K
12K
16K
20K
24K
28K
C
A
B
D
Physical
Main Memory
Disk
Logical program is in contiguous VA space; here, consists of 4 pages: A, B, C, D;
The physical location of the 3 pages – 3 are in main memory and 1 is located on the disk
55:035 Computer Architecture and Organization 10
So, some definitions/“analogies”
A “ page ” or “ segment ” of memory is analogous to a
“ block ” in a cache
A “ page fault ” or “ address fault ” is analogous to a cache miss so, if we go to main memory and our data isn’t there, we need to get it from disk…
“real”/physical memory
55:035 Computer Architecture and Organization 11
These are more definitions than analogies…
With VM, CPU produces “ virtual addresses ” that are translated by a combination of HW/SW to “ physical addresses ”
The “ physical addresses ” access main memory
The process described above is called “ memory mapping ” or
“ address translation ”
55:035 Computer Architecture and Organization 12
Parameter
Block (page) size
First-level cache
12-128 bytes
Virtual memory
4096-65,536 bytes
Hit time 1-2 clock cycles
Miss penalty
(Access time)
(Transfer time)
8-100 clock cycles
(6-60 clock cycles)
(2-40 clock cycles)
40-100 clock cycles
700,000 – 6,000,000 clock cycles
(500,000 – 4,000,000 clock cycles)
(200,000 – 2,000,000 clock cycles)
Miss rate
Data memory size
0.5 – 10%
0.016 – 1 MB
0.00001 – 0.001%
4MB – 4GB
55:035 Computer Architecture and Organization 13
Replacement policy:
Replacement on cache misses primarily controlled by hardware
Replacement with VM (i.e. which page do I replace?) usually controlled by OS
Because of bigger miss penalty, want to make the right choice
Sizes:
Size of processor address determines size of VM
Cache size independent of processor address size
55:035 Computer Architecture and Organization 14
Timing’s tough with virtual memory:
AMAT = T mem
+ (1-h) * T disk
= 100nS + (1-h) * 25,000,000nS
h (hit rate) had to be incredibly (almost unattainably) close to perfect to work
so: VM is a “cache” but an odd one.
55:035 Computer Architecture and Organization 15
How big is a page?
How big is the page table?
CPU
32 page offset frame offset
32
Physical
Memory page page table frame
55:035 Computer Architecture and Organization 16
Virtual Address
Page # Offset Frame # Offset
Register
Page Table Ptr
Page Table
Offset
+
P#
Frame #
Page
Frame
Program Paging
55:035 Computer Architecture and Organization
Main Memory
17
Suppose
32 bit architecture
Page size 4 kilobytes
Therefore:
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
Page Number 2 20
55:035 Computer Architecture and Organization
Offset 2 12
18
A processor asks for the contents of virtual memory address
0x10020. The paging scheme in use breaks this into a VPN of
0x10 and an offset of 0x020.
PTR (a CPU register that holds the address of the page table) has a value of 0x100 indicating that this process’s page table starts at location 0x100.
The machine uses word addressing and the page table entries are each one word long.
PTR 0x100
Memory Reference
VPN OFFSET
0x010 0x020
55:035 Computer Architecture and Organization 19
ADDR CONTENTS
0x00000
0x00100
0x00110
0x00000
0x00010
0x00022
0x00120
0x00130
0x00145
0x00045
0x00078
0x00010
0x10000
0x10020
0x22000
0x22020
0x45000
0x45020
0x03333
0x04444
0x01111
0x02222
0x05555
0x06666
PTR 0x100
Memory Reference
VPN OFFSET
0x010 0x020
• What is the physical address calculated?
1.
10020
2.
22020
3.
45000
4.
45020
5.
none of the above
55:035 Computer Architecture and Organization 20
ADDR CONTENTS
0x00000 0x00000
0x00100
0x00110
0x00120
0x00010
0x00022
0x00045
0x00130
0x00145
0x10000
0x10020
0x22000
0x22020
0x45000
0x45020
0x00078
0x00010
0x03333
0x04444
0x01111
0x02222
0x05555
0x06666
PTR 0x100
Memory Reference
VPN OFFSET
0x010 0x020
• What is the physical address calculated?
• What is the contents of this address returned to the processor?
• How many memory accesses in total were required to obtain the contents of the desired address?
55:035 Computer Architecture and Organization 21
7
8
9
10
11
12
13
14
15
2
3
4
5
6
Logical memory
0
1 a b c d e f g m n k l o p h i j
00
01
10
11
01 01
Page Table
0
1
2
3
5
6
1
2
101
110
001
010
110 01
55:035 Computer Architecture and Organization
26
27
28
29
30
31
20
21
22
23
24
25
16
17
18
19
10
11
12
13
14
15
8
9
6
7
3
4
5
1
2
Physical memory
0 k l i j o p m n c d e f a b g h
22
55:035 Computer Architecture and Organization 23
Which block should be replaced on a virtual memory miss?
Again, we’ll stick with the strategy that it’s a good thing to eliminate page faults
Therefore, we want to replace the LRU block
Many machines use a “use” or “reference” bit
Periodically reset
Gives the OS an estimation of which pages are referenced
55:035 Computer Architecture and Organization 24
What happens on a write?
We don’t even want to think about a write through policy!
Time with accesses, VM, hard disk, etc. is so great that this is not practical
Instead, a write back policy is used with a dirty bit to tell if a block has been written
55:035 Computer Architecture and Organization 25
Mechanism:
paging hardware
trap on page fault
Policy:
fetch policy : when should we bring in the pages of a process?
1. load all pages at the start of the process
2. load only on demand: “demand paging”
replacement policy : which page should we evict given a shortage of frames?
55:035 Computer Architecture and Organization 26
Given a full physical memory, which page should we evict??
What policy?
Random
FIFO: First-in-first-out
LRU: Least-Recently-Used
MRU: Most-Recently-Used
OPT: (will-not-be-used-farthest-in-future)
55:035 Computer Architecture and Organization 27
example sequence of page numbers
0 1 2 3 42 2 37 1 2 3
FIFO?
LRU?
OPT?
How do you keep track of LRU info? (another data structure question)
55:035 Computer Architecture and Organization 28
1. it’s slow! We’ve turned every access to memory into two accesses to memory
solution: add a specialized “cache” called a “translation lookaside buffer (TLB)” inside the processor
2. it’s still huge!
even worse: we’re ultimately going to have a page table for every process . Suppose 1024 processes, that’s 4GB of page tables!
55:035 Computer Architecture and Organization 29
Operating
System
CPU 42 356 356
Physical
Memory page table i
Disk
55:035 Computer Architecture and Organization 30
Operating
System
CPU 42 356 356
Physical
Memory page table i
Disk
Place page table in physical memory
However: this doubles the time per memory access!!
55:035 Computer Architecture and Organization 31
Operating
System
CPU 42 356 356
Physical
Memory page table i Cache!
Disk
Special-purpose cache for translations
Historically called the TLB: Translation Lookaside Buffer
55:035 Computer Architecture and Organization 32
Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped
TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This permits fully associative lookup on these machines.
Most mid-range machines use small n-way set associative organizations.
Note: 128-256 entries times 4KB-16KB/entry is only 512KB-4MB the L2 cache is often bigger than the “span” of the TLB.
CPU
Translation with a TLB
VA
TLB
Lookup miss
Translation
Cache hit data
55:035 Computer Architecture and Organization miss
Main
Memory
33
A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB
Virtual Page # Physical Frame # Dirty Ref Valid Access tag
Really just a cache (a special-purpose cache) on the page table mappings
TLB access time comparable to cache access time
(much less than main memory access time)
55:035 Computer Architecture and Organization 34
Page frame address
<30>
Page
Offset
<13>
Read/write policies and permissions…
1 2
V R W
<1> <2> <2>
Tag
<30>
Phys. Addr.
<21>
(Low-order 13 bits of addr.)
<13>
... …
3
4
…
32:1 Mux
<21>
(High-order 21 bits of addr.)
34-bit physical address
55:035 Computer Architecture and Organization 35
Address translation is usually on the critical path…
…which determines the clock cycle time of the m
P
Even in the simplest cache, TLB values must be read and compared
TLB is usually smaller and faster than the cacheaddress-tag memory
This way multiple TLB reads don’t increase the cache hit time
TLB accesses are usually pipelined b/c its so important!
55:035 Computer Architecture and Organization 36
Virtual Address TLB access
No Yes
TLB Hit?
Try to read from page table
No
Write?
Try to read from cache
Page fault?
Yes
Replace page from disk
No
TLB miss stall
No
Cache miss stall
Cache hit?
Yes
Deliver data to CPU
55:035 Computer Architecture and Organization
Yes
Set in TLB
Cache/buffer memory write
37
Can Ask the Same Four Questions we did about caches
Q1: Block Placement
choice: lower miss rates and complex placement or vice versa
miss penalty is huge
so choose low miss rate ==> place page anywhere in physical memory
similar to fully associative cache model
Q2: Block Addressing - use additional data structure
fixed size pages - use a page table
virtual page number ==> physical page number and concatenate offset
tag bit to indicate presence in main memory
55:035 Computer Architecture and Organization 38
Size is number of virtual pages
Purpose is to hold the translation of VPN to PPN
Permits ease of page relocation
Make sure to keep tags to indicate page is mapped
Potential problem:
Consider 32bit virtual address and 4k pages
4GB/4KB = 1MW required just for the page table!
Might have to page in the page table…
Consider how the problem gets worse on 64bit machines with even larger virtual address spaces!
Might have multi-level page tables
55:035 Computer Architecture and Organization 39
Similar to a set-associative mechanism
Make the page table reflect the # of physical pages (not virtual)
Use a hash mechanism
virtual page number ==> HPN index into inverted page table
Compare virtual page number with the tag to make sure it is the one you want if yes
check to see that it is in memory - OK if yes - if not page fault
If not - miss
go to full page table on disk to get new entry
implies 2 disk accesses in the worst case
trades increased worst case penalty for decrease in capacity induced miss rate since there is now more room for real pages with smaller page table
55:035 Computer Architecture and Organization 40
Page Offset
Hash Page Frame V
•Only store entries for pages in physical memory
=
OK
Frame Offset
55:035 Computer Architecture and Organization 41
The translation process using page tables takes too long!
Use a cache to hold recent translations
Translation Lookaside Buffer
Typically 8-1024 entries
Block size same as a page table entry (1 or 2 words)
Only holds translations for pages in memory
1 cycle hit time
Highly or fully associative
Miss rate < 1%
Miss goes to main memory (where the whole page table lives)
Must be purged on a process switch
55:035 Computer Architecture and Organization 42
Q3: Block Replacement (pages in physical memory)
LRU is best
So use it to minimize the horrible miss penalty
However, real LRU is expensive
Page table contains a use tag
On access the use tag is set
OS checks them every so often, records what it sees, and resets them all
On a miss, the OS decides who has been used the least
Basic strategy: Miss penalty is so huge, you can spend a few OS cycles to help reduce the miss rate
55:035 Computer Architecture and Organization 43
Q4: Write Policy
Always write-back
Due to the access time of the disk
So, you need to keep tags to show when pages are dirty and need to be written back to disk when they’re swapped out.
Anything else is pretty silly
Remember – the disk is SLOW!
55:035 Computer Architecture and Organization 44
An architectural choice
Large pages are good:
reduces page table size amortizes the long disk access if spatial locality is good then hit rate will improve
Large pages are bad:
more internal fragmentation
if everything is random each structure’s last page is only half full
Half of bigger is still bigger
if there are 3 structures per process: text, heap, and control stack
then 1.5 pages are wasted for each process
process start up time takes longer
since at least 1 page of each type is required to prior to start
transfer time penalty aspect is higher
55:035 Computer Architecture and Organization 45
The TLB must be on chip
otherwise it is worthless small TLB’s are worthless anyway large TLB’s are expensive
high associativity is likely
==> Price of CPU’s is going up!
OK as long as performance goes up faster
55:035 Computer Architecture and Organization 46
Reasons for larger page size
Page table size is inversely proportional to the page size; therefore memory saved
Fast cache hit time easy when cache size < page size (VA caches); bigger page makes this feasible as cache size grows
Transferring larger pages to or from secondary storage, possibly over a network, is more efficient
Number of TLB entries are restricted by clock cycle time, so a larger page size maps more memory, thereby reducing TLB misses
Reasons for a smaller page size
Want to avoid internal fragmentation: don’t waste storage; data must be contiguous within page
Quicker process start for small processes don’t need to bring in more memory than needed
55:035 Computer Architecture and Organization 47
With multiprogramming , a computer is shared by several programs or processes running concurrently
Need to provide protection
Need to allow sharing
Mechanisms for providing protection
Provide Base and Bound registers: Base
ฃ
Address
ฃ
Bound
Provide both user and supervisor (operating system) modes
Provide CPU state that the user can read, but cannot write
Branch and bounds registers, user/supervisor bit, exception bits
Provide method to go from user to supervisor mode and vice versa
system call : user to supervisor
system return : supervisor to user
Provide permissions for each flag or segment in memory
55:035 Computer Architecture and Organization 48
One of the biggest mistakes than can be made when designing an architecture is to devote to few bits to the address
address size limits the size of virtual memory
difficult to change since many components depend on it (e.g., PC, registers, effective-address calculations)
As program size increases, larger and larger address sizes are needed
8 bit: Intel 8080 (1975)
16 bit: Intel 8086 (1978)
24 bit: Intel 80286 (1982)
32 bit: Intel 80386 (1985)
64 bit: Intel Merced (1998)
55:035 Computer Architecture and Organization 49
Virtual memory (VM) allows main memory (DRAM) to act like a cache for secondary storage (magnetic disk).
The large miss penalty of virtual memory leads to different stategies from cache
Fully associative, TB + PT, LRU, Write-back
Designed as
paged: fixed size blocks
segmented: variable size blocks
hybrid: segmented paging or multiple page sizes
Avoid small address size
55:035 Computer Architecture and Organization 50
Option
Block Size
Hit Time
Miss Penalty
Local Miss Rate
Size
Backing Store
Q1: Block
Placement
Q2: Block ID
Q3: Block
Replacement
Q4: Writes
TLB L1 Cache
4-8 bytes (1 PTE) 4-32 bytes
1 cycle
10-30 cycles
.1 - 2%
32B – 8KB
L1 Cache
Fully or set associative
Tag/block
Random (not last)
Flush on PTE write
1-2 cycles
8-66 cycles
.5 – 20%
1 – 128 KB
L2 Cache
DM
Tag/block
N.A. For DM
Through or back
L2 Cache
32-256 bytes
6-15 cycles
30-200 cycles
13 - 15%
256KB - 16MB
DRAM
DM or SA
Tag/block
Random (if SA)
Write-back
55:035 Computer Architecture and Organization
VM (page)
4k-16k bytes
10-100 cycles
700k-6M cycles
.00001 - 001%
Disks
Fully associative
Table
LRU/LFU
Write-back
51