Virtual Memory

advertisement
Virtual Memory
Announcements
• Prelim coming up in one week:
– In 203 Thurston, Thursday October 16th, 10:10—11:25pm, 1½ hour
– Topics: Everything up to (and including) Thursday, October 9th
• Lectures 1-13, chapters 1-9, and 13 (8th ed)
• Review Session will be this Thursday, October 9th
– Time and Location TBD: Possibly 6:30pm – 7:30pm
• Nazrul’s office hours changed for today
– 12:30m - 2:30pm in Upson 328
• Homework 3 due today, October 7th
• CS 4410 Homework 2 graded. (Solutions avail via CMS).
– Mean 45 (stddev 5), High 50 out of 50
– Common problems
• Q1: did not satisfy bounded waiting
mutual exclusion was not violated
2
Homework #2, Question #1
• Dekker’s Algorithm (1965)
CSEnter(int i)
{
inside[i] = true;
while(inside[J])
{
if (turn == J)
{
inside[i] = false;
while(turn == J) continue;
inside[i] = true;
}
}}
CSExit(int i)
{
turn = J;
inside[i] = false;
}
Review: Multi-level Translation
• Illusion of a contiguous address space
• Physicall reality
– address space broken into segments or fixed-size pages
– Segments or pages spread throughout physical memory
• Could have any number of levels. Example (top segment):
Virtual
Address:
10 bits
Virtual
Seg #
Base0
Base1
Base2
Base3
Base4
Base5
Base6
Base7
10 bits
Virtual
Page #
Limit0
Limit1
Limit2
Limit3
Limit4
Limit5
Limit6
Limit7
V
V
V
N
V
N
N
V
12 bits
Offset
frame #0 V,R
frame #1 V,R
frame
#2 V,R,W
page #2
frame #3 V,R,W
frame #4 N
frame #5 V,R,W
>
Access
Error
Physical
fram #
Offset
Physical Address
Check Perm
Access
Error
• What must be saved/restored on context switch?
– Contents of top-level segment registers (for this example)
– Pointer to top-level table (page table)
4
Review: Two-LevelPhysical
Page Table
10 bits
10 bits
Virtual Virtual
Virtual
Address: P1 index P2 index
12 bits
Physical
Address: Frame #
Offset
Offset
4KB
PageTablePtr
4 bytes
• Tree of Page Tables
• Tables fixed size (1024 entries)
– On context-switch: save single
PageTablePtr register
• Sometimes, top-level page tables
called “directories” (Intel)
• Each entry called a (surprise!) Page
Table Entry (PTE)
4 bytes
5
What is in a PTE?
• What is in a Page Table Entry (or PTE)?
– Pointer to next-level page table or to actual page
– Permission bits: valid, read-only, read-write, execute-only
• Example: Intel x86 architecture PTE:
– Address same format previous slide (10, 10, 12-bit offset)
– Intermediate page tables called “Directories”
PCD
PWT
Page Frame Number
Free
0 L D A
UW P
(Physical Page Number)
(OS)
31-12
11-9 8 7 6 5 4 3 2 1 0
P: Present (same as “valid” bit in other architectures)
W: Writeable
U: User accessible
PWT: Page write transparent: external cache write-through
PCD: Page cache disabled (page cannot be cached)
A: Accessed: page has been accessed recently
D: Dirty (PTE only): page has been modified recently
L: L=14MB page (directory only).
Bottom 22 bits of virtual address serve as offset
6
Examples of how to use a PTE
• How do we use the PTE?
– Invalid PTE can imply different things:
• Region of address space is actually invalid or
• Page/directory is just somewhere else than memory
– Validity checked first
• OS can use other (say) 31 bits for location info
• Usage Example: Demand Paging
– Keep only active pages in memory
– Place others on disk and mark their PTEs invalid
• Usage Example: Copy on Write
– UNIX fork gives copy of parent address space to child
• Address spaces disconnected after child created
– How to do this cheaply?
• Make copy of parent’s page tables (point at same memory)
• Mark entries in both sets of page tables as read-only
• Page fault on write creates two copies
• Usage Example: Zero Fill On Demand
– New data pages must carry no information (say be zeroed)
– Mark PTEs as invalid; page fault on use gets zeroed page
– Often, OS creates zeroed pages in background
7
How is the translation accomplished?
CPU
Virtual
Addresses
MMU
Physical
Addresses
• What, exactly happens inside MMU?
• One possibility: Hardware Tree Traversal
– For each virtual address, takes page table base pointer and traverses
the page table in hardware
– Generates a “Page Fault” if it encounters invalid PTE
• Fault handler will decide what to do
• More on this next lecture
– Pros: Relatively fast (but still many memory accesses!)
– Cons: Inflexible, Complex hardware
• Another possibility: Software
– Each traversal done in software
– Pros: Very flexible
– Cons: Every translation must invoke Fault!
• In fact, need way to cache translations for either case!
8
Caching Concept
• Cache: a repository for copies that can be accessed more
quickly than the original
– Make frequent case fast and infrequent case less dominant
• Caching underlies many of the techniques that are used today
to make computers fast
– Can cache: memory locations, address translations, pages, file blocks,
file names, network routes, etc…
• Only good if:
– Frequent case frequent enough and
– Infrequent case not too expensive
• Important measure: Average Access time =
(Hit Rate x Hit Time) + (Miss Rate x Miss Time)
9
Why Bother with Caching?
Processor-DRAM Memory Gap (latency)
Performance
1000
“Moore’s Law”
(really Joy’s Law)
100
1
“Less’ Law?”
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
10
µProc
60%/yr.
(2X/1.5yr)
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
9%/yr.
(2X/10
yrs)
CPU
Time
10
Another Major Reason to Deal with
Caching
Virtual Virtual
Virtual
Address:
Seg #
Base0
Base1
Base2
Base3
Base4
Base5
Base6
Base7
Offset
Page #
Limit0
Limit1
Limit2
Limit3
Limit4
Limit5
Limit6
Limit7
V
V
V
N
V
N
N
V
page
page
page
page
page
page
>
#0 V,R
#1 V,R
#2 V,R,W
#3 V,R,W
N
#4
#5 V,R,W
Access
Error
• Too expensive to translate on every access
Physical
Page #
Offset
Physical Address
Check Perm
Access
Error
– At least two DRAM accesses per actual DRAM access
– Or: perhaps I/O if page table partially on disk!
• Even worse problem: What if we are using caching to make
memory access faster than DRAM access???
• Solution? Cache translations!
11
– Translation Cache: TLB (“Translation Lookaside Buffer”)
Why Does Caching Help? Locality!
Probability
of reference
0
2n - 1
Address Space
• Temporal Locality (Locality in Time):
– Keep recently accessed data items closer to processor
• Spatial Locality (Locality in Space):
– Move contiguous blocks to the upper levels
To Processor
Upper Level
Memory
Lower Level
Memory
Blk X
From Processor
Blk Y
12
Review: Memory Hierarchy of a
Modern Computer System
• Take advantage of the principle of locality to:
– Present as much memory as in the cheapest technology
– Provide access at speed offered by the fastest technology
Processor
Control
1s
Size (bytes): 100s
On-Chip
Cache
Speed (ns):
Registers
Datapath
Second
Level
Cache
(SRAM)
Main
Memory
(DRAM)
10s-100s
100s
Ks-Ms
Ms
Secondary
Storage
(Disk)
Tertiary
Storage
(Tape)
10,000,000s 10,000,000,000s
(10s ms)
(10s sec)
13
Gs
Ts
A Summary on Sources of
Cache Misses
• Compulsory (cold start): first reference to a block
– “Cold” fact of life: not a whole lot you can do about it
– Note: When running “billions” of instruction, Compulsory Misses are
insignificant
• Capacity:
– Cache cannot contain all blocks access by the program
– Solution: increase cache size
• Conflict (collision):
– Multiple memory locations mapped to same cache location
– Solutions: increase cache size, or increase associativity
• Two others:
– Coherence (Invalidation): other process (e.g., I/O) updates memory
– Policy: Due to non-optimal replacement policy
14
Review: Where does a Block Get
Placed in a Cache?
• Example: Block 12 placed in 8 block cache
32-Block Address Space:
Block
no.
Block
no.
1111111111222222222233
01234567890123456789012345678901
Direct mapped:
Set associative:
Fully associative:
block 12 can go
only into block 4
(12 mod 8)
block 12 can go
anywhere in set 0
(12 mod 4)
block 12 can go
anywhere
01234567
Block
no.
01234567
Set Set Set Set
0 1 2 3
Block
no.
01234567
15
Other Caching Questions
• What line gets replaced on cache miss?
– Easy for Direct Mapped: Only one possibility
– Set Associative or Fully Associative:
• Random
• LRU (Least Recently Used)
• What happens on a write?
– Write through: The information is written to both the cache and to
the block in the lower-level memory
– Write back: The information is written only to the block in the cache
• Modified cache block is written to main memory only when it is
replaced
• Question is block clean or dirty?
16
Caching Applied to Address Translation
CPU
Virtual
Address
TLB
Cached?
Yes
No
Physical
Address
Physical
Memory
Translate
(MMU)
Data Read or Write
(untranslated)
• Question is one of page locality: does it exist?
– Instruction accesses spend a lot of time on the same page (since
accesses sequential)
– Stack accesses have definite locality of reference
– Data accesses have less page locality, but still some…
• Can we have a TLB hierarchy?
– Sure: multiple levels at different sizes/speeds
17
What Actually Happens on a TLB
Miss?
• Hardware traversed page tables:
– On TLB miss, hardware in MMU looks at current page table to fill TLB
(may walk multiple levels)
• If PTE valid, hardware fills TLB and processor never knows
• If PTE marked as invalid, causes Page Fault, after which kernel decides
what to do afterwards
• Software traversed Page tables (like MIPS)
– On TLB miss, processor receives TLB fault
– Kernel traverses page table to find PTE
• If PTE valid, fills TLB and returns from fault
• If PTE marked as invalid, internally calls Page Fault handler
• Most chip sets provide hardware traversal
– Modern operating systems tend to have more TLB faults since they use
translation for many things
– Examples:
• shared segments
• user-level portions of an operating system
18
Goals for Today
• Virtual memory
• How does it work?
– Page faults
– Resuming after page faults
• When to fetch?
• What to replace?
– Page replacement algorithms
• FIFO, OPT, LRU (Clock)
– Page Buffering
– Allocating Pages to processes
19
What is virtual memory?
•
Each process has illusion of large address space
– 232 for 32-bit addressing
•
•
However, physical memory is much smaller
How do we give this illusion to multiple processes?
– Virtual Memory: some addresses reside in disk
page 0
page 1
page 2
page 3
page table
disk
page 4
page N
Virtual memory
Physical memory
20
Virtual Memory
• Separates users logical memory from physical
memory.
– Only part of the program needs to be in memory for
execution
– Logical address space can therefore be much larger than
physical address space
– Allows address spaces to be shared by several
processes
– Allows for more efficient process creation
21
Virtual Memory
• Load entire process in memory (swapping), run it, exit
– Is slow (for big processes)
– Wasteful (might not require everything)
• Solutions: partial residency
– Paging: only bring in pages, not all pages of process
– Demand paging: bring only pages that are required
• Where to fetch page from?
– Have a contiguous space in disk: swap file (pagefile.sys)
22
How does VM work?
• Modify Page Tables with another bit (“valid”)
– If page in memory, valid = 1, else valid = 0
– If page is in memory, translation works as before
– If page is not in memory, translation causes a page fault
0
1
2
3
32 :V=1
4183 :V=0
177 :V=1
5721 :V=0
Page Table
Disk
Mem
23
Page Faults
• On a page fault:
– OS finds a free frame, or evicts one from memory (which one?)
• Want knowledge of the future?
– Issues disk request to fetch data for page (what to fetch?)
• Just the requested page, or more?
– Block current process, context switch to new process (how?)
• Process might be executing an instruction
– When disk completes, set valid bit to 1, and current process in
ready queue
24
Steps in Handling a Page Fault
25
Resuming after a page fault
• Should be able to restart the instruction
• For RISC processors this is simple:
– Instructions are idempotent until references are done
• More complicated for CISC:
– E.g. move 256 bytes from one location to another
– Possible Solutions:
• Ensure pages are in memory before the instruction executes
26
Page Fault (Cont.)
• Restart instruction
– block move
– auto increment/decrement location
27
When to fetch?
• Just before the page is used!
– Need to know the future
• Demand paging:
– Fetch a page when it faults
• Prepaging:
– Get the page on fault + some of its neighbors, or
– Get all pages in use last time process was swapped
28
Performance of Demand Paging
• Page Fault Rate 0  p  1.0
– if p = 0 no page faults
– if p = 1, every reference is a fault
• Effective Access Time (EAT)
EAT = (1 – p) x memory access
+ p (page fault overhead
+ swap page out
+ swap page in
+ restart overhead
)
29
Demand Paging Example
• Memory access time = 200 nanoseconds
• Average page-fault service time = 8 milliseconds
• EAT = (1 – p) x 200 + p (8 milliseconds)
= (1 – p) x 200 + p x 8,000,000
= 200 + p x 7,999,800
• If one access out of 1,000 causes a page fault
EAT = 8.2 microseconds.
This is a slowdown by a factor of 40!!
30
What to replace?
• What happens if there is no free frame?
– find some page in memory, but not really in use, swap it
out
• Page Replacement
– When process has used up all frames it is allowed to use
– OS must select a page to eject from memory to allow new
page
– The page to eject is selected using the Page Replacement
Algorithm
• Goal: Select page that minimizes future page faults
31
Page Replacement
• Prevent over-allocation of memory by modifying pagefault service routine to include page replacement
• Use modify (dirty) bit to reduce overhead of page
transfers – only modified pages are written to disk
• Page replacement completes separation between logical
memory and physical memory – large virtual memory
can be provided on a smaller physical memory
32
Page Replacement
33
Page Replacement Algorithms
• Random: Pick any page to eject at random
– Used mainly for comparison
• FIFO: The page brought in earliest is evicted
– Ignores usage
– Suffers from “Belady’s Anomaly”
• Fault rate could increase on increasing number of pages
• E.g. 0 1 2 3 0 1 4 0 1 2 3 4 with frame sizes 3 and 4
• OPT: Belady’s algorithm
– Select page not used for longest time
• LRU: Evict page that hasn’t been used the longest
– Past could be a good predictor of the future
34
First-In-First-Out (FIFO) Algorithm
• Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
• 3 frames (3 pages can be in memory at a time per
process): 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
1
1
4
5
2
2
1
3
3
3
2
4
9 page faults
• 4 frames: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
1
1
5
4
2
2
1
5
3
3
2
4
4
3
10 page faults
35
FIFO Illustrating Belady’s Anomaly
36
Optimal Algorithm
• Replace page that will not be used for longest period of
time
• 4 frames example
1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
1
4
2
6 page faults
3
4
5
• How do you know this?
• Used for measuring how well your algorithm performs
37
Least Recently Used (LRU)
Algorithm
• Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
1
1
1
1
5
2
2
2
2
2
3
5
5
4
4
4
4
3
3
3
38
Implementing Perfect LRU
• On reference: Time stamp each page
• On eviction: Scan for oldest frame
• Problems:
– Large page lists
– Timestamps are costly
• Approximate LRU
13
– LRU is already an approximation!
0xffdcd: add r1,r2,r3
0xffdd0: ld r1, 0(sp) 14
14
t=4
t=14
t=14
t=5
39
LRU: Clock Algorithm
• Each page has a reference bit
– Set on use, reset periodically by the OS
• Algorithm:
– FIFO + reference bit (keep pages in circular list)
• Scan: if ref bit is 1, set to 0, and proceed. If ref bit is 0, stop and evict.
• Problem:
– Low accuracy for large memory
R=1
R=1
R=0
R=0
R=1
R=0
R=0
R=1
R=1
R=0
R=1
40
LRU with large memory
• Solution: Add another hand
– Leading edge clears ref bits
– Trailing edge evicts pages with ref bit 0
• What if angle small?
• What if angle big?
R=1
R=1
R=0
R=0
R=1
R=0
R=0
R=1
R=1
R=0
R=1
41
Clock Algorithm: Discussion
• Sensitive to sweeping interval
– Fast: lose usage information
– Slow: all pages look used
• Clock: add reference bits
– Could use (ref bit, modified bit) as ordered pair
– Might have to scan all pages
• LFU: Remove page with lowest count
– No track of when the page was referenced
– Use multiple bits. Shift right by 1 at regular intervals.
• MFU: remove the most frequently used page
• LFU and MFU do not approximate OPT well
42
Page Buffering
• Cute simple trick: (XP, 2K, Mach, VMS)
– Keep a list of free pages
– Track which page the free page corresponds to
– Periodically write modified pages, and reset modified bit
evict
add
used
free
modified list
(batch writes
= speed)
unmodified
free list
43
Allocating Pages to Processes
• Global replacement
– Single memory pool for entire system
– On page fault, evict oldest page in the system
– Problem: protection
• Local (per-process) replacement
– Have a separate pool of pages for each process
– Page fault in one process can only replace pages from its own
process
– Problem: might have idle resources
44
Allocation of Frames
• Each process needs minimum number of pages
• Example: IBM 370 – 6 pages to handle SS MOVE
instruction:
– instruction is 6 bytes, might span 2 pages
– 2 pages to handle from
– 2 pages to handle to
• Two major allocation schemes
– fixed allocation
– priority allocation
45
Summary
• Demand Paging:
– Treat memory as cache on disk
– Cache miss get page from disk
• Transparent Level of Indirection
– User program is unaware of activities of OS behind scenes
– Data can be moved without affecting application correctness
• Replacement policies
– FIFO: Place pages on queue, replace page at end
– OPT: replace page that will be used farthest in future
– LRU: Replace page that hasn’t be used for the longest time
• Clock Algorithm: Approximation to LRU
– Arrange all pages in circular list
– Sweep through them, marking as not “in use”
– If page not “in use” for one pass, than can replace
46
Download