Virtual Memory Announcements • Prelim coming up in one week: – In 203 Thurston, Thursday October 16th, 10:10—11:25pm, 1½ hour – Topics: Everything up to (and including) Thursday, October 9th • Lectures 1-13, chapters 1-9, and 13 (8th ed) • Review Session will be this Thursday, October 9th – Time and Location TBD: Possibly 6:30pm – 7:30pm • Nazrul’s office hours changed for today – 12:30m - 2:30pm in Upson 328 • Homework 3 due today, October 7th • CS 4410 Homework 2 graded. (Solutions avail via CMS). – Mean 45 (stddev 5), High 50 out of 50 – Common problems • Q1: did not satisfy bounded waiting mutual exclusion was not violated 2 Homework #2, Question #1 • Dekker’s Algorithm (1965) CSEnter(int i) { inside[i] = true; while(inside[J]) { if (turn == J) { inside[i] = false; while(turn == J) continue; inside[i] = true; } }} CSExit(int i) { turn = J; inside[i] = false; } Review: Multi-level Translation • Illusion of a contiguous address space • Physicall reality – address space broken into segments or fixed-size pages – Segments or pages spread throughout physical memory • Could have any number of levels. Example (top segment): Virtual Address: 10 bits Virtual Seg # Base0 Base1 Base2 Base3 Base4 Base5 Base6 Base7 10 bits Virtual Page # Limit0 Limit1 Limit2 Limit3 Limit4 Limit5 Limit6 Limit7 V V V N V N N V 12 bits Offset frame #0 V,R frame #1 V,R frame #2 V,R,W page #2 frame #3 V,R,W frame #4 N frame #5 V,R,W > Access Error Physical fram # Offset Physical Address Check Perm Access Error • What must be saved/restored on context switch? – Contents of top-level segment registers (for this example) – Pointer to top-level table (page table) 4 Review: Two-LevelPhysical Page Table 10 bits 10 bits Virtual Virtual Virtual Address: P1 index P2 index 12 bits Physical Address: Frame # Offset Offset 4KB PageTablePtr 4 bytes • Tree of Page Tables • Tables fixed size (1024 entries) – On context-switch: save single PageTablePtr register • Sometimes, top-level page tables called “directories” (Intel) • Each entry called a (surprise!) Page Table Entry (PTE) 4 bytes 5 What is in a PTE? • What is in a Page Table Entry (or PTE)? – Pointer to next-level page table or to actual page – Permission bits: valid, read-only, read-write, execute-only • Example: Intel x86 architecture PTE: – Address same format previous slide (10, 10, 12-bit offset) – Intermediate page tables called “Directories” PCD PWT Page Frame Number Free 0 L D A UW P (Physical Page Number) (OS) 31-12 11-9 8 7 6 5 4 3 2 1 0 P: Present (same as “valid” bit in other architectures) W: Writeable U: User accessible PWT: Page write transparent: external cache write-through PCD: Page cache disabled (page cannot be cached) A: Accessed: page has been accessed recently D: Dirty (PTE only): page has been modified recently L: L=14MB page (directory only). Bottom 22 bits of virtual address serve as offset 6 Examples of how to use a PTE • How do we use the PTE? – Invalid PTE can imply different things: • Region of address space is actually invalid or • Page/directory is just somewhere else than memory – Validity checked first • OS can use other (say) 31 bits for location info • Usage Example: Demand Paging – Keep only active pages in memory – Place others on disk and mark their PTEs invalid • Usage Example: Copy on Write – UNIX fork gives copy of parent address space to child • Address spaces disconnected after child created – How to do this cheaply? • Make copy of parent’s page tables (point at same memory) • Mark entries in both sets of page tables as read-only • Page fault on write creates two copies • Usage Example: Zero Fill On Demand – New data pages must carry no information (say be zeroed) – Mark PTEs as invalid; page fault on use gets zeroed page – Often, OS creates zeroed pages in background 7 How is the translation accomplished? CPU Virtual Addresses MMU Physical Addresses • What, exactly happens inside MMU? • One possibility: Hardware Tree Traversal – For each virtual address, takes page table base pointer and traverses the page table in hardware – Generates a “Page Fault” if it encounters invalid PTE • Fault handler will decide what to do • More on this next lecture – Pros: Relatively fast (but still many memory accesses!) – Cons: Inflexible, Complex hardware • Another possibility: Software – Each traversal done in software – Pros: Very flexible – Cons: Every translation must invoke Fault! • In fact, need way to cache translations for either case! 8 Caching Concept • Cache: a repository for copies that can be accessed more quickly than the original – Make frequent case fast and infrequent case less dominant • Caching underlies many of the techniques that are used today to make computers fast – Can cache: memory locations, address translations, pages, file blocks, file names, network routes, etc… • Only good if: – Frequent case frequent enough and – Infrequent case not too expensive • Important measure: Average Access time = (Hit Rate x Hit Time) + (Miss Rate x Miss Time) 9 Why Bother with Caching? Processor-DRAM Memory Gap (latency) Performance 1000 “Moore’s Law” (really Joy’s Law) 100 1 “Less’ Law?” 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 10 µProc 60%/yr. (2X/1.5yr) Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 9%/yr. (2X/10 yrs) CPU Time 10 Another Major Reason to Deal with Caching Virtual Virtual Virtual Address: Seg # Base0 Base1 Base2 Base3 Base4 Base5 Base6 Base7 Offset Page # Limit0 Limit1 Limit2 Limit3 Limit4 Limit5 Limit6 Limit7 V V V N V N N V page page page page page page > #0 V,R #1 V,R #2 V,R,W #3 V,R,W N #4 #5 V,R,W Access Error • Too expensive to translate on every access Physical Page # Offset Physical Address Check Perm Access Error – At least two DRAM accesses per actual DRAM access – Or: perhaps I/O if page table partially on disk! • Even worse problem: What if we are using caching to make memory access faster than DRAM access??? • Solution? Cache translations! 11 – Translation Cache: TLB (“Translation Lookaside Buffer”) Why Does Caching Help? Locality! Probability of reference 0 2n - 1 Address Space • Temporal Locality (Locality in Time): – Keep recently accessed data items closer to processor • Spatial Locality (Locality in Space): – Move contiguous blocks to the upper levels To Processor Upper Level Memory Lower Level Memory Blk X From Processor Blk Y 12 Review: Memory Hierarchy of a Modern Computer System • Take advantage of the principle of locality to: – Present as much memory as in the cheapest technology – Provide access at speed offered by the fastest technology Processor Control 1s Size (bytes): 100s On-Chip Cache Speed (ns): Registers Datapath Second Level Cache (SRAM) Main Memory (DRAM) 10s-100s 100s Ks-Ms Ms Secondary Storage (Disk) Tertiary Storage (Tape) 10,000,000s 10,000,000,000s (10s ms) (10s sec) 13 Gs Ts A Summary on Sources of Cache Misses • Compulsory (cold start): first reference to a block – “Cold” fact of life: not a whole lot you can do about it – Note: When running “billions” of instruction, Compulsory Misses are insignificant • Capacity: – Cache cannot contain all blocks access by the program – Solution: increase cache size • Conflict (collision): – Multiple memory locations mapped to same cache location – Solutions: increase cache size, or increase associativity • Two others: – Coherence (Invalidation): other process (e.g., I/O) updates memory – Policy: Due to non-optimal replacement policy 14 Review: Where does a Block Get Placed in a Cache? • Example: Block 12 placed in 8 block cache 32-Block Address Space: Block no. Block no. 1111111111222222222233 01234567890123456789012345678901 Direct mapped: Set associative: Fully associative: block 12 can go only into block 4 (12 mod 8) block 12 can go anywhere in set 0 (12 mod 4) block 12 can go anywhere 01234567 Block no. 01234567 Set Set Set Set 0 1 2 3 Block no. 01234567 15 Other Caching Questions • What line gets replaced on cache miss? – Easy for Direct Mapped: Only one possibility – Set Associative or Fully Associative: • Random • LRU (Least Recently Used) • What happens on a write? – Write through: The information is written to both the cache and to the block in the lower-level memory – Write back: The information is written only to the block in the cache • Modified cache block is written to main memory only when it is replaced • Question is block clean or dirty? 16 Caching Applied to Address Translation CPU Virtual Address TLB Cached? Yes No Physical Address Physical Memory Translate (MMU) Data Read or Write (untranslated) • Question is one of page locality: does it exist? – Instruction accesses spend a lot of time on the same page (since accesses sequential) – Stack accesses have definite locality of reference – Data accesses have less page locality, but still some… • Can we have a TLB hierarchy? – Sure: multiple levels at different sizes/speeds 17 What Actually Happens on a TLB Miss? • Hardware traversed page tables: – On TLB miss, hardware in MMU looks at current page table to fill TLB (may walk multiple levels) • If PTE valid, hardware fills TLB and processor never knows • If PTE marked as invalid, causes Page Fault, after which kernel decides what to do afterwards • Software traversed Page tables (like MIPS) – On TLB miss, processor receives TLB fault – Kernel traverses page table to find PTE • If PTE valid, fills TLB and returns from fault • If PTE marked as invalid, internally calls Page Fault handler • Most chip sets provide hardware traversal – Modern operating systems tend to have more TLB faults since they use translation for many things – Examples: • shared segments • user-level portions of an operating system 18 Goals for Today • Virtual memory • How does it work? – Page faults – Resuming after page faults • When to fetch? • What to replace? – Page replacement algorithms • FIFO, OPT, LRU (Clock) – Page Buffering – Allocating Pages to processes 19 What is virtual memory? • Each process has illusion of large address space – 232 for 32-bit addressing • • However, physical memory is much smaller How do we give this illusion to multiple processes? – Virtual Memory: some addresses reside in disk page 0 page 1 page 2 page 3 page table disk page 4 page N Virtual memory Physical memory 20 Virtual Memory • Separates users logical memory from physical memory. – Only part of the program needs to be in memory for execution – Logical address space can therefore be much larger than physical address space – Allows address spaces to be shared by several processes – Allows for more efficient process creation 21 Virtual Memory • Load entire process in memory (swapping), run it, exit – Is slow (for big processes) – Wasteful (might not require everything) • Solutions: partial residency – Paging: only bring in pages, not all pages of process – Demand paging: bring only pages that are required • Where to fetch page from? – Have a contiguous space in disk: swap file (pagefile.sys) 22 How does VM work? • Modify Page Tables with another bit (“valid”) – If page in memory, valid = 1, else valid = 0 – If page is in memory, translation works as before – If page is not in memory, translation causes a page fault 0 1 2 3 32 :V=1 4183 :V=0 177 :V=1 5721 :V=0 Page Table Disk Mem 23 Page Faults • On a page fault: – OS finds a free frame, or evicts one from memory (which one?) • Want knowledge of the future? – Issues disk request to fetch data for page (what to fetch?) • Just the requested page, or more? – Block current process, context switch to new process (how?) • Process might be executing an instruction – When disk completes, set valid bit to 1, and current process in ready queue 24 Steps in Handling a Page Fault 25 Resuming after a page fault • Should be able to restart the instruction • For RISC processors this is simple: – Instructions are idempotent until references are done • More complicated for CISC: – E.g. move 256 bytes from one location to another – Possible Solutions: • Ensure pages are in memory before the instruction executes 26 Page Fault (Cont.) • Restart instruction – block move – auto increment/decrement location 27 When to fetch? • Just before the page is used! – Need to know the future • Demand paging: – Fetch a page when it faults • Prepaging: – Get the page on fault + some of its neighbors, or – Get all pages in use last time process was swapped 28 Performance of Demand Paging • Page Fault Rate 0 p 1.0 – if p = 0 no page faults – if p = 1, every reference is a fault • Effective Access Time (EAT) EAT = (1 – p) x memory access + p (page fault overhead + swap page out + swap page in + restart overhead ) 29 Demand Paging Example • Memory access time = 200 nanoseconds • Average page-fault service time = 8 milliseconds • EAT = (1 – p) x 200 + p (8 milliseconds) = (1 – p) x 200 + p x 8,000,000 = 200 + p x 7,999,800 • If one access out of 1,000 causes a page fault EAT = 8.2 microseconds. This is a slowdown by a factor of 40!! 30 What to replace? • What happens if there is no free frame? – find some page in memory, but not really in use, swap it out • Page Replacement – When process has used up all frames it is allowed to use – OS must select a page to eject from memory to allow new page – The page to eject is selected using the Page Replacement Algorithm • Goal: Select page that minimizes future page faults 31 Page Replacement • Prevent over-allocation of memory by modifying pagefault service routine to include page replacement • Use modify (dirty) bit to reduce overhead of page transfers – only modified pages are written to disk • Page replacement completes separation between logical memory and physical memory – large virtual memory can be provided on a smaller physical memory 32 Page Replacement 33 Page Replacement Algorithms • Random: Pick any page to eject at random – Used mainly for comparison • FIFO: The page brought in earliest is evicted – Ignores usage – Suffers from “Belady’s Anomaly” • Fault rate could increase on increasing number of pages • E.g. 0 1 2 3 0 1 4 0 1 2 3 4 with frame sizes 3 and 4 • OPT: Belady’s algorithm – Select page not used for longest time • LRU: Evict page that hasn’t been used the longest – Past could be a good predictor of the future 34 First-In-First-Out (FIFO) Algorithm • Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 • 3 frames (3 pages can be in memory at a time per process): 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 1 1 4 5 2 2 1 3 3 3 2 4 9 page faults • 4 frames: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 1 1 5 4 2 2 1 5 3 3 2 4 4 3 10 page faults 35 FIFO Illustrating Belady’s Anomaly 36 Optimal Algorithm • Replace page that will not be used for longest period of time • 4 frames example 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 1 4 2 6 page faults 3 4 5 • How do you know this? • Used for measuring how well your algorithm performs 37 Least Recently Used (LRU) Algorithm • Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 1 1 1 1 5 2 2 2 2 2 3 5 5 4 4 4 4 3 3 3 38 Implementing Perfect LRU • On reference: Time stamp each page • On eviction: Scan for oldest frame • Problems: – Large page lists – Timestamps are costly • Approximate LRU 13 – LRU is already an approximation! 0xffdcd: add r1,r2,r3 0xffdd0: ld r1, 0(sp) 14 14 t=4 t=14 t=14 t=5 39 LRU: Clock Algorithm • Each page has a reference bit – Set on use, reset periodically by the OS • Algorithm: – FIFO + reference bit (keep pages in circular list) • Scan: if ref bit is 1, set to 0, and proceed. If ref bit is 0, stop and evict. • Problem: – Low accuracy for large memory R=1 R=1 R=0 R=0 R=1 R=0 R=0 R=1 R=1 R=0 R=1 40 LRU with large memory • Solution: Add another hand – Leading edge clears ref bits – Trailing edge evicts pages with ref bit 0 • What if angle small? • What if angle big? R=1 R=1 R=0 R=0 R=1 R=0 R=0 R=1 R=1 R=0 R=1 41 Clock Algorithm: Discussion • Sensitive to sweeping interval – Fast: lose usage information – Slow: all pages look used • Clock: add reference bits – Could use (ref bit, modified bit) as ordered pair – Might have to scan all pages • LFU: Remove page with lowest count – No track of when the page was referenced – Use multiple bits. Shift right by 1 at regular intervals. • MFU: remove the most frequently used page • LFU and MFU do not approximate OPT well 42 Page Buffering • Cute simple trick: (XP, 2K, Mach, VMS) – Keep a list of free pages – Track which page the free page corresponds to – Periodically write modified pages, and reset modified bit evict add used free modified list (batch writes = speed) unmodified free list 43 Allocating Pages to Processes • Global replacement – Single memory pool for entire system – On page fault, evict oldest page in the system – Problem: protection • Local (per-process) replacement – Have a separate pool of pages for each process – Page fault in one process can only replace pages from its own process – Problem: might have idle resources 44 Allocation of Frames • Each process needs minimum number of pages • Example: IBM 370 – 6 pages to handle SS MOVE instruction: – instruction is 6 bytes, might span 2 pages – 2 pages to handle from – 2 pages to handle to • Two major allocation schemes – fixed allocation – priority allocation 45 Summary • Demand Paging: – Treat memory as cache on disk – Cache miss get page from disk • Transparent Level of Indirection – User program is unaware of activities of OS behind scenes – Data can be moved without affecting application correctness • Replacement policies – FIFO: Place pages on queue, replace page at end – OPT: replace page that will be used farthest in future – LRU: Replace page that hasn’t be used for the longest time • Clock Algorithm: Approximation to LRU – Arrange all pages in circular list – Sweep through them, marking as not “in use” – If page not “in use” for one pass, than can replace 46