Virtual Memory Announcements • TA Swap – Tyler Steele is a 415 TA – Rohit Doshi is a 414 TA • My office hours today: 3:25-4:25 2 Review: Memory Hierarchy of a Modern Computer System • Take advantage of the principle of locality to: – Present as much memory as in the cheapest technology – Provide access at speed offered by the fastest technology Processor Control 1s Size (bytes): 100s On-Chip Cache Speed (ns): Registers Datapath Second Level Cache (SRAM) Main Memory (DRAM) 10s-100s 100s Ks-Ms Ms Secondary Storage (Disk) Tertiary Storage (Tape) 10,000,000s 10,000,000,000s (10s ms) (10s sec) 3 Gs Ts Review: A Summary on Sources of Cache Misses • Compulsory (cold start): first reference to a block – “Cold” fact of life: not a whole lot you can do about it – Note: When running “billions” of instruction, Compulsory Misses are insignificant • Capacity: – Cache cannot contain all blocks access by the program – Solution: increase cache size • Conflict (collision): – Multiple memory locations mapped to same cache location – Solutions: increase cache size, or increase associativity • Two others: – Coherence (Invalidation): other process (e.g., I/O) updates memory – Policy: Due to non-optimal replacement policy 4 Review: Where does a Block Get Placed in a Cache? • Example: Block 12 placed in 8 block cache 32-Block Address Space: Block no. Block no. 1111111111222222222233 01234567890123456789012345678901 Direct mapped: Set associative: Fully associative: block 12 can go only into block 4 (12 mod 8) block 12 can go anywhere in set 0 (12 mod 4) block 12 can go anywhere 01234567 Block no. 01234567 Set Set Set Set 0 1 2 3 Block no. 01234567 5 Review: Other Caching Questions • What line gets replaced on cache miss? – Easy for Direct Mapped: Only one possibility – Set Associative or Fully Associative: • Random • LRU (Least Recently Used) • What happens on a write? – Write through: The information is written to both the cache and to the block in the lower-level memory – Write back: The information is written only to the block in the cache • Modified cache block is written to main memory only when it is replaced • Question is block clean or dirty? 6 Caching Applied to Address Translation CPU Virtual Address TLB Cached? Yes No Physical Address Physical Memory Translate (MMU) Data Read or Write (untranslated) • Question is one of page locality: does it exist? – Instruction accesses spend a lot of time on the same page (since accesses sequential) – Stack accesses have definite locality of reference – Data accesses have less page locality, but still some… • Can we have a TLB hierarchy? – Sure: multiple levels at different sizes/speeds 7 What Actually Happens on a TLB Miss? • Hardware traversed page tables: – On TLB miss, hardware in MMU looks at current page table to fill TLB (may walk multiple levels) • If PTE valid, hardware fills TLB and processor never knows • If PTE marked as invalid, causes Page Fault, after which kernel decides what to do afterwards • Software traversed Page tables (like MIPS) – On TLB miss, processor receives TLB fault – Kernel traverses page table to find PTE • If PTE valid, fills TLB and returns from fault • If PTE marked as invalid, internally calls Page Fault handler • Most chip sets provide hardware traversal – Modern operating systems tend to have more TLB faults since they use translation for many things – Examples: • shared segments • user-level portions of an operating system 8 Goals for Today • Virtual memory • How does it work? – Page faults – Resuming after page faults • When to fetch? • What to replace? – Page replacement algorithms • FIFO, OPT, LRU (Clock) – Page Buffering – Allocating Pages to processes 9 What is virtual memory? • Each process has illusion of large address space – 232 for 32-bit addressing • • However, physical memory is much smaller How do we give this illusion to multiple processes? – Virtual Memory: some addresses reside in disk page 0 page 1 page 2 page 3 page table disk page 4 page N Virtual memory Physical memory 10 Virtual Memory • Separates users logical memory from physical memory. – Only part of the program needs to be in memory for execution – Logical address space can therefore be much larger than physical address space – Allows address spaces to be shared by several processes – Allows for more efficient process creation 11 Virtual Memory • Load entire process in memory (swapping), run it, exit – Is slow (for big processes) – Wasteful (might not require everything) • Solutions: partial residency – Paging: only bring in pages, not all pages of process – Demand paging: bring only pages that are required • Where to fetch page from? – Have a contiguous space in disk: swap file (pagefile.sys) 12 How does VM work? • Modify Page Tables with another bit (“valid”) – If page in memory, valid = 1, else valid = 0 – If page is in memory, translation works as before – If page is not in memory, translation causes a page fault 0 1 2 3 32 :V=1 4183 :V=0 177 :V=1 5721 :V=0 Page Table Disk Mem 13 Page Faults • On a page fault: – OS finds a free frame, or evicts one from memory (which one?) • Want knowledge of the future? – Issues disk request to fetch data for page (what to fetch?) • Just the requested page, or more? – Block current process, context switch to new process (how?) • Process might be executing an instruction – When disk completes, set valid bit to 1, and current process in ready queue 14 Steps in Handling a Page Fault 15 Resuming after a page fault • Should be able to restart the instruction • For RISC processors this is simple: – Instructions are idempotent until references are done • More complicated for CISC: – E.g. move 256 bytes from one location to another – Possible Solutions: • Ensure pages are in memory before the instruction executes 16 Page Fault (Cont.) • Restart instruction – block move – auto increment/decrement location 17 When to fetch? • Just before the page is used! – Need to know the future • Demand paging: – Fetch a page when it faults • Prepaging: – Get the page on fault + some of its neighbors, or – Get all pages in use last time process was swapped 18 Performance of Demand Paging • Page Fault Rate 0 p 1.0 – if p = 0 no page faults – if p = 1, every reference is a fault • Effective Access Time (EAT) EAT = (1 – p) x memory access + p (page fault overhead + swap page out + swap page in + restart overhead ) 19 Demand Paging Example • Memory access time = 200 nanoseconds • Average page-fault service time = 8 milliseconds • EAT = (1 – p) x 200 + p (8 milliseconds) = (1 – p x 200 + p x 8,000,000 = 200 + p x 7,999,800 • If one access out of 1,000 causes a page fault EAT = 8.2 microseconds. This is a slowdown by a factor of 40!! 20 What to replace? • What happens if there is no free frame? – find some page in memory, but not really in use, swap it out • Page Replacement – When process has used up all frames it is allowed to use – OS must select a page to eject from memory to allow new page – The page to eject is selected using the Page Replacement Algorithm • Goal: Select page that minimizes future page faults 21 Page Replacement • Prevent over-allocation of memory by modifying pagefault service routine to include page replacement • Use modify (dirty) bit to reduce overhead of page transfers – only modified pages are written to disk • Page replacement completes separation between logical memory and physical memory – large virtual memory can be provided on a smaller physical memory 22 Page Replacement 23 Page Replacement Algorithms • Random: Pick any page to eject at random – Used mainly for comparison • FIFO: The page brought in earliest is evicted – Ignores usage – Suffers from “Belady’s Anomaly” • Fault rate could increase on increasing number of pages • E.g. 0 1 2 3 0 1 4 0 1 2 3 4 with frame sizes 3 and 4 • OPT: Belady’s algorithm – Select page not used for longest time • LRU: Evict page that hasn’t been used the longest – Past could be a good predictor of the future 24 First-In-First-Out (FIFO) Algorithm • Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 • 3 frames (3 pages can be in memory at a time per process): 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 1 1 4 5 2 2 1 3 3 3 2 4 9 page faults • 4 frames: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 1 1 5 4 2 2 1 5 3 3 2 4 4 3 10 page faults 25 FIFO Illustrating Belady’s Anomaly 26 Optimal Algorithm • Replace page that will not be used for longest period of time • 4 frames example 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 1 4 2 6 page faults 3 4 5 • How do you know this? • Used for measuring how well your algorithm performs 27 Least Recently Used (LRU) Algorithm • Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 1 1 1 1 5 2 2 2 2 2 3 5 5 4 4 4 4 3 3 3 28 Implementing Perfect LRU • On reference: Time stamp each page • On eviction: Scan for oldest frame • Problems: – Large page lists – Timestamps are costly • Approximate LRU 13 – LRU is already an approximation! 0xffdcd: add r1,r2,r3 0xffdd0: ld r1, 0(sp) 14 14 t=4 t=14 t=14 t=5 29 LRU: Clock Algorithm • Each page has a reference bit – Set on use, reset periodically by the OS • Algorithm: – FIFO + reference bit (keep pages in circular list) • Scan: if ref bit is 1, set to 0, and proceed. If ref bit is 0, stop and evict. • Problem: – Low accuracy for large memory R=1 R=1 R=0 R=0 R=1 R=0 R=0 R=1 R=1 R=0 R=1 30 LRU with large memory • Solution: Add another hand – Leading edge clears ref bits – Trailing edge evicts pages with ref bit 0 • What if angle small? • What if angle big? R=1 R=1 R=0 R=0 R=1 R=0 R=0 R=1 R=1 R=0 R=1 31 Clock Algorithm: Discussion • Sensitive to sweeping interval – Fast: lose usage information – Slow: all pages look used • Clock: add reference bits – Could use (ref bit, modified bit) as ordered pair – Might have to scan all pages • LFU: Remove page with lowest count – No track of when the page was referenced – Use multiple bits. Shift right by 1 at regular intervals. • MFU: remove the most frequently used page • LFU and MFU do not approximate OPT well 32 Page Buffering • Cute simple trick: (XP, 2K, Mach, VMS) – Keep a list of free pages – Track which page the free page corresponds to – Periodically write modified pages, and reset modified bit evict add used free modified list (batch writes = speed) unmodified free list 33 Allocating Pages to Processes • Global replacement – Single memory pool for entire system – On page fault, evict oldest page in the system – Problem: protection • Local (per-process) replacement – Have a separate pool of pages for each process – Page fault in one process can only replace pages from its own process – Problem: might have idle resources 34 Allocation of Frames • Each process needs minimum number of pages • Example: IBM 370 – 6 pages to handle SS MOVE instruction: – instruction is 6 bytes, might span 2 pages – 2 pages to handle from – 2 pages to handle to • Two major allocation schemes – fixed allocation – priority allocation 35 Summary • Demand Paging: – Treat memory as cache on disk – Cache miss get page from disk • Transparent Level of Indirection – User program is unaware of activities of OS behind scenes – Data can be moved without affecting application correctness • Replacement policies – FIFO: Place pages on queue, replace page at end – OPT: replace page that will be used farthest in future – LRU: Replace page that hasn’t be used for the longest time • Clock Algorithm: Approximation to LRU – Arrange all pages in circular list – Sweep through them, marking as not “in use” – If page not “in use” for one pass, than can replace 36