Paging Virtual Memory • Segmentation – Basic early approach • Paging – Modern approach – Advantages • Easy to allocate physical memory • Easy to “page out”/swap chunks of memory – Disadvantages • Overhead added to each memory reference • Additional memory is required to store page tables Hardware and OS structures for paging • Hardware – Page table base register • X86: CR3 – TLB • Software – Page table • Virtual->physical or virtual->disk mapping – Page frame database • One entry for each physical page • Information about page – e.g. owning process, r/w permissions – Swap file / Section list Page Frame Database /* Each physical page in the system has a struct page associated with * it to keep track of whatever it is we are using the page for at the * moment. Note that we have no way to track which tasks are using * a page */ struct page { unsigned long flags; // Atomic flags: locked, referenced, dirty, slab, disk atomic_t _count; // Usage count, atomic_t _mapcount; // Count of ptes mapping in this page struct { unsigned long private; // Used for managing page used in file I/O struct address_space * mapping; // Used to define the data this page is holding }; pgoff_t index; struct list_head lru; void * virtual; }; // Our offset within mapping // Linked list node containing LRU ordering of pages // Kernel virtual address Multilevel page tables • Page of table with N levels – Virtual addresses split into N+1 parts • N indexes to different levels • 1 offset into page • Example: 32 bit paging on x86 – 4KB pages, 4 bytes/PTE • 12 bits in offset: 2^12 = 4096 – Want to fit page table entries into 1 page • 4KB / 4 bytes = 1024 PTEs per page – So level indexes = 10 bits each • 2^10 = 1024 2 level page tables Virtual Address Master Page # (10 bits) Secondary Page # (10 bits) Offset (12 bits) Physical Address Page Frame # Secondary Page Table Offset Physical Memory Page Frame 0 Page Frame 1 Page Frame 2 Master Page Table Page Frame Number Page Frame 3 Page Frame 4 Secondary Page Table Inverted Page Table • Previous examples: “Forward Page tables” – Page table size relative to size of virtual memory – Physical memory could be much less • Lots of wasted space Virtual Page # Offset Hash Table Phsyical Page # • Separate approach: Use a hash table – Inverted page table – Size is independent of virtual address space – Directly related to size of physical memory • Cons: – Have to manage a hash table (collisions, rebalancing, etc) Offset Addressing Page Tables • Where are page tables stored? – And in which address space? • Possibility #1: Physical memory – Easy address, no translation required – But page tables must stay resident in memory • Possibility #2: Virtual Memory (OS VA space) – Cold (unused) page table pages can be swapped out – But page table addresses must be translated through page tables • Don’t page the outer page table page (called wiring) • Question: Can the kernel be paged? Generic PTE • PTE maps virtual page to physical page • Includes Page properties (shared with HW) – Valid? Writable? Dirty? Cacheable? Physical Page Number Property Bits Where is the virtual page number? • Acronyms – – – – – – – PTE = Page Table Entry PDE = Page Directory Entry VA = Virtual Address PA = Physical Address VPN = Virtual Page number PPN = Physical Page number PFN = Page Frame Number (same as PPN) X86 address translation (32 bit) • Page Tables organized as a 2 level tree – Efficiently handle sparse address space • One set of page tables per process – Current page tables pointed to by CR3 • CPU “walks” page tables to find translations – Accessed and Dirty bits updated by CPU • 32 bit: 4KB or 4MB pages • 64 bit: 4 levels; 4KB or 2MB pages X86 32 bit PDE/PTE details PWT: Write through PCD: Cache Disable P: Present R/W: Read/Write U/S: User/System AVL: Available for OS use A: Accessed D: Dirty PAT: Cache behavior definition G: Global • If page is not present (P=0), then other bits are available for OS to use. • Remember: Useful for swapping Paging Translation Making it efficient • Original page table scheme doubled cost of memory accesses – 1 page table access, 1 data access • 2-level page tables triple the cost – 2 page table accesses + 1 data access • 4-level page tables quintuple the cost – 4 page table accesses + 1 data access • How to achieve efficiency – Goal: Make virtual memory accesses as fast as physical memory accesses – Solution: Use a hardware cache • Cache virtual-to-physical translations in hardware • Translation Lookaside Buffer (TLB) • X86: – TLB is managed by CPU’s MMU – 1 per CPU/core TLBs • Translation Lookaside buffers – Translates Virtual page #s into PTEs (NOT physical addresses) • Why? – Can be done in single machine cycle • Implemented in hardware – – – – Associative cache (many entries searched in parallel) Cache tags are virtual page numbers Cache values are PTEs With PTE + offset, MMU directly calculates PA • TLBs rely on locality – Processes only use a handful of pages at a time • 16-48 entries in TLB is typical (64-192KB for 4kb pages) • Targets “hot set” or “working set” of process – TLB hit rates are critical for performance Managing TLBs • Address translations are mostly handled by TLB – (>99%) hit rate, but there are occasional TLB misses – On miss, who places translations into TLB? • Hardware (MMU) – Knows where page tables are in memory (CR3) • OS maintains them, HW accesses them – Tables setup in HW-defined format – X86 • Software loaded TLB (OS) – TLB miss faults to OS, OS finds right PTE and loads it into TLB – Must be fast • CPU ISA has special TLB access instructions • OS uses its own page table format – SPARC and IBM Power Managing TLBs (2) • OS must ensure TLB and page tables are consistent – If OS changes PTE, it must invalidate cached PTE in TLB – Explicit instruction to invalidate PTE • X86: invlpg • What happens on a context switch? – Each process has its own page table – Entire TLB must be invalidated (TLB flush) – X86: Certain instructions automatically flush entire TLB • Reloading CR3: asm (“mov %1, %%cr3”); • When TLB misses, a new PTE is loaded, and cached PTE is evicted – Which PTE should be evicted? • TLB Replacement Policy • Defined and implemented in hardware (usually LRU) x86 TLB • TLB management is shared by CPU and OS • CPU: – Fills TLB on demand from page tables • OS is unaware of TLB misses – Evicts entries as needed • OS: – Ensures TLB and page tables are consistent • Flushes entire TLB when page tables are switched (e.g. context switch) – asm (“mov %0, %%cr3”:: “r”(page_table_addr)); • Modifications to a single PTE are flushed explicitly – asm (“invlpg %0;”:: “r”(virtual_addr)); Cool Paging Tricks • Exploit level of indirection between VA and PA – Shared memory • Regions of two separate processes’ address spaces map to the same physical memory – Read/write: access to shared data – Execute: shared libraries • Each process can have a separate PTE pointing to same physical memory – Different access privileges for different processes • Does the shared region need to map the same VA in each process? – Copy-On-Write (COW) • Instead of copying physical memory on fork() • Just create a new set of identical page tables with writes disabled – When a child writes to a page, OS gets page fault – OS copies physical memory page, and maps new page into child process Saving Memory to Disk • In memory shortage: – OS writes memory contents to disk and reuses memory • Copying a whole process is called “swapping” • Copying a single page is called “paging” • Where does data go? – If it came from a file and was not modified: deleted from memory • E.g. Executable code – Unix Swap partition • A partition (file, disk segment, or entire disk) reserved as a backing store – Windows Swap file • Designated file stored in regular file system • When does data move? • Swapping: In advance of running a process • Paging: When a page of memory is accessed Demand paging • Moving pages between memory and disk – OS uses main memory as a cache • Most of memory is used to store file data – Programs, libraries, data – File contents cached in memory • Anonymous memory – Memory not used for file data » Heap, stack, globals, … – Backed to swap file/partition – OS manages movement of pages to/from disk • Transparent to application Why is this “demand” paging? • When a process first starts: fork()/exec() – Brand new page tables with no valid PTEs – No pages mapped to physical memory – As process executes memory is accessed • Instructions immediately fault on code and data pages • Faults stop once all necessary code/data is in memory – Only code/data that is needed • Memory that is needed changes over time – Pages shift between disk and memory Page faults • What happens when a process references an evicted page? – When page is evicted, OS sets PTE as invalid (present = 0) • Sets rest of PTE bits to indicate location in swap file – When process accesses page, invalid PTE triggers a CPU exception (page fault) – OS invokes page fault handler • Checks PTE, and uses high 31 bits to find page in swap disk • Handler reads page from disk into available frame – Possibly has to evict another page… • Handler restarts process • What if memory is full? – Another page must be evicted (page replacement algorithm) Steps in Handling a page fault Evicting the best page • OS must choose victim page to be evicted – Goal: Reduce the page fault rate – The best page to evict is one that will never be accessed again • Not really possible… • Belady’s proof: Evicting the page that won’t be used for the longest period of time minimizes page fault rate Belady’s Algorithm • Find page that won’t be used for the longest amount of time – Not possible • So why is it here? – Provably optimal solution – Comparison for other practical algorithms – Upper bound on possible performance • Lower bound? – Depends on workload… • Random replacement is generally a bad idea FIFO • Obvious and simple – When a page is brought in, goes to tail of list – On eviction take the head of the list • Advantages – If it was brought in a while ago, then it might not be used... • Disadvantages – Or its being used by everybody (glibc) – Does not measure access behavior at all • FIFO suffers from Belady’s Anomaly – Fault rate might increase when given more physical memory • Very bad property… • Exercise: Develop a workload where this is true Least Recently Used (LRU) • Use access behavior during selection – Idea: Use past behavior to predict future behavior – On replacement, evict page that hasn’t been used for the longest amount of time • LRU looks at the past, Belady’s looks at future • Implementation – To be perfect, every access must be detected and timestamped (way too expensive) • So it must be approximated Approximating LRU • Many approximations, all use PTE flag – x86: Accessed bit, set by HW on every access – Each page has a counter (where?) – Periodically, scan entire list of pages • If accessed = 0, increment counter (not used) • If accessed = 1, clear counter (used) • Clear accessed flag in PTE – Counter will contain # of iterations since last reference • Page with largest counter is least recently used • Some CPUs don’t have PTE flags – Can simulate it by forcing page faults with invalid PTE LRU Clock • Not Recently Used (NRU) or Second Chance – Replace page that is “old enough” • Arrange page in circular list (like a clock) – Clock hand (ptr value) sweeps through list • If accessed = 0, not used recently so evict it • If accessed = 1, recently used – Set accessed to 0, go to next PTE • Recommended for Project 3 • Problem: – If memory is large, “accuracy” of information degrades N-Chance clock • Basic Clock only has two states – used or not used • Can we add more? – Answer: Embed counter into PTE • Clock hand (ptr value) sweeps through list – If accessed = 0, not used recently • if counter = 0, evict • else, decrement counter and go to next PTE – If accessed = 1, recently used • Set accessed to 0 • Increment counter • Go to next PTE