Virtual Memory and Address Translation Virtual Addressing virtual memory (big) User processes address memory through virtual addresses. text data physical memory (small) The kernel controls the virtual-physical translations in effect for each space. BSS The kernel and the machine collude to translate virtual addresses to physical addresses. user stack args/env kernel virtual-to-physical translations The machine does not allow a user process to access memory unless the kernel “says it’s OK”. The specific mechanisms for memory management and address translation are machine-dependent. What’s in an Object File or Executable? Header “magic number” indicates type of image. Section table an array of (offset, len, startVA) program sections Used by linker; may be removed after final link step and strip. header text program instructions p data idata immutable data (constants) “hello\n” wdata writable global/static data j, s symbol table relocation records j, s ,p,sbuf int j = 327; char* s = “hello\n”; char sbuf[512]; int p() { int k = 0; j = write(1, s, 6); return(j); } The Program and the Process VAS Process text segment is initialized directly from program text section. sections Process data segment(s) are initialized from idata and wdata sections. Text and idata segments may be write-protected. BSS “Block Started by Symbol” (uninitialized global data) e.g., heap and sbuf go here. header text text data idata data wdata symbol table relocation records program BSS user stack args/env kernel process VAS Process stack and BSS (e.g., heap) segment(s) are zero-filled. segments Process BSS segment may be expanded at runtime with a system call (e.g., Unix sbrk) called by the heap manager routines. Args/env strings copied in by kernel when the process is created. Virtual Address Translation 29 Example: typical 32-bit architecture with 8KB pages. 00 Virtual address translation maps a virtual page number (VPN) to a physical page frame number (PFN): the rest is easy. virtual address VPN 0 13 offset address translation Deliver exception to OS if translation is not valid and accessible in requested mode. physical address { PFN + offset Role of MMU Hardware and OS VM address translation must be very cheap (on average). • Every instruction includes one or two memory references. (including the reference to the instruction itself) VM translation is supported in hardware by a Memory Management Unit or MMU. • The addressing model is defined by the CPU architecture. • The MMU itself is an integral part of the CPU. The role of the OS is to install the virtual-physical mapping and intervene if the MMU reports that it cannot complete the translation. The OS Directs the MMU The OS controls the operation of the MMU to select: (1) the subset of possible virtual addresses that are valid for each process (the process virtual address space); (2) the physical translations for those virtual addresses; (3) the modes of permissible access to those virtual addresses; read/write/execute (4) the specific set of translations in effect at any instant. need rapid context switch from one address space to another MMU completes a reference only if the OS “says it’s OK”. MMU raises an exception if the reference is “not OK”. The Translation Lookaside Buffer (TLB) An on-chip hardware translation buffer (TB or TLB) caches recently used virtual-physical translations (ptes). Alpha 21164: 48-entry fully associative TLB. A CPU pipeline stage probes the TLB to complete over 99% of address translations in a single cycle. Like other memory system caches, replacement of TLB entries is simple and controlled by hardware. e.g., Not Last Used If a translation misses in the TLB, the entry must be fetched by accessing the page table(s) in memory. cost: 10-500 cycles Care and Feeding of TLBs The OS kernel carries out its memory management functions by issuing privileged operations on the MMU. Choice 1: OS maintains page tables examined by the MMU. • MMU loads TLB autonomously on each TLB miss • page table format is defined by the architecture • OS loads page table bases and lengths into privileged memory management registers on each context switch. Choice 2: OS controls the TLB directly. • MMU raises exception if the needed pte is not in the TLB. • Exception handler loads the missing pte by reading data structures in memory (software-loaded TLB). A Simple Page Table Each process/VAS has its own page table. Virtual addresses are translated relative to the current page table. process page table PFN 0 PFN 1 PFN i In this example, each VPN j maps to PFN j, but in practice any physical frame may be used for any virtual page. PFN i + offset page #i offset user virtual address physical memory page frames The page tables are themselves stored in memory; a protected register holds a pointer to the current page table. Page Tables (2) Second-level page tables Top-level page table 32 bit address with 2 page table fields Two-level page tables [from Tanenbaum] A Page Table Entry (PTE) This is (roughly) what a MIPS/Nachos page table entry (pte) looks like. valid bit: OS uses this bit to tell the MMU if the translation is valid. write-enable: OS touches this to enable or disable write access for this mapping. PFN dirty bit: MMU sets this when a store is completed to the page (page is modified). reference bit: MMU sets this when a reference is made through the mapping. Page Tables (3) Typical page table entry [from Tanenbaum] Completing a VM Reference start here probe page table load TLB probe TLB access valid? load TLB zero-fill fetch from disk page on disk? MMU access physical memory raise exception OS allocate frame page fault? signal process Demand Paging and Page Faults OS may leave some virtual-physical translations unspecified. mark the pte for a virtual page as invalid If an unmapped page is referenced, the machine passes control to the kernel exception handler (page fault). passes faulting virtual address and attempted access mode Handler initializes a page frame, updates pte, and restarts. If a disk access is required, the OS may switch to another process after initiating the I/O. Page faults are delivered at IPL 0, just like a system call trap. Fault handler executes in context of faulted process, blocks on a semaphore or condition variable awaiting I/O completion. Where Pages Come From file volume with executable programs text Modified (dirty) pages are pushed to backing store (swap) on eviction. data BSS user stack args/env Fetches for clean text or data are typically fill-from-file. kernel Initial references to user stack and BSS are satisfied by zero-fill on demand. Paged-out pages are fetched from backing store when needed. Processes and the Kernel processes in private virtual address spaces data The kernel sets up process execution contexts to “virtualize” the machine. data system call traps ...and upcalls (e.g., signals) shared kernel code and data in shared address space CPU and devices force entry to the kernel to handle exceptional events. Threads or processes enter the kernel for services. The Kernel • Today, all “real” operating systems have protected kernels. The kernel resides in a well-known file: the “machine” automatically loads it into memory (boots) on power-on/reset. Our “kernel” is called the executive in some systems (e.g., XP). • The kernel is (mostly) a library of service procedures shared by all user programs, but the kernel is protected: User code cannot access internal kernel data structures directly, and it can invoke the kernel only at well-defined entry points (system calls). • Kernel code is like user code, but the kernel is privileged: The kernel has direct access to all hardware functions, and defines the machine entry points for interrupts and exceptions. Protecting Entry to the Kernel Protected events and kernel mode are the architectural foundations of kernel-based OS (Unix, XP, etc). • The machine defines a small set of exceptional event types. • The machine defines what conditions raise each event. • The kernel installs handlers for each event at boot time. e.g., a table in kernel memory read by the machine The machine transitions to kernel mode only on an exceptional event. The kernel defines the event handlers. Therefore the kernel chooses what code will execute in kernel mode, and when. user trap/return kernel interrupt or exception Example: System Call Traps User code invokes kernel services by initiating system call traps. • Programs in C, C++, etc. invoke system calls by linking to a standard library of procedures written in assembly language. the library defines a stub or wrapper routine for each syscall stub executes a special trap instruction (e.g., chmk or callsys or int) syscall arguments/results passed in registers or user stack Alpha CPU architecture read() in Unix libc.a library (executes in user mode): #define SYSCALL_READ 27 # code for a read system call move arg0…argn, a0…an move SYSCALL_READ, v0 callsys move r1, _errno return # syscall args in registers A0..AN # syscall dispatch code in V0 # kernel trap # errno = return status Faults Faults are similar to system calls in some respects: • Faults occur as a result of a process executing an instruction. Fault handlers execute on the process kernel stack; the fault handler may block (sleep) in the kernel. • The completed fault handler may return to the faulted context. But faults are different from syscall traps in other respects: • Syscalls are deliberate, but faults are “accidents”. divide-by-zero, dereference invalid pointer, memory page fault • Not every execution of the faulting instruction results in a fault. may depend on memory state or register contents The Role of Events A CPU event is an “unnatural” change in control flow. Like a procedure call, an event changes the PC. Also changes mode or context (current stack), or both. Events do not change the current space! The kernel defines a handler routine for each event type. Event handlers always execute in kernel mode. The specific types of events are defined by the machine. Once the system is booted, every entry to the kernel occurs as a result of an event. In some sense, the whole kernel is a big event handler. CPU Events: Interrupts and Exceptions An interrupt is caused by an external event. device requests attention, timer expires, etc. An exception is caused by an executing instruction. CPU requires software intervention to handle a fault or trap. unplanned deliberate sync fault syscall trap async interrupt AST control flow AST: Asynchronous System Trap Also called a software interrupt or an Asynchronous or Deferred Procedure Call (APC or DPC) Note: different “cultures” may use some of these terms (e.g., trap, fault, exception, event, interrupt) slightly differently. exception.cc event handler (e.g., ISR: Interrupt Service Routine) Mode, Space, and Context At any time, the state of each processor is defined by: 1. mode: given by the mode bit Is the CPU executing in the protected kernel or a user program? 2. space: defined by V->P translations currently in effect What address space is the CPU running in? Once the system is booted, it always runs in some virtual address space. 3. context: given by register state and execution stream Is the CPU executing a thread/process, or an interrupt handler? Where is the stack? These are important because the mode/space/context determines the meaning and validity of key operations. The Virtual Address Space 0 text 0x0 data BSS user stack • user regions in the lower half sbrk() jsr V->P mappings specific to each process accessible to user or kernel code args/env • kernel regions in upper half 2n-1 shared by all processes, but accessible only to kernel code kernel text and kernel data 2n-1 A typical process VAS space includes: 0xffffffff • NT (XP?) on x86 subdivides kernel region into an unpaged half and a (mostly) paged upper half at 0xC0000000 for page tables and I/O cache. • Win95/98 uses the lower half of system space as a system-wide shared region. A VAS for a private address space system (e.g., Unix, NT/XP) executing on a typical 32-bit system (e.g., x86). Process and Kernel Address Spaces 0x0 0 n-bit virtual address space 2n-1-1 data data 32-bit virtual address space 0x7FFFFFFF 2n-1 0x80000000 2n-1 0xFFFFFFFF Faults Faults are similar to system calls in some respects: • Faults occur as a result of a process executing an instruction. Fault handlers execute on the process kernel stack; the fault handler may block (sleep) in the kernel. • The completed fault handler may return to the faulted context. But faults are different from syscall traps in other respects: • Syscalls are deliberate, but faults are “accidents”. divide-by-zero, dereference invalid pointer, memory page fault • Not every execution of the faulting instruction results in a fault. may depend on memory state or register contents Options for Handling a Fault (1) 1. Some faults are handled by “patching things up” and returning to the faulted context. Example: the kernel may resolve an address fault (virtual memory fault) by installing a new virtual-physical translation. The fault handler may adjust the saved PC to re-execute the faulting instruction after returning from the fault. 2. Some faults are handled by notifying the process that the fault occurred, so it may recover in its own way. Fault handler munges the saved user context (PC, SP) to transfer control to a registered user-mode handler on return from the fault. Example: Unix signals or Microsoft NT user-mode Asynchronous Procedure Calls (APCs). Options for Handling a Fault (2) 3. The kernel may handle unrecoverable faults by killing the user process. Program fault with no registered user-mode handler? Destroy the process, release its resources, maybe write the memory image to a file, and find another ready process/thread to run. In Unix this is the default action for many signals (e.g., SEGV). 4. How to handle faults generated by the kernel itself? Kernel follows a bogus pointer? Divides by zero? Executes an instruction that is undefined or reserved to user mode? These are generally fatal operating system errors resulting in a system crash, e.g., panic()! Issues for Paged Memory Management The OS tries to minimize page fault costs incurred by all processes, balancing fairness, system throughput, etc. (1) fetch policy: When are pages brought into memory? prepaging: reduce page faults by bring pages in before needed clustering: reduce seeks on backing storage (2) replacement policy: How and when does the system select victim pages to be evicted/discarded from memory? (3) backing storage policy: Where does the system store evicted pages? When is the backing storage allocated? When does the system write modified pages to backing store? Virtual Memory as a Cache executable file virtual memory (big) header text text data idata data wdata symbol table, etc. BSS program sections physical memory (small) backing storage pageout/eviction user stack args/env page fetch kernel process segments virtual-to-physical translations physical page frames Rationale for I/O Cache Structure Goal: maintain K slots in memory as a cache over a collection of m items on secondary storage (K << m). 1. What happens on the first access to each item? Fetch it into some slot of the cache, use it, and leave it there to speed up access if it is needed again later. 2. How to determine if an item is resident in the cache? Maintain a directory of items in the cache: a hash table. Hash on a unique identifier (tag) for the item (fully associative). 3. How to find a slot for an item fetched into the cache? Choose an unused slot, or select an item to replace according to some policy, and evict it from the cache, freeing its slot. Mechanism for Cache Eviction/Replacement Typical approach: maintain an ordered free/inactive list of slots that are candidates for reuse. • Busy items in active use are not on the list. E.g., some in-memory data structure holds a pointer to the item. E.g., an I/O operation is in progress on the item. • The best candidates are slots that do not contain valid items. Initially all slots are free, and they may become free again as items are destroyed (e.g., as files are removed). • Other slots are listed in order of value of the items they contain. These slots contain items that are valid but inactive: they are held in memory only in the hope that they will be accessed again later. Replacement Policy The effectiveness of a cache is determined largely by the policy for ordering slots/items on the free/inactive list. defines the replacement policy A typical cache replacement policy is Least Recently Used. • Assume hot items used recently are likely to be used again. • Move the item to the tail of the free list on every release. • The item at the front of the list is the coldest inactive item. Other alternatives: • FIFO: replace the oldest item. • MRU/LIFO: replace the most recently used item. The Page Caching Problem Each thread/process/job utters a stream of page references. • reference string: e.g., abcabcdabce.. The OS tries to minimize the number of faults incurred. • The set of pages (the working set) actively used by each job changes relatively slowly. • Try to arrange for the resident set of pages for each active job to closely approximate its working set. Replacement policy is the key. • On each page fault, select a victim page to evict from memory; read the new page into the victim’s frame. • Most systems try to approximate an LRU policy. VM Page Cache Internals HASH(memory object/segment, logical block) 1. Pages in active use are mapped through the page table of one or more processes. 2. On a fault, the global object/offset hash table in kernel finds pages brought into memory by other processes. 3. Several page queues wind through the set of active frames, keeping track of usage. 4. Pages selected for eviction are removed from all page tables first. Managing the VM Page Cache Managing a VM page cache is similar to a file block cache, but with some new twists. 1. Pages are typically referenced by page table (pmap) entries. Must invalidate mappings before reusing the frame. 2. Reads and writes are implicit; the TLB hides them from the OS. How can we tell if a page is dirty? How can we tell if a page is referenced? 3. Cache manager must run policies periodically, sampling page state. Continuously push dirty pages to disk to “launder” them. Continuously check references to judge how “hot” each page is. Balance accuracy with sampling overhead. The Paging Daemon Most OS have one or more system processes responsible for implementing the VM page cache replacement policy. • A daemon is an autonomous system process that periodically performs some housekeeping task. The paging daemon prepares for page eviction before the need arises. • Wake up when free memory becomes low. • Clean dirty pages by pushing to backing store. prewrite or pageout • Maintain ordered lists of eviction candidates. • Decide how much memory to allocate to file cache, VM, etc. LRU Approximations for Paging Pure LRU and LFU are prohibitively expensive to implement. • most references are hidden by the TLB • OS typically sees less than 10% of all references • can’t tweak your ordered page list on every reference Most systems rely on an approximation to LRU for paging. • periodically sample the reference bit on each page visit page and set reference bit to zero run the process for a while (the reference window) come back and check the bit again • reorder the list of eviction candidates based on sampling FIFO with Second Chance Idea: do simple FIFO replacement, but give pages a “second chance” to prove their value before they are replaced. • Every frame is on one of three FIFO lists: active, inactive and free • Page fault handler installs new pages on tail of active list. • “Old” pages are moved to the tail of the inactive list. Paging daemon moves pages from head of active list to tail of inactive list when demands for free frames is high. Clear the refbit and protect the inactive page to “monitor” it. • Pages on the inactive list get a “second chance”. If referenced while inactive, reactivate to the tail of active list. Illustrating FIFO-2C active list inactive list free list Paging daemon typically scans a few times per second, even if not needed to restock free list. I. Restock inactive list by pulling pages from the head of the active list: clear the ref bit and place on inactive list (deactivation). II. Inactive list scan from head: 1. Page has been referenced? Return to tail of active list (reactivation). 2. Page has not been referenced? pmap_page_protect and place on tail of free list. 3. Page is dirty? Push to backing store and return it to inactive list tail (clean). Consume frames from the head of the free list (free). If free shrinks below threshhold, kick the paging daemon to start a scan (I, II). FIFO-2C in Action (FreeBSD) What Do the Pretty Colors Mean? This is a plot of selected internal kernel events during a run of a process that randomly reads/writes its virtual memory. • x-axis: time in milliseconds (total run is about 3 seconds) • y-axis: each event pertains to a physical page frame, whose PFN is given on the y-axis The machine is an Alpha with 8000 8KB pages (64MB total) The process uses 48MB of virtual memory: force the paging daemon to do FIFO-2C bookkeeping, but little actual paging. • events: page allocate (yellow-green), page free (red), deactivation (duke blue), reactivation (lime green), page clean (carolina blue). What to Look For • Some low physical memory ranges are reserved to the kernel. • Process starts and soaks up memory that was initially free. • Paging daemon frees pages allocated to other processes, and the system reallocates them to the test process. • After an initial flurry of demand-loading activity, things settle down after most of the process memory is resident. • Paging daemon begins to run more frequently as memory becomes overcommitted (dark blue deactivation stripes). • Test process touches pages deactivated by the paging daemon, causing them to be reactivated. • Test process exits (heavy red bar). page alloc deactivate clean free activate Questions for Paged Virtual Memory 1. How do we prevent users from accessing protected data? 2. If a page is in memory, how do we find it? Address translation must be fast. 3. If a page is not in memory, how do we find it? 4. When is a page brought into memory? 5. If a page is brought into memory, where do we put it? 6. If a page is evicted from memory, where do we put it? 7. How do we decide which pages to evict from memory? Page replacement policy should minimize overall I/O. What You Should Know • Basics of paged memory management • Typical address space layout • Basics of address translation • Architectural mechanisms to support paged memory • Importance for kernel protection and process isolation • Why the simple page table is inadequate • Motivation for and structure of hierarchical tables • Optional: motivation for and structure of hashed (inverted) tables Background Be sure you understand why page-based memory allocation is more memory-efficient than the old way: allocating contiguous physical memory for each address space (partitioning). • Two partitioning strategies: fixed and variable • How to make partitioning transparent to programs • How to protect memory in a partitioned system • Fragmentation: internal and external • Fragmentation issues for each strategy • Relevance to heap managers today • Approaches to variable partitioning: First fit, best fit, etc., and the role of compaction. Memory Management 101 Once upon a time...memory was called “core”, and programs (“jobs”) were loaded and executed one by one. • load image in contiguous physical memory start execution at a known physical location allocate space in high memory for stack and data • address text and data using physical addresses prelink executables for known start address • run to completion Memory and Multiprogramming One day, IBM decided to load multiple jobs in memory at once. • improve utilization of that expensive CPU • improve system throughput Problem 1: how do programs address their memory space? load-time relocation? Problem 2: how does the OS protect memory from rogue programs? ??? Base and Bound Registers Goal: isolate jobs from one another, and from their placement in the machine memory. • addresses are offsets from the job’s base address stored in a machine base register machine computes effective address on each reference initialized by OS when job is loaded • machine checks each offset against job size placed by OS in a bound register Base and Bound: Pros and Cons Pro: • each job is physically contiguous • simple hardware and software • no need for load-time relocation of linked addresses • OS may swap or move jobs as it sees fit Con: • memory allocation is a royal pain • job size is limited by available memory Variable Partitioning Variable partitioning is the strategy of parking differently sized cars along a street with no marked parking space dividers. Wasted space from external fragmentation Fixed Partitioning Wasted space from internal fragmentation Alpha Page Tables (Forward Mapped) 21 seg 0/1 L1 10 L2 10 L3 PO 10 13 sparse 64-bit address space (43 bits in 21064 and 21164) base + + + three-level page table (forward-mapped) offset at each level is determined by specific bits in VA PFN