This lecture… Virtual memory – Page replacement – Page Allocation Least Recently Used (LRU) (Review) Replace the page that has not been used for the longest period of time. Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 5 1 2 8 page faults 3 5 4 3 4 This is the optimal page-replacement algorithm looking backward in time, rather than forward. LRU Page Replacement (Review) Implementing Perfect LRU (Review) Timestamp page on each reference – At eviction time scan for oldest – A clock counter can be kept for each page entry Problems: – large page lists – no hardware support for time stamps Implementing Perfect LRU (Review) Keep list of pages ordered by time of reference – Keep a stack of page numbers in a double link form – Page referenced: » move it to the top of stack » bottom is the LRU page – requires 6 pointers to be changed – No search for replacement LRU Approximation Hardware support in some systems is in the form of a reference bit. Reference bit – With each page in the page table associate a bit, initially = 0 – When page is referenced, bit set to 1. – Replace the one which is 0 (if one exists). We do not know the order, however. Additional-Reference-Bits Algorithm – Keep an 8-bit byte for each page in a table in memory. – At regular intervals, a timer interrupt transfers control to OS – The OS shifts the ref bit for each page into the high-order bit, shifting the other bits right 1 bit, discarding the low-order bit. Additional-Reference-Bits Algorithm Shift register contains 00000000 - page not used for eight time periods Shift register contains 11111111 - page used at least once for eight time periods A page with a history register value 11000100 is more recently used than the one with 01110111. Interpret 8-bit byte as unsigned integer. Page with lowest number is LRU If numbers are not unique, then – replace all pages with the smallest value, or – use a FIFO selection among them Second-Chance (Clock) algorithm 1. 2. 3. Replace an old page, not the oldest page Clock algorithm: arrange physical pages in a circle, with a clock hand. Hardware keeps reference bit per physical page frame Hardware sets ref bit on each reference – – If ref bit isn’t set, means not referenced in a long time Nachos – ref bit in TLB, copy to page table or replace – – Advance clock hand (not real time) Check ref bit On page fault: » 1 –> clear, go on » 0 –> replace page Second-Chance (Clock) algorithm Second-Chance (Clock) algorithm Will it always find a page or loop infinitely? – Even if all ref bits are set, it will eventually loop around, clearing all ref bits -> FIFO What if hand is moving slowly? – Not many page faults and/or find page quickly What if hand is moving quickly? – Lots of page faults and/or lots of reference bits set. One way to view clock algorithm: crude partitioning of pages into two categories: – young and old. – Why not partition into more than 2 groups? Nth Chance Algorithm Don’t throw page out until hand has swept by n times OS keeps counter per page – # of sweeps On page fault, OS checks ref bit: – 1 => clear ref and also clear counter, go on – 0 => increment counter, if < N, go on else replace page Nth Chance Algorithm How do we pick N? Why pick large N? Why pick small N? – Better approx to LRU. – More efficient; otherwise might have to look a long way to find free page. Dirty pages have to be written back to disk when replaced. Takes extra overhead to replace a dirty page, so give dirty pages an extra chance before replacing? Common approach: – Clean pages – use N = 1 – Dirty pages – use N = 2 (and write-back to disk when N=1) LRU Approximation State per page table entry To summarize, many machines maintain four bits per page table entry: – Ref: set when page is referenced, cleared by clock algorithm – Modified: set when page is modified, cleared when page is written to disk – Valid: ok for program to reference this page – Read-only: ok for program to read page, but not to modify it (for example, for catching modifications to code pages) LRU Approximation Do we really need a hardware-supported “modified” bit? No. Can emulate it (BSD UNIX). Keep two sets of books: i. Pages user program can access without taking a fault ii. Pages in memory Set (i) is a subset of set (ii). Initially, mark all pages as read-only, even data pages. On write, trap to OS. – OS sets modified bit, and marks page as read-write. When page comes back in from disk, mark read-only. LRU Approximation Do we really need a hardware-supported “ref” bit? No. Can emulate it, exactly the same as before: i. Mark all pages as invalid, even if in memory. ii. On read to invalid page, trap to OS. iii.OS sets ref bit, and marks page read-only. iv.On write, set ref and modified bit, and mark page read-write. v. When clock hand passes by, reset ref bit and mark page as invalid. But remember, clock is just an approximation of LRU. Can we do a better approximation, given that we have to take page faults on some reads and writes to collect ref information? Need to identify an old page, not the oldest page! LRU Approximation VAX/VMS didn’t have a ref or modified bit, so had to come up with some solution. Idea was to split memory in two parts – mapped and unmapped: i. Directly accessible to program (marked as read-write) (managed FIFO) ii. Second-chance list (marked as invalid, but in memory) (managed pure LRU) LRU Approximation On page reference: – If mapped, access at full speed – Otherwise page fault: » If on second chance list, mark read-write Move first page on FIFO list onto end of second chance list (and mark invalid) » If not on second chance list, bring into memory Move first page on FIFO list onto end of second chance Replace first page on second chance list A Reference to page on second chance list Mapped pages (FIFO) Second chance (LRU) LRU Approximation How many pages for second chance list? – If 0, FIFO – If all, LRU, but page fault on every page reference Pick intermediate value. Result is: – + Few disk accesses (page only goes to disk if it is unused for a long time) – – Increased overhead trapping to OS (software/hardware tradeoff) LRU Approximation Does a software-loaded TLB need a ref bit? What if we have a software-loaded TLB (as in Nachos)? Two options: 1. Hardware sets ref bit in TLB; when TLB entry is replaced, software copies ref bit back to page table. 2. Software manages TLB entries as FIFO list; everything not in TLB is second-chance list, managed as strict LRU. LRU Approximation Core map Page tables map virtual page # –> physical page # Do we need the reverse? physical page # –> virtual page #? – Yes. Clock algorithm runs through page frames. What if it ran through page tables? i. Many more entries ii. What if there is sharing? A different take: page buffering Cute simple trick (VMS, Mach, Windows NT/2K/XP): – keep spare of free pages; recycle in FIFO order – but record what free page corresponds to: if used before overwritten put back in place. VMS: evict add used modified list (batch writes = speed) reset free unmodified free list Allocation of Frames Single-user system – Allocate some frames to OS, and rest to user process – If some OS frames not used, allocate to user process – But keep some (at least three) frames for itself in case of page fault Multiprogramming – Each process needs minimum number of pages. » Defined by instruction-set architecture – Example: IBM 370 – 6 pages to handle MVC instruction: » instruction is 6 bytes, might span 2 pages. » 2 pages to handle from. » 2 pages to handle to. Fixed Allocation Equal allocation – e.g., if 100 frames and 5 processes, give each 20 pages. Proportional allocation – Allocate according to the size of process. si size of process pi S si m total number of frames s ai allocation for pi i m S m 64 si 10 s2 127 10 64 5 137 127 a2 64 59 137 a1 Priority Allocation Use a proportional allocation scheme using priorities or a combination of size and priority, rather than size. If process Pi generates a page fault, – select for replacement one of its frames. – select for replacement a frame from a process with lower priority number. Global vs. Local Allocation Global replacement – So far, we’ve implicitly assumed memory comes from a single global pool – when process P faults and needs a page, take oldest page on entire system – Good: shared caches are adaptable. Example if P1 needs 20% of memory and P2 70%, then they will be happy. P1 P2 – Bad: too adaptable. Little pig protection » What happens to P1 if P2 sequentially reads array about the size of memory? Global vs. Local Allocation Local replacement – each process has a separate pool of pages – a page fault in one process can only replace one of this process’s frames – isolates process and therefore relieves interference from other processes pool barrier P1 P2 – but, isolates process and therefore prevents process from using other’s (comparatively) idle resources – efficient memory usage requires a mechanism for (slowly) changing the allocations to each pool – Qs: What is “slowly”? How big a pool? When to migrate? Thrashing Thrashing: memory over-committed, pages tossed out while still needed. When thrashing occurs, the system becomes less able to do useful work because it is spending too many resources to swap pages in and out of memory. P1 P2 P3 Real mem What we wanted: virtual memory the size of disk with access time of physical memory What we have: memory with access time = disk access Thrashing Three different reasons – process doesn’t reuse memory, so caching doesn’t work (past != future) access pattern – process does reuse memory, but it does not “fit” P1 mem – individually, all processes fit and reuse memory, but too many for system. mem Thrashing Example: One program, touches 50 pages (each equally likely). Have only 40 physical page frames If have enough pages, 200 ns/ref If have too few pages, assume every 5th page reference, page fault 4 refs x 200 ns 1 page fault x 10 ms for disk I/O –––––––––––––––––––––––––––––– Result: 5 ref, 10 ms + 800ns => 2 ms/ref Thrashing Problem: system doesn’t know what it’s getting into. Log more and more users into the system, eventually: – Total # of pages needed > # of pages available So what do you do about this? Making the best of a bad situation 1. One process alone too big? – – 2. Several jobs? – – Change program so it needs less memory, has better locality. For example, split matrix multiply up into smaller submatrix multiplies, that each fit into memory. Figure out needs/process (working set) Run only groups that fit (balance sets) Kick other processes out from memory Remember: issue here is not total size of process, but rather total number of pages being used at the moment. How do we figure needs/process out? Methodology for solving? Approach 1: working set – thrashing viewed from a caching perspective: given locality of reference, how big a cache does the process need? – Or: how much memory does process need in order to make “reasonable” progress (its working set)? – Only run processes whose memory requirements can be satisfied. Approach 2: page-fault frequency – thrashing viewed as poor ratio of fetch to work – PFF = page faults / instructions executed – if PFF rises above threshold, process needs more memory » not enough memory on the system? Swap out. – if PFF sinks below threshold, memory can be taken away Working set (1968, Denning, MIT) Informally, collection of pages process is using right now Formally, set of pages job has referenced in last T seconds. How do we pick T? – 1 page fault = 10 msec – 10 msec = 2 million instructions So T needs to be a lot bigger than 1 million instructions. How to implement working set? 1. Modify clock algorithm, so that it sweeps at fixed intervals. – 2. Keep idle time/page – how many sec since last reference With second chance list – how many seconds since got put on 2nd chance list Now that you know how many pages each program needs, what to do? Scheduling details: The balance set 1. 2. Sum of working sets of all runnable processes fits in memory? Done! If they do not fit, then refuse to run some: divide into two groups – – – active: working set loaded inactive: working set intentionally not loaded balance set: sum of working sets of all active processes – Keep moving processes from active -> inactive until balance set less than memory size. Must allow inactive to become active. (if changes too frequently?) Long term scheduler: – As working set changes, must update balance set… Some problems What if T is too big? – Waste memory; too few programs fit in memory What if T is too small? – Thrashing Too big is better than too small! Working sets of real programs Working set size Typical programs have phases: transition, stable – working set of one may have little to do with other – balloons during transitions…. Working set less important The concept is a good perspective on system behavior. – As optimization trick, it’s less important: Early systems thrashed lots, current systems not so much. Have OS designers gotten smarter? No. It’s the hardware guys (cf. Moore’s law): – Obvious: Memory much larger (more to go around) – Less obvious: CPU faster so jobs exit quicker, return memory to freelist faster. – Some app can eat as much as you give, the percentage of them that have “enough” seems to be increasing.