Paging

advertisement
Paging
Virtual Memory
• Segmentation
– Basic early approach
• Paging
– Modern approach
– Advantages
• Easy to allocate physical memory
• Easy to “page out”/swap chunks of memory
– Disadvantages
• Overhead added to each memory reference
• Additional memory is required to store page tables
Hardware and OS structures for paging
• Hardware
– Page table base register
• X86: CR3
– TLB
• Software
– Page table
• Virtual->physical or virtual->disk mapping
– Page frame database
• One entry for each physical page
• Information about page
– e.g. owning process, r/w permissions
– Swap file / Section list
Page Frame Database
/* Each physical page in the system has a struct page associated with
* it to keep track of whatever it is we are using the page for at the
* moment. Note that we have no way to track which tasks are using
* a page
*/
struct page {
unsigned long flags; // Atomic flags: locked, referenced, dirty, slab, disk
atomic_t _count;
// Usage count,
atomic_t _mapcount; // Count of ptes mapping in this page
struct {
unsigned long private; // Used for managing page used in file I/O
struct address_space * mapping; // Used to define the data this page is holding
};
pgoff_t index;
struct list_head lru;
void * virtual;
};
// Our offset within mapping
// Linked list node containing LRU ordering of pages
// Kernel virtual address
Multilevel page tables
• Page of table with N levels
– Virtual addresses split into N+1 parts
• N indexes to different levels
• 1 offset into page
• Example: 32 bit paging on x86
– 4KB pages, 4 bytes/PTE
• 12 bits in offset: 2^12 = 4096
– Want to fit page table entries into 1 page
• 4KB / 4 bytes = 1024 PTEs per page
– So level indexes = 10 bits each
• 2^10 = 1024
2 level page tables
Virtual Address
Master Page #
(10 bits)
Secondary Page #
(10 bits)
Offset (12 bits)
Physical Address
Page Frame #
Secondary
Page Table
Offset
Physical Memory
Page Frame 0
Page Frame 1
Page Frame 2
Master Page
Table
Page Frame
Number
Page Frame 3
Page Frame 4
Secondary
Page Table
Inverted Page Table
• Previous examples: “Forward Page tables”
– Page table size relative to size of virtual memory
– Physical memory could be much less
• Lots of wasted space
Virtual
Page #
Offset
Hash Table
Phsyical
Page #
• Separate approach: Use a hash table
– Inverted page table
– Size is independent of virtual address space
– Directly related to size of physical memory
• Cons:
– Have to manage a hash table (collisions, rebalancing, etc)
Offset
Addressing Page Tables
• Where are page tables stored?
– And in which address space?
• Possibility #1: Physical memory
– Easy address, no translation required
– But page tables must stay resident in memory
• Possibility #2: Virtual Memory (OS VA space)
– Cold (unused) page table pages can be swapped out
– But page table addresses must be translated through page
tables
• Don’t page the outer page table page (called wiring)
• Question: Can the kernel be paged?
Generic PTE
• PTE maps virtual page to physical page
• Includes Page properties (shared with HW)
– Valid? Writable? Dirty? Cacheable?
Physical Page Number
Property Bits
Where is the virtual page number?
• Acronyms
–
–
–
–
–
–
–
PTE = Page Table Entry
PDE = Page Directory Entry
VA = Virtual Address
PA = Physical Address
VPN = Virtual Page number
PPN = Physical Page number
PFN = Page Frame Number (same as PPN)
X86 address translation (32 bit)
• Page Tables organized as a 2 level tree
– Efficiently handle sparse address space
• One set of page tables per process
– Current page tables pointed to by CR3
• CPU “walks” page tables to find translations
– Accessed and Dirty bits updated by CPU
• 32 bit: 4KB or 4MB pages
• 64 bit: 4 levels; 4KB or 2MB pages
X86 32 bit PDE/PTE details
PWT: Write through
PCD: Cache Disable
P: Present
R/W: Read/Write
U/S: User/System
AVL: Available for OS use
A: Accessed
D: Dirty
PAT: Cache behavior definition
G: Global
• If page is not present (P=0), then other bits are available for OS to use.
• Remember: Useful for swapping
Paging Translation
Making it efficient
• Original page table scheme doubled cost of memory accesses
– 1 page table access, 1 data access
• 2-level page tables triple the cost
– 2 page table accesses + 1 data access
• 4-level page tables quintuple the cost
– 4 page table accesses + 1 data access
• How to achieve efficiency
– Goal: Make virtual memory accesses as fast as physical memory
accesses
– Solution: Use a hardware cache
• Cache virtual-to-physical translations in hardware
• Translation Lookaside Buffer (TLB)
• X86:
– TLB is managed by CPU’s MMU
– 1 per CPU/core
TLBs
• Translation Lookaside buffers
– Translates Virtual page #s into PTEs (NOT physical addresses)
• Why?
– Can be done in single machine cycle
• Implemented in hardware
–
–
–
–
Associative cache (many entries searched in parallel)
Cache tags are virtual page numbers
Cache values are PTEs
With PTE + offset, MMU directly calculates PA
• TLBs rely on locality
– Processes only use a handful of pages at a time
• 16-48 entries in TLB is typical (64-192KB for 4kb pages)
• Targets “hot set” or “working set” of process
– TLB hit rates are critical for performance
Managing TLBs
• Address translations are mostly handled by TLB
– (>99%) hit rate, but there are occasional TLB misses
– On miss, who places translations into TLB?
• Hardware (MMU)
– Knows where page tables are in memory (CR3)
• OS maintains them, HW accesses them
– Tables setup in HW-defined format
– X86
• Software loaded TLB (OS)
– TLB miss faults to OS, OS finds right PTE and loads it into TLB
– Must be fast
• CPU ISA has special TLB access instructions
• OS uses its own page table format
– SPARC and IBM Power
Managing TLBs (2)
• OS must ensure TLB and page tables are consistent
– If OS changes PTE, it must invalidate cached PTE in TLB
– Explicit instruction to invalidate PTE
• X86: invlpg
• What happens on a context switch?
– Each process has its own page table
– Entire TLB must be invalidated (TLB flush)
– X86: Certain instructions automatically flush entire TLB
• Reloading CR3: asm (“mov %1, %%cr3”);
• When TLB misses, a new PTE is loaded, and cached PTE is
evicted
– Which PTE should be evicted?
• TLB Replacement Policy
• Defined and implemented in hardware (usually LRU)
x86 TLB
• TLB management is shared by CPU and OS
• CPU:
– Fills TLB on demand from page tables
• OS is unaware of TLB misses
– Evicts entries as needed
• OS:
– Ensures TLB and page tables are consistent
• Flushes entire TLB when page tables are switched (e.g.
context switch)
– asm (“mov %0, %%cr3”:: “r”(page_table_addr));
• Modifications to a single PTE are flushed explicitly
– asm (“invlpg %0;”:: “r”(virtual_addr));
Cool Paging Tricks
• Exploit level of indirection between VA and PA
– Shared memory
• Regions of two separate processes’ address spaces map to the
same physical memory
– Read/write: access to shared data
– Execute: shared libraries
• Each process can have a separate PTE pointing to same physical
memory
– Different access privileges for different processes
• Does the shared region need to map the same VA in each process?
– Copy-On-Write (COW)
• Instead of copying physical memory on fork()
• Just create a new set of identical page tables with writes disabled
– When a child writes to a page, OS gets page fault
– OS copies physical memory page, and maps new page into child process
Saving Memory to Disk
• In memory shortage:
– OS writes memory contents to disk and reuses memory
• Copying a whole process is called “swapping”
• Copying a single page is called “paging”
• Where does data go?
– If it came from a file and was not modified: deleted from memory
• E.g. Executable code
– Unix Swap partition
• A partition (file, disk segment, or entire disk) reserved as a backing store
– Windows Swap file
• Designated file stored in regular file system
• When does data move?
• Swapping: In advance of running a process
• Paging: When a page of memory is accessed
Demand paging
• Moving pages between memory and disk
– OS uses main memory as a cache
• Most of memory is used to store file data
– Programs, libraries, data
– File contents cached in memory
• Anonymous memory
– Memory not used for file data
» Heap, stack, globals, …
– Backed to swap file/partition
– OS manages movement of pages to/from disk
• Transparent to application
Why is this “demand” paging?
• When a process first starts: fork()/exec()
– Brand new page tables with no valid PTEs
– No pages mapped to physical memory
– As process executes memory is accessed
• Instructions immediately fault on code and data pages
• Faults stop once all necessary code/data is in memory
– Only code/data that is needed
• Memory that is needed changes over time
– Pages shift between disk and memory
Page faults
• What happens when a process references an evicted page?
– When page is evicted, OS sets PTE as invalid (present = 0)
• Sets rest of PTE bits to indicate location in swap file
– When process accesses page, invalid PTE triggers a CPU
exception (page fault)
– OS invokes page fault handler
• Checks PTE, and uses high 31 bits to find page in swap disk
• Handler reads page from disk into available frame
– Possibly has to evict another page…
• Handler restarts process
• What if memory is full?
– Another page must be evicted (page replacement algorithm)
Steps in Handling a page fault
Evicting the best page
• OS must choose victim page to be evicted
– Goal: Reduce the page fault rate
– The best page to evict is one that will never be
accessed again
• Not really possible…
• Belady’s proof: Evicting the page that won’t be used for
the longest period of time minimizes page fault rate
Belady’s Algorithm
• Find page that won’t be used for the longest
amount of time
– Not possible
• So why is it here?
– Provably optimal solution
– Comparison for other practical algorithms
– Upper bound on possible performance
• Lower bound?
– Depends on workload…
• Random replacement is generally a bad idea
FIFO
• Obvious and simple
– When a page is brought in, goes to tail of list
– On eviction take the head of the list
• Advantages
– If it was brought in a while ago, then it might not be used...
• Disadvantages
– Or its being used by everybody (glibc)
– Does not measure access behavior at all
• FIFO suffers from Belady’s Anomaly
– Fault rate might increase when given more physical
memory
• Very bad property…
• Exercise: Develop a workload where this is true
Least Recently Used (LRU)
• Use access behavior during selection
– Idea: Use past behavior to predict future behavior
– On replacement, evict page that hasn’t been used for
the longest amount of time
• LRU looks at the past, Belady’s looks at future
• Implementation
– To be perfect, every access must be detected and
timestamped (way too expensive)
• So it must be approximated
Approximating LRU
• Many approximations, all use PTE flag
– x86: Accessed bit, set by HW on every access
– Each page has a counter (where?)
– Periodically, scan entire list of pages
• If accessed = 0, increment counter (not used)
• If accessed = 1, clear counter (used)
• Clear accessed flag in PTE
– Counter will contain # of iterations since last reference
• Page with largest counter is least recently used
• Some CPUs don’t have PTE flags
– Can simulate it by forcing page faults with invalid PTE
LRU Clock
• Not Recently Used (NRU) or Second Chance
– Replace page that is “old enough”
• Arrange page in circular list (like a clock)
– Clock hand (ptr value) sweeps through list
• If accessed = 0, not used recently so evict it
• If accessed = 1, recently used
– Set accessed to 0, go to next PTE
• Recommended for Project 3
• Problem:
– If memory is large, “accuracy” of information degrades
N-Chance clock
• Basic Clock only has two states
– used or not used
• Can we add more?
– Answer: Embed counter into PTE
• Clock hand (ptr value) sweeps through list
– If accessed = 0, not used recently
• if counter = 0, evict
• else, decrement counter and go to next PTE
– If accessed = 1, recently used
• Set accessed to 0
• Increment counter
• Go to next PTE
Download