Lecture 9-1 Virtual Memory Original Note By Prof. Mike Schulte Present by Pradondet Nilagupta Spring 2001 Virtual Memory • Virtual memory (VM) allows main memory (DRAM) to act like a cache for secondary storage (magnetic disk). • VM address translation a provides a mapping from the virtual address of the processor to the physical address in main memory or on disk. • VM provides the following benefits – Allows multiple programs to share the same physical memory – Allows programmers to write code as though they have a very large amount of main memory – Automatically handles bringing in data from disk • Cache terms vs. VM terms – Cache block => page or segment – Cache Miss => page fault or address fault Virtual Memory Basics • Programs reference “virtual” addresses in a nonexistent memory – These are then translated into real “physical” addresses – Virtual address space may be bigger than physical address space • Divide physical memory into blocks, called pages – Anywhere from 512 to 16MB (4k typical) • Virtual-to-physical translation by indexed table lookup – Add another cache for recent translations (the TLB) • Invisible to the programmer – Looks to your application like you have a lot of memory! – Anyone remember overlays? 204521 Digital Computer Architecture 3 VM: Page Mapping Process 1’s Virtual Address Space Page Frames Process 2’s Virtual Address Space Disk Physical Memory 204521 Digital Computer Architecture 4 VM: Address Translation 20 bits Virtual page number 12 bits Page offset Log2 of pagesize Per-process page table Valid bit Protection bits Dirty bt Reference bit Page Table base Physical page number Page offset To physical memory 204521 Digital Computer Architecture 5 Typical Page Parameters Parameter Value Page Size 4KB – 64KB L1 Cache Hit Time 1-2 clock cycles Virtual Hit (e.g. mapped to DRAM) 50-400 clock cycles Miss Penalty (all the way to disk) 700k-6M clock cycles Disk Access Time 500k-4M clock cycles Page Transfer Time 200k-2M clock cycles Page Fault Rate .001% - .00001% Main Memory Size 4MB – 4GB • It’s a lot like what happens in a cache – But everything (except miss rate) is a LOT worse 204521 Digital Computer Architecture 6 Paging vs. Segmentation • Pages are fixed sized blocks • Segments vary from 1 byte to 232 (for 32bit addresses) bytes Aspect Page Words per address One – contains page and offset Programmer visible? No Replacement Trivial – because of fixed size Memory Efficiency Internal Fragmentation Disk Efficiency Yes – adjust page size to balance access and transfer time 204521 Digital Computer Architecture Segment Two – possible large max-size, so need Seg and offset words Sometimes Hard, need to find contiguous space, use garbage collection External Fragmentation Not always – segment size varies 7 Cache and VM Parameters Parameter L1 Cache Virtual Memory Block (page) size 16-128 bytes 4096-65,536 bytes Hit time 1-2 cycles 40-100 cycles Miss Penalty 8-100 cycles 1 to 6 million cycles Miss rate 0.5-10% 0.00001-0.001% Memory size 16 KB to 1 MB 16 MB to 8 GB • How is virtual memory different from caches? – Software controls replacement - why? – Size of virtual memory determined by size of processor address – Disk is also used to store the file system - nonvolatile Paged and Segmented VM (Figure 5.38, pg. 442) • Virtual memories can be catagorized into two main classes – Paged memory : fixed size blocks – Segmented memory : variable size blocks Paged vs. Segmented VM • Paged memory – – – – – Fixed sized blocks (4 KB to 64 KB) One word per address (page number + page offset) Easy to replace pages (all same size) Internal fragmentation (not all of page is used) Efficient disk traffic (optimize for page size) • Segmented memory – – – – – Variable sized blocks (up to 64 KB or 4GB) Two words per address (segment + offset) Difficult to replace segments (find where segment fits) External fragmentation (unused portions of memory) Inefficient disk traffic (may have small or large transfers) • Hybrid approaches – Paged segments: segments are a multiple of a page size – Multiple page sizes: (e.g., 8 KB, 64 KB, 512 KB, 4096 KB) Pages are Cached in a Virtual Memory System Can Ask the Same Four Questions we did about caches • Q1: Block Placement – choice: lower miss rates and complex placement or vice versa • miss penalty is huge • so choose low miss rate ==> place page anywhere in physical memory • similar to fully associative cache model • Q2: Block Addressing - use additional data structure – fixed size pages - use a page table • virtual page number ==> physical page number and concatenate offset • tag bit to indicate presence in main memory 204521 Digital Computer Architecture 11 Normal Page Tables • Size is number of virtual pages • Purpose is to hold the translation of VPN to PPN – Permits ease of page relocation – Make sure to keep tags to indicate page is mapped • Potential problem: – Consider 32bit virtual address and 4k pages – 4GB/4KB = 1MW required just for the page table! – Might have to page in the page table… • Consider how the problem gets worse on 64bit machines with even larger virtual address spaces! • Alpha has a 43bit virtual address with 8k pages… – Might have multi-level page tables 204521 Digital Computer Architecture 12 Inverted Page Tables Similar to a set-associative mechanism • Make the page table reflect the # of physical pages (not virtual) • Use a hash mechanism – virtual page number ==> HPN index into inverted page table – Compare virtual page number with the tag to make sure it is the one you want – if yes • check to see that it is in memory - OK if yes - if not page fault – If not - miss • go to full page table on disk to get new entry • implies 2 disk accesses in the worst case • trades increased worst case penalty for decrease in capacity induced miss rate since there is now more room for real pages with smaller page table 204521 Digital Computer Architecture 13 Inverted Page Table Page Offset Hash •Only store entries For pages in physical memory Page Frame V = OK Frame Offset 204521 Digital Computer Architecture 14 Address Translation Reality • The translation process using page tables takes too long! • Use a cache to hold recent translations – Translation Lookaside Buffer • Typically 8-1024 entries • Block size same as a page table entry (1 or 2 words) • Only holds translations for pages in memory • 1 cycle hit time • Highly or fully associative • Miss rate < 1% • Miss goes to main memory (where the whole page table lives) • Must be purged on a process switch 204521 Digital Computer Architecture 15 Back to the 4 Questions • Q3: Block Replacement (pages in physical memory) – LRU is best • So use it to minimize the horrible miss penalty – However, real LRU is expensive • Page table contains a use tag • On access the use tag is set • OS checks them every so often, records what it sees, and resets them all • On a miss, the OS decides who has been used the least – Basic strategy: Miss penalty is so huge, you can spend a few OS cycles to help reduce the miss rate 204521 Digital Computer Architecture 16 Last Question • Q4: Write Policy – Always write-back • Due to the access time of the disk • So, you need to keep tags to show when pages are dirty and need to be written back to disk when they’re swapped out. – Anything else is pretty silly – Remember – the disk is SLOW! 204521 Digital Computer Architecture 17 Page Sizes An architectural choice • Large pages are good: – reduces page table size – amortizes the long disk access – if spatial locality is good then hit rate will improve • Large pages are bad: – more internal fragmentation • if everything is random each structure’s last page is only half full • Half of bigger is still bigger • if there are 3 structures per process: text, heap, and control stack • then 1.5 pages are wasted for each process – process start up time takes longer • since at least 1 page of each type is required to prior to start • transfer time penalty aspect is higher 204521 Digital Computer Architecture 18 More on TLBs • The TLB must be on chip – otherwise it is worthless – small TLB’s are worthless anyway – large TLB’s are expensive • high associativity is likely • ==> Price of CPU’s is going up! – OK as long as performance goes up faster 204521 Digital Computer Architecture 19 Address Translation with Page Table (Figure 5.40, pg. 444) • A page table translates a virtual page number into a physical page number • The page offset remains unchaged • Page tables are large – 32 bit virtual address – 4 KB page size – 2^20 4 byte table entries = 4MB • Page tables are stored in main memory => slow • Cache table entries in a translation buffer Fast Address Translation with Translation Buffer (TB) (Figure 5.41, pg. 446) • Cache translated addresses in TB • Alpha 21064 data TB – – – – – – 32 entries fully associative 30 bit tag 21 bit physical address Valid and read/write bits Separate TB for instr. • Steps in translation – compare page no. to tags – check for memory access violation – send physical page no. of matching tag – combine physical page no. and page offset Selecting a Page Size • Reasons for larger page size – Page table size is inversely proportional to the page size; therefore memory saved – Fast cache hit time easy when cache size < page size (VA caches); bigger page makes this feasible as cache size grows – Transferring larger pages to or from secondary storage, possibly over a network, is more efficient – Number of TLB entries are restricted by clock cycle time, so a larger page size maps more memory, thereby reducing TLB misses • Reasons for a smaller page size – Want to avoid internal fragmentation: don’t waste storage; data must be contiguous within page – Quicker process start for small processes - don’t need to bring in more memory than needed Memory Protection • With multiprogramming, a computer is shared by several programs or processes running concurrently – Need to provide protection – Need to allow sharing • Mechanisms for providing protection – Provide Base and Bound registers: Base ฃ Address ฃ Bound – Provide both user and supervisor (operating system) modes – Provide CPU state that the user can read, but cannot write • Branch and bounds registers, user/supervisor bit, exception bits – Provide method to go from user to supervisor mode and vice versa • system call : user to supervisor • system return : supervisor to user – Provide permissions for each flag or segment in memory Alpha VM Mapping (Figure 5.43, pg. 451) • “64-bit” address divided into 3 segments – seg0 (bit 63=0) user code – seg1 (bit 63 = 1, 62 = 1) user stack – kseg (bit 63 = 1, 62 = 0) kernel segment for OS • Three level page table, each one page – Reduces page table size – Increases translation time • PTE bits; valid, kernel & user read & write enable Alpha 21064 Memory Hierarchy • The Alpha 21064 memory hierarchy includes – – – – – – – A 32 entry, fully associative, data TB A 12 entry, fully associative instruction TB A 8 KB direct-mapped physically addressed data cache A 8 KB direct-mapped physically addressed instruction cache A 4 entry by 64-bit instruction prefetch stream buffer A 4 entry by 256-bit write buffer A 2 MB directed mapped second level unified cache • The virtual memory – Maps a 43-bit virtual address to a 34-bit physical address – Has a page size of 8 KB Miss Rate 10.00% 1.00% 0.10% 0.01% Su2cor Spice Nasa7 Mdljp2 Hydro2d Wave5 Alvinn Tomcatv Doduc Ear Swm256 Fpppp Mdljsp2 Ora Gcc Compress Sc Li Eqntott Espresso TPC-B (db2) TPC-B (db1) AlphaSort Alpha Memory Performance: Miss Rates 100.00% I$ 8K D$ 8K L2 2M Alpha CPI Components • Largest increase in CPI due to – I stall: Instruction stalls from branch mispredictions – Other: data hazards, structural hazards 4.50 4.00 3.50 L2 I$ 2.50 D$ 2.00 I Stall 1.50 Other 1.00 0.50 Hydro2d Mdljp2 Wave5 Tomcatv Alvinn Doduc Swm256 Ear Fpppp Ora Mdljsp2 Compress Gcc Sc Eqntott Li Espresso TPC-B (db1) TPC-B (db2) 0.00 AlphaSort CPI 3.00 Pitfall: Address space to small • One of the biggest mistakes than can be made when designing an architect is to devote to few bits to the address – address size limits the size of virtual memory – difficult to change since many components depend on it (e.g., PC, registers, effective-address calculations) • As program size increases, larger and larger address sizes are needed – – – – – 8 bit: Intel 8080 16 bit: Intel 8086 24 bit: Intel 80286 32 bit: Intel 80386 64 bit: Intel Merced (1975) (1978) (1982) (1985) (1998) Pitfall: Predicting Cache Performance of one Program from Another Program 35% D: tomcatv • 4KB Data cache miss rate 8%,12%, or 28%? • 1KB Instr cache miss rate 0%,3%, Miss or 10%? Rate • Alpha vs. MIPS for 8KB Data: 17% vs. 10% 30% D: gcc D: espresso 25% I: gcc 20% I: espresso I: tomcatv 15% 10% 5% 0% 1 2 4 8 16 Cache Size (KB) 32 64 128 Pitfall: Simulating Too Small an Address Trace 4.5 4 Cummlati 3.5 ve 3 Average Memory 2.5 Access 2 Time 1.5 1 0 1 2 3 4 5 6 7 8 9 10 11 12 Instructions Executed (billions) Virtual Memory Summary • Virtual memory (VM) allows main memory (DRAM) to act like a cache for secondary storage (magnetic disk). • The large miss penalty of virtual memory leads to different stategies from cache – Fully associative, TB + PT, LRU, Write-back • Designed as – paged: fixed size blocks – segmented: variable size blocks – hybrid: segmented paging or multiple page sizes • Avoid small address size Summary 2: Typical Choices Option TLB L1 Cache L2 Cache Block Size 4-8 bytes (1 PTE) 1 cycle 4-32 bytes 32-256 bytes 4k-16k bytes 6-15 cycles 10-100 cycles 30-200 700k-6M cycles cycles 13 - 15% .00001 001% 256KB 16MB DRAM Disks Hit Time 1-2 cycles Miss Penalty 10-30 cycles Local Miss .1 - 2% Rate 32B – 8KB Size 8-66 cycles Backing Store Q1: Block Placement Q2: Block ID L2 Cache L1 Cache .5 – 20% 1 – 128 KB Fully or set DM associative Tag/block Tag/block Q3: Block Random Replacemen (not last) N.A. For DM DM or SA Tag/block Random (if SA) VM (page) Fully associative Table LRU/LFU