Today s Menu: X=5 Virtual Memory II Remaining Virtual Memory Issues Accelerating size of address translation: Translation Lookaside Buffer (TLB) Page Table Size and Multi-level Page Tables Cache and TLB interactions Physical Caches vs Virtual Caches 1 Memory Hierarchy of a Modern Computer System TLB Structure & Performance By taking advantage of the principle of locality: 2 Processor Local Cache Registers Shared Cache (SRAM) Main Memory (DRAM) Secondary Storage (Disk) Speed (ns): Size (bytes): 1s 5-10 10-50 100s Ks Ms 100s Gs Ts 3 Next Issue: Page Table Size 4K or 8K Bytes Often support multiple page sizes up to 2GB(!), 2MB is becoming more widely used (“large pages”) 4K page 32 bit virtual address space 20 bits for the page number --> 220 virtual pages 220 == 1,048,576 PTEs Each PTE is at least 4 Bytes 1 Million PTEs * 4 Bytes/PTE = 4 Megabytes Each of these is actually one page (4K) in size! Example: from Linux Have two levels of page tables PTE 0 PTE 0 VP 0 PTE 1 PTE 1 VP 1 PTE 2 (null) … … PTE 3 (null) PTE 1023 VP 1023 … Hey: you said each process has its own page table! 4 Solution: Multi-level page tables Previously determined: Linux page table size 4% to 8% for typical Unix workloads Can be much higher for large applications Operating systems can significantly influence the TLB miss rate Page size 10,000,000s (10s ms) 20 to 500 cycles (much longer if the page is on disk) Hardware and software based TLB miss rates 32 to 1024 entries (slots) Can be direct mapped, set associate, fully associative Easier to be fully associate here, since TLBs are often pretty small TLB miss cost Control Datapath TLB structure Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. PTE 1023 100 processes = 400 MB of memory for page tables!!!! If the page table is not in memory, you can’t find it The page table cannot be swapped out! Only this table needs to stay in memory at all times. What’s the deal? 5 VP 1024 PTE 0 PTE 1 … … VP 2047 … PTE 1023 6 Multi-level Page Tables VPN 1 Multi-level Advantages Minimizes page: 4KB+4KB table Allows for maximum: 4KB + 4MB Page table size scales with memory usage Only level 1 MUST be in real memory VPN 2Virtual Address Offset PPN Level 2 Table Level 1 Table (Page Directory) (Page Table) PPN Level 2 tables can be swapped just like any other page Multi-level and TLB integration TLB only stores Level 2 PTEs Don’t have to do two TLB accesses to do VA -> PA translation Offset 7 Alternative Implementation: Reverse Page Tables Already implemented as part of OS data structure 8 Next Issue: Cache & TLBs, How They Interact Identify physical pages for eviction/replacement Need a table of physical page indices Hash VPN into a PPN index in the reverse table We do memory hierarchies to hide unpleasant facts Effectiveness depends on how well the hash works We don’t have as much fast memory as we would ideally like. Solution? Cache hierarchy gives illusion of speed--most of the time. Occasionally it’s slow. We don’t have as much memory as we would like. Solution? VM hierarchy gives the illusion of size--most of the time. Occasionally it’s slow. Roughly put: We do cache for speed. We do VM for size. So, we have regular cache for fast access, and we have a TLB for fast translation VA->PA. How do they interact? They must interact somehow… Do we wait for VA->PA translation before looking in the cache? Is the cache full of virtual or physical addresses? 9 Address Translation/Cache Lookup Simplest Scheme is Sequential: TLB then Cache CPU Slowest, but simplest 1. virtual address 2. TLB 3. physical address 4. Cache Memory physical cache (physical tags) 5. Simple! CPU sends out virtual address TLB translates to physical, or page faults and we wait for page to load On TLB hit, translated physical address sent to cache Cache lookup gives data access fast, or… Cache miss goes to main mem Slow because you have to wait for the translation before you can check the cache. 10 Virtual Address VPN TLB PPN PO TAG IDX Cache =? 11 PO BO (same address interpreted two different ways) Data Hit/Miss 12 TLBs and Caches: Basic Flow for Access Real Example: DECstation 3100 31 V i r t u a l a d d r e ss 12 11 0 Virtual Page Number T L B a cc e s s T L B m is s e xc e p t i o n No TLB (64 entry, fully-assoc) P h y s ic a l a d d r e s s No No Ye s 12 20 Physical page number Physical page number =? =? =? =? No C a c h e h it? 20 Tag Y es W r ite ? T r y t o r e a d d a ta f r om c a c h e C a c h e m i ss s t a l l TLB Hit Ye s TL B h it? Valid Dirty Page Offset W r ite a c c e s s b it o n ? W r ite p rote c tio n ex c ep tio n Physical Address Physical address tag Cache (64 KB, Direct-mapped) 16 Valid Tag Y es W rit e d at a int o c a ch e, u p d a te th e t a g , a n d p u t th e d a t a a n d th e a d d re s s in to th e w rite bu ffe r Page offset Cache index 14 Data Byte offset 2 D e l iv e r d a ta to th e C P U =? 13 Protection and the TLB A process presents a Virtual Address to the TLB for translation Ex: either process could present virtual address 0x2000 How does the TLB know which process’s VA->PA mapping is held in the TLB? There are 2 separate VM address spaces here, one per process There is only 1 TLB, NOT one per process Virtual Address 0x00002 000 Fully Assoc. TLB TAG PPN 0x00002 0x105 0x00004 0x094 Physical Page Frame Physical Address 0x105 000 0x00002 000 PID TAG PPN 1 0x00002 0x105 2 0x00004 0x094 Physical Address 0x105 Why is this better/worse? 000 16 Ex: MIPS R3000 Instruction Pipeline Decode Reg. Read Inst Fetch You have to do lookup in the TLB… …then you have to do lookup in the cache Involves a lot of memory access time TLB I-Cache RF ALU / E.A. Memory Operation E.A. One Solution: pipelining Virtual Address Another solution is to flush the TLB on a context switch; flush means empty it 15 If we do these accesses sequentially, big impact on speed Many machines append a PID (process ID) to each TLB entry OS maintains a Process ID Register (updated during the switch between processes, called a context switch) Process ID Register Speed & Timing Impacts 14 Protection and the TLB (cont.) Why not 1 per process? 32 cache hit TLB Write Reg WB D-Cache Resource Usage Spread the accesses across stages of the pipeline TLB and cache are just like any other resource in the pipeline You gotta be careful to know how long they take (impacts pipe cycle time) You gotta know who is trying use the resource in what pipe stage TLB TLB I-cache RF You can have hazards, need stalls, need forwarding paths, etc WB ALU ALU D-Cache 17 18 Speeding it Up Overlapped Cache & TLB Access TLB and then Cache… Why? What else could we do? 32 TLB Two options 1. Overlapped cache & TLB access (in parallel) 2. 1K Cache 4 bytes cache tag What are the limitations? Why does our cache use physical addresses? index assoc lookup Hit/ Miss PA Could it store virtual addresses? What are the problems/considerations? 10 2 index 00 Data PA 12 20 page number page offset Hit/ Miss = IF (cache hit) AND (cache tag = PA) THEN deliver data to CPU ELSE IF [cache miss OR (cache tag != PA)] AND (TLB hit) THEN access memory with the PA from the TLB ELSE do standard VA translation 19 20 Overlapped Cache & TLB Access How Overlapping Reduces Translation Time Basic plumbing High order-bits of the VA are used to look in the TLB… Remember: low order bits are the page offset--which byte address on the page High order bits are what really changes, from virtual to physical address IDX + BO ≤ PO Must satisfy Cache size/Assoc ≤ Page size …while low-order bits are used as index into cache Remember: lowest bits are the cache line offset--which byte on the cache line The intermediate bits are the cache index: which line in the cache should we check to see if the address we want is actually loaded into the cache mem The highest order bits are the cache tag: when we look in the cache at a line of cache, we must compare these bits with the cache tag, to see if what s stored in the cache is really the address we want, or just another line of memory that happens to map to this same place in the cache Simple for small caches Assume 4K pages & Virtual Address VPN PO IDX BO TLB PPN 2-way set-associative caches TAG What is the max size allowed for parallel address translation to work? Cache =? The action here is on the cache tag high order bits… 21 large caches? Two Cache Architectures Virtual Address VPN PO IDX BO Remember: Cache index used to look up data in the cache Cache tag used to verify what data is in the cache TLB Should we use VA instead? Should we only use VA bits to index? Hit/Miss 22 Cache vs. TLB Access What happens for Data 1. PPN Virtually-indexed Virtually-tagged Caches TAG IDX BO 2. Data Cache =? Also known as Virtually-Addressed or Virtual Address Caches The VPN bits are used for both tags and index bits Virtually-indexed Physically-tagged Caches The VPN bits only contribute to the index The tag is physical and requires a TLB lookup, but it can be done in parallel Hit/Miss 23 24 Virtual Address Cache Multiple Virtual Address Spaces (Multiprogramming) Virtual Address VPN Lookup using VA TAG TLB access on miss IDX Use PA to access PO BO cache Cache next level (L2) Load/store to VA (page 0x02) Data Block from VPN 0x02 =? Hit/Miss TLB PPN PO Is it VA from Process 1 or Process 2? Physical memory 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0A 0x0B P2 P1 in its own VM space 0x00 0x01 0x02 0x03 0x04 P1 Is it a hit? miss? P2 in its own VM space 0x00 0x01 0x02 0x03 0x04 P1 top top 0xfffff000 0xfffff000 25 Keep process id with cache block tags 2. 3. Don’t need to wait for TLB Upon cache lookup check both address tag and process id Parallel TLB access Flush cache on context switch (e.g., for larger caches) Expensive (lose contents of the cache every switch) Virtual Address VPN PO IDX BO Physically-tagged but Use single-address space OS’s 26 Or, Only Use Virtual Bits to Index Cache Multiple Address Space Solution 1. TLB Virtually-indexed Cache Not practical today (requires too much compiler analysis) Can distinguish addresses from different processes But, what if multiple processes PPN TAG share memory? 27 Virtual Address Synonyms P1 makes a ref to data on page 0x04 cache Set X P2 Set Y P1? P1 block is a miss But, P2 block is in Can’t look it up! 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0A 0x0B Shared Data Cache =? Physical memory P2 Hit/Miss 28 Virtual Address Synonyms (Cont.) P1 in its own VM space 0x00 0x01 0x02 0x03 0x04 P1 P2 in its own VM space 0x00 0x01 0x02 0x03 0x04 Virtual addresses on page 0x04 of P1 are synonyms of those on page 0x00 of P2 Synonyms are also referred to as aliases The page is shared among the processes Must avoid allowing multiple synonyms to co-exist in the cache P2 top Example shared pages are kernel data structures Only memory read/written must be resolved Read-only memory (e.g., instructions) can exist in multiple locations top 0xfffff000 0xfffff000 29 30 Synonym Solutions Summary Avoid: Limit cache size to page size times associativity Memory access is hard and complicated! get index from page offset Avoid: Eliminate by OS convention single virtual space restrictive sharing model Detect: Search all sets in parallel 64K 4-way cache, 4K pages, search 4 sets (16 entries) VM hierarchy Another form of cache, but now between RAM and disk. Atomic units of memory are pages, typically 4kB to 2MB. Page table serves as translation mechanism from virtual to physical address Page table lives in physical memory, managed by OS Reduce search space: Restrict page placement in OS make sure index(VA) = index(PA) Speed of CPU core demands very fast memory access. We do cache hierarchy to solve this one. Gives illusion of speed--most of the time. Occasionally slow. Size of programs demands large RAM. We do VM hierarchy to solve this one. Gives illusions of size--most of the time. Occasionally slow. For 64b addresses, multi-level tables used, some of the table is in VM TLB is yet another cache--caches translated addresses, page table entries. Saves from having to go to physical memory to do lookup on each access Usually very small, managed by OS VM, TLB, cache have interesting interactions. 31 Big impacts on speed, pipelining. Big impacts on exactly where the virtual to physical mapping takes place. 32