Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited Memory Hierarchy Smaller, faster, costlier per byte Larger, slower, cheaper per byte regs on-chip L1 cache (SRAM) on-chip L2 cache (SRAM) main memory (DRAM) local secondary storage (local disks) remote secondary storage (tapes, distributed file systems, Web servers) Why Caches Work Temporal locality: Recently referenced items are likely to be referenced again in the near future Spatial locality: block Items with nearby addresses tend to be referenced close together in time block –3– Cache (L1 and L2) Performance Metrics Miss Rate Fraction of memory references not found in cache (misses / accesses) = 1 – hit rate Typical numbers (in percentages): 3-10% for L1 can be quite small (e.g., < 1%) for L2, depending on size, etc. Hit Time Time to deliver a block in the cache to the processor includes time to determine whether the line is in the cache Typical numbers: 1-3 clock cycles for L1 5-20 clock cycles for L2 Miss Penalty Additional time required because of a miss typically 50-400 cycles for main memory Lets think about those numbers Huge difference between a hit and a miss Could be 100x, if just L1 and main memory Would you believe 99% hits is twice as good as 97%? Consider: cache hit time of 1 cycle miss penalty of 100 cycles Average access time: 0.97 * 1 cycle + 0.03 * 100 cycles = 3.97 cycles 0.99 * 1 cycle + 0.01 * 100 cycles = 1.99 cycles –5– Types of Cache Misses Cold (compulsory) miss Occurs on first access to a block Spatial locality of access helps (also prefetching---more later) Conflict miss Multiple data objects all map to the same slot (like in hashing) e.g, block i must be placed in cache entry/slot: i mod 8 replacing block already in that slot referencing blocks 0, 8, 0, 8, ... would miss every time Conflict misses are less of a problem these days Set associative caches with 8, or 16 set size per slot help Capacity miss When the set of active cache blocks (working set) is larger than the cache This is where to focus nowadays What about writes? Multiple copies of data exist: L1, L2, Main Memory, Disk What to do on a write-hit? Write-back (defer write to memory until replacement of line) Need a dirty bit (line different from memory or not) What to do on a write-miss? Write-allocate (load into cache, update line in cache) Typical Write-back + Write-allocate Rare –7– Write-through (write immediately to memory, usually for I/O) Main Memory is something like a Cache (for Disk) Driven by enormous miss penalty: Disk is about 10,000x slower than DRAM DRAM Design: –8– Large page (block) size: typically 4KB Virtual Memory Programs refer to virtual memory addresses Conceptually very large array of bytes (4GB for IA32, 16 exabytes for 64 bits) Each byte has its own address System provides address space private to each process Allocation: Compiler and run-time system –9– All allocation within single virtual address space Virtual Addressing CPU Chip CPU Virtual address (VA) MMU Main memory 0: 1: Physical address 2: 3: (PA) 4: 5: 6: 7: ... Data word MMU = Memory Management Unit MMU keeps mapping of VAs -> PAs in a “page table” MMU Needs Table of Translations CPU Chip CPU Virtual address (VA) MMU ... Page Table Main memory 0: 1: Physical address 2: 3: (PA) 4: 5: 6: 7: MMU keeps mapping of VAs -> PAs in a “page table” – 11 – Where is page table kept ? CPU Chip CPU Virtual address (VA) MMU ... Main memory 0: 1: Physical address 2: 3: (PA) 4: 5: 6: Page 7: Table In main memory – can be cached e.g., in L2 (like data) – 12 – Speeding up Translation with a TLB Translation Lookaside Buffer (TLB) – 13 – Small hardware cache for page table in MMU Caches page table entries for a number of pages (eg., 256 entries) TLB Hit CPU Chip CPU TLB 1 VA 2 PTE VA 3 Page Table MMU PA 4 Mem Data 5 A TLB hit saves you from accessing memory for the page table – 14 – TLB Miss CPU Chip TLB 2 4 PTE VA CPU 1 VA MMU 3 PTE request PA Page Table Mem 5 Data 6 A TLB miss incurs an additional memory access (the PT) – 15 – How to Program for Virtual Memory At any point in time, programs tend to access a set of active virtual pages called the working set Programs with better temporal locality will have smaller working sets If ((working set size) > main mem size) Thrashing: Performance meltdown where pages are swapped (copied) in and out continuously If ((# working set pages) > # TLB entries) – 16 – Will suffer TLB misses Not as bad as page thrashing, but still worth avoiding More on TLBs Assume a 256-entry TLB, and each page is 4KB Can only have TLB hits for 1MB of data (256*4kB = 1MB) This is called the “TLB reach”---amount of mem TLB can cover Typical L2 cache is 6MB Hence should consider TLB-size before L2 size when tiling? Real CPUs have second-level TLBs (like an L2 for TLB) – 17 – This is getting complicated to reason about! Likely have to experiment to find best tile size Memory Optimization: Summary Caches Conflict Misses: Not much of a concern (set-associative caches) Cache Capacity: Keep working set within on-chip cache capacity Fit in L1 or L2 depending on working-set size Virtual Memory: Page Misses: Keep page-level working set within main memory capacity – 18 – TLB Misses: may want to keep working set #pages < TLB #entries IA32 Linux Memory Layout Stack Runtime stack (8MB limit) Data Statically allocated data E.g., arrays & strings declared in code Heap Dynamically allocated storage When call malloc(), calloc(), new() Text Executable machine instructions Read-only