VM slides

Virtual Memory Topics     Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited Memory Hierarchy Smaller, faster, costlier per byte Larger, slower, cheaper per byte regs on-chip L1 cache (SRAM) on-chip L2 cache (SRAM) main memory (DRAM) local secondary storage (local disks) remote secondary storage (tapes, distributed file systems, Web servers) Why Caches Work Temporal locality:  Recently referenced items are likely to be referenced again in the near future Spatial locality:  block Items with nearby addresses tend to be referenced close together in time block –3– Cache (L1 and L2) Performance Metrics Miss Rate   Fraction of memory references not found in cache (misses / accesses) = 1 – hit rate Typical numbers (in percentages):  3-10% for L1  can be quite small (e.g., < 1%) for L2, depending on size, etc. Hit Time   Time to deliver a block in the cache to the processor  includes time to determine whether the line is in the cache Typical numbers:  1-3 clock cycles for L1  5-20 clock cycles for L2 Miss Penalty  Additional time required because of a miss  typically 50-400 cycles for main memory Lets think about those numbers Huge difference between a hit and a miss  Could be 100x, if just L1 and main memory Would you believe 99% hits is twice as good as 97%?  Consider: cache hit time of 1 cycle miss penalty of 100 cycles  Average access time: 0.97 * 1 cycle + 0.03 * 100 cycles = 3.97 cycles 0.99 * 1 cycle + 0.01 * 100 cycles = 1.99 cycles –5– Types of Cache Misses Cold (compulsory) miss   Occurs on first access to a block Spatial locality of access helps (also prefetching---more later) Conflict miss  Multiple data objects all map to the same slot (like in hashing)  e.g, block i must be placed in cache entry/slot: i mod 8  replacing block already in that slot  referencing blocks 0, 8, 0, 8, ... would miss every time  Conflict misses are less of a problem these days  Set associative caches with 8, or 16 set size per slot help Capacity miss  When the set of active cache blocks (working set) is larger than the cache  This is where to focus nowadays What about writes? Multiple copies of data exist:  L1, L2, Main Memory, Disk What to do on a write-hit?  Write-back (defer write to memory until replacement of line)  Need a dirty bit (line different from memory or not) What to do on a write-miss?  Write-allocate (load into cache, update line in cache) Typical  Write-back + Write-allocate Rare  –7– Write-through (write immediately to memory, usually for I/O) Main Memory is something like a Cache (for Disk) Driven by enormous miss penalty:  Disk is about 10,000x slower than DRAM DRAM Design:  –8– Large page (block) size: typically 4KB Virtual Memory Programs refer to virtual memory addresses    Conceptually very large array of bytes (4GB for IA32, 16 exabytes for 64 bits) Each byte has its own address System provides address space private to each process Allocation: Compiler and run-time system  –9– All allocation within single virtual address space Virtual Addressing CPU Chip CPU Virtual address (VA) MMU Main memory 0: 1: Physical address 2: 3: (PA) 4: 5: 6: 7: ... Data word MMU = Memory Management Unit MMU keeps mapping of VAs -> PAs in a “page table” MMU Needs Table of Translations CPU Chip CPU Virtual address (VA) MMU ... Page Table Main memory 0: 1: Physical address 2: 3: (PA) 4: 5: 6: 7: MMU keeps mapping of VAs -> PAs in a “page table” – 11 – Where is page table kept ? CPU Chip CPU Virtual address (VA) MMU ... Main memory 0: 1: Physical address 2: 3: (PA) 4: 5: 6: Page 7: Table In main memory – can be cached e.g., in L2 (like data) – 12 – Speeding up Translation with a TLB Translation Lookaside Buffer (TLB)   – 13 – Small hardware cache for page table in MMU Caches page table entries for a number of pages (eg., 256 entries) TLB Hit CPU Chip CPU TLB 1 VA 2 PTE VA 3 Page Table MMU PA 4 Mem Data 5 A TLB hit saves you from accessing memory for the page table – 14 – TLB Miss CPU Chip TLB 2 4 PTE VA CPU 1 VA MMU 3 PTE request PA Page Table Mem 5 Data 6 A TLB miss incurs an additional memory access (the PT) – 15 – How to Program for Virtual Memory At any point in time, programs tend to access a set of active virtual pages called the working set  Programs with better temporal locality will have smaller working sets If ((working set size) > main mem size)  Thrashing: Performance meltdown where pages are swapped (copied) in and out continuously If ((# working set pages) > # TLB entries)   – 16 – Will suffer TLB misses Not as bad as page thrashing, but still worth avoiding More on TLBs Assume a 256-entry TLB, and each page is 4KB   Can only have TLB hits for 1MB of data (256*4kB = 1MB) This is called the “TLB reach”---amount of mem TLB can cover Typical L2 cache is 6MB  Hence should consider TLB-size before L2 size when tiling? Real CPUs have second-level TLBs (like an L2 for TLB)   – 17 – This is getting complicated to reason about! Likely have to experiment to find best tile size Memory Optimization: Summary Caches  Conflict Misses:  Not much of a concern (set-associative caches)  Cache Capacity:  Keep working set within on-chip cache capacity  Fit in L1 or L2 depending on working-set size Virtual Memory:  Page Misses:  Keep page-level working set within main memory capacity  – 18 – TLB Misses: may want to keep working set #pages < TLB #entries IA32 Linux Memory Layout Stack  Runtime stack (8MB limit) Data   Statically allocated data E.g., arrays & strings declared in code Heap   Dynamically allocated storage When call malloc(), calloc(), new() Text   Executable machine instructions Read-only

VM slides

Related documents

Products

Support

VM slides

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib