VM slides

advertisement
Virtual Memory
Topics




Virtual Memory Access
Page Table, TLB
Programming for locality
Memory Mountain Revisited
Memory Hierarchy
Smaller,
faster,
costlier
per byte
Larger,
slower,
cheaper
per byte
regs
on-chip L1
cache (SRAM)
on-chip L2
cache (SRAM)
main memory
(DRAM)
local secondary storage
(local disks)
remote secondary storage
(tapes, distributed file systems, Web servers)
Why Caches Work
Temporal locality:

Recently referenced items are likely
to be referenced again in the near future
Spatial locality:

block
Items with nearby addresses tend
to be referenced close together in time
block
–3–
Cache (L1 and L2) Performance Metrics
Miss Rate


Fraction of memory references not found in cache (misses /
accesses)
= 1 – hit rate
Typical numbers (in percentages):
 3-10% for L1
 can be quite small (e.g., < 1%) for L2, depending on size, etc.
Hit Time


Time to deliver a block in the cache to the processor
 includes time to determine whether the line is in the cache
Typical numbers:
 1-3 clock cycles for L1
 5-20 clock cycles for L2
Miss Penalty

Additional time required because of a miss
 typically 50-400 cycles for main memory
Lets think about those numbers
Huge difference between a hit and a miss

Could be 100x, if just L1 and main memory
Would you believe 99% hits is twice as good as 97%?

Consider:
cache hit time of 1 cycle
miss penalty of 100 cycles

Average access time:
0.97 * 1 cycle + 0.03 * 100 cycles = 3.97 cycles
0.99 * 1 cycle + 0.01 * 100 cycles = 1.99 cycles
–5–
Types of Cache Misses
Cold (compulsory) miss


Occurs on first access to a block
Spatial locality of access helps (also prefetching---more later)
Conflict miss

Multiple data objects all map to the same slot (like in hashing)
 e.g, block i must be placed in cache entry/slot: i mod 8
 replacing block already in that slot
 referencing blocks 0, 8, 0, 8, ... would miss every time

Conflict misses are less of a problem these days
 Set associative caches with 8, or 16 set size per slot help
Capacity miss

When the set of active cache blocks (working set) is larger than
the cache
 This is where to focus nowadays
What about writes?
Multiple copies of data exist:

L1, L2, Main Memory, Disk
What to do on a write-hit?

Write-back (defer write to memory until replacement of line)
 Need a dirty bit (line different from memory or not)
What to do on a write-miss?

Write-allocate (load into cache, update line in cache)
Typical

Write-back + Write-allocate
Rare

–7–
Write-through (write immediately to memory, usually for I/O)
Main Memory is something like a Cache (for
Disk)
Driven by enormous miss penalty:

Disk is about 10,000x slower than DRAM
DRAM Design:

–8–
Large page (block) size: typically 4KB
Virtual Memory
Programs refer to virtual memory addresses



Conceptually very large array of bytes (4GB for IA32, 16
exabytes for 64 bits)
Each byte has its own address
System provides address space private to each process
Allocation: Compiler and run-time system

–9–
All allocation within single virtual address space
Virtual Addressing
CPU Chip
CPU
Virtual address
(VA)
MMU
Main memory
0:
1:
Physical address 2:
3:
(PA)
4:
5:
6:
7:
...
Data word
MMU = Memory Management Unit
MMU keeps mapping of VAs -> PAs in a “page table”
MMU Needs Table of Translations
CPU Chip
CPU
Virtual address
(VA)
MMU
...
Page
Table
Main memory
0:
1:
Physical address 2:
3:
(PA)
4:
5:
6:
7:
MMU keeps mapping of VAs -> PAs in a “page table”
– 11 –
Where is page table kept ?
CPU Chip
CPU
Virtual address
(VA)
MMU
...
Main memory
0:
1:
Physical address 2:
3:
(PA)
4:
5:
6: Page
7: Table
In main memory – can be cached e.g., in L2 (like data)
– 12 –
Speeding up Translation with a TLB
Translation Lookaside Buffer (TLB)


– 13 –
Small hardware cache for page table in MMU
Caches page table entries for a number of pages (eg., 256
entries)
TLB Hit
CPU Chip
CPU
TLB
1
VA
2
PTE
VA
3
Page Table
MMU
PA
4
Mem
Data
5
A TLB hit saves you from accessing memory for the page table
– 14 –
TLB Miss
CPU Chip
TLB
2
4
PTE
VA
CPU
1
VA
MMU
3
PTE request
PA
Page Table
Mem
5
Data
6
A TLB miss incurs an additional memory access (the PT)
– 15 –
How to Program for Virtual Memory
At any point in time, programs tend to access a set of
active virtual pages called the working set

Programs with better temporal locality will have smaller
working sets
If ((working set size) > main mem size)

Thrashing: Performance meltdown where pages are swapped
(copied) in and out continuously
If ((# working set pages) > # TLB entries)


– 16 –
Will suffer TLB misses
Not as bad as page thrashing, but still worth avoiding
More on TLBs
Assume a 256-entry TLB, and each page is 4KB


Can only have TLB hits for 1MB of data (256*4kB = 1MB)
This is called the “TLB reach”---amount of mem TLB can cover
Typical L2 cache is 6MB

Hence should consider TLB-size before L2 size when tiling?
Real CPUs have second-level TLBs (like an L2 for TLB)


– 17 –
This is getting complicated to reason about!
Likely have to experiment to find best tile size
Memory Optimization: Summary
Caches

Conflict Misses:
 Not much of a concern (set-associative caches)

Cache Capacity:
 Keep working set within on-chip cache capacity
 Fit in L1 or L2 depending on working-set size
Virtual Memory:

Page Misses:
 Keep page-level working set within main memory capacity

– 18 –
TLB Misses: may want to keep working set #pages < TLB
#entries
IA32 Linux Memory Layout
Stack

Runtime stack (8MB limit)
Data


Statically allocated data
E.g., arrays & strings
declared in code
Heap


Dynamically allocated
storage
When call malloc(),
calloc(), new()
Text


Executable machine
instructions
Read-only
Download