Virtual Memory 2

advertisement
Today s Menu:
X=5
Virtual Memory II
 
Remaining Virtual Memory Issues
 
Accelerating size of address translation:
  Translation Lookaside Buffer (TLB)
 
Page Table Size and Multi-level Page Tables
 
Cache and TLB interactions
  Physical Caches vs Virtual Caches
1
Memory Hierarchy of a Modern Computer System
 
 
TLB Structure & Performance
 
By taking advantage of the principle of locality:
 
2
 
 
 
Processor
 
Local
Cache
Registers
Shared
Cache
(SRAM)
Main
Memory
(DRAM)
Secondary
Storage
(Disk)
 
Speed (ns):
Size (bytes):
 
 
1s
5-10
10-50
100s
Ks
Ms
100s
Gs
 
Ts
3
Next Issue: Page Table Size
 
 
 
 
4K or 8K Bytes
Often support multiple page sizes
  up to 2GB(!), 2MB is becoming more widely used (“large pages”)
 
4K page
32 bit virtual address space
20 bits for the page number --> 220 virtual pages
  220 == 1,048,576 PTEs
Each PTE is at least 4 Bytes
  1 Million PTEs * 4 Bytes/PTE = 4 Megabytes
Each of these is actually
one page (4K) in size!
Example: from Linux
 
Have two levels of page tables
PTE 0
PTE 0
VP 0
PTE 1
PTE 1
VP 1
PTE 2 (null)
…
…
PTE 3 (null)
PTE 1023
VP 1023
…
 
Hey: you said each process has its own page table!
 
 
 
4
Solution: Multi-level page tables
Previously determined: Linux page table size
 
4% to 8% for typical Unix workloads
Can be much higher for large applications
Operating systems can significantly influence the TLB miss rate
Page size
 
10,000,000s (10s ms)
20 to 500 cycles (much longer if the page is on disk)
Hardware and software based
TLB miss rates
 
 
32 to 1024 entries (slots)
Can be direct mapped, set associate, fully associative
Easier to be fully associate here, since TLBs are often pretty small
TLB miss cost
 
Control
Datapath
TLB structure
 
Present the user with as much memory as is available in the cheapest technology.
Provide access at the speed offered by the fastest technology.
PTE 1023
100 processes = 400 MB of memory for page tables!!!!
If the page table is not in memory, you can’t find it
  The page table cannot be swapped out!
Only this table needs to
stay in memory at all times.
What’s the deal?
5
VP 1024
PTE 0
PTE 1
…
…
VP 2047
…
PTE 1023
6
Multi-level Page Tables
VPN 1
Multi-level Advantages
 
Minimizes page: 4KB+4KB table
Allows for maximum: 4KB + 4MB
 
Page table size scales with memory usage
 
Only level 1 MUST be in real memory
 
VPN 2Virtual Address Offset
PPN
 
 
Level 2 Table
Level 1 Table
(Page Directory) (Page Table)
PPN
Level 2 tables can be swapped just like any other page
Multi-level and TLB integration
 
TLB only stores Level 2 PTEs
 
Don’t have to do two TLB accesses to do VA -> PA translation
Offset
7
Alternative Implementation: Reverse Page Tables
 
Already implemented as part of OS data structure
 
 
 
8
Next Issue: Cache & TLBs, How They Interact
 
Identify physical pages for eviction/replacement
Need a table of physical page indices
 
Hash VPN into a PPN index in the reverse table
 
We do memory hierarchies to hide unpleasant facts
 
Effectiveness depends on how well the hash works
We don’t have as much fast memory as we would ideally like. Solution?
  Cache hierarchy gives illusion of speed--most of the time. Occasionally it’s slow.
We don’t have as much memory as we would like. Solution?
  VM hierarchy gives the illusion of size--most of the time. Occasionally it’s slow.
 
Roughly put: We do cache for speed. We do VM for size.
 
So, we have regular cache for fast access, and
we have a TLB for fast translation VA->PA.
 
How do they interact? They must interact somehow…
 
Do we wait for VA->PA translation before looking in the cache?
Is the cache full of virtual or physical addresses?
 
9
Address Translation/Cache Lookup
Simplest Scheme is Sequential: TLB then Cache
 
CPU
Slowest, but simplest
1. 
virtual address
2. 
TLB
3. 
physical address
4. 
Cache
Memory
physical cache
(physical tags)
5. 
 Simple!
CPU sends out virtual address
TLB translates to physical, or
page faults and we wait for page
to load
On TLB hit, translated physical
address sent to cache
Cache lookup gives data access
fast, or…
Cache miss goes to main mem
Slow because you have to
wait for the translation
before you can check the
cache.
10
Virtual Address
VPN
TLB
PPN
PO
TAG
IDX
Cache
=?
11
PO
BO
(same address
interpreted two
different ways)
Data
Hit/Miss
12
TLBs and Caches: Basic Flow for Access
Real Example: DECstation 3100
31
V i r t u a l a d d r e ss
12 11
0
Virtual Page Number
T L B a cc e s s
T L B m is s
e xc e p t i o n
No
TLB (64 entry, fully-assoc)
P h y s ic a l a d d r e s s
No
No
Ye s
12
20
Physical page number
Physical page number
=?
=?
=?
=?
No
C a c h e h it?
20
Tag
Y es
W r ite ?
T r y t o r e a d d a ta
f r om c a c h e
C a c h e m i ss s t a l l
TLB
Hit
Ye s
TL B h it?
Valid Dirty
Page Offset
W r ite a c c e s s
b it o n ?
W r ite p rote c tio n
ex c ep tio n
Physical Address
Physical address tag
Cache (64 KB, Direct-mapped)
16
Valid
Tag
Y es
W rit e d at a int o c a ch e,
u p d a te th e t a g , a n d p u t
th e d a t a a n d th e a d d re s s
in to th e w rite bu ffe r
Page offset
Cache index
14
Data
Byte offset
2
D e l iv e r d a ta
to th e C P U
=?
13
Protection and the TLB
 
 
 
A process presents a
Virtual Address to the
TLB for translation
Ex: either process could
present virtual address 0x2000
How does the TLB know which
process’s VA->PA mapping
is held in the TLB?
 
There are 2 separate
VM address spaces here,
one per process
 
There is only 1 TLB,
NOT one per process
 
Virtual Address
0x00002 000
 
Fully
Assoc.
TLB
TAG
PPN
0x00002
0x105
0x00004
0x094
Physical Page Frame
Physical Address
0x105
 
000
 
 
 
0x00002
000
PID
TAG
PPN
1
0x00002
0x105
2
0x00004
0x094
Physical Address
0x105
Why is this better/worse?
000
16
Ex: MIPS R3000 Instruction Pipeline
Decode
Reg. Read
Inst Fetch
You have to do lookup in the TLB…
…then you have to do lookup in the cache
Involves a lot of memory access time
TLB
I-Cache
RF
ALU / E.A.
Memory
Operation
E.A.
One Solution: pipelining
 
Virtual Address
Another solution is to flush
the TLB on a context switch;
flush means empty it
15
If we do these accesses sequentially, big impact on speed
 
Many machines append a
PID (process ID) to each
TLB entry
OS maintains a Process ID
Register (updated during
the switch between processes,
called a context switch)
Process ID Register
Speed & Timing Impacts
 
14
Protection and the TLB (cont.)
Why not 1 per process?
 
32
cache hit
TLB
Write Reg
WB
D-Cache
Resource Usage
Spread the accesses across stages of the pipeline
TLB and cache are just like any other resource in the pipeline
  You gotta be careful to know how long they take (impacts pipe cycle time)
  You gotta know who is trying use the resource in what pipe stage
TLB
TLB
I-cache
RF
  You can have hazards, need stalls, need forwarding paths, etc
WB
ALU
ALU
D-Cache
17
18
Speeding it Up
 
Overlapped Cache & TLB Access
TLB and then Cache… Why? What else could we do?
32
TLB
Two options
1.  Overlapped cache & TLB access (in parallel)
 
2. 
 
1K
Cache
4 bytes
cache tag
What are the limitations?
Why does our cache use physical addresses?
 
index
assoc
lookup
Hit/
Miss
PA
Could it store virtual addresses?
What are the problems/considerations?
10
2
index 00
Data
PA
12
20
page number page offset
Hit/
Miss
=
IF (cache hit) AND (cache tag = PA) THEN deliver data to CPU
ELSE IF [cache miss OR (cache tag != PA)] AND (TLB hit) THEN
access memory with the PA from the TLB
ELSE do standard VA translation
19
20
Overlapped Cache & TLB Access
How Overlapping Reduces Translation Time
 
Basic plumbing
 
High order-bits of the VA are used to look in the TLB…
 
Remember: low order bits are the page offset--which byte address on the page
 
High order bits are what really changes, from virtual to physical address
  IDX +
BO ≤ PO
  Must satisfy
Cache size/Assoc ≤ Page size
…while low-order bits are used as index into cache
 
Remember: lowest bits are the cache line offset--which byte on the cache line
 
The intermediate bits are the cache index: which line in the cache should we check to see if the
address we want is actually loaded into the cache mem
The highest order bits are the cache tag: when we look in the cache at a line of cache, we must
compare these bits with the cache tag, to see if what s stored in the cache is really the address
we want, or just another line of memory that happens to map to this same place in the cache
 
 
 Simple for small caches
 
 Assume 4K pages &
Virtual Address
VPN
PO
IDX BO
TLB
PPN
2-way set-associative caches
TAG
What is the max size allowed for
parallel address translation to work?
Cache
=?
The action here is on the cache tag high order bits…
21
large caches?
Two Cache Architectures
Virtual Address
VPN
PO
IDX BO
Remember:
Cache index used to look up data in the cache
Cache tag used to verify what data is in the cache
TLB
 Should we use VA
instead?
 Should we only use
VA bits to index?
Hit/Miss
22
Cache vs. TLB Access
 What happens for
Data
1. 
PPN
Virtually-indexed Virtually-tagged Caches
 
TAG
 
IDX BO
2. 
Data
Cache
=?
Also known as Virtually-Addressed or Virtual Address Caches
The VPN bits are used for both tags and index bits
Virtually-indexed Physically-tagged Caches
 
 
The VPN bits only contribute to the index
The tag is physical and requires a TLB lookup, but it can be done in parallel
Hit/Miss
23
24
Virtual Address Cache
Multiple Virtual Address Spaces (Multiprogramming)
Virtual Address
VPN
 Lookup using VA
TAG
 TLB access on miss
IDX
 Use PA to access
 
PO
BO
cache
Cache
next level (L2)
Load/store to
VA (page 0x02)
Data
Block from
VPN 0x02
=?
Hit/Miss
 
TLB
PPN
PO
 
Is it VA from
Process 1 or
Process 2?
Physical
memory
0x00
0x01
0x02
0x03
0x04
0x05
0x06
0x07
0x08
0x09
0x0A
0x0B
P2
P1 in its own
VM space
0x00
0x01
0x02
0x03
0x04
P1
Is it a hit? miss?
P2 in its own
VM space
0x00
0x01
0x02
0x03
0x04
P1
top
top
0xfffff000
0xfffff000
25
Keep process id with cache block tags
 
2. 
3. 
 Don’t need to wait for TLB
Upon cache lookup check both address tag and process id
 Parallel TLB access
Flush cache on context switch
 
(e.g., for larger caches)
Expensive (lose contents of the cache every switch)
Virtual Address
VPN
PO
IDX BO
 Physically-tagged but
Use single-address space OS’s
 
26
Or, Only Use Virtual Bits to Index Cache
Multiple Address Space Solution
1. 
TLB
Virtually-indexed Cache
Not practical today (requires too much compiler analysis)
 Can distinguish addresses
from different processes
 But, what if multiple processes
PPN
TAG
share memory?
27
Virtual Address Synonyms
 
P1 makes a ref
to data on page
0x04
cache
Set X
P2
Set Y
P1?
 
 
 
P1 block is a miss
But, P2 block is in
Can’t look it up!
0x00
0x01
0x02
0x03
0x04
0x05
0x06
0x07
0x08
0x09
0x0A
0x0B
Shared
Data
Cache
=?
Physical
memory
P2
Hit/Miss
28
Virtual Address Synonyms (Cont.)
P1 in its own
VM space
0x00
0x01
0x02
0x03
0x04
P1
P2 in its own
VM space
0x00
0x01
0x02
0x03
0x04
 
Virtual addresses on page 0x04 of P1 are synonyms of those on
page 0x00 of P2
 
Synonyms are also referred to as aliases
 
The page is shared among the processes
 
Must avoid allowing multiple synonyms to co-exist in the cache
P2
 
 
 
top
Example shared pages are kernel data structures
Only memory read/written must be resolved
Read-only memory (e.g., instructions) can exist in multiple locations
top
0xfffff000
0xfffff000
29
30
Synonym Solutions
Summary
 Avoid: Limit cache size to page size times associativity
 
Memory access is hard and complicated!
 
  get index from page offset
 
 Avoid: Eliminate by OS convention
  single virtual space
 
  restrictive sharing model
 Detect: Search all sets in parallel
  64K 4-way cache, 4K pages, search 4 sets (16 entries)
VM hierarchy
 
Another form of cache, but now between RAM and disk.
 
Atomic units of memory are pages, typically 4kB to 2MB.
 
Page table serves as translation mechanism from virtual to physical address
  Page table lives in physical memory, managed by OS
 Reduce search space: Restrict page placement in OS
 
  make sure index(VA) = index(PA)
Speed of CPU core demands very fast memory access. We do cache hierarchy to solve this one.
Gives illusion of speed--most of the time. Occasionally slow.
Size of programs demands large RAM. We do VM hierarchy to solve this one.
Gives illusions of size--most of the time. Occasionally slow.
  For 64b addresses, multi-level tables used, some of the table is in VM
TLB is yet another cache--caches translated addresses, page table entries.
 
Saves from having to go to physical memory to do lookup on each access
Usually very small, managed by OS
VM, TLB, cache have interesting interactions.
 
 
 
31
Big impacts on speed, pipelining. Big impacts on exactly where the virtual to physical mapping
takes place.
32
Download