18-447: Computer Architecture Lecture 20: Memory Scheduling and Virtual Memory

advertisement
18-447: Computer Architecture
Lecture 20: Memory Scheduling and
Virtual Memory
Prof. Onur Mutlu
Carnegie Mellon University
Spring 2012, 4/4/2012
Reminder: Homeworks

Homework 6



Out today
Due April 16
Topics: Main memory, caches, virtual memory
2
Reminder: Lab Assignments

Lab Assignment 5



Implementing caches and branch prediction in a high-level
timing simulator of a pipelined processor
Due April 6 (this Friday!)
Extra credit: Cache exploration and high performance with
optimized caches
3
Reminder: Midterm II

Next week



April 11
Everything covered in the course can be on the exam
You can bring in two cheat sheets (8.5x11’’)
4
Review of Last Lecture

DRAM subsystem





Page mode
Cell, Row/column, Bank, Chip, Rank, Module/DIMM, Channel
Address mapping
Refresh
DRAM controller


Power management basics



Scheduling, row buffer management, power management
Static vs. dynamic power
Principles
Bloom filters


Compact and approximate way of representing set membership
Allows easy way of testing set membership
5
Bloom Filters
Bloom, “Space/time trade-offs in hash coding with allowable errors,” CACM 1970.
Seshadri et al., “The Evicted-Address Filter: A Unified Mechanism
to Address Both Cache Pollution and Thrashing,” CMU Technical Report, 2012.
6
Hardware Implementation of Bloom Filters
7
The DRAM Subsystem:
A Quick Review
DRAM Subsystem Organization






Channel
DIMM
Rank
Chip
Bank
Row/Column
9
Interaction with VirtualPhysical Mapping

Operating System influences where an address maps to in
DRAM
Virtual Page number (52 bits)
Physical Frame number (19 bits)
Row (14 bits)


Bank (3 bits)
Page offset (12 bits)
VA
Page offset (12 bits)
PA
Column (11 bits)
Byte in bus (3 bits)
PA
Operating system can control which bank a virtual page is
mapped to. It can randomize Page<Bank,Channel>
mappings
Application cannot know/determine which bank it is accessing
10
DRAM Controller

Purpose and functions


Ensure correct operation of DRAM (refresh and timing)
Service DRAM requests while obeying timing constraints of
DRAM chips



Buffer and schedule requests to improve performance


Constraints: resource conflicts (bank, bus, channel), minimum
write-to-read delays
Translate requests to DRAM command sequences
Reordering and row-buffer management
Manage power consumption and thermals in DRAM

Turn on/off DRAM chips, manage power modes
11
DRAM Scheduling Policies (I)

FCFS (first come first served)


Oldest request first
FR-FCFS (first ready, first come first served)
1. Row-hit first
2. Oldest first
Goal: Maximize row buffer hit rate  maximize DRAM throughput

Actually, scheduling is done at the command level


Column commands (read/write) prioritized over row commands
(activate/precharge)
Within each group, older commands prioritized over younger ones
12
Row Buffer Management Policies

Open row
Keep the row open after an access
+ Next access might need the same row  row hit
-- Next access might need a different row  row conflict, wasted energy


Closed row
Close the row after an access (if no other requests already in the request
buffer need the same row)
+ Next access might need a different row  avoid a row conflict
-- Next access might need the same row  extra activate latency


Adaptive policies

Predict whether or not the next access to the bank will be to
the same row
13
Why are DRAM Controllers Difficult to Design?

Need to obey DRAM timing constraints for correctness





Need to keep track of many resources to prevent conflicts



There are many (50+) timing constraints in DRAM
tWTR: Minimum number of cycles to wait before issuing a
read command after a write command is issued
tRC: Minimum number of cycles between the issuing of two
consecutive activate commands to the same bank
…
Channels, banks, ranks, data bus, address bus, row buffers
Need to handle DRAM refresh
Need to optimize for performance


(in the presence of constraints)
Reordering is not simple
Predicting the future?
14
DRAM Request Scheduling
(and Interference) in Multi-Core Systems
Scheduling Policy for Single-Core Systems

FR-FCFS (first ready, first come first served)
1. Row-hit first
2. Oldest first
Goal: Maximize row buffer hit rate  maximize DRAM
throughput

Is this a good policy in a multi-core system?
16
Uncontrolled Interference: An Example
CORE
stream1
random2
CORE
L2
CACHE
L2
CACHE
Multi-Core
Chip
unfairness
INTERCONNECT
DRAM MEMORY CONTROLLER
Shared DRAM
Memory System
DRAM DRAM DRAM DRAM
Bank 0 Bank 1 Bank 2 Bank 3
17
A Memory Performance Hog
// initialize large arrays A, B
// initialize large arrays A, B
for (j=0; j<N; j++) {
index = j*linesize; streaming
A[index] = B[index];
…
}
for (j=0; j<N; j++) {
index = rand(); random
A[index] = B[index];
…
}
STREAM
RANDOM
- Sequential memory access
- Random memory access
- Very high row buffer locality (96% hit rate) - Very low row buffer locality (3% hit rate)
- Memory intensive
- Similarly memory intensive
Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.
18
Row decoder
What Does the Memory Hog Do?
T0: Row 0
T0:
T1: Row 05
T1:
T0:Row
Row111
0
T1:
T0:Row
Row16
0
Memory Request Buffer
Row
Row 00
Row Buffer
mux
Row size: 8KB, cache blockColumn
size: 64B
T0: STREAM
128
(8KB/64B)
T1:
RANDOM
requests of T0 serviced
Data before T1
Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.
19
Effect of the Memory Performance Hog
3
2.82X slowdown
Slowdown
2.5
2
1.5
1.18X slowdown
1
0.5
0
STREAM
Virtual
gcc PC
Results on Intel Pentium D running Windows XP
(Similar results for Intel Core Duo and AMD Turion, and on Fedora Linux)
Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.
20
Problems due to Uncontrolled Interference
Slowdown
Main memory is the only shared resource
High priority
Memory
Low
performance
priority
hog
Cores make
very slow
progress





Unfair slowdown of different threads
Low system performance
Vulnerability to denial of service
Priority inversion: unable to enforce priorities/SLAs
Poor performance predictability (no performance isolation)
21
Problems due to Uncontrolled Interference





Unfair slowdown of different threads
Low system performance
Vulnerability to denial of service
Priority inversion: unable to enforce priorities/SLAs
Poor performance predictability (no performance isolation)
22
Inter-Thread Interference in DRAM

Memory controllers, pins, and memory banks are shared

Pin bandwidth is not increasing as fast as number of cores



Different threads executing on different cores interfere with
each other in the main memory system
Threads delay each other by causing resource contention:


Bandwidth per core reducing
Bank, bus, row-buffer conflicts  reduced DRAM throughput
Threads can also destroy each other’s DRAM bank
parallelism

Otherwise parallel requests can become serialized
23
Effects of Inter-Thread Interference in DRAM

Queueing/contention delays


Bank conflict, bus conflict, channel conflict, …
Additional delays due to DRAM constraints


Called “protocol overhead”
Examples



Row conflicts
Read-to-write and write-to-read delays
Loss of intra-thread parallelism
24
Inter-Thread Interference in DRAM


Existing DRAM controllers are unaware of inter-thread
interference in DRAM system
They simply aim to maximize DRAM throughput



Thread-unaware and thread-unfair
No intent to service each thread’s requests in parallel
FR-FCFS policy: 1) row-hit first, 2) oldest first

Unfairly prioritizes threads with high row-buffer locality
25
QoS-Aware Memory Request Scheduling
Resolves memory contention
by scheduling requests
Core Core
Core Core

Memory
How to schedule requests to provide




Memory
Controller
High system performance
High fairness to applications
Configurability to system software
Memory controller needs to be aware of threads
26
How Do We Solve the Problem?


Stall-time fair memory scheduling
Goal: Threads sharing main memory should experience
similar slowdowns compared to when they are run alone 
fair scheduling



[Mutlu+ MICRO’07]
Also improves overall system performance by ensuring cores make
“proportional” progress
Idea: Memory controller estimates each thread’s slowdown
due to interference and schedules requests in a way to
balance the slowdowns
Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for
Chip Multiprocessors,” MICRO 2007.
27
Stall-Time Fairness in Shared DRAM Systems





A DRAM system is fair if it equalizes the slowdown of equal-priority threads
relative to when each thread is run alone on the same system
DRAM-related stall-time: The time a thread spends waiting for DRAM memory
STshared: DRAM-related stall-time when the thread runs with other threads
STalone: DRAM-related stall-time when the thread runs alone
Memory-slowdown = STshared/STalone


Relative increase in stall-time
Stall-Time Fair Memory scheduler (STFM) aims to equalize
Memory-slowdown for interfering threads, without sacrificing performance


Considers inherent DRAM performance of each thread
Aims to allow proportional progress of threads
28
STFM Scheduling Algorithm [MICRO’07]




For each thread, the DRAM controller
 Tracks STshared
 Estimates STalone
Each cycle, the DRAM controller
 Computes Slowdown = STshared/STalone for threads with legal requests
 Computes unfairness = MAX Slowdown / MIN Slowdown
If unfairness < 
 Use DRAM throughput oriented scheduling policy
If unfairness ≥ 
 Use fairness-oriented scheduling policy


(1) requests from thread with MAX Slowdown first
(2) row-hit first , (3) oldest-first
29
How Does STFM Prevent Unfairness?
T0: Row 0
T1: Row 5
T0: Row 0
T1: Row 111
T0: Row 0
T0:
T1: Row 0
16
T0 Slowdown 1.10
1.00
1.04
1.07
1.03
Row
16
Row
00
Row 111
Row Buffer
T1 Slowdown 1.14
1.03
1.06
1.08
1.11
1.00
Unfairness

1.06
1.04
1.03
1.00
Data
1.05
30
Another Problem due to Interference

Processors try to tolerate the latency of DRAM requests by
generating multiple outstanding requests





Memory-Level Parallelism (MLP)
Out-of-order execution, non-blocking caches, runahead execution
Effective only if the DRAM controller actually services the
multiple requests in parallel in DRAM banks
Multiple threads share the DRAM controller
DRAM controllers are not aware of a thread’s MLP

Can service each thread’s outstanding requests serially, not in parallel
31
Bank Parallelism of a Thread
Bank 0
2 DRAM Requests
Bank 1
Single Thread:
Thread A :
Compute
Stall
Compute
Bank 0
Bank 1
Thread A: Bank 0, Row 1
Thread A: Bank 1, Row 1
Bank access latencies of the two requests overlapped
Thread stalls for ~ONE bank access latency
32
Bank Parallelism Interference in DRAM
Bank 0
Baseline Scheduler:
Bank 1
2 DRAM Requests
A : Compute
Stall
Stall
Compute
Bank 0
Bank 1
Thread A: Bank 0, Row 1
2 DRAM Requests
B: Compute
Stall
Bank 1
Bank 0
Stall
Compute
Thread B: Bank 1, Row 99
Thread B: Bank 0, Row 99
Thread A: Bank 1, Row 1
Bank access latencies of each thread serialized
Each thread stalls for ~TWO bank access latencies
33
Parallelism-Aware Scheduler
Baseline Scheduler:
Bank 0
Bank 1
2 DRAM Requests
A : Compute
Stall
Stall
Compute
Bank 0
Bank 1
2 DRAM Requests
B: Compute
Thread A: Bank 0, Row 1
Stall
Stall
Compute
Bank 1
Thread B: Bank 1, Row 99
Thread B: Bank 0, Row 99
Bank 0
Thread A: Bank 1, Row 1
Parallelism-aware Scheduler:
2 DRAM Requests
A : Compute
Stall
Compute
Bank 0
Bank 1
Saved Cycles
2 DRAM Requests
B: Compute
Stall
Stall
Compute
Average stall-time:
~1.5 bank access
latencies
Bank 0
Bank 1
34
Parallelism-Aware Batch Scheduling (PAR-BS)

Principle 1: Parallelism-awareness
 Schedule requests from a thread (to
different banks) back to back



Preserves each thread’s bank parallelism
But, this can cause starvation…
Principle 2: Request Batching
 Group a fixed number of oldest requests
from each thread into a “batch”
 Service the batch before all other requests



Form a new batch when the current one is done
Eliminates starvation, provides fairness
Allows parallelism-awareness within a batch
T1
T1
T2
T0
T2
T2
T3
T2
T0
T3
T2
T1
T1
T0
Bank 0
Bank 1
Batch
Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling,” ISCA 2008.
35
Request Batching

Each memory request has a bit (marked) associated with it

Batch formation:




Marked requests are prioritized over unmarked ones


Mark up to Marking-Cap oldest requests per bank for each thread
Marked requests constitute the batch
Form a new batch when no marked requests are left
No reordering of requests across batches: no starvation, high fairness
How to prioritize requests within a batch?
36
Within-Batch Scheduling

Can use any existing DRAM scheduling policy


FR-FCFS (row-hit first, then oldest-first) exploits row-buffer locality
But, we also want to preserve intra-thread bank parallelism

Service each thread’s requests back to back
HOW?

Scheduler computes a ranking of threads when the batch is
formed


Higher-ranked threads are prioritized over lower-ranked ones
Improves the likelihood that requests from a thread are serviced in
parallel by different banks

Different threads prioritized in the same order across ALL banks
37
How to Rank Threads within a Batch

Ranking scheme affects system throughput and fairness

Maximize system throughput


Minimize unfairness (Equalize the slowdown of threads)



Minimize average stall-time of threads within the batch
Service threads with inherently low stall-time early in the batch
Insight: delaying memory non-intensive threads results in high
slowdown
Shortest stall-time first (shortest job first) ranking



Provides optimal system throughput [Smith, 1956]*
Controller estimates each thread’s stall-time within the batch
Ranks threads with shorter stall-time higher
* W.E. Smith, “Various optimizers for single stage production,” Naval Research Logistics Quarterly, 1956.
38
Shortest Stall-Time First Ranking

Maximum number of marked requests to any bank (max-bank-load)


Rank thread with lower max-bank-load higher (~ low stall-time)
Total number of marked requests (total-load)

Breaks ties: rank thread with lower total-load higher
T3
max-bank-load total-load
T3
T3
T2
T3
T3
T0
1
3
T1
T0
T2
T0
T1
2
4
T2
T2
T1
T2
T2
2
6
T3
T1
T0
T3
T1
T3
T2
T3
T3
5
9
Bank 0
Bank 1
Bank 2
Bank 3
Ranking:
T0 > T1 > T2 > T3
39
T3
7
T3
6
5
4
3
T3
T2
T3
T3
T1
T0
T2
T0
T2
T2
T1
T2
T3
T1
T0
T3
T1
T3
T2
T3
Bank 0
Bank 1
Bank 2
Bank 3
2
1
PAR-BS Scheduling
Order
Time
Baseline Scheduling
Order (Arrival order)
T3
7
T3
6
5
4
3
T3
T3
T3
T3
T3
T2
T2
T3
T2
T2
T2
T3
T1
T1
T1
T2
T1
T0
T0
T0
Bank 0
Bank 1
Bank 2
Bank 3
2
1
Ranking: T0 > T1 > T2 > T3
Stall times
T0
T1
T2
T3
4
4
5
7
AVG: 5 bank access latencies
Stall times
T0
T1
T2
T3
1
2
4
7
AVG: 3.5 bank access latencies
40
Time
Example Within-Batch Scheduling Order
Putting It Together: PAR-BS Scheduling Policy

PAR-BS Scheduling Policy
Batching
(1) Marked requests first
(2) Row-hit requests first
Parallelism-aware
(3) Higher-rank thread first (shortest stall-time first) within-batch
scheduling
(4) Oldest first

Three properties:


Exploits row-buffer locality and intra-thread bank parallelism
Work-conserving


Marking-Cap is important



Services unmarked requests to banks without marked requests
Too small cap: destroys row-buffer locality
Too large cap: penalizes memory non-intensive threads
Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling,” ISCA 2008.
41
Unfairness on 4-, 8-, 16-core Systems
Unfairness = MAX Memory Slowdown / MIN Memory Slowdown [MICRO 2007]
5
FR-FCFS
Unfairness (lower is better)
4.5
FCFS
NFQ
4
STFM
PAR-BS
3.5
3
2.5
2
1.5
1
4-core
8-core
16-core
42
System Performance
1.4
1.3
Normalized Hmean Speedup
1.2
1.1
1
0.9
0.8
0.7
FR-FCFS
0.6
FCFS
NFQ
0.5
STFM
0.4
PAR-BS
0.3
0.2
0.1
0
4-core
8-core
16-core
43
Another Way of Reducing Interference

Memory Channel Partitioning

Idea: Map badly-interfering applications’ pages to different
channels [Muralidhara+, MICRO’11]
Time Units
5
Core 0
App A
Core 1
App B
4
3
2
1
Channel 0
Bank 0
Bank 1
Bank 0
Bank 1
Channel 1
Conventional Page Mapping


Time Units
5
4
3
2
1
Core 0
App A
Core 1
App B
Channel 0
Bank 0
Bank 1
Bank 0
Bank 1
Channel 1
Channel Partitioning
Separate data of low/high intensity and low/high row-locality applications
Especially effective in reducing interference of threads with “medium” and
“heavy” memory intensity
44
Yet Another Way: Core/Request Throttling

Idea: Estimate the slowdown due to (DRAM) interference
and throttle down threads that slow down others


Ebrahimi et al., “Fairness via Source Throttling: A Configurable
and High-Performance Fairness Substrate for Multi-Core
Memory Systems,” ASPLOS 2010.
Advantages
+ Core/request throttling is easy to implement: no need to
change scheduling algorithm
+ Can be a general way of handling shared resource contention

Disadvantages
- Requires interference/slowdown estimations
- Thresholds can become difficult to optimize
45
Handling Interference in Parallel Applications





Threads in a multithreaded application are inter-dependent
Some threads can be on the critical path of execution due
to synchronization; some threads are not
How do we schedule requests of inter-dependent threads
to maximize multithreaded application performance?
Idea: Estimate limiter threads likely to be on the critical path and
prioritize their requests; shuffle priorities of non-limiter threads
to reduce memory interference among them [Ebrahimi+, MICRO’11]
Hardware/software cooperative limiter thread estimation:


Thread executing the most contended critical section
Thread that is falling behind the most in a parallel for loop
46
Memory System is the Major Shared Resource
threads’ requests
interfere
47
Inter-Thread/Application Interference


Problem: Threads share the memory system, but memory
system does not distinguish between threads’ requests
Existing memory systems




Free-for-all, shared based on demand
Control algorithms thread-unaware and thread-unfair
Aggressive threads can deny service to others
Do not try to reduce or control inter-thread interference
48
How Do We Solve The Problem?

Inter-thread interference is uncontrolled in all memory
resources




Memory controller
Interconnect
Caches
We need to control it

i.e., design an interference-aware (QoS-aware) memory system
49
Virtual Memory
Roadmap

Virtual Memory






Purpose: illusion of a large memory and protection
Simplified memory management for multiple processes
Demand paging, page faults
Address Translation
TLB
Integrating Caches and Virtual Memory




Physically indexed caches
Virtually indexed caches
Virtually indexed, physically tagged caches
Synonym/aliasing problem and solutions
51
Readings


Section 5.4 in P&H
Optional: Section 8.8 in Hamacher et al.
52
Ideal Memory




Zero access time (latency)
Infinite capacity
Zero cost
Infinite bandwidth (to support multiple accesses in parallel)
53
A Modern Memory Hierarchy
Register File
32 words, sub-nsec
Memory
Abstraction
L1 cache
~32 KB, ~nsec
L2 cache
512 KB ~ 1MB, many nsec
L3 cache,
.....
Main memory (DRAM),
GB, ~100 nsec
Swap Disk
100 GB, ~10 msec
manual/compiler
register spilling
Automatic
HW cache
management
automatic
demand
paging
54
The Problem

Physical memory is of limited size (cost)




What if you need more?
Should the programmer be concerned about the size of
code/data blocks fitting physical memory? (overlay
programming, programming with some embedded systems)
Should the programmer manage data movement from disk to
physical memory?
Also, ISA can have an address space greater than the
physical memory size


E.g., a 64-bit address space with byte addressability
What if you do not have enough physical memory?
55
Virtual Memory

Idea: Give the programmer the illusion of a large address
space


Programmer can assume he/she has “infinite” amount of
physical memory


So that he/she does not worry about running out of memory
Really, it is the amount specified by the address space for a
program
Hardware and software cooperatively provide the illusion
even though physical memory is not infinite

Illusion is maintained for each independent process
56
Basic Mechanism


Indirection
Address generated by each instruction in a program is a
“virtual address”



i.e., it is not the physical address used to address main
memory
called “linear address” in x86
An “address translation” mechanism maps this address to a
“physical address”


called “real address” in x86
Address translation mechanism is implemented in hardware
and software together
57
Virtual Pages, Physical Frames

Virtual address space divided into pages

Physical address space divided into frames

A virtual page is mapped to a physical frame


If an accessed virtual page is not in memory, but on disk


Assuming the page is in memory
Virtual memory system brings the page into a physical frame
and adjusts the mapping  demand paging
Page table is the table that stores the mapping of virtual
pages to physical frames
58
A System with Physical Memory Only

Examples:



Most Cray machines
early PCs
nearly all embedded systems
Memory
Physical
Addresses
0:
1:
CPU
CPU’s load or store addresses used
directly to access memory.
N-1:
59
A System with Virtual Memory (page-based)

Examples:

Memory
Laptops, servers, modern PCs
0:
1:
Page Table
Virtual
Addresses
0:
1:
Physical
Addresses
CPU
P-1:
N-1:
Disk

Address Translation: The hardware converts virtual addresses into
physical addresses via an OS-managed lookup table (page table)
60
Page Fault (“A miss in physical memory”)

What if object is on disk rather than in memory?


Page table entry indicates virtual page not in memory  page
fault exception
OS trap handler invoked to move data from disk into memory


Current process suspends, others can resume
OS has full control over placement
Before fault
After fault
Memory
Memory
Page Table
Virtual
Addresses
Physical
Addresses
CPU
Page Table
Virtual
Addresses
Physical
Addresses
CPU
Disk
Disk
Servicing a Page Fault

(1) Processor signals controller


(2) Read occurs



Read block of length P starting
at disk address X and store
starting at memory address Y
Direct Memory Access (DMA)
Under control of I/O controller
(3) Controller signals completion


Interrupt processor
OS resumes suspended process
(1) Initiate Block Read
Processor
Reg
(3) Read
Done
Cache
Memory-I/O bus
(2) DMA
Transfer
Memory
I/O
controller
Disk
Disk
62
Page Table is Per Process

Each process has its own virtual address space


Full address space for each program
Simplifies memory allocation, sharing, linking and loading.
0
Virtual
Address
Space for
Process 1:
Virtual
Address
Space for
Process 2:
0
VP 1
VP 2
...
Address
Translation
PP 2
Physical Address
Space (DRAM)
PP 7
(e.g., read/only
library code)
N-1
0
VP 1
VP 2
...
N-1
PP 10
M-1
63
Address Translation

Page size specified by the ISA



VAX: 512 bytes
Today: 4KB, 8KB, 2GB, … (small and large pages mixed
together)
Page Table contains an entry for each virtual page


Called Page Table Entry (PTE)
What is in a PTE?
64
Address Translation
65
Page Table Entry
66
We did not cover the following slides in lecture.
These are for your preparation for the next lecture.
VM Address Translation

Parameters



P = 2p = page size (bytes).
N = 2n = Virtual-address limit
M = 2m = Physical-address limit
n–1
virtual page number
p p–1
page offset
0
virtual address
address translation
m–1
physical page number
p p–1
page offset
0
physical address
Page offset bits don’t change as a result of translation
68
VM Address Translation



Separate (set of) page table(s) per process
VPN forms index into page table (points to a page table entry)
Page Table Entry (PTE) provides information about page
page table
base register
virtual address
n–1
p p–1
virtual page number (VPN)
page offset
0
valid access physical page number (PPN)
VPN acts as
table index
if valid=0
then page
not in memory
(page fault)
m–1
p p–1
physical page number (PPN)
page offset
0
physical address
69
VM Address Translation: Page Hit
70
VM Address Translation: Page Fault
71
Page-Level Access Control (Protection)

Not every process is allowed to access every page



E.g., may need supervisor level privilege to access system
pages
Idea: Store access control information on a page basis in
the process’s page table
Enforce access control at the same time as translation
 Virtual memory system serves two functions today
Address translation (for illusion of large physical memory)
Access control (protection)
72
Issues (I)

How large is the page table?

Where do we store it?




In hardware?
In physical memory? (Where is the PTBR?)
In virtual memory? (Where is the PTBR?)
How can we store it efficiently without requiring physical
memory that can store all page tables?



Idea: multi-level page tables
Only the first-level page table has to be in physical memory
Remaining levels are in virtual memory (but get cached in
physical memory when accessed)
73
Page Table Access




How do we access the Page Table?
Page Table Base Register
Page Table Limit Register
If VPN is out of the bounds (exceeds PTLR) then the
process did not allocate the virtual page  access control
exception
74
Issues (II)

How fast is the address translation?



Idea: Use a hardware structure that caches PTEs 
Translation lookaside buffer
What should be done on a TLB miss?



How can we make it fast?
What TLB entry to replace?
Who handles the TLB miss? HW vs. SW?
What should be done on a page fault?


What virtual page to replace from physical memory?
Who handles the page fault? HW vs. SW?
75
Issues (III)

When do we do the address translation?


Before or after accessing the L1 cache?
In other words, is the cache virtually addressed or
physically addressed?

Virtual versus physical cache

What are the issues with a virtually addressed cache?

Synonym problem:

Two different virtual addresses can map to the same physical
address  same physical address can be present in multiple
locations in the cache  can lead to inconsistency in data
76
Physical Cache
77
Virtual Cache
78
Virtual-Physical Cache
79
Download