Resource Management Policy and Mechanism Jeff Chase Duke University

advertisement
Resource Management
Policy and Mechanism
Jeff Chase
Duke University
The kernel
syscall trap/return
fault/return
system call layer: files, processes, IPC, thread syscalls
fault entry: VM page faults, signals, etc.
thread/CPU/core management: sleep and ready queues
memory management: block/page cache
sleep queue
I/O completions
ready queue
interrupt/return
timer ticks
The kernel
syscall trap/return
fault/return
system call layer: files, processes, IPC, thread syscalls
fault entry: VM page faults, signals, etc.
thread/CPU/core management: sleep and ready queues
memory management: block/page cache
policy
sleep queue
I/O completions
ready queue
interrupt/return
policy
timer ticks
Separation of policy and mechanism
• Every OS platform has mechanisms that enable it to
mediate access to machine resources.
– Gain control of core by timer interrupts
– Fault on access to non-resident virtual memory
– I/O through system call traps
– Internal code and data structures to track resource usage and
allocate resources
• The mechanisms enable resource management policy.
• But the mechanisms do not and must/should not
determine the policy.
• We might want to change the policy!
Goals of policy
• Share resources fairly.
• Use machine resources efficiently.
• Be responsive to user interaction.
But what do these things mean?
How do we know if a policy is good or not?
What are the metrics?
What do we assume about the workload?
Example: Processor Allocation
The key issue is: how should an OS allocate its CPU
resources among contending demands?
– We are concerned with resource allocation policy: how the OS
uses underlying mechanisms to meet design goals.
– Focus on OS kernel : user code can decide how to use the
processor time it is given.
– Which thread to run on a free core?
– For how long? When to take the core back and give it to some
other thread? (timeslice or quantum)
– What are the policy goals?
CPU dispatch and ready queues
In a typical OS, each thread has a priority, which may change over
time. When a core is idle, pick the (a) thread with the highest priority. If
a higher-priority thread becomes ready, then preempt the thread
currently running on the core and switch to the new thread. If the
quantum expires (timer), then preempt, select a new thread, and switch
Priority
Most modern OS schedulers use priority scheduling.
– Each thread in the ready pool has a priority value.
– The scheduler favors higher-priority threads.
– Threads inherit a base priority from the associated
application/process.
– User-settable relative importance within application
– Internal priority adjustments as an implementation
technique within the scheduler.
– How to set the priority of a thread?
How many priority levels? 32 (Windows) to 128 (OS X)
Scheduler Policy Goals
• Response time or latency, responsiveness
How long does it take to do what I asked? (R)
• Throughput
How many operations complete per unit of time? (X)
Utilization: what percentage of time does the CPU (and each
device) spend doing useful work? (U)
• Fairness
What does this mean? Divide the pie evenly? Guarantee low
variance in response times? freedom from starvation?
Serve the clients who pay the most?
• Meet deadlines and reduce jitter for periodic tasks
A Simple Policy: FCFS
The most basic scheduling policy is first-come-firstserved, also called first-in-first-out (FIFO).
– FCFS is just like the checkout line at the QuickiMart.
Maintain a queue ordered by time of arrival.
GetNextToRun selects from the front of the queue.
– FCFS with preemptive timeslicing is called round robin.
Wakeup or
ReadyToRun
GetNextToRun()
RemoveFromHead
List::Append
CPU
ready list
Evaluating FCFS
How well does FCFS achieve the goals of a scheduler?
– throughput. FCFS is as good as any non-preemptive policy.
….if the CPU is the only schedulable resource in the system.
– fairness. FCFS is intuitively fair…sort of.
“The early bird gets the worm”…and everyone else is fed
eventually.
– response time. Long jobs keep everyone else waiting.
D=3
D=2
3
Time
R = (3 + 5 + 6)/3 = 4.67
D=1
5
6
Gantt
Chart
Preemptive FCFS: Round Robin
Preemptive timeslicing is one way to improve fairness of FCFS.
If job does not block or exit, force an involuntary context switch
after each quantum Q of CPU time.
Preempted job goes back to the tail of the ready list.
With infinitesimal Q round robin is called processor sharing.
FCFS-RTC
D=3
D=2
D=1
round robin
3+ε
5
6
quantum Q=1
R = (3 + 5 + 6 + ε)/3 = 4.67 + ε
preemption
overhead = ε
In this case, R is unchanged by timeslicing.
Is this always true?
Evaluating Round Robin
D=5
D=1
R = (5+6)/2 = 5.5
R = (2+6 + ε)/2 = 4 + ε
– Response time. RR reduces response time for short jobs.
For a given load, wait time is proportional to the job’s total service
demand D.
– Fairness. RR reduces variance in wait times.
But: RR forces jobs to wait for other jobs that arrived later.
– Throughput. RR imposes extra context switch overhead.
CPU is only Q/(Q+ε) as fast as it was before.
Degrades to FCFS-RTC with large Q.
Minimizing Response Time: SJF (STCF)
Shortest Job First (SJF) is provably optimal if the goal is
to minimize average-case R.
Also called Shortest Time to Completion First (STCF) or
Shortest Remaining Processing Time (SRPT).
Example: express lanes at the MegaMart
Idea: get short jobs out of the way quickly to minimize
the number of jobs waiting while a long job runs.
Intuition: longest jobs do the least possible damage to the wait
times of their competitors.
D=3
D=2
D=1
1
3
6
R = (1 + 3 + 6)/3 = 3.33
Two Schedules for CPU/Disk
1. Naive Round Robin
5
5
1
1
4
CPU busy 25/37: U = 67%
Disk busy 15/37: U = 40%
2. Round Robin with internal boost for I/O completion
33% performance improvement
CPU busy 25/25: U = 100%
Disk busy 15/25: U = 60%
Estimating Time-to-Yield
How to predict which job/task/thread will have the shortest
demand on the CPU?
– If you don’t know, then guess.
Weather report strategy: predict future D from the recent past.
Don’t have to guess exactly: we can do well by using
adaptive internal priority.
– Common technique: multi-level feedback queue.
– Set N priority levels, with a timeslice quantum for each.
– If quantum expires, drop priority down one level.
– If a job yields or blocks, bump priority up one level.
Example: a recent Linux rev
Tasks are determined to be I/O-bound or CPUbound based on an interactivity heuristic. A task's
interactiveness metric is calculated based on how
much time the task executes compared to how much
time it sleeps. Note that because I/O tasks schedule
I/O and then wait, an I/O-bound task spends more
time sleeping and waiting for I/O completion. This
increases its interactive metric.
Internal Priority Adjustment
Continuous, dynamic, priority adjustment in response to
observed conditions and events.
– Adjust priority according to recent usage.
• Decay with usage, rise with time
– Boost threads that already hold resources that are in demand.
e.g., internal sleep primitive in Unix kernels
– Boost threads that have starved in the recent past.
– May be visible/controllable to other parts of the kernel
Multilevel Feedback Queue
Many systems (e.g., Unix variants) implement priority
and incorporate SJF by using a multilevel feedback
queue.
– multilevel. Separate queue for each of N priority levels.
Use RR on each queue; look at queue i-1 only if queue i is empty.
– feedback. Factor previous behavior into new job priority.
high
I/O bound jobs waiting for CPU
jobs holding resouces
jobs with high external priority
GetNextToRun selects job
at the head of the highest
priority queue.
constant time, no sorting
low
ready queues
indexed by priority
CPU-bound jobs
Priority of CPU-bound
jobs decays with system
load and service received.
Thread priority in other queues
• The scheduling problem applies to sleep queues as well.
• Which thread should get a mutex next? Which thread
should wakeup on a signal?
• Should priority matter?
• What if a high-priority thread is waiting for a mutex held
by a low-priority thread? This is called priority inversion.
Mars Pathfinder
Mission
Demonstrate new landing techniques
parachute and airbags
Take pictures
Analyze soil samples
Demonstrate mobile robot technology
Sojourner
Major success on all fronts
Returned 2.3 billion bits of
information
16,500 images from the Lander
550 images from the Rover
15 chemical analyses of rocks & soil
Lots of weather data
Both Lander and Rover outlived their
design life
Broke all records for number of hits
on a website!!!
© 2001, Steve Easterbrook
Pictures from an early Mars rover
© 2001, Steve Easterbrook
Pathfinder had Software Errors
Symptoms: software did total systems resets and some data was lost each time
Symptoms noticed soon after Pathfinder started collecting meteorological data
Cause
3 Process threads, with bus access via mutual exclusion locks (mutexes):
High priority: Information Bus Manager
Medium priority: Communications Task
Low priority:
Meteorological Data Gathering Task
Priority Inversion:
Low priority task gets mutex to transfer data to the bus
High priority task blocked until mutex is released
Medium priority task pre-empts low priority task
Eventually a watchdog timer notices Bus Manager hasn’t run for
some time…
Factors
Very hard to diagnose and hard to reproduce
Need full tracing switched on to analyze what happened
Was experienced a couple of times in pre-flight testing
Never reproduced or explained, hence testers assumed it was a
hardware glitch
© 2001, Steve Easterbrook
Real Time/Media
Real-time schedulers must support regular, periodic
execution of tasks (e.g., continuous media).
E.g., OS X has four user-settable parameters per thread:
– Period (y)
– Computation (x)
– Preemptible (boolean)
– Constraint (<y)
• Can the application adapt if the scheduler cannot meet
its requirements?
– Admission control and reflection
Provided for completeness
Memory Allocation
How should an OS allocate its memory resources
among contending demands?
– Virtual address spaces: fork, exec, sbrk, page fault.
– The kernel controls how many machine memory frames back
the pages of each virtual address space.
– The kernel can take memory away from a VAS at any time.
– The kernel always gets control if a VAS (or rather a thread
running within a VAS) asks for more.
– The kernel controls how much machine memory to use as a
cache for data blocks whose home is on slow storage.
– Policy choices: which pages or blocks to keep in memory? And
which ones to evict from memory to make room for others?
What is a Virtual Address Space?
• Protection domain
– A “sandbox” for threads that limits what memory they can
access for read/write/execute.
– Each thread is in exactly one sandbox, but many threads
may play in the same sandbox.
• Uniform name space
– Threads access their code and data items without caring
where they are in physical memory, or even if they are
resident in memory at all.
• A set of VP translations
– A level of indirection from virtual pages to physical frames.
– The OS kernel controls the translations in effect at any time.
Virtual Memory as a Cache
executable
file
virtual
memory
(big)
header
text
text
data
idata
data
wdata
symbol
table, etc.
BSS
program
sections
physical
memory
(small)
backing
storage
pageout/eviction
user stack
page fetch
args/env
kernel
process
segments
virtual-to-physical
translations
page frames
Introduction to Virtual Addressing
Code addresses
memory through
virtual addresses.
virtual
memory
(big?)
text
data
physical
memory
(small?)
The kernel controls
the virtual-physical
translations in effect
(space).
BSS
The kernel and the
machine collude to
translate virtual
addresses to physical
addresses.
user stack
args/env
kernel
virtual-to-physical
translations
The machine does
not allow a user
process to access
memory unless the
kernel “says it’s
OK”.
The specific mechanisms for
implementing virtual address
translation are machine-dependent.
Names and layers
User
view
notes in notebook file
Application
notefile fd, byte range*
fd
bytes
block#
File System
device, block #
Disk Subsystem
surface, cylinder, sector
Add more layers as needed.
Page/block maps
Idea: use a level of indirection
through a map to assemble a
storage object from “scraps” of
storage in different locations.
The “scraps” can be fixed-size
slots: that makes allocation
easy because they are
interchangeable.
map
Example: page tables that
implement a VAS.
Cartoon View
Each process/VAS has
its own page table.
Virtual addresses are
translated relative to
the current page table.
process page table (map)
PFN 0
PFN 1
PFN i
In this example, each
VPN j maps to PFN j, but
in practice any physical
frame may be used for
any virtual page.
PFN i
+
offset
page #i
offset
user virtual address
physical memory
page frames
The maps are
themselves stored in
memory; a protected
register holds a pointer to
the current map.
Locality
Principle of Locality:



Programs tend to reuse data and instructions near those
they have used recently, or that were recently referenced
themselves.
Temporal locality: Recently referenced items are likely to be
referenced in the near future.
Spatial locality: Items with nearby addresses tend to be
referenced close together in time.
Locality Example:
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;
• Data
– Reference array elements in succession
(stride-1 reference pattern): Spatial locality
– Reference sum each iteration: Temporal locality
• Instructions
– Reference instructions in sequence: Spatial locality
– Cycle through loop repeatedly: Temporal locality
– 32 –
15-213, F’02
Memory Hierarchies
Some fundamental and enduring properties of hardware
and software:



Fast storage technologies cost more per byte and have less
capacity.
The gap between CPU and main memory speed is widening.
Well-written programs tend to exhibit good locality.
These fundamental properties complement each other
beautifully.
They suggest an approach for organizing memory and
storage systems known as a memory hierarchy.
– 33 –
15-213, F’02
An Example Memory Hierarchy
Smaller,
faster,
and
costlier
(per byte)
storage
devices
L0:
registers
L1: on-chip L1
cache (SRAM)
L2:
L3:
Larger,
slower,
and
cheaper
(per byte)
storage
devices
L5:
– 34 –
CPU registers hold words retrieved
from L1 cache.
L4:
off-chip L2
cache (SRAM)
L1 cache holds cache lines retrieved
from the L2 cache memory.
L2 cache holds cache lines
retrieved from main memory.
main memory
(DRAM)
Main memory holds disk
blocks retrieved from local
disks.
local secondary storage
(local disks)
Local disks hold files
retrieved from disks on
remote network servers.
remote secondary storage
(distributed file systems, Web servers)
15-213, F’02
Caches
Cache: A smaller, faster storage device that acts as a
staging area for a subset of the data in a larger,
slower device.
Fundamental idea of a memory hierarchy:

For each k, the faster, smaller device at level k serves as a
cache for the larger, slower device at level k+1.
Why do memory hierarchies work?



– 35 –
Programs tend to access the data at level k more often than
they access the data at level k+1.
Thus, the storage at level k+1 can be slower, and thus larger
and cheaper per bit.
Net effect: A large pool of memory that costs as much as
the cheap storage near the bottom, but that serves data to
programs at the rate of the fast storage near the top.
15-213, F’02
Caching in a Memory Hierarchy
Level k:
8
4
9
10
4
Level k+1:
– 36 –
14
10
3
Smaller, faster, more expensive
device at level k caches a
subset of the blocks from level k+1
Data is copied between
levels in block-sized transfer
units
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Larger, slower, cheaper storage
device at level k+1 is partitioned
into blocks.
15-213, F’02
General Caching Concepts
14
12
Level
k:
0
1
2
3
Cache hit
4*
12
9
14
3

12
4*
Level
k+1:
– 37 –
Program needs object d, which is stored
in some block b.
Request
12
14
Program finds b in the cache at level
k. E.g., block 14.
Cache miss
Request
12

0
1
2
3
4
4*
5
6
7
8
9
10
11
12
13
14
15

b is not at level k, so level k cache
must fetch it from level k+1.
E.g., block 12.
If level k cache is full, then some
current block must be replaced
(evicted). Which one is the “victim”?
 Placement policy: where can the new
block go? E.g., b mod 4
 Replacement policy: which block
should be evicted? E.g., LRU
15-213, F’02
A System with Virtual Memory
Examples:

Memory
workstations, servers, modern PCs, etc.
Page Table
Virtual
Addresses
0:
1:
0:
1:
Physical
Addresses
CPU
P-1:
N-1:
Disk

Address Translation: Hardware converts virtual addresses to
physical addresses via OS-managed lookup table (page table)
– 38 –
15-213, F’02
Page Faults (like “Cache Misses”)
What if an object is on disk rather than in memory?


Page table entry indicates virtual address not in memory
OS exception handler invoked to move data from disk into
memory
 current process suspends, others can resume
 OS has full control over placement, etc.
Before fault
After fault
Memory
Memory
Page Table
Virtual
Addresses
Physical
Addresses
CPU
Virtual
Addresses
Physical
Addresses
CPU
Disk
– 39 –
Page Table
Disk
15-213, F’02
Dynamic address translation
User process
Virtual
address
Translator
(MMU)
Physical
address
Will this allow us to provide protection?
Sure, as long as the translation is correct
Physical
memory
Download