Resource Management Policy and Mechanism Jeff Chase Duke University The kernel syscall trap/return fault/return system call layer: files, processes, IPC, thread syscalls fault entry: VM page faults, signals, etc. thread/CPU/core management: sleep and ready queues memory management: block/page cache sleep queue I/O completions ready queue interrupt/return timer ticks The kernel syscall trap/return fault/return system call layer: files, processes, IPC, thread syscalls fault entry: VM page faults, signals, etc. thread/CPU/core management: sleep and ready queues memory management: block/page cache policy sleep queue I/O completions ready queue interrupt/return policy timer ticks Separation of policy and mechanism • Every OS platform has mechanisms that enable it to mediate access to machine resources. – Gain control of core by timer interrupts – Fault on access to non-resident virtual memory – I/O through system call traps – Internal code and data structures to track resource usage and allocate resources • The mechanisms enable resource management policy. • But the mechanisms do not and must/should not determine the policy. • We might want to change the policy! Goals of policy • Share resources fairly. • Use machine resources efficiently. • Be responsive to user interaction. But what do these things mean? How do we know if a policy is good or not? What are the metrics? What do we assume about the workload? Example: Processor Allocation The key issue is: how should an OS allocate its CPU resources among contending demands? – We are concerned with resource allocation policy: how the OS uses underlying mechanisms to meet design goals. – Focus on OS kernel : user code can decide how to use the processor time it is given. – Which thread to run on a free core? – For how long? When to take the core back and give it to some other thread? (timeslice or quantum) – What are the policy goals? CPU dispatch and ready queues In a typical OS, each thread has a priority, which may change over time. When a core is idle, pick the (a) thread with the highest priority. If a higher-priority thread becomes ready, then preempt the thread currently running on the core and switch to the new thread. If the quantum expires (timer), then preempt, select a new thread, and switch Priority Most modern OS schedulers use priority scheduling. – Each thread in the ready pool has a priority value. – The scheduler favors higher-priority threads. – Threads inherit a base priority from the associated application/process. – User-settable relative importance within application – Internal priority adjustments as an implementation technique within the scheduler. – How to set the priority of a thread? How many priority levels? 32 (Windows) to 128 (OS X) Scheduler Policy Goals • Response time or latency, responsiveness How long does it take to do what I asked? (R) • Throughput How many operations complete per unit of time? (X) Utilization: what percentage of time does the CPU (and each device) spend doing useful work? (U) • Fairness What does this mean? Divide the pie evenly? Guarantee low variance in response times? freedom from starvation? Serve the clients who pay the most? • Meet deadlines and reduce jitter for periodic tasks A Simple Policy: FCFS The most basic scheduling policy is first-come-firstserved, also called first-in-first-out (FIFO). – FCFS is just like the checkout line at the QuickiMart. Maintain a queue ordered by time of arrival. GetNextToRun selects from the front of the queue. – FCFS with preemptive timeslicing is called round robin. Wakeup or ReadyToRun GetNextToRun() RemoveFromHead List::Append CPU ready list Evaluating FCFS How well does FCFS achieve the goals of a scheduler? – throughput. FCFS is as good as any non-preemptive policy. ….if the CPU is the only schedulable resource in the system. – fairness. FCFS is intuitively fair…sort of. “The early bird gets the worm”…and everyone else is fed eventually. – response time. Long jobs keep everyone else waiting. D=3 D=2 3 Time R = (3 + 5 + 6)/3 = 4.67 D=1 5 6 Gantt Chart Preemptive FCFS: Round Robin Preemptive timeslicing is one way to improve fairness of FCFS. If job does not block or exit, force an involuntary context switch after each quantum Q of CPU time. Preempted job goes back to the tail of the ready list. With infinitesimal Q round robin is called processor sharing. FCFS-RTC D=3 D=2 D=1 round robin 3+ε 5 6 quantum Q=1 R = (3 + 5 + 6 + ε)/3 = 4.67 + ε preemption overhead = ε In this case, R is unchanged by timeslicing. Is this always true? Evaluating Round Robin D=5 D=1 R = (5+6)/2 = 5.5 R = (2+6 + ε)/2 = 4 + ε – Response time. RR reduces response time for short jobs. For a given load, wait time is proportional to the job’s total service demand D. – Fairness. RR reduces variance in wait times. But: RR forces jobs to wait for other jobs that arrived later. – Throughput. RR imposes extra context switch overhead. CPU is only Q/(Q+ε) as fast as it was before. Degrades to FCFS-RTC with large Q. Minimizing Response Time: SJF (STCF) Shortest Job First (SJF) is provably optimal if the goal is to minimize average-case R. Also called Shortest Time to Completion First (STCF) or Shortest Remaining Processing Time (SRPT). Example: express lanes at the MegaMart Idea: get short jobs out of the way quickly to minimize the number of jobs waiting while a long job runs. Intuition: longest jobs do the least possible damage to the wait times of their competitors. D=3 D=2 D=1 1 3 6 R = (1 + 3 + 6)/3 = 3.33 Two Schedules for CPU/Disk 1. Naive Round Robin 5 5 1 1 4 CPU busy 25/37: U = 67% Disk busy 15/37: U = 40% 2. Round Robin with internal boost for I/O completion 33% performance improvement CPU busy 25/25: U = 100% Disk busy 15/25: U = 60% Estimating Time-to-Yield How to predict which job/task/thread will have the shortest demand on the CPU? – If you don’t know, then guess. Weather report strategy: predict future D from the recent past. Don’t have to guess exactly: we can do well by using adaptive internal priority. – Common technique: multi-level feedback queue. – Set N priority levels, with a timeslice quantum for each. – If quantum expires, drop priority down one level. – If a job yields or blocks, bump priority up one level. Example: a recent Linux rev Tasks are determined to be I/O-bound or CPUbound based on an interactivity heuristic. A task's interactiveness metric is calculated based on how much time the task executes compared to how much time it sleeps. Note that because I/O tasks schedule I/O and then wait, an I/O-bound task spends more time sleeping and waiting for I/O completion. This increases its interactive metric. Internal Priority Adjustment Continuous, dynamic, priority adjustment in response to observed conditions and events. – Adjust priority according to recent usage. • Decay with usage, rise with time – Boost threads that already hold resources that are in demand. e.g., internal sleep primitive in Unix kernels – Boost threads that have starved in the recent past. – May be visible/controllable to other parts of the kernel Multilevel Feedback Queue Many systems (e.g., Unix variants) implement priority and incorporate SJF by using a multilevel feedback queue. – multilevel. Separate queue for each of N priority levels. Use RR on each queue; look at queue i-1 only if queue i is empty. – feedback. Factor previous behavior into new job priority. high I/O bound jobs waiting for CPU jobs holding resouces jobs with high external priority GetNextToRun selects job at the head of the highest priority queue. constant time, no sorting low ready queues indexed by priority CPU-bound jobs Priority of CPU-bound jobs decays with system load and service received. Thread priority in other queues • The scheduling problem applies to sleep queues as well. • Which thread should get a mutex next? Which thread should wakeup on a signal? • Should priority matter? • What if a high-priority thread is waiting for a mutex held by a low-priority thread? This is called priority inversion. Mars Pathfinder Mission Demonstrate new landing techniques parachute and airbags Take pictures Analyze soil samples Demonstrate mobile robot technology Sojourner Major success on all fronts Returned 2.3 billion bits of information 16,500 images from the Lander 550 images from the Rover 15 chemical analyses of rocks & soil Lots of weather data Both Lander and Rover outlived their design life Broke all records for number of hits on a website!!! © 2001, Steve Easterbrook Pictures from an early Mars rover © 2001, Steve Easterbrook Pathfinder had Software Errors Symptoms: software did total systems resets and some data was lost each time Symptoms noticed soon after Pathfinder started collecting meteorological data Cause 3 Process threads, with bus access via mutual exclusion locks (mutexes): High priority: Information Bus Manager Medium priority: Communications Task Low priority: Meteorological Data Gathering Task Priority Inversion: Low priority task gets mutex to transfer data to the bus High priority task blocked until mutex is released Medium priority task pre-empts low priority task Eventually a watchdog timer notices Bus Manager hasn’t run for some time… Factors Very hard to diagnose and hard to reproduce Need full tracing switched on to analyze what happened Was experienced a couple of times in pre-flight testing Never reproduced or explained, hence testers assumed it was a hardware glitch © 2001, Steve Easterbrook Real Time/Media Real-time schedulers must support regular, periodic execution of tasks (e.g., continuous media). E.g., OS X has four user-settable parameters per thread: – Period (y) – Computation (x) – Preemptible (boolean) – Constraint (<y) • Can the application adapt if the scheduler cannot meet its requirements? – Admission control and reflection Provided for completeness Memory Allocation How should an OS allocate its memory resources among contending demands? – Virtual address spaces: fork, exec, sbrk, page fault. – The kernel controls how many machine memory frames back the pages of each virtual address space. – The kernel can take memory away from a VAS at any time. – The kernel always gets control if a VAS (or rather a thread running within a VAS) asks for more. – The kernel controls how much machine memory to use as a cache for data blocks whose home is on slow storage. – Policy choices: which pages or blocks to keep in memory? And which ones to evict from memory to make room for others? What is a Virtual Address Space? • Protection domain – A “sandbox” for threads that limits what memory they can access for read/write/execute. – Each thread is in exactly one sandbox, but many threads may play in the same sandbox. • Uniform name space – Threads access their code and data items without caring where they are in physical memory, or even if they are resident in memory at all. • A set of VP translations – A level of indirection from virtual pages to physical frames. – The OS kernel controls the translations in effect at any time. Virtual Memory as a Cache executable file virtual memory (big) header text text data idata data wdata symbol table, etc. BSS program sections physical memory (small) backing storage pageout/eviction user stack page fetch args/env kernel process segments virtual-to-physical translations page frames Introduction to Virtual Addressing Code addresses memory through virtual addresses. virtual memory (big?) text data physical memory (small?) The kernel controls the virtual-physical translations in effect (space). BSS The kernel and the machine collude to translate virtual addresses to physical addresses. user stack args/env kernel virtual-to-physical translations The machine does not allow a user process to access memory unless the kernel “says it’s OK”. The specific mechanisms for implementing virtual address translation are machine-dependent. Names and layers User view notes in notebook file Application notefile fd, byte range* fd bytes block# File System device, block # Disk Subsystem surface, cylinder, sector Add more layers as needed. Page/block maps Idea: use a level of indirection through a map to assemble a storage object from “scraps” of storage in different locations. The “scraps” can be fixed-size slots: that makes allocation easy because they are interchangeable. map Example: page tables that implement a VAS. Cartoon View Each process/VAS has its own page table. Virtual addresses are translated relative to the current page table. process page table (map) PFN 0 PFN 1 PFN i In this example, each VPN j maps to PFN j, but in practice any physical frame may be used for any virtual page. PFN i + offset page #i offset user virtual address physical memory page frames The maps are themselves stored in memory; a protected register holds a pointer to the current map. Locality Principle of Locality: Programs tend to reuse data and instructions near those they have used recently, or that were recently referenced themselves. Temporal locality: Recently referenced items are likely to be referenced in the near future. Spatial locality: Items with nearby addresses tend to be referenced close together in time. Locality Example: sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum; • Data – Reference array elements in succession (stride-1 reference pattern): Spatial locality – Reference sum each iteration: Temporal locality • Instructions – Reference instructions in sequence: Spatial locality – Cycle through loop repeatedly: Temporal locality – 32 – 15-213, F’02 Memory Hierarchies Some fundamental and enduring properties of hardware and software: Fast storage technologies cost more per byte and have less capacity. The gap between CPU and main memory speed is widening. Well-written programs tend to exhibit good locality. These fundamental properties complement each other beautifully. They suggest an approach for organizing memory and storage systems known as a memory hierarchy. – 33 – 15-213, F’02 An Example Memory Hierarchy Smaller, faster, and costlier (per byte) storage devices L0: registers L1: on-chip L1 cache (SRAM) L2: L3: Larger, slower, and cheaper (per byte) storage devices L5: – 34 – CPU registers hold words retrieved from L1 cache. L4: off-chip L2 cache (SRAM) L1 cache holds cache lines retrieved from the L2 cache memory. L2 cache holds cache lines retrieved from main memory. main memory (DRAM) Main memory holds disk blocks retrieved from local disks. local secondary storage (local disks) Local disks hold files retrieved from disks on remote network servers. remote secondary storage (distributed file systems, Web servers) 15-213, F’02 Caches Cache: A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device. Fundamental idea of a memory hierarchy: For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1. Why do memory hierarchies work? – 35 – Programs tend to access the data at level k more often than they access the data at level k+1. Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit. Net effect: A large pool of memory that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top. 15-213, F’02 Caching in a Memory Hierarchy Level k: 8 4 9 10 4 Level k+1: – 36 – 14 10 3 Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 Data is copied between levels in block-sized transfer units 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks. 15-213, F’02 General Caching Concepts 14 12 Level k: 0 1 2 3 Cache hit 4* 12 9 14 3 12 4* Level k+1: – 37 – Program needs object d, which is stored in some block b. Request 12 14 Program finds b in the cache at level k. E.g., block 14. Cache miss Request 12 0 1 2 3 4 4* 5 6 7 8 9 10 11 12 13 14 15 b is not at level k, so level k cache must fetch it from level k+1. E.g., block 12. If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”? Placement policy: where can the new block go? E.g., b mod 4 Replacement policy: which block should be evicted? E.g., LRU 15-213, F’02 A System with Virtual Memory Examples: Memory workstations, servers, modern PCs, etc. Page Table Virtual Addresses 0: 1: 0: 1: Physical Addresses CPU P-1: N-1: Disk Address Translation: Hardware converts virtual addresses to physical addresses via OS-managed lookup table (page table) – 38 – 15-213, F’02 Page Faults (like “Cache Misses”) What if an object is on disk rather than in memory? Page table entry indicates virtual address not in memory OS exception handler invoked to move data from disk into memory current process suspends, others can resume OS has full control over placement, etc. Before fault After fault Memory Memory Page Table Virtual Addresses Physical Addresses CPU Virtual Addresses Physical Addresses CPU Disk – 39 – Page Table Disk 15-213, F’02 Dynamic address translation User process Virtual address Translator (MMU) Physical address Will this allow us to provide protection? Sure, as long as the translation is correct Physical memory