I/O Management and Disk Scheduling Chapter 11 I/O Driver • OS module which controls an I/O device • hides the device specifics from the above layers in the OS/kernel • translates logical I/O into device I/O (logical disk blocks into {track, head, sector}) performs data buffering and scheduling of I/O operations structure: several synchronous entry points (device initialization, queue I/O requests, state control, read/write) and an asynchronous entry point (to handle interrupts) Typical driver structure driver_strategy(request) { if (empty(request-queue)) driver_start(request) else add(request, request-queue) block_current_process; reschedule() } driver_start(request) { current_request= request; start_dma(request); } driver_ioctl(request) { } driver_init() { } driver_interrupt(state) /* asynchronous part */ { if (state==ERROR) && (retries++<MAX) { driver_start(current_request); return; } add_current_process_to_active_queue if (! (empty(request_queue)) driver_start(get_next(request_queue)) } User to Driver Control Flow read, write, ioctl user kernel special file ordinary file File System character device block device Character queue driver_read/write Buffer Cache driver-strategy I/O Buffering • before an I/O request is placed the source/destination of the I/O transfer must be locked in memory • I/O buffering: data is copied from user space to kernel buffers which are pinned to memory • works for character devices (terminals), network and disks • buffer cache: a buffer in main memory for disk sectors • character queue: follows the producer/consumer model (characters in the queue are read once) • unbuffered I/O to/from disk (block device): VM paging for instance Buffer Cache • when an I/O request is made for a sector, the buffer cache is checked first • if it is missing from the cache, it is read into the buffer cache from the disk • exploits locality of reference as any other cache • usually replacements done in chunks (a whole track can be written back at once to minimize seek time) • replacement policies are global and controlled by the kernel Replacement policies • buffer cache organized like a stack: replace from the bottom • LRU: replace the block that has been in the cache longest with no reference to it (on reference a block is moved to the top of the stack) • LFU: replace the block with the fewest references (counters which are incremented on reference and blocks move accordingly) • frequency-based replacement: define a new section on the top of the stack, counter is unchanged while the block is in the new section Least Recently Used • The block that has been in the cache the longest with no reference to it is replaced • The cache consists of a stack of blocks • Most recently referenced block is on the top of the stack • When a block is referenced or brought into the cache, it is placed on the top of the stack Least Frequently Used • The block that has experienced the fewest references is replaced • A counter is associated with each block • Counter is incremented each time block accessed • Block with smallest count is selected for replacement • Some blocks may be referenced many times in a short period of time and then not needed any more Application-controlled File Caching • two-level block replacement: responsibility is split between kernel and user level • a global allocation policy performed by the kernel which decides which process will give up a block • a block replacement policy decided by the user: kernel provides the candidate block as a hint to the process the process can overrule the kernel’s choice by suggesting an alternative block the suggested block is replaced by the kernel examples of alternative replacement policy: most-recently used (MRU) Sound kernel-user cooperation • oblivious processes should do no worse than under LRU • foolish processes should not hurt other processes • smart processes should perform better than LRU whenever possible and they should never perform worse if kernel selects block A and user chooses B instead, the kernel swaps the position of A and B in the LRU list and places B in a “placeholder” which points to A (kernel’s choice) if the user process misses on B (i.e. he made a bad choice), and B is found in the placeholder, then the block pointed to by the placeholder is chosen (prevents hurting other processes) Disk Performance Parameters • To read or write, the disk head must be positioned at the desired track and at the beginning of the desired sector • Seek time – time it takes to position the head at the desired track • Rotational delay or rotational latency – time its takes for the beginning of the sector to reach the head Disk Performance Parameters • Access time – Sum of seek time and rotational delay – The time it takes to get in position to read or write • Data transfer occurs as the sector moves under the head Disk I/O Performance • disks are at least four orders of magnitude slower than the main memory • the performance of disk I/O is vital for the performance of the computer system as a whole • disk performance parameters seek time (to position the head at the track): 20 ms rotational delay (to reach the sector): 8.3 ms transfer time: 1-2 MB/sec access time (seek time+ rotational delay) >> transfer time for a sector therefore the order in which sectors are read matters a lot Disk Scheduling Policies • Seek time is the reason for differences in performance • For a single disk there will be a number of I/O requests • If requests are selected randomly, we will get the worst possible performance Disk Scheduling Policies • First-in, first-out (FIFO) – Process request sequentially – Fair to all processes – Approaches random scheduling in performance if there are many processes Disk Scheduling Policies • Priority – Goal is not to optimize disk use but to meet other objectives – Short batch jobs may have higher priority – Provide good interactive response time Disk Scheduling Policies • Last-in, first-out – Good for transaction processing systems • The device is given to the most recent user so there should be little arm movement – Possibility of starvation since a job may never regain the head of the line Disk Scheduling Policies • Shortest Service Time First – Select the disk I/O request that requires the least movement of the disk arm from its current position – Always choose the minimum Seek time Disk Scheduling Policies • SCAN – Arm moves in one direction only, satisfying all outstanding requests until it reaches the last track in that direction – Direction is reversed Disk Scheduling Policies • C-SCAN – Restricts scanning to one direction only – When the last track has been visited in one direction, the arm is returned to the opposite end of the disk and the scan begins again Disk Scheduling Policies • N-step-SCAN – Segments the disk request queue into subqueues of length N – Subqueues are process one at a time, using SCAN – New requests added to other queue when queue is processed • FSCAN – Two queues – One queue is empty for new request Disk Scheduling Policies • usually based on the position of the requested sector rather than according to the process priority • shortest-service-time-first (SSTF): pick the request that requires the least movement of the head • SCAN (back and forth over disk): good distribution • C-SCAN(one way with fast return):lower service variability but head may not be moved for a considerable period of time • N-step SCAN: scan of N records at a time by breaking the request queue in segments of size at most N • FSCAN: uses two subqueues, during a scan one queue is consumed while the other one is produced RAID • Redundant Array of Independent Disks (RAID) • idea: replace large-capacity disks with multiple smaller-capacity drives to improve the I/O performance • RAID is a set of physical disk drives viewed by the OS as a single logical drive • data are distributed across physical drives in a way that enables simultaneous access to data from multiple drives • redundant disk capacity is used to compensate the increase in the probability of failure due to multiple drives • size RAID levels (design architectures) RAID Level 0 • does not include redundancy • data is stripped across the available disks disk is divided into strips strips are mapped round-robin to consecutive disks a set of consecutive strips that map exactly one strip to each array member is called stripe strip 0 strip 1 strip 2 strip 3 strip 4 ... strip 5 strip 6 strip 7 RAID Level 1 • redundancy achieved by duplicating all the data • each logical disk is mapped to two separate physical disks so that every disk has a mirror disk that contains the same data a read can be serviced by either of the two disks which contains the requested data (improved performance over RAID 0 if reads dominate) a write request must be done on both disks but can be done in parallel recovery is simple but cost is high strip 0 strip 1 strip 0 strip 1 strip 2 ... strip 3 strip 2 strip 3 RAID Levels 2 and 3 • parallel access: all disks participate in every I/O request • small strips (byte or word size) • RAID 2: error correcting code (Hamming) is calculated across corresponding bits on each data disk and stored on log(data) parity disks; necessary only if error rate is high • RAID 3: a single redundant disk which keeps the parity bit P(i) = X2(i) + X1(i) + X0(i) • in the event of failure, data can be reconstructed but only one request at the time can be satisfied b0 ... b1 b2 P(b) X2(i) = P(i) + X1(i) + X0(i) RAID Levels 4 and 5 • independent access: each disk operates independently, so multiple I/O request can be satisfied in parallel • large strips • RAID 4: for small writes: 2 reads + 2 writes example: if write performed only on strip 0: P’(i) = X2(i) + X1(i) + X0’1(i) = X2(i) + X1(i) + X0’(i) + X0(i) + X0(i) = P(i) + X0’(i) + X0(i) strip 0 strip 1 strip 2 P(0-2) strip 3 strip 4 strip 5 P(3-5) • RAID 5: parity strips are distributed across all disks UNIX SVR4 I/O