제07강 : Loading File into Memory Loading File into Memory DMA buffer replacement LRU 1 cpu • Buffer Memory • each buffer -- holds one disk block (sector) • kernel has N buffers -- shared by all buffer DMA sector – OS needs information about each buffer • user • hw • state Clinton, Bob, ... (who’s using this buffer now) device, sector number free/used (empty/waiting/reading/full/locked/writing/dirty) • “buffer header” (struct) • stores all information about each buffer • points to actual buffer • buffer header has link fields (doubly linked) – device_list, free_buffer_list, I/O_wait_list 2 “Buffer Cache” • Managed like CPU cache – read ahead (reada) – delayed write (dwrite) cpu Memory buffer DMA sector • dwrite – just set “dirty* bit” in buffer cache (on update) – write to disk later (when it is being replaced) • reada – prefetch if offset moves sequentially • dirty: data came from disk. Later memory copy is modified. Now disk copy and memory copy are different 3 Delayed Write ---- Pros & cons cpu • Good performance – many disk traffic can be saved Memory buffer DMA sector • Complex reliability – logically single information – physically many copies (disk, buffer) -- inconsistency – If system crashes ... 4 (1) problem detected (2) computer full stop t 5 Emergency action during this period problem detected & interrupt computer full stop t How many disk blocks can you save during this interval? 6 Crash ... • Only few blocks can be saved • What happens if they cannot be saved…? if lost, following goes wrong superblock inode data block which block is free/occupied? pointer to file data block if directories -- subtree structure if regular files -- just a file content • metadata are more important – superblock, directory, inode 7 Super block root directory inode Occupied Holes data Damage --- if this block becomes bad block? 8 Crash ... • In program, sync(2) system call – sync(2) flush (disk write) dirty buffers • doesn’t finish disk I/O (just queue them) on return • So sync(2) twice …2nd return guarantees flush • At keyboard – updated calls sync(2) every 30 second -- periodic – halt(8), shutdown(8) calls sync(2) -- by super user try man 8 intro …. (before logoff) • Caution – Do not power down without sync(2) or halt(8) – Otherwise the system crashes. What if it crashes? 9 fsck(8) • file system check -- check & repair file system – – – – performed at system bootup time start from root inode -- mark all occupied blocks start from superblock -- mark all free blocks something is wrong if: • some block has no incoming arc (unreachable) • some block has many incoming arc (reached many times) • lost+found – Very time-consuming 10 ms. * (1 GB / 1 KB) = 10 mega ms. = 10,000 sec !!! 10 Design Goal • Original UNIX file system design was – cheap, good performance – adequate reliability for School, SW house • on power fault (電源 中斷) – max. 30 seconds’ amount of work is gone – most important metadata are saved Power Down? – timesharing market (school, sw house) Some Contents lost • UNIX for bank? – Need to solve these problems 30 sec 30 sec flush 11 Modern systems • System V – To reduce boot time (minimize downtime) • • • • On successful return from sync(2), make /fastboot file if /fastboot exits, system was shutdown cleanly (don’t fsck) After successful boot, remove /fastboot file If /fastboot doesn’t exist, do fsck (only for /etc/fstab) • Log Structured File System – collect dirty nodes in one big segment (~track size) Memory – periodically write this log to disk • fast -- no seek/rotational delay – recovery is fast & complete buffer DMA sector 12 “remove b” Issues directory • Transactional guarantee – – – – • Write all, or no write at all “Account A Account B (transfer $ 100)” Atomic transaction Write both or cancel both a b dev bin 7 9 11 45 inode of b pointers[ ] Ordering guarantee – “Delete file A” 1. Modify parent directory’s data block (file name A) 2. Release file A’s inode (address of data block sectors, …) 3. Release file A’s data block – – – data data data Suggested order : (3 2 1), Otherwise, A’s inode exists, pointer exists, wrong data …, Write the next block to disk, only if previous write is complete synchronous write ** Reference: Vahalia, 11.7.2 13 Back to buffer cache 14 Some buffers are linked to free buffer pool 22 14 74 23 25 37 88 45 11 83 32 19 Free buffers 15 Some buffers are allocated to a device 11 18 64 43 15 97 23 44 10 33 54 99 Disk 3 16 Allocate buffers to whom? Process 1 Buffer cache user offset CPU inode Linux dev CPU 17 11 18 64 43 15 97 23 44 10 33 54 99 Disk 3 Buffer header has flag Among buf allocated to dev ... some will do (waiting) DMA some is currently doing DMA others has done DMA (I/O wait queue) within (dev) 18 Some buffers are waiting for disk I/O I/O wait Queue 11 18 43 15 23 44 33 54 Waiting to do DMA Disk 3 has done DMA 19 struct buf { int b_flags; /* see defines below */ struct buf *b_forw; struct buf *b_back; struct buf *av_forw; struct buf *av_back; /* headed by devtab of b_dev */ /* " */ /* position on free list, */ /* if not BUSY*/ int b_dev; char *b_blkno; int b_wcount; char b_error /* major+minor device name */ /* block # on device */ /* transfer count (usu. words) */ /* returned after I/O */ char char /* low order core address */ /* high order core address */ *b_addr; *b_xmem; } buf[NBUF]; struct buf bfreelist; 20 struct devtab { char char struct struct struct struct }; struct devtab d_active; d_errcnt; buf *b_forw; buf *b_back; buf *d_actf; buf *d_actl; d_active b_forw b_back d_actf d_actl /* busy flag */ /* error count (for recovery) */ /* first buffer for this dev */ /* last buffer for this dev */ /* head of I/O queue */ /* tail of I/O queue */ 11 18 64 43 15 97 23 44 10 33 54 99 I/O waiting buffers 21 Remember ..OS Kernel (plain C program with variables and functions) Process 1 Process 2 Process 3 PCB PCB PCB CPU mem disk tty CPU mem disk tty : Table (Data Structure) : Object (hardware or software) 22 Kernel Data Structure Process 1 devswtab user Buffer cache inode offset disk_ read ( ) superblock inode data devtab CPU CPU / bin cc date etc sh getty passwd 23 – Each buffer header has 4 link fields – buf can belong to two doubly linked list at a time – read(fd) system call • get offset • get inode user file inode dev fd offset – checks access permission (rwx rwx rwx) – mapping: offset sector address – get major/minor device number • search buffer cache (buffer header has disk & sector #) – start from device table, traverse the links – compare each buffer with sector address • if already in buffer cache, done • if miss, then arrange to read from disk 24 – read() system call {fd offset inode device search buffer list} If (hit) then done /* return data from buffer cache */ else /* buffer cache miss – must read disk */ if (free buf available?) then /* using this free buffer, read disk */ get buf read disk fill buf done else /* need replacement first */ {get most LRU buffer If (dirty?) {write old content -first, delayed write} {read disk fill buf done} } 25 mounting System can have many file systems Compare with Windows {C: D: E: ...} 26 <Logically> FS 1 At bootup time specify which F.S. to boot as a “root file system” Bootblock Superblock Inode list Data block FS FS 2 Bootblock Superblock Inode list FS Data block FS FS 3 Bootblock Superblock Inode list Data block 27 <Logically> FS 1 “root file system” Bootblock Superblock Inode list / bin Data block dsk1 FS 2 Bootblock Superblock Inode list FS 3 sh getty usr passwd dsk2 Now all files under root file system can be accessed dsk3 But how do we access files in other file systems? Data block Bootblock Superblock Inode list date etc Windows C: D: E: Data block 28 <Logically> FS 1 Bootblock Superblock Inode list / bin Data block dsk1 FS 2 Bootblock Superblock Inode list date etc sh usr getty passwd Mount it! dsk2 Data block dsk3 FS 3 Bootblock Superblock Inode list /dev/dsk3 / bin include src Data block banner yacc studio.h uts 29 System call mount (path1, path2, option) dev special file: /dev/dsk3 (which) mount point: /usr example: read-only (how) After mounting, /dev/dsk3 is accessed as /usr i-numbers in disk-1 root superblock date / etc bin sh (where) getty usr passwd i-numbers in disk-2 root superblock bin banner include yacc studio.h src uts 30 Mount Table Entry Purpose: - resolve pathname - locate superblock inode (/usr) inode (root) / superblock date usr etc bin sh getty passwd bin banner device number include yacc studio.h src uts 31 Relationship between Tables Inode table Mount table Buffer Cabe buf inode of /usr Superblock Mounted on inode Root inode inode of dsk 3 root 32 Disk File System • • • • Boot block Superblock pointers to free space in disk inode list pointers to data block data block • mounting file system 33