CS5460: Operating Systems Lecture 19: File System Implementation (Ch. 11-12) CS 5460: Operating Systems File Allocation Strategies Contiguous allocation – Files allocated (only) in contiguous blocks on disk – Analogous to base-and-bounds memory management Linked file allocation – Maintain a linked list of blocks used to contain file – At end of each block, add a (hidden) pointer to the next block Indexed file allocation – Maintain array of block numbers in inode Multi-level indexed file allocation – Maintain array of block numbers in inode – Maintain pointers to blocks full of more block numbers in inode (indirect blocks, double-indirect blocks, …) Extents, and muti-level extents CS 5460: Operating Systems Multi-level Indexed File Allocation Inode contains: inode – Fixed-size array of direct blocks – Small array of indirect blocks – (Optional) double/triple indirect Good points indirect block – Simple offsetàblock computation for sequential or random access – Allows incremental growth/shrinkage – Fixed size (small) inodes – Very fast access to (common) small files Bad points – Indirection adds overhead to random access to large files – Blocks can be spread all over disk à more seeks CS 5460: Operating Systems Multi-level Indexed File Allocation Example: 4.3 BSD file system – Inode contains 12 direct block addresses – Inode contains 1 indirect block address – Inode contains 1 double-indirect block address If block addrs are 4-bytes and blocks are 1024-bytes, what is maximum file size? – Number of block addrs per block = 1024/4 = 256 – Number of blocks mapped by direct blocks à 12 – Number of blocks mapped by indirect block à 256 – Number of blocks mapped by double-indirect block à 2562 = 65536 – Max file size: (12 + 256 + 65536) * 1024 = 66MB (67,383,296 bytes) Modern file systems have triple-indirect blocks CS 5460: Operating Systems Links Links let us have multiple names to same file Hard links: ln /foo/bar /tmp/moo /foo directory bar /tmp directory inode# moo inode# – Two entries point to same inode – Link count tracks connections 2 » Decrement link count on delete » Only delete file when last connection is deleted – Problems: loops, unreachable directories, unreachable files inode /foo directory /tmp directory Soft links: – Adds symbolic pointer to file bar inode# moo /foo/bar – Special flag in directory entry – Only one real link to file » File goes away when its deleted – Problems: Infinite loops CS 5460: Operating Systems 1 inode ln –s /foo/bar /tmp/moo Mounting a Filesystem Locate superblock(s) Read file system format information Initialize inode cache Initialize buffer cache Initialize name cache Optional: perform sanity checks (more detail later) – UNIX / Linux / Mac OS X: fsck – Windows: ScanDisk / CHKDSK CS 5460: Operating Systems File System Optimizations Modern Historic Technique Effect Disk buffer cache Eliminates problem Aggregated disk I/O Reduces seeks Prefetching Overlap/hide disk access Disk head scheduling Reduces seeks Disk interleaving Reduces rotational latency Goal: Reduce or hide expensive disk operations CS 5460: Operating Systems Buffer/Page Cache Idea: Keep recently used disk blocks in kernel memory Process reads from a file: – If blocks are not in buffer cache » Allocate space in buffer cache Q: What do we purge and how? » Initiate a disk read » Block the process until disk operations complete – Copy data from buffer cache to process memory – Finally, system call returns Usually, a process does not see the buffer cache directly mmap() maps buffer cache pages into process RAM CS 5460: Operating Systems Buffer/Page Cache Process writes to a file: – If blocks are not in the buffer cache » Allocate pages » Initiate disk read » Block process until disk operations complete – Copy written data from process RAM to buffer cache Default: writes create dirty pages in the cache, then the system call returns – Data gets written to device in the background – What if the file is unlinked before it goes to disk? Optional: Synchronous writes which go to disk before the system call returns – Really slow! CS 5460: Operating Systems Performing Large File I/Os Idea: Try to allocate contiguous chunks of file in large contiguous regions of the disk – Disks have excellent bandwidth, but poor latency – Amortize expensive seeks over many block read/writes Question: How? – Maintain free block bitmap (cache parts in memory) – When you allocate blocks, use a modified best fit algorithm, rather than allocating a block at a time (pre-allocate even) Problem: Hard to do this when disk full/fragmented – Solution A: Keep a reserve (e.g., 10%) available at all times – Solution B: Run a disk defragger occasionally CS 5460: Operating Systems Prefetching Idea: Read blocks from disk ahead of user request Goal: Reduce number of seeks visible to user – If block read before request à hits in file buffer cache User Read 0 File System Read 0 Read 1 Read 1 Read 2 Read 2 Problem: What blocks should we prefetch? – Easy: Detect sequential access and prefetch ahead N blocks – Harder: Detect periodic/predictable random accesses CS 5460: Operating Systems Disk Scheduling Idea: Permute order of disk requests to reduce seeks Some policies: – First-come, first-served – SCAN (0 à 100, 100 à 0, 0 à 100, …) – Shortest seek time first – C-SCAN (0 à 100, 0 à 100, …) Example: head @ 30, requests 61, 40, 18, 78 – FCFS: 30 à 61 à 40 à 18 à 78 = 31 + 21 + 32 + 60 à 134 tracks » Discussion: Lots of unnecessary seeks under load – SSTF: 30 à 40 à 61 à 78 à 18 = 10 + 21 + 17 + 60 à 108 tracks » Discussion : Starvation (How?), high variance – SCAN: 30 à 18 à 40 à 61 à 78 = 12 + 22 + 21 + 17 à 72 tracks » Discussion : Handles heavy load well, but not in middle enough – C-SCAN: 30 à 18 à 78 à 61 à 40 = 12 + 60 + 17 + 21 à 110 tracks » Discussion: Elevator-like, similar to SCAN, but more fair CS 5460: Operating Systems Disk scheduling used to be done by the OS – Disks lacked onboard processing power to do this – There were relatively few disk models so it wasn’t too hard for OSes to understand disks geometry Now, disk scheduling is done on the disk – Disk hardware and firmware has gotten quite complicated – OS gives the disk a batch of request which complete in an order chosen by the disk, unless the OS forces sequential accesses CS 5460: Operating Systems Example Operation: Open /tmp/foo Open `/tmp/foo 1. check to see if we already know what inode for: » /tmp/foo (goto `B') » /tmp (goto `A') 2. check to see if root inode is in inode cache (else read root inode) 3. check permissions for user on root directory 4. determine location of blocks containing root directory 5. check to see if each block is in buffer cache 6. load ones that are not and place in cache 7. search root directory for entry matching `tmp' and extract inode number 8. [A] check to see if inode is in inode cache (else read it) 9. check permissions for user on file /tmp 10. determine location of blocks containing /tmp directory 11. check to see if each block is in buffer cache 12. load ones that are not and place in cache 13. search directory for entry matching `foo' and extract inode number CS 5460: Operating Systems Example Operation: Open /tmp/foo Open /tmp/foo (cont d) 14. [B] check to see if inode is in inode cache (else read it) 15. check permissions for user on file /tmp/foo 16. use inode number to determine if /tmp/foo is already open (i.e., has an entry in the open file table): » if not, allocate an entry in the open file table, mark it as being for that inodenumber, add a link to the now in-core inode for /tmp/foo 17. find free slot in per-process open file table » return error if no space » else, initialize entry, add link to appropriate entry in system-wide open file table 18. initialize entry 19. return index of entry in open file table to user as fd CS 5460: Operating Systems Example: Seek to offset 10,000 fseek(fd, 10000, …) 1. Check that fd is a valid open file (return error if not) 2. Update seek_offset in process open file table 3. Return – Optimization: Initiate prefetch at new file offset à why? CS 5460: Operating Systems Example: Read 1000 bytes fread( fd, buffer, 1000, …) 1. Check that fd is a valid open file (return error if not) 2. Check that [buffer, buffer+1000] is valid user buffer 3. Determine which disk block(s) are already in buffer cache 4. If any blocks not in buffer cache: 1. Determine disk addresses of block(s) that need to be read à how? 2. Initiate disk read operations to read necessary block(s) 3. Put process on disk queue awaiting block read completion 5. Copy requested data from buffer cache to user buffer 6. Return CS 5460: Operating Systems Exercises fwrite(fd, buffer, 4096, …) fclose(fd) rename(oldname, newname) à tricky! unlink( /tmp/foo ) à delete file link(existing, new) à create hard link symlink (existing, new) à create soft link CS 5460: Operating Systems Questions? CS 5460: Operating Systems