File System Internals Sunny Gleason COM S 414 November 29, 2001 In this Lecture • The Hard Disk – Architecture – Performance • File System Structures • Local File systems – Like, FAT, UFS, Ext2, Ext3 Where to Find More Info • Hard Drive Manufacturers – http://www.storage.ibm.com/hdd/index.htm – http://www.seagate.com/newsinfo/technology/ – http://www.westerndigital.com/library/ • Windows File Systems – http://www.microsoft.com/hwdev/download/hard ware/fatgen103.pdf – http://msdn.microsoft.com/library/default.asp?url =/library/en-us/fileio/fsys_10ku.asp Where to Find More Info • Unix File Systems – http://www-106.ibm.com/developerworks/library/l-fs.html • NFS Version 4 – http://nfsv4.org/ • The Actual Code – Linux Kernel Source • http://www.kernel.org/ • (look in the “fs” directory of any 2.4 kernel) – BSD Kernel Source • http://www.openbsd.org/ • (Look in the “sys/ufs” directory) Where to Find More Info • The Book – Chapters 11, 12, … • CS414 Spring 2001 Web Site – http://www.cs.cornell.edu/Courses/cs414/2001sp/ – (from which these slides are mostly stolen…) • CS414 Fall 2000 Web Site – http://www.cs.cornell.edu/Courses/cs414/2000fa/ – (other useful slide sets available) The Memory Hierarchy • Memory is arranged as a hierarchy: – Close to CPU: • Registers, L1 cache • L2 Cache – RAM (primary memory) – Disk Storage (secondary memory) – Tape or Optical Storage (tertiary mem.) • Higher = higher speed, higher cost Hard Disk: Architecture • A disk drive has several physical components – spindle – surface (one side in the pack) – read/write arm and head – track (cylinder is vertical set of tracks) – Sector Physical Disk Access • Delays associated with accessing a sector on the disk: – Seek delay (biggest) • Moving the read/write head – Rotational delay • Waiting for the sector to spin under the head – Transfer delay (smallest) • Transferring the bits from the disk Physical Disks • O/S goal: provide file system API • Problems with disks: – Read errors – Bad blocks – Missed seeks • O/S Disk API may have many levels: – Physical disk block <surf#, cyl#, sec#> – Disk (volume) logical block <block#> – File logical <file block, record, or byte#> Logical Disks • A single hard disk may contain multiple file systems Making the HD Usable • The hard disk must be partitioned • Partitions are formatted with specific filesystems • In some cases, can “quick format” instead of full reformat • Multiple partitions are useful – (Limited) protection against crashes – If one partition fills up, the rest are still usable – “Dual-booting” - in general, ability to load multiple operating systems Some Typical Numbers • • • • • • • • Sector Size: 512 bytes Cylinders per disk: 6962 Platters: 3 - 12 Rotational Speed: 10,000 rpm Storage size: 12 - 120 GB Seek time: 5 - 12ms Latency: 3ms Transfer Rate: 14-20 MB/sec Disk Structure • Bare disk interface: cylinders, sectors • O/S imposes structure on disks • Disk contents: – Data : user files – Metadata: structural / administrative info • Any ideas? • Free list: structure indicating which blocks are unused • Typically maintained as a bitmap: an array of bits, representing blocks Dealing with Mechanical Latencies • Caches – Locality in file access • RAM disk – Reserve RAM as a [fast!] filesystem • RAID – Exploiting parallelism • Clever layouts and scheduling algorithms – Head scheduling – Meta-information layout Bad Blocks • All disks have some bad blocks • Blocks go bad as time goes on • O/S removes these blocks from the allocation map • On some disks, some cylinders have reserve blocks that can be remapped to replace bad blocks The File System • File system supports the abstraction of file objects – Create, delete, read, write, rename • File: a named collection of data • Typical abstraction: a vector of bytes • O/S knows about special file types: – Directories, symlinks, executable files • For data files, applications decide internal file structure (data file format) Accessing Files • Files can be accessed in different ways: – Sequential Access • Read bytes one at a time, in order – Direct access • Random access, given block/byte number – Record access • Some higher-level structure, instead of byte – Indexed access • Uses map from index field to corresponding file record Storing Files • Files can be allocated in different ways: – Contiguous allocation • All bytes together, in order – Linked Structure • Each block points to the next block – Indexed Structure • An index block contains pointer to many other blocks – Rhetorical Questions -- which is best? • For sequential access? Random access? • Large files? Small files? Mixed? Linked-list allocation • Each data block contains pointer to the next data block • Advantages? • Disadvantages? Linked-List Allocation • A single pointer is sufficient to locate all the blocks of the file • Seeking takes O(n) time, where n is the size of the file • A single corrupt pointer can cause the entire file to be lost MS-DOS Filesystem • MS-DOS uses a File Allocation Table (FAT) • Like a linked structure, except pointers are kept in a separate table – For every block, the FAT keeps track of whether or not it is allocated, and if so, which block it points to – Two copies of the FAT on disk Indexed Allocation • Index block contains pointers to each data block • Pros? • Cons? Combined Scheme: UFS • Unix File System • An inode contains the metadata for UNIX files – Contains control and allocation information – Each inode contains 15 block pointers • 12 direct • 1 single, 1 double, 1 triple indirect – Kind of tricky -- see the diagram! UNIX Inode UNIX Inode • If data blocks are 4K … – First 48K reachable from the inode – Next 4MB available from single-indirect – Next 4GB available from double-indirect – Next 4TB (!) available through the tripleindirect block • Any block can be found with at most 3 disk accesses UNIX Directories • Directories are just like regular files – They contain <filename, inode#> tuples – Filename is usually filename + filename_length usr home etc 3 4 5 inode 4 ken 7 hopkik 9 gleason 12 UNIX Disk Layout Boot Block Superblock Inodes Data Blocks … • Boot block provides information on how to boot the computer (tiny “bootstrap” program) • Superblock contains the file system layout: # of inodes, block size, location of the free list File System Problems • Fragmentation – When the blocks of a file are located all over the physical disk – Causes undesirable seeking – Use defragmentation utility to compact the filesystem, consolidate free space – See the pictures! Fragmentation Defragmentation File System Problems • Unreliability – Historically, disks have been among the most unreliable components • Develop “bad blocks” • Modern disks detect such faults, and have replacement blocks that can be remapped to replace bad blocks • Filesystems still need to track bad blocks and avoid using them • Inode 1 is a special inode that keeps track of where all the bad blocks are File System Problems • System crashes or power failures can occur at any time – Any disk operation can be interrupted at any time – Need to ensure that the filesystem is consistent throughout updates • Data that is being modified may be lost, but that should not compromise entire file system File System Problems • Crashes can occur at any time – A write in UNIX involves: • Writing the new data • Updating the inode • Updating the free list – Is there a correct order? What can go wrong if the FS does not respect the order? Disk Scheduling • To minimize mechanical delays, the O/S looks at multiple pending disk requests – FCFS (first come, first serve) • Ok when load is low • Long waiting times for long request queues – SSTF (shortest seek time first) • Always minimize arm movement, maximizes throughput • Favors middle blocks – SCAN (elevator) • Continue in same direction until done, then reverse direction and service in that order – C-SCAN: like scan, but return to 0 at end Disk Scheduling • In general, unless there are request queues, it doesn’t matter • The O/S may locate files strategically for performance reasons – The Organ Pipe distribution locates heavilyused files towards the center of the disk – The Ext2 Filesystem places groups of inodes around the disk, closer to the data blocks that they reference Conclusion • Hard disks provide vast amounts of slow, cheap storage • Operating Systems layer file system services on top of the raw disk API • The O/S must find ways to work around the slow performance and unreliability of disk storage Thanks! • Any questions? • Review session - Tuesday, 12/04 5:30 - 7:30pm