Connecting with Computer Science, 2e Chapter 10 File Structures Objectives • In this chapter you will: – Learn what a file system does – Understand the FAT file system and its advantages and disadvantages – Understand the NTFS file system and its advantages and disadvantages – Compare common file systems – Learn how sequential and random file access work – See how hashing is used – Understand how hashing algorithms are created Connecting with Computer Science, 2e 2 Why You Need to Know About...File Structures • Knowledge of how an operating system stores and maintains data in a computer – Allows better comprehension of how a computer handles and manipulate files – Allows the computer to run as efficiently as possible Connecting with Computer Science, 2e 3 What Does a File System Do? • Responsibilities – Creating, manipulating, renaming, copying, and removing files to and from a storage device – Organizing files into common storage units • Called directories – Keeping track of file and directory locations – Assisting users • Relate files and folders to the physical structure of the storage medium Connecting with Computer Science, 2e 4 What Does a File System Do? (cont’d.) • Files used by operating systems and applications – – – – – – Word-processing documents Source code for programs you have written Music files Movie files Spreadsheets Photos • Operating systems use a file folder icon to represent a directory Connecting with Computer Science, 2e 5 What Does a File System Do? (cont’d.) Figure 10-1, Files and directories in a file system are similar to documents and folders in a filing cabinet Connecting with Computer Science, 2e 6 What Does a File System Do? (cont’d.) Figure 10-2, Folders and files in Windows Connecting with Computer Science, 2e 7 What Does a File System Do? (cont’d.) • Hard disk – Most common storage medium for a file system – Physically organized into tracks and sectors – Read/write heads move over specified areas of the hard disk to store (write) or retrieve (read) data – Random access device • Reads or writes data directly on the disk • Faster than sequential access – Reads and writes from beginning to end • Makes use of the file system to organize files Connecting with Computer Science, 2e 8 File Systems and Operating Systems • File management system – Dependent on the operating system • FAT (File Allocation Table) – Used from MS-DOS to Windows ME • NTFS (New Technology File System) – Default for Windows • Unix and Linux support several file systems – XFS, JFS, ReiserFS, ext3, others • Mac OS X file system – HFS and HFS+ Connecting with Computer Science, 2e 9 FAT • Groups hard drive sectors into clusters – Increases performance by organizing blocks of sectors contiguously • Maintains a relationship between files and clusters – Clusters have two entries in the FAT • Current cluster information • Link to next cluster or special code indicating the last cluster • Keeps track of writable clusters and bad clusters Connecting with Computer Science, 2e 10 FAT (cont’d.) Figure 10-3, Sectors are grouped into clusters on a hard disk Connecting with Computer Science, 2e 11 FAT (cont’d.) • Hard drive organization – Partition boot sector • Contains information on how to access volumes – Main and backup FAT • If error in reading the main FAT, backup copied to main to ensure stability – Root directory • Contains entries for every file and folder in the directory – Data area • Measured in clusters Connecting with Computer Science, 2e 12 FAT (cont’d.) Figure 10-4, Typical FAT file system Connecting with Computer Science, 2e 13 Disk Fragmentation • File clusters scattered in different locations on the storage medium – Windows provides the Disk Defragmenter utility • • • • Reorganizes clusters contiguously Improves performance Minimizes movement of the read/write heads Use regularly to ensure system runs at peak performance Connecting with Computer Science, 2e 14 Disk Fragmentation (cont’d.) Figure 10-5, Files become fragmented as they’re stored in noncontiguous clusters; a defragmenting utility moves files to contiguous clusters and improves disk performance Connecting with Computer Science, 2e 15 Advantages of FAT • Efficient use of disk space – Does not have to use contiguous space for large files • File names up to 255 characters (FAT32) • Easy to recover deleted files upon deletion – System places E5h in the first position of filename • File remains on drive • Replace E5h with original first letter of the filename Connecting with Computer Science, 2e 16 Disadvantages of FAT • Performance slows down as more files are stored on the partition • Hard drive fragments easily • Lack of security – NTFS provides access rights to files and directories • File integrity problems – Lost clusters – Invalid files and directories – Allocation errors Connecting with Computer Science, 2e 17 NTFS • Overcomes FAT system limitations • “Journaling” file system – Keeps track of transaction performed – “Rolls back” transactions if errors found • Uses a Master File Table (MFT) – Stores data about all files and directories – Similar to database table with records • Uses clusters • Reserves blocks of space to allow the MFT to grow Connecting with Computer Science, 2e 18 Advantages of NTFS • File access is very fast and reliable • MFT allows system recovery from problems without losing significant amounts of data • Security is greatly increased over FAT • File encryption with EFS (Encrypting File System) • File compression reduces file size – Saves disk space Connecting with Computer Science, 2e 19 Disadvantages of NTFS • Large overhead – Not recommended for volumes less than 4 GB • Cannot access NTFS volumes from: – – – – MS-DOS Windows 95 Windows 98 Linux Connecting with Computer Science, 2e 20 Comparing File Systems • Choosing correct file system – Operating system dependent – Rarely depends on hardware • NTFS: Windows XP or Vista – Supports drive sizes up to 16 TB (1600 GB) • FAT: Windows 9x – Older small hard drives, small removable devices • UNIX/Linux – Many file system choices Connecting with Computer Science, 2e 21 Comparing File Systems (cont’d.) Table 10-1, Fat16, FAT32, and NTFS compared Connecting with Computer Science, 2e 22 Comparing File Systems (cont’d.) Table 10-2, Some UNIX/Linux file systems Connecting with Computer Science, 2e 23 File Organization • Topics covered: – File characteristics – How files are stored on disks and other media Connecting with Computer Science, 2e 24 Binary or Text • Text files – Consist of ASCII or Unicode characters – Typically read with word-processing programs or text editors • Easy to view and modify • Binary files – Computer readable (not human readable) – Coded and numeric information – More compact than text files • Examples: executable programs, applications, sound and image files Connecting with Computer Science, 2e 25 Sequential or Random Access • Sequential storage – Data accessed one chunk after the other in order • Random storage – Data accessed in any order – Also called direct or relative access Connecting with Computer Science, 2e 26 Sequential or Random Access (cont’d.) Figure 10-6, Sequential versus random access Connecting with Computer Science, 2e 27 Sequential Access • Starts at the beginning and processes to the end of the file – Writing process is very fast • New data added to the end of a file – Retrieving, inserting, deleting, modifying data • Very slow – Stores data in rows like a database record • Field delimiters or specific fixed sizes for each field Connecting with Computer Science, 2e 28 Sequential Access (cont’d.) Figure 10-7, A comma can be used as a field delimiter Connecting with Computer Science, 2e 29 Sequential Access (cont’d.) Figure 10-8, Data can also be in fixed-length format Connecting with Computer Science, 2e 30 Random Access • Provides faster access to large amounts of data • Stores fixed-length records (relative records) – Ability to mathematically calculate the record’s position on disk surface and go right to it – Ability to update records in place • May waste disk space – Partial record or no data • Works well when sequential record number can easily identify records Connecting with Computer Science, 2e 31 Random Access (cont’d.) Figure 10-9, Record organization and file access Connecting with Computer Science, 2e 32 Hashing • Used for accessing relative record files – Uses unique value called a hash key • Widely used in database management systems • Involves a hashing algorithm to generate hash keys for each record – Combining hash keys establishes an index to rows or records of information Connecting with Computer Science, 2e 33 Why Hash? • Allows a key field number not suited for relative file access to be converted into a relative record number – Example: phone numbers as keys in a customer information table • Divide highest possible phone number by the expected number of customers to get the hash key • 9999999999 / 2000 (estimated number of customers) = approximately 5,000,000 • Phone number 7025551234 / 5,000,000 gives the record number 1045 Connecting with Computer Science, 2e 34 Why Hash? (cont’d.) • Hashing may result in collisions – Same relative key is generated for more than one original key value • One solution: – Expand algorithm to add the sum of the digits of the phone number to the relative key • Sum of the digits in phone number 7025551234 is 34 • Original key 1045 + 34 = 1079 • Lessens collisions but does not eliminate them Connecting with Computer Science, 2e 35 Dealing with Collisions • Best hashing algorithms have collisions • One solution: create overflow area – Records with duplicate record numbers are placed in the overflow area at the end of the file – Record retrieval • Hash key is calculated, and record at calculation position is retrieved • If the record at that location isn’t the correct one, the overflow area is searched sequentially Connecting with Computer Science, 2e 36 Dealing with Collisions (cont’d.) Figure 10-10, An overflow area helps resolve collisions Connecting with Computer Science, 2e 37 Hashing and Computing • Efficient hashing algorithm – Important to companies producing database management systems • Many different hashing algorithms are used in computing – Encryption and decryption – Indexing – Many programming languages have specialized libraries of built-in hashing routines Connecting with Computer Science, 2e 38 One Last Thought • Determining a computer system’s worth – Often measured in terms of data stored on hard drives – Data can be difficult to replace • Data storage dependent on file systems • Strong understanding of file systems allows more data availability and protraction Connecting with Computer Science, 2e 39 Summary • Hard drive – Random access device – Stores information in tracks and sectors – Accesses data through read/write heads • File system – Responsible for creating, manipulating, renaming, copying, and removing files from a storage device • Windows uses either FAT or NTFS Connecting with Computer Science, 2e 40 Summary (cont’d.) • FAT keeps track of which files are using specific clusters – Vulnerable to disk fragmentation • NTFS uses MFT to keep track of files and directories – Used with Windows • NTFS advantages over FAT – Better reliability and security, journaling, file encryption, and file compression Connecting with Computer Science, 2e 41 Summary (cont’d.) • Linux can be used with many file systems • Files contain binary or text (ASCII) data • Data is usually stored and accessed either sequentially or randomly (relative access) • Hashing – Common method for accessing a relative file – Collisions occur when the hash key is duplicated for more than one relative record location Connecting with Computer Science, 2e 42