Native LINUX Filesystems Extended filesystems (Ext, Ext2, Ext3, Ext4) Extended filesystem (ext fs), second extended filesystem (ext2fs) and third extended filesystem (ext3fs) were designed and implemented on Linux by Rmy Card, Laboratoire MASI--Institut Blaise Pascal, , Theodore Ts'o, Massachussets Institute of Technology, and Stephen Tweedie, University of Edinburgh Non-Journaled Filesystems Extended filesystem (Ext FS) This is original filesystem used in early Linux systems. The standard filesystem for Linux, ext2, is a high-performance, non-journaled filesystem. Although ext2 lacks journaling features, many users choose it because of its high speed and reliability. Second Extended Filesystem (Ext2 FS) The Second Extended File System provides standard Unix file semantics and advanced features. Ext2 filesystem format forms the basis for following native LINUX file system versions. Due to optimizations included in the kernel code, Ext2fs has extensions to the current filesystem: access control lists conforming to the Posix semantics, undelete, and on-the-fly file compression. . I Ext2 features: - - - Long file names (255 characters to 1012_ and variable length directory entries. VFS layer filesystems to 4 TB Reserves 5% of the blocks super user (root) to recover from user processes filling up filesystems. Filesystem metadata (inodes, bitmap blocks, indirect blocks and directory blocks) synchronous write Choice of logical block size when creating the filesystem, typically be 1024, 2048 and 4096 bytes to speed up I/O since with fewer I/O requests, and thus fewer disk head seeks. Fast symbolic links that do not use any data block on the filesystem; filename is not stored in the inode filesystem state using a special field in the superblock to indicate the status of the file system. When a filesystem is mounted in read/write mode, its state is set to ``Not Clean''. When it is unmounted or remounted in read-only mode, its state is reset to ``Clean''. At boot time, the filesystem checker uses this information to decide if a filesystem must be checked (fsck)... The filesystem checker tests this to force the check of the filesystem regardless of its apparently clean state (fsck). Filesystems checks are forced at regular intervals. A mount counter is maintained in the superblock. A last check time and a maximal check interval are also maintained in the superblock. Each time the filesystem is mounted in read/write mode, counters and timestamps arechecked. When it reaches a maximal value (also recorded in the superblock), the filesystem checker forces the check even if the filesystem is ``Clean''. provides an attribute allows the users to request secure deletion on files. When such a file is deleted, random data is written in the disk blocks previously allocated to the file. This prevents malicious people from gaining access to the previous content of the file by using a disk editor. Ext2 Physical Structure Unlike FFS, the ext2 filesystems is made up of block groups instead of FFS cylinder groups. Block groups are not tied to the physical layout of the blocks on the disk, since modern drives tend to be optimized for sequential access (‘smart” drives – SAN, SCSI, SATA) and hide their physical geometry to the operating system. Ext2 filesystem layout ,---------+---------+---------+---------+---------, | Boot | Block | Block | ... | Block | | sector | group 1 | group 2 | | group n | `---------+---------+---------+---------+---------' Each block group contains a redundant copy of crucial filesystem control informations (superblock and the filesystem descriptors) and also contains a part of the filesystem (a block bitmap, an inode bitmap, a piece of the inode table, and data blocks). The structure of a block group is represented in this table: Ext2 blockgroup layout ,---------+---------+---------+---------+---------+---------, | Super | FS | Block | Inode | Inode | Data | | block | desc. | bitmap | bitmap | table | blocks | ---------+---------+---------+---------+---------+---------' Using block groups improves reliability since the control structures are replicated in each block group, it is easy to recover from a filesystem where the superblock has been corrupted. This structure also helps to get good performances: by reducing the distance between the inode table and the data blocks, it is possible to reduce the disk head seeks during I/O on files. - - - In Ext2fs, directories are managed as linked lists of variable length entries containing the inode number, the entry length, the file name and its length. Variable length entries permit long file names without wasting disk space in directories. Ext2fs buffer cache management performs readaheads:reading data blocks contiguouslys. This way, it tries to ensure that the next block to read will already be loaded into the buffer cache. Readaheads are extended directory reads Ext2fs performs allocation optimizations. Block groups are used to cluster together related inodes and data to reduce the disk head seeks made when the kernel reads an inode and its data blocks. Preallocates up to 8 adjacent blocks when allocating a new block for writing data. Preallocation hit rates are around 75% even on very full filesystems and gets good write performances under heavy load. It also allows contiguous blocks to be allocated to files, thus it speeds up the future sequential reads. Journaled Filesystems Journaled filesystems include additional record keeping that increases the ability of the filesystem to recover from a crash. · · · · ext3 - the ext2 filesystem with journaling extensions. jfs - Journaled File System - a filesystem contributed to Linux by IBM. xfs - A filesystem contributed to open source by SGI. reiserfs, developed by Namesys, is the default filesystem for SUSE Linux, DARPA. •Third Extended Filesystem (Ext3 FS) Ext3 supports the same features as Ext2, but also includes Journaling. • Fourth Extended Filesystem (Ext4 FS) Compatibility Any existing Ext3 filesystem can be migrated to Ext4 with an easy procedure which consists in running a couple of commands in read-only mode. Migrate existing Ext3 filesystems to Ext4 You need to use the tune2fs and fsck tools in the filesystem, and that filesystem needs to be unmounted. Run: tune2fs -O extents,uninit_bg,dir_index /dev/yourfilesystem After running this command you MUST run fsck. If you don't do it, Ext4 WILL NOT MOUNT your filesystem. This fsck run is needed to return the filesystem to a consistent state. It WILL tell you that it finds checksum errors in the group descriptors - it's expected, and it's exactly what it needs to be rebuilt to be able to mount it as Ext4, so don't get surprised by them. Since each time it finds one of those errors it asks you what to do, always say YES. If you don't want to be asked, add the "-p" parameter to the fsck command, it means "automatic repair": (e2)fsck -pfDCO /dev/yourfilesystem Bigger filesystem/file sizes Currently, Ext3 support 16 TB of maximum filesystem size, and 2 TB of maximum file size. Ext4 adds 48-bit block addressing, so it will have 1 EB of maximum filesystem size and 16 TB of maximum file size. 1 EB = 1,048,576 TB (1 EB = 1024 PB, 1 PB = 1024 TB, 1 TB = 1024 GB). Sub directory scalability Right now the maximum possible number of sub directories contained in a single directory in Ext3 is 32000. Ext4 breaks that limit and allows a unlimited number of sub directories. Extents The traditionally Unix-derived filesystems like Ext3 use a indirect block mapping scheme to keep track of each block used for the blocks corresponding to the data of a file. This is inefficient for large files, especially on large file delete and truncate operations, because the mapping keeps a entry for every single block, and big files have many blocks -> huge mappings, slow to handle. Modern filesystems use a different approach called "extents". An extent is basically a bunch of contiguous physical blocks. Multiblock allocation When Ext3 needs to write new data to the disk, there's a block allocator that decides which free blocks will be used to write the data. But the Ext3 block allocator only allocates one block (4KB) at a time. That means that if the system needs to write the 100 MB data mentioned in the previous point, it will need to call the block allocator 25600 times (and it was just 100 MB!). Ext4 uses a "multiblock allocator" (mballoc) which allocates many blocks in a single call, instead of a single block per call, avoiding a lot of overhead. This improves the performance, and it's particularly useful with delayed allocation and extents. This feature doesn't affect the disk format. Also, note that the Ext4 block/inode allocator has other improvements, described in detail in this paper. Delayed allocation Delayed allocation is a performance feature (it doesn't change the disk format) found in a few modern filesystems such as XFS, ZFS, btrfs or Reiser 4, and it consists in delaying the allocation of blocks as much as possible, contrary to what traditionally filesystems (such as Ext3, reiser3, etc) do: allocate the blocks as soon as possible. EXT4 Delayed allocation, on the other hand, does not allocate the blocks immediately when the process write()s, rather, it delays the allocation of the blocks while the file is kept in cache, until it is really going to be written to the disk. This gives the block allocator the opportunity to optimize the allocation in situations where the old system couldn't. Fast fsck Fsck is a very slow operation, especially the first step: checking all the inodes in the file system. In Ext4, at the end of each group's inode table will be stored a list of unused inodes (with a checksum, for safety), so fsck will not check those inodes. The result is that total fsck time improves from 2 to 20 times, depending on the number of used inodes. Journal checksumming The journal is the most used part of the disk, making the blocks that form part of it more prone to hardware failure. And recovering from a corrupted journal can lead to massive corruption. Ext4 checksums the journal data to know if the journal blocks are failing or corrupted. But journal checksumming has a bonus: it allows one to convert the two-phase commit system of Ext3's journaling to a single phase, speeding the filesystem operation up to 20% in some cases - so reliability and performance are improved at the same time. “No Journaling" mode Journaling ensures the integrity of the filesystem by keeping a log of the ongoing disk changes. However, it is know to have a small overhead. Some people with special requirements and workloads can run without a journal and its integrity advantages. In Ext4 the journaling feature can be disabled, which provides a small performance improvement. Online defragmentation (This feature is being developed and will be included in future releases). Inode-related features Larger inodes, nanosecond timestamps, fast extended attributes, inodes reservation... Larger inodes: Ext3 supports configurable inode sizes (via the -I mkfs parameter), but the default inode size is 128 bytes. Ext4 will default to 256 bytes. This is needed to accommodate some extra fields (like nanosecond timestamps or inode versioning), and the remaining space of the inode will be used to store extend attributes that are small enough to fit it that space. This will make the access to those attributes much faster, and improves the performance of applications that use extend attributes by a factor of 3-7 times. Inode reservation consists in reserving several inodes when a directory is created, expecting that they will be used in the future. This improves the performance, because when new files are created in that directory they'll be able to use the reserved inodes. File creation and deletion is hence more efficient. Nanoseconds timestamps means that inode fields like "modified time" will be able to use nanosecond resolution instead of the second resolution of Ext3. Persistent preallocation This feature, available in Ext3 in the latest kernel versions, and emulated by glibc in the filesystems that don't support it, allows applications to preallocate disk space: Applications tell the filesystem to preallocate the space, and the filesystem preallocates the necessary blocks and data structures, but there's no data on it until the application really needs to write the data in the future. Barriers on by default This is an option that improves the integrity of the filesystem at the cost of some performance (you can disable it with "mount -o barrier=0", recommended trying it if you're benchmarking). The filesystem code must, before writing the [journaling] commit record, be absolutely sure that all of the transaction's information has made it to the journal. Just doing the writes in the proper order is insufficient; contemporary drives maintain large internal caches and will reorder operations for better performance. So the filesystem must explicitly instruct the disk to get all of the journal data onto the media before writing the commit record; if the commit record gets written first, the journal may be corrupted. The kernel's block I/O subsystem makes this capability available through the use of barriers; in essence, a barrier forbids the writing of any blocks after the barrier until all blocks written before the barrier are committed to the media. By using barriers, filesystems can make sure that their on-disk structures remain consistent at all times.