Native LINUX Filesystems

advertisement
Native LINUX Filesystems
Extended filesystems (Ext, Ext2, Ext3, Ext4)
Extended filesystem (ext fs), second extended filesystem (ext2fs) and third extended filesystem (ext3fs) were
designed and implemented on Linux by Rmy Card, Laboratoire MASI--Institut Blaise Pascal, , Theodore Ts'o,
Massachussets Institute of Technology, and Stephen Tweedie, University of Edinburgh
Non-Journaled Filesystems
 Extended filesystem (Ext FS)
This is original filesystem used in early Linux systems.
The standard filesystem for Linux, ext2, is a high-performance, non-journaled filesystem. Although ext2 lacks
journaling features, many users choose it because of its high speed and reliability.
 Second Extended Filesystem (Ext2 FS)
The Second Extended File System provides standard Unix file semantics and advanced features. Ext2
filesystem format forms the basis for following native LINUX file system versions. Due to optimizations
included in the kernel code, Ext2fs has extensions to the current filesystem: access control lists conforming to
the Posix semantics, undelete, and on-the-fly file compression. . I
Ext2 features:
-
-
-
Long file names (255 characters to 1012_ and variable length directory entries.
VFS layer filesystems to 4 TB
Reserves 5% of the blocks super user (root) to recover from user processes filling up filesystems.
Filesystem metadata (inodes, bitmap blocks, indirect blocks and directory blocks) synchronous write
Choice of logical block size when creating the filesystem, typically be 1024, 2048 and 4096 bytes to
speed up I/O since with fewer I/O requests, and thus fewer disk head seeks.
Fast symbolic links that do not use any data block on the filesystem; filename is not stored in the inode
filesystem state using a special field in the superblock to indicate the status of the file system. When a
filesystem is mounted in read/write mode, its state is set to ``Not Clean''. When it is unmounted or
remounted in read-only mode, its state is reset to ``Clean''. At boot time, the filesystem checker uses this
information to decide if a filesystem must be checked (fsck)... The filesystem checker tests this to force
the check of the filesystem regardless of its apparently clean state (fsck).
Filesystems checks are forced at regular intervals. A mount counter is maintained in the superblock. A
last check time and a maximal check interval are also maintained in the superblock. Each time the
filesystem is mounted in read/write mode, counters and timestamps arechecked. When it reaches a
maximal value (also recorded in the superblock), the filesystem checker forces the check even if the
filesystem is ``Clean''.
provides an attribute allows the users to request secure deletion on files. When such a file is deleted,
random data is written in the disk blocks previously allocated to the file. This prevents malicious people
from gaining access to the previous content of the file by using a disk editor.
Ext2 Physical Structure
Unlike FFS, the ext2 filesystems is made up of block groups instead of FFS cylinder groups. Block groups are
not tied to the physical layout of the blocks on the disk, since modern drives tend to be optimized for sequential
access (‘smart” drives – SAN, SCSI, SATA) and hide their physical geometry to the operating system.
Ext2 filesystem layout
,---------+---------+---------+---------+---------,
| Boot | Block | Block | ... | Block |
| sector | group 1 | group 2 |
| group n |
`---------+---------+---------+---------+---------'
Each block group contains a redundant copy of crucial filesystem control informations (superblock and the
filesystem descriptors) and also contains a part of the filesystem (a block bitmap, an inode bitmap, a piece of the
inode table, and data blocks). The structure of a block group is represented in this table:
Ext2 blockgroup layout
,---------+---------+---------+---------+---------+---------,
| Super | FS
| Block | Inode | Inode | Data |
| block | desc. | bitmap | bitmap | table | blocks |
---------+---------+---------+---------+---------+---------'
Using block groups improves reliability since the control structures are replicated in each block group, it is easy
to recover from a filesystem where the superblock has been corrupted. This structure also helps to get good
performances: by reducing the distance between the inode table and the data blocks, it is possible to reduce the
disk head seeks during I/O on files.
-
-
-
In Ext2fs, directories are managed as linked lists of variable length entries containing the inode number,
the entry length, the file name and its length. Variable length entries permit long file names without
wasting disk space in directories.
Ext2fs buffer cache management performs readaheads:reading data blocks contiguouslys. This way, it
tries to ensure that the next block to read will already be loaded into the buffer cache. Readaheads are
extended directory reads
Ext2fs performs allocation optimizations. Block groups are used to cluster together related inodes and
data to reduce the disk head seeks made when the kernel reads an inode and its data blocks.
Preallocates up to 8 adjacent blocks when allocating a new block for writing data. Preallocation hit
rates are around 75% even on very full filesystems and gets good write performances under heavy load.
It also allows contiguous blocks to be allocated to files, thus it speeds up the future sequential reads.
Journaled Filesystems
 Journaled filesystems include additional record keeping that increases the ability of the filesystem to recover
from a crash.
·
·
·
·
ext3 - the ext2 filesystem with journaling extensions.
jfs - Journaled File System - a filesystem contributed to Linux by IBM.
xfs - A filesystem contributed to open source by SGI.
reiserfs, developed by Namesys, is the default filesystem for SUSE Linux, DARPA.
•Third Extended Filesystem (Ext3 FS)
Ext3 supports the same features as Ext2, but also includes Journaling.
• Fourth Extended Filesystem (Ext4 FS)
Compatibility
Any existing Ext3 filesystem can be migrated to Ext4 with an easy procedure which consists in running a
couple of commands in read-only mode.
Migrate existing Ext3 filesystems to Ext4
You need to use the tune2fs and fsck tools in the filesystem, and that filesystem needs to be unmounted. Run:
tune2fs -O extents,uninit_bg,dir_index /dev/yourfilesystem
After running this command you MUST run fsck. If you don't do it, Ext4 WILL NOT MOUNT your filesystem.
This fsck run is needed to return the filesystem to a consistent state. It WILL tell you that it finds checksum
errors in the group descriptors - it's expected, and it's exactly what it needs to be rebuilt to be able to mount it as
Ext4, so don't get surprised by them. Since each time it finds one of those errors it asks you what to do, always
say YES. If you don't want to be asked, add the "-p" parameter to the fsck command, it means "automatic
repair":
(e2)fsck -pfDCO /dev/yourfilesystem
Bigger filesystem/file sizes
Currently, Ext3 support 16 TB of maximum filesystem size, and 2 TB of maximum file size. Ext4 adds 48-bit
block addressing, so it will have 1 EB of maximum filesystem size and 16 TB of maximum file size. 1 EB =
1,048,576 TB (1 EB = 1024 PB, 1 PB = 1024 TB, 1 TB = 1024 GB).
Sub directory scalability
Right now the maximum possible number of sub directories contained in a single directory in Ext3 is 32000.
Ext4 breaks that limit and allows a unlimited number of sub directories.
Extents
The traditionally Unix-derived filesystems like Ext3 use a indirect block mapping scheme to keep track of each
block used for the blocks corresponding to the data of a file. This is inefficient for large files, especially on large
file delete and truncate operations, because the mapping keeps a entry for every single block, and big files have
many blocks -> huge mappings, slow to handle. Modern filesystems use a different approach called "extents".
An extent is basically a bunch of contiguous physical blocks.
Multiblock allocation
When Ext3 needs to write new data to the disk, there's a block allocator that decides which free blocks will be
used to write the data. But the Ext3 block allocator only allocates one block (4KB) at a time. That means that if
the system needs to write the 100 MB data mentioned in the previous point, it will need to call the block
allocator 25600 times (and it was just 100 MB!).
Ext4 uses a "multiblock allocator" (mballoc) which allocates many blocks in a single call, instead of a single
block per call, avoiding a lot of overhead. This improves the performance, and it's particularly useful with
delayed allocation and extents. This feature doesn't affect the disk format. Also, note that the Ext4 block/inode
allocator has other improvements, described in detail in this paper.
Delayed allocation
Delayed allocation is a performance feature (it doesn't change the disk format) found in a few modern
filesystems such as XFS, ZFS, btrfs or Reiser 4, and it consists in delaying the allocation of blocks as much as
possible, contrary to what traditionally filesystems (such as Ext3, reiser3, etc) do: allocate the blocks as soon as
possible.
EXT4 Delayed allocation, on the other hand, does not allocate the blocks immediately when the process
write()s, rather, it delays the allocation of the blocks while the file is kept in cache, until it is really going to be
written to the disk. This gives the block allocator the opportunity to optimize the allocation in situations where
the old system couldn't.
Fast fsck
Fsck is a very slow operation, especially the first step: checking all the inodes in the file system. In Ext4, at the
end of each group's inode table will be stored a list of unused inodes (with a checksum, for safety), so fsck will
not check those inodes. The result is that total fsck time improves from 2 to 20 times, depending on the number
of used inodes.
Journal checksumming
The journal is the most used part of the disk, making the blocks that form part of it more prone to hardware
failure. And recovering from a corrupted journal can lead to massive corruption. Ext4 checksums the journal
data to know if the journal blocks are failing or corrupted. But journal checksumming has a bonus: it allows one
to convert the two-phase commit system of Ext3's journaling to a single phase, speeding the filesystem
operation up to 20% in some cases - so reliability and performance are improved at the same time.
“No Journaling" mode
Journaling ensures the integrity of the filesystem by keeping a log of the ongoing disk changes. However, it is
know to have a small overhead. Some people with special requirements and workloads can run without a
journal and its integrity advantages. In Ext4 the journaling feature can be disabled, which provides a small
performance improvement.
Online defragmentation
(This feature is being developed and will be included in future releases).
Inode-related features
Larger inodes, nanosecond timestamps, fast extended attributes, inodes reservation...
Larger inodes: Ext3 supports configurable inode sizes (via the -I mkfs parameter), but the default inode size is
128 bytes. Ext4 will default to 256 bytes. This is needed to accommodate some extra fields (like nanosecond
timestamps or inode versioning), and the remaining space of the inode will be used to store extend attributes
that are small enough to fit it that space. This will make the access to those attributes much faster, and improves
the performance of applications that use extend attributes by a factor of 3-7 times.
Inode reservation consists in reserving several inodes when a directory is created, expecting that they will be
used in the future. This improves the performance, because when new files are created in that directory they'll
be able to use the reserved inodes. File creation and deletion is hence more efficient.
Nanoseconds timestamps means that inode fields like "modified time" will be able to use nanosecond resolution
instead of the second resolution of Ext3.
Persistent preallocation
This feature, available in Ext3 in the latest kernel versions, and emulated by glibc in the filesystems that don't
support it, allows applications to preallocate disk space: Applications tell the filesystem to preallocate the space,
and the filesystem preallocates the necessary blocks and data structures, but there's no data on it until the
application really needs to write the data in the future.
Barriers on by default
This is an option that improves the integrity of the filesystem at the cost of some performance (you can disable
it with "mount -o barrier=0", recommended trying it if you're benchmarking). The filesystem code must, before
writing the [journaling] commit record, be absolutely sure that all of the transaction's information has made it to
the journal. Just doing the writes in the proper order is insufficient; contemporary drives maintain large internal
caches and will reorder operations for better performance. So the filesystem must explicitly instruct the disk to
get all of the journal data onto the media before writing the commit record; if the commit record gets written
first, the journal may be corrupted. The kernel's block I/O subsystem makes this capability available through the
use of barriers; in essence, a barrier forbids the writing of any blocks after the barrier until all blocks written
before the barrier are committed to the media. By using barriers, filesystems can make sure that their on-disk
structures remain consistent at all times.
Download