File Systems

advertisement
FILE SYSTEM TOPICS
Lei Xu
Agenda





Introduction
VFS
Optimizations
Examples
F&Q
Introduction

“A file system is a means to organize data expected
to be retained after a program terminates by
providing procedures to store, retrieve and update
data, as well as manage the available space on the
device(s) which contain it.” – from Wikipedia
 Store
data
 Organize data
 Access data
 Manage storage resources (e.g. hard drive)
Relationship to Architecture Course
Acknowledge to the slides from 830 course
Relationship to Architecture Course

File system is designed between memory and
secondary storage (or remote servers)
 One
of the most complex part in an operating system
 Main R&D focuses:
 Performance:
throughput, latency, scalability
 Reliability and availability
 Management: snapshot and etc.
Acknowledge to the slides from 830 course
Different types of file systems

Local file systems
Stored data on local hard drives, SSDs, floppy drives,
optical disks or etc.
 Examples: NTFS, EXT4, HFS+, ZFS


Network/distributed file systems
Stored data on remote file server(s)
 Example: NFS, CIFS/Samba, AFP, Hadoop DFS, Ceph


Pseudo file systems


Example: procfs, devfs, tmpfs
“List of file systems”

http://en.wikipedia.org/wiki/List_of_file_systems
Agenda





Introduction
VFS
Optimizations
Examples
F&Q
Overall Architecture of Linux file
system components
Acknowledgement: “Anatomy of the Linux file system”, IBM
developerWorks.
Virtual File System (VFS)

VFS is the essential concept in UNIX-like FS

Specify an interface between the kernel and a concrete file
system


Pass system calls to the underlying file systems


Introduced by SUN in 1985
E.g. pass sys_write() to Ext4 (i.e. ext4_write())
Three major metadata in VFS
Metadata: the data about data (wikipedia)
 Super block, dentry and inode
 OO design


Each component defines a set of data members and the functions
to access them
Super block

A segment of metadata that describes a file system
 Is
constructed when mount a file system
 Usually, a persistent copy of super block is stored in the
beginning of a storage device
 Describes:
 File
system type, size, status (e.g. dirty bit, read only bit)
 Block size, max file bytes, device size..
 How to find other metadata and data.
 How to manipulates these data (i.e. sb_ops)
Inode

“Index-node” in Unix-style file system
 All
information about one file (or directory)
 Except

 E.g.
its name
In UNIX-like system, file names are stored in the directory file:
the content of it is an “array” of file names
owner, access rights, mode, size, time and etc.
 Pointers to data
Directory Entry (dentry)

Dentry conceptually points a file name to its
corresponding Inode
 Each
file/directory has a dentry presenting it
 File systems use dentry to lookup a file in the
hierarchical namespace
 Each
dentry has a pointer to the dentry of its parent
directory
 Each dentry of a directory has a list of dentries of its subdirectories and sub-files
Agenda





Introduction
VFS
Optimizations
Examples
F&Q
Optimizations

Most of file system optimizations are designed
based on the characteristics of the memory
hierarchy and storage devices.
 Recall:
 RAM
50-100 ns
 Disks: 5-10 ms
 2-3 orders of magnitude difference
 Almost all widely used local file systems are designed for
hard disk drives, which have their unique characteristics
Hard Disk Drive (HDD)

Stores data on one or
more rotating disks,
coated with magnetic
material
 Introduce
by IBM in
1956
 Use magnetic head to
read data
The very early HDD…..
Acknowledge to:
HDD (Cont’d)

The essential structure of
HDD has not changed
too much…
Constitute with several
disks
 Each disk is divided to
tracks, each of which
then is divided to sectors


The single most
significant factor:

Seek time
Why seek time matters

When access a data (sector), the HDD head must
first move to the track (seek time), then rotates the
disk to the sector (rotational time)
 Seek
time: 3 ms on high-end server disks, 12 ms on
desktop-level disks [1]
 Rotational time: 5.56ms on 5400 RPM HDD, 4.17ms on
7200 RPM HDD [1]

As a result, sequential IO is much faster than
random IO, because there is no seek /rotational
time
[1], http://en.wikipedia.org/wiki/Disk-drive_performance_characteristics
General Optimizations

Based on two principles:
 RAM
access is much faster than the access on disk
 Sequential IOs is much faster than random IOs on disk

So we design file systems that
 Largely
utilizes CPU/RAM to reduce IO to disks (various
caches/write buffers)
 Prefers sequential IOs
 Computes
disk layout to arrange related data sequentially
located on disks
Dcache

Dentry cache (dcache)
Directories are stored as files on disks.
 For each file lookup, we want obtain the inode from the
given full file path


OS looks the dentries from the root to all parent directories in the
path.


E.g. for looking up file “/Users/john/Documents/course.pdf”, OS
needs traverse the dentries that presents “/”, “Users”, “john”,
“Documents”, and “course.pdf”
To accelerate this:
We use a global hash table (dcache) to map “file path” -> dentry
 A two-list solution: one for active dentries, and one for “recent
unused dentries” (LRU).

Inode cache

Similar to the dcache,
OS maintains a cache
for inode objects.
inode object has
1-to-1 relation to a
dentry
 If the dentry object is
evicted, this inode is
evicted
P1
f0
P2
f1
f0
f2
 Each
f3
P10
Processes
f0
File
Objects
Dentry Cache (hash table)
Dentry 0
Dentry 10
Dentry 20
VFS
Inode 0
Inode 10
Inode 20
Inode Cache
Page
Cache 0
Page
Cache 10
Page
Cache 20
Page Cache
(Radix Tree)
Page Cache

…a “transparent” buffer for disk-backed pages kept in
RAM for fast access… [wikipedia]
A write-back cache
 Main purpose: reducing the # of IOs to disks
 Access based on page (usually 4KB).

Page cache is per-file based.
 A Redix-tree in inode object.
 Prefetch pages to serve future read
 Absorb writes to reduce # of IOs


The dirty pages (modified) are flushed to disks for : 1) each
30s or 5s, or 2) OS wants to reclaim RAMs

Also can be forced to flush by calling “fsync()” system call
Agenda





Introduction
VFS
Optimizations
Examples
F&Q
Examples

Several concrete file system designs
 Ext4,
classic UNIX-like file system concepts
 NTFS, advanced Windows file system
 ZFS, “the last word of file system”
 NFS, a standard network file system
 Google File System, a special distributed file system for
special requirements
Ext4

The latest version of
the “extended file
system” (Ext2/3/4)
 The
standard Linux file
system for a long time
 Inspired from UFS from
BSD/Solaris
 Group files to block
groups
 Keep
file data near to
inodes
Ack: http://bit.ly/tjipWY
NTFS

“New Technology File
System” (NTFS)
 The
standard file
system in Windows
world.
 A Master File Table
(MFT) contains all
metadata.
 Directory
is also a file
ZFS

ZFS: “the last word of file system”
 The
most advanced local file system in production
 128 bits space (2128 bytes in theory)
 larger
A
the # of sand in the earth…
lot of advanced features:
 E.g.
transactional commits, end-to-end integration, snapshot,
volume management and much more…

Will never lose data and always be consistent.
 Every

OS community wants to clone or copy its features…
Btrfs on Linux, ReFS on Windows, ZFS on FreeBSD
NFS

“Network File System
(NFS)”
A
protocol developed
by SUN in 1984
A
 IETF
set of RPC calls
standard
 Supported
by all major
OSs
 Simple
and efficient
Google File System (GFS)

A large distributed file
system specially
designed for
MapReduce framework
High throughput
 High availability
 Special designed. Not
compatible to
VFS/POSIX API.



Requires clients linked to
the GFS library.
Hadoop DFS clones the
concepts of GFS
More File Systems

Interesting file systems that are worth to explore
 Btrfs
(B-tree FS) from oracle, expected to be the next
standard Linux file system. Many concepts are shared
with ZFS.
 ReFS: The file system for Windows 8 (from Microsoft).
Many concepts are shared with ZFS (too!).
 WAFL (Write Anywhere File Layout) file system from
NetApp.
 FUSE (Filesystem in Userspace): a cross-platform library
that allows developers to write file system running in
user mode
FAQ?
Thanks
Download