Implementation and Performance of a

advertisement
Implementation and Performance of a
DBMS-Based Filesystem Using Size-Varying
Storage Heuristics
by
Eamon Francis Walsh
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degrees of
Bachelor of Science in Computer Science
and
Master of Engineering in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
May 2003
Copyright 2003 Eamon F. Walsh. All rights reserved.
BARKER
The author hereby grants to M.I.T. permission to reproduce and
distribute publicly paper and electronic copies of this thesis and to
MASSACHUSETTS INSTITUTE
grant others the right to do so.
OFTECHNOLOGY
JUL 3 0 2003
LIBRARIES
Author
....
Department of Electrical Yngineering and Computer Science
May 15, 2003
Certified by
..........................
Michael Stonebraker
Adjunct Professor, EECS
Cl esist$upvsor
Accepted by...........
Arthur C. Smith
Chairman, Department Committee on Graduate Theses
Implementation and Performance of a DBMS-Based
Filesystem Using Size-Varying Storage Heuristics
by
Eamon Francis Walsh
Submitted to the Department of Electrical Engineering and Computer Science
on May 21, 2003, in partial fulfillment of the
requirements for the degrees of
Bachelor of Science in Computer Science
and
Master of Engineering in Electrical Engineering and Computer Science
Abstract
This thesis project implemented a file system using a transactional database management system (DBMS) as the storage layer. As a result, the file system provides
atomicity and recoverability guarantees that conventional file systems do not. Current attempts at providing these guarantees have focused on modifying conventional
file systems, but this project has used an existing DBMS to achieve the same ends.
As such, the project is seen as bridging a historical divide between database and
operating system research. Previous work in this area and the motivation for the
project are discussed. The file system implementation is presented, including its external interface, database schema, metadata and data storage methods, and performance enhancements. The DBMS used in the project is the open-source Berkeley DB,
which is provided as an application library rather than a separate, monolithic daemon
program. The implications of this choice are discussed, as well as the data storage
method, which employs geometrically increasing allocation to maximize performance
on files of different sizes. Performance results, which compare the implementation
to the standard Sun UFS filesystem, are presented and analyzed. Finally, future directions are discussed. It is the hope of the author that this project is eventually
published as a kernel module or otherwise mountable and fully functional file system.
Thesis Supervisor: Michael Stonebraker
Title: Adjunct Professor, EECS
3
4
Acknowledgments
To Mike Stonebraker, for taking me on as his student and providing much-needed
and much-appreciated guidance and direction as the project evolved. The author is
glad to have been a part of database research under Mike and thanks him for the
opportunity.
To Margo Seltzer, for technical support and advice on Berkeley DB, now the
author's favorite DBMS.
To David Karger, for 4 years of academic advising in the Electrical Engineering
and Computer Science department.
5
6
Contents
1
13
Introduction
1.1
1.2
1.3
Database Background.
. . . .
13
. . . . . . . . . . . . . . . . . . . . . . . .
13
. . . . . . . . . . . . . .
14
Filesystem Background . . . . . . . . . . . . . . . . . . . . . . . . . .
16
. . . .
16
. . . .
16
1.1.1
What is a DBMS?
1.1.2
Transactions, Recovery, and Locking
1.2.1
What is a Filesystem?
. . . . . . . .
1.2.2
Interacting with a Filesystem
1.2.3
Filesystem Behavior after a Crash
. . . .
18
1.2.4
Example Filesystem: Berkeley FFS
. . . .
19
Addressing Filesystem Recovery . . . . . . .
. . . .
21
. . . . . .
. . . .
21
. . . .
1.3.1
Interdisciplinary Research
1.3.2
Current DBMS-Based Work.....
. . . .
22
1.3.3
Current Filesystem-Based Work . . .
. . . .
23
1.3.4
Project Goal. . . . . . . . . . . . . .
. . . .
24
1.3.5
Project Proposal
. . . . . . . . . . .
. . . .
25
2 Design
27
2.1
Application Considerations . . . . . . . . . .
27
2.2
Unsuitability of Conventional DBMS
29
2.3
The Berkeley Database . . . . . . . . . . . .
30
2.4
Proposed Architecture
. . . . . . . . . . . .
33
2.5
File Data Storage Model . . . . . . . . . . .
34
2.6
File Metadata Storage Model
37
. .
. . . . . . . .
7
2.7
3
Storage Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Implementation
43
3.1
System Calls Provided . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.2
Consistency and Durability
. . . . . . . . . . . . . . . . . . . . . . .
44
3.3
Configurable Parameters . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.4
Per-Process State . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
3.5
Threads, Concurrency, and Code Re-entrance
47
. . . . . . . . . . . . .
49
4 Performance Testing
5
38
4.1
Test Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
4.2
Baseline Storage Performance
. . . . . . . . . . . . . . . . . . . . . .
51
4.3
Performance with Logging . . . . . . . . . . . . . . . . . . . . . . . .
53
4.4
Performance with Locking . . . . . . . . . . . . . . . . . . . . . . . .
56
4.5
A nalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
61
Future Directions
5.1
Moving to the Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
5.2
Use of Multiple Disks . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
8
List of Figures
1-1
Filesystem disk allocations . . . . . . . . . . . . . . . . . . . . . . . .
20
1-2
Proposed DBMS-based filesystem architecture
. . . . . . . . . . . . .
25
2-1
Conventional filesystem process architecture
. . . . . . . . . . . . . .
28
2-2
Conventional DBMS process architecture . . . . . . . . . . . . . . . .
30
2-3
Berkeley DB process architecture
. . . . . . . . . . . . . . . . . . . .
31
2-4
Proposed SCFS process architecture . . . . . . . . . . . . . . . . . . .
34
2-5
Conventional filesystem data storage model . . . . . . . . . . . . . . .
35
2-6
SCFS data storage schema . . . . . . . . . . . . . . . . . . . . . . . .
36
2-7
Original storage layer disk allocation
. . . . . . . . . . . . . . . . . .
39
2-8
Revised storage layer disk allocation
. . . . . . . . . . . . . . . . . .
40
4-1
Baseline performance: 100% read operations . . . . . . . . . . . . . .
51
4-2
Baseline performance: 50/50 read/write operations
. . . . . . . . . .
52
4-3
Baseline performance: 100% write operations . . . . . . . . . . . . . .
52
4-4
Performance with logging: 100% read operations . . . . . . . . . . . .
54
4-5
Performance with logging: 50/50 read/write operations . . . . . . . .
54
4-6
Performance with logging: 100% write operations
. . . . . . . . . . .
55
4-7
Performance with locking: 100% read operations . . . . . . . . . . . .
56
4-8
Performance with locking: 50/50 read/write operations . . . . . . . .
57
4-9
Performance with locking: 100% write operations
57
9
. . . . . . . . . . .
10
List of Tables
2.1
SCFS metadata fields . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.1
SCFS user-visible API . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.2
SCFS specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.3
Translation layer per-process state . . . . . . . . . . . . . . . . . . . .
47
4.1
Average SCFS performance versus filesystem performance
58
11
. . . . . .
12
Chapter 1
Introduction
1.1
1.1.1
Database Background
What is a DBMS?
A database management system (DBMS) is software designed to assist in maintaining
and utilizing large collections of data [15]. The first DBMS's stored data using networks of pointers between related data items; this "network" model was applicationspecific and unwieldy [6]. Most database management systems in use today are "relational", meaning that their data is stored in relations or tables. Some DBMS's are
"object relational:" they allow new data types and operations on those types to be
defined, but otherwise behave like standard relational databases. Still other DBMS's
are "object oriented:" they desert the relational model in favor of close integration
with application programming languages [15]. OODBMS's, as they are called, are
not treated in this thesis.
Relational DBMS's are primarily used for storing highly structured data. Unstructured data, such as text or binary strings, may also be stored through the use of
"binary large objects" (BLOB's) [20], though these types are more awkward to use.
"Semi-structured" data, which can be divided roughly into records but contains large
amounts of unstructured text, is a focus of current DBMS research and development;
many DBMS's have support for text processing and XML, which is a common format
13
for presenting semi-structured data [21]. Part of the task of this thesis project was to
find a good way of storing (unstructured) file data in a database.
The tables of a relational database are composed of records, which are themselves
composed of fields, each of which stores a piece of data of a fixed type. The records
of a table may be accessed through a sequential scan, but to boost performance,
DBMS's allow the construction of "indices" on tables. Through the use of secondary
structures such as B-trees, indices allow fast access to the records of a table based upon
the value of a field. Beneath the table and index abstractions, a DBMS stores data
in blocks or "pages," and has several subsystems for managing them. One subsystem
is the storage layer, which transfers pages to and from the permanent storage device,
the disk. Another is the "buffer cache," which stores page copies in main memory
to prevent the performance bottleneck associated with the disk, which is generally
orders of magnitude slower than memory.
1.1.2
Transactions, Recovery, and Locking
When a database record is changed, the change is usually written to the record's
page in the buffer cache, but not to the disk. If this were not the case, every change
would require a costly disk operation. However, unlike disk storage, main memory is
"volatile," which means that after a system has crashed, its contents are not recoverable. System crashes are a fact of life; they occur as a result of software or operating
system bugs, hardware failure, power outages, and natural disasters, among other
things. As a result, database changes can be and are lost from the buffer cache as a
result of crashes. A primary function of DBMS's (and a primary focus of database research) is providing atomicity and durability in the face of system crashes and volatile
buffer caches.
Database operations are grouped into "transactions" whose execu-
tion in a DBMS is guaranteed to be atomic. "Atomicity" requires that transactions
"commit" to make their changes visible in the database. Interrupted or "aborted"
transactions' are completely undone by the DBMS so that none of their changes are
'Transaction abort may occur as a result of a DBMS condition such as a lock conflict, or as a
result of system crash. Transactions may also abort themselves under normal operation.
14
reflected. Atomicity ensures database "consistency" even if transactions internally
move the database to an inconsistent state. The rules of consistency are defined by
the user, who must ensure that all transactions preserve those rules when fully applied. The "durability" property guarantees that, once committed, a transaction's
effects are permanent and will not be lost in the event of a system crash [15].
The IBM System R project, an early DBMS implementation, achieved atomicity
through the use of duplicate page copies on disk. This method, called "shadowing,"
updated one page copy at a time, allowing reversion to the other copy if a crash
occurred [12]. However, this method was wasteful and caused disks to become fragmented over time, degrading performance [3]. Presently, the widely accepted method
of providing these guarantees is "logging," which makes use of a small log area on
disk. Records of all changed pages, as well as transaction commits, are appended
to the log immediately, making them durable. The changed pages themselves may
remain in volatile memory. In the event of their loss, the log must be scanned from
the beginning to reconstruct, or "redo" the changes. To prevent the task of recovery
from growing too large, periodic "checkpoints" are performed, which flush the buffer
cache to disk and make an entry in the log. Recovery may then proceed from the
most recent checkpoint entry rather than the beginning of the log. However, because
some transactions may be running during the checkpoint, the (partial) changes they
made, along with any changes flushed from the buffer cache during normal operation,
must be "undone" if the system later crashes before the transactions have committed.
The changes to be undone may be reconstructed from the log by associating changes
with their committed transactions [9].
Still another guarantee often required of a DBMS is "isolation," which requires
that concurrently running transactions never interfere with one another [15]. Isolation
would be violated if, for example, one transaction changed a table record which
another transaction was in the process of reading.
To achieve isolation, a "lock
manager" is used to control record (or often, page) access. The lock manager grants
locks to running transactions and enforces their use. Locks come in different types
which can allow for exclusive access (in the case of writes), or shared access (in the
15
case of reads). Modern lock managers have other types as well to provide maximum
concurrency while maintaining isolation [5].
1.2
1.2.1
Filesystem Background
What is a Filesystem?
Conceptually, a filesystem is an abstraction which presents users of a computer system
with a unified, human-friendly way of viewing and organizing the storage areas (e.g.
disks) available to that computer system. A file is a piece of data which resides on
disk; the filesystem abstraction allows files to be assigned human-readable names and
grouped together in directories. Directories themselves are arranged in a tree with a
root directory at its base. Modern operating systems have a single, universal directory
tree encompassing all storage devices in use; the root directories of each device simply
become branches in the universal tree.
The filesystem abstraction has taken on even more significance as a huge variety
of peripherals and kernel structures, not just disks, have been given file interfaces and
made a part of the directory tree [18]. Modern UNIX operating systems present kernel
information, memory usage tables, and even individual process state as readable files
mounted beneath a special directory such as /proc.
Non-storage devices such as
terminals and line printers are available through character- or block-special file entries
in another directory such as /dev. Even local sockets and pipes, kernel constructions
used for inter-process communication, have place-holder entries in the filesystem.
Operations on these "files", save for a lack of random access, behave exactly as they
would for a normal file.
1.2.2
Interacting with a Filesystem
Files are manipulated using a set of functions provided with the operating system,
referred to as "system calls." Government and industry groups such as POSIX have
endeavored to standardize the form of these calls among operating systems provided
16
by different vendors [7]. The result has been a core set of system calls for file manipulation which are largely universal. These include open, close, seek, read, and
write for operating on files themselves; as well as calls for reading and manipulating
file and directory metadata, such as stat, unlink, chmod, and mkdir. Use of this
core set nearly always guarantees application portability among standard operating
systems.
In many system calls (notably open), files are identified by their pathname, a
human-readable string identifying the directories (themselves files) whose entries form
a chain leading from the root of the directory tree to the desired file. When a file is
opened, the kernel returns a number called the "file descriptor" which is simply the
index to a table of open files (allowing the application to operate on more than one file
at at time). In other system calls, notably read and write, files are identified by file
descriptor. Both file descriptor and pathname, however, are merely abstractions. The
real mark of a file's existence is its "i-node," a small structure stored on disk which
contains all of the file's metadata (except for its name), and the locations on disk
where the bytes which make up the file are stored. A directory is simply a file which
matches names to i-nodes. Applications, however, need never be aware of them.
At the level of the system calls described above, a plain file appears as an unstructured sequence of bytes. No delimiters are present within or adjacent to the file data2 .
The byte sequence is accessed through a cursor point which represents a byte number
in the file and which can be changed using the seek system call. Read and write
operations on the file occur starting at the byte indicated. Read operations do not
advance past the last byte in the sequence, but write operations may replace existing
bytes or extend the sequence with additional bytes3 . It is impossible to reduce the
length of a file, other than by truncating its length to zero.
2
The "end-of-file" marker is in fact a construction of the higher-level "Standard I/O" library
built atop the core system calls.
3
Interestingly, the POSIX standard permits advancement of the cursor point past the end of the
file. Even more interestingly, a write may then be performed, leaving a "hole" in the file which is
required to appear as zero bytes if subsequently read.
17
1.2.3
Filesystem Behavior after a Crash
The most significant aspect of modern filesystems is their highly aggressive caching
behavior [18]. In fact, under normal circumstances, operations on files rarely touch
the physical disk where the files are stored4 . Rather, file contents are divided into
pages and brought into main memory by the kernel on the first access; subsequent
operations act on the data pages in memory until they are paged out or flushed back
to disk. This behavior is similar to virtual memory paging; so similar, in fact, that
many operating systems combine the filesystem cache and the virtual memory system
into a single "unified" buffer cache [13]. Not only are file contents cached in memory,
but also their i-nodes. Some kernels even cache commonly used pathnames in memory
in order to avoid the task of searching a directory chain [13].
Main memory is volatile, and its contents are irrecoverably lost after a system
crash. Such events thus have severe implications for filesystems which cache file data
and metadata in memory: any filesystem operation not committed to disk is lost after
a crash. Applications which cannot tolerate such data loss use the f sync system call,
which forces all changes for a given file to disk. Frequent recourse to this system
call ensures consistency, but slows performance. In any case, many applications can
tolerate small amounts of data loss; the amount lost will depend on how frequently
the operating system flushes dirty pages to the disk.
The true problem with filesystems and system crashes is not one of data loss, but
one of atomicity. The problem arises from the fact that nearly all operating system
buffer caches have no mechanism for flushing a given file's data to the disk in a
coherent manner. In these cases, the replacement algorithm used the buffer cache has
no understanding of filesystem consistency. For example, following a write operation
to a file, the memory page containing its i-node may be flushed to disk some time
before the page containing its data. If a crash were to occur between these points,
the file's metadata would reflect the write operation, but the data written would be
lost. In the worst case, the write operation could have allocated free disk blocks
4 Non-random file access, producing locality of reference, is what constitutes "normal circum-
stances."
18
to the file, anticipating the arrival of the data. After the crash, the blocks would
remain uninitialized but would be marked as part of the file. The file would then
have undefined contents.
A worse scenario may arise if a filesystem superblock becomes inconsistent after a
crash. A "superblock" is a small region on disk which provides a map of the remaining
space; it records which disk blocks are allocated to i-nodes, which contain file data,
and which are free. A corruption of this map may thus result in significant damage
to the filesystem, including possible total loss. For this reason, filesystems usually
make many copies of the superblock, scattered throughout the storage device [13].
Each backup copy requires updating, however, and occupies disk space which could
otherwise be used for storing files.
1.2.4
Example Filesystem: Berkeley FFS
The original UNIX file system, referred to as s5, was flexible and featureful for its
time, but was known for poor performance and fragility [13].
The Berkeley Fast
Filesystem, referred to as FFS (sometimes ufs as well), was designed to address
the shortcomings of s5 and features several performance optimizations as well as
hardening against on-disk filesystem corruption resulting from a system crash. A
variant of this filesystem type was used as a performance benchmark in this thesis;
this section describes its major features.
FFS employs intelligent allocation of disk areas to reduce seek times on filesystem
operations. To read or write a block from a disk, the disk head must be positioned
over the proper track of the disk. The delay which occurs while waiting for this
positioning is the "seek time". In s5, i-nodes were all placed at the beginning of
the disk, after the superblock. This caused a significant seek delay to occur when
accessing a file on disk, since the disk head always had to be moved to the i-node
region to read the file's data block allocation, then back out to access the data blocks
themselves. FFS addresses this issue by dividing the disk into "cylinder groups", each
of which contains its own superblock copy, i-node area, and data blocks (figure 1-1).
By keeping the i-node and data of a file within a single cylinder group, seek times are
19
reduced, since cylinder groups are small compared to the entire disk. However, this
method is not perfect. If a cylinder group becomes full, file data must be stored in
other groups, leading to fragmentation and increasing seek time. FFS deals with this
problem in a rather inelegant manner, by simply reducing the amount of disk space
available for use to ensure free space in the cylinder groups. The amount removed
from use is 10% of the disk, by default 5 [13]. In the days when disks were less than
1GB, setting aside this space was not viewed as grossly wasteful, but today, wasting
10GB of a 100GB disk is a significant loss.
s5
(relative area size not to scale)
FFS
key
cylinder group
backup superblock
boot block
i-node storage area
superblock
data storage area
Figure 1-1: Filesystem disk allocations
FFS attempts to prevent loss of consistency following a system crash by flushing sensitive metadata changes directly to disk instead of placing them in the buffer
cache [13]. These changes include operations on directory entries and i-nodes. FFS
also maintains multiple backup copies of the superblock on disk as described in the
previous section. While these features mostly eliminate the risk of filesystem corruption, they make metadata-intensive operations such as removing directory trees or
renaming files very slow. Additionally, these features by themselves do not solve the
atomicity problem; a companion program called f sck is required to check the filesystem following a system crash and undo any partially-completed operations. The f sck
5 0n these systems, the df command is even modified to read 100% usage when the disk is only
90% full!
20
program must check the entire filesystem since there is no way of deducing beforehand
which files, if any, were being changed during the system crash. File system checks
are thus extremely slow; minutes long on many systems, with the time proportional
to filesystem size. Compounding the problem is the fact that by default, f sck will not
change the filesystem without human confirmation of each individual change. Boot
processes which are supposed to be automatic often drop to a shell prompt because of
this behavior, making the kind of quick recovery and uptime percentages associated
with database systems all but impossible.
1.3
1.3.1
Addressing Filesystem Recovery
Interdisciplinary Research
Since database researchers have long known of methods for attaining atomicity, consistency, isolation, and durability guarantees, one would think that filesystems would
have incorporated these advances as well. However, the transfer of knowledge and
methods between the fields of database research and operating systems research has
proceeded at a very slow pace for decades, hampering the evolution of recoverable
filesystems.
The first DBMS implementations made little use of operating system services,
finding these unoptimized, unconfigurable, or otherwise unsuitable for use. Database
researchers implemented DBMS's as single, monolithic processes containing their own
buffer caches, storage layers, and schedulers to manage threads of operation. This
tradition has persisted even as operating system services have improved, so that even
today DBMS's use operating systems for little more than booting the machine and
communicating through sockets over the network [25]. At the same time, DBMS's
have gained a reputation of poor performance when used for file storage due to their
logging overhead, network or IPC-based interfaces, and highly structured data organization (which is unused by files). These characteristics reflect design decisions made
in implementing database management systems, not the fundamental database the21
ory and methods which underlie them; nevertheless, they have retarded the adoption
of DBMS technology within operating systems or filesystems.
1.3.2
Current DBMS-Based Work
Several projects, both academic and commercial, have implemented file storage using
conventional DBMS instances. However, these implementations are not full filesystems in that they are not part of an operating system kernel. Rather, they are implemented as NFS servers, which run as user-level daemon processes. The Network
File System (NFS) protocol was developed by Sun Microsystems atop its Remote
Procedure Call (RPC) and eXternal Data Representation (XDR) protocols, which
themselves use UDP/IP. NFS has been widely adopted and its third version is an
IETF standard [2]. Operating systems which have client support for NFS can mount
a remotely served filesystem in their directory tree, making it appear local and hiding
all use of network communication. NFS servers, on the other hand, need only implement the 14 RPC calls which comprise the NFS protocol. For this reason, NFS has
been very popular among budding filesystem writers; it eliminates the need to create
kernel modules or otherwise leave the realm of user space in order to produce mountable directories and files. Since conventional DBMS implementations interface over
the network anyway, commercial vendors have also found it convenient to produce
NFS servers to go with their DBMS distributions.
The PostGres File System (PGFS) is an NFS server designed to work with the
PostgreSQL DBMS [1, 8]. It accepts NFS requests and translates them into SQL
queries. PGFS was designed for filesystem version control applications, though it
may be used as a general-purpose filesystem. The DBFS project is another NFSbased filesystem designed for use with the Berkeley DB [24, 22].
Net::FTPServer is an FTP server with support for a DBMS back-end [10]. To a
connected FTP client, the remote system appears as a standard filesystem. On the
server side, Net::FTPServer makes SQL queries into a separately provided DBMS
instance to retrieve file data. The Net::FTPServer database schema provides for a
directory structure and file names, as well as file data. However, The FTP protocol
22
[14] lacks most filesystem features and in general, FTP services cannot be mounted
as filesystems.
Oracle Files is Oracle's filesystem implementation designed for use with the Oracle
DBMS [16].
In addition to an NFS service, Oracle Files supports other network
protocols such as HTTP, FTP, and Microsoft Windows file sharing. The concept of
Oracle Files, however, is identical to that of PGFS: a middleware layer capable of
translating filesystem requests into SQL queries.
Though NFS client support allows these implementations to be mounted alongside "real" local filesystems, the two cannot be compared. NFS servers are stateless,
maintaining no information about "open" files [2]. Network latency reduces performance considerably and requires client-side caching which makes it possible to read
stale data. For this reason, NFS is not recommended for use in write-intensive applications [13].
1.3.3
Current Filesystem-Based Work
Operating systems researchers have finally begun to incorporate database techniques
in filesystem implementations. The most visible result has been the introduction of
"logging" or "journaled" filesystems. These filesystems maintain a log area similar
to that of a DBMS, where filesystem changes are written sequentially, then slowly
applied to the filesystem itself [13]. The f sck program for these filesystems only has to
scan the log, not the entire filesystem, following a crash. Many journaled filesystems
are simply modified conventional filesystems, such as the Linux third extended file
system ext3, which is built on ext2 [26], and the Sun UFS with Transaction Logging,
which is built atop regular uf s [13]. Both uf s and ext2 are modeled after FFS.
A more comprehensive journaled filesystem project is ReiserFS [17]. Unlike ext3,
ReiserFS is a completely new filesystem, being written from scratch (ReiserFS is an
open source project. An accompanying venture, Namesys, offers services and support). The authors of ReiserFS are versed in database techniques; their filesystem
implementation uses B-tree indices and dispenses with the traditional i-node allocation and superblock models of conventional filesystems. Version 4 of this project is
23
still under development as of this writing.
The authors of ReiserFS have determined that database techniques such as logging and B-tree indexing are capable of matching the performance of conventional
filesystems [17]. However, these authors have rejected existing DBMS implementations in favor of their own ground-up reimplementation; this work has been ongoing
for several years. Is not a single one of the many DBMS's already available suitable
for storing file data?
1.3.4
Project Goal
The goal of this project was to find a way to apply existing DBMS technology to the
construction of a local filesystem. NFS-based projects use existing DBMS technology
but rely on the network, act in a stateless manner, and are not full local filesystems.
Filesystem projects such as ReiserFS incorporate DBMS ideas into full local filesystems, but ignore existing DBMS technology while consuming years of programming
effort in starting from scratch.
Database management systems have intelligent buffer caches and storage layers,
which should compare to those of operating systems provided that an intelligent
database schema (storage model) is chosen 6 . Other performance issues, such as the
IPC- or network-based interface normally associated with DBMS's, can be addressed
simply through choice of an appropriate DBMS (not all DBMS's, as will be shown in
the next chapter, are cut from the same mold). With an appropriate choice of DBMS
architecture, the main task of the project is simply the construction of a "translation"
layer to hide the relational and transactional nature of the DBMS from the end user,
while providing instead a standard system call interface which will allow existing
applications to run essentially unchanged atop the DBMS-based filesystem.
The eventual goal of any filesystem is full integration with the operating system
kernel, allowing the filesystem to be locally mounted. This project is no exception,
but in order to focus on the design and implementation of the filesystem architecture
6
The storage model was originally the primary focus of this thesis, a fact which is reflected in
the title.
24
(rather than the inner workings of the kernel), the scope of this thesis is constrained
to user space. Eventually, a kernel module may be produced, either as a full filesystem
or perhaps simply as a stub module which provides mounting functionality while the
DBMS itself remains in user space.
1.3.5
Project Proposal
Implementing the project goal calls for the construction of a translation layer to
convert filesystem requests into DBMS queries. However, unlike the network-based
filesystems described in section 1.3.3, this translation layer will be local, converting
application system calls, rather than network packets obeying some protocol, into
DBMS queries. As a local filesystem, the DBMS and translation layer will reside
on the same machine as the application processes which make use of them. Finally,
the project will store data on its own raw partition, without relying on an existing
filesystem for storage of any kind.
raw disk storage
file
application
translation
layer
DBMS
crash
recovery
standard system
calls
database
queries
block
transfers
Figure 1-2: Proposed DBMS-based filesystem architecture
Figure 1-2 depicts the proposed project architecture at a high level. A DBMS
instance provides ACID (atomicity, consistency, isolation, and durability) guarantees
if necessary, while storing data on a raw disk partition. Application system calls are
received by the translation layer, which implements them through database queries to
the DBMS. The design details of this architecture are the subject of the next chapter.
25
26
Chapter 2
Design
2.1
Application Considerations
The foremost concern of the project is to present applications with the same interface
used for "normal" filesystems. On all varieties of the UNIX operating system, which
is the platform used for this project, system calls are implemented as C functions.
They are included in application code through a number of system header files and
linked into the application at compile-time through the standard C library. Since
the C library and system header files are not subject to modification by users, the
project must provide its own header file and library, mirroring the standard system
calls. Applications may then be modified to use the system call replacements and
recompiled with the translation layer library.
Conventional filesystems do not use separate threads of execution. Rather, system calls execute a "trap" which transfers execution from the application to the
kernel (figure 2-1 [13]. After the kernel has performed the requested filesystem operation, execution returns to the application program. This method of control transfer
eliminates the costly overhead of inter-process communication and does not require
thread-safe, synchronized code in either the application or the kernel. However, its
implications for a DBMS-based filesystem project are important: no separate threads
of execution may be maintained apart from the application. Since most popular
DBMS's are implemented as daemon processes, this requirement narrows the field of
27
available DBMS solutions significantly.
raw
block transfers
7 rpartition
buffer
cache
storage
layer
filesystem
interface
kernel space
user space
system call (trap)
standard C library
user processes
Figure 2-1: Conventional filesystem process architecture
Several pieces of per-process state are associated with file operations. The most
important of these is the file descriptor table, which maps open files to integer tags
used by the application. Besides the tag, file descriptors also include a small structure
which contains, among other things, the offset of the cursor point into the file and
whether the file is open for reading, writing, or both. In a conventional filesystem, the
operating system provides space from its own memory, associating a few pages with
each process to store state information. Fortunately, the use of a linked library allows
for the duplication of this behavior. Library code may simply allocate space for state
information, which will reside in user space. Unfortunately, the presence of this state
information cannot be completely hidden from the application, as is the case when
the operation system manages it. In the conventional case, the operating system
initializes the state information during process creation.
Since our linked library
runs in user space, initialization must be performed by the application process itself,
through a file-init 0 type of call which has no analog in the standard system call
28
collection. This requirement is not seen as a significant obstacle, however, and is
tolerated within the scope of this project.
2.2
Unsuitability of Conventional DBMS
The task now becomes choosing a DBMS which can be used for the project while
adhering to the single-threaded model described above.
Unfortunately, the most
commonly recognized commercial and open-source DBMS's, such as Oracle and PostgreSQL, are not suitable for use. They are implemented as daemon processes which
run independently of application software, often on dedicated machines. The only
way to obtain a single-threaded architecture with such a DBMS would be to somehow build the application code directly into the DBMS, a task which is impossible
for proprietary DBMS's such as Oracle and unwieldy at best for open-source analogs,
which are designed to communicate through sockets or IPC, not through programmatic interfaces (figure 2-2. Furthermore, even if such a construction were possible,
the resulting filesystem would be isolated from use by other applications. The DBMS
buffer cache and storage layer would reside in the private space of the application,
preventing other applications from using them concurrently. Since filesystems were
designed to facilitate data sharing between applications, this restriction is unacceptable. Finally, the inclusion of a complete DBMS instance in each application program
would waste memory and degrade performance.
Even if it were deemed acceptable to run a conventional DBMS instance as a separate process, the overhead associated with inter-process communication (IPC) would
make the project incomparable with conventional filesystems, which suffer no such
overhead. Sockets, the method by which connections are made to most conventional
DBMS's, are kernel-buffered FIFO queues which must be properly synchronized (especially in multi-threaded applications). Sockets also incur performance costs as a
result of the need for a kernel trap to both send and receive data.
29
socket IPC
kernel spaceA
user pacesystem
call (trap)
-
raw
partition
standard C library
buffer
cache
lock
manager
user processes
storage
layer
logge
lge
DBMS process
Figure 2-2: Conventional DBMS process architecture
2.3
The Berkeley Database
The Berkeley Database is an open-source project with an associated venture, Sleepycat Software [22].
"DB," as it is commonly referred to, is a highly configurable
DBMS which is provided as a linked library. DB has a number of characteristics
which set it apart from conventional DBMS implementations and which make it the
DBMS of choice for this project. Pervasive and widely used, DB provides atomicity,
consistency, isolation, and durability guarantees on demand, while taking recourse
to UNIX operating system services in order to avoid the monolithic daemon process
model commonly associated with DBMS instances [23]. Unfortunately, DB lacks an
important subsystem, the storage layer, which requires us to provide one in order to
use a raw device for block storage.
Unlike conventional DBMS's, which are queried through SQL statements sent over
socket IPC, DB has a programmatic interface: a C API. To use DB, applications include a header file and are compiled with a linked library. The DBMS code thus
becomes part of the application program itself and runs together with the application
code in a single thread. However, this does not mean that the DBMS is restricted to
use by one application at at time. The various subsystems of the DBMS: lock man30
ager, logging subsystem, and buffer cache, use "shared memory" to store their state.
Shared memory is another form of IPC, which simply maps a single region of memory
into the address spaces of multiple programs, allowing concurrent access. This form
of "IPC" involves simple memory copying, which is much faster than socket IPC and
comparable to conventional filesystem operations, which copy memory between kernel and user space. The use of shared memory structures by DB allows concurrent
database access and reduces the footprint of each individual application.
filesystem
(used as storage layer)
kernel space
user space
system call (trap)
shared memory:
buffer, lock table, etc.
buffer
cache
lock
manager
Berkeley DB
library
logger
user processes
Figure 2-3: Berkeley DB process architecture
DB is a relational DBMS, but uses a far simpler relational model than most
DBMS's. DB's lack of SQL support is one reason for this; another is philosophical,
asserting that the application, not the DBMS, should keep track of record structure.
In DB, records consist of only two fields: a "key" and "value." Both are untyped,
and may be of fixed or variable length (everything in DB is, in effect, a binary large
object). Any further record structure is left to the application, which may divide the
value field into appropriate subfields. DB builds indices on the key field for each table,
and allows one or more secondary indices to be defined over a function of the key and
31
value. DB provides four table types: the btree and hash table types use B-tree and
hash indexing, respectively, and allow variable-length keys and values. The queue
and recno types use fixed-length records and are keyed by record number.
Database queries are performed using a simple function call API. A get function
is used to obtain the value associated with a given key. A put function sets or
adds a key/value pair. A delete function removes a key/value pair. More complex
versions of these functions allow multiple values under a single key; cursor operations
are supported. Transactions are initiated using a start function which returns a
transaction handle to the application. This handle may be passed to get, set, and
delete operations to make them part of the transaction. After these operations, the
application passes the handle to a commit or abort call, terminating its use.
DB provides four subsystems: a lock manager, logging subsystem, buffer cache,
and transaction subsystem. All four subsystems may be enabled or disabled independently (with the exception of the transaction subsystem, which requires logging).
DB defines an "environment," which encompasses a DBMS instance. All applications
which use a given environment must be configured identically (e.g. use the same subsystems), since they all use the same shared memory region. Each subsystem has
configurable parameters, such as the memory pool size (buffer cache) and log file size
(logging subsystem). A large number of boolean flags may be changed to define DB's
behavior, particularly in the transaction subsystem, which may be set to buffer log
entries in memory for a time (guaranteeing consistency), or to flush them immediately
to disk (guaranteeing consistency and durability). The former behavior was employed
in the project, since it more closely mirrors filesystem behavior.
DB relies on the filesystem as its storage layer. Databases and log records are
stored in files, and the f sync system call is used to keep them current on disk. Unfortunately, this behavior is unsuitable for our project; we seek to create a filesystem
alternative, not a construction atop an existing one. We thus require the DBMS to
use a raw storage device. Luckily, DB provides a collection of hooks which can be
used to redirect standard system calls (used for filesystem interaction) with callback
functions of our own design. Intended for debugging, this feature allows us to ef32
fectively "hijack" the storage layer beneath DB. The project task then becomes the
implementation of not one but two translation layers: one to translate applicationlevel system calls to DBMS queries; the second to translate storage-level system calls
to block transfers to and from a raw partition.
2.4
Proposed Architecture
The use of Berkeley DB allows the DBMS-based filesystem, hereby referred to as
"SCFS1 " for "Sleepycat file system," to be implemented as a group of C libraries.
When linked with application code, these libraries form a vertical stack, with each
layer using the function call interface of the one below.
Applications use the standard system call interface, with access to the SCFS
versions of those calls obtained by including an SCFS header file. Other than including
this header file, linking with the SCFS library, and calling an initialization routine as
discussed in section 2.1, the application need make no other special considerations to
use the SCFS system instead of a regular filesystem.
The translation layer library is responsible for turning filesystem requests into
DBMS queries as described in section 1.3.5. This layer also contains the per-process
state information described in section 2.1. System call implementations in this layer
make changes to the per-process state as necessary and/or make database queries
through the DB API. The translation layer must adhere to a database schema for
storing file data and metadata. This schema is the subject of sections 2.5 and 2.6.
The storage layer library consists of callback functions which replace the standard
system calls in DB. It is responsible for the layout of blocks on the raw partition and
must emulate standard filesystem behavior to DB. The storage layer is the subject of
section 2.7.
'The name "DBFS" being already taken (see section 1.3.3).
33
kernel space
user space
shared memory:
buffer, lock table, etc.
storage
layer
library
DB
library
translation
layer
library
application
code
raw
partition
4
4
t
standard
syscall
interface
DB
transactional
interface
standard
syscall
interface
t
block
transfers
Figure 2-4: Proposed SCFS process architecture
2.5
File Data Storage Model
An intelligent database schema, or choice of tables, record structures, and the relationships between them, is vital to ensure efficient storage of file data. Complicating
the task of choosing a schema for file storage is the fact that modern applications
produce a wide range of file sizes, quantities, and usage patterns. E-mail processing
applications, for example, use thousands of small files which are rapidly created and
removed [11]. World Wide Web browser caches produce the same type of behavior;
the average file size on the World Wide Web is 16KB [4]. At the same time, multimedia applications such as video editing are resulting in larger and larger files, fueling
a continued need for more hard disk capacity. Any file storage model must perform
adequately under both situations, handling small files with minimal overhead while
at the same time supporting files many orders of magnitude larger.
Most conventional filesystems, including FFS, use increasing levels of indirection
in their block allocation scheme.
In FFS, file i-nodes contain space for 12 block
addresses [13]. On a system with an 8KB block size, the first 96KB of a file are thus
accessible through these "direct" blocks; a single lookup is required to find them.
Following the 12 direct block references in a file i-node is an indirect block reference:
34
the address of a block which itself contains addresses of data blocks. File access
beyond the first 96KB thus involves two lookups, though much more data can be
stored before recourse to a "double-indirect" block is required; and so on (figure 2-5.
Note that the amount of space available at each level of indirection is a geometrically
increasing sequence.
file i-node
block address pointer
I
Iindirect
.doublein dire c t
bloc k
......................................
...
block
data blocks (small number)
data blocks
data blocks
Figure 2-5: Conventional filesystem data storage model
SCFS seeks to duplicate the geometrically increasing storage size model of FFS,
which is seen as desirable because it minimizes overhead at each stage of file growth.
Small files are allocated small amounts of space without significant waste. As files
grow larger, the number of indirections required grows simply as the logarithm of
the file size, allowing very large files to be stored without large amounts of growth in
the bookkeeping mechanism. However, SCFS seeks to eliminate indirect lookups by
storing file data in tables at known offsets.
For data storage, SCFS employs a sequence of n tables with fixed-length records
of size B bytes. The tables are indexed by record number. Each file in the system
is allocated a single row from the first table; the number of that row may thus serve
as a unique identifier for the file2 . Once a file has grown larger than B bytes, more
2
The analog in conventional filesystems is the i-node number
35
space must be allocated from the next table in the sequence. An exponent parameter
k is introduced: k records from the second table are allocated to the file, providing
it with kB additional bytes of space. Once this space has been filled, k2 B records
are allocated from the third table, and so forth. Like FFS, this scheme provides
geometrically increasing storage space, but by storing in a file's metadata the record
number in each table where the file's data begins, indirect lookups can be avoided no
matter how large the file grows. Figure 2-6 illustrates the concept with k = 2.
Record #
(File ID)
Table 1
Table 2
Table 3
1 data block
(table record)
k data blocks
k2 data blocks
1
2
3
4
5
6
E
indicates blocks of file with ID 1
Figure 2-6: SCFS data storage schema
One problem with the SCFS storage model is that as files are deleted, blocks
of records in each data table will be freed, causing fragmentation over time. To
avoid fragmentation, the SCFS schema includes a "free list" table which records the
locations of unused records in each data table. When allocating records from a table,
SCFS will first check the free list to see if any blocks are available. Only if none are
available will additional records be appended to the data table.
36
2.6
File Metadata Storage Model
The file metadata table is perhaps the most important of the SCFS schema. This table
is keyed by absolute file pathname, so that given the pathname to a file, its metadata
information may be directly looked up in the table.
In conventional filesystems,
pathnames must be broken into individual directories, and each directory must be
scanned to find the entry for the next. While pathname and i-node caching may
ameliorate this task, SCFS eliminates it altogether. Keying by pathname does require,
however, that pathnames be resolved to a normal form, starting from the absolute
root directory (not the current root or working directories) and without any instances
of the placeholder "."
or ". ." directories.
The value field of the metadata table is a structure containing the usual i-node
fields: file type and size, user and group ownership, permissions, access times, and
preferred block size. In addition, the metadata contains a sequence of record numbers,
one for each data table, which refer to the locations in each table where the file's data
is stored. The full metadata structure is shown in table 2.1.
Table 2.1: SCFS metadata fields
Type
db-recno-t
scf s-size
mode-t
uid-t
gid-t
time-t
time-t
time-t
Name
recno[n]
size
mode
uid
gid
atime
mtime
ctime
Bytes
4n
8
4
4
4
4
4
4
Interpretation
first record number of data (one per data table)
file size (bytes)
file mode (see mknod(2))
user ID of the file's owner
group ID of the file's group
time of last access
time of last data modification
time of last file status change
Directory files are a vital part of conventional filesystem structure since they map
file names to i-node numbers. In SCFS, however, this mapping resides not in directory
files but in the metadata table. Directories in SCFS are maintained to allow traversal
of the directory tree, but lost or corrupted directory files could easily be restored
37
by scanning the metadata table. For this reason, and for convenience, directories in
SCFS are implemented as regular files. The SIFDIR bit of the file mode may be
used in the future to restrict random access to directory files, but the consequences
of directory file loss or corruption in SCFS are far less severe than in conventional
filesystems.
The content of an SCFS directory is a sequence of record numbers. Each number
is the reference number of a file, referring to the record number of its entry in the
first table as discussed in the previous section. A secondary index on the metadata
table allows lookups by reference number, allowing directory entries to be mapped to
their pathname and metadata information. Through this index, directories may be
"scanned" to determine their contents, allowing exploration and traversal of directory
trees in SCFS.
The metadata storage model of SCFS removes much of the indirection associated
with conventional filesystem metadata, but makes the implementation of "hard links"
very difficult.
A hard link is a synonym for a directory entry; in FFS, multiple
directory entries may refer to the same i-node, in effect placing a file in multiple
directories. Removing a file becomes more difficult in the presence of hard links: the
i-node nlinks field becomes necessary to record how many directories reference the
file (the SCFS metadata has no such field). More importantly, keying by pathname
becomes impossible when files may have more than one unique absolute pathname,
which is why hard links are unsupported in SCFS3 . Symbolic links, however, are
possible in SCFS, though not implemented as part of this project.
2.7
Storage Layer
The storage layer receives system call requests from DB; its job is to simulate a
filesystem to DB while actually using a raw device for storage. At first glance, this
layer seems complicated, but the way in which DB uses the filesystem makes its task
3Hard links are generally regarded in filesystems as goto statements are in programming lan-
guages, making their absence tolerable.
38
easier. Databases and log entries are stored in files of known names, which can be
hard-coded into the storage layer, rather than having to provide support for arbitrary
files and directories. In addition, DB performs data reads and writes one page at a
time, so that the I/O operation size is constant. Log writes, however, are arbitrarily
sized.
raw partition
original
layout
fixed-size data table areas
log area
Figure 2-7: Original storage layer disk allocation
The original storage layer design divided the raw partition into sections; one for
log entries and one for each database file (figure 2-7. However, this approach was
abandoned because the relative sizes of the database files is impossible to predict.
Under small-file-intensive applications, space for the first few data tables would fill
quickly while upper-level data tables sat unused. In contrast, large-file-intensive applications would result in relatively small tables at the lower end of the sequence.
Choosing fixed-size areas for database tables would thus result in large amounts of
wasted space, or worse yet, overflow 4 .
The storage layer was thus moved to a purely journaled format in which blocks
from all database files mingle together on disk. Data writes are performed at the
journal "head," which moves forward until it reaches the end of the disk and wraps
back to the beginning (once this has happened, the head must be advanced to the
next free block, rather than simply writing in place) [19]. Writes which change a data
4
Overflows were in fact encountered while testing large files, which forced the reevaluation of the
storage layer.
39
insertion head
raw partition
revised
layout
blocks
data table 1
data table 3
on disk
data table 2
free space
log area
Figure 2-8: Revised storage layer disk allocation
block rather than appending a new one also invalidate and free the old block. The
log, meanwhile, is written to a separate circular buffer. Periodic checkpoints prevent
the log buffer from overflowing. Figure 2-8 shows the disk layout under this design.
In the example, a block from data table 2 has just been written. The next write that
occurs must first advance the insertion head one block to avoid overwriting a block
from data table 3.
The journaled format makes data writes especially fast because no seeking of
the disk head is generally required; each write follows the next [13, 19]. However,
reading a block in this system requires finding it on disk, which would require a
linear scan in the design so far discussed. Thus, the storage layer includes some inmemory structures which speed the location of blocks on disk. Two "maps" of the
disk area are maintained. One is a simple bitmap which records whether or not a
block is free; this information is used when advancing the head of the journal to the
next free block, and also when invalidating the previous block after the write has
been performed. A second, more detailed map records for each block on disk which
database file it belongs to and at what offset into the file it sits. At 8 bytes to the
block, the overhead associated with these maps is minimal. The maps may also be
stored on disk, preventing the need to reconstruct them on filesystem initialization.
Storage layer recoverability is important because DB makes assumptions about
the storage layer (normally a filesystem) which sits beneath it; if those assumptions
are violated, DB will lose its recoverability as well. Raw partitions are not cached in
any way so that reads and writes to them are permanent [13]. However, care must
40
be taken to ensure that information about which block belongs to which database file
is not lost. The journaled structure of the storage layer causes blocks on disk to be
freed and overwritten as newer blocks are written at the journal head. The disk maps
described earlier keep track of the blocks on disk, but since the maps are stored in
volatile memory, they are subject to loss.
To achieve recoverability, each block on disk is stamped with the identity of its
database file and its offset into the file, as well as the time at which it was written.
This information duplicates that of the disk map and allows them to be reconstructed
if lost. After a system crash, the storage layer must scan the disk to reconstruct the
maps; after this has happened, DB recovery itself may be run, which will make use
of the database files and the log. If the file, offset, and time information were not
associated with each page by the storage layer, recovery of the storage layer would
be impossible following a crash.
41
42
Chapter 3
Implementation
3.1
System Calls Provided
Table 3.1 summarizes the user-visible functions which were implemented as part of
the translation layer. Each call mimics the parameters and return value of its respective system call. Even error behavior is duplicated; an scfs-errno variable is
provided which mimics the standard errno variable (DB supports negative errno
values which indicate DBMS errors; this is reason why errno was not used itself).
The scfs init 0 routine has no analog among the standard system calls since the
operating system takes care of state initialization under a conventional filesystem.
The set of implemented functions reflects the basic set necessary to create and
browse a directory tree, perform file operations, and conduct performance tests. The
lack of an analog for a common system call at this point does not mean that it cannot
be implemented. For example, the chmod, chown, and chgrp family of system calls
was left out simply to save time, and a the opendir, readdir, etc. family of calls
may easily be constructed atop getdents. SCFS is restricted in some areas, however.
The lack of support for hard links was mentioned in section 2.6; another example is
lack of support pipes, sockets, or special device file types. These file types are simply
kernel placeholders and do not affect the data storage aspect of SCFS, which is the
43
Table 3.1: SCFS user-visible API
Name
scfs-init
scfsmknod
scfs-mkdir
scfs-creat
scfs-open
scfs-close
scfs-read
scfs-write
scfs-stat
scfs-seek
scfs-tell
scfs-unlink
scfs-sync
scfs-strerror
scf s-perror
Analog
(none)
mknod
mkdir
creat
open
close
read
write
stat
seek
tell
unlink
sync
strerror
perror
Purpose
process state initialization
file creation
directory creation
file creation and open
file creation and open
close filehandle
read from file
write to file
get file metadata
set file cursor position
get file cursor position
remove file
flush buffer cache (checkpoint)
get error information
print error information
primary focus. Several other minor aspects of SCFS differ from that of the UNIX
standards, but in general the project has focused simply on providing file-like storage,
not on conforming to every quirk of POSIX and other standards.
3.2
Consistency and Durability
SCFS is intended to ensure filesystem consistency after a crash, but durability, which
incurs a large performance penalty as a result of log flushes on writes, may be disabled.
Use of the DBJTXNNOSYNC flag in DB causes log writes to be buffered in a memory
area of configurable size; only when the buffer fills are the log entries flushed to disk.
This behavior significantly increases performance and is enabled by default to better
approximate filesystem behavior. A system crash in this case will result in loss of
the log buffer and hence possibly cause the loss of recent filesystem updates, but
the atomicity of log flushes ensures consistency of the filesystem on disk. Regular
checkpointing of the system will minimize this data loss, as filesystems do through
the use of a disk synchronization thread running in the operating system.
An analog to the sync 0 system call is provided in SCFS. The scfs-sync 0
44
function simply performs a DB checkpoint, ensuring that all updates in memory are
flushed to disk1 . Of course, SCFS may simply be rebuilt without the DBTXNNOSYNC
flag, which would cause all of its filesystem operations to flush immediately. Applications which make repeated use of f sync 0 and sync 0 to achieve the same goal on
regular filesystems would no doubt find this feature of SCFS convenient.
3.3
Configurable Parameters
This section presents numerical specifications for SCFS, which are summarized in table 3.2. Compared to conventional filesystems, SCFS is highly flexible and supportive
of large files. Care has been taken to ensure 64-bit sizes and offsets internally, though
the standard system calls almost always take 32-bit values for these parameters. The
parameters n, B and k, discussed in section 2.5, are configurable compile-time parameters, currently set to 5, 4000 bytes and 16, respectively.
Table 3.2: SCFS specifications
Configurable?
No
No
Yes
Parameter
Maximum number of files
Maximum number of directories
Maximum filesystem size
Value
at most 232 - 1
at most 232 - 1
nB(23 2 - 1)
Maximum file size
Yes
Maximum subdirs per directory
B(k" - 1)/(k - 1)
B(kn - 1)/4(k - 1)
B(kn - 1)/4(k - 1)
Maximum pathname length
Maximum open files per process
Size of buffer cache
Synchronized log flush
Size of log buffer
256
256
24MB
not enabled
4M
Yes
Yes
Yes
Yes
Yes
Maximum entries per directory
Yes
Yes
'In fact, the system sync () merely schedules the flush, frustrating attempts to compare filesystem
syncs with database checkpoints.
45
File size in SCFS is essentially limited by the number of data tables present in
the schema 2 . If n tables are present, the maximum file size (in bytes) is
n-1
k"
E Bkz = B k
i=O
1
The number of data tables n is a compile-time parameter which is easily changed.
In addition, the storage layer does not depend on this parameter, so that a raw
partition initially used by an SCFS system with a certain number of tables may be
used unchanged after the number has been modified. Recompilation of all application
code is currently necessary in order to effect this change, however; a move to the kernel
would eliminate this requirement.
Since DB tables support 232-1 records, the maximum number of files which can be
present in SCFS at any one time is at most 232 -1.
The maximum number decreases,
however, as average file size increases. For example, at most
2321
k
(B+ 1)-byte files
may be present in the system at any one time, since a (B + 1)-byte file is allocated k
records from the second data table. The total size of all files is more important than
the number of files in this respect: an instance of SCFS has n(23 2
-
1) total table
records available for file data.
Since directories are implemented as regular files, the maximum number of directory entries per directory is simply one fourth of the maximum file size, since each
directory entry consists of a four-byte file number. SCFS makes no distinction between entries of regular files and entries of subdirectories. The maximum length of
path names is a configurable compile-time parameter, and is only restricted to avoid
complicated pathname parsing code which must dynamically allocate space for temporary buffers. This parameter could be made arbitrary, since DB has no restriction
on key length.
2
The absolute limit is imposed by the number of records allowed in a DB table. With B = 4KB,
files may grow to approximately 16TB; though only one file may be this big; others may only reach
16/k TB.
46
3.4
Per-Process State
As described in section 2.1, several pieces of per-process state are maintained by
SCFS. These lie outside of the database itself and are stored in user space, declared
within the code of the translation layer library. Table 3.3 summarizes the per-process
state variables maintained by the storage layer (The value m indicates the size of the
file descriptor table, which is a configurable parameter). These can be divided into
pointers used for database access (referred to by DB as "handles"), the file descriptor
table, and some miscellaneous variables holding the current time and error value. A
separate variable for the current time is kept because of the frequency with which the
time is needed during filesystem metadata operations (e.g. for updating the access,
modification, and status change times). It was found during testing that calls to the
system time routine were slowing performance to a measurable degree.
Table 3.3: Translation layer per-process state
Type
DBENV*
DB*
DB*
DB*
DB*
db-recno-t
scfs-size
int
char*
time-t
int
3.5
Name
dbenv
scfs-datadb[n]
scfs-statdb
scfs-secdb
scfs-freedb
scfs-fd[m]
scfsifo[m]
scfs-fm[m]
scfsifn[m]
scfs-curtime
scfs-errno
Bytes
4
4n
4
4
4
4m
8m
4m
4m
4
4
Purpose
DB environment handle
DB data table handles
DB metadata table handle
DB metadata secondary index handle
DB free list table handle
First record number of open file
Current byte offset into open file
Mode of open file
Pathname of open file
Current time
Error number
Threads, Concurrency, and Code Re-entrance
SCFS code is designed to support concurrency between processes as well as multithreaded processes themselves. The DBTHREAD flag is passed to DB, enabling support
47
for database access from multiple process threads. Internally, DB's shared-memory
subsystems are thread-safe and support use by multiple concurrent processes. The
more difficult task is ensuring that the translation and storage layer libraries maintain and preserve this behavior. The translation layer library is inherently process-safe
since each of its instances is associated with a single process. However, it has not
been shown that the per-process state of the translation layer is immune from corruption resulting from multiple concurrent system calls performed by separate process
threads. Further work must focus on ensuring translation layer duress in the face of
re-entrant system calls, which may require the use of semaphores or a simple locking
mechanism 3
A more urgent area of concern is the storage layer, which is currently also a perprocess entity and which thus prevents the use of the SCFS storage area by more
than one application at a time. Under normal operation, DB uses the filesystem for
storage. Database pages are stored in a file and visible to all DB applications making
use of the shared buffer cache and lock manager. The SCFS storage layer, however,
is not shared between DB applications since each process has its own instance. Since
the storage layer state includes the disk map and journal head, only one running
process may use the raw partition at a time without fear of disk corruption. To allow
concurrent access, the storage layer state should reside in shared memory just as the
DB buffer cache, lock manager, and transaction subsystem do. Until this change
is effected SCFS is restricted to single-process use. Two methods for sharing the
storage layer are available: using user-space mapped memory as DB does, or moving
the storage layer into kernel space. One or the other must be the focus of future work.
3
Such work is only necessary, of course, if the standard filesystem interface offers a thread-safe
guarantee, which may not be the case.
48
Chapter 4
Performance Testing
4.1
Test Methodology
To gauge the performance of SCFS, it was compared to a conventional filesystem by
constructing and running several test programs. SCFS was developed on a SunBlade
100 workstation with 512MB of memory, running Solaris 8. The uf s filesystem on
this machine was compared to SCFS, which was set up to use a partition on the same
disk as as the regular filesystem. Other than the use of forced file synchronization as
will be described, the filesystem was used as-is, with no special configuration (e.g. to
the buffer cache).
Every effort was made to ensure parity between the two systems being compared
in terms of caching. Tests would be meaningless if one system was permitted to cache
file operations in memory indefinitely, while the opposing system was forced to flush
changes to disk periodically. Thus, care was taken to ensure that both SCFS and the
ufs filesystem flushed file data to disk at the same rate while tests, which consisted
of repeated file reads and writes, were being performed. The disk usage of SCFS
is predictable to a good degree of accuracy, since it has a log buffer of fixed size.
The filesystem buffer cache, on the other hand, is part of virtual memory itself and
cannot be "set" to a fixed size, nor can the operating system replacement algorithm
which governs page flushes be modified. Thus, recourse to sync 0 was taken to force
filesystem flushes during testing. A call to sync o was made each time the number
49
of bytes written to the filesystem reached the amount which would cause a log flush
in SCFS.
Several test programs were written to exercise SCFS using its file interface. The
same test programs were also run using the ufs filesystem and the results were tabulated and plotted side-by-side. The first test, referred to as the "single file" test,
creates a single file and performs repeated read and write operations on it. The size
of the file is a variable parameter, as well as the ratio of read operations to write operations (the entire file is read or written during each operation). The single file test
was run over a range of file sizes. The number of operations performed was adjusted
as the file size varied so that an equal number of bytes were read/written during each
iteration. Thus, the ideal performance result is a straight line when plotted versus
file size, as is done in subsequent sections.
The second test, referred to as the "multiple file" test, creates a set of files and
performs repeated read and write operations, choosing a file at random for each
operation. The size and number of files are both variable parameters, as well as the
proportion of read operations to write operations.
Unlike the single file test, the
multiple file test does not read or write entire file, but operates on a fixed-size block
within the file, so that the operation size does not change with file size. The multiple
file test was run over a range of file sizes, but the number of files was adjusted to
keep the total size of the set of files constant. A constant number of operations was
performed during each iteration, so that again a straight line is the ideal performance
result when plotted.
Under both tests, the proportion of reads to writes was also varied so that three
sets of results were obtained: results for 100% reads, 100% writes, and a 50/50 mixture
of reads and writes. In subsequent sections, the results are presented as follows: a
separate graph for each of the three read/write proportions, with each graph having
four plots corresponding to the SCFS results (single- and multiple-file) and uf s results
(single- and multiple-file). In addition to the single file and multiple file tests, a multithreaded test was written, but since the SCFS storage layer at present is not in shared
memory as described in section 3.5, the test could not be run using SCFS.
50
Baseline Storage Performance
4.2
The baseline tests were the first tests run using SCFS with its storage layer on a
raw partition. Neither logging nor locking were enabled in SCFS, approximating
"standard" filesystem behavior by offering neither locking nor consistency guarantees.
The baseline testing results are plotted in figures 4-1, 4-2, and 4-3.
0.4
0.35
-'-+-L FS: multi file
-- x--- FS: single file
-
-
---
: multi file
8--sigle -file* ------------
..
--------
----------
-
0.3
(D
0.25
E
0.2
-*-
-X
....................
0.15
0.1
4
16
64
256
File Size (KB)
- -.....
- -E
.
............
1024
4096
16384
Figure 4-1: Baseline performance: 100% read operations
In 100% read testing, SCFS performance lags that of the filesystem by 20%-30%
in both tests. Write testing, however, strongly favors SCFS. While SCFS write performance remains relatively flat across the range of file sizes, filesystem performance
decreases dramatically as file size increases, yielding to SCFS about halfway through
the range of file size. In addition, SCFS write performance for small files is relatively
close to that of the filesystem, especially in the single-file test where the gap is extremely small. In general, the baseline results prove the effectiveness of the storage
layer through significantly heightened SCFS performance.
51
0.8
FS: multi file
---x--- FS: single file
---
0.7
----I---
a.----
DB: multi file
DB: single file
0.6
0.5
E
0.4
0.3
---- --
* -------- x
0.2
U
-.
0.1
4
16
64
1024
256
File Size (KB)
.
4096
-
16384
Figure 4-2: Baseline performance: 50/50 read/write operations
1.6
1.4
1.2
---- ~----a--a
.....--
FS:
FS:
DB:
DB:
-
-
multi file
single file
multi file
single file
1
E
i1=
0.8
0.6
----- x
0.4
------
..
----.- ~
- :----
X--..----
0.2
....
.............
e .................... a ...........
0
4
16
64
1024
256
File Size (KB)
4096
16384
Figure 4-3: Baseline performance: 100% write operations
52
The filesystem's performance on the 1MB multi-file test is sharply out of line
with the overall trend. This unexpectedly fast performance is most likely the result
of an internal filesystem threshold. Recall that the multiple-file test operates on a
fixed 8KB record within the file, rather than the entire file as in the single-file test.
The middle 8KB of the file is the record used, and for 1MB files, this record may be
aligned by chance so that operations on it proceed at above average speed. The block
address of the record (in the indirect block), the the data block itself, or the page
in memory may be subject to favorable alignment as hypothesized. Alternately, the
increase in file size to 1MB may trigger large-file functionality within the filesystem.
More knowledge of the filesystem's internal behavior is necessary to pin down the
exact cause of this performance deviation.
4.3
Performance with Logging
Following the baseline tests, logging was enabled in SCFS. The use of logging in
SCFS results in a log flush whenever the log buffer fills; this behavior was duplicated
in the filesystem test application by forcing a sync at the same interval, as described
in section 4.1. However, the presence of both log and data operations in SCFS was
expected to reduce its performance margin somewhat from the baseline tests; this is
the behavior which was observed. The testing results with logging enabled are plotted
in figures 4-4, 4-5, and 4-6.
For reads, the most significant observation is the radical jump in filesystem read
time at a file size of 4MB. The size of the deviation indicates that disk writes are being
performed as a result of the periodic calls to sync O. The most likely explanation for
this behavior is that the operating system buffer cache begins to flush pages to disk
in order to make room for the large number of pages being read. Another possibility
could be the file access time, which is modified after a read operation. It is possible
that for large reads the access time is flushed immediately, though why only large
53
1.6
FS: multi file
--- x--- FS: single file
---- u--- DB: multi file
DB: single file
1.4
1.2
,a
(D
0.8
-
.. 2:~2~
2
-.
0.6
0.4
0.2
--------.........
.............
.................
... ... ...
--- --------
----
4
16
d-
----
---
---------
1024
256
File Size (KB)
64
........ ..............E
-I
4096
16384
Figure 4-4: Performance with logging: 100% read operations
3.5
- FS:
-- x--- FS:
--- K--- DB:
DB:
3
multi file
single file
multi file
single file
2.5 [
a)
(D
2
E
1.5
--
'
----
---
1
0.5
4
16
64
1024
256
File Size (KB)
4096
16 384
Figure 4-5: Performance with logging: 50/50 read/write operations
54
i
FS: multi file
-- x- --
--
FS: single file
DB: multi file
DB: single file
E
.~-~E-7'
4
16
64
1024
256
File Size (KB)
4096
16384
Figure 4-6: Performance with logging: 100% write operations
reads would be treated this way is not known. The access time operation definitely
affects SCFS since a log write is performed as a result of it. SCFS read times fall to
about 1.5 times filesystem read times as a result, except for the filesystem deviation
at 4MB.
As expected, the use of logging causes SCFS performance to suffer on write operations. In the single file test especially, the filesystem gradually overtakes SCFS,
in a reversal of previous behavior. Since the filesystem simply overwrites the file
repeatedly in the cache while log entries still pile up in the SCFS log cache, this
result is not a surprise. For the multiple-file test, SCFS exhibited a much better
performance response, with a flat performance curve compared to the filesystem's
degrading performance. Operating on multiple files mediates the effect of the log,
since the filesystem cannot cache the entire data set as in the single file test.
55
Performance with Locking
4.4
The final round of testing enabled both logging and locking in SCFS. In the filesystem, the periodic sync calls were retained to duplicate logging, and file locking was
employed via the flock system call to duplicate locking. The results are plotted in
figures 4-7, 4-8, and 4-9.
1.8
1.6
FS: multi file
---x--- FS: single file
1.4
- .... DB: single file
-*- --- DB: multi file
1.2
E
0.8
0.6
-
-
.
0.4
0.2
16
4
64
1024
256
File Size (KB)
4096
16384
Figure 4-7: Performance with locking: 100% read operations
SCFS read performance degraded slightly in the single-file test as a result of
locking, though the filesystem retained its strange behavior for files 4MB and larger
in the multiple-file test. Write performance remained essentially unchanged from the
previous test.
4.5
Analysis
Table 4.1 shows relative SCFS performance versus filesystem performance, averaged
over all the file sizes for each test run. Numbers indicate SCFS speed relative to
56
4
-+-
FS: multi file
--- x--- FS: single file
3.5
*-- - DB: multi file
DB: single file
.-.---
3
-6
2.5
E
2
-
.
--
........
1.5
1
0.5
I-
-
16
4
..........
----
64
..
-
1024
256
File Size (KB)
16384
4096
Figure 4-8: Performance with locking: 50/50 read/write operations
7
-+--- FS: multi file
---x--- FS: single file
--- --- DB: multi file
-- a - DB: single file
6
5
D4
E
3
-
....
-..
.....
.
2
4
16
64
1024
256
File Size (KB)
4096
16384
Figure 4-9: Performance with locking: 100% write operations
57
filesystem speed; "2.0" indicates SCFS ran twice as fast (averaged over all file sizes)
while "0.8" indicates SCFS ran only - as fast (e.g. slower).
Table 4.1: Average SCFS performance versus filesystem performance
Test
multi-file
single-file
multi-file
single-file
multi-file
single-file
Log?
no
no
yes
yes
yes
yes
Lock?
no
no
no
no
yes
yes
100% W
1.7
1.9
1.4
0.9
1.3
0.8
100% R
0.8
0.7
0.8
1.4
0.6
1.0
50/50 R/W
1.1
1.4
1.3
1.0
1.2
0.9
Total Avg.
1.2
1.3
1.2
1.1
1.0
0.9
Several trends are visible from the data. SCFS is generally slower on reads; the
only tests where SCFS read performance matched or exceeded that of the filesystem
were the two tests where the filesystem exhibited a significant performance deviation
on the 4MB and 8MB file sizes. These two data points alone are responsible for the
favorable average; removing them would produce averages similar to the other tests,
as the graphs indicate. In general, the poor read performance can be attributed to
the access time metadata variable, which must be changed on every read and thus
causes a log write in SCFS which is unmatched in the filesystem.
For writes, on the other hand, SCFS performs very well compared to the filesystem.
The SCFS database schema and the SCFS storage layer were designed to achieve
precisely this result by keeping file data as clustered as possible in DB tables and
on disk, respectively. This is accomplished through a lack of indirect blocks and a
journal-structured storage layout. The use of logging does have a significant effect on
SCFS performance, however. This is seen in the performance drop from the baseline
tests, which is about 25% for the multi-file test and nearly 50% for the single-file
test (where logging is more of an encumbrance).
The observed performance drop
represents the cost of consistency in SCFS.
SCFS with logging and locking enabled is the most interesting case in terms of
DBMS functionality. The scores of 1.0 (equal performance) and 0.9 (SCFS 10%
58
slower) in table 4.1 indicate that SCFS performance in this case slightly trails that
of ufs. In reality, however, the average performance would depend on the ratio of
reads to writes. Read-intensive applications would experience the observed SCFS
performance deficit of 20% - 30% on read operations. Write-intensive applications
on the other hand, would benefit from the good write performance of SCFS. In the
50/50 case, with logging and locking enabled, SCFS and uf s have nearly equal performances. This result shows that, when used with an intelligent schema and storage
layer, existing DBMS technology is capable of supporting a filesystem abstraction
with minimal performance loss in exchange for the added value of DBMS locking and
consistency.
59
60
Chapter 5
Future Directions
5.1
Moving to the Kernel
Real filesystems are accessed through the standard system interface and may be
mounted in the directory tree. In its current form, SCFS satisfies neither of these
criteria, since it cannot be mounted and its interface, though a duplication of the
standard, resides in a separate user-level library. This does not prevent application
programs from using SCFS with minimal modification to their source code. However,
the ultimate goal should be to achieve a seamless mount, which will require some
degree of cooperation between SCFS and the operating system kernel. At the minimum, the storage layer portion of SCFS must be moved to the kernel to allow sharing
among user-space applications. Currently, a separate storage layer is created for each
process as described in section 3.5. Though sharing could be achieved by moving
the storage layer into shared memory alongside the DBMS buffer cache and other
subsystems, the same ends could be achieved simply by moving it into the kernel.
Conventional filesystem buffer caches are part of operating system virtual memory.
Three choices are consequently available for SCFS: either forego the DB buffer cache
in favor of the operating systems; somehow combine the two, or maintain the DB
buffer cache as a separate entity devoted entirely to the filesystem. The first two
options may be impossible because of the relationship between the DB buffer pool
and its transaction and locking subsystems. The third option is more appealing and
61
suggests a possible course for the integration of SCFS with the kernel: rather than
attempting to move everything into kernel space, the DBMS itself could be maintained
as a separate process running in user space. The kernel portion of the project, then,
would consist simply of mounting functionality and integration with the standard
libraries. The difficulty of implementing such an architecture is currently unknown more understanding of kernel module development is necessary.
5.2
Use of Multiple Disks
In most DBMS installations (and in some of the newer journaling filesystems), increased performance is achieved by storing the log on a separate disk from the data.
This provides more locality of reference to both systems by eliminating the need for
constant disk seeks between data and logging areas [13]. In SCFS as tested in this
project, both log and data were kept on the same disk partition. Future testing may
explore the performance gain, if any, resulting from use of separate log and data
disks. The use of separate disks may speed logging performance to the point where
buffering of log entries in memory is unnecessary. The resulting SCFS instance would
provide durability in addition to atomicity and isolation. Testing of performance
under durability was not performed as part of this project.
62
Bibliography
[1] B. Bartholomew. Pgfs: The postgres file system. Linux Journal,42, October
1997. Specialized Systems Consultants.
[2] B. Callaghan, B. Pawlowski, and B. Staubach. Nfs version 3 protocol specification. RFC 1813, IETF Network Working Group, June 1995.
[3] D. Chamberlin and et al. A history and evaluation of system r. Communications
of the ACM, 24(10):632-646, 1981.
[4] F. Douglis, A. Feldmann, B. Krishnamurthy, and J. C. Mogul. Rate of change
and other metrics: A live study of the world wide web. USENIX Symposium on
Internet Technologies and Systems, 1997.
[5] J. Gray and et al. Granularity of locks and degrees of consistency in a shared
data base. IFIP Working Conference on Modelling of Data Base Management
Systems, pages 1-29, 1977. AFIPS Press.
[6] Database Task Group. April 1971 report. ACM, 1971.
[7] Open Group.
The single unix specification, version 3.
URL: http://
www.unix-systems.org/version3, January 2002.
[8] PostgreSQL Global Development Group. Postgresql: The most advanced opensource database system in the world. URL: http: //advocacy .postgresql. org,
October 2002.
[9] T. Haerder and A. Reuter. Principles of transaction-oriented database recovery.
Computing Surveys, 15(4):287-317, 1983.
63
[10] Richard Jones. Net::ftpserver. Freshmeat Projects Library, December 2002.
[11] Jeffrey Katcher. Postmark: A new file system benchmark. Technical Report
3022, Network Appliance Library, 1997.
[12] et al. M. Astrahan. System r: Relational approach to database management.
ACM Transactions on Database Systems, 1(2):97-137, 1976.
[13] Amir H. Majidimehr. Optimizing UNIX for Performance. Prentice Hall, Upper
Saddle River, NJ, 1st edition, 1996.
[14] J. Postel and J. Reynolds. File transfer protocol (ftp). RFC 959, IETF Network
Working Group, October 1985.
[15] R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw Hill,
2nd edition, 2000.
[16] H. R. Rasmussen. Oracle files: File server consolidation. White paper, Oracle
Corporation, Redwood Shores, California, June 2002.
[17] Hans Reiser. Reiser 4. Technical report, Naming System Venture, June 2003.
[18] D. M. Ritchie and K. Thompson. The unix time-sharing system. Communications
of the A CM, 17(7):365-375, July 1974.
[19] M. Rosenblum and J. Ousterhout.
The design and implementation of a log-
structured file system. ACM Transactions on Computer Systems, 10(1):36-52,
February 1992.
[20] A. Silberschatz, M. Stonebraker, and J. Ullman. Database systems: Achievements and opportunities. Communications of the ACM, 34(10):110-120, 1991.
[21] A. Silberschatz, S. Zdonik, and et al. Strategic directions in database systems breaking out of the box. ACM Computing Surveys, 28(4):764-788, 1996.
[22] Sleepycat Software. Berkeley db products. URL: http: //www .sleepycat . com/
products, 2003.
64
[23] Sleepycat Software.
Diverse needs, database choices - meeting specific data
management tasks. Technical article, Sleepycat Software, Inc., Lincoln, MA,
April 2003.
[24] Lex Stein.
The bdbfs nfs3 fileserver.
URL: http://eecs.harvard.edu/
~stein/bdbfs, 2002.
[25] M. Stonebraker. Operating system support for database management. Communications of the ACM, 24(7):412-418, 1981.
[26] Stephen C. Tweedie. Journaling the linux ext2fs filesystem. Proc. Linux Expo,
1998.
65
Download