Implementation and Performance of a DBMS-Based Filesystem Using Size-Varying Storage Heuristics by Eamon Francis Walsh Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degrees of Bachelor of Science in Computer Science and Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2003 Copyright 2003 Eamon F. Walsh. All rights reserved. BARKER The author hereby grants to M.I.T. permission to reproduce and distribute publicly paper and electronic copies of this thesis and to MASSACHUSETTS INSTITUTE grant others the right to do so. OFTECHNOLOGY JUL 3 0 2003 LIBRARIES Author .... Department of Electrical Yngineering and Computer Science May 15, 2003 Certified by .......................... Michael Stonebraker Adjunct Professor, EECS Cl esist$upvsor Accepted by........... Arthur C. Smith Chairman, Department Committee on Graduate Theses Implementation and Performance of a DBMS-Based Filesystem Using Size-Varying Storage Heuristics by Eamon Francis Walsh Submitted to the Department of Electrical Engineering and Computer Science on May 21, 2003, in partial fulfillment of the requirements for the degrees of Bachelor of Science in Computer Science and Master of Engineering in Electrical Engineering and Computer Science Abstract This thesis project implemented a file system using a transactional database management system (DBMS) as the storage layer. As a result, the file system provides atomicity and recoverability guarantees that conventional file systems do not. Current attempts at providing these guarantees have focused on modifying conventional file systems, but this project has used an existing DBMS to achieve the same ends. As such, the project is seen as bridging a historical divide between database and operating system research. Previous work in this area and the motivation for the project are discussed. The file system implementation is presented, including its external interface, database schema, metadata and data storage methods, and performance enhancements. The DBMS used in the project is the open-source Berkeley DB, which is provided as an application library rather than a separate, monolithic daemon program. The implications of this choice are discussed, as well as the data storage method, which employs geometrically increasing allocation to maximize performance on files of different sizes. Performance results, which compare the implementation to the standard Sun UFS filesystem, are presented and analyzed. Finally, future directions are discussed. It is the hope of the author that this project is eventually published as a kernel module or otherwise mountable and fully functional file system. Thesis Supervisor: Michael Stonebraker Title: Adjunct Professor, EECS 3 4 Acknowledgments To Mike Stonebraker, for taking me on as his student and providing much-needed and much-appreciated guidance and direction as the project evolved. The author is glad to have been a part of database research under Mike and thanks him for the opportunity. To Margo Seltzer, for technical support and advice on Berkeley DB, now the author's favorite DBMS. To David Karger, for 4 years of academic advising in the Electrical Engineering and Computer Science department. 5 6 Contents 1 13 Introduction 1.1 1.2 1.3 Database Background. . . . . 13 . . . . . . . . . . . . . . . . . . . . . . . . 13 . . . . . . . . . . . . . . 14 Filesystem Background . . . . . . . . . . . . . . . . . . . . . . . . . . 16 . . . . 16 . . . . 16 1.1.1 What is a DBMS? 1.1.2 Transactions, Recovery, and Locking 1.2.1 What is a Filesystem? . . . . . . . . 1.2.2 Interacting with a Filesystem 1.2.3 Filesystem Behavior after a Crash . . . . 18 1.2.4 Example Filesystem: Berkeley FFS . . . . 19 Addressing Filesystem Recovery . . . . . . . . . . . 21 . . . . . . . . . . 21 . . . . 1.3.1 Interdisciplinary Research 1.3.2 Current DBMS-Based Work..... . . . . 22 1.3.3 Current Filesystem-Based Work . . . . . . . 23 1.3.4 Project Goal. . . . . . . . . . . . . . . . . . 24 1.3.5 Project Proposal . . . . . . . . . . . . . . . 25 2 Design 27 2.1 Application Considerations . . . . . . . . . . 27 2.2 Unsuitability of Conventional DBMS 29 2.3 The Berkeley Database . . . . . . . . . . . . 30 2.4 Proposed Architecture . . . . . . . . . . . . 33 2.5 File Data Storage Model . . . . . . . . . . . 34 2.6 File Metadata Storage Model 37 . . . . . . . . . . 7 2.7 3 Storage Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implementation 43 3.1 System Calls Provided . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Consistency and Durability . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Configurable Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 Per-Process State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5 Threads, Concurrency, and Code Re-entrance 47 . . . . . . . . . . . . . 49 4 Performance Testing 5 38 4.1 Test Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Baseline Storage Performance . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Performance with Logging . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4 Performance with Locking . . . . . . . . . . . . . . . . . . . . . . . . 56 4.5 A nalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 61 Future Directions 5.1 Moving to the Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Use of Multiple Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 8 List of Figures 1-1 Filesystem disk allocations . . . . . . . . . . . . . . . . . . . . . . . . 20 1-2 Proposed DBMS-based filesystem architecture . . . . . . . . . . . . . 25 2-1 Conventional filesystem process architecture . . . . . . . . . . . . . . 28 2-2 Conventional DBMS process architecture . . . . . . . . . . . . . . . . 30 2-3 Berkeley DB process architecture . . . . . . . . . . . . . . . . . . . . 31 2-4 Proposed SCFS process architecture . . . . . . . . . . . . . . . . . . . 34 2-5 Conventional filesystem data storage model . . . . . . . . . . . . . . . 35 2-6 SCFS data storage schema . . . . . . . . . . . . . . . . . . . . . . . . 36 2-7 Original storage layer disk allocation . . . . . . . . . . . . . . . . . . 39 2-8 Revised storage layer disk allocation . . . . . . . . . . . . . . . . . . 40 4-1 Baseline performance: 100% read operations . . . . . . . . . . . . . . 51 4-2 Baseline performance: 50/50 read/write operations . . . . . . . . . . 52 4-3 Baseline performance: 100% write operations . . . . . . . . . . . . . . 52 4-4 Performance with logging: 100% read operations . . . . . . . . . . . . 54 4-5 Performance with logging: 50/50 read/write operations . . . . . . . . 54 4-6 Performance with logging: 100% write operations . . . . . . . . . . . 55 4-7 Performance with locking: 100% read operations . . . . . . . . . . . . 56 4-8 Performance with locking: 50/50 read/write operations . . . . . . . . 57 4-9 Performance with locking: 100% write operations 57 9 . . . . . . . . . . . 10 List of Tables 2.1 SCFS metadata fields . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 SCFS user-visible API . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2 SCFS specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Translation layer per-process state . . . . . . . . . . . . . . . . . . . . 47 4.1 Average SCFS performance versus filesystem performance 58 11 . . . . . . 12 Chapter 1 Introduction 1.1 1.1.1 Database Background What is a DBMS? A database management system (DBMS) is software designed to assist in maintaining and utilizing large collections of data [15]. The first DBMS's stored data using networks of pointers between related data items; this "network" model was applicationspecific and unwieldy [6]. Most database management systems in use today are "relational", meaning that their data is stored in relations or tables. Some DBMS's are "object relational:" they allow new data types and operations on those types to be defined, but otherwise behave like standard relational databases. Still other DBMS's are "object oriented:" they desert the relational model in favor of close integration with application programming languages [15]. OODBMS's, as they are called, are not treated in this thesis. Relational DBMS's are primarily used for storing highly structured data. Unstructured data, such as text or binary strings, may also be stored through the use of "binary large objects" (BLOB's) [20], though these types are more awkward to use. "Semi-structured" data, which can be divided roughly into records but contains large amounts of unstructured text, is a focus of current DBMS research and development; many DBMS's have support for text processing and XML, which is a common format 13 for presenting semi-structured data [21]. Part of the task of this thesis project was to find a good way of storing (unstructured) file data in a database. The tables of a relational database are composed of records, which are themselves composed of fields, each of which stores a piece of data of a fixed type. The records of a table may be accessed through a sequential scan, but to boost performance, DBMS's allow the construction of "indices" on tables. Through the use of secondary structures such as B-trees, indices allow fast access to the records of a table based upon the value of a field. Beneath the table and index abstractions, a DBMS stores data in blocks or "pages," and has several subsystems for managing them. One subsystem is the storage layer, which transfers pages to and from the permanent storage device, the disk. Another is the "buffer cache," which stores page copies in main memory to prevent the performance bottleneck associated with the disk, which is generally orders of magnitude slower than memory. 1.1.2 Transactions, Recovery, and Locking When a database record is changed, the change is usually written to the record's page in the buffer cache, but not to the disk. If this were not the case, every change would require a costly disk operation. However, unlike disk storage, main memory is "volatile," which means that after a system has crashed, its contents are not recoverable. System crashes are a fact of life; they occur as a result of software or operating system bugs, hardware failure, power outages, and natural disasters, among other things. As a result, database changes can be and are lost from the buffer cache as a result of crashes. A primary function of DBMS's (and a primary focus of database research) is providing atomicity and durability in the face of system crashes and volatile buffer caches. Database operations are grouped into "transactions" whose execu- tion in a DBMS is guaranteed to be atomic. "Atomicity" requires that transactions "commit" to make their changes visible in the database. Interrupted or "aborted" transactions' are completely undone by the DBMS so that none of their changes are 'Transaction abort may occur as a result of a DBMS condition such as a lock conflict, or as a result of system crash. Transactions may also abort themselves under normal operation. 14 reflected. Atomicity ensures database "consistency" even if transactions internally move the database to an inconsistent state. The rules of consistency are defined by the user, who must ensure that all transactions preserve those rules when fully applied. The "durability" property guarantees that, once committed, a transaction's effects are permanent and will not be lost in the event of a system crash [15]. The IBM System R project, an early DBMS implementation, achieved atomicity through the use of duplicate page copies on disk. This method, called "shadowing," updated one page copy at a time, allowing reversion to the other copy if a crash occurred [12]. However, this method was wasteful and caused disks to become fragmented over time, degrading performance [3]. Presently, the widely accepted method of providing these guarantees is "logging," which makes use of a small log area on disk. Records of all changed pages, as well as transaction commits, are appended to the log immediately, making them durable. The changed pages themselves may remain in volatile memory. In the event of their loss, the log must be scanned from the beginning to reconstruct, or "redo" the changes. To prevent the task of recovery from growing too large, periodic "checkpoints" are performed, which flush the buffer cache to disk and make an entry in the log. Recovery may then proceed from the most recent checkpoint entry rather than the beginning of the log. However, because some transactions may be running during the checkpoint, the (partial) changes they made, along with any changes flushed from the buffer cache during normal operation, must be "undone" if the system later crashes before the transactions have committed. The changes to be undone may be reconstructed from the log by associating changes with their committed transactions [9]. Still another guarantee often required of a DBMS is "isolation," which requires that concurrently running transactions never interfere with one another [15]. Isolation would be violated if, for example, one transaction changed a table record which another transaction was in the process of reading. To achieve isolation, a "lock manager" is used to control record (or often, page) access. The lock manager grants locks to running transactions and enforces their use. Locks come in different types which can allow for exclusive access (in the case of writes), or shared access (in the 15 case of reads). Modern lock managers have other types as well to provide maximum concurrency while maintaining isolation [5]. 1.2 1.2.1 Filesystem Background What is a Filesystem? Conceptually, a filesystem is an abstraction which presents users of a computer system with a unified, human-friendly way of viewing and organizing the storage areas (e.g. disks) available to that computer system. A file is a piece of data which resides on disk; the filesystem abstraction allows files to be assigned human-readable names and grouped together in directories. Directories themselves are arranged in a tree with a root directory at its base. Modern operating systems have a single, universal directory tree encompassing all storage devices in use; the root directories of each device simply become branches in the universal tree. The filesystem abstraction has taken on even more significance as a huge variety of peripherals and kernel structures, not just disks, have been given file interfaces and made a part of the directory tree [18]. Modern UNIX operating systems present kernel information, memory usage tables, and even individual process state as readable files mounted beneath a special directory such as /proc. Non-storage devices such as terminals and line printers are available through character- or block-special file entries in another directory such as /dev. Even local sockets and pipes, kernel constructions used for inter-process communication, have place-holder entries in the filesystem. Operations on these "files", save for a lack of random access, behave exactly as they would for a normal file. 1.2.2 Interacting with a Filesystem Files are manipulated using a set of functions provided with the operating system, referred to as "system calls." Government and industry groups such as POSIX have endeavored to standardize the form of these calls among operating systems provided 16 by different vendors [7]. The result has been a core set of system calls for file manipulation which are largely universal. These include open, close, seek, read, and write for operating on files themselves; as well as calls for reading and manipulating file and directory metadata, such as stat, unlink, chmod, and mkdir. Use of this core set nearly always guarantees application portability among standard operating systems. In many system calls (notably open), files are identified by their pathname, a human-readable string identifying the directories (themselves files) whose entries form a chain leading from the root of the directory tree to the desired file. When a file is opened, the kernel returns a number called the "file descriptor" which is simply the index to a table of open files (allowing the application to operate on more than one file at at time). In other system calls, notably read and write, files are identified by file descriptor. Both file descriptor and pathname, however, are merely abstractions. The real mark of a file's existence is its "i-node," a small structure stored on disk which contains all of the file's metadata (except for its name), and the locations on disk where the bytes which make up the file are stored. A directory is simply a file which matches names to i-nodes. Applications, however, need never be aware of them. At the level of the system calls described above, a plain file appears as an unstructured sequence of bytes. No delimiters are present within or adjacent to the file data2 . The byte sequence is accessed through a cursor point which represents a byte number in the file and which can be changed using the seek system call. Read and write operations on the file occur starting at the byte indicated. Read operations do not advance past the last byte in the sequence, but write operations may replace existing bytes or extend the sequence with additional bytes3 . It is impossible to reduce the length of a file, other than by truncating its length to zero. 2 The "end-of-file" marker is in fact a construction of the higher-level "Standard I/O" library built atop the core system calls. 3 Interestingly, the POSIX standard permits advancement of the cursor point past the end of the file. Even more interestingly, a write may then be performed, leaving a "hole" in the file which is required to appear as zero bytes if subsequently read. 17 1.2.3 Filesystem Behavior after a Crash The most significant aspect of modern filesystems is their highly aggressive caching behavior [18]. In fact, under normal circumstances, operations on files rarely touch the physical disk where the files are stored4 . Rather, file contents are divided into pages and brought into main memory by the kernel on the first access; subsequent operations act on the data pages in memory until they are paged out or flushed back to disk. This behavior is similar to virtual memory paging; so similar, in fact, that many operating systems combine the filesystem cache and the virtual memory system into a single "unified" buffer cache [13]. Not only are file contents cached in memory, but also their i-nodes. Some kernels even cache commonly used pathnames in memory in order to avoid the task of searching a directory chain [13]. Main memory is volatile, and its contents are irrecoverably lost after a system crash. Such events thus have severe implications for filesystems which cache file data and metadata in memory: any filesystem operation not committed to disk is lost after a crash. Applications which cannot tolerate such data loss use the f sync system call, which forces all changes for a given file to disk. Frequent recourse to this system call ensures consistency, but slows performance. In any case, many applications can tolerate small amounts of data loss; the amount lost will depend on how frequently the operating system flushes dirty pages to the disk. The true problem with filesystems and system crashes is not one of data loss, but one of atomicity. The problem arises from the fact that nearly all operating system buffer caches have no mechanism for flushing a given file's data to the disk in a coherent manner. In these cases, the replacement algorithm used the buffer cache has no understanding of filesystem consistency. For example, following a write operation to a file, the memory page containing its i-node may be flushed to disk some time before the page containing its data. If a crash were to occur between these points, the file's metadata would reflect the write operation, but the data written would be lost. In the worst case, the write operation could have allocated free disk blocks 4 Non-random file access, producing locality of reference, is what constitutes "normal circum- stances." 18 to the file, anticipating the arrival of the data. After the crash, the blocks would remain uninitialized but would be marked as part of the file. The file would then have undefined contents. A worse scenario may arise if a filesystem superblock becomes inconsistent after a crash. A "superblock" is a small region on disk which provides a map of the remaining space; it records which disk blocks are allocated to i-nodes, which contain file data, and which are free. A corruption of this map may thus result in significant damage to the filesystem, including possible total loss. For this reason, filesystems usually make many copies of the superblock, scattered throughout the storage device [13]. Each backup copy requires updating, however, and occupies disk space which could otherwise be used for storing files. 1.2.4 Example Filesystem: Berkeley FFS The original UNIX file system, referred to as s5, was flexible and featureful for its time, but was known for poor performance and fragility [13]. The Berkeley Fast Filesystem, referred to as FFS (sometimes ufs as well), was designed to address the shortcomings of s5 and features several performance optimizations as well as hardening against on-disk filesystem corruption resulting from a system crash. A variant of this filesystem type was used as a performance benchmark in this thesis; this section describes its major features. FFS employs intelligent allocation of disk areas to reduce seek times on filesystem operations. To read or write a block from a disk, the disk head must be positioned over the proper track of the disk. The delay which occurs while waiting for this positioning is the "seek time". In s5, i-nodes were all placed at the beginning of the disk, after the superblock. This caused a significant seek delay to occur when accessing a file on disk, since the disk head always had to be moved to the i-node region to read the file's data block allocation, then back out to access the data blocks themselves. FFS addresses this issue by dividing the disk into "cylinder groups", each of which contains its own superblock copy, i-node area, and data blocks (figure 1-1). By keeping the i-node and data of a file within a single cylinder group, seek times are 19 reduced, since cylinder groups are small compared to the entire disk. However, this method is not perfect. If a cylinder group becomes full, file data must be stored in other groups, leading to fragmentation and increasing seek time. FFS deals with this problem in a rather inelegant manner, by simply reducing the amount of disk space available for use to ensure free space in the cylinder groups. The amount removed from use is 10% of the disk, by default 5 [13]. In the days when disks were less than 1GB, setting aside this space was not viewed as grossly wasteful, but today, wasting 10GB of a 100GB disk is a significant loss. s5 (relative area size not to scale) FFS key cylinder group backup superblock boot block i-node storage area superblock data storage area Figure 1-1: Filesystem disk allocations FFS attempts to prevent loss of consistency following a system crash by flushing sensitive metadata changes directly to disk instead of placing them in the buffer cache [13]. These changes include operations on directory entries and i-nodes. FFS also maintains multiple backup copies of the superblock on disk as described in the previous section. While these features mostly eliminate the risk of filesystem corruption, they make metadata-intensive operations such as removing directory trees or renaming files very slow. Additionally, these features by themselves do not solve the atomicity problem; a companion program called f sck is required to check the filesystem following a system crash and undo any partially-completed operations. The f sck 5 0n these systems, the df command is even modified to read 100% usage when the disk is only 90% full! 20 program must check the entire filesystem since there is no way of deducing beforehand which files, if any, were being changed during the system crash. File system checks are thus extremely slow; minutes long on many systems, with the time proportional to filesystem size. Compounding the problem is the fact that by default, f sck will not change the filesystem without human confirmation of each individual change. Boot processes which are supposed to be automatic often drop to a shell prompt because of this behavior, making the kind of quick recovery and uptime percentages associated with database systems all but impossible. 1.3 1.3.1 Addressing Filesystem Recovery Interdisciplinary Research Since database researchers have long known of methods for attaining atomicity, consistency, isolation, and durability guarantees, one would think that filesystems would have incorporated these advances as well. However, the transfer of knowledge and methods between the fields of database research and operating systems research has proceeded at a very slow pace for decades, hampering the evolution of recoverable filesystems. The first DBMS implementations made little use of operating system services, finding these unoptimized, unconfigurable, or otherwise unsuitable for use. Database researchers implemented DBMS's as single, monolithic processes containing their own buffer caches, storage layers, and schedulers to manage threads of operation. This tradition has persisted even as operating system services have improved, so that even today DBMS's use operating systems for little more than booting the machine and communicating through sockets over the network [25]. At the same time, DBMS's have gained a reputation of poor performance when used for file storage due to their logging overhead, network or IPC-based interfaces, and highly structured data organization (which is unused by files). These characteristics reflect design decisions made in implementing database management systems, not the fundamental database the21 ory and methods which underlie them; nevertheless, they have retarded the adoption of DBMS technology within operating systems or filesystems. 1.3.2 Current DBMS-Based Work Several projects, both academic and commercial, have implemented file storage using conventional DBMS instances. However, these implementations are not full filesystems in that they are not part of an operating system kernel. Rather, they are implemented as NFS servers, which run as user-level daemon processes. The Network File System (NFS) protocol was developed by Sun Microsystems atop its Remote Procedure Call (RPC) and eXternal Data Representation (XDR) protocols, which themselves use UDP/IP. NFS has been widely adopted and its third version is an IETF standard [2]. Operating systems which have client support for NFS can mount a remotely served filesystem in their directory tree, making it appear local and hiding all use of network communication. NFS servers, on the other hand, need only implement the 14 RPC calls which comprise the NFS protocol. For this reason, NFS has been very popular among budding filesystem writers; it eliminates the need to create kernel modules or otherwise leave the realm of user space in order to produce mountable directories and files. Since conventional DBMS implementations interface over the network anyway, commercial vendors have also found it convenient to produce NFS servers to go with their DBMS distributions. The PostGres File System (PGFS) is an NFS server designed to work with the PostgreSQL DBMS [1, 8]. It accepts NFS requests and translates them into SQL queries. PGFS was designed for filesystem version control applications, though it may be used as a general-purpose filesystem. The DBFS project is another NFSbased filesystem designed for use with the Berkeley DB [24, 22]. Net::FTPServer is an FTP server with support for a DBMS back-end [10]. To a connected FTP client, the remote system appears as a standard filesystem. On the server side, Net::FTPServer makes SQL queries into a separately provided DBMS instance to retrieve file data. The Net::FTPServer database schema provides for a directory structure and file names, as well as file data. However, The FTP protocol 22 [14] lacks most filesystem features and in general, FTP services cannot be mounted as filesystems. Oracle Files is Oracle's filesystem implementation designed for use with the Oracle DBMS [16]. In addition to an NFS service, Oracle Files supports other network protocols such as HTTP, FTP, and Microsoft Windows file sharing. The concept of Oracle Files, however, is identical to that of PGFS: a middleware layer capable of translating filesystem requests into SQL queries. Though NFS client support allows these implementations to be mounted alongside "real" local filesystems, the two cannot be compared. NFS servers are stateless, maintaining no information about "open" files [2]. Network latency reduces performance considerably and requires client-side caching which makes it possible to read stale data. For this reason, NFS is not recommended for use in write-intensive applications [13]. 1.3.3 Current Filesystem-Based Work Operating systems researchers have finally begun to incorporate database techniques in filesystem implementations. The most visible result has been the introduction of "logging" or "journaled" filesystems. These filesystems maintain a log area similar to that of a DBMS, where filesystem changes are written sequentially, then slowly applied to the filesystem itself [13]. The f sck program for these filesystems only has to scan the log, not the entire filesystem, following a crash. Many journaled filesystems are simply modified conventional filesystems, such as the Linux third extended file system ext3, which is built on ext2 [26], and the Sun UFS with Transaction Logging, which is built atop regular uf s [13]. Both uf s and ext2 are modeled after FFS. A more comprehensive journaled filesystem project is ReiserFS [17]. Unlike ext3, ReiserFS is a completely new filesystem, being written from scratch (ReiserFS is an open source project. An accompanying venture, Namesys, offers services and support). The authors of ReiserFS are versed in database techniques; their filesystem implementation uses B-tree indices and dispenses with the traditional i-node allocation and superblock models of conventional filesystems. Version 4 of this project is 23 still under development as of this writing. The authors of ReiserFS have determined that database techniques such as logging and B-tree indexing are capable of matching the performance of conventional filesystems [17]. However, these authors have rejected existing DBMS implementations in favor of their own ground-up reimplementation; this work has been ongoing for several years. Is not a single one of the many DBMS's already available suitable for storing file data? 1.3.4 Project Goal The goal of this project was to find a way to apply existing DBMS technology to the construction of a local filesystem. NFS-based projects use existing DBMS technology but rely on the network, act in a stateless manner, and are not full local filesystems. Filesystem projects such as ReiserFS incorporate DBMS ideas into full local filesystems, but ignore existing DBMS technology while consuming years of programming effort in starting from scratch. Database management systems have intelligent buffer caches and storage layers, which should compare to those of operating systems provided that an intelligent database schema (storage model) is chosen 6 . Other performance issues, such as the IPC- or network-based interface normally associated with DBMS's, can be addressed simply through choice of an appropriate DBMS (not all DBMS's, as will be shown in the next chapter, are cut from the same mold). With an appropriate choice of DBMS architecture, the main task of the project is simply the construction of a "translation" layer to hide the relational and transactional nature of the DBMS from the end user, while providing instead a standard system call interface which will allow existing applications to run essentially unchanged atop the DBMS-based filesystem. The eventual goal of any filesystem is full integration with the operating system kernel, allowing the filesystem to be locally mounted. This project is no exception, but in order to focus on the design and implementation of the filesystem architecture 6 The storage model was originally the primary focus of this thesis, a fact which is reflected in the title. 24 (rather than the inner workings of the kernel), the scope of this thesis is constrained to user space. Eventually, a kernel module may be produced, either as a full filesystem or perhaps simply as a stub module which provides mounting functionality while the DBMS itself remains in user space. 1.3.5 Project Proposal Implementing the project goal calls for the construction of a translation layer to convert filesystem requests into DBMS queries. However, unlike the network-based filesystems described in section 1.3.3, this translation layer will be local, converting application system calls, rather than network packets obeying some protocol, into DBMS queries. As a local filesystem, the DBMS and translation layer will reside on the same machine as the application processes which make use of them. Finally, the project will store data on its own raw partition, without relying on an existing filesystem for storage of any kind. raw disk storage file application translation layer DBMS crash recovery standard system calls database queries block transfers Figure 1-2: Proposed DBMS-based filesystem architecture Figure 1-2 depicts the proposed project architecture at a high level. A DBMS instance provides ACID (atomicity, consistency, isolation, and durability) guarantees if necessary, while storing data on a raw disk partition. Application system calls are received by the translation layer, which implements them through database queries to the DBMS. The design details of this architecture are the subject of the next chapter. 25 26 Chapter 2 Design 2.1 Application Considerations The foremost concern of the project is to present applications with the same interface used for "normal" filesystems. On all varieties of the UNIX operating system, which is the platform used for this project, system calls are implemented as C functions. They are included in application code through a number of system header files and linked into the application at compile-time through the standard C library. Since the C library and system header files are not subject to modification by users, the project must provide its own header file and library, mirroring the standard system calls. Applications may then be modified to use the system call replacements and recompiled with the translation layer library. Conventional filesystems do not use separate threads of execution. Rather, system calls execute a "trap" which transfers execution from the application to the kernel (figure 2-1 [13]. After the kernel has performed the requested filesystem operation, execution returns to the application program. This method of control transfer eliminates the costly overhead of inter-process communication and does not require thread-safe, synchronized code in either the application or the kernel. However, its implications for a DBMS-based filesystem project are important: no separate threads of execution may be maintained apart from the application. Since most popular DBMS's are implemented as daemon processes, this requirement narrows the field of 27 available DBMS solutions significantly. raw block transfers 7 rpartition buffer cache storage layer filesystem interface kernel space user space system call (trap) standard C library user processes Figure 2-1: Conventional filesystem process architecture Several pieces of per-process state are associated with file operations. The most important of these is the file descriptor table, which maps open files to integer tags used by the application. Besides the tag, file descriptors also include a small structure which contains, among other things, the offset of the cursor point into the file and whether the file is open for reading, writing, or both. In a conventional filesystem, the operating system provides space from its own memory, associating a few pages with each process to store state information. Fortunately, the use of a linked library allows for the duplication of this behavior. Library code may simply allocate space for state information, which will reside in user space. Unfortunately, the presence of this state information cannot be completely hidden from the application, as is the case when the operation system manages it. In the conventional case, the operating system initializes the state information during process creation. Since our linked library runs in user space, initialization must be performed by the application process itself, through a file-init 0 type of call which has no analog in the standard system call 28 collection. This requirement is not seen as a significant obstacle, however, and is tolerated within the scope of this project. 2.2 Unsuitability of Conventional DBMS The task now becomes choosing a DBMS which can be used for the project while adhering to the single-threaded model described above. Unfortunately, the most commonly recognized commercial and open-source DBMS's, such as Oracle and PostgreSQL, are not suitable for use. They are implemented as daemon processes which run independently of application software, often on dedicated machines. The only way to obtain a single-threaded architecture with such a DBMS would be to somehow build the application code directly into the DBMS, a task which is impossible for proprietary DBMS's such as Oracle and unwieldy at best for open-source analogs, which are designed to communicate through sockets or IPC, not through programmatic interfaces (figure 2-2. Furthermore, even if such a construction were possible, the resulting filesystem would be isolated from use by other applications. The DBMS buffer cache and storage layer would reside in the private space of the application, preventing other applications from using them concurrently. Since filesystems were designed to facilitate data sharing between applications, this restriction is unacceptable. Finally, the inclusion of a complete DBMS instance in each application program would waste memory and degrade performance. Even if it were deemed acceptable to run a conventional DBMS instance as a separate process, the overhead associated with inter-process communication (IPC) would make the project incomparable with conventional filesystems, which suffer no such overhead. Sockets, the method by which connections are made to most conventional DBMS's, are kernel-buffered FIFO queues which must be properly synchronized (especially in multi-threaded applications). Sockets also incur performance costs as a result of the need for a kernel trap to both send and receive data. 29 socket IPC kernel spaceA user pacesystem call (trap) - raw partition standard C library buffer cache lock manager user processes storage layer logge lge DBMS process Figure 2-2: Conventional DBMS process architecture 2.3 The Berkeley Database The Berkeley Database is an open-source project with an associated venture, Sleepycat Software [22]. "DB," as it is commonly referred to, is a highly configurable DBMS which is provided as a linked library. DB has a number of characteristics which set it apart from conventional DBMS implementations and which make it the DBMS of choice for this project. Pervasive and widely used, DB provides atomicity, consistency, isolation, and durability guarantees on demand, while taking recourse to UNIX operating system services in order to avoid the monolithic daemon process model commonly associated with DBMS instances [23]. Unfortunately, DB lacks an important subsystem, the storage layer, which requires us to provide one in order to use a raw device for block storage. Unlike conventional DBMS's, which are queried through SQL statements sent over socket IPC, DB has a programmatic interface: a C API. To use DB, applications include a header file and are compiled with a linked library. The DBMS code thus becomes part of the application program itself and runs together with the application code in a single thread. However, this does not mean that the DBMS is restricted to use by one application at at time. The various subsystems of the DBMS: lock man30 ager, logging subsystem, and buffer cache, use "shared memory" to store their state. Shared memory is another form of IPC, which simply maps a single region of memory into the address spaces of multiple programs, allowing concurrent access. This form of "IPC" involves simple memory copying, which is much faster than socket IPC and comparable to conventional filesystem operations, which copy memory between kernel and user space. The use of shared memory structures by DB allows concurrent database access and reduces the footprint of each individual application. filesystem (used as storage layer) kernel space user space system call (trap) shared memory: buffer, lock table, etc. buffer cache lock manager Berkeley DB library logger user processes Figure 2-3: Berkeley DB process architecture DB is a relational DBMS, but uses a far simpler relational model than most DBMS's. DB's lack of SQL support is one reason for this; another is philosophical, asserting that the application, not the DBMS, should keep track of record structure. In DB, records consist of only two fields: a "key" and "value." Both are untyped, and may be of fixed or variable length (everything in DB is, in effect, a binary large object). Any further record structure is left to the application, which may divide the value field into appropriate subfields. DB builds indices on the key field for each table, and allows one or more secondary indices to be defined over a function of the key and 31 value. DB provides four table types: the btree and hash table types use B-tree and hash indexing, respectively, and allow variable-length keys and values. The queue and recno types use fixed-length records and are keyed by record number. Database queries are performed using a simple function call API. A get function is used to obtain the value associated with a given key. A put function sets or adds a key/value pair. A delete function removes a key/value pair. More complex versions of these functions allow multiple values under a single key; cursor operations are supported. Transactions are initiated using a start function which returns a transaction handle to the application. This handle may be passed to get, set, and delete operations to make them part of the transaction. After these operations, the application passes the handle to a commit or abort call, terminating its use. DB provides four subsystems: a lock manager, logging subsystem, buffer cache, and transaction subsystem. All four subsystems may be enabled or disabled independently (with the exception of the transaction subsystem, which requires logging). DB defines an "environment," which encompasses a DBMS instance. All applications which use a given environment must be configured identically (e.g. use the same subsystems), since they all use the same shared memory region. Each subsystem has configurable parameters, such as the memory pool size (buffer cache) and log file size (logging subsystem). A large number of boolean flags may be changed to define DB's behavior, particularly in the transaction subsystem, which may be set to buffer log entries in memory for a time (guaranteeing consistency), or to flush them immediately to disk (guaranteeing consistency and durability). The former behavior was employed in the project, since it more closely mirrors filesystem behavior. DB relies on the filesystem as its storage layer. Databases and log records are stored in files, and the f sync system call is used to keep them current on disk. Unfortunately, this behavior is unsuitable for our project; we seek to create a filesystem alternative, not a construction atop an existing one. We thus require the DBMS to use a raw storage device. Luckily, DB provides a collection of hooks which can be used to redirect standard system calls (used for filesystem interaction) with callback functions of our own design. Intended for debugging, this feature allows us to ef32 fectively "hijack" the storage layer beneath DB. The project task then becomes the implementation of not one but two translation layers: one to translate applicationlevel system calls to DBMS queries; the second to translate storage-level system calls to block transfers to and from a raw partition. 2.4 Proposed Architecture The use of Berkeley DB allows the DBMS-based filesystem, hereby referred to as "SCFS1 " for "Sleepycat file system," to be implemented as a group of C libraries. When linked with application code, these libraries form a vertical stack, with each layer using the function call interface of the one below. Applications use the standard system call interface, with access to the SCFS versions of those calls obtained by including an SCFS header file. Other than including this header file, linking with the SCFS library, and calling an initialization routine as discussed in section 2.1, the application need make no other special considerations to use the SCFS system instead of a regular filesystem. The translation layer library is responsible for turning filesystem requests into DBMS queries as described in section 1.3.5. This layer also contains the per-process state information described in section 2.1. System call implementations in this layer make changes to the per-process state as necessary and/or make database queries through the DB API. The translation layer must adhere to a database schema for storing file data and metadata. This schema is the subject of sections 2.5 and 2.6. The storage layer library consists of callback functions which replace the standard system calls in DB. It is responsible for the layout of blocks on the raw partition and must emulate standard filesystem behavior to DB. The storage layer is the subject of section 2.7. 'The name "DBFS" being already taken (see section 1.3.3). 33 kernel space user space shared memory: buffer, lock table, etc. storage layer library DB library translation layer library application code raw partition 4 4 t standard syscall interface DB transactional interface standard syscall interface t block transfers Figure 2-4: Proposed SCFS process architecture 2.5 File Data Storage Model An intelligent database schema, or choice of tables, record structures, and the relationships between them, is vital to ensure efficient storage of file data. Complicating the task of choosing a schema for file storage is the fact that modern applications produce a wide range of file sizes, quantities, and usage patterns. E-mail processing applications, for example, use thousands of small files which are rapidly created and removed [11]. World Wide Web browser caches produce the same type of behavior; the average file size on the World Wide Web is 16KB [4]. At the same time, multimedia applications such as video editing are resulting in larger and larger files, fueling a continued need for more hard disk capacity. Any file storage model must perform adequately under both situations, handling small files with minimal overhead while at the same time supporting files many orders of magnitude larger. Most conventional filesystems, including FFS, use increasing levels of indirection in their block allocation scheme. In FFS, file i-nodes contain space for 12 block addresses [13]. On a system with an 8KB block size, the first 96KB of a file are thus accessible through these "direct" blocks; a single lookup is required to find them. Following the 12 direct block references in a file i-node is an indirect block reference: 34 the address of a block which itself contains addresses of data blocks. File access beyond the first 96KB thus involves two lookups, though much more data can be stored before recourse to a "double-indirect" block is required; and so on (figure 2-5. Note that the amount of space available at each level of indirection is a geometrically increasing sequence. file i-node block address pointer I Iindirect .doublein dire c t bloc k ...................................... ... block data blocks (small number) data blocks data blocks Figure 2-5: Conventional filesystem data storage model SCFS seeks to duplicate the geometrically increasing storage size model of FFS, which is seen as desirable because it minimizes overhead at each stage of file growth. Small files are allocated small amounts of space without significant waste. As files grow larger, the number of indirections required grows simply as the logarithm of the file size, allowing very large files to be stored without large amounts of growth in the bookkeeping mechanism. However, SCFS seeks to eliminate indirect lookups by storing file data in tables at known offsets. For data storage, SCFS employs a sequence of n tables with fixed-length records of size B bytes. The tables are indexed by record number. Each file in the system is allocated a single row from the first table; the number of that row may thus serve as a unique identifier for the file2 . Once a file has grown larger than B bytes, more 2 The analog in conventional filesystems is the i-node number 35 space must be allocated from the next table in the sequence. An exponent parameter k is introduced: k records from the second table are allocated to the file, providing it with kB additional bytes of space. Once this space has been filled, k2 B records are allocated from the third table, and so forth. Like FFS, this scheme provides geometrically increasing storage space, but by storing in a file's metadata the record number in each table where the file's data begins, indirect lookups can be avoided no matter how large the file grows. Figure 2-6 illustrates the concept with k = 2. Record # (File ID) Table 1 Table 2 Table 3 1 data block (table record) k data blocks k2 data blocks 1 2 3 4 5 6 E indicates blocks of file with ID 1 Figure 2-6: SCFS data storage schema One problem with the SCFS storage model is that as files are deleted, blocks of records in each data table will be freed, causing fragmentation over time. To avoid fragmentation, the SCFS schema includes a "free list" table which records the locations of unused records in each data table. When allocating records from a table, SCFS will first check the free list to see if any blocks are available. Only if none are available will additional records be appended to the data table. 36 2.6 File Metadata Storage Model The file metadata table is perhaps the most important of the SCFS schema. This table is keyed by absolute file pathname, so that given the pathname to a file, its metadata information may be directly looked up in the table. In conventional filesystems, pathnames must be broken into individual directories, and each directory must be scanned to find the entry for the next. While pathname and i-node caching may ameliorate this task, SCFS eliminates it altogether. Keying by pathname does require, however, that pathnames be resolved to a normal form, starting from the absolute root directory (not the current root or working directories) and without any instances of the placeholder "." or ". ." directories. The value field of the metadata table is a structure containing the usual i-node fields: file type and size, user and group ownership, permissions, access times, and preferred block size. In addition, the metadata contains a sequence of record numbers, one for each data table, which refer to the locations in each table where the file's data is stored. The full metadata structure is shown in table 2.1. Table 2.1: SCFS metadata fields Type db-recno-t scf s-size mode-t uid-t gid-t time-t time-t time-t Name recno[n] size mode uid gid atime mtime ctime Bytes 4n 8 4 4 4 4 4 4 Interpretation first record number of data (one per data table) file size (bytes) file mode (see mknod(2)) user ID of the file's owner group ID of the file's group time of last access time of last data modification time of last file status change Directory files are a vital part of conventional filesystem structure since they map file names to i-node numbers. In SCFS, however, this mapping resides not in directory files but in the metadata table. Directories in SCFS are maintained to allow traversal of the directory tree, but lost or corrupted directory files could easily be restored 37 by scanning the metadata table. For this reason, and for convenience, directories in SCFS are implemented as regular files. The SIFDIR bit of the file mode may be used in the future to restrict random access to directory files, but the consequences of directory file loss or corruption in SCFS are far less severe than in conventional filesystems. The content of an SCFS directory is a sequence of record numbers. Each number is the reference number of a file, referring to the record number of its entry in the first table as discussed in the previous section. A secondary index on the metadata table allows lookups by reference number, allowing directory entries to be mapped to their pathname and metadata information. Through this index, directories may be "scanned" to determine their contents, allowing exploration and traversal of directory trees in SCFS. The metadata storage model of SCFS removes much of the indirection associated with conventional filesystem metadata, but makes the implementation of "hard links" very difficult. A hard link is a synonym for a directory entry; in FFS, multiple directory entries may refer to the same i-node, in effect placing a file in multiple directories. Removing a file becomes more difficult in the presence of hard links: the i-node nlinks field becomes necessary to record how many directories reference the file (the SCFS metadata has no such field). More importantly, keying by pathname becomes impossible when files may have more than one unique absolute pathname, which is why hard links are unsupported in SCFS3 . Symbolic links, however, are possible in SCFS, though not implemented as part of this project. 2.7 Storage Layer The storage layer receives system call requests from DB; its job is to simulate a filesystem to DB while actually using a raw device for storage. At first glance, this layer seems complicated, but the way in which DB uses the filesystem makes its task 3Hard links are generally regarded in filesystems as goto statements are in programming lan- guages, making their absence tolerable. 38 easier. Databases and log entries are stored in files of known names, which can be hard-coded into the storage layer, rather than having to provide support for arbitrary files and directories. In addition, DB performs data reads and writes one page at a time, so that the I/O operation size is constant. Log writes, however, are arbitrarily sized. raw partition original layout fixed-size data table areas log area Figure 2-7: Original storage layer disk allocation The original storage layer design divided the raw partition into sections; one for log entries and one for each database file (figure 2-7. However, this approach was abandoned because the relative sizes of the database files is impossible to predict. Under small-file-intensive applications, space for the first few data tables would fill quickly while upper-level data tables sat unused. In contrast, large-file-intensive applications would result in relatively small tables at the lower end of the sequence. Choosing fixed-size areas for database tables would thus result in large amounts of wasted space, or worse yet, overflow 4 . The storage layer was thus moved to a purely journaled format in which blocks from all database files mingle together on disk. Data writes are performed at the journal "head," which moves forward until it reaches the end of the disk and wraps back to the beginning (once this has happened, the head must be advanced to the next free block, rather than simply writing in place) [19]. Writes which change a data 4 Overflows were in fact encountered while testing large files, which forced the reevaluation of the storage layer. 39 insertion head raw partition revised layout blocks data table 1 data table 3 on disk data table 2 free space log area Figure 2-8: Revised storage layer disk allocation block rather than appending a new one also invalidate and free the old block. The log, meanwhile, is written to a separate circular buffer. Periodic checkpoints prevent the log buffer from overflowing. Figure 2-8 shows the disk layout under this design. In the example, a block from data table 2 has just been written. The next write that occurs must first advance the insertion head one block to avoid overwriting a block from data table 3. The journaled format makes data writes especially fast because no seeking of the disk head is generally required; each write follows the next [13, 19]. However, reading a block in this system requires finding it on disk, which would require a linear scan in the design so far discussed. Thus, the storage layer includes some inmemory structures which speed the location of blocks on disk. Two "maps" of the disk area are maintained. One is a simple bitmap which records whether or not a block is free; this information is used when advancing the head of the journal to the next free block, and also when invalidating the previous block after the write has been performed. A second, more detailed map records for each block on disk which database file it belongs to and at what offset into the file it sits. At 8 bytes to the block, the overhead associated with these maps is minimal. The maps may also be stored on disk, preventing the need to reconstruct them on filesystem initialization. Storage layer recoverability is important because DB makes assumptions about the storage layer (normally a filesystem) which sits beneath it; if those assumptions are violated, DB will lose its recoverability as well. Raw partitions are not cached in any way so that reads and writes to them are permanent [13]. However, care must 40 be taken to ensure that information about which block belongs to which database file is not lost. The journaled structure of the storage layer causes blocks on disk to be freed and overwritten as newer blocks are written at the journal head. The disk maps described earlier keep track of the blocks on disk, but since the maps are stored in volatile memory, they are subject to loss. To achieve recoverability, each block on disk is stamped with the identity of its database file and its offset into the file, as well as the time at which it was written. This information duplicates that of the disk map and allows them to be reconstructed if lost. After a system crash, the storage layer must scan the disk to reconstruct the maps; after this has happened, DB recovery itself may be run, which will make use of the database files and the log. If the file, offset, and time information were not associated with each page by the storage layer, recovery of the storage layer would be impossible following a crash. 41 42 Chapter 3 Implementation 3.1 System Calls Provided Table 3.1 summarizes the user-visible functions which were implemented as part of the translation layer. Each call mimics the parameters and return value of its respective system call. Even error behavior is duplicated; an scfs-errno variable is provided which mimics the standard errno variable (DB supports negative errno values which indicate DBMS errors; this is reason why errno was not used itself). The scfs init 0 routine has no analog among the standard system calls since the operating system takes care of state initialization under a conventional filesystem. The set of implemented functions reflects the basic set necessary to create and browse a directory tree, perform file operations, and conduct performance tests. The lack of an analog for a common system call at this point does not mean that it cannot be implemented. For example, the chmod, chown, and chgrp family of system calls was left out simply to save time, and a the opendir, readdir, etc. family of calls may easily be constructed atop getdents. SCFS is restricted in some areas, however. The lack of support for hard links was mentioned in section 2.6; another example is lack of support pipes, sockets, or special device file types. These file types are simply kernel placeholders and do not affect the data storage aspect of SCFS, which is the 43 Table 3.1: SCFS user-visible API Name scfs-init scfsmknod scfs-mkdir scfs-creat scfs-open scfs-close scfs-read scfs-write scfs-stat scfs-seek scfs-tell scfs-unlink scfs-sync scfs-strerror scf s-perror Analog (none) mknod mkdir creat open close read write stat seek tell unlink sync strerror perror Purpose process state initialization file creation directory creation file creation and open file creation and open close filehandle read from file write to file get file metadata set file cursor position get file cursor position remove file flush buffer cache (checkpoint) get error information print error information primary focus. Several other minor aspects of SCFS differ from that of the UNIX standards, but in general the project has focused simply on providing file-like storage, not on conforming to every quirk of POSIX and other standards. 3.2 Consistency and Durability SCFS is intended to ensure filesystem consistency after a crash, but durability, which incurs a large performance penalty as a result of log flushes on writes, may be disabled. Use of the DBJTXNNOSYNC flag in DB causes log writes to be buffered in a memory area of configurable size; only when the buffer fills are the log entries flushed to disk. This behavior significantly increases performance and is enabled by default to better approximate filesystem behavior. A system crash in this case will result in loss of the log buffer and hence possibly cause the loss of recent filesystem updates, but the atomicity of log flushes ensures consistency of the filesystem on disk. Regular checkpointing of the system will minimize this data loss, as filesystems do through the use of a disk synchronization thread running in the operating system. An analog to the sync 0 system call is provided in SCFS. The scfs-sync 0 44 function simply performs a DB checkpoint, ensuring that all updates in memory are flushed to disk1 . Of course, SCFS may simply be rebuilt without the DBTXNNOSYNC flag, which would cause all of its filesystem operations to flush immediately. Applications which make repeated use of f sync 0 and sync 0 to achieve the same goal on regular filesystems would no doubt find this feature of SCFS convenient. 3.3 Configurable Parameters This section presents numerical specifications for SCFS, which are summarized in table 3.2. Compared to conventional filesystems, SCFS is highly flexible and supportive of large files. Care has been taken to ensure 64-bit sizes and offsets internally, though the standard system calls almost always take 32-bit values for these parameters. The parameters n, B and k, discussed in section 2.5, are configurable compile-time parameters, currently set to 5, 4000 bytes and 16, respectively. Table 3.2: SCFS specifications Configurable? No No Yes Parameter Maximum number of files Maximum number of directories Maximum filesystem size Value at most 232 - 1 at most 232 - 1 nB(23 2 - 1) Maximum file size Yes Maximum subdirs per directory B(k" - 1)/(k - 1) B(kn - 1)/4(k - 1) B(kn - 1)/4(k - 1) Maximum pathname length Maximum open files per process Size of buffer cache Synchronized log flush Size of log buffer 256 256 24MB not enabled 4M Yes Yes Yes Yes Yes Maximum entries per directory Yes Yes 'In fact, the system sync () merely schedules the flush, frustrating attempts to compare filesystem syncs with database checkpoints. 45 File size in SCFS is essentially limited by the number of data tables present in the schema 2 . If n tables are present, the maximum file size (in bytes) is n-1 k" E Bkz = B k i=O 1 The number of data tables n is a compile-time parameter which is easily changed. In addition, the storage layer does not depend on this parameter, so that a raw partition initially used by an SCFS system with a certain number of tables may be used unchanged after the number has been modified. Recompilation of all application code is currently necessary in order to effect this change, however; a move to the kernel would eliminate this requirement. Since DB tables support 232-1 records, the maximum number of files which can be present in SCFS at any one time is at most 232 -1. The maximum number decreases, however, as average file size increases. For example, at most 2321 k (B+ 1)-byte files may be present in the system at any one time, since a (B + 1)-byte file is allocated k records from the second data table. The total size of all files is more important than the number of files in this respect: an instance of SCFS has n(23 2 - 1) total table records available for file data. Since directories are implemented as regular files, the maximum number of directory entries per directory is simply one fourth of the maximum file size, since each directory entry consists of a four-byte file number. SCFS makes no distinction between entries of regular files and entries of subdirectories. The maximum length of path names is a configurable compile-time parameter, and is only restricted to avoid complicated pathname parsing code which must dynamically allocate space for temporary buffers. This parameter could be made arbitrary, since DB has no restriction on key length. 2 The absolute limit is imposed by the number of records allowed in a DB table. With B = 4KB, files may grow to approximately 16TB; though only one file may be this big; others may only reach 16/k TB. 46 3.4 Per-Process State As described in section 2.1, several pieces of per-process state are maintained by SCFS. These lie outside of the database itself and are stored in user space, declared within the code of the translation layer library. Table 3.3 summarizes the per-process state variables maintained by the storage layer (The value m indicates the size of the file descriptor table, which is a configurable parameter). These can be divided into pointers used for database access (referred to by DB as "handles"), the file descriptor table, and some miscellaneous variables holding the current time and error value. A separate variable for the current time is kept because of the frequency with which the time is needed during filesystem metadata operations (e.g. for updating the access, modification, and status change times). It was found during testing that calls to the system time routine were slowing performance to a measurable degree. Table 3.3: Translation layer per-process state Type DBENV* DB* DB* DB* DB* db-recno-t scfs-size int char* time-t int 3.5 Name dbenv scfs-datadb[n] scfs-statdb scfs-secdb scfs-freedb scfs-fd[m] scfsifo[m] scfs-fm[m] scfsifn[m] scfs-curtime scfs-errno Bytes 4 4n 4 4 4 4m 8m 4m 4m 4 4 Purpose DB environment handle DB data table handles DB metadata table handle DB metadata secondary index handle DB free list table handle First record number of open file Current byte offset into open file Mode of open file Pathname of open file Current time Error number Threads, Concurrency, and Code Re-entrance SCFS code is designed to support concurrency between processes as well as multithreaded processes themselves. The DBTHREAD flag is passed to DB, enabling support 47 for database access from multiple process threads. Internally, DB's shared-memory subsystems are thread-safe and support use by multiple concurrent processes. The more difficult task is ensuring that the translation and storage layer libraries maintain and preserve this behavior. The translation layer library is inherently process-safe since each of its instances is associated with a single process. However, it has not been shown that the per-process state of the translation layer is immune from corruption resulting from multiple concurrent system calls performed by separate process threads. Further work must focus on ensuring translation layer duress in the face of re-entrant system calls, which may require the use of semaphores or a simple locking mechanism 3 A more urgent area of concern is the storage layer, which is currently also a perprocess entity and which thus prevents the use of the SCFS storage area by more than one application at a time. Under normal operation, DB uses the filesystem for storage. Database pages are stored in a file and visible to all DB applications making use of the shared buffer cache and lock manager. The SCFS storage layer, however, is not shared between DB applications since each process has its own instance. Since the storage layer state includes the disk map and journal head, only one running process may use the raw partition at a time without fear of disk corruption. To allow concurrent access, the storage layer state should reside in shared memory just as the DB buffer cache, lock manager, and transaction subsystem do. Until this change is effected SCFS is restricted to single-process use. Two methods for sharing the storage layer are available: using user-space mapped memory as DB does, or moving the storage layer into kernel space. One or the other must be the focus of future work. 3 Such work is only necessary, of course, if the standard filesystem interface offers a thread-safe guarantee, which may not be the case. 48 Chapter 4 Performance Testing 4.1 Test Methodology To gauge the performance of SCFS, it was compared to a conventional filesystem by constructing and running several test programs. SCFS was developed on a SunBlade 100 workstation with 512MB of memory, running Solaris 8. The uf s filesystem on this machine was compared to SCFS, which was set up to use a partition on the same disk as as the regular filesystem. Other than the use of forced file synchronization as will be described, the filesystem was used as-is, with no special configuration (e.g. to the buffer cache). Every effort was made to ensure parity between the two systems being compared in terms of caching. Tests would be meaningless if one system was permitted to cache file operations in memory indefinitely, while the opposing system was forced to flush changes to disk periodically. Thus, care was taken to ensure that both SCFS and the ufs filesystem flushed file data to disk at the same rate while tests, which consisted of repeated file reads and writes, were being performed. The disk usage of SCFS is predictable to a good degree of accuracy, since it has a log buffer of fixed size. The filesystem buffer cache, on the other hand, is part of virtual memory itself and cannot be "set" to a fixed size, nor can the operating system replacement algorithm which governs page flushes be modified. Thus, recourse to sync 0 was taken to force filesystem flushes during testing. A call to sync o was made each time the number 49 of bytes written to the filesystem reached the amount which would cause a log flush in SCFS. Several test programs were written to exercise SCFS using its file interface. The same test programs were also run using the ufs filesystem and the results were tabulated and plotted side-by-side. The first test, referred to as the "single file" test, creates a single file and performs repeated read and write operations on it. The size of the file is a variable parameter, as well as the ratio of read operations to write operations (the entire file is read or written during each operation). The single file test was run over a range of file sizes. The number of operations performed was adjusted as the file size varied so that an equal number of bytes were read/written during each iteration. Thus, the ideal performance result is a straight line when plotted versus file size, as is done in subsequent sections. The second test, referred to as the "multiple file" test, creates a set of files and performs repeated read and write operations, choosing a file at random for each operation. The size and number of files are both variable parameters, as well as the proportion of read operations to write operations. Unlike the single file test, the multiple file test does not read or write entire file, but operates on a fixed-size block within the file, so that the operation size does not change with file size. The multiple file test was run over a range of file sizes, but the number of files was adjusted to keep the total size of the set of files constant. A constant number of operations was performed during each iteration, so that again a straight line is the ideal performance result when plotted. Under both tests, the proportion of reads to writes was also varied so that three sets of results were obtained: results for 100% reads, 100% writes, and a 50/50 mixture of reads and writes. In subsequent sections, the results are presented as follows: a separate graph for each of the three read/write proportions, with each graph having four plots corresponding to the SCFS results (single- and multiple-file) and uf s results (single- and multiple-file). In addition to the single file and multiple file tests, a multithreaded test was written, but since the SCFS storage layer at present is not in shared memory as described in section 3.5, the test could not be run using SCFS. 50 Baseline Storage Performance 4.2 The baseline tests were the first tests run using SCFS with its storage layer on a raw partition. Neither logging nor locking were enabled in SCFS, approximating "standard" filesystem behavior by offering neither locking nor consistency guarantees. The baseline testing results are plotted in figures 4-1, 4-2, and 4-3. 0.4 0.35 -'-+-L FS: multi file -- x--- FS: single file - - --- : multi file 8--sigle -file* ------------ .. -------- ---------- - 0.3 (D 0.25 E 0.2 -*- -X .................... 0.15 0.1 4 16 64 256 File Size (KB) - -..... - -E . ............ 1024 4096 16384 Figure 4-1: Baseline performance: 100% read operations In 100% read testing, SCFS performance lags that of the filesystem by 20%-30% in both tests. Write testing, however, strongly favors SCFS. While SCFS write performance remains relatively flat across the range of file sizes, filesystem performance decreases dramatically as file size increases, yielding to SCFS about halfway through the range of file size. In addition, SCFS write performance for small files is relatively close to that of the filesystem, especially in the single-file test where the gap is extremely small. In general, the baseline results prove the effectiveness of the storage layer through significantly heightened SCFS performance. 51 0.8 FS: multi file ---x--- FS: single file --- 0.7 ----I--- a.---- DB: multi file DB: single file 0.6 0.5 E 0.4 0.3 ---- -- * -------- x 0.2 U -. 0.1 4 16 64 1024 256 File Size (KB) . 4096 - 16384 Figure 4-2: Baseline performance: 50/50 read/write operations 1.6 1.4 1.2 ---- ~----a--a .....-- FS: FS: DB: DB: - - multi file single file multi file single file 1 E i1= 0.8 0.6 ----- x 0.4 ------ .. ----.- ~ - :---- X--..---- 0.2 .... ............. e .................... a ........... 0 4 16 64 1024 256 File Size (KB) 4096 16384 Figure 4-3: Baseline performance: 100% write operations 52 The filesystem's performance on the 1MB multi-file test is sharply out of line with the overall trend. This unexpectedly fast performance is most likely the result of an internal filesystem threshold. Recall that the multiple-file test operates on a fixed 8KB record within the file, rather than the entire file as in the single-file test. The middle 8KB of the file is the record used, and for 1MB files, this record may be aligned by chance so that operations on it proceed at above average speed. The block address of the record (in the indirect block), the the data block itself, or the page in memory may be subject to favorable alignment as hypothesized. Alternately, the increase in file size to 1MB may trigger large-file functionality within the filesystem. More knowledge of the filesystem's internal behavior is necessary to pin down the exact cause of this performance deviation. 4.3 Performance with Logging Following the baseline tests, logging was enabled in SCFS. The use of logging in SCFS results in a log flush whenever the log buffer fills; this behavior was duplicated in the filesystem test application by forcing a sync at the same interval, as described in section 4.1. However, the presence of both log and data operations in SCFS was expected to reduce its performance margin somewhat from the baseline tests; this is the behavior which was observed. The testing results with logging enabled are plotted in figures 4-4, 4-5, and 4-6. For reads, the most significant observation is the radical jump in filesystem read time at a file size of 4MB. The size of the deviation indicates that disk writes are being performed as a result of the periodic calls to sync O. The most likely explanation for this behavior is that the operating system buffer cache begins to flush pages to disk in order to make room for the large number of pages being read. Another possibility could be the file access time, which is modified after a read operation. It is possible that for large reads the access time is flushed immediately, though why only large 53 1.6 FS: multi file --- x--- FS: single file ---- u--- DB: multi file DB: single file 1.4 1.2 ,a (D 0.8 - .. 2:~2~ 2 -. 0.6 0.4 0.2 --------......... ............. ................. ... ... ... --- -------- ---- 4 16 d- ---- --- --------- 1024 256 File Size (KB) 64 ........ ..............E -I 4096 16384 Figure 4-4: Performance with logging: 100% read operations 3.5 - FS: -- x--- FS: --- K--- DB: DB: 3 multi file single file multi file single file 2.5 [ a) (D 2 E 1.5 -- ' ---- --- 1 0.5 4 16 64 1024 256 File Size (KB) 4096 16 384 Figure 4-5: Performance with logging: 50/50 read/write operations 54 i FS: multi file -- x- -- -- FS: single file DB: multi file DB: single file E .~-~E-7' 4 16 64 1024 256 File Size (KB) 4096 16384 Figure 4-6: Performance with logging: 100% write operations reads would be treated this way is not known. The access time operation definitely affects SCFS since a log write is performed as a result of it. SCFS read times fall to about 1.5 times filesystem read times as a result, except for the filesystem deviation at 4MB. As expected, the use of logging causes SCFS performance to suffer on write operations. In the single file test especially, the filesystem gradually overtakes SCFS, in a reversal of previous behavior. Since the filesystem simply overwrites the file repeatedly in the cache while log entries still pile up in the SCFS log cache, this result is not a surprise. For the multiple-file test, SCFS exhibited a much better performance response, with a flat performance curve compared to the filesystem's degrading performance. Operating on multiple files mediates the effect of the log, since the filesystem cannot cache the entire data set as in the single file test. 55 Performance with Locking 4.4 The final round of testing enabled both logging and locking in SCFS. In the filesystem, the periodic sync calls were retained to duplicate logging, and file locking was employed via the flock system call to duplicate locking. The results are plotted in figures 4-7, 4-8, and 4-9. 1.8 1.6 FS: multi file ---x--- FS: single file 1.4 - .... DB: single file -*- --- DB: multi file 1.2 E 0.8 0.6 - - . 0.4 0.2 16 4 64 1024 256 File Size (KB) 4096 16384 Figure 4-7: Performance with locking: 100% read operations SCFS read performance degraded slightly in the single-file test as a result of locking, though the filesystem retained its strange behavior for files 4MB and larger in the multiple-file test. Write performance remained essentially unchanged from the previous test. 4.5 Analysis Table 4.1 shows relative SCFS performance versus filesystem performance, averaged over all the file sizes for each test run. Numbers indicate SCFS speed relative to 56 4 -+- FS: multi file --- x--- FS: single file 3.5 *-- - DB: multi file DB: single file .-.--- 3 -6 2.5 E 2 - . -- ........ 1.5 1 0.5 I- - 16 4 .......... ---- 64 .. - 1024 256 File Size (KB) 16384 4096 Figure 4-8: Performance with locking: 50/50 read/write operations 7 -+--- FS: multi file ---x--- FS: single file --- --- DB: multi file -- a - DB: single file 6 5 D4 E 3 - .... -.. ..... . 2 4 16 64 1024 256 File Size (KB) 4096 16384 Figure 4-9: Performance with locking: 100% write operations 57 filesystem speed; "2.0" indicates SCFS ran twice as fast (averaged over all file sizes) while "0.8" indicates SCFS ran only - as fast (e.g. slower). Table 4.1: Average SCFS performance versus filesystem performance Test multi-file single-file multi-file single-file multi-file single-file Log? no no yes yes yes yes Lock? no no no no yes yes 100% W 1.7 1.9 1.4 0.9 1.3 0.8 100% R 0.8 0.7 0.8 1.4 0.6 1.0 50/50 R/W 1.1 1.4 1.3 1.0 1.2 0.9 Total Avg. 1.2 1.3 1.2 1.1 1.0 0.9 Several trends are visible from the data. SCFS is generally slower on reads; the only tests where SCFS read performance matched or exceeded that of the filesystem were the two tests where the filesystem exhibited a significant performance deviation on the 4MB and 8MB file sizes. These two data points alone are responsible for the favorable average; removing them would produce averages similar to the other tests, as the graphs indicate. In general, the poor read performance can be attributed to the access time metadata variable, which must be changed on every read and thus causes a log write in SCFS which is unmatched in the filesystem. For writes, on the other hand, SCFS performs very well compared to the filesystem. The SCFS database schema and the SCFS storage layer were designed to achieve precisely this result by keeping file data as clustered as possible in DB tables and on disk, respectively. This is accomplished through a lack of indirect blocks and a journal-structured storage layout. The use of logging does have a significant effect on SCFS performance, however. This is seen in the performance drop from the baseline tests, which is about 25% for the multi-file test and nearly 50% for the single-file test (where logging is more of an encumbrance). The observed performance drop represents the cost of consistency in SCFS. SCFS with logging and locking enabled is the most interesting case in terms of DBMS functionality. The scores of 1.0 (equal performance) and 0.9 (SCFS 10% 58 slower) in table 4.1 indicate that SCFS performance in this case slightly trails that of ufs. In reality, however, the average performance would depend on the ratio of reads to writes. Read-intensive applications would experience the observed SCFS performance deficit of 20% - 30% on read operations. Write-intensive applications on the other hand, would benefit from the good write performance of SCFS. In the 50/50 case, with logging and locking enabled, SCFS and uf s have nearly equal performances. This result shows that, when used with an intelligent schema and storage layer, existing DBMS technology is capable of supporting a filesystem abstraction with minimal performance loss in exchange for the added value of DBMS locking and consistency. 59 60 Chapter 5 Future Directions 5.1 Moving to the Kernel Real filesystems are accessed through the standard system interface and may be mounted in the directory tree. In its current form, SCFS satisfies neither of these criteria, since it cannot be mounted and its interface, though a duplication of the standard, resides in a separate user-level library. This does not prevent application programs from using SCFS with minimal modification to their source code. However, the ultimate goal should be to achieve a seamless mount, which will require some degree of cooperation between SCFS and the operating system kernel. At the minimum, the storage layer portion of SCFS must be moved to the kernel to allow sharing among user-space applications. Currently, a separate storage layer is created for each process as described in section 3.5. Though sharing could be achieved by moving the storage layer into shared memory alongside the DBMS buffer cache and other subsystems, the same ends could be achieved simply by moving it into the kernel. Conventional filesystem buffer caches are part of operating system virtual memory. Three choices are consequently available for SCFS: either forego the DB buffer cache in favor of the operating systems; somehow combine the two, or maintain the DB buffer cache as a separate entity devoted entirely to the filesystem. The first two options may be impossible because of the relationship between the DB buffer pool and its transaction and locking subsystems. The third option is more appealing and 61 suggests a possible course for the integration of SCFS with the kernel: rather than attempting to move everything into kernel space, the DBMS itself could be maintained as a separate process running in user space. The kernel portion of the project, then, would consist simply of mounting functionality and integration with the standard libraries. The difficulty of implementing such an architecture is currently unknown more understanding of kernel module development is necessary. 5.2 Use of Multiple Disks In most DBMS installations (and in some of the newer journaling filesystems), increased performance is achieved by storing the log on a separate disk from the data. This provides more locality of reference to both systems by eliminating the need for constant disk seeks between data and logging areas [13]. In SCFS as tested in this project, both log and data were kept on the same disk partition. Future testing may explore the performance gain, if any, resulting from use of separate log and data disks. The use of separate disks may speed logging performance to the point where buffering of log entries in memory is unnecessary. The resulting SCFS instance would provide durability in addition to atomicity and isolation. Testing of performance under durability was not performed as part of this project. 62 Bibliography [1] B. Bartholomew. Pgfs: The postgres file system. Linux Journal,42, October 1997. Specialized Systems Consultants. [2] B. Callaghan, B. Pawlowski, and B. Staubach. Nfs version 3 protocol specification. RFC 1813, IETF Network Working Group, June 1995. [3] D. Chamberlin and et al. A history and evaluation of system r. Communications of the ACM, 24(10):632-646, 1981. [4] F. Douglis, A. Feldmann, B. Krishnamurthy, and J. C. Mogul. Rate of change and other metrics: A live study of the world wide web. USENIX Symposium on Internet Technologies and Systems, 1997. [5] J. Gray and et al. Granularity of locks and degrees of consistency in a shared data base. IFIP Working Conference on Modelling of Data Base Management Systems, pages 1-29, 1977. AFIPS Press. [6] Database Task Group. April 1971 report. ACM, 1971. [7] Open Group. The single unix specification, version 3. URL: http:// www.unix-systems.org/version3, January 2002. [8] PostgreSQL Global Development Group. Postgresql: The most advanced opensource database system in the world. URL: http: //advocacy .postgresql. org, October 2002. [9] T. Haerder and A. Reuter. Principles of transaction-oriented database recovery. Computing Surveys, 15(4):287-317, 1983. 63 [10] Richard Jones. Net::ftpserver. Freshmeat Projects Library, December 2002. [11] Jeffrey Katcher. Postmark: A new file system benchmark. Technical Report 3022, Network Appliance Library, 1997. [12] et al. M. Astrahan. System r: Relational approach to database management. ACM Transactions on Database Systems, 1(2):97-137, 1976. [13] Amir H. Majidimehr. Optimizing UNIX for Performance. Prentice Hall, Upper Saddle River, NJ, 1st edition, 1996. [14] J. Postel and J. Reynolds. File transfer protocol (ftp). RFC 959, IETF Network Working Group, October 1985. [15] R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw Hill, 2nd edition, 2000. [16] H. R. Rasmussen. Oracle files: File server consolidation. White paper, Oracle Corporation, Redwood Shores, California, June 2002. [17] Hans Reiser. Reiser 4. Technical report, Naming System Venture, June 2003. [18] D. M. Ritchie and K. Thompson. The unix time-sharing system. Communications of the A CM, 17(7):365-375, July 1974. [19] M. Rosenblum and J. Ousterhout. The design and implementation of a log- structured file system. ACM Transactions on Computer Systems, 10(1):36-52, February 1992. [20] A. Silberschatz, M. Stonebraker, and J. Ullman. Database systems: Achievements and opportunities. Communications of the ACM, 34(10):110-120, 1991. [21] A. Silberschatz, S. Zdonik, and et al. Strategic directions in database systems breaking out of the box. ACM Computing Surveys, 28(4):764-788, 1996. [22] Sleepycat Software. Berkeley db products. URL: http: //www .sleepycat . com/ products, 2003. 64 [23] Sleepycat Software. Diverse needs, database choices - meeting specific data management tasks. Technical article, Sleepycat Software, Inc., Lincoln, MA, April 2003. [24] Lex Stein. The bdbfs nfs3 fileserver. URL: http://eecs.harvard.edu/ ~stein/bdbfs, 2002. [25] M. Stonebraker. Operating system support for database management. Communications of the ACM, 24(7):412-418, 1981. [26] Stephen C. Tweedie. Journaling the linux ext2fs filesystem. Proc. Linux Expo, 1998. 65