File-System Development Stackable Layers with JOHN S. HEIDEMANN and GERALD J. POPEK University of California, Los Angeles Filing services promising ments present stand alone requires faded several changes There allowing development. work invahdate design layer very discusses an implementation lowering barriers Categories and not feasible Subject design; D.4.8 Files—orgaruzatzon General Terms: Additional ture, D.47 years, many of these current filing environ- systems today typically is that file support but of new filing services often m several of which order, Complex be provided and layers can occupy layer fihng services by independent different evolution and address development each layer in detad these ways. may and presents facilities stackable that layering design exhibits offers techniques very it enables high the potential We performance. of broad By third-party today. D 4.3 [Operating [Operating [Operating Systems]: Systems] Systems]: organization File and System Deslgn—hLerai”cht- Performance—measurements; E.5 [Data]: / structure Design, Key example, and Independent bounding Descriptors: Management—rnatntenance; cal to layer layering design, For issues each configuration providing to new filing development these interface stackable reason of others, blocks,” constraints layer in recent One work. addresses “building flexible describe file-system existing by an extensible paper use. on the are no syntactic are supported broad to new from spaces, of innovations into barriers file-system parties. a number to enter of building that constructed This experienced have instead Stackable are have Ideas Words Performance and Phrases Composablhty, file system design. operating system struc- reuse 1. INTRODUCTION File systems services. operating represent one of the most important aspects of operating-system Traditionally, the file system has been tightly integrated system proper. As a result, the evolution of filing services with the has been This work was sponsored by the Advanced Research Projects Agency under contracts C-0072 and NOO174-91-C-O1O7. Also, J. S. Heidemann was sponsored by a USENIX for the 1990– 1991 academic year, address: Department and G. J. Popek is affiliated with Locus F29601-87scholarship Computing Corpora- tion. Authors’ 90024; email: Permission not made of the to copy without or distributed publication Association specific of Computer ]ohnh@cs.ucla.edu; and fee all or part for direct its for Computing date Science, popek(ucs, of this commercial appear, Machinery. Umversity of Califorma, Los Angeles, CA ucla.edu. and material IS ~g-anted advantage, notice the ACM is given To copy otherwise, that provided copyright copying or to republish, that notice the copies is by permission requires are and the title of the a fee and/or permission. C 1994 ACM 0734-2071/94/0200-0058 ACM TransactIons $3.50 on Computer Systems, Vol 12, No 1, February 1994, Pages 58-89 File-System Development relatively slow. For today, called almost a decade is similar. change catalog areas benefits from builders users to provide at considerable There file-system their reasons service by which fact has not worked so most vendors real system variety have of functional dependent filing the file service to arrange manager the for the file into that and the face of these observations, it is not surprising and deployed, is not changed a great deal possible to add file-system services to an existing manner, analogous user level to the merely by way adding services is not provided individually developed that by a monolithic packages, many with one another in a straightforward may the software” select at a local an explosion systems, @ UNIX store, of services where is services a trademark and install of UNIX ACM it himself. rate, is traditionally System Transactions acquire to a viewed with the impact basis so of an with virtual that other motivation memory buffer pool. In a file system, time. would accrue system in a very if it is, the were simple is obtained That at the set of user but by a large number of today can exchange data way. In the personal-computer he wishes, at a far faster the process program, of which when a single service application. any solution as it is usually system. then of benefit an additional another user Also, for a long problems, of issues, system, system’s with these this is considerable operating perhaps Sys- in practice, effort on an intimate of pages, in a few Furthermore, can be quite devasting, storage in the computer there a File successful is critical. interact that Virtual ways. it system. the range a daunting management Despite fact (the of an operating For example, First, to introduce operating the requires to share once operational an in addressing performance service of software persists. vendor can be installed, and is thus must to users. a variety such as database situation Despite is such a key part services. benefit kept applications system it in incompatible error in the filing system’s logic responsible for all of the persistent In addition, their interface system service on it, excellent core operating-system inside a major is inflexible requirements, Since have developers to employ. extended is a complex in its entirety. much VFS the fact generality. is a coarse-grain file systems architectural despite far more improvement an entire well. giving unappealing than interface there systems introduced operating exists of evolution much this UNIX@ 59 that are well known and have been By contrast, applications software in services or incremental UNIX, or VFS) file why other of affairs . System has seen little applications own filing is no well-defined tem, state rapidly and without for anyone like more and have caused are several systems, This in File in proprietary for example, and inhibition expense difficult There much of innovation system Fast for filing system, and a half. has evolved file Berkeley are numerous improvements in one context or another. stifling is very primary the ago. The situation in over a decade many the is basically The MVS that there constructed This example, UFS, with Stackable Layers appropriate This situation for example, considerably a has provided than more arena, “shrink-wrap on mainframe cumbersome. Laboratories, on Computer Systems, Vol. 12, No. 1, February 1994. 60 J. S. Heidemann . We believe deal. Suppose that a similar situation would it were just as straightforward obtain and install —fast physical —extended packages storage directory —compression for such services as management, services, encryption and decryption, performance —remote-access —selective —undo benefit filing services a great to construct independently, or and decompression, —automatic —cache and G. J. Popek enhancement, services, file replication, and undelete, and —transactions, with the assurance that available much they services more widely would would work together be much available, and effectively. richer, Then, particular evolution would we contend, solutions be far more would be rapid. In addition, the wide variety of expensive and confusing ad hoc solutions employed today would not occur.1 The goal of the research reported in this paper is to contribute toward making it as easy to add a function to a file system in practice as it is to install an application provide on a personal an environment effectively assembled, functionality useful in for or performance. lifetime operation and which spans which built the To do so, it is necessary can be easily makes no compromises must of computer in the past fragments result The environment generations of modules computer. file-system must be extensible systems. so that Continued be assured. Multiple to and in its successful third-party additions must be able to coexist in this environment, some of which may provide unforeseen services. The impact of such an environment can be considerable. For example, many recognize the utility of microkernels such as Mach or Chorus. Such systems, however, address the structure of about 15 percent of a typical operating system, compared 40 percent of the operating This paper file-system describes to the filing system. an interface development. This environment, and framework which framework allows that new often represents support stackable layers to build on the functionality of existing services, while allowing all parties to gradually grow and evolve the interface and set of services. Unique characteristics of this framework include the ability of layers in different address spaces to interact and for layers to react gracefully in the face of change. 1Consider, service that would compatible ACM for example, the grouping on top of and independent not have been facihty in MS-Windows. of the underling easy to incorporate MS-DOS into the It m, in effect, directory underlying fashion. TransactIons on Computer Systems, Vol 12, No, 1, February 1994 system. file another directory It provides system in an services upward File-System Development Actual experience utility. For examples obtained. 1984], this software much environments of the body 6.2), modularity as well [Keiman as fragments 1986], of ideas work. The experiences and application interface controlling access to persistent that which of the Paper This examines paper how paper . 61 test of their is accompanied these stream by experiences were protocols [Ritchie and object-oriented in a variety design of other of these principles data differentiate (see referenced to this a complex work from has come before. 1.1 Organization consider is an important of this drawn from the UNIX system, where Our work draws on earlier research with file-system Section with reason with Stackable Layers first the direct essential support the evolution of file-system characteristics for stacking development. of a stackable can enable new file system approaches We then and discuss to file-system design. An implementation of this framework is next examined, focusing on the elements of filing unique to stacking. We follow with an evaluation of this implementation, sider examining similar areas of both performance operating-systems and usability. research to Finally, place this we conwork in context. 2. FILE-SYSTEM DESIGN Many new filing services tion presented a certainly have been incomplete suggested in recent years. list of nine such services, exists today in some form. This section approaches for new services, examining simplify development. 2.1 Traditional A first which file can existing frequently ing new compares alternative implementation characteristics of each that hinder or Design approach existing The Introduceach of which to providing system. There be modified to add implementation evolved capabilities new filing are several can significantly across new speed since a dozen functionality might widely available features. Although development, their systems, beginning “standard” initial different be to modify file platforms with services implementation. may an any of an have Supportwell mean a dozen separate sets of modifications, each nearly as difficult as the previous. Furthermore, it is often difficult to localize changes; correctness verification of the modified file system will require examination of its entire implementation, not just the modifications. Finally, although file systems may be widely used, source code for a particular system may be expensive, difficult, or impossible to acquire. Even when source code is available, it can be expected to change frequently as vendors update their system. Standard filing interfaces, such as the vnode interface [Kleinman 1986] and NFS [Sandberg et al. 1985], address some of these issues. By defining an “airtight” boundary for change, such interfaces avoid modification of existing services, preventing introduction of bugs outside new activity. Yet, by employing existing file systems for data storage, basic filing services do not need to ACM Transactmns on Computer Systems, Vol. 12, No 1, February 199-4 62 J. S. Heidemann . be reimplemented. level, Use own. NFS simplifying and G. J. Popek also allows development of standard interfaces The interface file systems and reducing introduces is either evolving to be developed the impact implementation or static. at the user of error. problems If, like the vnode of their interface, it evolves to provide new services and functionality, compatibility problems are introduced. A change to the vnode interface requires corresponding changes to each existing filing Compatibility with NFS. Although to provide services new must services either Finally, concept of modularity A naive today. can be avoided this interface. tion. service, problems approach cleanly. RPC with implementation the result is not fully of layer crossings thus discouraging 2.2 Stackable form. New vides a clear layer. illustrate is not facilities can of process easily allowed, new a parallel services burden execution interpose as difficult then or employ to user-level choices static, the and protec- repeated data and the user. Although performance can be [Steere et al. 1990], portability is reduced, and satisfactory. Finally, even with careful caching, the cost is often orders of magnitude greater than subroutine the use of layering for structuring and reuse. systems construct developed layers, services layer another interface it becomes calls, Design Stackable file independently current interfaces approach copies between the hardware improved by in-kernel caching change on existing particular the portability, If protocol be overloaded NFS-like by keeping improves how are provided boundary or an invalid “stacks,” as those layers; implementation (discussed can be used is actually of Figures services from a number of available only in binary Errors 1–5 stacking Stacking as separate for modifications. Figures environments. such complex filing each potentially of the further to layer the services interface, in by sections) a variety since be common; syntactically proto the interface following of a misnomer, 3 and 4, should division isolated interlayer in provide somewhat term for historic reasons. Layers are joined by a symmetric the can be then of nonlinear we retain identical the above and below. Because of this symmetry, there are no syntactic restrictions in the configuration of layers. New layers can be easily added or removed from a file-system stack, much as Streams modules can be configured onto a network interface. Such filing configuration is simple to do; no kernel are required, allowing each experimentation with stack behavior. Because new services often employ or export additional functionality, changes the interlayer interface is extensible. Any layer can add new operations; existing layers adapt automatically to support these operations. If a layer encounters an operation it does not recognize, the operation can be forwarded to a lower layer in the stack for processing. Since new file systems can support operations not conceived of by the original operating-system developer, unconventional file services can now be supported as easily as standard file systems. The goals of interface symmetry and extensibility may initially appear incompatible. \( ‘\! ‘1’r,in~,ict Confiwration ]OIIS on computer is most Systems, Vol flexible 12, No when 1, February all interfaces 1994 are literally File-System Development syntactically identical, same mechanisms. distinct, restricting that is, when all with Stackable Layers layers can take file-system domain, but there transparently exist development. is considerable in in the Layers different same execute advantage address address often space to making spaces. should in Whether require 63 advantage But extensibility implies that layers may how layers can be combined. Semantic layers layers may of stack- a single protection also able to run or not no changes of the be semantically constraints do limit layer configurations, but Sections 4.2 and 4.4 discuss how provide reasonable default semantics in the face of extensibility. Address-space independence is another important characteristic able . different to the layers layer and should not affect stack behavior. New layers can then be developed at the user level and later put into the kernel for maximum performance. A transport layer bridges the gap between address spaces, transferring all operations port layer this way, and results back and forth. Like the interlayer interface, a transmust be extensible, to support new operations automatically. In distributed filing implementations fit smoothly into the same framework and transport layer corresponding address are afforded isolates the the performance space, the impact. statistically tremely rapid manner, marshaling. The interface same advantages. “distributed” component Stack calls dominate importantly, stacking between case, avoiding distributed has been developed More of the can layers then protocols to enable in operate the and in such as absolutely its a single an ex- argument minimum crossing cost locally while still maintaining the structure necessary for address-space independence. This stacking model of layers joined by a symmetric but extensible filing interface, the constrained binary-only absolutely of this minimal work. section techniques of section or facilitated 3. STACKABLE requirements performance The next are enabled This by the environment of address-space kernels, impact, discusses and the represents approaches by the stacking independence, overriding demand the primary to file-system for contribution design that model. LAYERING TECHNIQUES examines enabled in detail a number or simplified of different by stackable file-system development layering. 3.1 Layer Composition One goal of layered file-system design is the construction services from a number of simple, independently developed systems services Our are to be constructed should be decomposed experience shows that of complex filing layers. If the file from multiple layers, one must decide how to make individual components most reusable. layers are most easily reusable and composable when each encompasses a single abstraction. This experience parallels those encountered in designing composable network protocols in the x-kernel [Hutchinson et al. 1989] and tool development with the UNIX shells [Pike and Kernighan 1984]. ACM Transactions on Computer Systems, Vol. 12, No. 1, February 1994. 64 J. S. Heidemann . Fig. 1. and G. J Popek Compression serwce stacked over a UNIX file system. & layer As an example of this problem in the context of file-system I layering, consider the stack presented in Figure 1. A compression layer is stacked over a standard UNIX file system (UFS); the UFS handles file services, while the compression layer periodically compresses rarely used files. A compression service provided above the UNIX directory difficulty efficiently handling files with multiple names abstraction (hard links). has This is because the UFS was not designed as a stackable layer; it encompasses several separate abstractions. Examining the UFS in more detail, we see at least three basic abstractions: a disk partition, enced by fixed names (inode-level access), and vice. Instead stack layer of a single the “UFS service” should files referdirectory ser- be composed of directory, file, and disk layers. In this architecture the could be configured directly above the file layer. Multiply would no longer higher-level be a problem layer. other low-level Figure 2. 3.2 layer, arbitrary-length a hierarchical One could storage because multiple also imagine implementations. names reusing Stacks would the be provided directory of this type of a compression named files service are shown by a over in Layer Substitution Figure 2 also demonstrates layer substitution. Because the log-structured file system and the UFS are semantically similar, the compression layer can stack equally well over either. Substitution of one for the other is possible, allowing selection of low-level storage to be independent of higher-level services. This ability to have “plug-compatible” layers not only supports higher-level services across a variety of vendor-customized storage facilities, but also desired. 3.3 supports Nonlinear File-system the evolution and replacement of the lower layers as the top Stacking stacks are frequently linear; all access proceeds from down through each layer. However, there are also times when nonlinear stacks are desirable. Fan-oz/t occurs when a layer references “out” to multiple layers beneath it. Figure 3 illustrates how this structure is used to provide replication in Ficus [Guy et al. 1990]. ACM TransactIons on Computer Systems, Vol 12, No, 1, February 1994, File-System Development 0 with Stackable Layers . 65 Q user user 0s + d directory layer directory layer Fig. 2. Compression physical compression ti layer ent file ~compression ~ storage storage layer service. layer configured Each (UFS stack with a modular also uses a differ- and log-structured layers. Fan-out layout). layer ~ Iog-fs file I / [ l~YOUt 4 + d + disk device ~ device i Q user A J?’-Ficus logical Fig. i remote access ?? Al Cooperating Ficus layer to identify access layer is inserted several replicas, between while cooperating allows the a remoteFicus lay- ers as necessary. 1 Fan-in layer I I 3. logical is allows multiple separately clients named, it access is to a particular possible for layer. knowledgeable If each stack programs to choose to avoid upper stack layers. For example, one would prefer to back up compressed or encrypted data without uncompressing or decrypting it to conserve space by direct access to the underlying and to preserve ACM privacy Transactions (Figure storage layer. on Computer 4). This could Network Systems, easily be done access of encrypted Vol. 12, No. 1, February 1994. J. S. Heidemann 66 . Fig. 4. Access and G. J. Popek A& ently the through uncompressed compressed the compression data. version Fan-in layer allows provides a backup users transpar- program to access dmectly. comprason layer UFS ~ data could storage, 3.4 also be provided avoiding Cooperating Layered layers. design encourages Sometimes stateless might build but access that underlying encrypted several exact particular if remote of file systems be reusable occur into small, in the possible semantics, semantics on access is provided distributed UNIX is buried top of in the as a reusable an internals layers. layer, file or even file reusable middle file system. For example, a distributed and server portion, with a remote-access service, its could system, it will be unavailable for reuse. Cases such as these call for cooperating access service to the over the network. the separation services One can envision simple service, direct transfer Layers otherwise special-purpose may consist of a client between. by clear-text of an file system service in systems offering replication. “RPC” Each remote-access of each specific A “semantics-free” and the remainder file remoteis split into two separate, cooperating layers. When the file-system stack is composed, the reusable layer is placed between the others. Because the reusable portion is encapsulated as a separate layer, it is available for use in other stacks. For example, a new secure remote filing service could be built by configuring encryption/decryption layers around the basic transport service. An example of the use of cooperating layers in the Ficus replicated file system layers [Guy et al. 1990] correspond remote-access layer 3.5 Compatibility is shown roughly in Figure to a client is placed between 3. The Ficus and server them when logical of a replicated and physical service. A necessary. with Layers The flexibility stacking provides promotes rapid interface and layer evolution. Unfortunately, rapid change often rapidly results in incompatibility. Interface change and incompatibility today often prevent the use of existing filing abstractions [Webber 1993]. A goal of our design is to provide approaches to cope with interface change in a binary-only environment. File-system interface evolution takes a number of forms. Third parties wish to extend interfaces to provide new services. Operating-system vendors must change ACM interfaces TransactIons on to evolve Computer the Systems, operating Vol 12, No system, 1, February but 1994 usually also wish to File-System Development maintain backward approaches Extensibility Any provides interface is the primary to the interface; services. evolution increased, . 67 a number of evolution. can add operations existing gradual operating-system of a filing layer is greatly layering of interface file-system party not invalidate Stackable the problems of the compatibility. need compatibility. to address with Stackable Layers Third-party tool to such development address additions is facilitated, becomes possible, and the useful lifetime protecting the investment in its construc- tion. Layer substitution (see Section incompatibilities. adaption to differences format tied on other of more have interface, service provide hardware A still more significant direct different layers example, system simple allows a low-level easy storage by an alternate may base layer boundaries, used transport and operation this limited role. Section Ficus replication available operating of the NFS this systems. radi- access to and of stackable can be thought ranging from provides and is not extensible, only it is still approach to machine benefits NFS a single with limited can bridge on platforms how between vendor systems many 5.3 describes on PCs. map different layers standard of their to map operating environment. available to parties sets is difficult, Transport Although restrictions, by layer. semantics operating-system changes. between extending layer, to mainframes. imparts or by an significant computing a compatibility of the be constructed is posed be possible. a widely services, easily employ views be used by third of layers services to a nonstacking computers can barrier may identical could portability services not layer facility layering of as personal core filing quite useful is used in to make User-Level Development One advantage the For problems but similar platforms comparability after operating-system 3.6 to address similar can be replaced significant similar This to several backward Although remote approach semantically environments. a thin incompatibilities. cally in to particular layers shared of machines. Resolution If two 3.2) is another Substitution operating with this approach. are simply RPC usually takes serves as the user-level allows of microkernel system this Each form at UCLA file-system server. calls layers may on servers. (Figure moving vnodes new operations exist and inside executed development layer service the (such the (the as NFS) kernel “u-to-k kernel. as user of naturally and operations layer from portions fits of as a server, In fact, transport that be developed large layering 5). A transport all Another to move Stackable can be thought between interface, is the ability of the kernel. layer messages RPC user-level framework design outside to With code. a layer”) this Although inter-address-space RPC has real formance for an out-of-kernel file cost, caching may provide reasonable persystem [Steere et al. 1990] in some cases, particularly if other latency (for example, of the filing service storage management). characteristics hierarchical ACM Transactions on Computer Systems, have Vol. inherently 12, No. 1, February high 1994. 68 J. S. Heidemann . and G. J. Popek -\ /’/+ lL-f=Y2- I GW* L’---Kd nfs (client) I/ * layer /nf~ I --’ protocol Fig, 5. Nevertheless, User-level many layer filing development services will via transport find the layers cost overly expensive. Stackable layering offers valuable Because file-system layers each interact only through of frequent flexibility the layer RPCS in this case. interface, the transport layers can be removed from this configuration without affecting layer’s implementation. An appropriately constructed layer can then run the kernel, kernel avoiding (or between all RPC different ing the concepts of modularity ing permits advantages an integrated the execution overhead. Layers user-level servers) from address-space of microkernel can be moved as usage development in an out of the requires. protection, a in By separat- stackable layer- and the efficiency of environment. 4. IMPLEMENTATION The UCLA stackable layers interface and its environment are the results of our efforts to tailor file-system development to the stackable model. Sun’s vnode interface is extended to provide extensibility, stacking, and addressspace independence. We describe this implementation summary of the vnode interface and then examining our stackable 4.1 Existing here, beginning with a important differences in interface. File System Interfaces Sun’s vnode interface is a good example of several “file-system switches” developed for the UNIX operating system [Kleiman 1986; Rodriguez 1986]. All have the same goal, to support multiple file-system types in the same operating system. The vnode interface has been quite successful in this respect, providing dozens of different filing services in several versions of UNIX. The vnode interface is a method of abstracting the details of a file-system implement ation from the majority of the kernel. The kernel views file access ACM Transactions on Computer Systems, Vol. 12, No. 1, February 1994. File-System Development with Stackable Layers . 69 A Fig. through two abstract data types. 6. Namespace A vnode identifies composed of two individual subtrees. files. A small set of file types is supported, including regular files, which provide an uninterpreted array of bytes for user data, and directories, which list other files. Directories include references to other directories, forming a hierarchy of files. For implementation reasons, the directory portion of this hierarchy is typically limited to a strict tree structure. The other major configuration referred data structure purposes, to as file is the sets of files systems or disk partitions), Subtrees are added to the file-system Mounting is the process of adding file-system namespace. Figure vfs, representing are grouped into groups of files. each corresponding to one vfs. namespace by mounting. new collections of files into 6 shows two For sub trees (traditionally subtrees: the root the global subtree and another attached under /usr. Once a subtree is mounted, name translation proceeds automatically across subtree boundaries, presenting the user with an apparently seamless namespace. All files within a subtree typically have al UNIX disk partitions correspond is employed, each collection of files a corresponding completely Data subtrees subtree separate requires be manipulated supported characteristics. local machine. Tradition- subtrees. machine Each When NFS is assigned subtree is allowed a implementation. encapsulation that these abstract only by a restricted by vnodes, implementation on the similar one-to-one with from a remote the (see Karels abstract and data data types set of operations. type McKusick for for “files,” vary and Kleiman [1986] files and The operations according to [1986] for semantics of typical operations). To allow this generic treatment of vnodes, binding of desired function to correct implementation is delayed until kernel initialization. This is implemented by associating with each vnode type an operations vector identifying the correct can then implementation be invoked of each operation on a given vnode for that by looking vnode type. up the correct Operations operation in this vector (this mechanism is analogous to typical implementations of C+ + virtual class method invocation). Limited file-system stacking is possible with the standard vnode interface using the mount mechanism. Sun Microsystems’ NFS [ Sandberg 1985], loopACM Transactions on Computer Systems, Vol. 12, No. 1, February 1994. 70 . J. S. Heidemann back and translucent and G. J. Popek [Hendricks 1990] Information associated with the mount layer and where the new layer should Extensibility 4.2 of interface Incompatible 1993] are serious filing services concerns without The vnode interface the addition file have The UCLA all interface cally operations information approach. initialization, additions.z invocation. of each operation During new operations, reconfiguration. operation may systems be added provides of these the operations This set dynamically, very rapid the correct each file by any layer. dynami- to each file system select can be added since change. of all kernel. vector to permit Thus, then a list by this customized vectors vnode. and operations sufficient these for a given and new file operations Vectors information after by maintaining beings supported the global of all prohibits execution, of extensibility the union of all operations used to define its imple- convention or during system of evolu- definition compatibility execution file with the formal time problem At kernel caching tion link until a requirement ease vendor of an operation This existing [Webber to add to the set of is greatly interface Each it to arbitrary tion this with problem The ability compilation. interface. the list is then constructed, New this problem practices by fixing at kernel addresses yielding time of ensuring interface it supports. adapting take release and would kernel definition of operations then before the today. association run no method constructing is taken, the until a critical lock-step existing development of new operations systems the of developers allows operations is and disrupting to be delayed permissible evolution change diverse third-party filing tion of existing systems. mentation systems in the UCLA Interface Accommodation interfaces. file command identifies the existing stack be attached into the filing namespace. Because implementa- system to a kernel are opera- may with include a simple the interface does not define a fixed set of operations, a new layer must expect “unsupported” operations and accommodate them consistently. The UCLA interface requires a defaalt routine that will be invoked for all operations not otherwise provided by a file operation” to a lower error layer The new structure operation static invocation. offset computed important 2For File systems into of the The may most operations calling the operations vector vector we ignore addition here, although ACM Transactions here of operations this extension on Computer the problem Systems, 12. No an “unsupported also requires for new with new operations an extension implementations. 1, February 1994 operations a new operations with of the a dynamically at run to the method replaces very little performance will be as frequently of adding of current Vol return to pass unknown of the old interface can be supported is not part simply layers sequence new offset. These changes have consideration for a service that simplicity, dynamic” system. code, but we expect for processing. impact, employed time. approach This an as “fully described File-System Development with Stackable Layers . 71 / 0s ‘v-y--- t UFS layer ‘4 Fig. 7. device Mounting a UFS layer. The new layer is instantiated at /layer/ufs/crypt raw from a disk /dev/dskOg. an interlayer interface. Section 5.1 discusses performance of stackable layer- ing in detail. 4.3 This Stack Creation section discusses how stacks are configured required on a file-by-file 4.3.1 Stack stacks at the basis. Configuration. are formed. file-system Section In the prototype granularity 4.1 has and described interface, constructed how a UNIX as file system is built from a number of individual subtrees by mounting. Subtrees are the basic unit of file-system configuration; each is either mounted making all of its files accessible, or unmounted and unavailable. We employ this same mechanism for layer construction. Fundamentally, the UNIX mount mechanism has two purposes: it creates a new “subtree object” of the requested type, and it attaches this object into the file-system other namespace objects 7, where in for the a new UFS file layer use. system. An is instantiated Frequently, example from of this a disk /layer/ufs/crypt creation of subtrees is shown in uses Figure device raw /dev/dskOg. of layers Configuration from requires the same basic steps of layer creation and naming, so we employ the same mount mechanism for layer construction.3 Layers are built at the subtree granularity, a mount command creating each layer of a stack. Typically, stacks are built bottom up. After a layer is mounted to a name, the next higher layer’s mount command uses that name to identify its “lower-layer ‘;Although mount is typically inherently costly. Mount inexpensive, so is the neighbor” used constructs corresponding ACM today in initialization. to provide an object and “expensive” gives Figure services, it a name; when 8 continues the mechanism object initialization the is not is mount. Transactions on Computer Systems, Vol. 12, No. 1, February 1994. 72 . J. S. Heidemann and G, J, Popek W / usr + data _ layer o / L——— B Au Fig. 8. from encryption an existing layer UFS is stacked over a UFS. The encryption \ layer is instantiated /usr/data /Iayer/ufs/crypt raw. layer previous example by stacking an encryption layer over the UFS. In this figure an encryption layer is created with a new name (/usr/data) after specifying the lower layer (/lay er/ufs/crypt. raw). Alternatively, if no new name is necessary or desired, the new layer can be mounted to the same place in the namespace.4 Stacks with fan-out typically require that each lower layer be named when constructed. Stack construction does not necessarily proceed from the bottom up. Sophisticated file systems may create lower layers on demand. The Ficus distributed file system takes this approach in its use of volumes. Each volume is a subtree storing related files. To ensure that all sites maintain a consistent view about distributed the system, location volume of the mount mount location. When a volume translation, the corresponding cally located and mounted. 4.3.2 File-Level Stacking. level, most user actions take by vnodes, with one vnode thousands of volumes information mount volume is maintained in a large-scale on disk at the point is encountered during path-name (and lower stack layers) is automati- While stacks are place on individual configured files. Files at the subtree are represented per layer. When a user opens a new file in a stack, a vnode is constructed to represent each layer of the stack. User actions begin in the top stack layer and are then forwarded down the stack as required. If an action requires creation of a new vnode (such as referencing a new file), then as the action proceeds down the stack, each layer will build the appropriate vnode and return its reference to the layer above. The higher layer will then store this reference in the private 4 Mounts to the is separately same named, name are currently possible standard access control Computer Systems, only mechanisms in BSD 4.4-derived layers ACM TransactIons on Vol. 12, No. systems. can be used to mediate 1, February 1994. If each layer access to lower File-System Development data of the vnodes vnode will it constructs. reference several Since vnode references the rest of the kernel, Should lower-level with Stackable Layers a layer employ vnodes similarly. . fan-out, each 73 of its are used both to bind layers and to access files from no special provision needs to be made to perform operations between layers. The same operations used by the general kernel can be used between layers; layers treat all incoming operations identically. Although the current implementation does not explicitly support stack configuration at a per-file prohibits finer configuration maintain proper 4.3.3 layer granularity, control. configuration layers when Stack Data details there is nothing A typed-file layer, for each file and automatically Caching. cause cache aliasing A cache manager Individual stack page other ferred. was not layer This often previously is called approach cached. upon If the to flush is analogous page its before is required this fact, 4.4 the replication A compression different sion layer Attribute currently Before of that page ownership if by another ownership or token passing to identify cached layer, is transin a dis- data pages different layers. The cache manager identifies pages by a pair: [stack file offset]. Stack layers that provide semantics that violate this to its clients. have to cache interface. is cached cache to callbacks policy wish acquire “ownership” immediately grants naming convention are required to conceal this difference For example, a replication service must coordinate stack its clients, itself, and its several lower layers, providing conceal the files. If each layer cached copies can problems and possible data 10SS.5 coordinates page caching in the UCLA tributed file system. A consistent page-naming between identifier, layers and to support memory mapped data pages, writes to different any stack layer may cache a page, it must from the cache manager. The cache manager the create the file is opened. data, both for performance may independently cache the in the model that for example, could meanings will explicitly the name above will claim presents and below to be the bottom another solutions in these example, a compression coordinate cache behavior caching by layers present investigating Stacking layer layer from their clients. identifiers between replica storage. To layer. above similar of the stack since file offsets The compres- and below problems; itself. we are areas. and Extensibility One of the most powerful features of a stackable interface is that layers can be stacked together, each adding functionality to the whole. Often, layers in the middle of a stack will modify only a few operations, passing most to the next lower layer unchanged. For example, although an encryption layer would encrypt and decrypt all data accessed by read and write requests, it may not need to modify operations for directory manipulation. Since the interlayer interface is extensible and therefore new operations may always be 5Imagine in another modifying layer. the first Without byte some of a page in one stack cache consistency ACM Transactions layer policy, on Computer and the last one of these Systems, byte of the same page modifications Vol. 12, No. will 1, February be lost. 1994. 74 . J. S. Heidemann added, an intermediate operations. One way lower This that to a lower and would fail to forward layer invokes would requiring operation. be discouraged, of new stacks be prepared explicitly approach of new operations, layer adds a new must operations a routine layer. would middle layer to pass operation, and G. J. Popek modification The creation is to implement, the to adapt arbitrary, same operation automatically for each in the next to the of all existing layers of new layers and new the use of unmodified be impossible. third-party new addition when any operations layers in the What is needed is a single bypass routine that forwards new operations to a lower level. Default routines (discussed in Section 4.2) provide the capability to have a generic routine intercept unknown operations, but the standard vnode interface manner. handle provides To handle the variety no way multiple to process operations, of arguments aging operations’ ated with arguments each and other ation collection, pertinent invocations we expect importantly, these providing allows all operations have as collections. information. and the between make it to a lower most file-system these changes to map absolutely also be any vnode argu- operation by explicitly metadata identity, explicit to be manipulated performance man- are associ- argument management types, of oper- in a generic fashion layers, usually with pointer manipulation. possible for a simple bypass routine layer layers to the to It must minimal characteristics this a general be able existing interfaces where calls. And, of course, sup- In addition, Together, arguments and efficiently forwarded These characteristics forward place in must operations, of these characteristics is possible with are implemented as standard function port for these characteristics must impact. The UCLA interface accommodates operation routine used by different possible to identify the operation taking ments to their lower-level counterparts.G Neither operations this a single in the UCLA interface. to By convention, to support such a bypass routine. More interface have minimal impact on perfor- mance. For example, passing metadata requires only one additional arg-w ment to each operation. See Section 5.1 for a detailed analysis of performance, 4.5 Intermachine Operation A transport layer address space to is a stackable layer that transfers operations from one another. Because vnodes for both local and remote file they may be used interchangeably, the same operations, systems accept providing network transparency. Sections 3.5 and 3.6 describe some of the ways to configure the layers this transparency allows. Providing a bridge between address spaces presents several potential problems. Different machines might have differently configured sets of opera- GVnode stripped bypass ACM arguments change off as network an operation Transactions between on as a call messages Computer two proceeds are processed. layers in the Systems, Vol. 12, down No same No. the stack, much other argument address space. 1, February 1994, as protocol processing headers is required are to File-System Development tions. Heterogeneity ods to support can make basic variable-length data and with Stackable Layers types incompatible. dynamically . 75 Finally, allocated data meth- structures for traditional kernel interfaces do not always generalize when crossing address-space boundaries. For two hosts to interoperate, it must be possible to identify each desired operation unambiguously. Well-defined RPC protocols, such as NFS, ensure compatibility by providing set of operations frequently in the UCLA defined.7 labels interface only a fixed set of operations. Since restricting the restricts and impedes innovation, each operation is assigned Intermachine to reject locally Transparent a universally communication unknown forwarding unique of arbitrary identifier when operations it is uses these operations. of operations across address-space boundaries re- quires not only that operations be identified consistently, but also that arguments be communicated correctly in spite of machine heterogeneity. Part of the metadata associated with each operation includes a complete type description of all arguments. With this information, an RPC protocol can marshal operation arguments and results between heterogeneous machines. Thus, a transport layer may be thought of as a semantics-free RPC protocol with a stylized method of marshaling and delivering arguments. NFS provides a good prototype transport layer. It stacks on top of existing local file systems, not designed using the vnode as a transport layer; interface and its implementations define particular to bypass new operations automatically. consistency layer, providing a separate In addition to the use of an NFS-like we employ a more efficient above its supported transport and below. operations But consistency policy. inter-address-space layer operating transport between the level. Such a transport layer provides “system-call’’-level interface allowing user-level development of file-system providing user-level the interface. data. We interface to make restriction to new transport Traditional returned this access a system-call-like system have the file-system functionality. layer one additional expect to extend user-to-kernel has not been places calls chosen serious was caching semantics. We extend NFS We have also prototyped a cache the kernel the UCLA support NFS are not extensible, transport since The layer the client desire constraint restriction and access to layers and the user to provide this layer, user to on space for all to the universal. In can often make UCLA practice, a good estimate of storage requirements. If the client’s first guess is wrong, information is returned, allowing the client to repeat the operation correctly. 4.6 Centralized Several Interface aspects characteristics a complete Definition of the UCLA definition of all able to map vnodes from employing new operations 7Generation creation and interface of the operation schemes based assignment. taking operation require place. types, precise Network and information a bypass one layer to the next. The designer must provide this information. on host identifier We therefore employ ACM Transactions and time-stamp the NCS support UUID on Computer about transparency fully routine of a file distributed the requires must be system identifier mechanism. Systems, Vol. 12, No. 1, February 1994. 76 . J. S. Heidemann Detailed interface and G. J. Popek information is needed at several different places throughout the layers. Rather than require that the interface designer keep this information consistent in several different places, operation definitions are combined into an interface definition. Similar language used by RPC packages, this description arguments, and translates 4.7 this Framework The UCLA the into direction forms of data convenient movement. to interface some change, a system with has proved to be quite layers BSD’S namei “compiler” portable. pleased independently derived more implemented to path-name with translation our framework’s vnode of individual layers. interface itself has proved is somewhat Initially to SunOS 4.1.1. In addition, the of the interface have been ported approach we are largely an discusses portability While the UCLA individual interface manipulation. Portability 4.4. Although quired An for automatic under SunOS 4.0.3, it has since been ported in-kernel stacking and extensibility portions to BSD to the data-description lists each operation, its interface. to be portable, difficult. None of the re- portability Section 5.3 portability of implementations described have identical sets of vnode operations, and path-name translation approaches differ considerably between SunOS and BSD. Fortunately, several aspects of the UCLA interface provide approaches to address layer portability. Extensibility allows layers with different sets of operations to coexist. In fact, require no changes to existing interface additions from SunOS 4.0.3 to 4,1.1 layers. When interface differences are signifi- cantly greater, layer to run layers operations a compatibility without (as well change. as other (see Section Ultimately, system 3.5) provides adoption services) an opportunity of a standard is required set of core for effortless layer portability. 5. PERFORMANCE AND EXPERIENCE While a stackable file-system design offers numerous advantages, file-system layering will not be widely accepted if layer overhead is such that a monolithic file system performs significantly better than one formed from multiple layers. To verify layering performance, overhead was evaluated from several points of view. If stackable layering is to encourage rapid advance in filing, not only must it have good performance, Here, we also examine development of similar and then by examining Finally, compatibility current applying 5.1 but it also must filing abstractions. stacking to resolve file-system We conclude by describing filing incompatibilities. development. our experiences in Layer Performance To examine the performance classes of benchmarks. First, ACM facilitate this aspect of “performance,” first by comparing the file systems with and without the UCLA interface, the development of layers in the new system. problems are one of the primary barriers to the use of Transactions on Computer Systems, of the UCLA interface, we consider several we examine the costs of particular parts of this Vol. 12, No. 1, February 1994, File-System Development interface with overall “microbenchmarks.” system performance We then by with Stackable Layers consider comparing . how the interface a stackable layers 77 affects kernel to an unmodified kernel. Finally, we evaluate the performance of multilayer systems by determining the overhead as the number of layers changes. The UCLA interface measured here was implemented as a modification SunOS and 4.0.3. All timing two 70-Mb data Maxtor were collected XT-1085 hard 5.1.2 use the new interface throughout 5.1.3 use it only systems. 5.1.1 file Microbenchmarks. system very within operation The is invoked. inexpensive. Here, on a Sun-3\60 disks. new interface two while changes overhead, portions the in Section those in Section way every operation of the to 8 Mb of RAM measurements the new kernel, To minimize we discuss The with file calls interface: file- must be the method for calling an operation, and the bypass routine. Cost of operation invocation is the key to performance, since it is an unavoidable cost of stacking no matter how layers themselves are constructed. To evaluate the performance of these portions of the interface, we consider the number of assembly-language instructions generated in the implementation. Although this statistic is only a very rough indication of true cost, provides an order-of-magnitude comparison.8 We begin the UCLA quence the cost of invoking a Sun-3 translates sequence with by considering interfaces. into requires respect On four the assembly-language six instructions.g to most an operation platform, vnode instructions, We view file-system in the vnode original this overhead tem or local layers next layer pressed File offer. down, routine the a remote of these caching pass user’s must to about are many actions file into routine amounts disk layers the bypass in our design one-third These modifying file or to bring to be practical, About compression might new as not significant the cost of the bypass routine, layers, each adding new abilities stack. se- the operations. We are also interested in number of “filter” file-system such and calling while it examples operations only the local We envision a to the file-sys- disk in the a com- cache. For such layers be inexpensive. are not to the to uncompress A complete 54 assembly-language instructions of services directly bypass instructions.l main flow, being 0 used only for uncommon argument combinations, reducing the cost of forwarding simple vnode operations to 34 instructions. Although this cost is significantly more than a simple subroutine call, it is not significant with respect to the cost of an average file-system operation. To investigate the effects of 8Factors such these figures. claim only a rough 9We found instructions code to as machine Many a similar and the pass 10 These figures bundled with arguments were SunOS architecture architectures comparison and have from these ratio on SPARC-based new required of the produced 4.0.3 the choice instructions eight. of compiler that have are significantly a significant slower than impact on others. We statistics. architectures, In both cases where these the calling old sequence sequences required five do not include operation. by produced ACM the Free Software Foundation’s gcc compiler. Sun’s C compiler 71 instructions. Transactions on Computer Systems, Vol. 12, No. 1, February 1994. 78 . Table J. S. Heidemann I. Modified Andrew and G. J. Popek Benchmark Results UCLA Vnode interface Time Phase Running on Kernels Using the Vnode and the Interfacesa UCLA interface Time YoRSD Percent %RSD overhead 14.8 -3.03 3.3 16.1 copy 18.8 4.7 191 5.0 1.60 ScanDir 17.3 5.1 17.8 7.9 2.89 MakeDir 3.2 28.2 1.8 28.8 2.0 2.13 Make 327.1 0.4 328.1 0.7 0.31 Overall 394.7 0.4 396.9 0.9 0.56 ReadAll aTime values runs. % RSD overhead timer (in seconds; indicates timer granularity the percent of the new interface. High standard relative deviation standard of elapsed time impact layering further, of a multilayered 5.1.2 tual lrzter~ace deviations for MakeDir 29 sample is the Percent are a result of poor step compares kernel. [Ousterhout and removal of large multiple The Andrew ent file-system phase a kernel Howard trees. has several activities. Unfortunately, to granularity results four makes overall very short phase from phases run shows fewer number the benchmark. To exercise remove recursive large about for these copy and amounts interface the the of data the each of which brevity difficult. examines In addition, Coarse benchmarks limit their We attribute this more Computer the long compile second a recursive (a 4.8-Mb lower two /usr/include and since Systems, Vol. system 12, No. the of time by this Both phases impact tree) 1, February 1994. first to extend small of copy doing operate occurred instead of layering, is usually to the phase recursive the directory time The compile phases, remove. and the overhead we examined employs for the granularity exaggerates on differphases accuracy, done per unit strenuously, benchmark greatly is in the kernel Transactions of four I. Overhead timing elapsed This copy the effect of the first can be seen in Table operations interface This phases, Because we knew all overhead time (time spent in the kernel) time. a Andrew recursive we examine duration of the benchmark. kernel, we measured system overhead with the modified and ac- for evaluation. the UCLA In addition, 2 percent. overhead. of file-system the are useful, results. Nevertheless, taken as a whole, “normal use” better than a file-syscharacterizes such as a recursive copy/remove. a slight times. only et al. 1988] accuracy the benchmark times counts benchmark averages only performance in the new interface. this benchmark probably tem-intensive benchmark The overall are essential two benchmarks: 1990; benchmark dominates first supporting subdirectory layers the instruction measurements To do so, we consider benchmarks relative Although performance The first 5.1.3 examines system. Performance. implementation adding Section file standard ACM from ( ~X\W.Y ). overhead granularity. file-system and = 1 s) are the means relative a on the in the of total since all compared to File-System Development Table II. with Stackable Layers . 79 Recursive Copy and Remove Benchmark Results Running on Kernels Using the Vnode and UCLA Interfacesa Vnode interface UCLA interface Time %RSD Time %RSD Percent overhead Recursive copy Recursive remove 51.57 25.26 1.28 2.50 52.55 25.41 1.11 2.80 1.90 0.59 Overall 76.83 0.87 77.96 1.11 1.47 Phase ‘Time values (in seconds; timer granularity = 0.1 s) are the means of system time from 20 sample runs. 70RSD indicates the percent relative standard deviation. Overhead is the percent overhead of the new interface. the elapsed “wall-clock” time a user actually experiences. As can be seen in Table II, system-time overhead averages about 1.5 percent. 5.1.3 Multiple-Layer Since Performance. the stackable-layers design phi- losophy advocates using several layers to implement what has traditionally been provided by a monolithic module, the cost of layer transitions must be minimal if it is to be used for serious file-system implementations. To examine the overall performance impact of a multilayer of a file-system stack file as the system, number we of layers analyze that the employs changes. To perform the UCLA out the Fast File mounted this experiment, interface rest within of the we began all file kernel.11 System, modified from zero to six At with systems the base a kernel modified and the vnode of the to use the UCLA null layers, each stack, to support interface we placed interface. of which Above merely througha Berkeley this layer forwards we all operations to the next layer of the stack. We ran the benchmarks the previous section upon those file-system stacks. This test described in is by far the worst full overhead nearly linearly possible case for layering, without providing Figure 9 shows with the since each added layer incurs any additional functionality. the results of this study. Performance number of layers used. about a 0.3 percent elapsed-time such as the recursive copy and The modified varies Andrew benchmark shows overhead per layer. Alternate benchmarks remove phases, also show less than 0.25 percent overhead per layer. To get a better feel for the costs of layering, we also measured system time, that is, time spent in the kernel on behalf of the process. Figare 10 compares recursive copy and remove system times (the modified Andrew benchmark does not report system-time statistics). Because all overhead is in the kernel and the total time spent in the kernel comparisons of system time indicate layer for recursive copy and remove. of one layer in Figure 10 results from is only one-tenth of elapsed a higher overhead: about Slightly better performance a slight caching effect time, 2 percent per for the case of the null layer 11To Improve portability, we desired to modify as little of the kernel as possible. Mapping between interfaces occurs automatically upon first entry of a file-system layer. ACM Transactions on Computer Systems, Vol. 12, No. 1, February 1994. 80 J. S. Heidemann and G. J. Popek 15 MAB Overall cp real Ii7 .-E — — rm real — 10 - 5 0 Fig. 9. Elapsed added time of recursive to a file-system stack. 2 1 copy/remove Each data point 3 4 Number of layers and modified is the mean 5 Andrew of four 6 benchmarks as layers are runs, 15 Cp Sys — rm sys — /“’/” 10 5 0 o 1 2 3 4 5 6 Number of layers Fig. 10 System stack (the mean of four possible ACM time modified runs. layering Transactions of recursive Andrew Measuring copy/remove benchmark system benchmarks does not time alone provide as layers system of a do-nothing overhead. on Computer Systems, Vol. 12, No 1, February are added time ). Each 1994. layer to a file-system data represents point the is the worst File-System Development compared result to the standard of differences benchmark UFS. in the ratio Differences between with Stackable Layers in benchmark the number . overheads of vnode 81 are the operations and length. We draw two conclusions from these figures. indicate that under normal load usage,, a layered be virtually file-system undetectable. Also, use a small overhead First, elapsed-time results file-system architecture will system-time costs imply that during heavy will be incurred when numerous layers are involved. 5.2 Layer Implementation Effort An important goal of stackable file job of new file-system development. saves a significant amount of time systems Importing and this interface functional with in new development, is to ease the existing layers but this be compared to the effort required to employ stackable layers. sections compare development with and without the UCLA examine how layering conclude that 5.2.1 Simple file-system than development tion? To evaluate implemented that the was of existing and layer was chosen vnode interface, A first concern would file without and small and large would we chose with large small process extensibility; complexity, both both Development. was faces do not support through traditional simplifies Layer layers the can be used for both layering when Most services. We developing other the the interface. size new complicated kernel complicate to examine for comparison: and the null filing to be more facility UCLA must three and projects. prove systems. this savings The next interface, inter- implementa- of similar layers A simple “pass- the loopback file system under the layer under the UCLA interface.lz We performed this comparison for both the SunOS 4.0.3 and the BSD 4.4 implementations, measuring complexity as numbers of lines of comment-free C code.13 Table III compares the code length of each service in the two operating systems. Closer examination reveals in the implementation of individual ments must the most operations explicitly services with forward provided that the majority vnode operations. a bypass routine, each operation. by the null while In spite layer are of code savings The null layer the loopback of a smaller also more file occurs implesystem implementation, general; the same implementation will support the addition of future operations. For the example of a pass-through layer, use of the UCLA interface enables improved functionality with a smaller implementation. Although the relative difference in size would be less for single layers providing multiple services, a goal of stackable layers is to provide sophisticated services through multiple, ‘2 In SunOS the null layer was augmented to reproduce the semantics of the loopback layer exactly. This was not necessary in BSD UNIX. 13Although well-commented code might be a better comparison, the null layer was quite heavily commented for pedagogical reasons, whereas the loopback layer had only sparse comments. We chose to eliminate this variable. ACM Transactions on Computer Systems, Vol. 12, No. 1, February 1994. 82 J. S. Heidemann . Table III. Number and G. J. Popek of Lines Layer of Comment-Free or File System Code Needed in SunOS to Implement 4.0.3 and BSD a Pass-Through 4.4 BSD SunOS Loopback-fs 743 lines Null 632 lines layer 578 lines 111 lines Difference reusable 1046 lines layers. This goal 468 lines (15%) requires that minimal layers be (45%) as simple as possible. We are layer currently code null-derived 5.2.2 Layer in We layers, generality and pursuing further. Development of a new students graduate to centralizing this invited class at UCLA. programming experience ranged from students were environment. each provided All projects succeeded in provided a file-versioning second-class replication the consistency providing ment its time become layer, as a layer, layer, ranged from with more a null prototype encryption layer, their to stack groups layers. Prototypes layer. UFS Self-estimates figure layer, Other over a standard This of a user-level a compression environment, of a kernel and consistency enhancement. on this as part Five layer functioning development layers the parties perspective new to considerable. with 40 to 60 person-hours. the for to demonstrate programmers, and an NFS as an optional of null use by different develop none each was designed service familiar an way its proficient one or two development include best To gain and all were size routines service. The problems. absolute management is through to design While the vnode common technique to different were to reduce unify Experience. design application issue, strategies expect layer, of develop- included as well than time as layer to design and implementation, Review of the development butions of stacking provide basic unnecessary. required peripheral to this filing services, Second, was of these layers experiment. detailed by beginning focused on the framework issues. Finally, platform provided a convenient, tional kernel development. a null problem the familiar three by relying understanding with largely suggested First, primary on a lower of these layer, being new solved out-of-kernel services to was implementation rather layer environment contrilayer than on development compared to tradi- We consider this experience a promising indication of the ease of development offered by stackable layers. Previously, new-file-system functionality required in-kernel modification of multi-thousand-line With stackable cant new filing and programming ACM TransactIons layers, file of current systems students in the capabilities with methodology. on Computer Systems, file and class were knowledge Vol systems, low-level 12, No only 1, February requiring kernel able to investigate of the 1994, knowledge debugging stackable tools. signifiinterface File-System Development 5.2.3 Large-Scale riences in concludes Example. stackable with development the results for both of size and use. It is comparable in (12,000 to system, 7,000–8,000 of comment-free development over in terms file its (including section has discussed our expeprototype layers. This section suitable production sive 83 a replicated code size to other environment . of developing daily use. Ficus is a “real” lines The previous of several with Stackable Layers systems NFS three-year Ficus lines or UFS system for Ficus code). Ficus existence. development) file Its compared has seen exten- developers’ is completely computing supported in Ficus, and it is now in use at various sites in the United States. Stacking has been a part of Ficus from its very early development. Ficus has provided both a fertile source of layered development techniques, and a proving ground for what works and what does not work. Ficus makes good use of stackable concepts such as extensibility, cooperating layers, an extensible transport layer, and out-of-kernel development. Extensibility The concept is widely used in Ficus to provide replication-specific of cooperating layers is fundamental to the Ficus where services some must must be close to data be provided storage. port layer has provided parency as well. Finally, Between “close” the to the Ficus user layers, the operations. architecture, whereas others optional trans- easy access to any replica, leveraging location transthe out-of-kernel debugging environment has proved particularly important in early development, saving significant development time. As a full-scale example of the use of stackable layering and the UCLA interface, Ficus illustrates the success of these tools for file-system ment. Layered file systems can be robust enough for daily use, development process is suitable for long-term projects. 5.3 Compatibility Experiences Extensibility and lems. 3.5 discusses here Section layering we consider experiences user-id UFS. Extensibility are powerful several how effective primarily mapping developand the concern and null has these tools to address approaches have quite in prob- these tools; to be in practice. Our of the Ficus and the stack-enabled effective compatibility to employ proved the use and evolution layers, proved tools different versions supporting layers, the of NFS and “third-party’’-style change. The file-system layers developed at UCLA evolve independently of each other and of standard filing services. Operations are frequently added to the Ficus layers with minimal consequences on the other layers. We have encountered and our protocols some cache consistency problems resulting transport layer. We are currently implementing as discussed in Section 4.3.3 to address this issue. from extensibility cache coherence Without extensi- bility, each interface change would require changes to all other layers, greatly slowing progress. We have had mixed experiences with portability between different operating systems. On the positive side, Ficus is currently accessible from PCS ACM Transactions on Computer Systems, Vol. 12, No, 1, February 1994. 84 . J. S. Heidemann and G. J. Popek Unix host j : ~ a PC user : n fs : ; (client) : 0 /—< nfs (server) Fig. 11. running ferences; nal Access MS-DOS. the interface to UNIX-based NFS Ficus bridges shrinkfid layer from minor difinter- : : . shnnkfid differences. _ ) ! : : : : a PC operating-system addresses :’.. : MS-DOS Pc 8 Ficus logical E Ficus physical UFS b running MS-DOS communicate (see Figure with tion to identify files additional “shrinkfid” Actual portability mentations is more a UNIX 11). The host running PC runs Ficus. an NFS Ficus implementation requires more to informa- than will fit in an NFS file identifier, so we employ an layer to map over this difference. of layers between the SunOS and BSD stacking impledifficult. Each operating system has a radically different set of core vnode operations and related services. For this reason and because of licensing restrictions, we chose to reimplement the null and user-id mapping layers for the BSJ3 port. Although we expect that a compatibility layer could mask interface differences, long-term interoperability requires not only a consistent stacking framework, but also a common set of core operations and related operating-system services. Finally, we have been successful employing simple compatibility layers to map over minor interface differences. The shrinkfld and umap layers each correct deficiencies in interface or administrative configuration. We have also constructed a simple layer that passes additional through extensible NFS as new operations. 6. RELATED state information closes WORK Stackable filing environments build upon several bodies of existing work. UNIX shell programming, streams, and the x-kernel present examples of stackable development, primarily applied to network protocols and terminal ACM Transactions on Computer Systems, Vol. 12. No. 1, February 1994. File-System Development processing. There object-oriented is also design. systems. Finally, face independently in turn. a significant Sun’s vnode relationship interface between provides . 85 stacking a basis and for modular file Rosenthal has presented a prototype stackable filing interdescended from these examples. We consider each of these 6.1 Other Stackable Systems The key characteristics of a stackable file system and a flexible method of joining these provides an early example of combining with with Stackable Layers a syntactically identical interface are its symmetric layers. UNIX independently [Pike interface shell programming developed modules and Kernighan 1984]. Ritchie [1984] applied these principles to one kernel subsystem with the Streams device 1/0 system. Ritchie’s system constructed terminal and network protocols by composing stackable modules that may be added and removed during operation. Ritchie’s concluded that Streams significantly reduce complexity and improve maintainability of this portion of the kernel. Since their development, Streams have been widely adopted. The x-kernel is an operating-system nucleus designed to simplify protocol implementation [Hutchinson allowing et al. arbitrary by implementing 1989]. Key protocol all features are composition; protocols a uniform run-time protocol choice network environment, Streams, and the and x-kernel that layers interface, of protocol allowing selection based on efficiency; and very inexpensive The x-kernel demonstrates the effectiveness of layering development in the suffer. Shell programming, network as stackable stacks, layer transition. in new protocol performance need are all important not examples of stackable environments. They differ from our work in stackable file systems primarily in the richness of their services and in the level of performance demands. The pipe mechanism provides only a simple byte-stream of data, leaving x-kernel effectively it to the application also place very annotating few to impose constraints message streams structure. Both or requirements with control Streams on their information. and the interface, A stackable file system, on the other hand, must provide the complete suite of expected filing operations under reasonably extreme performance requirements. Caching of persistent data is another major difference between Streams-like approaches and stackable file systems. File systems store persistent may be repeatedly accessed, making caching of frequently accessed possible and necessary. Because of the performance differences cached and noncached data, file caching is mandatory in production Network protocols operate need not be addressed. 6.2 Object-Orientation strictly with transient data, late binding, and so caching issues and Stacking Strong parallels exist between “object-oriented design ing. Object-oriented design is frequently characterized sulation, data that data both between systems. and inheritance. ACM Transactions Each on Computer of these Systems, techniques by strong has and stackdata encap- a counterpart in Vol. 12, No. 1, February 1994. 86 J. S, Heldemann . stacking. Strong cannot manipulate time stack routine; and G. J. Popek data encapsulation layers as black configuration. operations is required; without encapsulation one boxes. Late binding is analogous to run- Inheritance inherited parallels a layer in an object-oriented through a stack to the implementing layer. Stacking differs from object-oriented design object-orientation is often associated with providing system in two broad a particular tiate) new areas. First, lan- stackable filing For example, and linkers) to class of objects and to instantiate individual objects. In a environment, however, far more people will configure (instan- stacks than will to simplify this process. A second difference easily be bypassed programming guage. Such languages are typically general purpose, while can be instead tuned for much more specific requirements. languages usually employ similar mechanisms (compilers define a new stackable filing a bypass would be described design concerns in new layers. As a result, inheritance. object-oriented Simple terms. For special stackable example, the tools exist layers can compression stack of Figure 3 can be thought of as a compression subclass of normal files; similarly, a remote-access layer could be described as a subclass of “files.” But with staking it is not uncommon to employ multiple remote-access layers. It is less clear how to express this characteristic in traditional object-oriented terms. 6.3 Modular File Systems Sun’s interface vnode stackable standard [Kleinman file-systems vnode work. interface. 1986] Section We build modularity to provide stackable The standard vnode interface has served 2 compares upon filing. has been its as a foundation stackable abstractions used to provide for our and the approach to filing and basic file-system stacking. Sun’s loopback and translucent file systems [Hendricks 1990] and early versions of the Ficus file system were all built with a standard vnode interface. These implementations highlight the primary differences between the standard vnode interface and our stackable environment; with support for extensibility and explicit support for stacking, the UCLA interface is significantly easier to employ (see Section 5.2.1). 6.4 Rosenthal’s Stackable Interface Rosenthal [1990] also recently explored stackable filing, Although conceptually similar to our work, the approaches differ with regard to stack con@-uration, stack view consistency, and extensibility. Stack configuration in Rosenthal’s model is accomplished by two new operations, push and pop. Stacks are configured on a file-by-file basis with these operations, unlike our subtree granularity configuration. Per-file configuration allows additional configuration flexibility, since arbitrary files can be independently configured. However, this flexibility complicates the task of maintaining this information; it is not clear how current tools can be applied to this ACM problem. TransactIons A second on Computer concern Systems, is that these new operations Vol. 12, No, 1, February 1994 are specialized File-System Development for the construction general stack of linear fan-in Rosenthal’s stack layers; stacks. Push with Stackable Layers and pop do not . support 87 more and fan-out. stack model requires that all users see an identical view of dynamic changes of the stack by one client will be perceived by all other clients. As a result, it is possible to push a new layer on an existing stack and to have all clients immediately begin using the new layer. In principle, one might dynamically add and remove a measurements layer during file use. This new vnode pushed However, layers were it is not typically remain approach clear have unchanged that semantic during used to write be employed also over the mount this facility content, data. mounts This needed. will expect a compression the corresponding the to implement is widely a client use. Consider the file, to read can be used as a point. Bec~tise stack layer. decompression suggests that a more stack contents Clearly, to if it service needs global dynamic change may not be necessary and, to the extent that it adds complexity overhead, may be undesirable. In addition, ensuring that all stack clients agree on stack construction a number layers of drawbacks. is often access. useful Such allowed. for special diverse Ensuring As discussed in Section tasks a common stack prohibited debugging, if only top also requires very multiprocessor implementation, at some performance interface does not enforce atomic stack configuration, overhead. The most is that for multiple could careful of a single-stack For view address all of these with reasons, is in a cost. Since the UCLA it does not share this model is one service 3.3). Rosenthal’s does not make layers, it impossible our stack at different difference between concerns extensibility. users transport space, making to access the stack A final interface for all stack tops. Furthermore, be in another pointer. view locking significant problem with Rosenthal’s method of dynamic stacking many stacks there is no well-defined notion of “top-of-stack.” stack clients has stack and remote one stack Stacks with fan-in have multiple stack tops. Encryption requiring fan-in with multiple stack “views” (see Section guarantee and 3.3, access to different such as backup, access is explicitly to sense with the correct stack top to keep a top-of-stack explicitly permits different layers. 14 Rosenthal’s Rosenthal vnode interface and the UCLA discussed the use of versioning layers to map between different interfaces. While versioning layers work well to map between pairs of layers with conflicting semantics, the number of mappings required grows exponentially with the number of changes, making this approach unsuitable for wide-scale, third-party change. A more general solution to extensibility is preferable, in our view. 7. CONCLUSION The focus of this has been 14While Stackable work approached Rosenthal’s Files model Working is to improve in several the file-system ways. can be extended Group 1992], ACM the to support result Transactions development Stacking provides nonlinear is, in effect, on Computer stacking two Systems, process. a framework different [UNIX This allow- International “stacking” methods. Vol. 12, No. 1, February 1994. 88 . J. S. Heldemann ing reuse of existing by leveraging the and G. J. Popek filing body services. of existing Higher-level services can be built file improved low-level can immediately have a wide-reaching Formal mechanisms for extensibility new services, facilities even from are provided work enables Widespread lower systems; impact by replacing existing provide a consistent approach layers of a sophisticated stack. in an address-space-independent a number adoption services. to export When manner, these this of new development approaches. of a framework such as that described will permit independent individual developers while moving forward quickly facilities frame- in this paper development of filing services by many parties, while can benefit from the ability to leverage others’ work independently. By opening this field previously largely restricted to major operating-systems as a whole can progress forward more vendors, rapidly. it is hoped that the industry ACKNOWLEDGMENTS The authors thank regarding stackable Kirk McKusick Tom Page and Richard Guy for their many discussions file systems. We would also particularly like to thank for working with us to bring stacking to BSD 4.4. We would like to acknowledge Dieter Rothmeier, Wai Mak, David Ratner, Jeff Weidner, Peter Reiher, Steven Stovall, and Greg Skinner for their contributions to the Ficus file contributions their system, and Yu Guang Wu and Jan-Simon to the null layer. Finally, we wish to thank many helpful Pendry for their the reviewers for suggestions. REFERENCES GUY, R. G., HEIDEMANN, J. S., MAK, W., PAGE, T. W., JR., POPEK, G. J., AND ROTHMEIER) Implementation (June). HENDRICKS, Berkeley, 1990. A filesystem wzgs (June). USENIX, KAZAR, M., Syst. N. x-kernel: Operating KARELS, M. J., Proceedings KLEIMAN, C., Prlnc~ples USENZX development. In D., Conference CLSENIX SATYANARAYANAN, in a distributed M. An Why M. ACM, K. 1986. Unix User’s D. 1990. Proceedings Conference M., Proceed- SIDEBOTHAM, file system. B. 1984. (June). Program York, ACM R., AND Trans. Cornpzd. multiple EUUG, file RPC in the Sympvslum filesystem interface. In Buntington, England, 15. system Berkeley, systems 1989. 12th a compatible types Calif., getting USENIX, design S. of the on 91-101, (Sept.). USENIX, operating Proceedings Proceedings Toward Group for (June). AND OMALLEY, In New architecture aren’t B., techniques. Proceedings J, 63, 8 (Oct.), 1984. In 333–340. ABBOTT, (Dec.). of the European Vnodes: L., design AND MCKUSICK, Conference D. M. L. Systems PIKE, R., AND KERNIGHAN, RITCHIE, Calif., S., NICHOLS, new OUSTERHOUT, J. K. 1990. Tech. system. 51-81. PETERSON, Conference USENIX software and performance Evaluating S. R. 1986. USENIX file 63-71. for Berkeley, Scale 1 (Feb.), 6, replicated Calif,, MENEES, WEST, M. 1988. HUTCHINSON, Ficus USENIX, D. HOWARD, J., of the faster Berkeley, as fast Calif., in the UNIX in Sun UNIX. In 238–247. as hardware? In 247-256. environment. AT&2’ Bell Lab. 1595-1605. A stream input-output system. AT&T Bell generic file Lab. Tech. system. In J. 63, 8 (Oct.), 1897-1910. RODRIGUEZ, R., KOEHLER, ence Proceedings ROSENTHAL, (June). ACM D. S. H. USENIX, Transactions M., AND HYDE, (June). 1990. Evolving Berkeley, on Computer R. 1986. The Berkeley, Calif., vnode interface. USENIX, Calif., Systems, the In USENIX 107-118. Vol USENZX Confer- 260–269. 12, No, 1, February 1994, Conference Proceedings File-System Development SANDBERG, R., GOLDBERG, D., KLEIMAN, mentation of the USENIX, Berkeley, STEERE, D. C., XISTLER, management USENIX, UNIX on Internal N. 1993. Received August System. D., AND LYON, B. 1985. In USENZX AND SATYANARAYANAN, Sun vnode Calif., UNIX Operating Proceedings 1992; revised Conference Design . and Proceedings 89 imple(June). 119-130. J., the S., WALSH) File interface. In M. 1990. USENIX Effkient user-level Conference file Proceedings cache (June). 325-332. STACRABLE FILES WORKING GROUP. 1992. memo. Conference Network Calif., J Berkeley, INTERNATIONAL WEBBER, Sun with Stackable Layers International system (Jan.). April ACM Files support USENIX, 1993; Working for accepted Transactions portable Berkeley, June on Computer Requirements Group, Parsippany, filesystem Calif., for stackable files. N.J. extensions. In USENZX 219-228. 1993 Systems, Vol. 12, No. 1, February 1994.