Serverless Network File Systems THOMAS DAVID E, ANDERSON, MICHAEL A, PATTERSON, University of California DREW D, DAHLIN, S. ROSELLI, and JEANNA M. NEEFE, RANDOLPH Y. WANG at Berkeley \V{ pr{~p{)M, :{ new paradigm for network fil(, systt,m rfestgn w,r(v,rfe.w n(,fldrjrlc /{/,, sv.sl(,nl.s \Vhil<, trari]t]onal network tile syst[, ms rely (In ii central server machln(., :( w.rv(,rl{+ ~yst{.m ut)li?.r, s w,ork statinns ctmper;itirrg :is peers t{) provide all tllc system st.r~lccs Any n]ach[nc, 111th[. ~y<t{,m c,In ~tt)r(~, cache, or r{)ntrol tiny block t]f’ datii. our appr(mch UW,S this I{)catl[)n ]nd[,pen [l[,nc~.. 1[] c{)n]hlnatlon \vlth Iiist local area nrtv,,(~rks. tn provide tx,ttcr p(,rfi)rm:]nc(, and wxilability tb:ln trviditit}n:{l filr systems 1}11,r[,>p{~msihilit ics of a fhilrd \ I;l r(xl(lnd:int data storage’. Furthrrmorc. hrc:Iuse any machin( in the, system can a>,um( ctlmponcnt. our srrvrrlcss clrsi~n also prc,\id{w high availability TII demonstrate our apprnach, w(, huv(, Implrrr)(, nt(d a prototype” +{,r.\(,rl(~s> t)(.t~v{)rk fllc svst[, m C;IIled xFS. I’rc,liminary p{,rformance mc:]surem{, nts suggest thiit ,Iiir, :irchttt,c(~]rt~ achievt>k Its L,oul of scalability. [<”orinstance, rn a ‘12-n(xJcJ xFS systt,m with :j2 ,1(! I\I, cl!c,nts. (,;ich clrcn( r(,ct,ivt+ n(,arly ;[s nruch rciid or write thr[)ugh put :IS [t w{iulrl ..f.{, if It Jvt,r(, th{, I)nl\ iicti v{, client. ;iddrt],]n;]l Kty W,)rds and I’hrast,..: Log-based s! rip]ng. RA1 l). rc[iundani duta <t oragc,, scalable pcrf(,rmancc log clcuning, logKi ng. Inx >t rurt urcd. ‘1’brs M[Irk IS supported rn par-t I)y thr Advanced Research Projects Agency (NO060(J-93-(’.2481, F:;[)(i02-95-(’ -(Jolt).” t bt, Natlona[ Science Foundation (CDA 040”1 156), (’allfhrnla LTI(lR(), the ,Y~& 1 1,’~~undati(ln, I)igltal Equipment (k)rporation, Exabytt,, }+c\vl{,tt-Pzlck ~ird, 11)}1, Siemen~ (’,,rporti(l{)m Sun Microsyst{,rns, tind Xerux Corporation, Andcrsnn \v:~s also supp{~rtt,d hy a N:ltr<~r~;il Sci[, nce Foundkitlon Pr(sidcrrtial Faculty Fellowship, Nerfc hy ii Natir]rral Sc]enct F,,und:iti(,n (;raduatc Rcscarc h l’cllowship, and Rosrlli by a [)epar-tmcnt [If f+klucatmn (;AANN ftllfl\\+hrp ,JI’ (’;~iif{)rn]t] at Berkeley, :J&7 SOdU Sclc,ncc l)ivisi~)n, [Jn]ver+rty ~luth[]r>’ addr(,.ss: Computtr 1{:111, }](,rk{,l[,y, ~.~ :1.17z().177(; , (mail: {tea: dnhlin; “neefc; pattrrson: drew: r.wvanf+d(~~c+ It(rkt,l(,> .cdu. :]dditi(,n,~l l[lfi)rrn:iti(~n ab{)ut xFS c~in hr found at http:,, n[~w (’S %rkclcy. cdu Xl’s xfk,htrnl. IJt,rl]lisslon to make diglta’hard cupy of part or all of this wnrk for personal nr classroom usc is ~r:il~tt,d without fce pr{]vidcd that copies are not made or dls!rihutcd for profit or c{]mmcrciaf :Idviint:lg(,, th~, cupyr-lght notice, the title of tbe publicati(~n, and its dat[, :ipprur, and n{lticc. }~ ~iv~,rr thtlt c(jpyin~r is I]y pcrmiss inn of A(’M, Inc. TO copy {Itherwi se, to republish. to post [In .si,r\, (,r,s. or t[] r[>di,stribl]t[, to li.st,s, rrquirc,s prior .sprcif]c pcrmi.s.s ion :]nd tor ;I fc[, c 1996 A(’M 07:)4-2071 96 /0200-0041 $03.50 42 . Thomas E. Anderson et al. 1. INTRODUCTION A serverless network file system distributes storage, cache, and control over cooperating workstations. This approach contrasts with traditional file systems such as Netware [Major et al. 1994], NFS [Sandberg et al. 19851, Andrew [Howard et al. 1988], and Sprite [Nelson et al. 1988] where a central server machine stores all data and satisfies all client cache misses. Such a central server is both a performance and reliability bottleneck. A serverless system, on the other hand, distributes control processing and data storage to achieve scalable high performance, migrates the responsibilities of failed components to the remaining machines to provide high availability, and scales gracefully to simplify system management. Three factors motivate our work on serverless network file systems: the opportunity provided by fast switched LANs, the expanding demands of users, and the fundamental limitations of central server systems. The recent introduction of switched local area networks such as ATM or Myrinet [Boden et al. 1995] enables serverlessness by providing aggregate bandwidth that scales with the number of machines on the network. In contrast, shared-media networks such as Ethernet or FDDI allow only one client or server to transmit at a time. In addition, the move toward low-latency network interfaces [Basu et al. 1995; von Eicken 1992] enables closer cooperation between machines than has been possible in the past. The result is that a LAN can be used as an 1/0 backplane, harnessing physically distributed processors, memory, and disks into a single system. Next-generation networks not only enable serverlessness; they require it by allowing applications to place increasing demands on the file system. The 1/0 demands of traditional applications have been increasing over time [Baker et al. 1991]; new applications enabled by fast networks—such as multimedia, process migration, and parallel processing—will further pressure file systems to provide increased performance. For instance, continuousmedia workloads will increase file system demands; even a few workstations simultaneously running video applications would swamp a traditional central server [Rashid 1994]. Coordinated Networks of Workstations (NOWS) allow users to migrate jobs among many machines and permit networked workstations to run parallel jobs [Anderson et al. 1995; Douglas and Ousterhout 199 1; Litzkow and Solomon 1992]. By increasing the peak processing power available to users, NOWS increase peak demands on the file system [Cypher et al. 1993]. Unfortunately, current centralized file system designs fundamentally limit performance and availability since all read misses and all disk writes go through the central server. To address such performance limitations, users resort to costly schemes to try to scale these fundamentally unscalable file systems. Some installations rely on specialized server machines configured with multiple processors, 1/0 channels, and 1/0 processors. Alas, such machines cost significantly more than desktop workstations for a given amount of computing or 1/0 capacity. Many installations also attempt to achieve scalability by distributing a file system among multiple servers by ACM Transactionson ComputerSystems, Vol. 14, No. 1, February 1996. Serverless Network File Systems partitioning the directm-y tree across multiple mount points, This . 43 approach scalability because its coarse distribution often results in hot spots when the partitioning allocates heavily used files and directory trees to a single server [Wolf 1989]. It is also expensive, since it requires the (human) system manager to effectively become part of the file system moving users, volumes, and disks among servers to balance load. Finally, Andrew [Howard et al. 1988] attempts to improve scalability by caching data on client disks. Although this made sense on an Ethernet, on today’s fast LANs fetching data from local disk can be an order of magnitude slower than from server memory or remote striped disk. Similarly, a central server represents a single point of failure, requiring server replication [Birrell et al. 1993; Kazar 1989; Kistler and Satyanarayanan 1992; Liskov et al, 1991; Popek et al. 1990; Walker et al, 1983] for high availability. Replication increases the cost and complexity of central servers and can increase latency on writes since the system must replicate data al multiple servers. In contrast t.o central server designs, our objective is to build a truly distributed network file system—one with no central bottleneck. We have designed and implemented xFS, a prototype serverless network file system, to invc’stigate this goal. xFS illustrates serverless design principles in three ways. First, xFS dynamically distributes control processing across the system on a per-file granularity by utilizing a new serverless management scheme. Second. xFS distributes its data storage across storage server disks by implemtlnting a software RAID [Chen et al. 1994; Patterson et al. 1988] using log-based network striping similar to Zebra’s [Hartman and Ousterhout 1995], Finally, xFS eliminates central server caching by taking advantage of cooperative caching [Dahlin et al. 1994b; Leff et al. 1991] to harvest portions of client memory as a large, global file cache. This article makes two sets of contributions. First, xFS synthesizes a number of recent innovations that, taken together, provide a basis for serverIess file system design. xFS relies on previous work in areas such as scalable cache consistency (DASH [Lenoski et al. 1990] and Alewife [Chaiken et al. 1991 i ), cooperative caching, disk striping (RAID and Zebra), and log-structured file systems (Sprite LFS [Rosenblum and Ousterhout 1992] and BSD I,FS [Seltzer et al. 1993]). Second, in addition to borrowing techniques developed in other projects, we have refined them to work well in our serverless system. For instance, we have transformed DASH’s scalable cache consistency approach into a more general, distributed control system that is also fault tolerant. We have also improved upon Zebra to eliminate bottlenecks in its design hy using distributed management, parallel cleaning, and subsets of storage servers called stripe groups. Finally, we have actually implemented cooperative caching, building on prior simulation results. The primary limitation of our serverless approach is that it is only appropriate in a restricted environment-–among machines that communicate over a fast network and that trust one another’s kernels to enforce security. However, we expect such environments to be common in the future. For instance, NOW systems already provide high-speed networking and tmst to only moderately improves A(’M Transactmns on (k,mputer Systems. Vo[ 14 No 1, February 1996 44. Thomas E. Anderson et al run parallel and distributed jobs. Similarly, xFS could be used within a group or department where fast LANs connect machines and where uniform system administration and physical building security allow machines to trust one another. A file system based on serverless principles would also be appropriate for “scalable server” architectures currently being researched [Kubiatowicz and Agarwal 1993; Kuskin et al. 1994]. Untrusted clients can also benefit from the scalable, reliable, and cost-effective file service provided by a core of xFS machines via a more restrictive protocol such as NFS. We have built a prototype that demonstrates most of xFS’ key features, including distributed management, cooperative caching, and network disk striping with parity and multiple groups. As Section 7 details, however, several pieces of implementation remain to be done; most notably, we must still implement the cleaner and much of the recovery and dynamic reconfiguration code. The results in this article should thus be viewed as evidence that the serverless approach is promising, not as “proof’ that it will succeed. We present both simulation results of the xFS design and a few preliminary measurements of the prototype. Because the prototype is largely untuned, a single xFS client’s performance is slightly worse than that of a single NFS client; we are currently working to improve single-client performance to allow one xFS client to significantly outperform one NFS client by reading from or writing to the network-striped disks at its full network bandwidth. Nonetheless, the prototype does demonstrate remarkable scalability. For instance, in a 32-node xFS system with 32 clients, each client receives nearly as much read or write bandwidth as it would see if it were the only active client. The rest of this article discusses these issues in more detail. Section 2 provides an overview of recent research results exploited in the xFS design. Section 3 explains how xFS distributes its data, metadata, and control. Section 4 describes xFS distributed log cleaner. Section 5 outlines xFS’ approach to high availability, and Section 6 addresses the issue of security and describes how xFS could be used in a mixed security environment. We describe our prototype in Section 7, including initial performance measurements. Section 8 describes related work, and Section 9 summarizes our conclusions. 2. BACKGROUND xFS builds upon several recent and ongoing research efforts to achieve our goal of distributing all aspects of file service across the network. xFS’ network disk storage exploits the high performance and availability of Redundant Arrays of Inexpensive Disks (RAIDs). We organize this storage in a log structure as in the Sprite and BSD Log-structured File Systems (LFS), largely because Zebra demonstrated how to exploit the synergy between RAID and LFS to provide high-performance, reliable writes to disks that are distributed across a network. To distribute control across the network, xFS draws inspiration from several multiprocessor cache consistency designs. Finally, since xFS has evolved from our initial proposal [Wang and Anderson ACM Transactions on Computer Systems. Vol. 14, No. 1, February 1996. Serverless Network File Systems 1993], we describe the relationship versions of the xFS design. 2.1 of the design presented 45 here to previous RAID xFS exploits RAID-style disk striping to provide high performance and highly available disk storage [Chen et al. 1994; Patterson et al. 1988]. A RAID partitions a stripe of data into N – 1 data blocks and a parity block—the exclusive-OR of the corresponding bits of the data blocks. It stores each data and parity block on a different disk. The parallelism of a RAID’s multiple disks provides high bandwidth, while its parity provides fault tolerance-it can reconstruct the contents of a failed disk by taking the exclusive-OR of the remaining data blocks and the parity blocks. xFS uses single-parity disk striping to achieve the same benefits; in the future we plan to cope with multiple workstation or disk failures using multiple-parity blocks [Blaum et al. 1994], RAIDs suffer from two limitations. First, the overhead of parity management can hurt performance for small writes; if the system does not simultaneously overwrite all N – 1 blocks of a stripe, it must first read the old parity and some of the old data from the disks to compute the new parity. Unfortunately, small writes are common in many environments [Baker et aI. 1991], and larger caches increase the percentage of writes in disk workload mixes over time. We expect cooperative caching—using workstation memory as a global cache—to further this workload trend. A second drawback of commercially available hardware RAID systems is that they are significantly more expensive than non-RAID commodity disks because the commercial RAIDs add special-purpose hardware to compute parity. 2.2 LFS xFS implements log-structured storage based on the Sprite and BSD LFS prototypes [Rosenblum and Ousterhout 1992; Seltzer et al. 1993] because this approach provides high-performance writes, simple recovery, and a flexible method to locate file data stored on disk. LFS addresses the RAID small-write problem by buffering writes in memory and then committing them to disk in large, contiguous, fixed-sized groups called log segments; it threads these segments on disk to create a logical append-only log of file system modifications. When used with a RAID, each segment of the log spans a RAID stripe and is committed as a unit to avoid the need to recompute parity. LFS also simplifies failure recovery because all recent modifications are located near the end of the log. Although log-based storage simplifies writes, it potentially complicates reads because any block could be located anywhere in the log, depending on when it was written. LFS’ solution to this problem provides a general mechanism to handle location-independent data storage. LFS uses per-file inodes, similar to those of the Fast File System (FFS) [McKusick et al. 1984], to store pointers to the system’s data blocks. However, where FFS’ inodes reside in fixed locations, LFS’ inodes move to the end of the log each time ACM Transactions on Computer Systems, Vol 14, No. 1, February 1996 46 . Thomas E. Anderson et al, they are modified. When LFS writes a file’s data block, moving it to the end of the log, it updates the file’s inode to point to the new location of the data block; it then writes the modified inode to the end of the log as well. LFS locates the mobile inodes by adding a level of indirection, called an irnap. The imap contains the current log pointers to the system’s inodes; LFS stores the imap in memory and periodically checkpoints it to disk. These checkpoints form a basis for LFS’ efllcient recovery procedure. After a crash, LFS reads the last checkpoint in the log and then rolls forward, reading the later segments in the log to find the new location of inodes that were written since the last checkpoint. When recovery completes, the imap contains pointers to all of the system’s inodes, and the inodes contain pointers to all of the data blocks. Another important aspect of LFS is its log cleaner that creates free disk space for new log segments using a form of generational garbage collection. When the system overwrites a block, it adds the new version of the block to the newest log segment, creating a “hole” in the segment where the data used to reside. The cleaner coalesces old, partially empty segments into a smaller number of full segments to create contiguous space in which to store new segments. The overhead associated with log cleaning is the primary drawback of LFS. Although Rosenblum’s original measurements found relatively low cleaner overheads, even a small overhead can make the cleaner a bottleneck in a distributed environment. Furthermore, some workloads, such as transaction processing, incur larger cleaning overheads [Seltzer et al. 1993; 1995]. 2.3 Zebra Zebra [Hartman and Ousterhout 1995] provides a way to combine LFS and RAID so that both work well in a distributed environment. Zebra uses a software RAID on commodity hardware (workstation, disks, and networks) to address RAID’s cost disadvantage, and LFS batched writes provide efficient access to a network RAID. Furthermore, the reliability of both LFS and RAID makes it feasible to distribute data storage across a network. LFS’ solution to the small-write problem is particularly important for Zebra’s network striping since reading old data to recalculate RAID parity would be a network operation for Zebra. As Figure 1 illustrates, each Zebra client coalesces its writes into a private per-client log. It commits the log to the disks using fixed-sized log segments, each made up of several log fragments that it sends to different storage server disks over the LAN. Log-based striping allows clients to efficiently calculate parity fragments entirely as a local operation and then store them on an additional storage server to provide high data availability. Zebra’s log-structured architecture significantly simplifies its failure recovery. Like LFS, Zebra provides efllcient recovery using checkpoint and roll forward. To roll the log forward, Zebra relies on deltas stored in the log. Each delta describes a modification to a file system block, including the ID of the modified block and pointers to the old and new versions of the block, to allow the system to replay the modification during recovery. Deltas greatly simplifi ACM Transactions on Computer Systems, Vol. 14, No. 1, February 1996. Serverless Network File Systems . 47 Client Memories - One Client’sWriteLo= OneClient’sWriteh~ Log Segment Log Segment I I 1 I ... AB C... HEEI?%A H+imiii!%~2~~ k Network >$b mmm MUM B BBBB Storage Server Disks Fig. 1, Log-based striping used by Zebra and xFS. Each client writes its new file data into a private append-only log and stripes this log across the storage servers. Clients compute parity for segments, not for individual files. recovery by providing an atomic commit for actions that modify state located on multiple machines: each delta encapsulates a set of changes to file system state that must occur as a unit. Although Zebra points the way toward serverlessness, several factors limit Zebra’s scalability. First, a single file manager tracks where clients store data blocks in the log; the manager also handles cache consistency operations. Second, Zebra, like LFS, relies on a single cleaner to create empty segments. Finally, Zebra stripes each segment to all of the system’s storage servers, limiting the maximum number of storage servers that Zebra can use efficiently. 2.4 Multiprocessor Cache Consistency Network file systems resemble multiprocessors in that both provide a uniform view of storage across the system, requiring both to track where blocks are cached. This information allows them to maintain cache consistency by invalidating stale cached copies. Multiprocessors such as DASH [Lenoski et al. 1990] and Alewife [Chaiken et al. 1991] scalably distribute this task by dividing the system’s physical memory evenly among processors; each processor manages the cache consistency state for its own physical memory locations.l 1In the context of scalable multiprocessor consistency, this state is referred to as a directory. We avoid this terminology to prevent confhsion with file system directories that provide a hierarchical organization of file names. ACM Transactions on Computer Systems, Vol. 14, No. 1, February 1996. 48 . Thomas E. Anderson et al. Unfortunately, the fixed mapping from physical memory addresses to consistency managers makes this approach unsuitable for file systems. Our goal is graceful recovery and load rebalancing whenever the number of machines in xFS changes; such reconfiguration occurs when a machine crashes or when a new machine joins xFS. Furthermore, as we show in Section 3.2.4, by directly controlling which machines manage which data, we can improve locality and reduce network communication. 2.5 Previous xFS Work The design of xFS has evolved considerably since our original proposal [Dahlin et al. 1994a; Wang and Anderson 1993]. The original design stored all system data in client disk caches and managed cache consistency using a hierarchy of metadata servers rooted at a central server. Our new implementation eliminates client disk caching in favor of network striping to take advantage of high-speed, switched LANs. We still believe that the aggressive caching of the earlier design would work well under different technology assumptions; in particular, its efficient use of the network makes it well suited for both wireless and wide-area network use. Moreover, our new design eliminates the central management server in favor of a distributed metadata manager to provide better scalability, locality, and availability. We have also previously examined cooperative caching—using client memory as a global file cache—via simulation [Dahlin et al. 1994b] and therefore focus only on the issues raised by integrating cooperative caching with the rest of the serverless system. 3. SERVERLESS FILE SERVICE The RAID, LFS, Zebra, and multiprocessor cache consistency work discussed in the previous section leaves three basic problems unsolved. First, we need scalable, distributed metadata and cache consistency management, along with enough flexibility to reconfigure responsibilities dynamically after failures. Second, the system must provide a scalable way to subset storage servers into groups to provide efficient storage. Finally, a log-based system must provide scalable log cleaning. This section describes the xFS design as it relates to the first two problems. Section 3.1 provides an overview of how xFS distributes its key data structures. Section 3.2 then provides examples of how the system as a whole functions for several important operations. This entire section disregards several important details necessary to make the design practical; in particular, we defer discussion of log cleaning, recovery from failures, and security until Sections 4 through 6. 3.1 Metadata and Data Distribution The xFS design philosophy can be summed up with the phrase, “anything, anywhere.” All data, metadata, and control can be located anywhere in the system and can be dynamically migrated during operation. We exploit this location independence to improve performance by taking advantage of all of ACM Transactions on Computer Systems, Vol. 14, No. 1, February 1996 Serverless Network File Systems . 49 the system’s resources—CPUs, DRAM, and disks—to distribute load and increase locality. Furthermore, we use location independence to provide high availability by allowing any machine to take over the responsibilities of a failed component after recovering its state from the redundant log-structured storage system. In a typical centralized system, the central server has four main tasks: (1) The server stores all of the system’s data blocks on its local disks. (2) The server manages disk location metadata the system has stored each data block. that indicates (3) The server maintains a central cache of data blocks satisfy some client misses without accessing its disks. where on disk in its memory to (4) The server manages cache consistency metadata that lists which clients in the system are caching each block. It uses this metadata to invalidate stale data in client caches.2 The xFS system performs the same tasks, but it builds on the ideas discussed in Section 2 to distribute this work over all of the machines in the system. To provide scalable control of disk metadata and cache consistency state, xFS splits management among metadata managers similar to multiprocessor consistency managers. Unlike multiprocessor managers, xFS managers can alter the mapping from files to managers. Similarly, to provide scalable disk storage, xFS uses log-based network striping inspired by Zebra, but it dynamically clusters disks into stripe groups to allow the system to scale to large numbers of storage servers. Finally, xFS replaces the server cache with cooperative caching that forwards data among client caches under the control of the managers. In xFS, four types of entities—the clients, storage servers, and managers already mentioned, and the cleaners discussed in Section 4—cooperate to provide file service as Figure 2 illustrates. The key challenge for xFS is locating data and metadata in this dynamically changing, completely distributed system. The rest of this subsection examines four key maps used for this purpose: the manager map, the imap, file directories, and the stripe group map. The manager map allows clients to determine which manager to contact for a file, and the imap allows each manager to locate where its files are stored in the on-disk log. File directories serve the same purpose in xFS as in a standard UNIX file system, providing a mapping from a human-readable name to a metadata locator called an index number. Finally, the stripe group map provides mappings from segment identifiers embedded in disk log addresses to the set of physical machines storing the segments. The rest of this subsection discusses these four data structures before giving an example of their use in file reads and writes. For reference, Table I provides a summary of these and other key xFS z Note that the NFS server does not keep caches consistent. Instead NFS relies on clients tQ verify that a block is current before using it. We rejected that approach because it sometimes allows clients to observe stale data when a client tries to read what another client recently wrote. ACM Transactions on Computer Systems, Vol. 14, No 1, February 1996 50. Thomas E. Anderson et al. n \ 1 1 Network Fig. 2. Two simple xFS installations. In the first, each machine acts as a client, storage server, cleaner, and manager, while in the second each node only performs some of those roles. The freedom to configure the system is not complete; managers and cleaners access storage using the client interface, so all machines acting as managers or cleaners must also be clients. data structures. work together. Figure 3 in Section 3.2.1. illustrates how these components 3.1.1 The Manager Map. xFS distributes management responsibilities according to a globally replicated manager map. A client uses this mapping to locate a file’s manager from the file’s index number by extracting some of the index number’s bits and using them as an index into the manager map. The map itself is simply a table that indicates which physical machines manage which groups of index numbers at any given time. This indirection allows xFS to adapt when managers enter or leave the system; the map can also act as a coarse-grained load-balancing mechanism to split the work of overloaded managers. Where distributed multiprocessor cache consistency relies on a fixed mapping from physical addresses to managers, xFS can change the mapping from index number to manager by changing the manager map. To support reconfiguration, the manager map should have at least an order of magnitude more entries than there are managers. This rule of thumb allows the system to balance load by assigning roughly equal portions of the map to each manager. When a new machine joins the system, xFS can modi~ the manager map to assign some of the index number space to the new manager by having the original managers send the corresponding parts of their manager state to the new manager. Section 5 describes how the system reconfigures manager maps. Note that the prototype has not yet implemented this dynamic reconfiguration of manager maps. xFS globally replicates the manager map to all of the managers and all of the clients in the system. This replication allows managers to know their responsibilities, and it allows clients to contact the correct manager directly —with the same number of network hops as a system with a centralized manager. We feel it is reasonable to distribute the manager map globally because it is relatively small (even with hundreds of machines, the map would be only tens of kilobytes in size) and because it changes only to correct a load imbalance or when a machine enters or leaves the system. ACM Transactionson ComputerSystems,Vol. 14, No. 1, February 1996 Table I. Summary of Key xFS Data Structures Purpose Maps file’s index number+ manager. Lacation Globally repliuted. Imap Maps file’s index number+ disk log addressof file’s index node. Split among managers. 3.1.2 Index Node Maps file offset - disk log address of data block. In ondisk log at storage servers. 3.1.2 Index Number Key used to locate metadatafor a file. File directory. 3.1.3 File Directory Maps file’s name - file’s index number. In ondisk log at storage servers. 3.1.3 Disk hg Key used to locate blocks on storage server disks. Includesa stripe group identifier, segment ID, and offset within segment. Index nodes and the imap. 3.1.4 Stripe Group Map Maps disk log address+ list of storage servers. Globally replicated. 3.1.4 Cache Consistency State Lists clients caching or holding the write token of each block. Split among managers. 3.2.1 3.2.3 Segment Utilization State Utilization, modification time of segments. Split among clients. 4 S-Files Ondisk cleaner statefor cleaner communication and recovery. In ondisk log at storage servers. 4 I-File Gn disk copy of imap used for recovery. In ondisk log at storage servers. 5 Deltas Log modifications for recovery roll-forward. In ondisk log at storage servers. 5 Manager Checkpoints Record manager statefor recovery. In ondisk log at storage servers. 5 DataStructure Manager Map Address Section 3.1,1 This table summarizesthe purposeof the key xFS datastructures. The location column indicateswhere thesestructuresare located in xFS, and the Section column indicateswhere in thisarticle the structureis described. 52 Thomas E. Anderson ● The manager of a file et al. controls two sets of information about it: cache consistency state and disk location metadata. Together, these structures allow the manager to locate all copies of the file’s blocks. The manager can thus forward client read requests to where the block is stored, and it can invalidate stale data when clients write a block. For each block, the cache consistency state lists the clients caching the block or the client that has write ownership of it. The next subsection describes the disk metadata. 3.1.2 The Imap. also where in the Managers on-disk track not only where file blocks log they are stored. xFS uses are cached the LFS but imap to encapsulate disk location metadata; each file’s index number has an entry in the imap that points to that file’s disk metadata in the log. To make LFS’ imap scale, xFS distributes the imap among managers according to the manager map so that managers handle the imap entries and cache consistency state of the same tiles. The disk stirage for each file can be thought of as a tree whose root is the imap entry for the file’s index number and whose leaves are the data blocks. A file’s imap entry contains the log address of the file’s index node. xFS index nodes, like those of LFS and FFS, contain the disk addresses of the file’s data blocks; for large files the index node can also contain blocks that contain more data block addresses, contain addresses of indirect blocks, and so on. log addresses double indirect of indirect blocks that 3.1.3 File Directories and Index Numbers. xFS uses the data structures described above to locate a file’s manager given the file’s index number. To determine the file’s index number, xFS, like FFS and LFS, uses file directories that contain mappings directories in regular reading a directory. files, from file allowing names to index numbers. xFS a client to learn an index number stores by In xFS, the index number listed in a directory determines a file’s manager. When a file is created, we choose its index number so that the file’s manager is on the same machine as the client that creates the file. Section 3.2.4 describes simulation results of the effectiveness of this policy in reducing network communication. 3.1.4 The Stripe Group Map. Like Zebra, xFS bases its storage subsystem on simple storage servers to which clients write log fkagments. To improve performance and availability when using large numbers of storage servers, rather than stripe each segment over all storage servers in the system, xFS implements stripe groups as have been proposed for large RAIDs [Chen et al. 1994]. Each stripe group includes a separate subset of the system’s storage servers, and clients write each segment across a stripe group rather than across all of the system’s storage servers. xFS uses a globally replicated stripe group map to direct reads and writes to the appropriate storage servers as the system configuration changes. Like the manager map, xFS globally replicates the stripe group map because it is small and seldom changes. The current version of the prototype implements reads and writes from multiple stripe groups, but it does not dynamically modi& the group map. ACM Transactions on Computer Systems, Vol. 14, No. 1, February 1996 Serverless Network File Systems . 53 Stripe groups are essential to support large numbers of’ storage servers for at least four reasons. First, without stripe groups, clients would stripe each of their segments over all of the disks in the system. This organization would require clients to send small, inefficient fragments to each of the many storage servers or to buffer enormous amounts of data per segment so that they could write large fragments to each storage server. Second, stripe groups match the aggregate bandwidth of the groups’ disks to the network bandwidth of a client, using both resources efficiently; while one client writes at its full network bandwidth to one stripe group, another client can do the same with a different group. Third, by limiting segment size, stripe groups make cleaning more efficient, This eftlciency arises because when cleaners extract segments’ live data, they can skip completely empty segments, but they must read partially full segments in their entirety; large segments linger in the partially full state longer than small segments, significantly increasing cleaning costs. Finally, stripe groups greatly improve availability. Because each group stores its own parity, the system can survive multiple server failures if they happen to strike different groups; in a large system with random failures this is the most likely case. The cost for this improved availability is a marginal reduction in disk storage and effective bandwidth because the system dedicates one parity server per group rather than one for the entire system. The stripe group map provides several pieces of information about each group: the group’s ID, the members of the group, and whether the group is current or obsolete; we describe the distinction between current and obsolete groups below. When a client writes a segment to a group, it includes the stripe group’s ID in the segment’s identifier and uses the map’s list of storage servers to send the data to the correct machines. Later, when it or another client wants to read that segment, it uses the identifier and the stripe group map to locate the storage servers to contact for the data or parity. xFS distinguishes between current and obsolete groups to support reconfiguration. When a storage server enters or leaves the system, xFS changes the map so that each active storage server belongs to exactly one current stripe group. If this reconfiguration changes the membership of a particular group, xFS does not delete the group’s old map entry. Instead, it marks that entry as “obsolete.” Clients write only to current stripe groups, but they may read from either current or obsolete stripe groups. By leaving the obsolete entries in the map, xFS allows clients to read data previously written to the groups without first transferring the data from obsolete groups to current groups. Over time, the cleaner will move data from obsolete groups to current groups [Hartman and Ousterhout 1995]. When the cleaner removes the last block of live data from an obsolete group, xFS deletes its entry from the stripe group map. 3.2 System Operation This section describes how xFS uses the various maps we described in the previous section. We first describe how reads, writes, and cache consistency ACM Transactions cm Computer Systems, Vol. 14, No. 1, Februsry 1996 54. Thomas E. Anderson et al. work and then present simulation results examining the assignment of files to managers. the issue of locality in 3.2.1 Reads and Caching. Figure 3 illustrates how xFS reads a block given a file name and an offset within that file. Although the figure is complex, the complexity in the architecture is designed to provide good performance with fast LANs. On today’s fast LANs, fetching a block out of local memory is much faster than fetching it from remote memory, which, in turn, is much faster than fetching it from disk. To open a file, the client first reads the file’s parent directory (labeled 1 in the diagram) to determine its index number. Note that the parent directory is, itself, a data file that must be read using the procedure described here. As with FFS, xFS breaks this recursion at the rooti, the kernel learns the index number of the root when it mounts the file system. As the top left path in the figure indicates, the client first checks its local UNIX block cache for the block (2a); if the block is present, the request is done. Otherwise it follows the lower path to fetch the data over the network. xFS first uses the manager map to locate the correct manager for the index number (2b) and then sends the request to the manager. If the manager is not colocated with the client, this message requires a network hop. The manager then tries to satisfy the request by fetching the data fi-om some other client’s cache. The manager checks its cache consistency state (3a), and, if possible, forwards the request to a client caching the data. That client reads the block from its UNIX block cache and forwards the data directly to the client that originated the request. The manager also adds the new client to its list of clients caching the block. If no other client can supply the data from memory, the manager routes the read request to disk by first examining the imap to locate the block’s index node (3b). The manager may find the index node in its local cache (4a), or it may have to read the index node from disk. If the manager has to read the index node from disk, it uses the index node’s disk log address and the stripe group map (4b) to determine which storage server to contact. The manager then requests the index block from the storage server, who then reads the block from its disk and sends it back to the manager (5). The manager then uses the index node (6) to identi~ the log address of the data block. (We have not shown a detail: if the file is large, the manager may have to read several levels of indirect blocks to find the data block’s address; the manager follows the same procedure in reading indirect blocks as in reading the index node.) The manager uses the data block’s log address and the stripe group map (7) ta send the request to the storage server keeping the block. The storage server reads the data from its disk (8) and sends the data directly to the client that originally asked for it. One important design decision was to cache index nodes at managers but not at clients. Although caching index nodes at clients would allow them to read many blocks from storage servers without sending a request through the ACM Transactionson ComputerSystems,Vol. 14, No. 1, February 1996. Directo Name, Offset * 1 I@&?l 0 stripe Y Access LQcal Data Structure_ Possible NetworkHopData or MetadataBlock (or Cache) Globally Replicated Data SWheal Portion of GIobal Data E Group L Data Mgr. ‘%a Block Addr. 7 * %%!zd~[-1[ **** n ~ m 56. Thomas E. Anderson et al. manager for each block, doing so has three significant drawbacks. First, by reading blocks from disk without first contacting the manager, clients would lose the opportunity to use cooperative caching to avoid disk accesses. Second, although clients could sometimes read a data block directly, they would still need to notify the manager of the fact that they now cache the block so that the manager knows to invalidate the block if it is modified. Finally, our approach simplifies the design by eliminating client caching and cache consistency for index nodes—only the manager handling an index number directly accesses its index node. 3.2.2 Writts. Clients buffer writes in their local memory until committed to a stripe group of storage servers. Because xFS uses a log-based file system, every write changes the disk address of the modified block. Therefore, after a client commits a segment to a storage server, the client notifies the modified blocks’ managers; the managers then update their index nodes and imaps and periodically log these changes to stable storage. As with Zebra, xFS does not need to commit both index nodes and their data blocks “simultaneously” because the client’s log includes deltas that allows reconstruction of the manager’s data structures in the event of a client or manager crash. We discuss deltas in more detail in Section 5.1. As in BSD LFS [Seltzer et al. 1993], each manager caches its portion of the imap in memory and stores it on disk in a special file called the ifile. The system treats the ifile like any other file with one exception: the ifile has no index nodes. Instead, the system locates the blocks of the ifile using manager checkpoints described in Section 5.1. xFS utilizes a token-based cache consistency 3.2.3 Cache Consistency. scheme similar to Sprite [Nelson et al. 1988] and Andrew [Howard et al. 1988] except that xFS manages consistency on a per-block rather than per-file basis. Before a client modifies a block, it must acquire write ownership of that block. The client sends a message to the block’s manager. The manager then invalidates any other cached copies of the block, updates its cache consistency information tQ indicate the new owner, and replies to the client, giving permission to write. Once a client owns a block, the client may write the block repeatedly without having to ask the manager for ownership each time. The client maintains write ownership until some other client reads or writes the data, at which point the manager revokes ownership, forcing the client to stop writing the block, flush any changes to stable storage, and forward the data to the new client. xFS managers use the same state for both cache consistency and cooperative caching. The list of clients caching each block allows managers to invalidate stale cached copies in the first case and to forward read requests to clients with valid cached copies in the second. 3.2.4 Management Distribution a client to a manager colocated Policies. xFS tries to assign files used by on that machine. This section presents a ACMTransactions on Computer Systems, Vol. 14, No. 1, Februa~ 1996. Servertess Network File Systems . 57 simulation study that examines policies for assigning files to managers. We show that collocating a file’s management with the client that creates that file can significantly improve locality, reducing the number of network hops needed to satisfy client requests by over 40!Z0 compared to a centralized manager. The xFS prototype uses a policy we call First Writer. When a client creates a file, xFS chooses an index number that assigns the file’s management to the manager colocated with that client, For comparison, we also simulated a Centralized policy that uses a single, centralized manager that is not colocated with any of the clients. We examined management policies by simulating xFS’ behavior under a seven-day trace of 236 clients’ NFS accesses to an Auspex file server in the Berkeley Computer Science Division [Dahlin et al. 1994a]. We warmed the simulated caches through the first day of the trace and gathered statistics through the rest. Since we would expect other workloads to yield different results, evaluating a wider range of workloads remains important work. The simulator counts the network messages necessary to satisfy client requests, assuming that each client has 16MB of local cache and that there is a manager colocated with each client, but that storage servers are always remote. Two artifacts of the trace affect the simulation. First, because the trace was gathered by snooping the network, it does not include reads that resulted in local cache hits. By omitting requests that resulted in local hits, the trace inflates the average number of network hops needed to satisfy a read request. Because we simulate larger caches than those of the traced system, this factor does not alter the total number of network requests for each policy [Smith 19’771,which is the relative metric we use for comparing policies. The second limitation of the trace is that its finite length does not allow us to determine a file’s “First Writer” with certainty for references to tiles created before the beginning of the trace. For files that are read or deleted in the trace before being written, we assign management to random managers at the start of the trace; when and if such a file is written for the first time in the trace, we move its management to the first writer. Because write sharing is rare—960/o of all block overwrites or deletes are by the block’s previous writer—we believe this heuristic will yield results close to a true “First Writer” policy for writes, although it will give pessimistic locality results for “cold-start” read misses that it assigns to random managers. Figure 4 shows the impact of the policies on locality. The First Writer policy reduces the total number of network hops needed to satisfy client requests by 43$Z. Most of the difference comes from improving write locality; the algorithm does little to improve locality for reads, and deletes account for only a small fraction of the system’s network traffic. Figure 5 illustrates the average number of network messages to satisfy a read block request, write block request, or delete file request. The communication for a read block request includes all of the network hops indicated in Figure 3. Despite the large number of network hops that can be incurred by some requests, the average per request is quite low. Seventy-five percent of ACM Transactions on Computer Systems, Vol. 14, No. 1, February 1996. . Thomas E. Anderson et al, 6,000,000 ~ 5,000,000 M ~ 4,000,000 2 & 3,0CQ,000 g 2,000,000 z I ,Ooo,ooo 0 Centralized FirstWriter Management Policy Fig. 4. Comparison of locality as measured by network traffic for the Centralized and First Writx3rmanagement policies. 41 Centralized = First Writer ~ L HODS Per write HopsPer Delete Fig. 5. Average number of network messages needed to satis~ a read block, write block, or delete file request under the Centralized and First Writer policies. The Hops Per Write column does not include a charge for writing the segment containing blosk writes to disk because the segment write is asynchronous to the block write request and because the large segment amortizes the per-block write cost. ‘Note that the number of hops per read would be even lower if the trace included all local hits in the traced system. read request.a in the trace were satisfied by the local cache; as noted earlier, the local hit rate would be even higher if the trace included local hits in the traced system. An average local read miss costs 2.9 hops under the First Writer policy; a local miss normally requires three hops (the client asks the manager; the manager forwards the request; and the storage server or client supplies the data), but 12% of the time it can avoid one hop because the manager is colocat.ed with the client making the request or the client supplying the data. Under both the Centralized and First Writer policies, a read ACMTransactions on Computer Systems, Vol. 14, No. 1, February 1996. Serverless Network File Systems miss will occasionally incur a few additional hops indirect block from a storage server. Writes benefit more dramatically from locality. quests that ownership, required the the manager client to contact was colocated the to read Of the manager with the client . an index 55% 59 node of write to establish or re- write 90Yc of the time. When a manager had to invalidate stale cached data, the cache being invalidated was local one-third of the time. Finally, when clients flushed data to disk, they informed the manager of the data’s new storage location, a local operation 90?l of the time. Deletes, though rare, also benefit from locality: 68% of file delete requests went to a local manager, and 89~o of the clients notified to stop caching deleted files were local to the manager. In the future, we plan to examine other policies for assigning managers. For instance, we plan to investigate modiffing directories to permit xFS to dynamically change a file’s index number and thus its manager after it has been created. This capability would allow fine-g-rained load balancing on a per-file rather than a per-manager map entry basis, and it would permit xFS to improve locality by switching managers when a different machine repeatedly accesses a file. Another optimization that we plan to investigate is to assign multiple managers to different portions of the same file to balance load and provide locality for parallel workloads. 4. CLEANING When a log-structured file system such as xFS writes data by appending complete segments to its logs, it often invalidates blocks in old segments, leaving “holes” that contain no data. LFS uses a log cleaner to coalesce live data from old segments into a smaller number of new segments, creating completely empty segments that can be used for future full segment writes. Since the cleaner must create empty segments at least as quickly as the system writes new segments, a single, sequential cleaner would be a bottleneck in a distributed system such as xFS. Our design therefore provides a distributed cleaner. An LFS cleaner, whether centralized or distributed, has three main tasks. First, the system must keep utilization status about old segments-how many “holes” they contain and how recently these holes appeared—to make wise decisions about which segments to clean [Rosenblum and Ousterhout 1992]. Second, the system must examine this bookkeeping information to select segments to clean. Third, the cleaner reads the live blocks from old log segments and writes those blocks to new segments. The rest of this section describes how xFS distributes cleaning. We first describe how xFS tracks segment utilizations, then how we identify subsets of segments to examine and clean, and finally, how we coordinate the parallel cleaners to keep the file system consistent. Because the prototype does not yet implement the distributed cleaner, this section includes the key simulation results motivating our design. ACM Transactions on Computer Systems, Vol 14, No. 1, February 1996. 60 4.1 . Thomas E. Anderson et al, Distributing Utilization Status xFS assigns the burden of maintaining each segment’s utilization status to the client that wrote the segment. This approach provides parallelism by distributing the bookkeeping, and it provides good locality; because clients seldom write-share data [Baker et al. 1991; Blaze 1993; Kistler and Satyanarayanan 1992] a client’s writes usually affect only local segments’ utilization status. We simulated this policy to examine how well it reduced the overhead of maintaining utilization information. For input to the simulator, we used the Auspex trace described in Section 3.2.4, but since caching is not an issue, we gather statistics for the full seven-day trace (rather than using some of that time to warm caches). Figure 6 shows the results of the simulation. The bars summarize the network communication necessary to monitor segment state under three policies: Centralized Pessimistic, Centralized Optimistic, and Distributed. Under the Centralized Pessimistic policy, clients notify a centralized, remote cleaner every time they modify an existing block. The Centralized Optimistic policy also uses a cleaner that is remote from the clients, but clients do not have to send messages when they modify blocks that are still in their local write buffers. The results for this policy are optimistic because the simulator assumes that blocks survive in clients’ write buffers for 30 seconds or until overwritten, whichever is sooner; this assumption allows the simulated system to avoid communication more often than a real system since it does not account for segments that are written to disk early due to syncs [Baker et al. 1992]. (Unfortunately, syncs are not visible in our Auspex traces.) Finally, under the Distributed policy, each client tracks the status of blocks that it writes so that it needs no network messages when modi&ing a block for which it was the last writer. During the seven days of the trace, of the one million blocks written by clients and then later overwritten or deleted, 33% were modified within 30 seconds by the same client and therefore required no network communication under the Centralized Optimistic policy. However, the Distributed scheme does much better, reducing communication by a factor of 18 for this workload compared to even the Centralized Optimistic policy. 4.2 Distributing Cleaning Clients store their segment utilization information in s-#iles. We implement s-files as normal xFS files to facilitate recovery and sharing of s-files by different machines in the system. Each s-file contains segment utilization information for segments written by one client to one stripe group: clients write their s-files into per-client directories, and they write separate s-files in their directories for segments stored to different stripe groups. A leader in each stripe group initiates cleaning when the number of free segments in that group falls below a low-water mark or when the group is ACM Transactions on Computer Systems, Vol. 14, No. 1, February 1996. Setverless Network File Systems 100% [IN EL o% ~ Modified By Sanx Client m Modified By Different . 61 Client Fig. 6. Simulated network communication between clients and cleaner, Each bar shows the fraction of all blocks modified or deleted in the trace, based on the time and client that modified the block. Blocks can be modified by a different client than originally wrote the data, by the same client within 30 seconds of the previous write, or by the same client after more than 30 seconds have passed. The Centralized Pessimistic policy assumes every modification requires network traffic. The Centralized Optimistic scheme avoids network communication when the same client modifies a block it wrote within the previous 30 seconds, while the Distributed scheme avoids communication whenever a block is modified by its previous writer, idle. The group leader decides which cleaners should clean the stripe group’s segments. It sends each of those cleaners part of the list of s-files that contain utilization information for the group. By giving each cleaner a different subset of the s-files, xFS specifies subsets of segments that can be cleaned in parallel. A simple policy would be to assign each client to clean its own segments. An attractive alternative is to assign cleaning responsibilities to idle machines. xFS would do this by assigning s-files from active machines to the cleaners running on idle ones. 4.3 Coordinating Cleaners Like BSD LFS and Zebra, xFS uses optimistic concurrency control to resolve conflicts between cleaner updates and normal file system writes. Cleaners do not lock files that are being cleaned, nor do they invoke cache consistency actions. Instead, cleaners just copy the blocks from the blocks’ old segments to their new segments, optimistically assuming that the blocks are not in the process of being updated somewhere else. If there is a conflict because a client is writing a block as it is cleaned, the manager will ensure that the client update takes precedence over the cleaner’s update. Although our algorithm for distributing cleaning responsibilities never simultaneously asks multiple cleaners to clean the same segment, the same mechanism could be used to allow less strict (e.g., probabilistic) divisions of labor by resolving conflicts between cleaners. ACM Transactions on Computer Systems. Vol. 14. No. 1, February 1996 62 . Thomas E. Anderson et al. 5. RECOVERY AND RECONFIGURATION Availability is a key challenge for a distributed system such as xFS. Because xFS distributes the file system across many machines, it must be able to continue operation when some of the machines fail. To meet this challenge, xFS builds on Zebra’s recovery mechanisms, the keystone of which is redundant, log-stmctured storage. The redundancy provided by a software RAID makes the system’s logs highly available, and the log-structured storage allows the system to quickly recover a consistent view of its data and metadata through LFS checkpoint recovery and roll-forward. LFS provides fast recovery by scanning the end of its log to first read a checkpoint and then roll forward, and Zebra demonstrates how to extend LFS recovery to a distributed system in which multiple clients are logging data concurrently. xFS addresses two additional problems. First, xFS regenerates the manager map and stripe group map using a distributed consensus algorithm. Second, xFS recovers manager metadata from multiple managers’ logs, a process that xFS makes scalable by distributing checkpoint writes and recovery to the managers and by distributing roll-forward to the clients. The prototype implements only a limited subset of xFS recovery functionality—storage servers recover their local state after a crash; they automatically reconstruct data from parity when one storage server in a group fails; and clients write deltas into their logs to support manager recovery. However, we have not implemented manager checkpoint writes, checkpoint recovery reads, or delta reads for roll-forward. The current prototype also fails to recover cleaner state and cache consistency state, and it does not yet implement the consensus algorithm needed to dynamically reconfigure manager maps and stripe group maps. This section outlines our recovery design and explains why we expect it to provide scalable recovery for the system. However, given the complexity of the recovery problem and the early state of our implementation, continued research will be needed to filly understand scalable recovery. 5.1 Data Structure Recovery Table H lists the data structures that storage servers, managers, and cleaners recover after a crash. For a systemwide reboot or widespread crash, recovery proceeds from storage servers, to managers, and then to cleaners because later levels depend on earlier ones. Because recovery depends on the logs stored on the storage servers, xFS will be unable to continue if multiple storage servers from a single stripe group are unreachable due to machine or network failures. We plan to investigate using multiple parity fragments to allow recovery when there are multiple failures within a stripe group [Blaum et al. 1994]. Less widespread changes to xFS membership-such as when an authorized machine asks to join the system, when a machine notifies the system that it is withdrawing, or when a machine cannot be contacted because of a crash or network failure—trigger similar reconfiguration steps. For instance, if a single manager crashes, the system skips the steps to recover the storage servers, going directly to generate a new manager map ACM Transactions on Computer Systems, Vol. 14, No. 1, February 1996 Serverless Network File Systems Table H. Data Structures . 63 Restored During Recovery Data Structure Log Segments Stripe Group Map Recovered From Local DataStructures Consensus Manager Manager Map Disk Location Metadata Cache Consistency Metadata Consensus Checkpoint and Roll-Forward Poll Clients Cleaner Segment Utilization S-Files Storage Server Recovery occurs in the order listed from top to bottom because lower datastructures depend on higher ones. that assigns the failed manager’s then recovers the failed manager’s using checkpoint and roll-forward, by polling clients. duties to a new manager; the new manager disk metadata from the storage server logs and it recovers its cache consistency state 5.1.1 Storage Server Recovery. The segments stored on storage server disks contain the logs needed to recover the rest of xFS’ data structures, so the storage servers initiate recovery by restoring their internal data structures. When a storage server recovers, it regenerates its mapping of xFS fragment IDs to the fragments’ physical disk addresses, rebuilds its map of its local free disk space, and verifies checksums for fragments that it stored near the time of the crash. Each storage server recovers this information independently from a private checkpoint, so this stage can proceed in parallel across all storage servers. Storage servers next regenerate their stripe group map. First, the storage servers use a distributed consensus algorithm [Cristian 1991; Ricciardi and Birman 1991; Schroeder et al. 1991] to determine group membership and to elect a group leader. Each storage server then sends the leader a list of stripe groups for which it stores segments, and the leader combines these lists to form a list of groups where fragments are already stored (the obsolete stripe groups). The leader then assigns each active storage server ti a current stripe group and distributes the resulting stripe group map to the storage servers. 5.1.2 Manager Recovery. Once the storage servers have recovered, the managers can recover their manager map, disk location metadata, and cache consistency metadata. Manager map recovery uses a consensus algorithm as described above for stripe group recovery. Cache consistency recovery relies on server-driven polling [Baker 1994; Nelson et al. 1988]: a recovering manager contacts the clients, and each client returns a list of the blocks that it is caching or for which it has write ownership from that manager’s portion of the index number space. The remainder of this subsection describes how managers and clients work together to recover the managers’ disk location metadata—the distributed imap and index nodes that provide pointers to the data blocks on disk. Like ACM Transactions on Computer Systems, Vol. 14, No. 1, February 1996 Thomas E. Anderson et al 64. LFS and Zebra, xFS recovers this data using a checkpoint and roll-forward mechanism. xFS distributes this disk metadata recovery to managers and clients so that each manager and client log written before the crash is assigned to one manager or client to read during recovery. Each manager reads the log containing the checkpoint for its portion of the index number space, the and where crash. This possible, clients delegation read occurs as the same part logs that they of the consensus wrote before process that generates the manager map. The goal of manager checkpoints is to help managers recover their imaps from the logs. As Section 3.2.2 described, managers store copies of their imaps in files called ifiles. To help recover the imaps from the ifiles, managers periodically write checkpoints that contain lists of pointers to the disk storage locations of the ifiles’ state of the ifile when of the clients’ blocks. Because the checkpoint logs reflected each checkpoint is written, by the checkpoint. corresponds it also includes Thus, to the the positions once a manager reads a checkpoint during recovery, it knows the storage locations of the blocks of the ifile as they existed at the time of the checkpoint, and it knows where in the client logs to start reading to learn about more recent modifications. The main difference between xFS and Zebra or BSD LFS is that xFS has multiple managers, so each xFS manager writes its own checkpoint for its part of the index number space. During recovery, managers read their checkpoints independently and in parallel. Each manager locates its checkpoint by first querying storage servers to locate the newest segment written to its log before the crash and then reading backward in the log until it finds the segment with the most recent checkpoint. Next, managers use this checkpoint to recover their portions of the imap. Although the managers’ checkpoints were written at different times and therefore do not reflect a globally consistent view of the file system, the next phase of recovery, roll-forward, brings all of the managers’ disk location metadata to a consistent state corresponding to the end of the clients’ logs. To account for changes that had not reached the managers’ checkpoints, the system uses roll-forward, where clients use the deltas stored in their logs to replay actions that occurred later than the checkpoints. To initiate roll-forward, the managers use the log position information from their checkpoints to advise the clients of the earliest segments to scan. Each client locates the tail of its log by querying storage servers, and then it reads the log backward to locate the earliest segment needed by any manager. Each client then reads forward in its log, using the manager map to send the deltas to the appropriate managers. Managers use the deltas to update their imaps and index nodes as they do during normal operation; version numbers in the deltas allow managers to chronologically order different same files [Hartman and Ousterhout 1995]. clients’ modifications t.o the 5.1.3 Cleaner Recovery. Clients checkpoint the segment utilization information needed by cleaners in standard xFS files, called s-files. Because these checkpoints are stored in standard files, they are automatically recovered by the storage server and manager phases of recovery. However, the s-files may ACM Transactions on Computer Systems, Vol. 14, No. 1, February 1996, Serverless Network File Systems . 65 not reflect the most recent changes to segment utilizations at the time of the crash, so s-file recovery also includes a roll-forward phase. Each client rolls forward the utilization state of the segments tracked in its s-files by asking the other clients are more recent clients gather roll-forward 5.2 Scalability for summaries of their modifications than the s-file checkpoint. this phase segment for manager utilization To avoid summary to those scanning segments that their logs twice, information during the metadata. of Recovery Even with the parallelism provided by xFS’ approach to manager recovery, future work will be needed to evaluate its scalability. Our design is based on the observation that, while the procedures described above can require O(N z ) communications steps (where N refers to the number of clients, managers, or storage servers), each phase can proceed in parallel across N machines. For instance, to locate the tails of the systems logs, each manager and client queries each storage server group to locate the end of its log. While this can require a total of 0(l?2 ) messages, each manager or client only needs to contact N storage server groups, and all of the managers and clients can proceed in parallel, provided that they take steps to avoid having many machines simultaneously contact the same storage server [Baker 1994]; we plan to use randomization to accomplish this goal, Similar considerations apply to the phases where managers read their checkpoints, clients roll forward, and managers query clients for their cache consistency state. 6. SECURITY xFS, as described, is appropriate for a restricted environment-among machines that communicate over a fast network and that tmst one another’s kernels to enforce security. xFS managers, storage servers, clients, and cleaners must run on secure machines using the protocols we have described so far. However, xFS can support less trusted clients using different protocols that require no more trust than traditional client protocols, albeit at some cost to performance. Our current implementation allows unmodified UNIX clients to mount a remote xFS partition using the standard NFS protocol. Like other file systems, xFS trusts the kernel to enforce a firewall between untrusted user processes and kernel subsystems such as xFS. The xFS storage servers, managers, and clients can then enforce standard file system security semantics. For instance, xFS storage servers only store fragments supplied by authorized clients; xFS managers only grant read and write tokens to authorized clients; xFS clients only allow user processes with appropriate credentials and permissions to access file system data. We expect this level of trust to exist within many settings. For instance, xFS could be used within a group or department’s administrative domain, where all machines are administered the same way and therefore trust one another. Similarly, xFS would be appropriate within a NOW where users already trust remote nodes to run migrated processes on their behalf. Even in ACM Transactions on Computer Systems, Vol. 14, No 1. February 1996 66. Thomas E. Anderson et al. Fig. 7. Am xFS core acting as a scalable file server for unmodified NFS!clients. environment that do not trust all desktop machines, xFS could still be used within a trusted core of desktop machines and servers, among physically secure compute servers and file servers in a machine room, or within one of the parallel server architectures now being researched [Kubiatowicz and Agarwal 1993; Kuskin et al. 1994]. In these cases, the xFS core could still provide scalable, reliable, and cost-effective file service to less trusted fringe clients running more restrictive protocols. The downside is that the core system cannot exploit the untrusted CPUS, memories, and disks located in the fkinge. Client trust is a concern for xFS because xFS ties its clients more intimately to the rest of the system than do traditional protocols. This close association improves performance, but it may increase the opportunity for mischievous clients to interfere with the system. In either xFS or a traditional system, a compromised client can endanger data accessed by a user on that machine. However, a damaged xFS client can do wider harm by writing bad logs or by supplying incorrect data via cooperative caching. In the future we plan to examine techniques to guard against unauthorized log entries and ta use encryption-based techniques to safeguard cooperative caching. Our current prototype allows unmodified UNIX fi-inge clients to access xFS core machines using the NFS protocol as Figure 7 illustrates. To do this, any xFS client in the core exports the xFS file system via NFS, and an NFS client employs the same procedures it would use to mount a standard NFS partition from the xFS client. The xFS core client then acts as an NFS server for the NFS client, providing high pefiormance by employing the remaining xFS core machines to satisfi any requesti not satisfied by its local cache. Multiple NFS clients can utilize the xFS core as a scalable file server by having different NFS clients mount the xFS file system using different xFS clients to avoid bottlenecks. Because xFS provides single-machine sharing semantics, it appears to the NFS clients that they are mounting the same file system from the same server. The NFS clients also benefit from xFS high availability, since they can mount the file system using any available xFS client. Of course, a key to good NFS server performance is to efficiently implement synchronous writes; our prototype does not yet exploit the nonvolatile RAM ACM Transactions on Computer Systems, Vol. 14, No. 1, February 1996. Serverless Network File Systems . 67 optimization found in most commercial NFS servers [Baker et al. 1992], so for best performance, NFS clients should mount these partitions using the “unsafe” option to allow xFS to buffer writes in memory. 7. xFS PROTOTYPE This section describes the state of the xFS prototype as of August, 1995, and presents preliminary performance results measured on a 32-node cluster of SPARCStation 10’s and 20’s. Although these results are preliminary and although we expect future tuning to significantly improve absolute performance, they suggest that xFS has achieved its goal of scalability. For instance, in one of our microbenchmarks 32 clients achieved an aggregate large-file write bandwidth of 13.9MB/second, close to a linear speedup compared to a single client’s 0.6MB/second bandwidth. Our other tests indicated similar speedups for reads and small-file writes. The prototype implementation consists of four main pieces. First, we implemented a small amount of code as a loadable module for the Solaris kernel, This code provides xFS’ interface to the Solaris v-node layer and accesses the kernel buffer cache. We implemented the remaining three pieces of xFS as daemons outside of the kernel address space to facilitate debugging [Howard et al. 1988]. If the xFS kernel module cannot satisfi a request using the buffer cache, then it sends the request to the client daemon. The client daemons provide the rest of xFS’ functionality by accessing the manager daemons and the storage server daemons over the network. The rest of this section summarizes the state of the prototype, describes our test environment, and presents our results. 7.1 Prototype Status The prototype implements most of xFS’ key features, including distributed management, cooperative caching, and network disk striping with single parity and multiple groups. We have not yet completed implementation of a number of other features. The most glaring deficiencies are in crash recovery and cleaning. Although we have implemented storage server recovery, including automatic reconstruction of data from panty, we have not completed implementation of manager state checkpoint and roll-forward; also, we have not implemented the consensus algorithms necessary to calculate and distribute new manager maps and stripe group maps; the system currently reads these mappings from a non-xFS file and cannot change them. Additionally, we have yet to implement the distributed cleaner. As a result, xFS is still best characterized as a research prototype, and the results in this article should thus be viewed as evidence that the serverless approach is promising, not as “proof’ that it will succeed. 7.2 Test Environment For our testbed, we use a total of 32 machines: eight dual-processor SPARCStation 20’s, and 24 single-processor SPARCStation 10’s. Each of our machines has 64MB of physical memory. Uniprocessor 50 MHz SS-20’s and ACM Transactions on Computer Systems, Vol. 14, No, 1, February 1996. 68 . Thomas E. Anderson et al, SS-10’S have SPECInt92 ratings of 74 and 65 and can copy large blocks of data from memory to memory at 27MB/second and 20MB/second, respectively. We use the same hardware to compare xFS with two central-server architectures: NFS [Sandberg et al. 1985] and AFS (a commercial version of the Andrew file system [Howard et al. 1988]). We use NFS as our baseline system for practical reasons—NFS is mature, widely available, and well tuned, allowing easy comparison and a good frame of reference—but its limitations with respect to scalability are well known [Howard et al. 1988]. Since many NFS installations have attacked NFS limitations by buying shared-memory multiprocessor servers, we would like to compare xFS running on workstations to NFS running on a large multiprocessor server, but such a machine was not available to us; our NFS server therefore runs on essentially the same platform as the clients. We also compare xFS to AFS, a more scalable central-server architecture. However, AFS achieves most of its scalability compared to NFS by improving cache petiormance; ita scalability is only modestly better compared to NFS for reads from server disk and for writes. For our NFS and AFS tests, we use one of the SS-20’s as the server and the remaining 31 machines as clients. For the xFS tests, all machines act as storage servers, managers, and clients unless otherwise noted. For experiments using fewer than 32 machines, we always include all of the SS-20’s before starting to use the less powerful SS-10’S. The xFS storage servers store data on a 256MB partition of a 1.lGB Seagate-STl 1200N disk. These disks have an advertised average seek time of 5.4ms and rotate at 541 lRPM. We measured a 2.7MB/second peak bandwidth to read from the raw disk device into memory. For all xFS tests, we use a log fragment size of 64KB, and unless otherwise noted we use storage server groups of eight machines—seven for data and one for parity; all xFS tests include the overhead of parity computation. The AFS clients use a 100MB partition of the same disks for local disk caches. The NFS and AFS central servers use a larger and somewhat faster disk than the xFS storage servers, a 2. lGB DEC RZ 28-VA with a peak bandwidth of 5MB/second from the raw partition into memory. These servers also use a Prestoserve NVRAM card that acts as a buffer for disk writes [Baker et al. 1992]. We did not use an NVR&l btier for the xFS machines, but xFS’ log buffer provides similar performance benefits. A high-speed, switched Myrinet network [Boden et al. 1995] connects the machines. Although each link of the physical network has a peak bandwidth of 80MB/second, RPC and TCP\IP protocol overheads place a much lower limit on the throughput actually achieved [Keeton et al. 1995]. The throughput for fast networks such as the Myrinet depends heavily on the version and patch level of the Solaris operating system used. For our xFS measurements, we used a kernel that we compiled from the Solaris 2.4 source release. We measured the TCP throughput to be 3.2MB/second for 8KB packets when using this source release. For the NFS and AFS measurements, we used the binary release of Solaris 2.4, augmented with the binary patches recommended by Sun as of June 1, 1995. This release provides better network ACMTransactions on Computer Systems, Vol. 14, No. 1, February 1996. Serverless Network File Systems . 69 performance; our TCP test achieved a throughput of 8.4 MB/second for this setup. Alas, we could not get sources for the patches, so our xFS measurements are penalized with a slower effective network than NFS and AFS. RPC overheads further reduce network performance for all three systems. 7.3 Performance Results This section presents a set of preliminary performance results for xFS under a set of microbenchmarks designed to stress file system scalability and under an application-level benchmark. These performance results are preliminary. As noted above, several significant pieces of the xFS system—manager checkpoints and cleaning—remain to be implemented. the results to ever We do not expect for the benchmarks limit performance. these presented However, additions to significantly impact here. We do not expect checkpoints thorough future investigation will be needed to evaluate the impact of distributed cleaning under a wide range workloads; other researchers have measured sequential cleaning overheads from a few percent [Blackwell et al. 1995; Rosenblum and Ousterhout 1992] to as much as 40”; [Seltzer et al. 1995], depending on the workload. Also, the current prototype implementation suffers from three inefficiencies, all of which we will attack in the future. (1) xFS is currently implemented as a set of user-level processes by redirecting vnode layer calls. This hurts performance because each user/kernel space crossing requires the kernel to schedule the user-level process and copy data to or from the user process’ address space. To fix this limitation, we are working to move xFS into the kernel. (Note that AFS shares this handicap.) (2) RPC and TCP/IP overheads severely limit xFS’ network performance. We are porting xFS’ communications layer to Active Messages [von Eicken et al. 1992] to address this issue. (3) We have done little profiling and flx inefficiencies. and tuning. As we do so, we expect to find As a result, the absolute performance is much less than we expect for the well-tuned xFS. As the implementation matures, we expect a single xFS client to significantly outperform an NFS or AFS client by benefiting from the bandwidth of multiple disks and from cooperative caching. Our eventual performance goal is for a single xFS client to achieve read and write bandwidths near that of its maximum network throughput and for multiple clients to realize an aggregate bandwidth approaching the system’s aggregate local disk bandwidth. The microbenchmark results presented here stress the scalability of xFS’ storage servers and managers. We examine read and write throughput for large files and write performance for small files, but we do not examine small-file read performance explicitly because the network is too slow to provide an interesting evaluation of cooperative caching; we leave this evaluation as future work. We also use Satyanarayanan’s Andrew benchmark ACM Transactions on Computer Systems, Vol. 14, No 1, February 1996 70 . Thomas E. Anderson et al. Clients Fig. 8. Aggregate disk write bandwidth. The x-axis indicates the number of clients simuk.aneously writing private 10MB files, and the y-axis indicates the total throughput across all of the active clients. xFS uses four groups of eight storage servers and 32 managers. NFS peak throughput is 1.9MB/second with two clients; AFS is 1.3MB/second with 32 clients; and xFS is 13.9MB/second with 32 clients. [Howard et al. 1988] assa simple evaluation of application-level In the fiture, we plan to compare the systems’ performance demanding applications, performance. under more Figures 8 through 10 illustrate the scalability of xFS 7.3.1 Scalability. performance for large writes, large reads, and small writes. For each of these tests, as the number of clients increases, so does xFS aggregate performance. In contrast, just a few clients saturate NFS or AFS’ single server, limiting peak throughput. Figure 8 illustrates the performance of our disk write throughput test, in which each client writes a large (1 OMB), private file and then invokes syne( ) to force the data to disk (some of the data stay in NVRAM in the case of NFS and AFS.) A single xFS client is limited h 0.6MB/second, about one-third of the 1.7MB/second throughput of a single NFS client; this difference is largely due to the extra kernel crossings and associated data copies in the user-level xFS implementation, as well aa high network protocol overheads. A single AFS client achieves a bandwidth of 0.7 MB/second, limited by AFS’ kernel crossings and the overhead of writing data to both the local disk cache and the server disk. Aa we increase the number of clients, NFS’ and AFS’ throughputs increase only modestly until the single, central server disk bottlenecks both systems. The xFS configuration, in contrast, scales up to a peak bandwidth of 13.9MB\second for 32 clients, and it appears that if we had more clients available for our experiments, they could achieve even more bandwidth from the 32 xFS storage servers and managers. Figure 9 illustrates the performance of NFS, AFS, and xFS for large reads from disk. For this test, each machine flushes its cache and then sequentially reads a per-client 10MB file. Again, a single NFS or AFS client outperforms a ACM Transactions on Computer Systems, Vol. 14, No. 1, February 1996 Serverless Network File Systems . 71 Fig. 9. Aggregate disk read bandwidth, The x-axis indicates the number of clients simultaneously reading private 10MB files and the y-axis the total throughput across all active clients. xFS uses four groups of eight storage servers and 32 managers. NFS peak throughput is 3. lMB/second with two clients; AFS’ is 1.9MB/second with 12 clients; and xFS is 13,8MB/second with 32 clients single xFS client. One NFS client can read at 2.8MB\second, and an AFS client can read at l. OMB/second, while the current xFS implementation limits one xFS client to 0.9MB\second. As is the case for writes, xFS exhibits good scalability; 32 clients achieve a read throughput of 13.8MB/second. In contrast, two clients saturate NFS at a peak throughput of 3. lMB\second, and 12 clients While saturate Figure AFS’ 9 shows central server disk at 1.9MB/second. disk read performance when data are not cached, all three file systems achieve much better scalability when clients can read data from their caches to avoid interacting with the server. All three systems allow clients to cache data in local memory, providing scalable bandwidths of 20MB\second to 30MB\second per client when clients access working sets of a few tens of megabytes. Furthermore, AFS provides a larger, though slower, local disk cache at each client that provides scalable disk read bandwidth for workloads whose working can achieve an aggregate sets do not fit in memory; our 32-node AFS cluster disk bandwidth of nearly 40MB/second for such workloads. This aggregate disk maximum disk bandwidth for largely untuned, and we expect bandwidth two reasons. is significantly First, the gap to shrink larger as noted in the future. than above, xFS’ xFS Second, is xFS transfers most of the data over the network, while AFS’ cache accesses are local. Thus, there will be some workloads for which AFS’ disk caches achieve a higher aggregate disk read bandwidth than xFS’ network storage. xFS’ network striping, however, provides better write performance and will, in the future, provide better read performance for individual clients via striping. Additionally, once we have ported cooperative caching to a faster network protocol, accessing remote memory will be much faster than going to local disk, and thus the clients’ large, aggregate memory cache will further reduce the potential benefit from local disk caching. ACM Transactions on Computer Systems, Vol. 14, No. 1, February 1996. 72 . Thomas E. Anderson et al. 1200 tilesls xFS tilesls ~ tikxls filesls . filesis . $ 200 tilesls WY -------- .- .*----- ----- o AFS . .“- . :Nb.~ 35 Fig. 10. Aggregate small-write performance. The x-axis indicates the number of clients, each simultaneously creating 2048 lKB files. The y-axis is the average aggregate number of tile creates per second during the benchmark mn, xFS uses four groups of eight storage servers and 32 managers. NFS achieves its peak throughput of 91 files per second with four clients; AFS peaks at 87 files per second with four clients; and xFS scales up to 1122 files per second with 32 clients. Figure 10 illustrates the petiormance when each client creates 2048 files containing lKB of data per file. For this benchmark, xFS log-based architecture overcomes the current implementation limitations to achieve approximate parity with NFS and AFS for a single client: one NFS, AFS, or xFS client can create 51, 32, or 41 files per second, respectively. xFS also demonstrates good scalability for this benchmark. Thirty-two xFS clients generate a total of 1122 files per second, while NFS peak rate is 91 files per second with four clients, and AFS’ peak is 87 files per second with four clients. Figure 11 shows the average time for a client to complete the Andrew benchmark as the number of clients varies for each file system. This benchmark was designed as a simple yardstick for comparing application-level performance for common tasks such as copying, reading, and compiling files. When one client is running the benchmark, NFS takes 64 seconds to run, and AFS takes 61 seconds, while xFS requires somewhat more time—78 seconds. xFS scalability, however, allows xFS to outperform the other systems for larger numbers of clients. For instance, with 32 clients xFS takes 117 seconds to complete the benchmark, while increased 1/0 time, particularly in the copy phase of the benchmark, increases NFS time to 172 seconds and AFS time to 210 seconds. A surprising result is that NFS outperforms AFS when there are a large number of clients; this is because in-memory file caches have grown dramatically since this comparison was first made [Howard et al. 1988], and the working set of the benchmark now fita in the NFS clients’ in-memory caches, reducing the benefit of AFS on-disk caches. 7.3.2 Storage Server Scalability. In the above measurements, we used a 32-node xFS system where all machines acted as clients, managers, and ACM Transactions on Computer Systems, VOL 14, No. 1, February 1996 Serverless Network File Systems . 73 25(’ ‘r-----l ~ m) s ~ F 150s CJ &loos” : u make rewlAlt scanilr copy m:tkelllr 50 s () s () 5 10 15 x) Clicnls 250 T----l 25 30 ‘50’~ 2(M) s 200 s E p150s u ~I(M)s a. m = 50 s g150s -0 % loos _% LIJ 50s. () s () 5 10 15 20 25 30 () s () 5 10 Clicnls 15 20 25 30 Clients Fig. 11 Average time to complete the Andrew benchmark for NFS, AFS, and xFS as the number of clients simultaneously executing the benchmark varies. The total height of the shaded areas represents the total time to complete the benchmark; each shaded area represents the time for one of the five phases of the benchmark: makeDir, copy, scanDir, readAll, and make For all of the systems, the caches were flushed before running the benchmark. storage servers and found that both bandwidth and small-write performance scaled well. This section examines the impact of different storage server organizations on that scalability. Figure 12 shows the large-write performance as we vary the number of storage servers and as we change the stripe group size. Increasing the number of storage servers improves performance by spreading the systems’ requests across more CPUS bandwidth falls short of linear with the number because client overheads are also a significant and disks. The increase in of storage servers, however, limitation on system band- width. Reducing the system’s ments. We the stripe aggregate attribute group size from bandwidth most of this eight storage by 87G to 227c difference servers to four for the different to the increased reduces measure- overhead of group size from eight to four reduces the fraction of fragments that store data as opposed to parity. The additional overhead reduces the available disk bandwidth by 1690 for the system using groups of four servers. panty. Reducing 7.3.3 Manager ing management the stripe Scalability. Figure 13 shows the importance of distributamong multiple managers to achieve both parallelism and ACM Transactions on Computer Systems, Vol 14. No, 1, February 1996 74 . Thomas E. Anderson et al MBIs xFS (8 SS’s per Group) MBA MB/s MBIs MW . %4 m OIJ E? auOrl ‘o MBIs 5 10 [5 20 !$[ora~e Servers 27 - 70 - 5 - Fig. 12. Large-write throughput as a function of the number of stmage servers in the system. The x-axis indicates the total number of storage servers in the system and the y-axis the aggregate bandwidth when 32 clients each write a 10MB file to disk. The solid line indicates performance for stripe groups of eight storage servers (the default), and the dashed line shows performance for groups of four storage servers. I200 files/s First Writer Polic ----600 fibk 400 tilesh _- ----- Nonkscal ------- Manager I II 200 filesls I O filesls 5 10 24 Storage Server: 31 Cljenls 70 5 5 30 15 .Managers Fig. 13. Small-write performance as a function of the number of managers in the system and manager locality policy. The x-axis indicates the number of managers. The y-axis is the average aggregate number of file creates per second by 31 clients, each simultaneously creating 2048 small ( lKB) files. The two lines show the performance using the First Writer policy that colocatas a file’s manager with the client that creates the file, and a Nonlocal policy that assigns management b some other machine. Because of a hardware failure, we ran this experiment with three groups of eight storage servers and 31 clients. The maximum point on the x-axis is 31 managers. locality. It varies the number of managers handling metadata for 31 clients running the small-write benchmark .3 This graph indicates that a single manager is a significant bottleneck for this benchmark. Increasing the sys3 Due to a hardware and 31 clients. ACM l’r~sactions failure, we ran this experiment with three groups of eight storage servers on Computer Systems, Vol. 14, No. 1, February 1996. Serverless Network File Systems . 75 tern from one manager to two increases throughput by over 80’%, and a system with four managers more than doubles throughput compared to a single-manager system, Continuing to increase the number of managers in the system continues to improve performance under xFS’ First Writer policy. This policy assigns files to managers running on the same machine as the clients that create the files; Section 3.2.4 described this policy in more detail. The system with 31 managers can create 45% more files per second than the system with four managers under this policy. This improvement comes not from load distribution but from locality; when a larger fraction of the clients also host managers, the algorithm is able to successfully colocate managers with the clients accessing a file more often. The Nonlocal Manager line illustrates what would happen without locality. For this line, we altered the system’s management assignment policy to avoid assigning files created by a client to the local manager. When the system has four managers, throughput peaks for this algorithm because the managers are no longer a significant bottleneck for this benchmark; larger numbers of managers do not further improve performance. 8. RELATED WORK Section 2 discussed a number of projects that provide an important basis for xFS. This section describes several other efforts to build decentralized file systems and then discusses the dynamic management hierarchies used in some MPPs. Several file systems, such as CFS [Pierce 1989], Bridge [Dibble and Scott 1989], and Vesta [Corbett et al. 1993], distribute data over multiple storage servers to support parallel workloads; however, they lack mechanisms to provide availability across component failures. Other parallel systems have implemented redundant data storage intended for restricted workloads consisting entirely of large files, where per-tile striping is appropriate and where large-file accesses reduce stress on their centralized manager architectures. For instance, Swift [Cabrera and Long 1991] and SFS [Lo Verso et al. 1993] provide redundant distributed data storage for parallel environments, and Tiger [Rashid 1994] services multimedia workloads. TickerTAIP [Cao et al. 1993], SNS [Lee 1995], and AutoRAID [Wilkes et al. 1995] implement RAID-derived storage systems. These systems could provide services similar to xFS’ storage servers, but they would require serverless management to provide a scalable and highly available tile system interface to augment their simpler disk block interfaces. In contrast with the log-based striping approach taken by Zebra and xFS, TickerTAIPs RAID level 5 [Patterson et al. 1988] architecture makes calculating parity for small writes expensive when disks are distributed over the network. SNS combats this problem by using a RAID level 1 (mirrored) architecture, but this approach approximately doubles the space overhead for storing redundant data. AutoRAID addresses this dilemma by storing data that are actively being written to a RAID level 1 and migrating inactive data to a RAID level 5. ACM Transactions on Computer Systems, Vol. 14. No 1, February 1996 76 Thomas E. Anderson et al, . Several MPP designs have used dynamic hierarchies to avoid the fixed-home approach used in traditional directory-based MPPs. The KSR1 [Rosti et al. 1993] machine, based on the DDM proposal [Hagersten et al. 1992], avoids associating data with fixed-home nodes. Instead, data may be stared in any cache, and a hierarchy of directories allows any node to locate any data by searching successively higher and more globally complete directories. While an early data xFS study for file systems independence our root ager hop. using simulated [Dahlin the effect of a hierarchical et al. 1994a], a manager-map-based we now instead approach approach to meta- support for three reasons. location First, approach eliminates the “root” manager that must track all data; such a would bottleneck peflormance and reduce availability. Second, the manmap allows a client to locate a file’s manager with at most one network Finally, the manager map approach can be integrated more readily with the imap data structure that tracks disk location metadata. 9. CONCLUSIONS Serverless large file systems distribute numbers of cooperating file machines. system server This approach responsibilities across eliminates the central server bottleneck inherent in today’s file system designs to provide improved performance, scalability, and availability. Furthermore, serverless systems are cost effective because their scalable architecture eliminates the specialized server hardware and convoluted system administration necessary to achieve scalability strates the viability mance results under current file systems. The xFS prototype demonof building such scalable systems, and its initial perfor- illustrate the potential of this approach. ACKNOWLEDGMENTS We owe several members of the Berkeley Communications Abstraction Layer group-David Culler, Lok Liu, and Rich Martin-a large debt for helping us up. We also made use of a modified to get the 32-node Myrinet network version of Mendel Rosenblum’s LFS cleaner simulator. Eric Anderson, John Hartman, Frans Kaashoek, John Ousterhout, the SOSP program committee, and the anonymous SOSP and TOCS referees provided helpful comments on earlier drafls of this article; their comments greatly improved both the technical content and presentation of this work. REFERENCES ANDERSON, T., CULLER, D., PATTERSON, D., AND THE NOW TEAM. 1995. A case for NOW (Networks of Workstations). IEEE Micro 15, 1 (Feb.), 54-64. BARRR, M. 1994. Fast crash recovery in distributed tile systems. Ph.D. thesis, Univ. of California at Berkeley, Berkeley, Calif. BAKER,M., As*I, S., DEPRIT,E., Ousmwow, J., ANDSELTZER,M. 1992. Non-volatile memory for fast, reliable file systems. In ASPLOS-V (Sept.). ACM, New York, 10-22. BAXER, M., I-Lwm&m, J., KUPFER, M., SHIRRIFF,K., ~ Ous’rmuiowr, J. 1991. Measurement of a distributed file system. In Proceedings of the 13th Symposium on Opemting Systems Principk-s (Oct.). ACM, New York, 198-212. ACM Transactions on Computir Systems, Vol. 14, No. 1, February 1996. Serverless Network File Systems . 77 BASCI, A., BIIIH, V., V(WELS, W.. AND v(JiN EK’KEN, T. 1995. U-Net: A user-level network interface for parallel and distributed computing. In Proceedings of the 1.5thSymposium on Operajing SLvstem.sPrinr~ple.s ( Dec.), ACM, New York, 40-53. E31RRELI,,A., HISGEN, A., J~RIAN, C,, MANN, T., ANDSWART, G. system. Tech. Rep. 111, Digital Equipment BLACKWEX,L, T., HARRIS,J., MWISIH.TZP,R,M. tured file systems. Calif., 277-288. In Proceedings Corp., Systems 1995, Heuristic of the 199,5 Winter 1993. The Echo distributed tile Center, Palo Alto., Calif. Research cleaning algorithms USEN’IX. USENIX in log-struc- Assoc., Berkeley, BLAUM, M., BRADY,J., BR[)CK, J., ,mn M~NON, J, EVENODD: An optimal scheme for tolerating double disk failures in RAID architectures. In Proceedings of the 21st International Symposium on Computer Architecture (Apr.). IEEE Computer Society Press, Los Alamitos, Calif,, 245-254. BMW, M. 1993. Caching in large-scale distributed file systems. Ph, D. thesis, Princeton Univ., Princeton, NJ. Jan. BCH)EN, N., COHEN, D , FELI)ERMAN, R., KULAWIK, A., SEITZ, C,, SEIzovI(’, Myrirwt A gigabit-per-second local-area network, IEEE J,, AXI) S(;, W. 1995. Micro 1,5, 1 (Feb. ), 29-36. L. AN]) LOM;, D. 1991. Swift: A storage architecture for large objects. In Proceedings of the 11th Symposium on Mass Storage Systems (Oct.). IEEE Cornputer Society Press, Las Alamitos, Calif., 123-128. CAMWL4, P., LIM, S., VENKAIAMMAN. S,, AND WILKES, J. 1993. The TickerTAIP parallel RAID arch itecture, In Proceedings of the 20th International 1 Symposium on Computer Architect re (May). IEEE Computer Society Press, Las Alamitos, Calif., 52-63. (’A(), CHAIKEX, D., KLIIIIATOWI(Z, J., ANI) A~ARWAL, A. coherence scheme. In ASPL.OS-IV Proceedings LimitLESS directories: A scalable cache (Apr.), ACM, New York, 224-234. 1991. C}mix. P., Lix, E,, GIBSON, G., RAT”, R., AND PATIWRSCN,D. 1994, RAID: High-performance, reliable secondary storage. ACM Comput, Suru, 26, 2 (June), 145-188. CORWXT, (hmput. P., BAYLOR, S., ANI) FEJTE1.WN,D. Arch. Neus 21, 5 (Dec.), 7-14. 1993. Overview of the Vesta parallel file system. CswsTL\S, F. 1991. Reaching agreement on processor tributed systems. Distrib. Comput. 4, 175-187. group membership in synchronous dis- CYPHER, R., Ho, A., K(PWT.4NTINHXNT, S., ANII MIXWNA P. 1993. Architectural requirements of parallel scientific applications with explicit communication. [n Prm-eed~ngs of the 20th Interrsa tirma[ Symposium on Computer Architecture (May). IEEE Computer Society Press, Los Alamitos, (’alif., 2-13. DAHLIN. M., MATHIW, C,. W.4N(;, R., ANDERSON, T., A?W PA’ITERSO?J, D. 1994a. A quantitative analysis of cache policies for scalable network tile systems. In Proceedings of the 1994 ACM SIGM.ETRICS Conference (May). ACM, New York, 150-160. T., Am PAnERSON, D. 1994b, Cooperative caching: Using remote client memory to improve file system performance. In Proceedings of the Ist Symposium on Operating .Sy,stem.sDesign and Imp/emerrtation (Nov.), 276-280. DAHI.IN, M., WANG. R., AWERSO~, DIBBI,K, P. ANH Scwrr. 17, 5 (Sept.), M. 1989. The Bridge multiprocessor file system. Comput. Arch. News 32-39. Doum.M, F. mo OUSTERHW.IT, J. 1991. Transparent process migration: and the Sprite implementation. Srrfl/4. Pr-acf. 13xp. 21, 8 (July), 757-785 HA~I.XWWN, E., LANOJN,A., ANO HARIDI, S. IEEE Compu~. 25, 9 (Sept.), 45-54. HARrMAN, J. AND OrST~RH(XrT, J. 1995. 1992, DDM- A cache-only The Zebra striped network Design alternatives memory file system. architecture. ACM Trans. Comput. Syst. 13, 3 (Aug.), 274310. HOVVAfU),J., KMAR, M., MI+NEES, S., NICHOLS, D., SATYA~ARAYANA~, M., SIDEIIOTHAM, R., A?WI Wwr, W. 1988. Scale and performance in a distributed tile system, ACM Trans. Comput. Syst. 6, 1 (Feb.), 51 ,81. KAZAR, M. 1989. Ubik: Replicated servers made easy. In Proceedings of the 2nd Workshop on Workstation Operrrttng Systems (Sept.). IEEE Computer Society Press. LOS Alamitos, Calif., 60-67 ACM Transactions on Computer Systems, Vol 14, No 1, February 1996 78 . Thomas E. Anderson et al. KEETON, K., ANDERSON,T., AND PATrERSON, D. 1995. ImgP quantified: The case for low-overhead local area networks. In Proceedings of Hot Interconnects (Aug.). IEEE Computer Society Press, Los Alamitos, Calif. KISTLER, J. AND SATYANARAYANAN, M. 1992. Disconnected operation in the Coda file system. ACM Trans. Comput. Syst. 10, 1 (Feb.), 3-25. KUBIATOWICZ, J. ANDAGARWAL,A. 1993. Anatomy of a message in the Alewife multiprocessor. In Proceedings of the 7th International Conference on Supercomputing (July). ACM, New York. KUSKIN,J., OFELT, D., HEtNRICH, M., HEINLEIN, J., SIMONI, R., GHARACHORLCO,K., CHAPIN, J., N~HHLA, D., BAXTER,J., HOROWITZ,M., GUPTA, A., ROSENBLUM,M., ANDHENNESSY,J. 1994. The Stanford FLASH multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture (Apr.). IEEE Computer Society Press, Los Alamitos, Calif., 302-313. LEE, E. 1995. Highly-available, scalable network storage. In Proceedings of COMPCON 95. IEEE, New York. LEFF,A., Yu, P., AND WOLF, J. 1991. Policies for efficient memory utilization in a remote caching architecture. In Proceedings of the 1st International Conference on Parallel and Distributed Information Systems (Dec.). IEEE Computer Society Press, Las Alamitos, Calif., 198-207. LENOSKI, K., LAUDON, J., GtWR+GHORLOO, K., GumA, A., mm HENNESSY, J. 1990. The directory-based cache coherence protocol for the DASH multiprocessor. In Proceedings of the 17th International Symposium on Computer Architecture (May). IEEE Computer Society Press, Las Alamitos, Calif., 148-159. LISKOV, B., GHEMAWAT, S., GRUBER, R., JOHNSON, P., SHRIRA, L., AND WHJJAMS, M. 1991. Replication in the Harp file system. In Proceedings of the 13th Symposium on Operating Systems Principles (Oct.). ACM, New York, 226-238. LITZKOW,M. ANDSOLOMON,M. 1992. Supporting checkpainting and process migration outaide the UNIX kernel. In Proceedings of the Winter 1992 LWENIX (Jan.). USENIX Assoc., Berkeley, Calif., 283-290. h VEKRO,S., ISMAN,M., NANOPOULOS,A., NESHEIM,W,, MILNE, E., ANDWHEELER, R. 1993. sfs: A parallel tile system for the CM-5. In Proceedings of the Sumnwr 1993 USENIX. USENIX Assoc., Berkeley, Calif., 291-305. MAJOR, D., MINSHALL, G., AND POWELL, K. 1994. An overview of the NetWare operating system, In Proceedings of the 1994 Winter USENIX (Jan.). USENIX Assoc., Berkeley, Calif., 355-372. MCKUSICK, M., JoY, W., LEFFLER, S., AND FABRY, R. 1984. A fast file system for UNIX. ACM Trans. Comput. Syst. 2, 3 (Aug.), 181-197. NELSON,M., WELCH, B., ANDOUSTERHOUT,J, 1988, Caching in the Sprite network file system. ACM Trans. Comput. Syst. 6, 1 (Feb.), 134-154, PATTERSON,D., GIBSON, G., AND KATZ, R. 1988. A case for redundant arrays of inexpensive disks (RAID). [n the International Conference on Management of Data (June). ACM, New York, 109-116. PIERCE, P. 1989. A concurrent file system for a highly parallel mass storage subsystem. In Proceedings of the 4th Conference on Hypercubes, Concurrent Computers, and Applications. Golden Gate Enterprises, Los Altos, Calif, 155-160, POPEK, G,, GUY, R., PAGE) T., ANDHEtDEMANN,J. 1990. Replication in the Ficus distributed file system. In Proceedings of the Workshop on the Management of Replicated Data (Nov.). IEEE Computer Society Press, Los AlamitOs, Calif., 5-10. RASHID,R. 1994. Microsoft’s Tiger media server. In The Ist Networks of Workstations Workshop Record (Oct.). Presented at ASPLOS 1994 Conference (San Jose, Calif.). RtCCtARDI,A. AND B!RMAN, K. 1991. Using process groups to implement failure detection in asynchronous environments. In Proceedings of the Ioth Symposium on Principles of Distributed Computing (Aug.). ACM, New York, 341–353. ROSENBLUM,M. ANDOUSTERHOUT,J. 1992. The design and implementation of a log-structured file system. ACM Trans. Comput. Syst. 10, 1 (Feb.), 26-52. ROSTI, E., SMIRNI,E., WAGNER,T., APON, A., ANDDowDY, L. 1993. The KSR1: Experimentation and modeling of Poststnre. In Proceedings of 1993 SIGMETRICS (June). ACM, New York, 74-85. ACM Transactions on Computer Systems, Vol. 14, No. 1, February 1996. Serverless Network File Systems . 79 SANO~ERG, R., GOLDBERG, D., KLEIMAN, S., WALSH, D., ANDLYON, B. mentation of the Sun network file system. In Proceedings (June). USENIX Aasoc, Berkeley, Calif., 119-130. 1985. Design and impleof the Summer 1985 USENIX SCHROEDER, M., BIRRELL, A., BURROWS, M., MURRAY, H., NEEDHAM, R., RODEHEPFRR, T., SATTERTH- 1991. Autonet: A high-speed, self-configuring local area network using point-to-Wint links, IEEE J. Set, Areas Cornmun. 9, 8 (Oct.), 1318– 1335. SELTZER, M., BOSTIC, K., M(!Kus]cK, M., AND STAELIN, C. 1993. An implementation of a log-structured file system for UNIX. In Proceedings of the 1993 Winter USENLY (Jan.), USENIX Assoc., Berkeley, Calif., 307-326. SELTZER, M., SMITH, K., BALAKRISHNAN, H., CHANG, J,, MCMAINS, S., AND PADMANABHAN, V, 1995. File system logging versus clustering: A performance comparison. In Prweedi~s of the 1995 Winter USENIX (Jan.), USENIX Assoc., Berkeley, Calif. SMITH, A. 1977. Two methods for the eflicient analysis of memory address trace data. IEEE Trans. Sofiw. Eng. SE-3, 1 (Jan,), 94-101. vet+ EJCK~N, T., CULLER, D., GOLDSTEIN,S., AND SCHAUSER,K. E. 1992. Active messages: A mechanism for integrated communication and computation. In Proceedings of the 19th International Symposium on Computer Architecture (May). IEEE Computer Society Press, Los WAITE, E,, AND THACKER, C. Alamitos, Calif., 256-266. WALKER,B., POPEK,G., ENGLISH, R., KLINE, C,, ANOTHtEL, G. 1983. The LOCUS distributed operating system. In Proceedings of the 5th Symposium on Operating Systems Principles (Oct.). ACM, New York, 49-69. WANG, R. ANO ANDERSON, T. 1993, xFS: A wide area mass storage tile system. In the 4th Workshop on Workstation Operating Systems (Oct.). IEEE Computer Society Press, Los Alamitos, Calif., 71-78. WILKES, .J., AND GOLDING, R., STAELIN, C., AND SULLIVAN, T. 1995. The HP AutoRAID hierarchical storage system. In Proceedings of the 15th Symposium on Operating Systems Principles (Dec.). ACM, New York, 96-108. WOLF, J. 1989. The placement optimization problem: A practical solution to the disk file assignment problem. In Proceedings of the 1989 SIGMETRICS (May). ACM, New York, 1– 10. Received September 1995; revised October 1995; accepted October 1995 ACM Transactions on Computer Systems, Vol. 14, No. 1, February 1996.