Disconnected System JAMES J. KISTLER Carnegie Mellon Operation in the Coda File and M. SATYANARAYANAN University Disconnected operation data during temporary is a mode of operation that enables a client to continue accessing critical failures of a shared data repository. An important, though not exclusive, application of disconnected operation is in supporting portable computers. In this paper, we show that disconnected operation is feasible, efficient and usable by describing its design and implementation in the Coda File System. The central idea behind our work is that caching of data, now widely used for performance, can also be exploited to improve availability. Categories and Subject Descriptors: D,4.3 [Operating Systems]: Systems]: Reliability— Management: —distributed file systems; D.4.5 [Operating D.4.8 [Operating Systems]: Performance —nzeasurements General Terms: Design, Experimentation, Additional Key Words and Phrases: reintegration, second-class replication, Measurement, Performance, Disconnected operation, server emulation File fault Systems tolerance; Reliability hoarding, optimistic replication, 1. INTRODUCTION Every serious user of a distributed system has faced situations work has been impeded by a remote failure. His frustration acute when his workstation is powerful has been configured to be dependent inst ante of such dependence Placing users, growing data and is the use of data in a distributed allows popularity them enough to be used on remote resources. file to delegate of distributed file from standalone, but An important a distributed system simplifies the administration systems where critical is particularly collaboration such as NFS of that file syst em. between data. [161 and AFS The [191 This work was supported by the Defense Advanced Research Projects Agency (Avionics Lab, Wright Research and Development Center, Aeronautical Systems Division (AFSC), U.S. Air Force, Wright-Patterson AFB, Ohio, 45433-6543 under Contract F33615-90-C-1465, ARPA Order 7597), National Science Foundation (PYI Award and Grant ECD 8907068), IBM Corporation (Faculty Development Award, Graduate Fellowship, and Research Initiation Grant), Digital Equipment Corporation (External Research Project Grant), and Bellcore (Information Networking Research Grant). Authors’ address: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1992 ACM 0734-2071/92/0200-0003 $01.50 ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992, Pages 3-25. J J. Kistler and M, Satyanarayanan 4. attests to the compelling users of these systems critical juncture How may nature of these considerations. Unfortunately, have to accept the fact that a remote failure seriously can we improve the benefits when of a shared that repository inconvenience this state data repository, them. of affairs? Ideally, but is inaccessible. We the at a we would like be able to continue call the latter to enjoy critical mode work of operation disconnected operation, because it represents a temporary deviation from normal operation as a client of a shared repository. In this paper we show that disconnected operation in a file system is indeed feasible, efficient and usable. The central idea behind our work is that caching of exploited now data, to enhance widely used availability. to improve have We performance, implemented can tion in the Coda File System at Carnegie Mellon University. Our initial experience with Coda confirms the viability operation. We have to two days. successfully operated For a disconnection disconnected of this duration, and propagating changes typically takes 100MB has been adequate for us during 2. DESIGN Coda is servers. Unix and for an 1 clients The design academic half lasting one of reconnecting A local disk of of disconnection. that size should be OVERVIEW designed untrusted the process of about workday. be opera- of disconnected for periods about a minute. these periods Trace-driven simulations indicate that a disk adequate for disconnections lasting a typical also disconnected environment and a much is optimized research consisting smaller of a large number for the access and sharing environments. It is collection of trusted patterns specifically not of Unix file typical of intended for applications that exhibit highly Each Coda client has a local concurrent, fine granularity disk and can communicate data access. with the servers over a high bandwidth network. ily unable to communicate with At certain times, a client may be temporarsome or all of the servers. This may be due to a server or network failure, or due to the detachment from the network. Clients view Coda as a single, location-transparent shared tem. servers The Coda namespace larity of subtrees called is mapped volumes. At to individual each of a portable file client, a cache The first mechanism, replicas at more than server one allows replication, server. The set sys- ( Venus) to achieve volumes of replication high to have sites for is protocol based 1Unix its volume on callbacks is a trademark ACM Transactions storage of AT&T group ( VSG). [9] to guarantee Bell Telephone that subset a of a VSG that is accessible VSG ( A VSG). The performance currently accessible is a client’s cost of server replication is kept low by caching on disks at clients and through the use of parallel access protocols. Venus uses a cache coherence volume The file at the granu- manager dynamically obtains and caches volume mappings. Coda uses two distinct, but complementary, mechanisms availability. read-write Unix client an open file Labs on Computer Systems, Vol. 10, No 1, February 1992, yields its latest Disconnected Operation m the Coda File System . Icopy in the AVSG. This guarantee is provided by servers notifying when their cached copies are no longer valid, each notification being to as a ‘callback break’. all AVSG sites, Disconnected Modifications and eventually operation, the mechanism While Venus cache. services file system requests Since cache misses cannot application Venus propagates depicts a typical and disconnected Earlier discuss cantly this 3. DESIGN At a high used by disconnected, by relying solely on the contents of its be serviced or masked, they appear and and users. reverts involving When to server transitions [18, 19] have restricts replication disconnection replication. between server ends, Figure 1 replication its only our design described server attention in those to areas for disconnected replication disconnected where in depth. In operation. its presence We has signifi- operation. RATIONALE level, we empty. to operation. influenced First, scenario paper server becomes programs modifications Coda papers contrast, the clients referred in parallel AVSG takes to when are propagated to missing VSG sites. second high availability Coda, as failures effect in Coda 5 two wanted our system. factors to Second, use influenced our conventional, we wished strategy off-the-shelf to preserve for high availability. hardware throughout by seamlessly transparency inte- grating the high availability mechanisms of Coda into a normal environment. At a more detailed level, other considerations influenced our design. the advent of portable workstations, include the need to scale gracefully, These the resource, integrity, and very different clients and servers, and the need to strike about and We examine consistency. 3.1 each of these security a balance issues assumptions made availability between in the following Unix sections. Scalability Successful Coda’s distributed systems tend ancestor, AFS, had impressed rather than treating a priori, it to grow upon in size. Our experience us the need to prepare as an afterthought [17]. with for growth We brought this experience to bear upon Coda in two ways. First, we adopted certain mechanisms that enhance scalability. Second, we drew upon a set of general principles to guide our design choices. An example of a mechanism we adopted for scalability is callback-based whole-file caching, offers the cache coherence. Another such mechanism added occur advantage of a much simpler failure on an open, never on a read, write, substantially partial-file [1] would model: a cache seek, or close. miss This, can only in turn, simplifies the implementation of disconnected operation. A caching scheme such as that of AFS-4 [22], Echo [8] or MFS have complicated our implementation and made disconnected operation less transparent. A scalability principle that has had considerable of functionality on clients rather than the placing influence on our design servers. Only if integrity ACM Transactions on Computer Systems, Vol is or 10, No 1, February 1992 6. J J. Kistler and M Satyanarayanan ‘m _.___—l ACM TransactIons on Computer Systems, Vol. 10, No 1, February 1992 Disconnected security Another rapid would been compromised scalability principle change. Consequently, or agreement algorithms consensus Powerful, by large have we have adopted we have rejected numbers such as that on the current Portable 3.2 have Operation in the Coda File System of nodes. this principle. is the avoidance ofsystem-wide strategies that require election For example, used in Locus [23] that depend partition state of the network. lightweight and compact to observe how laptop a person we have on nodes avoided achieving computers are commonplace with in a shared uses such a machine. Typically, he identifies them from the shared file system into the When he returns, he copies substantially simplify use a different name propagate today. file system files back manual that into the caching, shared with disconnected file write-back operation could the use of portable clients. Users would not have to space while isolated, nor would they have to man- changes champion application The use of portable data files of interest and downloads local name space for use while modified system. Such a user is effectively performing upon reconnection! Early in the design of Coda we realized ually violated 7 Workstations It is instructive isolated. we o upon reconnection. Thus portable for disconnected operation. machines also gave us another machines insight. The are fact a that people are able to operate for extended periods in isolation indicates that they are quite good at predicting their future file access needs. This, in turn, suggests that it is reasonable cache management policy involuntary Functionally, to seek user for disconnected disconnections uoluntary disconnections from Hence Coda provides a single assistance in augmenting operation. caused by failures caused by unplugging mechanism to cope with are no different portable computers. all disconnections. Of course, there may be qualitative differences: user expectations as well extent of user cooperation are likely to be different in the two cases. First- vs. Second-Class 3.3 If disconnected Clients unattended as the Replication operation all? The answer to this assumptions made about the why is server replication question depends clients and servers is feasible, critically in Coda. on the needed very at different are like appliances: they can be turned off at will and may be for long periods of time. They have limited disk storage capacity, their software and hardware may be tampered with, and their owners may not be diligent about backing up the local disks. Servers are like public utilities: they have much greater disk capacity, they are physically secure, and they It are carefully is therefore servers, and monitored appropriate second-class and administered to distinguish replicas (i.e., cache by professional between copies) first-class on clients. staff. replicas on First-class replicas are of higher quality: they are more persistent, widely known, secure, available, complete and accurate. Second-class replicas, in contrast, are inferior along all these dimensions. Only by periodic revalidation with respect to a first-class replica can a second-class replica be useful. ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992. J. J. Kistler and M. Satyanarayanan 8. The function and of a cache scalability first-class coherence advantages replica. When protocol is to combine of a second-class disconnected, replica the quality with in for degradation. the face ability. Hence quency Whereas of failures, and server duration server replication in operation, because contrast, quality because operation, which little. it of data for reduces fre- viewed additional Whether avail- the is properly it requires costs of a replica the quality forsakes is important quality which it is contingent is the greater the poten- preserves operation of disconnected a measure of last resort. Server replication is expensive Disconnected replication disconnected the of the second-class may be degraded because the first-class replica upon inaccessible. The longer the duration of disconnection, tial the performance hardware. to use server replication or not is thus a trade-off between quality and cost. Coda permit a volume to have a sole server replica. Therefore, an installation rely exclusively on disconnected operation if it so chooses. 3.4 Optimistic vs. Pessimistic By definition, a network replica and all replica control to the design ing strategies, danger occurrence. A pessimistic all reads control by towards a disconnected choice strategy and provides conflict- deals them operation would of a cached reconnection. preclude reads much and of central avoids resolving disconnected would families or by restricting everywhere, detecting client strategy writes and writes second-class two [51, is therefore A pessimistic partitioned exclusive control such control until by a disconnected between optimistic An optimistic of conflicts client to acquire shared or disconnection, and to retain exclusive The operation. partition. approach between and pessimistic by permitting attendant exists by disallowing does can Control associates. of disconnected to a single availability partition its first-class operations and writes Replica as higher with after the their require a object prior Possession to of reading or writing at all other replicas. Possession of shared control would allow reading at other replicas, but writes would still be forbidden everywhere. Acquiring control prior to voluntary disconnection is relatively simple. It is more difficult when have to arbitrate disconnection among needed to make system cannot a wise predict is involuntary, multiple requesters. decision is not which requesters because Unfortunately, readily would available. actually the system may the information For use the example, object, the when they would release control, or what the relative costs of denying them access would be. Retaining control until reconnection is acceptable in the case of brief disconnections. But it is unacceptable in the case of extended disconnections. A disconnected client with shared control of an object would force the rest of the system to defer all updates until it reconnected. With exclusive control, it would even prevent other users from making a copy of the object. Coercing the client to reconnect ACM Transactions may not be feasible, since its whereabouts on Computer Systems, Vol. 10, No 1, February 1992 may not be Disconnected Operation in the Coda File System . 9 known. Thus, an entire user community could be at the mercy of a single errant client for an unbounded amount of time. Placing a time bound on exclusive or shared control, as done in the case of leases [7], avoids this problem but introduces others. Once a lease expires, a disconnected client loses the ability to access a cached object, even if no one else in the system disconnected already is interested operation made while An optimistic disconnected which in it. is to provide disconnected approach client This, have in turn, high conflict the availability. purpose Worse, of updates to be discarded. has its own disadvantages. may defeats with An update an update at another made at one disconnected or connected client. For optimistic replication to be viable, the system has to be more sophisticated. There needs to be machinery in the system for detecting conflicts, age and manually for automating resolution when possible, and for confining dampreserving evidence for manual repair. Having to repair conflicts violates transparency, is an annoyance to users, and reduces the usability of the system. We chose optimistic replication because we felt that its strengths and weaknesses better matched our design goals. The dominant influence on our choice an was the low degree optimistic strategy of write-sharing was likely to typical lead to of Unix. This relatively few implied that conflicts. An optimistic strategy was also consistent with our overall goal of providing the highest possible availability of data. In principle, we could have chosen a pessimistic strategy for server replication even after choosing an optimistic strategy for disconnected But that would have reduced transparency, because a user would the anomaly of being able to update data when disconnected, unable to do so when the previous replication. Using system connected arguments an optimistic from strategy the user’s data in his accessible everyone else in that set of shrinks updates universe universe. contact. universe become 4. DETAILED throughout strategy presents At any time, Further, also apply a uniform visible model and his updates are immediately His accessible universe is usually of of the visible to the entire When failures occur, his accessible universe he can contact, and the set of clients that they, in throughout his now-enlarged disconnected, reconnection, accessible his his universe. AND IMPLEMENTATION In describing our implementation client since this is where much the physical structure of Venus, and Sections many to server he is able to read the latest In the limit, when he is operating consists of just his machine. Upon DESIGN tion of the server Section 4.5. of the servers. of an optimistic perspective. servers and clients. to the set of servers turn, can accessible to a subset in favor operation. have faced but being of disconnected of the complexity of a client, 4.3 to 4.5 support needed operation, we focus on the lies. Section 4.1 describes Section 4.2 introduces the major states discuss these states in detail. A descripfor disconnected ACM Transactions operation is contained in on Computer Systems, Vol. 10, No. 1, February 1992. 10 . J. J Kistler and M Satyanarayanan IT — Venus I 1 Fig.2. 4.1 Client Because than Structure ofa Coda client. Structure of the part but debug. Figure Venus complexity of Venus, of the kernel. mance, would The latter have been 2 illustrates intercepts Unix we made approach less portable the high-level file system it a user-level may have access, are handled A system disconnected MiniCache. If possible, the returned to the application. service returns operation entirely by Venus. call on a Coda object calls via or server is forwarded from our for good performance Venus Logically, reintegration. Venus always rather better perfor- more difficult to widely used Sun Vnode performance overhead MiniCache to filter contains no support replication; by the these Vnode on out for functions interface to the call is serviced by the MiniCache and control Otherwise, the MiniCache contacts Venus is to the call. This, in turn, may involve contacting Coda servers. Control from Venus via the MiniCache to the application program, updating Measurements 4.2 process of a Coda client. the MiniCache state as a side effect. MiniCache initiated by Venus on events such as callback critical yielded and considerably structure interface [10]. Since this interface imposes a heavy user-level cache managers, we use a tiny in-kernel many kernel-Venus interactions. The MiniCache remote to Coda servers implementation state changes breaks from confirm that the may Coda also be servers. MiniCache is [211, States Venus operates in one of three Figure 3 depicts these states is normally on the alert emulation, between and them. in the hoarding state, relying on server replication but for possible disconnection. Upon disconnection, it enters the emulation state and remains Upon reconnection, Venus enters ACM Transactions states: hoarding, and the transitions on Computer Systems, Vol there for the the reintegration 10, No 1, February duration state, 1992 of disconnection. desynchronizes its Disconnected Operation in the Coda File System . 11 r) Hoarding Fig. 3. Venus states and transitions. When disconnected, Venus is in the emulation state. It transmits to reintegration upon successful reconnection to an AVSG member, and thence to hoarding, where it resumes connected operation. cache with its AVSG, and then all volumes may not be replicated be in different states with respect conditions in the system. 4.3 reverts to across the to different the hoarding state. same set of servers, volumes, depending Since Venus can on failure Hoarding The hoarding state state is to hoard not its only is so named useful data responsibility. because Rather, reference predicted with behavior, manage is its cache in a manner and disconnected operation. a certain set of files is critical the implementation especially and reconnection –The true cost of a cache hard to quantify. an object this must For inbut may in of hoarding: the distant future, cannot be certainty. –Disconnections —Activity in this However, files. To provide good performance, Venus must to be prepared for disconnection, it must also cache the former set of files. Many factors complicate —File of Venus of disconnection. Venus that balances the needs of connected stance, a user may have indicated that currently be using other cache the latter files. But a key responsibility in anticipation at other clients miss must are often while unpredictable. disconnected be accounted is highly for, so that variable the latest and version of is in the cache at disconnection. —Since cache space is finite, the availability of less critical to be sacrificed in favor of more critical objects. ACM Transactions objects may have on Computer Systems, Vol. 10, No. 1, February 1992. 12 J J. Kistler and M Satyanarayanan . # a a a Personal files /coda /usr/jjk d+ /coda /usr/jjk/papers /coda /usr/jjk/pape # # a a a a a a a a a a a 100:d+ rs/sosp 1000:d+ # a a a a a System files /usr/bin 100:d+ /usr/etc 100:d+ /usr/include 100:d+ /usr/lib 100:d+ /usr/local/gnu d+ a /usr/local/rcs d+ a /usr/ucb d+ XII files (from Xll maintainer) /usr/Xll/bin/X /usr/Xll/bin/Xvga /usr/Xll/bin/mwm /usr/Xl l/ bin/startx /usr/Xll /bin/x clock /usr/Xll/bin/xinlt /usr/Xll/bin/xtenn /usr/Xll/i nclude/Xll /bitmaps c+ /usr/Xll/ lib/app-defaults d+ /usr/Xll /lib/font s/mist c+ /usr/Xl l/lib/system .mwmrc (a) # # a a a Venus source files (shared among Coda developers) /coda /project /coda /src/venus 100:c+ /coda/project/coda/include 100:c+ /coda/project/coda/lib c+ (c) Fig. 4. Sample hoard profiles. These are typical hoard profiles provided hy a Coda user, an application maintainer, and a group of project developers. Each profile IS interpreted separately by the HDB front-end program The ‘a’ at the beginning of a line indicates an add-entry command. Other commands are delete an entry, clear all entries, and list entries. The modifiers following some pathnames specify nondefault priorities (the default is 10) and/or metaexpansion for the entry. ‘/coda’. To Note that address the pathnames these concerns, and periodically cache via a process known algorithm, 4.3.1 plicit Prioritized sources Cache manage Management. of information of interest front-end we with reevaluate which walking. as hoard rithm. The implicit information traditional caching algorithms. hoard database per-workstation fying objects A simple beginning in its ‘/usr’ the (HDB), cache objects Venus priority-based consists Explicit are actually symbolic using merit a links into prioritized retention in the combines implicit and excache management algo- of recent reference history, as in information takes the form of a whose entries are pathnames identi- to the user at the workstation. program allows a user to update the HDB using hoard profiles, such as those shown in Figure command scripts called Since hoard profiles are just files, it is simple for an application maintainer 4. to provide a common profile for his users. or for users collaborating on a project to maintain a common profile. A user can customize his HDB by specifying different combinations of profiles or by executing front-end commands interactively. To facilitate construction of hoard profiles, Venus can record all file references observed between a pair of start and stop events indicated by a user. To reduce the verbosity of hoard profiles and the effort needed to maintain them, Venus if the letter ACM Transactions meta-expansion of HDB supports ‘c’ (or ‘d’) follows a pathname, on Computer Systems, Vol entries. As shown in Figure the command also applies 10, No, 1, February 1992 4, to Disconnected immediate children (or all Operation in the Coda File System descendants). A ‘ +‘ following the . 13 ‘c’ or ‘d’ indi- cates that the command applies to all future as well descendants. A hoard entry may optionally indicate as present children or priority, with a hoard higher priorities indicating more critical The current priority of a cached object of its hoard well as a metric representing ously in response no longer in the victims when To resolve imperative usage. The latter priority is updated all ensure the ancestors that a cached of the object directory while of objects chosen as disconnected, also be cached. is not as continu- to new references, and serves to age the priority working set. Objects of the lowest priority are cache space has to be reclaimed. the pathname of a cached object that therefore recent objects. is a function purged it is Venus must any of its before hierarchical cache management is not needed in tradidescendants. This tional file caching schemes because cache misses during name translation can be serviced, albeit at a performance cost. Venus performs hierarchical cache management by assigning infinite priority to directories with children. This automatically forces replacement to occur bottom-up. 4.3.2 Hoard Walking. say that We a cache is in cached signifying equilibrium, that it meets user expectations about availability, when no uncached object has a higher priority than a cached object. Equilibrium maybe disturbed as a result the of normal cache activity. mentioned its priority equilibrium, as a hoard walk. implementation, bindings replacing an suppose object, B. an object, A, is brought Further suppose restores but one disconnection. of HDB equilibrium A hoard walk may The entries occurs be by performing every explicitly walk occurs are reevaluated Coda clients. For example, new tory whose pathname is specified required in an operation 10 minutes by in is First, to reflect update activity been created in the HDB. a key characteristic data is refetched only a strategy may result current a user phases. children may have with the ‘ +‘ option callback-based caching, break. But in Coda, such being unavailable should it. Refetching immediately ignores B known our two priorities of all entries in the cache and HDB are reevaluated, fetched or evicted as needed to restore equilibrium. Hoard walks also address a problem arising from callback traditional callback into that in the HDB, but A is not. Some time after activity on A ceases, will decay below the hoard priority of B. The cache is no longer in than the uncached since the cached object A has lower priority object B. Venus periodically voluntary For example, on demand, prior the to name by other in a direcSecond, the and objects breaks. on demand in a critical In after a object a disconnection occur before the next reference to upon callback break avoids this problem, but of Unix it is likely to be modified many interval [15, 6]. An immediate environments: once an object is modified, more times by the same user within a short refetch policy would increase client-server traffic considerably, thereby reducing scalability. Our strategy is a compromise that balances scalability. For files and symbolic links, Venus ACM Transactions availability, consistency and purges the object on callback on Computer Systems, Vol. 10, No. 1, February 1992 14 . break, and refetches occurs earlier. J. J. Klstler and M. Satyanarayanan it on demand If a disconnection or during were the next to occur hoard before walk, whichever refetching, the object would be unavailable. For directories, Venus does not purge on callback break, but marks the cache entry suspicious. A stale cache entry is thus available should a disconnection occur before the next hoard walk or reference. The callback entry acceptability semantics. y of stale A callback has been added to or deleted other directory entries fore, saving the stale nection causes directory break and copy consistency the and data follows on a directory from from typically the directory. its particular means It is often that objects they name are unchanged. using it in the event of untimely to suffer only a little, but increases actions normally an the case that Therediscon- availability considerably. 4.4 Emulation In the emulation servers. For semantic state, example, checks. Venus performs Venus now many assumes It is also responsible full responsibility for generating handled by for access and temporary file identi- (fids) for new objects, pending the assignment of permanent fids at updates reintegration. But although Venus is functioning as a pseudo-server, accepted by it have to be revalidated with respect to integrity and protection by real servers. This follows from the Coda policy of trusting only servers, not clients. To minimize unpleasant delayed surprises for a disconnected user, it fiers behooves Venus Cache algorithm management used during cache freed ity entries to be as faithful of the immediately, so that default request 4.4.1 they objects but are not version in its emulation. is involved. Cache of other modified those purged before done with operations entries the same priority directly update the of deleted objects assume reintegration. objects infinite On a cache are prior- miss, the behavior of Venus is to return an error code. A user may optionally Venus to block his processes until cache misses can be serviced. Logging. During emulation, to replay update activity when in a per-volume log of mutating contains as possible during emulation hoarding. Mutating a copy state of the corresponding of all objects Venus records sufficient information it reinteWates. It maintains this information log. Each log entry operations called a replay referenced system call arguments as well as the by the call. Venus uses a number of optimizations to reduce log, resulting in a log size that is typically a few the length of the replay percent of cache size. A small log conserves disk space, a critical resource during periods of disconnection. It also improves reintegration performance by reducing latency and server load. One important optimization to reduce log length pertains to write operations on files. Since Coda uses whole-file caching, the close after an open of a file for modification installs a completely new copy of the file. Rather than logging the open, close, and intervening write operations individually, Venus logs a single store record during the handling of a close, Another optimization for a file when a new consists of Venus discarding a previous store record one is appended to the log. This follows from the fact ACM TransactIons on Computer Systems, Vol. 10, No. 1, February 1992, Disconnected that a store record renders all Operation in the Coda File System previous does not contain versions of a file a copy of the file’s . superfluous. contents, but 15 The merely store points to the copy in the cache. We are length currently of the replay the previous earlier would implementing two log. generalizes paragraph The such that operations may be the canceling second optimization the inverting and own log record 4.4.2 chine further as well cancel the of a store as that to reduce optimization which the described in the effect overwrites of corresponding log records. An example by a subsequent unlink or truncate. The operations to cancel both of inverse For example, a rmdir may cancel its of the corresponding A disconnected user and continue where a shutdown optimizations the any operation exploits knowledge inverted log records. Persistence. after first mkdir. must be able he left to restart his ma- off. In case of a crash, the amount of data lost should be no greater than if the same failure occurred during connected operation. To provide these guarantees, Venus must keep its cache and related data structures in nonvolatile storage. Mets-data, consisting of cached directory and symbolic link contents, status blocks for cached objects of all types, replay logs, and the HDB, is mapped virtual memory (RVM). Transacinto Venus’ address space as recoverable tional access to this Venus. memory The actual contents local Unix files. The use of transactions is supported of cached to by the RVM files library are not in RVM, manipulate meta-data [13] linked but into are stored simplifies Venus’ as job enormously. To maintain its invariants Venus need only ensure that each transaction takes meta-data from one consistent state to another. It need not be concerned with crash recovery, since RVM handles this transparently. If we had files, chosen we synchronous RVM trol over the would writes supports the obvious have An alternative to follow and an ad-hoc local, basic serializability. had non-nested recovery RVM’S write-ahead bounded provides flushes. Venus exploits log from persistence, the allows commit latency a synchronous the application time to time. where the capabilities and of atomicity, reduce thereby avoiding commit as no-flush, persistence of no-flush transactions, of RVM in local accessible when emulating, When bound con- permanence and by the labelling write to disk. To ensure must explicitly flush used in this manner, RVM is the period between log to provide Venus is more the log more frequently. This lowers performance, data lost by a client crash within acceptable limits. 4.4.3 volatile Resource storage Exhaustion. during emulation. It is possible The timed independent good performance constant level of persistence. When hoarding, Venus initiates log infrequently, since a copy of the data is available on servers. Since are not Unix of carefully algorithm. properties can meta-data discipline transactions transactional application of placing a strict two for conservative but Venus significant keeps at a flushes servers and flushes the to exhaust instances amount its of this of nonare ACM ‘l!ransactions on Computer Systems, Vol. 10, No. 1, February 1992 16 . J. J. Kistler and M. Satyanarayanan the file allocated cache becoming filled to replay logs becoming with full. modified files, Our current implementation is not very graceful tions. When the file cache is full, space can be freed modified files. reintegrate always When ion log space is full, has been no further performed. and the in handling by truncating mutations Of course, RVM non-rout these situaor deleting are allowed at ing space until operations are allowed. We plan emulating. to explore at least three alternatives One possibility is to compress file to free up disk space while cache and RVM contents. Compression trades off computation time for space, and recent work [21 has shown it to be a promising tool for cache management. A second possibility is to allow users to selectively back out updates made while disconnected. A third approach is to allow out to removable 4.5 media portions of the such as floppy file cache and RVM to be written disks. Reintegration Reintegration is a transitory state through which Venus passes in changing roles from pseudo-server to cache manager. In this state, Venus propagates changes made during emulation, and updates its cache to reflect current server state. Reintegration is performed a volume at a time, with all update activity in the volume 4.5.1 Replay suspended in two for and This new objects step steps. in many independently executed. protection, to replace cases, of changes step, Venus temporary since Venus from obtains fids client in the obtains to AVSG permanent a small fids replay log. supply of fids in advance of need, while in the hoarding state. In the second replay log is shipped in parallel to the AVSG, and executed at each member. single transaction, which The replay algorithm parsed, locked. In the first uses them is avoided permanent step, the completion. The propagation Algorithm. is accomplished until Each server performs the is aborted if any error is detected. consists of four phases. In phase replay one within the log a is a transaction is begun, and all objects referenced in the log are In phase two, each operation in the log is validated and then The validation consists of conflict detection as well as integrity, and disk space checks. Except in the case of store operations, execution during replay is identical to execution in connected mode. For a file is created and meta-data is updated to reference store, an empty shadow it, but the data transfer is deferred. Phase three consists exclusively of performing these data transfers, a process known as bac?z-fetching. The final phase commits the transaction and releases all locks. If reinte~ation succeeds, Venus frees the replay log and resets the priority of cached objects referenced by the log. If reintegration fails, Venus writes file in a superset of the Unix tar format. out the replay log to a local replay The log and all corresponding cache entries are then purged, so that subsequent references will cause refetch of the current contents at the AVSG. A tool is provided which allows the user to inspect the contents of a replay file, compare entirety. it to the ACM Transactions state at the AVSG, and replay on Computer Systems, Vol. 10, No. 1, February it 1992, selectively or in its Disconnected Operation in the Coda File System Reintegration at finer granularity than perceived by clients, improve concurrency reduce user effort during manual replay. implementation to reintegrate dent within operations using the precedence Venus will pass them to servers along The tagged with with operations clients. conflicts. for conflicts a storeid of subsequences subsequences of Davidson precedence the replay The relies that only Read/ system model, since it of a single system call. check granularity graphs of depen- can be identified [4]. In the revised during emulation, imple - and log. Our use of optimistic replica control means that of one client may conflict with activity at servers Handling. or other disconnected with are write / write Unix file boundary maintain 17 a volume would reduce the latency and load balancing at servers, and To this end, we are revising our Dependent approach graph mentation ConfZict 4.5.2 the disconnected at the a volume. . class of conflicts we are concerned conflicts are not relevant to the write has no on the fact uniquely notion that identifies of atomicity each replica the last beyond the of an object update to it. is During phase two of replay, a server compares the storeid of every object mentioned in a log entry with the storeid of its own replica of the object. If the comparison the mutated indicates equality for all objects, the operation objects are tagged with a new storeid specified is performed and in the log entry. If a storeid comparison fails, the action taken depends on the operation being validated. In the case of a store of a file, the entire reintegration is aborted. But for directories, a conflict is declared only if a newly created name collides with an existing name, if an object updated at the client or the server has been deleted by the other, or if directory attributes have been modified at the directory updates server and the is consistent client. with This our strategy strategy and was originally suggested by Locus [23]. Our original design for disconnected operation replay files at servers rather than clients. This damage forcing to be manual confined repair, by marking conflicting as is currently of resolving in server Today, laptops [11], called for preservation of approach would also allow replicas inconsistent done in the case of server We are awaiting more usage experience to determine the correct approach for disconnected operation. 5. STATUS partitioned replication whether and replication. this is indeed AND EVALUATION Coda runs on IBM RTs, Decstation such as the Toshiba 5200. A small 3100s and 5000s, and 386-based user community has been using Coda on a daily basis as its primary data repository since April 1990. All development work on Coda is done in Coda itself. As of July 1991 there were nearly 350MB of triply replicated data in Coda, with plans to expand to 2GB in the next A version few months. of disconnected operation with strated in October 1990. A more complete 1991, and is now in regular use. We have for periods lasting one to two days. ACM Transactions Our minimal functionality version was functional successfully operated experience with on Computer Systems, Vol was demonin January disconnected the system has been 10, No 1, February 1992. 18 . J. J. Kistler and M. Satyanarayanan Table I. Elapssd Time (seconds) Andrew Make Reintegration Totaf Time AllocFid Size of Replay (xconds) Replay con Records Log Data Back-Fetched (Bytes) Bytes 288 (3) 43 (2) 4 (2) 29 (1) 10 (1) 223 65,010 1,141,315 3,271 (28) 52 (4) 1 (o) 40 (1) 10 (3) 193 65,919 2,990,120 Benchmark Venus Time for Reintegration Note: This data was obtained with a Toshiba T5200/100 client (12MB memory, 100MB disk) reintegrating over an Ethernet with an IBM RT-APC server (12MB memory, 400MB disk). The values shown above are the means of three trials, Figures m parentheses are standard deviations. quite positive, and we are confident that the refinements under will result in an even more usable system. In the following sections we provide qualitative and quantitative to three important questions pertaining to disconnected operation. (1) How long (2) How large (3) How likely 5.1 Duration does reintegration a local disk development answers These are: take? does one need? are conflicts? of Reintegration In our experience, development lasting To characterize typical disconnected a few hours required reintegration speed more sessions of editing and program about a minute for reintegration. precisely, we measured the reinte- gration times after disconnected execution of two well-defined tasks. The first task is the Andrew benchmark [91, now widely used as a basis for comparing file system performance. The second task is the compiling and linking of the current version of Venus. Table I presents the reintegration times for these tasks. The time for reintegration consists of three components: the time to allocate permanent fids, the time for the replay at the servers, and the time for the second phase of the update protocol used for server replication. The first component will be zero for many disconnections, due to the preallocation of fids during hoarding. We expect the time for the second component to fall, considerably in many cases, as we incorporate the last of the replay log optimizations described in Section 4.4.1. The third component can be avoided only if server replication is not used. One can make some interesting secondary observations from Table 1. First, the total time for reinte~ation is roughly the same for the two tasks even though the Andrew benchmark has a much smaller elapsed time. This is because the Andrew benchmark uses the file system more intensively. Second, reintegration for the Venus make takes longer, even though the number of entries in the replay log is smaller. This is because much more file data is ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992 Disconnected Operation in the Coda File System o 19 7J50$ — – – ----- 540 g -cl ~ m 30 s ~ Max Avg Min s 20 & .Q) z 10 - —--——— /“-- ......... . . . . . . . . . . ------0 4 2 ‘“ 12 10 Time (hours) 8 6 ,. . . ..-. -. Fig. 5. High-water mark of Cache usage. This graph is based on a total of 10 traces from 5 active Coda workstations. The curve labelled “Avg” corresponds to the values obtained by “Max” and “ Min” plot averaging the high-water marks of all workstations. The curves labeled the highest and lowest values of the high-water marks across all workstations. Note that the high-water mark does not include space needed for paging, the HDB or replay logs. back-fetched in the third phase of the replay. Finally, neither any think time. As a result, their reintegration times are task involves comparable to that session after a much longer, but more typical, disconnected in our environment. 5.2 Cache Size A local disk capacity of 100MB on our clients has proved initial sessions of disconnected operation. To obtain a better the cache size requirements for disconnected operation, reference menting traces from workstations adequate for understanding we analyzed our environment. The traces were obtained to record information on every file system by instruoperation, regardless of whether the file was in Coda, AFS, or the local file system. Our analysis is based on simulations driven by these traces. Writing validating Venus Venus driven a simulator that precisely models the complex caching our of file and behavior of would be quite difficult. To avoid this difficulty, we have modified to act as its own simulator. When running as a simulator, Venus is by traces rather than requests from the kernel. Code to communicate with the servers, as well as code to perform system are stubbed out during simulation. Figure 5 shows The actual disk both the explicit From our data operating the high-water size needed mark physical of cache for disconnected 1/0 usage operation on the as a function for a typical ACM Transactions workday. Of course, user file of time. has to be larger, and implicit sources of hoarding information it appears that a disk of 50-60MB should disconnected local since are imperfect. be adequate for activity that is on Computer Systems, Vol. 10, No. 1, February 1992. 20 J. J. Kkstler and M. Satyanarayanan ● drastically significantly We plan First, different from what different results. to extend our work we will investigate disconnection. Second, cache we will was recorded in on trace-driven our could simulations size requirements be sampling traces in three for much a broader longer range traces from many more machines in our environment. the effect of hoarding by simulating traces together profiles have 5.3 Likelihood been specified virtually different no network Second, server conflicts replication due to While partitions. perhaps grows will lead to many To obtain data mented 400 the AFS computer program cant conflicts will larger. Third, in Coda for nearly multiple gratifying servers science amount faculty, the same kind mate frequent of usage conflicts a user would modifies an at assumptions, file or we instru- for research, includes a signifi- is descended were user are used by over we can use this be if Coda users are system. Coda scale, profile Coda in is disconnections students usage Since AFS larger graduate Their as the servers we have an object observation voluntary These and activity. only extended of conflicts staff a year, updating us, this a problem perhaps and education. and makes users to is possible that our with an experimental in our environment. of collaborative environment. Every time become more conflicts. on the likelihood development, how Third, we with hoard by users. ex ante subject to at least three criticisms. First, it being cautious, knowing that they are dealing community of activity of Conflicts In our use of optimistic seen ways. periods of user by obtaining will evaluate that produce to replace directory, we from AFS data to esti- AFS in our compare his identity with that of the user who made the previous mutation. We also note the time interval between mutations. For a file, only the close after an open for update is counted as a mutation; individual write operations are not counted. For directories, as mutations. Table II presents is classified by all operations our observations volume type: user that modify over a period volumes a directory of twelve containing are counted months. private The data user data, volumes used for collaborative work, and system volumes containing program binaries, libraries, header files and other similar data. On average, project a project volume has about volume has about 1600 files 2600 files and 280 and 130 directories. directories, and a system User volumes tend to be smaller, averaging about 200 files and 18 directories, place much of their data in their project volumes. Table II shows that over 99% of all modifications because were users by the often previous writer, and that the chances of two different users modifying the same object less than a day apart is at most O.75%. We had expected to see the highest degree of write-sharing on project files or directories, and were surprised to see that it actually occurs on system files. We conjecture that a significant fraction of this sharing arises from modifications to system files by operators, who change shift periodically. If system files are excluded, the absence of ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992 Disconnected Operation in the Coda File System . 2I m “E! o v w ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992 22 J. J. Kistler and M. Satyanarayanan . write-sharing the previous is even more striking: more than 99.570 of all mutations are by writer, and the chances of two different users modifying the same object within ing from the point would not environment. be 6. RELATED a week are less than 0.4~o! This of view of optimistic replication. a serious problem if were replaced by Coda in our WORK Coda is unique in that it exploits availability while preserving a high no other AFS data is highly encouragIt suggests that conflicts system, published caching for both performance and high degree of transparency. We are aware of or unpublished, that duplicates this key aspect of Coda. By providing system [20] since this hoarding, tools to link provided local and rudimentary was not transparent its remote support name for spaces, disconnected the Cedar operation. file But primary goal, Cedar did not provide support reintegz-ation or conflict detection. Files were for ver- sioned and immutable, and a Cedar cache manager could substitute a cached version of a file on reference to an unqualified remote file whose server was inaccessible. However, the implementors of Cedar observe that this capability was not often exploited since remote files were normally referenced by specific Birrell version and availability more recent number. Schroeder pointed out the possibility of “stashing” data for in an early discussion of the Echo file system [141. However, a description of Echo [8] indicates that it uses stashing only for the highest levels of the naming hierarchy. The FACE file system [3] uses stashing but does not integrate it with caching. The lack of integration has at least three negative consequences. First, it reduces transparency because users and applications deal with two different name spaces, with different consistency properties. Second, utilization of local disk space is likely to be much worse. Third, recent information from cache management is not available to manage the The available literature integration detracted An application-specific the PCMAIL manipulate nize with on FACE from system at MIT [12]. existing mail messages a central does not report on how the usability of the system. form of disconnected operation repository much the usage stash. lack of was implemented in PCMAIL allowed clients to disconnect, and generate new ones, and desynchro- at reconnection. Besides relying heavily on the semantics of mail, PCMAIL was less transparent than Coda since it required manual resynchronization as well as preregistration of clients with servers. The use of optimistic replication in distributed file systems was pioneered by Locus [23]. Since Locus used a peer-to-peer model rather than a clientserver model, availability was achieved solely through server replication. There was no notion of caching, and hence of disconnected operation. Coda has transparency benefited in a general sense and performance in distributed from file the large body of work on systems. In particular, Coda owes much to AFS [191, from which it inherits its model of trust integrity, as well as its mechanisms and design philosophy for scalability. ACM TransactIons on Computer Systems, Vol. 10, No. 1, February 1992. and Disconnected 7. FUTURE section operation sections optimization, additional . 23 development. In WORK Disconnected earlier Operation m the Coda File System in of this granularity work is also we consider broadened. An excellent Coda paper Explicit dreds or thousands under work active in progress in the areas of log of reintegration, and evaluation of hoarding. being done at lower levels of the system. two ways opportunity Unix. is a facility we described in exists transactions which and scope in Coda for adding become of nodes, the more the work transactional desirable informal of our as systems concurrency In Much this may be support to scale control to hunof Unix becomes less effective. Many of the mechanisms supporting disconnected operation, such as operation logging, precedence graph maintenance, and conflict checking would transfer directly to a transactional system using optimistic concurrency control. Although transactional file systems are not a new idea, no such system with the scalability, availability and performance properties of Coda has been proposed A different opportunity exists in nected in operation, low bandwidth. lines, or that mask failures, connectivity. partial as provided But opportunities aggressive file where Coda to support connectivity weakly con- is intermittent or of Such conditions are found in networks that rely on voice-grade use wireless technologies such as packet radio. The ability to weak cation more environments or built. extended by disconnected techniques which operation, exploit is of value and adapt at hand write-back are also needed. Such techniques policies, compressed network and caching at intermediate levels. transfer even with to the communi may - include transmission, 8. CONCLUSION Disconnected operation preload cache one’s connection, log all reconnect ion. Implementing is a tantalizingly with critical changes made disconnected simple data, continue while operation idea. disconnected is not are dynamic Only in hindsight traditional caching of a lifeline choices to be made about one has to do is to operation and replay so simple. modifications and careful attention to detail in many agement. While hoarding, a surprisingly large volume lated state has to be maintained. When emulating, integrity of client data structures become critical. there All normal It until them involves disupon major aspects of cache manand variety of interrethe persistence and During reintegration, the granularity of reintegration. do we realize the extent to which implementations of schemes have been simplified by the guaranteed presence to a first-class replica. Purging and refetching on demand, a strategy often used to handle pathological situations in those implementations, is not viable when supporting disconnected operation. However, the obstacles to realizing disconnected operation are not insurmountable. Rather, the central message of this paper is that disconnected operation is indeed feasible, efficient and usable. One way to view our work is to regard it as an extension of the idea of write-back caching. Whereas write-back caching has hitherto been used for ACM TransactIons on Computer Systems, Vol. 10, No 1, February 1992. 24 J. J. Kistler and M. Satyanarayanan . performance, failures too. transitions system. remote we have A broader between shown that it can be extended to mask temporary view is that disconnected operation allows graceful states of autonomy and in a distributed interdependence Under favorable conditions, our approach provides all the benefits of data access; under unfavorable conditions, it provides continued access to critical data. We become increasingly important and vulnerability y. are certain that disconnected operation will as distributed systems grow in scale, diversity ACKNOWLEDGMENTS We wish to thank Lily and postprocessing Mummert the file for her reference traces invaluable assistance used in Section Varotsis, who helped instrument the AFS servers which yielded ments of Section 5.3. We also wish to express our appreciation present contributors Mashburn, Maria to the Okasaki, Coda project, and David especially in collecting 5.2, and Dimitris Puneet the measureto past and Kumar, Hank Steere. REFERENCES 1. BURROWS, M. Efficient data sharing, PhD thesis, Univ. of Cambridge, Computer Laboratory, Dec. 1988. 2. CAT~, V., AND GROSS,T. Combinmg the concepts of compression and caching for a two-level of the 4th ACM Symposium on Architectural Support for file system. In Proceedings Programm mg Languages and Operating Systems (Santa Clara, Calif., Apr. 1991), pp 200-211. 3. COVA, L. L. Resource management in federated computing environments. PhD thesis, Dept. of Computer Science, Princeton Univ., Oct. 1990. 4. DAVIDSON, S. B. Optimism and consistency in partitioned distributed database systems. ACM Trans. Database Syst. 9, 3 (Sept. 1984), 456-481. networks. 5. DAVIDSON, S. B., GARCIA-M• LINA, H., AND SKEEN, D, Consistency in partitioned ACM Comput. Suru. 17, 3 (Sept. 1985), 341-370. in distributed file systems. Tech. Rep. TR 272, Dept. of 6. FLOYD, R. A. Transparency Computer Science, Univ. of Rochester, 1989. fault-tolerant mechanism for 7. GRAY, C. G., AND CHERITON, D. R. Leases: An efficient of the 12th ACM Symposwm on Operating distributed file cache consistency. In Proceedings System Principles (Litchfield Park, Ariz., Dec. 1989), pp. 202-210. Availability and 8. HISGEN, A., BIRRELL, A., MANN, T., SCHROEDER, M., AND SWART, G of the Second consistency tradeoffs in the Echo distributed file system. In Proceedings Workshop on Workstation Operating Systems (Pacific Grove, Calif., Sept. 1989), pp. 49-54. M., 9. HOWARD, J. H., KAZAR, M. L.j MENEES, S, G., NICHOLS, D. A., SATYANARAYANAN, SIDEBOTHAM, R. N., AND WEST, M. J. Scale and performance in a distributed file system. 10. ACM Trans. Comput. Syst. 6, 1 (Feb. 1988), 51-81. KLEIMAN, S. R. Vnodes: An architecture for multiple Summer 11. KUMAR, system, 1991. Usenix P., Conference Proceedings AND SATYANARAYANAN, Tech. Rep. CMU-CS-91-164, tile system types in Sun UNIX. In (Atlanta, Ga., June 1986), pp. 238-247, M. Log-based directory resolution in the Coda file School of Computer Science, Carnegie Mellon Univ., mail system for personal computers. DARPA-Internet 12. LAMBERT, M. PCMAIL: A distributed RFC 1056, 1988, memory user man13. MASHBURN, H., AND SATYANARAYANAN, M. RVM: Recoverable virtual ual. School of Computer Science, Carnegie Mellon Univ., 1991. ACM Transactions on Computer Systems, Vol. 10, No 1, February 1992 Disconnected Operation in the Coda File System . 25 R. M., AND HERBERT, A. J. Report on the Third European SIGOPS Workshop: “Autonomy or Interdependence in Distributed Systems.” SZGOPS Reu. 23, 2 (Apr. 1989), 3-19. OUSTERHOUT,J., DACOSTA, H., HARRISON, D., KUNZE, J., KUPFER, M., AND THOMPSON, J. A trace-driven analysis of the 4.2BSD file system. In Proceedings of the 10th ACM Symposium on Operating System FYinciples (Orcas Island, Wash., Dec. 1985), pp. 15-24. SANDBERG, R., GOLDBERG,D., KLEIMAN, S., WALSH, D., AND LYON, B. Design and implemenUsenix Conference Proceedings (Portland, tation of the Sun network filesystem. In Summer Ore., June 1985), pp. 119-130. of SATYANARAYANAN, M. On the influence of scale in a distributed system. In Proceedings the 10th International Conference on Software Engineering (Singapore, Apr. 1988), pp. 10-18. SATYANARAYANAN, M., KISTLER, J. J., KUMAR, P., OKASAKI, M. E., SIEGEL, E. H., AND STEERE, D. C. Coda: A highly available file system for a distributed workstation environment. IEEE Trans. Comput. 39, 4 (Apr. 1990), 447-459. SATYANARAYANAN, M. Scalable, secure, and highly available distributed file access. IEEE Comput. 23, 5 (May 1990), 9-21. 14. NEEDHAM, 15. 16. 17. 18. 19. D., GIFFORD, D. K., AND NEEDHAM, R. M. A caching file system for a of the 10th ACM Symposium on Operating workstation. In Proceedings System Principles (Orcas Island, Wash., Dec. 1985), pp. 25-34. 21. STEERE, D. C., KISTLER, J. J., AND SATYANARAYANAN, M. Efficient user-level cache file Usenix Conference Proceedings management on the Sun Vnode interface. In Summer (Anaheim, Calif., June, 1990), pp. 325-331. 20. SCHROEDER, M. programmer’s 22. Decorum File System, 23. WALKER, B., POPEK, operating system. In ples (Bretton Woods, Transarc Corporation, Jan. 1990. G., ENGLISH, R., KLINE, C., AND THIEL, G. The LOCUS distributed Proceedings of the 9th ACM Symposium on Operating System PrinciN. H., Oct. 1983), pp. 49-70. Received June 1991; revised August 1991; accepted September ACM Transactions 1991 on Computer Systems, Vol. 10, No. 1, February 1992.