Lightweight M. Satyanarayanan, Recoverable Henry H. Mashburn, School Virtual Puneet Kumar, of Computer Carnegie Mellon Recoverable virtual memory rekrs to regions C. Steere, James J. Kistler Science combination found of a virtual David University This Abstract Memory in of circumstances situations involving the meta-data transactional guarantees are repositories. Thus RVM offered. This paper deseribes RVM, an efficient, Pofiable, and easily used implementation of recoverable virtual applications from distributed memory for Unix environments. A unique characteristic of RVM is that it allows independent control over the RVM trensactionaJ properties of atotnicity, permanence, and serializability. This leads to considerable flexibility in the use of RVM, potentially enlarging the range of applications than can benefh from transactions. It also control over the basic transactional simplifies the layering of functionality such as nesting and It may often be tempting, distribution. The paper shows that RVM performs well over its intended rattge of usage even though it does not use a mechanism integrated benefit from specialized operating system support. It also has been that such sophistication address space on which demonstrates the importance transaction optitnizations. of intra- end object-oriented inter- this paper, programming is a library implemented code and no more rurttime library called RVkl, be, while remaining for input-output. is implemented with considerable flexibility in about 10K lines intrusive This transactional without specialized is intended for Unix applications system-level concerns The total size of those data structures of disk capacity, easily fit within manner. and their working main memory. direct commercial title of the that copying and/or specific without fee the copies advantage, publication Machinery. SIGOPS B 1993 that To copy all or part of are not made the ACM and Its date is by permission this copyright appear, material or distributed and of the Association otherwise, on our design. appropriate, 2. Lessons and as an lay not in what could RVM. our experience This experience, and requirements were The description architecture, the a discussion of dominant of RVM follows and implementation. we point out ways in which our design. with We conclude usage with an of its use as a building or to republish, notice notice IS Camelot is a transactional that simplify for and the distributed IS given general-purpose and encourage systems [33]. nested transactions, for Computing requires from 2.1. Overview the choice a fee permission. ‘93/12 /93/N. C., USA ACM 0-89791-632-81931001 RVM block, and a summary of related work. Permission copy of usability Our design challenge Venari [24, 37], of RVM, thesis to concerns of the fault-tolerance evaluation Camelot provided programming and performance, [10], a predecessor of RVM. influenced This work was sponsored by the Avinmcs Laboratory, Wright Research and Development Center, Aeronautical Systems Division (AFSC), U.S. Air Force, Wright-Patterscm AFB, Ohio, 45433-6543 under Contract F33615-90-C-1465, ARPA Order No. 7597. James KistIer is now affiliated with the DEC Systems Research Center, Palo Alto, CA. granted But our experience one can view crippling experience to or better comes at the cost of this paper by describing Wherever set size must unavoidable, in functionality of functionality engineering and have represents a batance between the in three parts: rationale, should be a small of atomicity, up features to add, but in determining influences structures that must be updated in a fault-tolerant persistent applications system. Alternatively, Coda [16, 30] tools. independent properties ease of use and more onerous our understanding with persistent data for allows and sometimes that is richer Thus RVM Camelot operating of in how they use transactions. constraints, We begin facility, support serializability, be omitted without than a typical range tools, and CASE mntime with the operating conjuring wide range of hardware from laptops to servers. fraction and of storage a wide Since RVM exercise in minimalism. minimal system support, and has been in use for over two years on a RVM permanence, mm”ntenance. Our answer, as elaborated user-level constraints, of mainline provide to be file systems and databases, to CAD languages. and the software facility a potent tool for fault-tolerance? in can also portability, 1. Introduction How simple can a transactional can benefit repositories, programming is most likely 2... $1 .~~ 146 facility built to validate transactional the support construction reliable It supports local and distributed and provides considerable of logging, of the would synchronization, flexibility in and transaction commitment strategies. communication which is Camelot relies management page external facilities binary heavily and of the Mach operating compatible with on the the Figure 1 shows the overall structure of a Camelot Each module Mach task and communication Mach’s inteqxocess is implemented between communication is via facilihy(IPC). recoverable Coda control on servers. as well and internal m 9“” system. The meta-data included housekeeping. memory contribute to Coda meta-data in as persistent Server recovery recoverable we of the Camelot thesis. memoryl directories Yet experience in and because it would Coda file was kept in a Unix m of an overkill. We placed data structures pertaining as a modules be something towards the validation Unix operating system [20]. node. would the use of transactions, system [2], 4.3BSD Camelot persisted, because it would give us first-hand interprocess data for fite on a server’s consisted which local file of Camelot to the last committed by a Coda salvager replica The contents of each ensured restoring state, followed mutual consistency between meta-data and data. . Recoverable Processes 2.3. Experience The most valuable lesson we learned by using Camelot was that recoverable I Camelot and practically system Co.pon.nt like . Mach K@r.el Coda. virtual memory was indeed a convenient useful programming Crash recovery structures were restored operations were merely for systems because data in situ by Camelot. manipulations The Coda salvager structures. abstraction was simplified Directory of in-memory was simple range of error states it had to handle was small. the encamulation . of messv. crash This figure shows the intemaf structure of Camelot as welt as its relatim-ship to application code. Camelot is composed of several Camelot considerably Mach tasks: Master Control, Camelot, and Node Server, as welt as the Recovery, Transaction, and Disk Managers. Camelot Unfortunately, provides recoverable vittuaf memory for Data Servers; that is, transactional operations am supported cm portions of the virtual address space of each Data Server. Application code can be split between Data Server and Application tasks (as in this figure), or these benefits problems we encountered scalability, programming maintenance. may be entirely linked into a Data Server’s address space. The latter approach was used in Coda. Camelot facilities are accessed via a library linked with application code. simplified consequences Overall, details into Coda server code. came at a high price. manifested themselves constraints, In spite of considerable able to circumvent these problems. and effort, The as poor d.ijjjcul~ of we were not Since they were ditect of the design of CameloL these problems in the following Figure 1: Structure of a Camelot Node recoverv. data because the we elaborate on paragraphs. A key design goal of Coda was to preserve the scalability 2.2. Usage Our interest in Camelot phase optimistic replication System. Although distributed commit, atomicity arose in the context the meta-data protocol does not require strategy for us would have an ad hoc fault tolerance mechanism for using some form of shadowing. that we found support virtual for recoverable enables regions address be endowed of atomicity, isolation properties space we did not find distributed to we realized responsible is its These experiments contributor server CPU controlled to loss of utilization, operation also showed scalability was and that Camelot for over a third of this increase. of Coda servers in operation experiments was Examination showed considerable paging overheads due to the fact that each involved interactions between many of processes shown in Figure 1. There was no obvious way to reduce this overhead, since it was inherent in the implementation structure of Camelot. virtual transactional and permanence. a need for features transactions, the increased Camelot This unique of a process’ with that the primary the component most useful memory [9]. feature of Camelot in an earlier paper [30]) showed that Coda was and context switching But we were curious to see what Camelot could do for us. The aspect of Camelot But a set of carefully (described less scalable than AFS. a of locat updates to meta-dafa in The simplest been to implement used by the Coda File it does require each server to ensure the and permanence the first phase. protocol of the two- of AFS. Since such as nested or that our use of lFor memo# 147 brevity, we often omit in the rest of this paper. “virtual” from “recoverable virtuaf A second obstacle programming to using constraints Camelot was it imposed. the count on the clean failure set of latter is only responsible These constraints came in a variety of guises. For example, Camelot required all processes Manager using it to be descendants task shown in Figure Coda servers required scripts It also made debugging starting a Coda server under example of a programming required us to use Mach Coda was capable expensive, thread constraint of using context we ended up paying and were threads. much third limitation complexity and combinations porting tight of Mach difficult. recoverable dependence features made size, rarely used maintenance memory, we were usually new bugs in Camelot. As the cumulative toll and hard to decide for ways to preserve the virtues of Camelot avoiding its drawbacks. Since recoverable was the only aspect of Camelot virtual RVM RVM If has to for coping unpleasant is implemented correctly with with concurrency to be multi-threaded and in the presence of true parallelism. changes no on threading coroutine C threads, and coroutine fiml RVM mechanisms: simplification thus user-level thread We have, in fact, used RVM different Mach kernel threads [8], LWP [29]. was to factor Standard techniques will adopts with three out resiliency such as mirroring Our expectation most likely to can is that be implemented in the disk. a layered approach to transactional support, as shown in Figure 2. This approach is simple and that enhances flexibility: That into an application does not have to buy those aspects of the transactional irrelevant concept that are to it. Rationale The central principle we adopted in designing RVM was to In building a tool that did one thing well, we were heeding Lampson’s sound advice value simplicity over generality. on interface design [19]. long Unix tradition We were also being faithful of keeping building blocks simple. take radically from Camelot in the areas positions operating functionality, system dependence, I The to simplicity different allowed Application to the change in focus from generality of above That layer is also responsible device driver of a mirrored quest led to RVM. 3. Design are supporting. a layer and other of their at a granularity starvation this functionality memory was cheap, easy-to-use and had few strings attached. they is out concurrency synchronization be used to achieve such resiliency. we while into a realization RVM insist on a to use a policy deadlocks, Our we relied on, we sought to distill the essence of this functionality to factor applications is required, media failure. mounted, we decided implementations. lay in Camelot or Mach. of these problems simplified enforce it. used the first to expose But it was often problem the But it does not depend on kernel thread support, and can be Since Coda was the sternest test case for whether a particular looked on its code have to the abstmctions to function was that we and to perform Internally, a hefty performance cost of Camelot while control problems. more with little to show for it. A of RVM, Rather than having RVM This allows serializability even where control. appropriate was that threads, area technique, choice, because user-level switches specific control. was complex. kernel second concurrency that complicated more difficult Camelot Since kernel procedure a debugger Another though A the Disk 1. This meant that starting a rather convoluted made our system administration fragile. of semantics for local, non-nested transactions. Nastkg Code I \ Distribution / SanalizabWy us to and structure. Operating Permananca.’ System media faihre 3.1. Functionality was to eliminate support for nesting Our first simplification and distribution. A cost-benefit analysis showed us that each could be better provided top of RVM2. efficient property than as an independent While a layered implementation a monolithic of keeping one, it each layer simple. maybe has the Figure 2: Layering layer on less attractive Upper layers can 3.2. Operating System To make RVM portable, of Functionality Dependence we decided Unix ~plemenwtia sketch is provided in Section 8 only on a small, widely supported, A consequence of this decision was that we VM 148 to rely call interface. subset of the Mach system could not count on tight coupling Zh in RVM subsystem. The Camelot between Disk Manager RVM and the module runs as an external pager [39] and takes full responsibility managing the backing process. in the The use of advisory Mach recoverable interface regions paged out until with Mach’s backing that This close alliance allows Camelot store or to avoid regions whose addressing of large recoverable 3.3. Structure The ability spaces dirty address space are not commit. subsystem of a and unpin) ensure and to support recoverable handling Camelot’s Camelot of a process’ VM approaches Efficient lets for regions VM calls (pin transaction double paging, size store for recoverable limits. regions is critical to to communicate allows sactilcing efficiently robustness good decomposition, to be fast [4], its experimental lags in far Our goals in building RVM were more modest. not trying to replace traditional We were forms of persistent storage, such as file systems and databases. Rather, we saw RVM 1, is predicated commercial behind that Even implementations. on Unix of the on Mach best 2.5, the measurements reported by Stout et al [34] indicate that IPC is about 600 times more expensive goals. modular it has been shown that IPC can be performance implementations without Camelot’s shown earlier in Figure Although address enhanced performance. fast PC. across than local procedwe ca113. To make matters worse, Ousterhout [26] reports that the context switching performance not improving with raw hardware performance. linearly of opemting systems is as a building block for meta-data in those systems, and in Given our desire to make RVM portable, higher-level compositions willing to make its design critically dependent on fast IPC. of them. could assume that the recoverable a machine storage. Consequently, we memory requirements would only be a small fraction on of its total disk This in turn meant that it was acceptable to waste some disk space by duplicating recoverable regions. Hence RVM’S recoverable region, completely called independent the backing store for backing its external of the region’s store for a data segment, is VM swap space. Crash recovery relies only on the state of the external data segment. external Since a VM pageout does not data segmen~ an uncommitted reclaimed by correctness. the VM modify the dirty page can be subsystem without Of course, good performance loss of also requires that such pageouts be rare. versus our strategy resource generous with memory usage use implemented of is to view it as a By tradeoff. being and disk space, we have been able to keep RVM simple and portable. optional external support Our design supports the pagers, for this but feature we have The yet. not most apparent impact on Coda has been slower startup because a process’ recoverable memory must be read in en masse rather than being paged in on demand. Insulating RVM sharing of spaces. But this is not a serious limitation. primary reason increase Sharing recoverable this purpose. virtual memory to use a separate robustness by avoiding memory across memory address After all, the address space is to memory corruption. across address spaces defeats In fact, it is worse than sharing virtual memory because damage may be persistent! our view is involved is that processes willing (volatile) Hence, to share recoverable as a library that is linked No external communication in the servicing implication of this is, of course, applications not to damage RVM of RVM of any calls. An that we have to trust data structures and vice versa. A less obvious implication a single write-ahead is common movement is a strong because determinant infeasible the Disk significant server cannot share Such sharing because disk head of performance, and disk per application In present. Camelot, is for serves as the multiplexing The inability limitation legitimate at Manager agent for the log. RVM. systems the use of a separate economically file is that applications log on a dedicated disk. in transactional to share one log is not a for Coda, because we run only one process on a machine. But concern for other applications Fortunately, there are two it may be a that wish to use potential alleviating factors on the horizon. First, independent there is of transaction considerable implementations from the VM subsystem also hinders the recoverable kind example, One way to characterize complexity Instead, we have structured RVM in with an application. we were not to place the RVM in log-structured of the Unix file system [28]. log for each application on such a system, one would head movement. processing considerations, interest benefit from No log multiplexer If one were in a separate file minimal would disk be needed, because that role would be played by the file system. Second, there is a trend toward using disks of small form factor, partly technology [27]. motivated by interest It has been predicted in disk array that the large disk capacity in the future will be achieved by using many small already trust each other enough to run as threads in a single address space. 3430 ficros~nds contemporary 149 versus 0.7 microseconds machine, rhe DECStarion 5000200 for a nutt c~ on a WPi~ disks. startup latency, as mentioned If this turns out to be true, there will be considerably less economic incentive to avoiding a dedicated we plan to provide an optional disk per In summary, each process using RVM The log cart be placed partition. in a Unix has a separate log. file RVM’S permanence implementation rely of this system call. on the correct more mappings aliasing. eliminate information follows the major program-visible the operations abstractions, memory and then describe for is managed in segments, which segments. to accommodate segments although current segment hardware and RVM file system on a machine is only are limited number of by its storage Since the distinction programs, we use the term “external RVM to cope with of page size, retains no after its regions and maintenance storage and takes the use of absolute memory care also of mapping in segments. layered of storage within provided by and segment to operation data segment” to external A on RVM, a segment. RVM mapping for initialization, are shown in Figure The log to be used by a process is specified initialization is invisible a This each time. pointers allocator, of a load map Primitives termination refer to either. RVM mappings the same base address operations 4(a). outstanding. A segment loader package, built on top of the creation into 4.2. RVM The The backing store for a segment may be a file or a raw disk partition. These memory. must be done in multiples supports heap management has been limitations The the need for transactions recoverable recoverable up to 2 64 bytes long, to 232 bytes. length allows simplifies to Mukics virtual about a segment’s are unmapped. RVM, segment designed resources. the rationale and Regions analogous segments from below, we first present supported on them. 4.1. Segments Recoverable logically In the description Also, the same process. in Regions can be unmapped at any time, as long as they have no uncommitted presented earlier. The most and regions must be page-aligned. 4. Architecture The design of RVM once by overlap Mapping are minimal. is that no region of a segment may lx than cannot resmictions on a dedicated Unix file system. on segment mapping restriction mapped onto disk. For best performance, the log should either be in a raw partition disk o; in a fide on a log-structured important uses the fs ync flush modifications guarantees Restrictions or on a raw disk When the log is on a file, RVM system call to synchronously restrict Mach external pager to copy data on demand. process. loosely in Section 3.2. In the future, via the opt i on.s_des c argument. at RVM The map is called once for each region to be mapped. data segment addresses for argument. the and the range of virtual mapping The unmap are identified operation time that a region is quiescent. in the can be invoked Once unmapped, The memory first at any a region can be remapped to some other part of the process’ address space. After a region has been mapped, memory it may be used in the transactional .Ssgment-2 address range specified Figure 3: Mapping a transaction set As shown in Figure record Regions of Segments 3, applications explicitly memory. A region typically collection of objeets, segment. In the current data from external begin_t corresponds transaction image of implementation, when a region is mapped. The limitation the copying memory act of of this method is 150 that is used in all that transaction. The know that a certain area This allows RVM to value of the area so that it can undo The rest ore_mode lets an application ion abort a transaction. is more efficient, regions require no RVM occurs with lets RVM copy data on a set -range. to a related and may be as large as the entire data segment to virtual rans operation t id, is about to be modified. will never explicitly guarantees that newly mapped data represents the committed associated operation the current shown in begin_transaction identifier, changes in case of an abort. map regions RVM The operations range of a;egion during mapping. of segments into their virtual the region. 4(b). returns further Each shaded area represents a region. The contents of a region art physically copied from its external data segment to the virtuat memoty Figure addresses within operations since RVM Such a no-resrore does not have to Read operations intervention. flag to indicate that it on mapped initialize (version, options_desc); map(region_desc, options_desc); unmap(region —desc); ~end_transaction (tid, abort_transaction terminate; (a) Imtialization & Mapping Operations commit_mode) (tid) (b) Transactional query EzJ Operations (options_desc, ~ create region_desc); log_len, log(options, (c)LogControlOperations ; ; (d) Miscellaneous mode); Operations Figure 4: RVM Primitives Atmnsaction aborted is committedly via successful commit madeina end_transacti.on guarantees transaction. willingness permanence But an application the commi–t_mode paratneter Such ano-@shor no-flush transactions flush RVM’S where the boundis Figure4(c)shows controlling operation, flush, transactions operation, blocks have been truncate, in the write-ahead segments. a operation, committed truncation is long-running we have provided and RVM for variety application identity to obtain information i ons changes performed of application internal buffers. can dynamically then use it in an initialize in a region. sets a variety Using of tuning a no-undo/redo the old-value space is available records only of from memory transactions. records to the log. of committed The format of a Upon commit, records are known to operations the set-range old-value issued during records are replaced ranges of memory. Note that each modified one new-value record even if that range has been updated many times in a transaction. an final step of transaction commitment new-value The a records that reflect the current contents of the range results in only a records The consists of forcing the to the log and writing out a commit record. knobs an in virtual uncommitted the new-value have to be written by new-value c reate_log, logging assumes transaction. and the value The implementation The bounds and contents of old-value log truncation create a write-ahead use to an external data segment. RVM for applications allows to typical log record is shown in Figure 5. such as the number and for triggering treatment techniques can be found in changes transactions resource-intensive operation transactions operation such as the threshold sizes able Consequently, second shown in Figure 4(d), perform The query of uncommitted set_opt is corresponding of functions. log manual [22] strategy [3] because it never reflects uncommitkd for to control its timing. The final set of primitives, implementation that adequate buffer first But since this is a mechanism The RVM 5.1.1. Log Format Note to external data usually by RVM. aspects of its implementation: 5.1. Log Management no-flush The disk. blocks until all committed in the background potentially all to for our discussion Gray and Reuter’s text [14]. When The log. log have been reflected Log transparently until techniques must explicitly providedbyRVM forced well-known systems, we restrict and optimization. of transactional independentofpermanence. the use of the write-ahead upon here to two important management bounded persistence, thetwooperations draws transactional offers many further details, and a comprehensive the period between log flushes. thatatomicityisguaranteed its guarantee via time to time. provides RVM building To ensure persistence log tim used in this manner, RVM Since changes has reduced commit the application write-ahead of a of end_transaction. “lazy’’transaction latency since a log force is avoided. ofits default, can indicate to accept a weaker permanence 5. Implementation and By abort_transaction. No-restore and no-flush transactions are more efficient. The former result in both time and space spacings since the log and contents of old-value operation. buffered. latency, records do not have to be copied or The latter result in considerably since new-value and commit spooled rather than forced to the log. 151 lower commit records can be RaveraeD!splacemerrts Ranw Tram Hdr Hdr 1 *I !1 1 1 Dkplacements FoIwwd ‘this log record has three medifimtimr ranges. The bidirectional displacements records aUow the log to be read either way. Figure 5: Format of a Typical Log Record Tail Displacements I l} Head Dlsplacenrents ‘rttis figure shows the organization of a log during epoch truncation. The current tail of the log is to the right of the area marked “current epoch. The log wraps around logically, and irmemal synchronization in RVM allows forward processing in the current epoch white truncation is in progress. When truncation is ccmplete, the area marked “truncation epoch” wilt be freed for new log records. Figure 6: Epoch Truncation 5.1.2. Crash Recovery and Log Truncation undo/redo Crash recovery modified consists of RVM tail to head, then constructing latest committed encountered modifications external data segment. empty log. delaying for The in the log. information an in-memory changes applying first reading the log from in each tree of the data segment trees are then traversed, them Finally, to the corresponding property internal locks not violate Certain may result have RVM maintains truncation situations, transactions or sustained in incremental truncation The idempotency those circumstances, of recovery all other is achieved by recovery actions does such as the blocked for so long that log space becomes critical. step until been high being Under RVM reverts to epoch truncation. are Page Vector Truncation is the process of reclaiming log entries by applying the recoverable because log whenever current total size. space is finite, implementation to procedure code as epoch described the log while truncation to implement correctly. the forward ,.. ,. ---,,- crash processing Although logically recovery This figure truncation. part of and results in bursty system performance. incremental truncation mechanism periodically obsolete by writing recoverable during renders normal the Now that RVM a mechanism operation. oldest out relevant pages directly data segment. To log for This entries from VM to preserve the no- shows the key data structures involved in incremental RI through R5 are log entries. The reserved bit in to the recoverable data segment. The log head dces not move, since P2 has the same log offset as PI. P2 is written next, and the log head is moved to P3’s log offset. Incremental truncation is now blocked until P3’s uncommitted reference count drops tn zero. Figure more than necessary, is stable and robust, we are implementing kg tail Log Records count of zero, it is the first page to be written exclusive reliance on epoch truncation is a correct strategy, it substantially increases log processing ●., : page vector entries is used as an intemaf leek. Since page P1 is at the head of the page queue and has an tmcornrrdtted reference occurs in the is in progress. degrades forward tail &jR4~ rest of the log. Figure 6 depicts the layout of a log while an epoch truncation PagaQueua ----- IOghead approach, to an initial : : To chose to reuse In this truncation, head : of its has proved to truncation. P 1 is log truncation above is applied concurrent to and is triggered effort, we initially for Unumm-med R8fCni B* Resewed in them to log size exceeds a preset fraction be the hardest part of RVM crash recovery Periodic segment. In our experience, minimize space allocated the changes contained data necessary the that cannot be written data segment. of long-running complete. traffic, pages to ensure that incremental this propwty. concurrency, log, transactions in the log status block is updated to reflect an this referred the out to the recoverable presence the head and tail location of by uncommitted 7: Incremental Truncation Figure 7 shows the two data structures used in incremental truncation. mapped The first data structure is a page vector for each region that region’s that maintains the modification status of pages. The page vector is loosely analogous to a VM page table: the entry for a page contains a dirty bit and an uncommitted reference 152 count. A page is marked dirty when it has committed reference count executed, and committed or aborted, marked dirty. is changes. incremented decremented the descriptors that specifies the log head. Each descriptor first record referencing descriptor incremental offset are platforms in which truncation specified by The queue contains no it could consists the descriptor, repeated until out in order to move only in the appear. of A step in selecting the first out the pages specified and moving the next This descriptor. the desired amount such as IBM RTs, DEC MIPS workstations, laptops and machines capacity and a variety workstations. ranges ranges experience of log space has been with RVM experience opportunities with Intra-transaction RVM optimization identical, typically duplicate because of when one and actions part then recoverable memory procedure Optimization of Forgetting been written Standard has only been on Mach 2.5 and 3.0. an procedures - range it modifies, is supposed code in RVM to be ignored, application and to issue issuing a are often a to Each of those calls for the areas of even if the caller or some causes duplicate and overlapping Inter-transaction A version so already. set-range transactions. of modifications command “cp many no-flush Temporal system to recoverable dl / * and adjacent The is being debugged. for has encouraged transparent resolution of directory server replicas also use RVM alsconnected operation updates is done using a log- based strategy [17]. The logs for resolution Clients But positive us to expand its use. are maintained now, [16]. particularly for The persistence of changes made while disconnected is achieved replay logs in RVM, and user advice for long-term by storing cache is stored in a hoard database in RVM. An unexpected use of RVM servers and clients [31]. hard-to-reproduce structures. All log has been in debugging As Coda matured, bugs involving corrupted persistent data We realized that the information excellent of reference in in RVM’S log clues to the source of these corruptions. we had to do was to save a copy of the log &fore truncation, and to build a post-mortem display the history of modifications The most common using RVM catl prior new-value recorded by the log. has been in forgetting record tool to search and source of programming to modifying reflect problems memory. because RVM memory. does not create a by the transaction The current solution, memory. defensive y. in to do a set-range an area of recoverable for this area upon transaction modifications language-based, For example, the Coda we ran into commit. will not to that area of as described in Section 5.2, A better solution would be as discussed in Section 8. d2” on a Coda client will cause as transactions check locality in uses intent was just to replace Camelot by RVM on is to program updating for d2 as there are children performed that Our original often translates into locality these updates needs to be forced flush. the truncation The result is disastrous, occur only in the context of input requests to an application RVM of Hence the restored state after a crash or shutdown optimization have incremental offered begins elsewhere to have done will using RVM Such records to be coalesced. no-flush applications in C or C++, but a few have been written ML. management This is particularly that transaction. set -range modularity Hence applications invokes within set bug, while disk personal Most supporting transaction. in applications. procedures may perform calls a single call is an insidious call is harmless. tmnsaction, other Our that ports to other Unix platforms in RVM. or adjacent memory to err on the side of caution. common distinct respectively. overlapping, occur defensive programming –range two the volume of data arise when addresses are issued within situations indicated reducing optimization calls specifying perform to 2.5GB. on these while be stmightforward. made to partitioned We refer to these as intra-transaction and inter-transaction written MB, But RVM has been ported to SunOS and SGI IRIX at MIT, For example, for substantially to the log. a set capacity 64 servers, in the role described in Section 2.2. Optimizations written to 60MB experience with RVM Early 12MB Sun 386/486-based step is reclaimed, 5.2. Memory from from of Intel and we are confident by the log head to the has been in daily use for over two years on hardware Spare workstations, the order in a page is mentioned in the queue, writing it, deleting RVM specifies the log offset of the that page. page references: descriptor changes are the affected pages are which dirty pages should be written earliest — ranges The second data structure is a FIFO queue of page modification duplicate set when On commit, 6. Status and Experience The uncommitted as the data structure of dl. 7. Evaluation A fair assessment to the log on a future issues. From a software engineering inter-transaction optimization at commit time. If the modifications committed subsume those transaction, the older log records are discarded. from in Only the last of an earlier is being unflushed to ask whether commensurate perspective, simplicity 153 of RVM RVM’S with must code consider distinct we need size and complexity From its functionality. we need to know two perspective, whether has resulted in unacceptable RVM’S are a systems focus on loss of performance. To address the first issue, we compared the source code of RVM and RVM’S Camelot. approximately 10K lines of C, while utilities, and other auxiliary code contribute Camelot has a mainline and auxiliary like PC and the to Camelot. is considerably This translates into a corresponding and porting. smaller reduction for RVM. of effort in What is being given up in return is support for nesting and distribution, in areas such as choice of logging paging The performance worst case distributed strategies — a fair trade case, the benchmark considerable referred as well locality. to as localized, accounts we used controlled as measurements from Coda servers and clients in actual use. The specific questions of are In this uniformly the average that exhibits access pattern, 70% of the transactions update on 5’%0 of the pages, 259i0 of the transactions 1590 of the pages, and the remaining 5’70 of the transactions remaining 80?Z0of the pages. uniformly distributed. update accounts on the Within each set, accesses are 7.1.2. Results goal in these experiments the throughput of RVM as discussed experimentation accesses To represent uses an access pattern temporal This corresponds of RVM when across all accounts. Our primary as well as flexibility by our reckoning. To evaluate the performance occurs when accesses are sequential. occurs update accounts on a different size of code that has to be understood, and tuned maintenance 10K lines. These numbers do for features external pager that are critical debugged, is code size of about 60K lines of C, code in Mach the total code test programs a further code of about 10K lines. not include Thus mainline paging over its intended to situations in Section observe performance becomes was to understand 3.2. A secondary degradation relative more significant. shed light on the importance domain of use. where paging rates are low, goal was to to Camelot We expected of RVM-VM as this to integration. interest to us were ● How serious is the lack of integration RVM and VM? . What is RVM’S impact on scalability? . How effective optimizations? are intra- arrays ranging This roughly and inter-transaction the VM component of an operating To quantify of RVM system from could hurt this effect, we designed a variant of the industry-standard TPC-A in a series of carefully conwolled benchmzuk [32] and used it experiments. bank witk branch, and transaction many 1 and Figure other relevant branches, customer multiple accounts updates a randomly tellers per branch. chosen account, per A updates branch and teller balances, and appends a history record to and Camelot memory. The number benchmark. The represented as arrays respectively. a transaction in recoverable of accounts is a parameter accounb of and the 128-byte audit of our trail and 64-byte are records Each of these data structures occupies close to half the total recoverable memory. The sizes of the data structures for teller and branch balances are insignificant. around. sequential, with wrap- The pattern of accesses to the account array is a second parameter of our benchmark. conditions virtually identical changes increases. The best case for for Hardware and are described in as the The average throughput. size time of This yields a This recoverable to perform a log is about 17.4 theoretical maximum of 57.4 transactions per second, which is within 15% of the observed best-case throughput for RVM and Camelot. account sequential the is access. effects throughput until access is random, throughput of initially Figure close As recoverable paging drops. become 8(a) shows that to memory more its value for size increases, significant, and But the drop does not become serious recoverable memory size exceeds about 70% of physical memory size. The random access case is precisely where one would expect Camelot’s to be most valuable. in Figure Access to the audit trail is always offer hardly throughput In our variant of this benchmark, by 8 present our results. experimental milliseconds. When accessed the experiment account access patterns. For sequential account access, Figure 8(a) shows that RVM RVM’S we represent all the data entries. Table 1. an audit trail. structures and localized force on the disks used in our experiments is stated in terms of a hypothetical one or more random, Table memory benchmark for account to about 450K memory size to total physical memory size. At throughput 7.1.1. The Benchmark The TFC-A 32K entries each account array size, we performed Integration As discussed in Section 3.2, the separation from corresponds to ratios of 10% to 175% of total recoverable sequential, 7.1. Lack of RVM-VM performance. To meet these goals, we conducted experiments between graceful recoverable with Mach of the curves degradation But even at the highest to physical memory better than Camelot’s. 154 Indeed, the convexities 8(a) show that Camelot’s than RVM’S. integration size, RVM’S is more ratio of throughput is No. of Rmem Accounts Pmem RVM Camelot (Trarts/See) Sequential Random Localized (Tkans/See) Sequential Random L&sstized 32768 12.5% 48.6 (0.0) 47.9 (0.0) 47.5 (0.0) 48.1 (0.0) 41.6 (0.4) 44.5 (0.2) 65536 25.0% 48.5 (0.2) 46.4 (0.1) 46.6 (0.0) 48.2 (0.0) 34.2 (0.3) 43.1 (0.6) 98304 37.5% 48.6 (0.0) 45.5 (0.0) 46.2 (0.0) 48.9 (0.1) 30.1 (0.2) 41.2 (0.2) 131072 50.0% 48.2 (0.0) 44.7 (0.2) 45.1 (0.0) 48.1 (0.0) 163840 62.5% 48.1 (0.0) 43.9 43.2 42.5 41.6 40.8 (0.0) (0.0) (0.0) (0.0) (0.5) 44.2 (0.1) 39.7 33.8 33.3 30.9 27.4 (0.0) (0.9) (1.4) (0.3) (0.2) 48.1 48.1 48.2 48.0 48.0 48.1 48.3 48.9 48.0 47.7 29.2 27.1 2S.8 23.9 21.7 20.8 19.1 18.6 18.7 41.3 40.3 39.5 37.9 35.9 35.2 33.7 33.3 32.4 32.3 31.6 196608 75.0% 47.7 (0.0) 229376 87.5% 47.2 (0.1) 262144 100.0% 112.5% 46.9 (0.0) 294912 327680 125.0% 46.9 (0.7) 360448 137.5% 393216 150.0% 425984 162.5% 458752 175.0% 48.6 46.9 46.5 46.4 46.3 (0.6) (0.0) (0.2) (0.4) (0.4) 43.4 43.8 41.1 39.0 39.0 40.0 39.4 38.7 35.4 (0.0) (0.1) (0.0) (0.6) (0.5) (0.0) (0.4) (0.2) (1.0) (0.0) (0.2) (1.2) (0.1) (0.0) (0.2) (0.0) (0.0) (0.1) 18.2 (0.0) 17.9 (0.1) (0.0) (0.4) (0.2) (0.0) (0.0) (0.1) (0.0) (0.0) (0.0) (0.0) (0.1) (0.2) (0.8) (0.2) (0.2) (0.1) (0.0) (0.1) (0.2) (02) (0.0) ‘M table presents the measured steady-state throughput, in transactions per second, of RVM and Camelot on the benchmark described in Section ‘7.1. 1. The cohmm labekd “Rmem/Pmem” ~ivw the ratio of recoverable to physi~ m~ow si= Ea* dam -t t$ves tie mea SSId StSSSda~ deviatiar (in parenthesis) of the three trials {ith most consistent results, chosen from a set of five to eight. ‘Dse experiments were conducted on a DEC 50W200 with 64MB of main memory and separate disks for the log, external data segment, and paging file. Only one thread was used to nm the benchmark. Only processes relevant to the benchmark mrr on the machine during the experiments. Transactions were required to be fully atomic and permanent. Inter- and intra-transaction optirnizations were enabled in the case of RVM, but not effective for this benchmark. This version of RVM only supported epoeh truncation; we expect incremental truncation Table 1: Transactional g 5P q -- -’ g a g -- n =<.. 40 “\ \ 2 ● -.. u c .: ~ . ,* \ \ ❑\ . ●. . ..m ‘- . *...*-... “’- s... Et”.. u... =.. -------mp u 40 w . > significantly. Throughput ~ 50q u) V.* to improve performance ● .. 6.- . . . . . . ● .:- ..?... “---a . . . . . . : ~-... L7-.. k b, J \ 30 - ● a’-. ‘*\ i? K .. ““-&.-... ❑ ● \ . %-------- q . ..*D \e ‘u \a - .0 — 20 “ _ Sequa+ml Camelot Sequential -m- _ 100 is% IWM -i - : E-n-+a EzizEzl RVM R.wrdom Cunotot Random 20 40 60 100 120 80 140 RmemfPmem 160 180 10 20 0 40 60 localized RVM’S the data in Table 1. For clarity, recoverable and memory an impediment size domain. 8: Transactional shows with acceptable approaches throughput also drops that increasing even physical performance A conservative 160 180 (percent) data in locality recoverable less than 10%. and is for its interpretation is not intended of the 1 indicates that data, while keeping Applications restrict active recoverable performance. larger, simplicity Table applications can use up to 40% of physical slow, memory linearly, Throughput when throughput. confirm ‘that RVM’S to good 8(b) linearly 140 the average case is presented separately from the best and worst eases. But the drop is relatively worse than RVM’S These measurements application size. remains memory Camelot’s consistently access, Figure drops almost performance recoverable size, account throughput 120 (b) Average Case Figure For lLXI RmeWPmem (a) Best and Worst Cases These plots illustrate 80 (percent) Inactive constrained memory limits only imposed throughput with with memory good for active degradation poor locality to have to data to less than 25% for similar recoverable by startup data can be much latency by the operating and virtual system. The comparison with Camelot is especially revealing. In spite of the fact that RVM is not integrated with VM, it is able to outperform 155 Camelot over a broad range of workloads. _ _ — . . Cmmelot Rrndom Cwldot Sequa?ntial RVM Random _ RVM Sequential I ❑ ‘a 8 “5 .!. ❑ ....... FI... ❑ ✎ ✎ ..s...m”””o ””a”””m”””am”””a ” .ti. ❑ U--U-’--I3 8“ o 20 60 40 ● --9--b-’ . r ---- --- * 80 1(X3 120 140 RmetrVPmem ●.. ... ..... ..... .... .... .... ...-9 ....?.... V.... V... W...V-. ”● 4 9 160 0 180 20 60 40 1(M 80 120 140 RmenVPmem (percent) 160 180 (per cent) (b) Average Case (a) Worst and Best Cases These plots depict the measured CPU usage of RVM and Camelot during tbe experiments described in Secticm 7.1.2. As in Figure 8, we have separated the average ease from the best and worst cases for visuat clarity. To save space, we have omitted the table of data (similar to Table 1) on which these plots are based. Figure Although we were puzzled by recoverable Camelot’s gratified behavior. to physical memory and RVM’S we had expected throughputs case, throughput is highly from lecalized 48.1 throughput faster We conjecture by an overly Manager. sustained to much that this increased paging activity aggressive log truncation During truncation, levels Disk running of is random, writing When truncation many out a dirty IOSL Less ffequent to amortize page across multiple truncation design of RVM. RVM has yielded It is therefore are ideal way to answer this question experiment of Camelot. mentioned would Unfortunately, such a direct comparison on the machine, which all CPU is attributable Dividing usage to the the total CPU usage gives the average CPU cost is our metric of scalability. Note and page fault servicing over like all of RVM and Camelot for CPU usage of Camelot. The actual values of CPU usage remain for RVM almost constant account access, Figure Camelot’s. handling is not range, RVM’S over all the 9(a) shows that both that even at the is more limit of CPU usage is less than In other words, the inefficiency in RVM lower inherent overhead. 156 systems CPU usage increase with recoverable size. But it is astonishing our experimental instead both memory sizes we examined. and Camelot’s memory The be to repeat the in Section 2.3, using RVM present 9 compares the sealability For random on the to ask whether gains in scalability. the total CPU Since no extraneous For sequential account access, RVM requires about half the heavy toll on the appropriate the anticipated is based of that set of experiments, of the benchmark. recoverable 2.3, Camelot’s sealability each of the three access patterns described in Section 7.1.1, or sequential account access of Coda servers was a key influence of in Section 7.1. truncation Figure the cost of 7.2. Scalability sealability possible, transactions. result in fewer such lost opportunities. As discussed in Section not to the exclusion that this metric amortizes the cost of sporadic activities log writes out transactions original deseribed in system or user mode) per transaction, is induced is frequent and account access opportunities changed the is also of RVM’S by the number of transactions Mamger. all dirty pages referenced by entries in the affected portion of the log. was (whether strategy in the Disk the Disk Manager hardware usage on the machine was recorded. activity higher by the Camelot current our evaluation For each trial in the of the raw data indicates that the drop is attributable activity Repeating 5000/200s. on the same set of experiments case, and to 41.6 in the random case. in throughput on Consequently, At per second case to 44.5 Decstation has hardware RTs we now use the much Camelot. even at the in transactions server Instead of IBM because Coda servers now use RVM But in Camelot’s to locality because experiment of The data shows memory ratio of 12.5%. in the sequential Closer examination paging sensitive to physical that ratio Camelot’s drops feasible considerably. both to be independent in the access pattern. that this is indeed the case for RVM. lowest recoverable CPU Cost per Transaction by these results, we were For low ratios of Camelot’s the degree of locality 9: Amortized of page fault than compensated for by its Machine Machine ‘IkansactiOms Bytes Written name type committed to Log T grieg server haydn server wagner server morart client ives verdi client bath client purcell client beriioz client 267,224 483,978 248,169 34,744 21,013 21,907 26,209 76,491 101,168 client Intra-’llansaction 289,215,032 661,612,324 264,557,372 9,039,008 6,842,648 5,789,6% 10,787,736 12,247,508 14,918,736 ‘this table presents the observed reduction in log traffic due to RVM size after both optimization were applied. ‘Rre columns labclled Total Inter-Transaction Savings Savings Savings 20.7% 21.5% 20.9% 41.6% 31.2% 28.1% 25.8% 41.3% 0.070 20.7’% 0,07. 21.5% 0.0% 20.9% 26.7% 68.3% 22.0% 53.2% 20.$%. 49.0% 47.7% 77.5% 81.6% 21.9% 36.2% 64.3% 17.3~o optimizatiens. The column Iahelled “Bytes Written “Intra-Transaction Savings” and “Inter-Transaction percentage of the originat log sire that was supresscd by each type of optimization. from Coda clients and servers. This data was obtained to Log” shows the log Saving s“ indicate the over a 4day period in March 1993 Table 2: Savings Due to RVM Optimization For localized both RVM those machines account access, Figure 9(b) shows that CPU usage increase linearly with recoverable and Camelot. memory For all sizes investigated, tend to be selected on the basis of size, weight, and power consumption size for RVM’S 7.4. CPU usage remains well below that of Camelot’s. Broader Analysis A fair criticism Overall, these considerably most of requires measurements the workloads half investigated, the CPU anticipate that refinements truncation will further improve lower decision CPU RVM it RVM usage of to RVM usage to structure collection that less of a CPU burden than Camelot. about RVM’S establish is research prototype, typically well-tuned We as a library of tasks communicating commercial not currently rather via PC. from than drawn in Sections 7.1 product our products supports would recoverable of with strengthen the does not come at the cost of Unfortunately, analysis with a comparison such a comparison possible because no widely performance as a A favorable simplicity good performance. its scalability. directly Camelot. claim that RVM’S such as incremental follows of the conclusions and 7.2 is that they are based solely on comparison Over Camelot. rather than performance. virtual broader memory. scope is used commercial will Hence have to a awah the future. As mentioned in Section 3.3, Mach IPC costs about 600 times as much as a procedure call experiments. Further are the substantially on the hardware contributing we used for The simplicity a versatile smaller path lengths in various RVM components due to their inherently as a Building 8. RVM our to reduced CPU usage base on which functionality. simpler functionality. Block of the abstraction offered by RVM to implement In principle, more any abstraction Effectiveness of Optimization To estimate the value of intra- optimizations, we instrumented total of log data eliminated volume can be built and RVM inter-transaction extensions of the RVM For example, by each technique. sample of Coda clients interface maybe necessary. nested transactions could be implemented using RVM as a substrate for bookkeeping for a undo logs of nested transactions. and servers in our cotnmi environment. t, and abort operations would committed state would be handled entirely benefit feasibility of this approach from 30%, though benefit exhibit optimizations to transactions. be on portable especially 20% and substantially this type of optimization, to no-flush proved performance optimization. between typically on clients by another 20-30%. from applicable have some machines Inter-transaction savings. log traffic intra-transaction is typically Support for distributed reduce Coda clients, optimizations for since the restoration by RVM. has been confirmed of The by the Venari project [37]. because it is only valuable begin, higher Servers do not RVM simple, top-level would be visible to RVM, Recovery significantly be Only state such as the The data in Table 2 shows that both servers and clients The savings in log traffic semantics In some cases, minor to keep track of the Table 2 presents the observed savings in log traffic representative on top of RVM. complex that requires persistent data structures with clean local failure 7.3. makes it good because disks on 157 transactions could also be provided by a library built on RVM. Such a library coordinator and subordinate routines two-phase commit, beginning a transaction as well transaction. Recovery involve recovery, RVM as for and after adding operations such as new to a coordinator followed would provide for each phase of a sites a crash would by appropriate termination of distributed transactions The crash. unspecified communication until runtime to perform extended in progress mechanism communications. commit could RVM a list if the coordinator of the old-value would library at each subordinate phase commit is clear. would be discarded. subordinate could compensating RVM could until act by the by the On a global use the saved records Avalon that recoverable abstraction support to construct was built virtual memory on Camelot, is indeed implementing Language – range calls: an code could these calls transparently. An approximation based be solution augmentation would to use a – range post-compilation the recent work of O’TooIe of RVM et al [25]. is provided the heap for a language of persistent some improvements data. While to RVM work establishes the suitability for the authors of RVM context from the one that motivated of simplicity previously suggest their for a very different past work therefore it is impossible that restrict contribution relationship to fully attribute has indirectly influenced explicit our discussion here to placing in proper perspective, The reliance of Birrell checkpointing makes which been focused enrichment identification of transactional on three areas. of the transactional such as distribution, We RVM’S its transactions into and hardware [6]. languages [21], operating are in the in- the entire memory to be the current version of occurs only during crash technique on full-database practical limits its domain without being substantially of use. RVM The absence abort is more versatile monitors (TPMs), Tuxedo [1, 36], Like centralized such are TPMs add distribution back-ends, and integrate as important and support heterogeneous database managers, TPM backin structure. They encapsulate properties interface. and provide data access via a query language contrast to RVM, supports only atomicity which for more complex. processing products. update rates. updates and for explicit further and only of recoverable aspect of permanence, This is in and the and which provides data as mapped virtual memory. properties A more has toolkit, One area has been the and longevity then reflected all three of the basic transactional concept along dimensions nesting [23], There is Updates manage small amounts all the [13, 18], attention second area has been the incorporation abort. et al’s technique the for multi-item process failure for their realization Each transaction to disk, the log file deleted, and the systems. and to clarifying Their design is is Periodically, truncation access to recoverable Since the original on disk, ends are usually monolithic to its closest relatives. and techniques transaction file renamed Log with databases have been et al [5]. checkpointing. In the RVM. comes and is based upon new-value database image. Transaction 9. Related Work space available, for so, it that have to update only a single data item. Encina [35, 40] is enormous. By doing that for small by Birrell in a log file of support it. processing transactional to applications baggage than RVM’S, services to OLTP of transaction seeks facilities. and full-database commercial The field to the RVM essential application?” data and which have moderate of garbage this application, of the The virtues applications In this work, RVM that supports concurrent a “back embellishing recovery, not during normal operation. by segments are used as the stable to-space and from-space at transactional the database. calls. Further evidence of the versatility balked sophisticated new checkpoint to a language- and poor represents than accessible image is checkpointed issue phase tQ test for accesses to mapped RVM regions and to generate set hitherto memory to issue volumes or its implementation, for the average recorded RVM realization transactions no support support wouId alIeviate the problem compiler-generated simplest is constrained local in OLTP It poses and answers the question “What makes logging that performance data Rather abstraction both. even simpler appropriate forgetting the extolled with confirms language-based in Section 6 of programmers collection a systems for Experience persistence. which for mentioned at each large to those efforts, properties RVM transaction. [38], persistence, is very movement. transactional the records On a global abort, the library with high [12]. to simplify of the two- commit, environments In contrast to generated the outcome for achieving basics” One ion be preserved can also be used as the basis of runtime languages set rans of techniques locality have to be decides to abort. raords These records transaction. left to undo the effects of a way to do this would be to extend end_t return be by using upcalls from the library to enable a subordinate first-phase at the time of the [11]. A of support for modular which functionality provided the recovery, logging, Transarc toolkit. Transarc toolkit systems [15], RVM A third area has been the development 158 approach is used in the Transarc is the back-end by RVM RVM is structured corresponds and physical differs components entirely TP for the Encina TPM. The primarily to stomge modules of the from the corresponding in two important as a library ways. that is linked First, with applications, separate while some processes. accessed as mapped Transarc toolkit of the toolkit’s Second, modules recoverable memory in RVM, are storage is whereas offers access via the conventional the buffered Acknowledgements Marvin designers and I/O model. Theimer discussions and Lily system. et al have recently enhance the Mach memory [7], providing kernel Their efforts recoverable carries idea of MttrnmerL memory a [1] system support, thereby debt to Camelot should be obvious Camelot taught us the value of recoverable and showed us the merits and pitfalls RVM operating system has restrained virtual support generality [3] memory was willing to achieve within limits us to thank especially understand shepherd, Peter and Bill the Stout use Weihl, their helped significantly. Andrsde, J.M., Carges, M. T., Kovach, K.R. Buifding a Transaction Processing System on UNIX Conference Proceedings. Baron, R.V., Black, D. L., Bolosky, W., Chew, J., Golub, D. B., Rashid, R. F., Tevarrisn, Jr, A., Young, M.W. [4] generality, University, Bernstein, P. A., Hadzilacos, V., Goodman, N. Concumency Controt and Recovery in Data&zre Addison to Wesley, that preserve [5] Bershad, B.N., Anderson, Birrell, 1987. Systems, 1987. T. E., Lar.owska, E. D., Levy, H.M. Lightweight Remote Procedure CZU. ACM Transaction on ComptUer Systems 8(l), operating system independence. Systems. San Francisco, CA, Mach Kernel Interface Manual School of Ccrrrsptrter Science, Carnegie Metlon now. of a specific approach Whereas Camelot to its implementation. require by helping in the early wish Febrtsaty, 19g9. [2] enhancing portability. RVM’S CameloL of our SOSP the presentation In UniForurn In contrast, RVM avoids the operating for of We References since their support is in the kernel rather than specialized implementors pmticipated of RVM. to virtual Camelot’s support for recoverable in a user-level Disk Manager. need for on their to support work system-level step further, reported Hagmarm to the design The comments us improve Chew rmd Robert leading A. B., Jcmes, M. B., Wobixr, Febmary, 1990. E.P. A Simple and Efficient Implementation for Small Databases. In Proceedings of ihe Elevenlh ACM Symposium on Operating 10. Conclusion In general, RVM encountered System Principles. has proved to be useful wherever we have a need to maintain with clean failure semantics. persistent The only constraints of disk capacity, and for the working size of accesses to them to be significantly set Second, resource usage. RVM dimensions. these essentially minimal is indeed impact upon lightweight A Unix programmer library, such as the st While the importance been hampered implementation. situation. by double-edged the lack of Our hope is that RVM University, in subroutine abstraction a [10] lightweight system may applications, [11] applications with the operating applications, represents close to the limit has remedy this sword, as this paper has shown. class of less demanding Carnegie Mellon 154, Department University, Santa Fe, of Computer Febmary, Eppirrger, J. L., Mtsmmert, of Computer June, 1988. Marragemenr for Transaction PhD thesis, Department will for very demanding Report CIVIWCS-88- Eppinger, J.L. Virtunl Memory Morgan of the transactional While integration be unavoidable along both of the USENIX Mach 1[1 Symposium. 1993. Processing Systemr. Science, Carnegie Mellon 1989. L. B., Spector, A.Z Camelot and Avalon. been known for many years, its use in low-end has [9] system package. dio 1988. Cooper, E.C., Drsves, R.P. C T~eaak. Scienm, and thinks of RVM the same way he thinks of a typical Febmary, Chew, K-M, Reddy, A.J., Rcsner, T. H., Silberschatz, A. Kernel Support for Recoverable-Persistent Virtstat Memory. Technical in the title of this paper connotes First, it implies ease of learning it signifies 1987. Chang, A., Mergen, M.F. 801 Storage: Architecture and Programming. ACM Transactions on Computer Sy.wems 6(l), In Proceedings NM, Apd, [8] two distinct qualities. use. [7] less than main memory. The term “lightweight” TX, November, upon its use have been the need for the size of the data structures to be a small fraction [6] data structures Austin, it can be a [12] 1991. Garcia-Motina, Sagas. H., Salem, K. In Proceedings of the ACIU Signrod Conference. Good, B., Homan, P.W., Gawlick, 1987. D. E., Sammer, H. One thousand transactions per second. In Proceedings of IEEE Compcon. San Francisco, [13] For a broad CA, 1985. Gray, J. Notes on Database Operating Systems. In Goos, G., Harttnanis, J. (editor), Operating we believe that RVM of what is attainable Kaufmann, Advanced Course. Springer Verlag, Sysiems: An 197g. without [14] hardware or operating system suppofi Gray, J., Reuter, A. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993. [15] 159 Haskin, R., Malachi, Y., Sawdon, W., Chan, G. Recovery Management in QuickSilver. ACM Transactions on Camputer Syrterns 6(l), Febma~, 1988. [16] ACM Tran.sactiom [17] [32] Kistler, J. J., Satyrmarayanan, M. Disconnected Opcratioo in the Coda File System. on Computer Kumar, P., Satyanarayanan, Systems 10(1), February, Information Larrtpsmt, B. W. 1981. Hints for Computer System Design. lrt Proceedings of tk Ninth ACM Sympositun on Operaling Systems Principles. Bretton Woods, NH, October, 1983. Leffler, S.L., McKusick, M.K., Tk Design and lrnplementation Programs. ACM Transactions [22] [37] [38] R.W. Linguistic Lunguages An Approach of HICSS-25. [39] to Reliable Hawaii, J., Nettles, S., Gifford, Syrtem Principles. University, 1992. Distributed [40] January, 1992. 1993. fn Proceedings of the USENIX Faster As Fast as Summer Conference. Anaheun, CA, June, 1990. [27] Patterson, D. A., Gibson, G., Katz, R. A Case for Redundant Arrays of inexpensive Disks (RAID). In Proceedings of tk ACM SIGMOD Conference. 1988. [28] Rosenblum, M., Ousterhont, J.K. The Design and Implementation of a Lmg-Stmctumd File System. ACM Transactions on Computer Systems 10(1), February, 1992. Satyanarayanan, M. RPC2 User Guide and Reference Manual School of Computer Science, Carnegie Mellon [30] Satyanarayanan, M., Kistler, Siegel, E. H., Stcere, D.C. Coda A Highly Workstation 1991. J.J., Kumar, P., Okasaki, M. E., Available File System for a Distributed Environment. IEEE Transactions [31] University, on Computers 39(4), April, 1990. Satyanarayanan, M., Steere, D. C., Kudo, M., Mashbum, H. Transparent Imgging as a Technique for Debugging Complex Distributed Systems. In Proceedings of lhe Fijih ACM SIGOPS European Worhhop. Mmt St. Michel, France, September, 1992. 160 San Wing, J.M. Young, Young, J-B., Spector, A.Z. (editors), Morgan Kaufmarm, 1991. M.W. M.W., to Memory Managemetifiom Operating System. of Computer a Science, Carnegie Mellon November, 1989. Thompson, D.S., Jaffe, E. A Modular Architecture for Distributed Transaction In Proceedings of the USENLY Winter Conference. January, 1991. NC, December, Systems Getting G., and Nettles, S.M. CA, June, 1992. PhD thesm, Department D. Ashewlle, M., Morrisen, Exporting a User Interface Communication-Oriented J.K. Why Aren’t Operating Hardware? [29] Wistg, J.M., Faehndrich, 1993. Camelot and Avalon. 5(3), July, 1983. M. Science, Carnegie Mellon Overview The Avalon Language. In Eppinger, J. L., Mummers, Support for Robust, Distributed Nettles, S.M., Wing, J.M. Persistence -+ Undoabdlty = Transactions. Ousterhout, System Product Extensions to Standard ML to Support Trarrsacticns. In ACM SIGPLAN Workshop on ML and its Applicatwns. pt’ess, 1985. O’Toole, TUXEDO Unix System Laboratories, Concurrent Compacting Garbage Collection of a Persistent Heap. In Proceedings of the Fourteenth ACM Symposium on Operating [26] Encina Product Overview Transarc Corporation, 1991. University, In Proceedings [25] [35] Moss, J.E.B. Nesied Transactions: Computing. [24] Stout, P.D., Jaffe, E. D., Spector, A.Z Performance of Select Camelot Functions. In Eppinger, J. L., Mummett, L. B., Spector, A.Z. (editors), Camelot and Avalon. Morgan Kaufmsnn, 1991. Francisco, Mashbum, H., Satyanarayattan, RVM User Manual ~ Spector, A.Z. [34] [36] Karels, M. J., Quartemtan, J.S. of the 4.3BSD Unix Operating on Programming School of Computer [23] Morgan The Design of CamelcX. 1989. Liskov, B. H., Scheifler, Guardians and Actions: Handbook In Eppinger, J. L., Mutnmert, L. B., Spector, A.Z. (editors), Camelot and Avulon, Morgan Kaufmann, 1991. [19] [21] [33] Systems. San Diego, CA, Larnpsmr, B.W. Atomic Transactions. frr Lampson, B.W., Paul, M,, Siegert, H.J. (editors), Dkributed Systems -- Architecture and Implementation. Springer Verlag, System. Addison Wesley, and the TPC. M. [18] [20] of Debi[Ctedit Irr Gray, J. (editors), The Benchmark Kaufman, 1991. 1992. Log-based Directory Resolution in the Coda File System. In Proceedings of the Second International Conference on Parallel and Distributed January, 1993. Serlin, O. The History Prccessirrg. Dallas, TX,