ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging C. MOHAN IBM Almaden Research Center and DON HADERLE IBM Santa Teresa Laboratory and BRUCE LINDSAY, IBM Almaden HAMID Research In this paper we present PIRAHESH and PETER SCHWARZ Center and efficient method, called ARIES ( Algorithm for Recouery which supports partial rollbacks of transactions, finegranularity (e. g., record) locking and recovery using write-ahead logging (WAL). We introduce history to redo all missing updates before performing the rollbacks of the paradigm of repeating the loser transactions during restart after a system failure. ARIES uses a log sequence number in each page to correlate the state of a page with respect to logged updates of that page. All updates of a transaction are logged, including those performed during rollbacks. By appropriate chaining of the log records written during rollbacks to those written during forward progress, a bounded amount of logging is ensured during rollbacks even in the face of repeated failures during restart or of nested rollbacks We deal with a variety of features that are very Important transaction processing system ARIES supports in building and operating an industrial-strength fuzzy checkpoints, selective and deferred restart, fuzzy image copies, media recovery, and high concurrency lock modes (e. g., increment /decrement) which exploit the semantics of the operations and require the ability to perform operation logging. ARIES is flexible with respect to the kinds of buffer management policies that can be implemented. It supports objects of varying length efficiently. By enabling parallelism during restart, page-oriented redo, and logical undo, it enhances concurrency and performance. We show why some of the System R paradigms for logging and recovery, which were based on the shadow page technique, need to be changed in the context of WAL. We compare ARIES to the WAL-based recovery methods of and Isolation Exploiting a simple Semantics), Authors’ addresses: C Mohan, Data Base Technology Institute, IBM Almaden Research Center, San Jose, CA 95120; D. Haderle, Data Base Technology Institute, IBM Santa Teresa Laboratory, San Jose, CA 95150; B. Lindsay, H. Pirahesh, and P. Schwarz, IBM Almaden Research Center, San Jose, CA 95120. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1992 0362-5915/92/0300-0094 $1.50 ACM Transactions on Database Systems, Vol 17, No. 1, March 1992, Pages 94-162 ARIES: A Transaction Recovery Method . 95 DB2TM, IMS, and TandemTM systems. ARIES is applicable not only to database management systems but also to persistent object-oriented languages, recoverable file systems and transaction-based operating systems. ARIES has been implemented, to varying degrees, in IBM’s OS/2TM Extended Edition Database Manager, DB2, Workstation Data Save Facility/VM, Starburst and QuickSilver, and in the University of Wisconsin’s EXODUS and Gamma database machine. Categories dures, and Subject checkpoint/ Management]: fault Physical tems—concurrency, and D.4.5 [Operating Systems]: Reliability–backup proce- [Data]: Files– backup/ recouery; H.2.2 [Database and restart; H.2.4 [Database Management]: Sys- tolerance; E.5. Design–reco~ery transaction tration—logging General Descriptors: restart, H.2.7 [Database processing; Management]: Database Adminis- recovery Terms: Algorithms, Designj Performance, Additional Key Words and Phrases: Buffer write-ahead logging Reliability management, latching, locking, space management, 1. INTRODUCTION In this section, first we introduce some ery, concurrency control, and buffer organization of the rest of the paper. 1.1 Logging, Failures, and Recovery The transaction concept, for a long It encapsulates time. which and Durability) properties not limited to the database Guaranteeing concurrent important been the execution problem developed performance methods judged have using in concepts relating and then to recov- we outline the Methods is well the understood ACID by now, (Atomicity, has been Consistency, around Isolation [361. The application of the transaction concept is area [6, 17, 22, 23, 30, 39, 40, 51, 74, 88, 90, 1011. atomicity and durability of transactions, in the face of of multiple transactions and various failures, is a very in transaction processing. While many methods have the past characteristics, not always several basic management, to deal and the been metrics: with acceptable. degree this complexity problem, and Solutions of concurrency the assumptions, ad hoc nature to this supported problem within of such may be a page and across pages, complexity of the resulting logic, space overhead on nonvolatile storage and in memory for data and the log, overhead in terms of the number of synchronous and asynchronous 1/0s required during restart recovery and normal processing, kinds of functionality supported tion rollbacks, etc.), amount of processing performed during degree of concurrent processing supported during restart system-induced transaction rollbacks caused by deadlocks, (partial restart transacrecovery, recovery, extent of restrictions placed ‘M AS/400, DB2, IBM, and 0S/2 are trademarks of the International Business Machines Corp. Encompass, NonStop SQL and Tandem are trademarks of Tandem Computers, Inc. DEC, VAX DBMS, VAX and Rdb/VMS are trademarks of Digital Equipment Corp. Informix is a registered trademark of Informix Software, Inc. ACM Transactions on Database Systems, Vol. 17, No 1, March 1992. 96 C. Mohan et al . on stored data (e. g., requiring unique keys for all records, mum size of objects to the page size, etc.), ability to support which allow the concurrent execution, based restricting maxinovel lock modes on commutativity and other properties [2, 26, 38, 45, 88, 891, of operations like increment/decrement on the same data by different transactions, and so on. In this paper we introduce a new recovery method, called ARL?LSl (Algorithm very well flexibility for Recovery and Isolation Exploiting Semantics), which fares with respect to all these metrics. It also provides a great deal of to take advantage of some special characteristics of a class of applications that of applications for better performance (e. g., the kinds IMS Fast Path [28, 421 supports efficiently). To meet transaction and data recovery guarantees, ARIES records in a log the progress able data of a transaction, objects. transaction’s types back). records and its actions log becomes committed of failures, When the also The actions the which source are reflected for cause changes ensuring in the database or that its uncommitted actions logged actions reflect data object become the source for reconstruction to recover- either that despite the various are undone (i.e., rolled content, then those log of damaged or lost data (i.e., media recovery). Conceptually, the log can be thought of as an ever growing sequential file. In the actual implementation, multiple physical files may be used in a serial fashion to ease the job of archiving log records [151. Every record log record is assigned a unique log sequence number (LSN) is appended to the log. The LSNS are assigned in ascending when that sequence. Typically, they are the logical addresses of the corresponding log records. At [6’71. If more times, version numbers or timestamps are also used as LSNS than one log is used for storing the log records relating to different pieces of data, then a form of two-phase commit protocol (e. g., the current industrystandard Presumed Abort protocol [63, 641) must be used. The nonvolatile version of the log is stored on what is generally called stable storage. Stable storage means nonvolatile storage which remains intact Disk is an example of nonvolatile and available across system failures. storage and its stability is generally improved by maintaining synchronously two identical copies of the log on different devices. We would expect online log records stored on direct access storage devices to be archived cheaper and slower medium like tape at regular intervals. The archived records may be discarded once the appropriate image copies (archive the to a log dumps) of the database have been produced and those log records are no longer needed for media recovery. Whenever log records are written, they are placed first only in the volatile storage (i.e., virtual storage) buffers of the log file. Only at certain times (e.g., at commit time) are the log records up to a certain point (LSN) written, in log page sequence, to stable storage. This is called forcing the log up to that LSN. Besides forces caused by transaction and buffer manager activi - 1 The choice of the name ARIES, besides its use as an acronym that describes certain features of our recovery method, is also supposed to convey the relationship of our work to the Starburst project at IBM, since Aries is the name of a constellation. ACM TransactIons on Database Systems, Vol. 17, No 1, March 1992 ARIES: A Transaction Recovery Method ties, a system buffers as they process fill up. may, For ease of exposition, in the we assume background, that periodically each log record . force describes performed to only a single page. This is not a requirement in the Starburst [87] implementation of ARIES, sometimes 97 the log the update of ARIES. In fact, a single log record might be written to describe updates to two pages. The undo (respectively, redo) portion of a log record provides information on how to undo (respectively, redo) changes performed by the transaction. A log record which contains both record. information (e.g., fields undo and the a log or only log record that the Sometimes, the undo or an undo-only is performed, before within subtract 3 from high concurrency performed the the update the object) redo record information may information. log record, undo-redo information For example, with of the model ARIES of [3], which exclusively (X mode) uses the widely of the commercial is called the log redo a redo-only on the action be recorded physically images or values of specific add 5 to field 3 of record 15, logging permits semantics of the certain operations, the the use of operations same field updates of many transactions. These is permitted by the strict executions essentially for commit accepted and prototype undo-redo only Depending may update (e.g., field 4 of record 10). Operation lock modes, which exploit the on the data. property Such a record and after the or operationally an to contain respectively. of a record could have uncommitted permit more concurrency than what be locked is called be written write says that ahead systems modified objects must duration. logging (WAL) based on WAL protocol. are IBM’s Some AS/400TM [9, 211, CMU’S Camelot 961, Unisys’s DMS/1100 [23, 901, IBM’s DB2TM [1, 10,11,12,13,14,15,19, 35, [271, Tandem’s EncompassTM [4, 371, IBM’s IMS [42, m [161, Honeywell’s MRDS [911, 43, 53, 76, 80, 941, Informix’s Informix-Turbo [29], IBM’s 0S/2 Extended Tandem’s NonStop SQL ‘M [95], MCC’S ORION EditionTM Database Manager [71, IBM’s QuickSilver [40], IBM’s Starburst [871, SYNAPSE [781, IBM’s System/38 [99], and DEC’S VAX DBMSTM and VAX Rdb/VMSTM [811. In WAL-based systems, an updated page is written back to the same nonvolatile storage location from where it was read. That is, in-place what updating happens is performed in the shadow on nonvolatile page technique which storage. Contrast is used in systems this with such as System R [311 and SQL/DS [51 and which is illustrated in Figure 1. There the updated version of the page is written to a different location on nonvolatile storage and the previous version of the page is used for performing database recovery if the system were to fail before the next checkpoint. The WAL protocol asserts that the some data must already be on stable allowed to replace the previous version That is, the system storage records storage. is not allowed version of the which describe To enable the method of recovery describes the most log records representing changes to storage before the changed data is of that data on nonvolatile storage. to write an updated page to the nonvolatile database until at least the undo portions of the log the updates to the page have been written to stable enforcement of this protocol, systems using the WAL store recent in every page the LSN of the log record that update performed on that page. The reader is ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 98 . C Mohan et al. Map Page ~ Fig. 1. Shadow page technique. Logical page LPI IS read from physical page PI and after modlflcat!on IS wr!tten to physical page PI’ P1’ IS the current vers!on and PI IS the shadow version During a checkpoint, the shadow shadow verson recovety IS performed us!ng to [31, 971 for discussions ered to be better which of the than shadowing problems of the some of the important data why the original the On a failure, log and current the version data shadow page and they technique base version is consid- [16, 781 discuss a separate shadow drawbacks the and WAL page technique. using also base about the shadow is performed d]scarded IS the of the referred version becomes methods log. While these avoid approach, they still introduce in some retain some new ones. Similar comments apply to the methods suggested in [82, 881. Later, in Section 10, we show why some of the recovery paradigms of System R, which were based on the shadow page technique, are inappropriate in the WAL context, when we need support are described for high levels in Section 2. Transaction status of concurrency is also stored in and various the log and other features no transaction considered complete until its committed status and all its log data recorded on stable storage by forcing the log up to the transaction’s log record’s LSN. This allows a restart recovery procedure that can be are safely commit to recover any transactions that completed successfully but whose updated pages were not physically written to nonvolatile storage before the failure of the system. This means that a transaction is not permitted to complete its commit processing (see [63, 64]) until the redo portions of all log records of that transaction have been written to stable storage. We deal with three types of failures: transaction or process, system, and media or device. When a transaction or process failure occurs, typically the transaction would be in such a state that its updates would have to be undone. It is possible that the buffer pool if it was the process disappeared. in the When storage contents be lost restarted and the database contents recovered the would recovery and of that using the log. image When would copy had corrupted some pages in the middle of performing some updates when the virtual a system failure occurs, typically and performed media an transaction the using a media be lost (archive transaction the or device and system nonvolatile the dump) failure lost would have storage data version occurs, would of the to be versions of typically have lost data the to be and log. Forward processing refers to the updates performed when the system is in normal (i. e., not restart recovery) processing and the transaction is updating ACM TransactIons on Database Systems, Vol 17, No. 1, March 1992. ARIES: A Transaction Recovery Method the database because of the data user or the application and using the log to generate to the ability later manipulation program. That the (undo) to set up savepoints in the transaction the transaction request (e.g., update during the the rolling since the establishment concept is exposed only place if a partial another with at the application database partial rollback were rollback whose A to be later point Partial issued by the back rollback refers of a transaction of the changes savepoint and performed by [1, 31]. This is all updates of the transaction Whether or not the savepoint is immaterial nested calls 99 is not rolling execution of a previous level recovery. calls. back to be contrasted with total rollback in which are undone and the transaction is terminated. deals SQL) is, the transaction . rollback followed to us since this paper is said to have taken by a total of termination rollback is an earlier point or in the transaction than the point of termination of the first rollback. Normal undo refers to total or partial transaction rollback when the system is in normal operation. A normal or it may constraint be system violations). restart recovery undo may be caused by a transaction request to rollback initiated because of deadlocks or errors (e. g., integrity Restart undo refers to transaction rollback during after a system failure. To make partial or total rollback efficient and also to make debugging easier, all the log records written by a transaction are linked via the PreuLSN field of the log records in reverse chronological order. That transaction would point that transaction, if there the updates performed is, the most recently written log record of the to the previous most recent log record written by is such a log record.2 In many WAL-based systems, during a rollback are logged using what are called compensation log records (CLRS) [151. Whether a CLR’S update is undone, should that CLR be encountered during a rollback, depends on the particular system. As we will see later, in ARIES, a CLR’S update is never undone and hence CLRS are viewed as redo-only log records. Page-oriented redo is said to occur if the log record whose update is being redone describes which page of the database was originally modified during normal processing and if the same page is modified during the redo processing. No internal descriptors of tables or indexes need to be accessed to redo the update. That is to be contrasted is, no other with page of the database logical redo which and AS/400 for indexes [21, 621. In those not logged separately but are redone using needs to be examined. is required in System systems, since the log records This R, SQL/DS index changes are for the data pages, performing a redo requires accessing several descriptors and pages of the database. The index tree would have to be retraversed to determine the page(s) to be modified and, sometimes, the index page(s) modified because of this redo operation may be different from the index page(s) originally modified during normal processing. Being able to perform page-oriented redo allows the the system recovery to provide of one page’s recovery contents independence does not require amongst objects. accesses That to any is, other 2 The AS/400, Encompass and NonStop SQL do not explicitly link all the log records written by backward scan of the log must be a transaction. This makes undo inefficient since a sequential performed to retrieve all the desired log records of a transaction. ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992 100 . C. Mohan et al (data or catalog) pages of the database. media recovery very simple. In a similar Being levels fashion, As we will we can define describe page-oriented undo and able to perform logical undos allows the system of concurrency than what would be possible if the restricted only to page-oriented undos. appropriate concurrency control of one transaction to be moved one were restricted to only This later, this makes logical undo. to provide higher system were to be is because the former, with protocols, would permit uncommitted updates to a different page by another transaction. If page-oriented undos, then the latter transaction would have had to wait for the former to commit. Page-oriented redo and page-oriented undo permit faster recovery since pages of the database other than the pages mentioned in the log records are not accessed. In the interest of efficiency, interest of ARIES supports high concurrency, ARIES/IM method for page-oriented redo and its supports, in logical undos. In [62], we introduce concurrency control and recovery and show the advantages of being able to perform ARIES/IM with other index methods. 1.2 Latches Normally latches and locks has been discussed to a great have not been latches are other hand, semaphores. data, Usually, locks while worry about environment. locks. Also, consistency are usually the deadlock in such or involving Acquiring and are used to control indexes by comparing detector a manner access to shared extent discussed used are used to assure physical Latches are requested alone, undos and Locks Locking the in B ‘-tree logical the the to in the that much. guarantee logical physical consistency is not informed about deadlocks releasing a latch is much on like of We need to a multiprocessor period than are latch cheaper are consistency of data. so as to avoid and locks. Latches, Latches since we need to support held for a much shorter latches information. literature. waits. Latches involving than acquiring latches and releasing a lock. In the no-conflict case, the overhead amounts to 10s of instructions for the former versus 100s of instructions for the latter. Latches are cheaper because the latch control information is always in virtual memory in a fixed place, and direct addressability to the latch information is possible given the latch name. As the protocols presented later in this paper and those in [57, 621 show, each transaction holds at most two or three latches simultaneously. As a result, the latch request blocks can be permanently allocated to each transaction and initialized with transaction ID, etc. right at the start of that transaction. On the other hand, typically, storage for individual locks has to be acquired, formatted and released dynamically, causing more instructions to be executed to acquire and release locks. This is advisable because, in most systems, the number of lockable objects is many orders of magnitude greater than the number of latchable objects. Typically, all information relating to locks currently held or requested by all the transactions is stored in a single, central hash table. Addressability to a particular lock’s information is gained the address of the hash anchor and pointers. Usually, ACM Transactions in the process on Database Systems, Vol by first hashing then, possibly, of trying to locate 17, No 1, March 1992 the lock following the lock name to get a chain of control block, ARIES: A Transaction Recovery because multiple transactions may be simultaneously the contents of the lock table, one or more latches released—one latch on the hash anchor lock’s chain of holders and waiters. Locks may be obtained in different IX (Intention exclusive), and, Method reading and modifying will be acquired and possibly, one on the modes such as S (Shared), IS (Intention Shared) and 101 . SIX specific X (exclusive), (Shared Intention exclusive), and at different granularities such as record (tuple), table tion), and file (tablespace) [321. The S and X locks are the most common (relaones. S provides the read privilege and X provides the read and write privileges. Locks on a given object can be held simultaneously by different transactions only if those locks’ modes are compatible. The compatibility relationships amongst the above modes of locking are shown in Figure 2. A check mark (’<) indicates that the corresponding modes are compatible. With hierarchical locking, the intention locks (IX, IS, and SIX) are generally obtained on the higher levels of the hierarchy (e.g., table), and the S and X locks are obtained and X), on the lower levels (e. g., record). The nonintention mode locks (S when obtained on an object at a certain level of the hierarchy, implicitly grant locks of the corresponding mode on the lower level objects of that higher level object. The intention mode locks, on the other hand, only give the privilege of requesting the corresponding mode locks on the lower level objects. For example, grants S on all the records explicitly on the records. defined in the literature of that table, and it Additional, semantically [2, 38, 45, 551 and ARIES intention or nonintention SIX on a table implicitly allows X to be requested rich lock modes have been can accommodate them. Lock requests may be made with the conditional or the unconditional option. A conditional request means that the requestor is not willing to wait if, when the request is processed, the lock is not grantable immediately. An unconditional lock becomes unconditional request means that the requestor is willing to wait until the grantable. Locks may be held for different durations. An request for an instant duration lock means that the lock is not to be actually granted, call with the success duration locks are released long before transaction when the transaction The above discussions durations, except 1.3 but the lock manager has to delay returning status until the lock becomes grantable. Fine-Granularity Fine-granularity database systems some time termination. terminates, concerning for commit after they are acquired the lock Manual and, typically, Commit duration locks are released only i.e., after commit or rollback is completed. conditional duration, apply requests, to latches different modes, and also. Locking (e.g., record) locking has been supported by nonrelational (e.g., IMS [53, 76, 801) for a long time. Surprisingly, only few of the commercially locking, even though a available relational systems provide fine-granularity IBM’s System R [321, S/38 [991 and SQL/DS [51, and Tandem’s Encompass [37] supported record and/or key the beginning. 3 Although many interesting problems relating locking from to providing 3 Encompass and S/38 had only X locks for records and no locks were acquired these systems for reads. automatically ACM Transactions by on Database SyStanS, Vol. 17, No 1, March 1992 102 C. Mohan . Fig. 2. matrix Lock et al. m mode comparability lx + Slx 4 .’ fine-granularity locking in the context of WAL remain to be solved, the research community has not been paying enough attention to this area [3, 75, 88]. Some of the System R solutions worked only because of the use of the shadow page recovery technique in combination with 10). Supporting fine-granularity locking and variable flexible fashion requires addressing some interesting issues which have never really been discussed in locking length storage the (see Section records in a management database literature. Unfortunately, some of the interesting techniques that were developed for System R and which are now part of SQL/DS did not get documented in the literature. here At the expense some of those As supporting of making problems high this and their concurrency tion of an application requiring systems gain in popularity, paper long, we will be discussing solutions. gains very high it becomes importance (see [79] for the descrip- concurrency) necessary to and as object-oriented invent concurrency control and recovery methods that take advantage of the semantics of the operations on the data [2, 26, 38, 88, 891, and that support fine-granularity locking efficiently. Object-oriented systems may tend to encourage users to define a large to be the view of the number appropriate database, the container locking during users may tend of small objects granularity the and users of locking. concept of a page, may In with the many terminal object instances object-oriented logical its physical of objects, becomes unnatural to think object accesses and modifications. Also, to have expect interactions orientation about as the object-oriented during the as unit of system course of a transaction, thereby increasing the lock hold times. If the were to be a page, lock wait times and deadlock possibilities unit will of locking be aggra- vated. Other discussions concerning transaction management oriented environment can be found in [22, 29]. As more and more customers adopt relational systems in an object- applications, it becomes ever more important 77, 79, 83] and storage management without the system users or administrators. Since to handle requiring relational for production hot-spots [28, 34, 68, too much tuning by systems have been welcomed to a great extent because of their ease of use, it is important that we pay greater attention to this area than what has been done in the context of the nonrelational systems. Apart from the need for high concurrency for user data, the ease with which online data definition operations can be performed in relational systems by even ordinary users requires the support for high concurrency of access to, at least, the catalog data. Since a leaf page in an index typically describes data in hundreds of data pages, page-level locking of index data is just not acceptable. A flexible recovery method that ACM TransactIons on Database Systems, Vol 17, No. 1, March 1992. ARIES: A Transaction Recovery Method allows the needed. The support above of high facts argue levels for supporting such as increment/decrement rently modify even the same increment and decrement of concurrency during semantically . index rich 103 accesses modes is of locking which allow multiple transactions to concurpiece of data. In funds-transfer applications, operations are frequently performed on the branch and teller balances by numerous transactions. If those transactions to use only X locks, then they will be serialized, even though their are forced operations commute. 1.4 Buffer The buffer manages Management manager the nonvolatile (BM) buffer is the pool storage and version component does 1/0s of the to of the database. transaction read/write The fix pages primitive system that from/to the of the BM may be used to request the buffer address of a logical page in the database. If the requested page is not in the buffer pool, BM allocates a buffer slot and reads when the p~ge in. There may be instances (e. g., during a B ‘-tree page split, the new page is allocated) where the current contents of a page on nonvolatile storage are not of interest. In such a case, the fix– new primitive may be used to make the BM allocate a ji-ee slot and return the address of that slot, if BM does not find the page in the buffer pool. The fix-new invoker will then format the page as desired. Once a page is fixed in the buffer pool, the corresponding buffer slot is not available for page replacement until the unfix primitive is issued by the data manipulative component. Actually, for each page, BM keeps a fix count which is incremented by one during every fix operation and which is decremented by one during every unfix operation. A page in the buffer pool is said to be dirty if the buffer version of the page has some updates which are not yet reflected in the nonvolatile storage version of the same page. The fix primitive is also used to communicate the intention to modify the page. Dirty pages can be written back to nonvolatile the modification read accesses to the page while storage it is being of BM when no fix in writing with in the nonvolatile storage if a system failure were buffer pool pages in the other pages background, to reduce without is held, out. basis, of redo work that and also to keep a certain nondirty state synchronous write so that 1/0s they having thus allowing [96] discusses on a continuous the amount to occur intention written the role dirty would pages percentage may to be needed be replaced to be performed of the with at the time of replacement. While performing those writes, BM ensures that the WAL protocol is obeyed. As a consequence, BM may have to force the log up to the LSN of the dirty page before writing the page to nonvolatile storage. Given the large of this nature transactions buffer pools that to be very rare committing are common today, we would expect a force and most log forces to occur because of or entering the prepare state. BM also implements the support for latching pages. To provide direct addressability to page latches and to reduce the storage associated with those latches, the latch on a logical page is actually the latch on the corresponding buffer slot. This means that a logical page can be latched only after it is fixed ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 104 C. Mohan et al . in the buffer pool and the latch These are stored in the buffer BCB dirty highly has to be released acceptable control conditions. block also contains the identity status of the page, etc. (BCB) The before latch the page is unfixed. control information for the corresponding of the logical page, what buffer the fix is slot. count The is, the Buffer management policies differ among the many systems in existence WAL-Based Methods”). If a page modified by a (see Section 11, “Other transaction is allowed to be written to the permanent database on nonvolatile storage before that transaction commits, then the steal policy is said to be Otherwise, by the buffer manager (see [361 for such terminologies). policy is said to be in effect. Steal implies that during normal followed no-steal restart rollback, some volatile storage version commit until the database, undo work of the might have database. said to occur if, even in the virtual not performed database calls. updating policy, during transactions. storage database buffers, using the pending list the transaction is definitely back, then the pending list policy has implications information, on whether Organization The rest of the paper is organized is are the corresponding elsewhere and are after a transaction as follows. no it is deter- If the transaction needs or ignored. The deferred own updates or not, and on whether partial rollbacks For more discussions concerning buffer management, 1.5 to recovery, updating the updates only committing. is discarded non- allowed version of a no-force restart Deferred in-place when the transaction issues The updates are kept in a pending list in-place, mined that to be rolled on the is not all pages modified by it are written to the permanent then a force policy is said to be in effect. Otherwise, policy is said to be in effect. With a force redo work will be necessary for committed performed to be performed If a transaction a or can “see” its are possible or not. see [8, 15, 24, 961. After stating our goals in Section 2 and giving an overview of the new recovery method ARIES in Section 3, we present, in Section 4, the important data structures used by ARIES during normal and restart recovery processing. Next, in Section 5, the protocols followed during normal processing are presented followed, in Section 6, by the description of the processing performed during latter section also presents ways to exploit parallelism methods for performing recovery selectively restart during or postponing recovery. recovery the The and recovery of some of the data. checkpoints during impact of failures Then, in Section 7, algorithms are described for taking the different log passes of restart recovery to reduce the during recovery. This is followed, in Section 8, by the description of how Section 9 introduces fuzzy image copying and media the significant notion of nested a method tiques context caused detail in for some implementing of of the shadow by using the the those characteristics different ACM Transactions systems them existing page efficiently. recovery technique paradigms in the of many such as of the IMS, on Database Systems, Vol Section paradigms and System WAL recovery are supported. top actions and presents R. We context. WAL-based DB2, 10 which Encompass 17, No. 1, March 1992 describes and cri- originated in the discuss Section recovery and the problems 11 describes in methods in use NonStop SQL. ARIES: A Transaction Recovery Method . 105 Section 12 outlines the many different properties of ARIES. We conclude by summarizing, in Section 13, the features of ARIES which provide flexibility and efficiency, and by describing the extensions and the current status of the implementations of ARIES. Besides presenting a new recovery method, by way of motivation for our work, we also describe some previously unpublished aspects of recovery in System R. For comparison purposes, we also do a survey of the recovery methods used by other WAL-based systems and collect information appearing in several publications, aims in resulting many of which are not widely this paper is to show the intricate from the different choices made for available. One of our and unobvious interactions the recovery technique, the granularity of locking and the storage management scheme. One cannot make arbitrarily independent choices for these and still expect the combination to function together correctly and efficiently. This point needs to be emphasized books cover, as it is not always dealt with adequately on concurrency control and recovery. as much as possible, all the interesting one encounters processing in building and operating in most papers and In this paper, we have tried to recovery-related problems that an industrial-strength transaction system. 2. GOALS This section lists the goals of our work in designing a recovery method The goals relate to the metrics discussed earlier, Simplicity. and program algorithms for paper is long that Concurrency for, compared are strived bound a simple, the the difficulties involved 1.1. and recovery with other be error-prone, powerful and of its comprehensive ignored Hopefully, to yet because are mostly simple. feeling. in Section and outlines that supports the features that we aimed for. for comparison of recovery methods that we in the literature, overview presented are complex subjects to think aspects of data management. if they flexible, are discussion the complex. algorithm. main in Section Hence, Although of numerous algorithm 3 gives about The itself we this problems is quite the reader that Operation logging. The recovery method had to permit operation logging (and value logging) so that semantically rich lock modes could be supported. This would let one transaction modify the same data that was modified earlier by another transaction which transaction:’ actions are semantically has not yet committed, when the compatible (e.g., increment/decrement operations; see [2, 26, 45, 881). As should be clear, always perform value or state logging (i. e., logging images systems of modified that data), do very cannot physical support recovery methods which before-images and after- operation —byte-oriented— two logging. logging of all This includes changes to a page [6, 76, 811. The difficulty in supporting operation logging is that we need to track precisely, using a concept like the LSN, the exact state of a page with respect to logged actions relating to that page. An undo or a redo of an update should not be performed without being sure that the original update ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992 . C. Mohan et al is present or is not present, 106 transactions that need to know respectively. had previously precisely how This modified the page also means a page has been start affected that, if one or more rolling back, during then we the rollbacks and how much of each of the rollbacks had been accomplished so far. This requires that updates performed during rollbacks also be logged via the so-called compensation log records (CLRS). The LSN concept lets us avoid attempting to redo present in the page. an operation when the operation’s effect is already It also lets us avoid attempting to undo an operation when effect the operation’s is not present in the page. Operation logging lets us perform, thing that if found desirable, logical logging, which means that not everywas changed on a page needs to be logged explicitly, thereby saving log space. amount of free space on the page, For example, operations can be performed value logging, see [881. changes of control information, need not be logged. logically. like the The redo and the undo For a good discussion of operation and Efficient support for the storage and manipFlexible storage management. ulation of varying length data is important. In contrast to systems like IMS, the intent here is to be able to avoid the need for off-line reorganization of the data to garbage collect any space that might have been freed up because of deletions and updates that caused data shrinkage. It is desirable that the that the the data logging within moved this that data recovery method and locking a page for to be locked and the concurrency or the movements also means that one transaction must page currently has some uncommitted tion. This may lead to log; logical undos may a transaction that has space during its later permit this Partial in data rollbacks. control method be such is logical in nature so that movements garbage collection reasons do not cause to be logged. For an of the index, be able to split a leaf page even if data inserted by another transac- problems in performing page-oriented undos using the be necessary. Further, we would like to be able to let freed up some space be able to use, if necessary, that insert activity [50]. System R, for example, does not pages. It was essential that the new recovery method sup- port the concept of savepoints and rollbacks to savepoints (i.e., partial rollbacks). This is crucial for handling, in a user-friendly fashion (i. e., without requiring a total rollback of the transaction), integrity constraint violations information Flexible (see [1, 311), and (see [49]). buffer problems management. arising The recovery from using method should obsolete make cached the least number of restrictive assumptions about the buffer management policies (steal, force, etc.) in effect. At the same time, the method must be able to take advantage of the characteristics of any specific policy that is in effect (e.g., with a force policy there is no need to perform any redos for committed transactions.) This flexibility could result in increased concurrency, decreased 1/0s and efficient usage of buffer storage. Depending on the policies, the work that needs ACM Transactions to be performed during restart on Database Systems, Vol. 17, No. 1, March 1992 recovery after a system ARIES: A Transaction Recovery Method failure large or during media recovery maybe main memories, it must be noted more that . 107 or less complex. Even a steal policy is still with very desirable. This is because, with a no-steal policy, a page may never get written to nonvolatile storage if the page always contains uncommitted updates due to fine-~anularity locking and overlapping transactions’ updates to that page. running The situation transactions. to frequently by locking reduce all would be further aggravated those conditions, either Under concurrency the objects by quiescing on the page) if there the system all activities and then are would additional bookkeeping in the general Hence, general writing the page to non- 11 with reference It should independence. perform track whether a page restart incurs contains any given our goal of supporting semantiand varying length objects efficiently, undo logging and in-place updating. like the transaction workspace model of AIM [46] are not for our purposes. Other problems relating to no-steal are in Section Recovery and to case, we need to perform methods enough discussed overhead We believe that, partial rollbacks have on the page (i.e., volatile storage, or by doing nothing special and then paying a huge redo recovery cost if the system were to fail. Also, a no-steal policy uncommitted updates. cally rich lock modes, long- media recovery to IMS Fast be possible or restart Path. to image recovery copy (archive at different dump), granularities, rather than only at the entire database level. The recovery of one object should not force the concurrent or lock-step recovery of another object. Contrast this with what happens in the shadow page technique as implemented in System R, where index and space management information are recovered lock-step with user and catalog table (relation) data by starting from an internally to all the processing. some consistent state related objects of the Recovery independence object, catalog descriptors of that may be undergoing information of the whole in object and its related recovery in parallel the two may be out of synchronization be possible later point devices. database and redoing changes database simultaneously, as in normal means that, during the restart recovery of the database cannot be accessed for objects, since that information itself with the object being recovered and [141. During restart recovery, it should to do selective recovery and defer recovery of some objects to a in time to speed up restart and also to accommodate some offline Page-oriented recovery means that even if one page in the database is corrupted because of a process failure or a media problem, it should be possible to recover that page alone. To be able to do this efficiently, we need to log every spans multiple conjunction page’s with change pages and the writing will make media recovery image copying of different different frequencies. Logical undo. that is different This from individually, the update even affects if the more of CLRS for updates object than performed being updated one page. This, during rollbacks, in very simple (see Section 8). This will also permit objects to be performed independently and at relates to the ability, during undo, to affect the one modified during forward processing, a page as is ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 108 . C. Mohan et al. needed in the earlier-mentioned context of the split index page containing uncommitted data of another to perform logical undos especially in search rollback processing, allows higher levels by one transaction of an transaction. Being able of concurrency to be supported, structures [57, 59, 621. If logging is not performed during logical undos would be very difficult to support, if we also desired recovery independence and page-oriented recovery. but at the expense of R and SQL/DS support logical undos, independence. Parallelism and fast With recovery. multiprocessors becoming System recovery very com- mon and greater recovery method data availability becoming increasingly important, the has to be able to exploit parallelism during the different stages recovery that of restart the recovery method and during media be such that recovery. recovery It is also can be very fast, hot-standby approach is going to be used (a la IBM’s IMS/VS Tandem’s NonStop [4, 371). This means that redo processing possible, undo processing should be page-oriented (cf. always important if in fact a XRF [431 and and, whenever logical redos and undos in System R and SQL/DS for indexes and space management). It should also be possible to let the backup system start processing new transactions, even before the undo processing for the interrupted transactions completes. there This is necessary were long Minimal update because overhead. Our restart recovery normal and storage consumption, undo processing may take a long time if transactions. etc.) goal is to have processing. imposed by the good The performance overhead recovery (log method both data during volume, in virtual and nonvolatile storages for accomplishing the above goals should be minimal. Contrast this with the space overhead caused by the shadow page technique. This goal also implied that we should minimize the number of pages that are modified (dirtied) during restart. The idea is to reduce the number of pages that have to be written back to nonvolatile storage and also to reduce CPU overhead. This rules out methods which, during restart recovery, first undo some committed changes that had already reached the nonvolatile storage before the failure and then redo them (see, e.g., [16, 21, 72, 78, 881). It also rules out nonvolatile methods storage in which updates that are not present in a page on are undone unnecessarily (see, e.g., [41, 71, 881). The method should not cause deadlocks involving transactions that are already rolling back. Further, the writing of CLRS should not result in an unbounded number of log records having to be written for a transaction because of the undoing of CLRS, if there were nested rollbacks or repeated system failures during rollbacks. It should also be possible to take checkpoints and image copies without quiescing significant activities in the system. The impact of these operations on other activities should be minimal. To contrast, checkpointing and image copying in System R cause major perturbations in the rest of the system [31]. As the reader will have realized by now, some of these goals are contradictory. Based on our features, experiences ACM Transactions knowledge with IBM’s on Database Systems, Vol of different developers’ existing systems’ existing transaction systems and contacts 17, No 1, March 1992 ARIES: A TransactIon Recovery Method . 109 with customers, we made the necessary tradeoffs. We were keen on learning from the past successes and mistakes involving many prototypes and products. 3. OVERVIEW The aim method OF ARIES of this ARIES, section which is to provide satisfies a brief quite overview reasonably of the the goals that Section 2. Issues like deferred and selective restart, restart recovery, and so on will be discussed in the later ARIES guarantees the atomicity and durability new recovery we set forth in parallelism during sections of the paper. properties of transactions in the fact of process, transaction, system and media failures. For this purpose, ARIES keeps track of the changes made to the database by using a log and it does write-ahead logging (WAL). Besides logging, on a peraffected-page transactions, (CLRS), basis, update ARIES also updates during both partial rollback activities performed during forward logs, typically using compensation performed normal and in which back two of them during restart partial a transaction, and then or total rollbacks Figure 3 gives processing. starts going after performing forward again. the two updates, two CLRS are written. In ARIES, that they are redo-only log records. By appropriate log records written during forward processing, processing of log records of transactions an example three Because of a updates, rolls of the undo of CLRS have the property chaining of the CLRS to a bounded amount of logging is ensured during rollbacks, even in the face of repeated failures during restart or of nested rollbacks. This is to be contrasted with what happens in IMS, which may undo the same non-CLR multiple times, and in AS/400, DB2 and NonStop SQL, which, besides undoing may also undo CLRS one or more times severe problems In ARIES, to be written, action in real-life as Figure customer 5 shows, the CLR, for redo purposes, besides the same non-CLR multiple (see Figure 4). These have situations. when the undo containing is made times, caused to contain points to the predecessor of the just information is readily available since of a log record a description the causes a CLR of the compensating UndoNxtLSN pointer which undone log record. The predecessor every log record, including a CLR, contains the PreuLSN pointer which points to the most recent preceding log record written by the same transaction. The UndoNxtLSN pointer allows us to determine precisely how much of the transaction has not been undone so far. In Figure 5, log record 3’, which is the CLR for log record 3, points to log record 2, which is the predecessor of log record 3. Thus, during rollback, the UndoNxtLSN field of the most recently written CLR keeps track of the progress of rollback. the transaction, rollback or if bypass those It tells the system from whereto continue the rollback if a system failure were to interrupt the completion a nested rollback were to be performed. It lets the log records that had already been undone. Since of of the system CLRS are available to describe what actions are actually ~erformed during the undo of an original action, the undo action need not be, in terms of which page(s) is affected, the exact inverse of the original action. That is, logical undo which allows very high concurrency to be supported is made possible. ACM Transactions on Database Systems, Vol For example, 17, No. 1, March 1992. 110 C. Mohan et al. . w Fig. 3. Partial rollback example. 12 Log After performing rollback log I 3 actions, by undoing records and 33’2’4 3 performs the actions and 2, act~ons !3j transaction 3 and and then 4 and 5 performs 2, wrlt!ng starts > a patilal the compensation go[ng forward aga!n Before Failure 1 Log During DB2, s/38, Encompass --------------------------AS/400 Restart , 2’” 3“ 3’ 3’ 2’ 1’ ~ 1; lMS > ) I’ is the CLR for I and I“ is the CLR for I’ Fig. 4 Problem of compensating compensations or duplicate compensations, or both a key inserted on page 10 of a B ‘-tree by one transaction may be moved to page 20 by another transaction before the key insertion is committed. Later, if the first transaction were to roll back, then the key will be located on page 20 by retraversing the tree and deleted from there. A CLR will be written to describe the key deletion on page 20. This permits page-oriented redo which is very efficient. [59, 621 describe this logical undo feature. ARIES uses a single a page is updated and placed in the page-LSN LSN ARIES/LHS and ARIES/IM on each page to track the page’s a log record is written, the LSN field of the updated page. This which state. exploit Whenever of the log record is tagging of the page with the LSN allows ARIES to precisely track, for restartand mediarecovery purposes, the state of the page with respect to logged updates for that page. It allows ARIES to support novel lock modes! using which, before an update performed on a record’s field by one transaction is committed, another transaction may be permitted to modify the same data for specified operations. Periodically during checkpoint log records and the modified needed begin normal identify processing, ARIES takes checkpoints. the transactions that are active, their LSNS of their most recently written log records, data (dirty data) that is in the buffer pool. The latter to determine from where the redo pass of restart its processing. ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. The states, and also information recovery the is should ARIES: A Transaction Recovery Method Before 12 ‘,; \\ Log -% 111 Failure 3 3’ 2’ 1! ) F i-. ?% / / -=---/ / ------ --- During I . Restart ,, ----------------------------------------------+1 I’ is the Compensation Log Record for I I’ points to the predecessor, if any, of I Fig. 5. During from this ARIES’ restart the first analysis technique recovery record pass, for avoiding compensating compensations. (see Figure of the last information 6), ARIES checkpoint, about compensation dirty first and duplicate scans the log, starting up to the end of the pages log. and transactions During that were in progress at the time of the checkpoint is brought up to date as of the end of the log. The analysis pass uses the dirty pages information to determine the starting point ( li!edoLSIV) for the log scan of the immediately pass. The analysis pass also determines the list of transactions rolled back in the undo pass. For each in-progress transaction, most recently written log record will also be determined. following redo that are to be the LSN of the Then, during the redo pass, ARIES repeats history, with respect to those updates logged on stable storage, but whose effects on the database pages did not get reflected on nonvolatile storage before the failure of the system. This is done for the updates of all transactions, including the updates of those transactions that had neither committed nor reached the in-doubt state of two-phase commit by the time loser of the system transactions failure are (i.e., redone). even the missing This essentially updates reestablishes of the so-called the state of the database as of the time of the system failure. A log record’s update is redone if the affected page’s page-LSN is less than the log record’s LSN. No logging is performed when updates are redone. The redo pass obtains the locks needed to protect the uncommitted updates of those distributed transactions that will remain in the in-doubt (prepared) state [63, 64] at the end of restart The updates recovery. next log pass are rolled is the back, undo in reverse pass during chronological which order, all loser transactions’ in a single sweep of the log. This is done by continually taking the maximum of the LSNS of the next log record to be processed for each of the yet-to-be-completely-undone loser transactions, until no transaction remains to be undone. Unlike during the redo pass, performing undos is not a conditional operation during the undo pass (and during normal undo). That is, ARIES does not compare the page.LSN of the affected page to the LSN of the log record to decide ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 112 C. Mohan et al . m Log @ Checkpoint r’ Follure i DB2 System Analysis I Undo Losers / *————— ——. R Redo Nonlosers —— — ————,& Redo Nonlosers . ------ IMS “––-––––––X* ..:-------- (FP Updates) 1 ------- ARIES Redo ALL Undo Losers .-:”--------- Fig. 6, whether or not transaction to undo during the Restart the processing update. undo & Analysis Undo Losers (NonFP Updates) in different When pass, if it methods. a non-CLR is an I is encountered undo-redo for or undo-only a log record, then its update is undone. In any case, the next record to process for that transaction is determined by looking at the PrevLSN of that non-CLR. Since CLRS are never undone (i.e., CLRS are not compensated– see Figure 5), when a CLR is encountered during undo, it is used just to determine the next log record to process by looking at the UndoNxtLSN field of the CLR. For those transactions which were already rolling back at the time of the system failure, ARIES will rollback only those actions been undone. This is possible since history is repeated and since the last CLR written for each transaction indirectly) to the next non-CLR record that that had not already for such transactions points (directly or is to be undone, The net result is that, if only page-oriented undos are involved or logical undos generate only CLRS, then, for rolled back transactions, the number of CLRS written will be exactly equal to the number of undoable) log records processing of those transactions. This will be the repeated failures 4. DATA STRUCTURES This 4.1 section describes restart the major or if there data are nested structures that rollbacks. are used by ARIES. Log Records Below, types during written during forward case even if there are we describe the important fields that may of log records. ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992, be present in different ARIES: A Transaction Recovery Method . 113 LSN. Address of the first byte of the log record in the ever-growing log address space. This is a monotonically increasing value. This is shown here as a field only to make it easier to describe ARIES. The LSN need not actually be stored Type. regular pare’), in the record. Indicates update whether record this is a compensation (’update’), a commit or a nontransaction-related TransID. Identifier PrevLSN. LSN record (e.g., of the transaction, of the preceding record (’compensation’), protocol-related record ‘OSfile_return’). if any, that log record wrote written the log record. by the tion. This field has a value of zero in nontransaction-related the first log record of a transaction, thus avoiding the need begin transaction PageID. identifier PageID same transacrecords and in for an explicit log record. Present only in records of type ‘update’ or ‘compensation’. of the page to which the updates of this record were applied. will normally consist of two parts: an objectID (e.g., and a page number within that object. ARIES can deal with contains updates for multiple pages. For ease of exposition, only a (e. g., ‘pre- The This tablespaceID), a log record we assume that that one page is involved. UndoNxtLSN. Present of this transaction that UndoNxtLSN is the value only in CLRS. It is the LSN of the next log record is to be processed during rollback. That is, of PrevLSN of the log record that the current log record is compensating. If there this field contains a zero. Data. This is the redo are no more and/or undo data log records that to be undone, describes was performed. CLRS contain only redo information undone. Updates can be logged in a logical fashion. the then update that since they are never Changes to some fields (e.g., amount of free space) of that page need not be logged since they can be easily derived. The undo information and the redo information for the entire object need not be logged. It suffices if the changed fields alone are logged. For increment or decrement types of operations, before and after-images of the field are not needed. Information about the type of operation and the decrement or increment amount is enough. The information here would also be used to determine redo and/or 4.2 One undo the appropriate of this action routine to be used to perform the log record. Page Structure of the fields in every page of the database is the page-LSN field. It contains the LSN of the log record that describes the latest update to the page. This record may be a regular update record or a CLR. ARIES expects the buffer manager to enforce the WAL protocol. Except for this, ARIES does not place any restrictions on the buffer page replacement policy. The steal buffer management policy may be used. In-place updating is performed on nonvolatile storage. Updates are applied immediately and directly to the ACM Transactions on Database Systems, Vol. 17, No, 1, March 1992. 114 . buffer as in ing C. Mohan et al. version of the page containing INGRES [861 is performed. and, flexible 4.3 consequently, enough A table deferred not to preclude Transaction If the object. That is, no deferred updating it is found desirable, deferred updat- logging those can policies be from implemented. being ARIES is implemented. Table called the transaction table is used during restart recovery to track the state of active transactions. The table is initialized during the analysis pass from the most recent checkpoint’s record(s) and is modified during the analysis of the log records written after the During the undo pass, the entries of the checkpoint is taken during restart recovery, will be included in the checkpoint during normal processing by the important fields of the transaction TransID. State. Transaction Commit or unprepared LastLSN. record(s). The same transaction manager. table follows: checkpoint. table If a table is also A description used of the ID. state of the transaction: prepared The LSN The If the most of the latest LSN recent of the log record log record next written record (’P’ –also then this field’s is a CLR, then UndoNxtLSN CLR. value from that written called in-doubt) by the transaction. to be processed or seen for this undoable non-CLR log record, If that most recent log record 4.4 of that are also modified. the contents of the (’U’). UndoNxtLSN. back. beginning table then value will this field’s during transaction rollis an be set to LastLSN. value is set to the Dirty_ Pages Table A table called the dirty .pages table is used to represent information about dirty buffer pages during normal processing. This table is also used during restart recovery. The actual implementation of this table may be done using hashing or via the deferred-writes queue mechanism the table consists of two fields: PageID and RecLSN normal processing, when a nondirty the intention to modify, the buffer of [961. Each entry in (recovery LSN). During page is being fixed in the buffers manager records in the buffer pool with (BP) dirty .pages table, as RecLSN, the current end-of-log LSN, which will be the LSN of the next log record to be written. The value of RecLSN indicates from what point in the log there may be updates which are, possibly, not yet in the nonvolatile storage version of the page. Whenever pages are written back to nonvolatile storage, the corresponding entries in the BP dirty _pages table are removed. record(s) that The contents of this table are included is written during normal processing. The in the checkpoint restart dirty –pages table is initialized from the latest checkpoint’s record(s) and during the analysis of the other records during the analysis ACM Transactions on Database Systems, Vol 17, No 1, March 1992 is modified pass. The ARIES: A Transaction Recovery Method minimum RecLSN pass during restart 5. NORMAL This discusses the processing. part of recovering 5.1 Updates During table gives the starting point for 115 the redo PROCESSING section transaction value in the recovery. . normal actions Section from a system processing, that are 6 discusses performed the actions as part that of normal are performed as failure. transactions may be in forward processing, partial rollback or total rollback. The rollbacks may be system- or application-initiated. The causes of rollbacks may be deadlocks, error conditions, integrity constraint violations, unexpected database state, etc. If the granularity of locking is a record, then, when an update is to be performed on a record in a page, after the record is locked, that in the buffer and latched in the X mode, the update is performed, page is fixed a log record is appended to the log, the LSN of the log record is placed in the page .LSN field of the page and in the transaction table, and the page is unlatched and unfixed. The page latch is held during the call to the logger. This is done to ensure that the order of logging of updates of a page is the same as the order in which those updates are performed on the page. This is very important if some of the redo information is going to be logged amount of free space in the page) and guaranteed for the physical redo to work be held during read and update operations the page contents. This is necessary might move records around within such garbage collection is going look at the page since they repetition correctly. to ensure transaction get confused. Readers S mode and modifiers latch in the X mode. The data page latch is not held while any necessary performed. held At most two page (e.g., the because inserters and updaters of records a page to do garbage collection. When on, no other might physically of history has to be The page latch must physical consistency of latches are should be allowed of pages latch index operations simultaneously to in the (also are see [57, 621). This means that two transactions, T1 and T2, that are modifying different pieces of data may modify a particular data page in one order (Tl, T2) and a particular index page in another order (T2, T1).4 This scenario is impossible in System R and SQL/DS since in those systems, locks, instead of latches are used for providing physical consistency. Typically, all the (physical) page locks are released only at the end of the RSS (data manager) call. A single RSS call deals with modifying the data and all relevant indexes. This deadlocks may involve waiting (physical) page for many locks 1/0s alone and locks. or (physical) This means page that locks and gets very complicated if operations like increment/decrement are supported high concurrency lock modes and indexes are allowed to be defined on fields on which operations are supported. We are currently studying those situations. with such 4 The situation involving ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 116 . (logical) System C. Mohan et al record/key locks are possible. They have been a major problem in R and SQL/DS. Figure 7 depicts a situation at the time of a system failure which followed the commit of two transactions. The dotted lines show how up to date the states of pages PI and P2 are on nonvolatile storage with respect to logged updates of those pages. During restart recovery, it must be realized that the most recent log record written for PI, which was written by a transaction which later committed, needs to be redone, and that there is nothing to be redone for P2. This situation points to the need for having the LSN to relate the state of a page on nonvolatile and the need for knowing where some information storage restart in the checkpoint record to a particular position redo pass should begin (see Section 5.4). in the log by noting For the example scenario, the restart redo log scan should begin at least from the log record representing the most recent update of PI by T2, since that update needs to be redone. It is not assumed that a single log record can always accommodate information needed to redo or undo the update operation. There instances when more than one record needs to be written for this all the may be purpose. For example, one record may be written with the undo information and another one with the redo information. In such cases, (1) the undo-only log record should be written before the redo-only log record is written, and (2) it is the LSN of the redo-only log record field. The first condition is enforced situation in which the written to stable storage the redo of that redo-only history feature) only redo-only that should be placed in the page.LSN to make sure that we do not have record and not the undo-only before a failure, and that during log record is performed (because to realize later that there isn’t restart of the an undo-only record a gets recovery, repeating record to undo the effect of that operation. Given that the undo-only record is written before the redo-only record, the second condition ensures that we do not have a situation in which even though the page in nonvolatile storage already contains the unnecessarily the undo-only redo could update during record cause of the redo-only record, that same update gets redone restart recovery because the page contained the L SN of instead of that of the redo-only record. This unnecessary integrity problems if operation There may be some log records written cannot or should not be undone (prepare, logging is being performed. during forward processing free space inventory update, that etc. records). These are identified as redo-only log records. See Section 10.3 for a discussion of this kind of situation for free space inventory updates. Sometimes, the identity of the (data) record to be modified or read may not be known before a (data) page is examined. For example, during an insert, the record ID is not determined until the page is examined to find an empty slot. In such cases, the record lock must be obtained after the page is latched. To avoid waiting for a lock while holding a latch, which could lead to an undetected deadlock, the lock is requested conditionally, and if it is not granted, then the latch is released and the lock is requested unconditionally. Once the unconditionally requested lock is granted, the page is latched again, and any previously verified conditions are rechecked. This rechecking is ACM Transactions on Database Systems, Vol 17, No. 1, March 1992. ARIES: A Transaction Recovery Method /’ /’ j;:’ Log PI o T1 a T2 because, changed. The page_LSN bered detect quickly, If update, it taken. If the can the page, proceed to support if they hold performed an amount rency control be used 5.2 used conditions any to changes be could satisfied for Otherwise, is could could be have possibly performing corrective granted have remem- the actions immediately, are then the is should be made while reading to normal the to hold the transaction to a page will change, if the the system locking, is a transac- X latch on the physical consistency Unlocked reads of may page also causing the systems in be least processing. to control similar the assured interest this But, page than on the for case. are restricted are coarser lock Except page. in concurrency that the even with locks utility not something since record-locking then, copy or page reads, acquiring is page transaction. not ARIES as the a the only those mechanism. locking, Even like the other ones in which concur- [2], could ARIES. Total or Partial Rollbacks To provide flexibility of a sauepoint notion of a transaction, could be might updates atomicity. request at the After undoing outstanding limiting the to the savepoint. the transaction for This such the a partial rollback, like I)B2, command to support after of savepoints a system transaction performed the the execution number in manipulation is needed a while, updates data rollbacks, during Any Typically, SQL data. After of established. time. every executing of all be in before extent [1, 31]. At any point can a point is established perform level in is supported a savepoint outstanding savepoint still if lock as in the image schemes with the of unlatching above. executing a page S latch of is time found to latch dirty of interference locking still locking same are the Applicability the rematching, need or who by unlatched, was at are the the is updating readers state as a failure. requested is no unlocked that Database as before. are so that Commit P2 @ Checkpoint as described to isolate taken ‘:\,; w Failure page on of there ‘“O Commit value conditionally be sufficient actions the conditions granularity then tion after is performed If update the ‘! ‘! / required to /’ PI Fig. 7. occurred. P #“ PI LZN’”S pi 117 El / ,’ . SQL or the the statementsystem establishment the ACM Transactions on Database Systems, Vol a that transaction can of a can 17, No. 1, March 1992. 118 . C. Mohan et al. continue lar execution savepoint and is that savepoint LSN of the no or to latest remembered in start log record of the transaction SaveLSN is to savepoint, it to be exposed expose the and INGRES [181. Figure locks are undo get activity on in as During CLR. to written. when CLRS (e.g., is is up to determine helps nested rollback during the first various have to rollback then, scenarios methods, efficiently able inverses of situations are management ARIES’ safely not with small of the records Even with to informarollback. next record When a record the see how to CLR is is looked UndoNxtLSN means that UndoNxtLSN that the to be written. This were though conjunction be easy CLRS, having actions. in, Section guarantee in should involved possible (see ACM Transactions via not original was it log again. to contain the Thus, records. a need some undo of that of in mentioned during field. a in undone Figures if a CLRS, during 4, 5, and restart undos in nested rollbacks 13 the are ARIES. to describe, of the which by because of the rollback log fit CLRS CLR ignored to be processed. ease will As to contain field For performed, 62]. processed, in [1001. chronological is made PrevLSN undone partial recovery is its is this are UndoNxtLSN record none it up already occur, records after be processed flexibility deal don’t log would the page they the log over were Being us next skip second caused looking rollback, field undo do not of action [59, No involved multiple undo in UndoNxtLSN rollback describe handled by a logical and back latches reverse undo to during algorithms where described [42] acquired get a not TransID. is written. the case to sequence IMS the that in a CLR whose encountered, the us its Redo-only is during as or is the undone to the when be undone, determined pointer that, in about ARIES record before-images). encountered the information system cannot and are all log 641 the and ensured is undone, written, never always record) concept values in the back is used for rolling a latch transaction records the which SaveLSN have [31, roll as is done is at savepoint expect though SaueLSN, a log to symbolic back that possible the a non-CLR process R* is written, in will and to extend a CLR value we log sometimes would the established the to established, written If some is the record that It are PrevLSN When log It is easy non-CLRs Since the each assume single before, R we even a rolling yet A particu- performed called desires ROLLBACK Since System rollback, for exposition, tion in the and, be a page. not internally, routine rollback, deadlocks, has use to LSNS to the during then is is being SaveLSN. but routine The input involved order the it 3). been transaction, transaction level, user mapping acquired deadlock, user to the 8 describes to a savepoint. the remembered at the do the the Figure has a savepoint savepoint when When the SaveLSNs numbers (i.e., zero. supplies by the (see a rollback When written If again if one. storage. beginning were forward outstanding a preceding virtual set going longer for the to In in actions force the particular, performed undo the during actions undo to action the original action. example, index management undo gives the exact be could Such affect logical [621 and a undo space 10.3). of a bounded computer amount systems of logging situations during in which on Database Systems, Vol. 17, No. 1, March 1992 undo allows a circular us to online ARIES: A Transaction Recovery Method \\\ *** . 119 \ u ,0 w m dFm m 0 <0 m c m L .. ~ v al sQ .. .. z x m“. -J nc.1 WE > ..! !’. : 0 %’ : 0 CIA. . . . Fl ‘n ..! I . 5 n“ -_l z WI-’-l al !! .. w M.-s mztn CL. -am UWL aJ-.J Crfu u! It 0 .-l =% ql- ulc & l..- ..2 !! ;E %’2 al- ACM Transactions on Database Systems, Vol. 17, No 1, March 1992. 120 C. Mohan . log might be used and log space is at a premium. keep in reserve transactions enough under of ARIES advantage of this. When savepoint partial or cannot again, any thereby ever back, when and a CLR This makes rather 5.3 transaction’s always updates read record occur postpone Once any an the this action log erasing return of the lock roll- is undone on that partial object. rollbacks not take 5 When the same the site SIX, new in or a different We we need to the those locks is written, the would some other site). To files to be complete until by failure uncommitted locks cause objects’ that of the held then the record no may as part etc.) state, state files Abort and if a system protect if which [191. log that prepare such erasing committing to prepare of X, in-doubt released, the logging like (IX, the be to the to ensure recovery, Presumed transactions deal sure log of with erased, contents, are be part we that the pending these record. the they in-doubt locks. its they must place when it is committed be performed. a file log record. associated state, Once the end record or returning redo-only is not locks of objects) the enters actions, record does (at definitely involves OSfile. locks CLRS a (partial) (e. g., written is done into dropping prepare protocol to terminate enters of getting and releasing pending locks could actions a transaction commit is used restart IS) avoiding is in release because using a rollbacks. transaction. and performing end record which S as the of transaction actions during transaction sake such undoes object the deadlocks of update-type a transaction as part distributed (such resolving 64])) of the in-doubt (e.g., later the once, during release the and after does to a particular can after do not to be undone never than update DB2 updates R field, synchronously is list logging after of the actions first system of two-phase which reacquired, for more establishment because, same ARIES UndoNxtLSN to total (see [63, the The acquired the form some Commit locks non-CLR the resorting includes be because very it, like System But, the Termination that to the impletakes be released rollback cause The Manager after systems a partial still the to consider transaction. could using it possible prepare were particular for or Presumed protocol a fact, running shortage). may we can currently Database obtained In the bound, all space rollback inconsistencies. is written Transaction the data log locks after may back Edition of the completes. CLRS the than Assume locks rollback rollback of the the target is completed. of the undoes chaining back, is the causing a partial Extended Knowing to roll (e. g., 0S/2 rolls a later to be able conditions rollback release space the which total release, nor in a transaction of the after log critical mentation lock et al with to the For each operating For ease of exposition, any particular a checkpoint is in by is written, transaction writing an if there pending system, are action we write we assume that and that this progress. 5Another possibility is not to log the locks, but to regenerate the lock names during restart recovery by examining all the log records written by the in-doubt transaction— see Sections 6.1 and 64, and item 18 (Section 12) for further ramifications of this approach ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. ARIES: A Transaction Recovery Method A transaction record, rolling actions list, in-doubt state is rolled back by writing the transaction to its beginning, discarding in releasing its locks, and then writing the end record. not the rollback and end records are synchronously written will depend on the type of two-phase commit protocol used. record may be avoided if the transaction 121 a rollback the pending the back of the prepare . Whether or to stable storage Also, the writing is not a distributed one or is read-only. 5.4 Checkpoints Periodically, checkpoints are taken to reduce the amount of work that needs to be performed during restart recovery. The work may relate to the extent of the log that needs to be examined, the number of data pages that have to be read from nonvolatile storage, etc. Checkpoints can be taken asynchronously (i.e., while fuzzy transaction checkpoint end– chkpt record transaction tion the writing (like tablespace, table information. is going begin-chkpt and indexspace, any file etc.) that Only end-chkpt record Such Then a the of the normal mapping informa- are “open” for simplicity can be accommodated the case where multiple Once the on). record. in it the contents table, has entries). assume that all the information record. It is easy to deal with updates, a by including BP dirty-pages BP dirty–pages log this including by is constructed table, for the objects which processing, is initiated (i.e., for of exposition, we in a single end- chkpt records are needed to is constructed, it is written to the log. Once that record reaches stable storage, the LSN of the begin-chkpt record is stored in the master record which is in a well-known place on stable storage. If a failure were to occur before the end–chkpt record migrates to stable storage, but after the begin _chkpt record migrates to stable storage, then that checkpoint is considered an incomplete checkpoint. Between the begin--chkpt and end. chkpt log records, transactions might have written other log records. If one or more transactions are likely to remain in the in-doubt state for a long time because of prolonged loss of contact with the commit coordinator, then record information about those transactions. This recovery, those locks it is a good idea the update-type way, could if a failure be reacquired to include locks (e.g., were to occur, in the end-chkpt X, IX and SIX) without then, having during to held by restart access the prepare records of those transactions. Since latches may need to be acquired to read the dirty _pages table correctly while gathering the needed information, it is a good idea to gather the information a little at a time to reduce contention on the tables. For example, tion if the dirty 100 entries before Figure _pages table If the already during written important by because transactions the entries acquisichange remain correct (see redo point, besides of the RecLSNs of the dirty pages included also takes into account the log records that since effect each latch examined the end of the checkpoint, the recovery algorithms 10). This is because, in computing the restart taking into account the minimum in the end_chkpt record, ARIES were has 1000 rows, can be examined. of some the beginning of the updates of the that checkpoint. were performed This is since ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 122 C. Mohan et al. . initiation of the checkpoint might not is recorded as part of the checkpoint. the that ARIES does storage on system buffer in in this in case make a copy minimizes unavailability to before site analysis dirty gets _pages For begin failure pass, is the the its pages which are those pages restart redo prevention buffer 1/0 multi manages are work, of updates manager from the of the a failure, recovery state ensure and could copy. This master the pass, appropriately. At which the checkpoint invokes the routines order. The end system. contains in that the be RESTART the complete to atomicity of a failed record last routine undo restart needs the 9 describes of the of the This and is updated availability, redo and undo passes. to latch pages for way the necessary Analysis The first taken for buffer of restart the pool recovery, a Figure ments the contains analysis the the The list processing is by parallelism are by must exploiting be as short as parallelism is going modified to be during allowing new during employed restart is it recovery. transaction processing [601. is made actions. outputs of pages or was The of this from which that may that the be written rolled back routine were were the must this before in routine in the which end but missing. ACM ‘llansactlons imple- LSN table, of the which the in-doubt or unprepared the dirty–pages table, which processing are that is the transaction dirty failure, analysis is the routine routine RedoLSN, start system recovery are the potentially and pass by to this or shutdown; down; redo restart ANALYSIS input which failure shut during RESTART_ the of transactions list totally that of system the had log pass failed records in describes system log if they availability explored of the 10 at the time contains Only before data are record. master of restart this Pass pass pass. duration of accomplishing improving recovery 6.1 are LSN record pass the that the write DB2 that the Figure beginning shutdown. redo One state and how is, using is taken. high Ideas after a consistent at the possible. during to .chkpt or the table checkpoint writes manager writes. of transactions. routine the restarts data invoked to this pointer the properties that input for background reduce To avoid nonvolatile the hot-spot to perform to buffer ensure operation, time system bring durability The some to often and forced list page the about are has 1/0 pages the dirty the PROCESSING to routine batch to occur. an data transaction performed were during of those the the failure in details reasonably of each 6. RESTART When storage pages can manager be pages if there buffer in is that dirty [961 gives Even the a system hot-spot out manager fashion. pages assumption writing buffer dirty any The operation. to nonvolatile such and 1/0 modified, written to The one pools frequently just basis, processes. pages that a checkpoint. a continuous ple require not during be reflected on Database Systems, Vol. 17, No. 1, March 1992. the records for buffers is the log. for whom when location The only the on log transactions end records ARIES: ATransaction Recovery Method . 123 RE.STAR7(Master Addr); Restart_Analys~ Restart_ s(Master_Addr, Redo(RedoLSN, buffer pool remove entries Restart_ Dirty_Pages for table locks Dlrty_Pages, for RedoLSN); Dlrty_Pages); := Dirty_ Pages; non-buffer-resident Undo (Trans_Tabl reacquire Trans_Table, Trans_Table, pages from the buffer pool Dirty_ Pages table; e); transactions; prepared checkpoint; RETURN; Fig.9. During this does not the table transaction back. if a log record appear in the with the current table is modified also to note undone pass, already the LSN if it were file which of the _pages most recent log record sure that the redo later, original for a page table, then log record ultimately that whose an entry table identity is made causing would then in The and need to be had to be rolled any pages belonging are removed no page belonging pass. The same file operation that the transaction is encountered, are in the dirty-pages order to make accessed during once the is encountered dirty log record’s LSN as the page’s RecLSN. to track the state changes of transactions determined If an OSfile.return to that Pseudocode for restart. from the latter in to that version of that file is may be recreated and updated the file erasure is committed. In that case, some pages of the recreated file will reappear in the dirty-pages table later with RecLSN values greater than the end-of-log LSN when the file was erased. The RedoLSN is the minimum RecLSN from the dirty-pages table at the end of the analysis are no pages in the dirty _pages It is not necessary ARIES there is no analysis (see also missing logged Hence, tion. This that there implementation in pass. pass. table. The redo pass can be skipped be a separate the This 0S/2 analysis Extended is especially 6.2), in the redo pass, ARIES updates. That is, it redoes them irrespective redo transactions, does not need to know unlike the loser unconditionally System their update locks are reacquired computation to consider the Begin _LSNs turn requires that we know, before the start the in-doubt transactions. Without the analysis pass, the transaction and DB2. of a transac- the lock names as they are encountered locks forces the RedoLSN of in-doubt transactions which of the redo pass, the identities table all were only for the undo pass. in which for in-doubt by inferring from the log records of the in-doubt transactions, during the redo pass. This technique for reacquiring before they R, SQL/DS status in the Manager redoes of whether or nonloser That information is, strictly speaking, needed would not be true for a system (like DB2) transactions Database as we mentioned Section by loser or nonloser pass and, in fact, Edition because, if there could be constructed in of from the checkpoint record and the log records encountered during the redo pass. The RedoLSN would have to be the minimum(minimum( RecLSN from the dirty-pages table in the end.chkpt record), LSN(begin-chkpt record)). Suppression of the analysis pass would also require that other methods be used to ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992 124 C. Mohan 0 et al. #~START_ANALYSIS(Mast er_Addr, ln]tiallze the Trans_’able, Trans_Table tables D1rty_pages, arm D1rty_Pages to RedoLSN) ; empty; Master_Rec := Read_Dl sk(Master_Addr) ; Open_ Log_ Scan (Master_Rec .Chkpt LSN) ; LogRec := Next_ Logo; /’ /* LogRec := Next_ Logo; WHILE NOT(End_of_Log) open log scan at Beg)n_Chkpt /* read )n the Begln_Chkpt read log record followlng record record ‘/ ‘/ Begln_Chkpt */ 00; ret Urn*/ IF trans related record & LogRec.7ransi3 ‘/C- ;n Trans Table THEN /* not chkpt/OSflle /* log ~ecord */ Insert (Log Rec. Trans ID, ’U’ ,Log Rec. LSN, Log Rec. Frev LSN) l!,:o Trans Table; SELECT(LogRec. Type) WHEN(’update’ I ‘compensation’) DO; Trans_Tabl e[LogRec. Trans ID] .Last LSN := LogRt-:. LSN; IF LogRec. Type = ‘update’ IF LogRec 1s undoable THEN THEN Trans_Tahl e[.ogRec. TransIO] .UndoNxt LSN := LogRec. LSN; ELSE Trans_Tabl e[LogRec. Trans IDU.UndoNxt LSN := LogRec. UndoNxt LSN; /’ next record to undo 1s the one pointed IF LogRec is redoable & LogRec. ~age ID NOT IN DTrty_Pages THEN insert (LogRec. Page ID, Log Rec. LSN) Into Llrty_Pages; END; /’ WHEN(‘update’ I ‘compensation’) */ WHEN(‘Begln_Chkpt ‘) ; /* found an Incomplete ENO; /* SELECT ‘/ LogRec := Next_ Logo; ENO; /’ WHILE ‘/ FOR EACH Trans Table entry with (State = ‘U’) & (Undo Nxt LSN = O) 00; /* rolled back trans write end re~ord and remove entry from Trans Table; I* w)th mlsslng end record */ *[ IF Trans ID NOT IN Trans_Table Insert entry (Trans ID, State, ENO; END; /* entry entry (Page IO, RecLSN) In Olrty_Pages; to Rec LSN In Olrty_PagLst; IF LogRec. Type = ‘prepare’ THEk Trans_Tabl e[Log Rec. Transit]. ELSE Trans Table [LogRec .Trans ID]. State := ‘U’; Trans_Tabl~[LogRec .TransID] WHEN(‘OSfile_return’) delete ENO; /* FOR */ RedoLSN := minimum(Di rty_Pages. RE-URN; bac<’) entry begin_ pass chkpt Redo Pass The second Figure := ‘P’ ; /* to files used which is that to = LogRec. Trans ID; pages of filter have returned return file; start posltlon for ~edo *I analysis. been returned the dirty .pages table update log records which to the used operating during occur the after the record. 6.2 pass. be TransID all Pseudocode for restart consequence cannot which Rec LSN) ; updates Another */ for from Olrty_?ages Fig. 10. processing State .Last LSN := LogRec. LSN; ENO; /’ WHEN(’prepare’ I ‘roll WHEN(‘end’) delete Trans_Table redo Table; FOR ‘/ ELSE set RecLSN of Dlrty_Pages END; /’ FOR ‘/ END; /’ WHEN(’End Chkpt’) */ WhEN( ‘prepare’ \ ‘rollback’) DO; avoid ignore 00; THEN 00; Last LSN,UndoNxt LSN) In Trans FOR each entry in LogRec.Dirty PagLst 00; IF Pagel Ll NOT IN Olrty_Pages-THEN lrsert system. record. CLR */ */ DO; in LogRec. Tran_Table Begln_Chkpt by this It WHEN(‘End_ Chkpt’) FOR each entry checkpoint’s to pass 11 of the describes log that the is made during RESTART.REDO restart routine ACM 11-ansact,ons on Database Systems, Vol. 17, No. 1, March 1992 recovery that is the redo implements ARIES: A Transaction Recovery Method . 125 Di rty_Pages); RESTART-REDO(RedoLSN, /* open log scan and :;s]tlon at restart pt *J /* read log record a: restart redo point */ /* look at all records till end of log */ Open_ Log_Scan(RedoLSN); LojRec := Next_ Logo; WHILE NOT(End_of_Log) 00; IF LogRec. Type = (’update’ I ‘compensation’) & LogRec is redoable & LogRec. PageIO IN Oirty-Pages & LogRec. LSN >= Oi rty_Pages[LogRec .~ageID] .Rec LSN THEN 00; /’ a redoable page update. updated page mg-t not have made It to */ /* disk before sys failure. need to access cage and check Its LSN */ Page := fix&l atch(LogRec. PageIO, ‘X’); IF Page. LSN < LogRec. LSN THEN 00 /* update not or cage. need to redo It *I Redo_Update(Page, LogRec); /’ redo update */ update *I Pag.?. LSN := LogRec. LSN; END; ELSE Dlrty_Pages [* [LogRec. PageIO] .Rec LSN := Page. LSN+l; .~date already on page *I update dirty page list with correct info. tr-s w1ll happen if this */ ~~gewas written to disk after :Re checkpt b.t before sYs failure */ /’ I* unfix&unlatch (Page); ENO; LogRec : = Next_ Log (); /“ LSN on ~age has to /a read next /* ENO; RETURN; Fig. 11. the redo the dirty-pages pass records log records from a check table. If it might be less than the if the that log log record’s the log record’s record’s page LSN, then the reestablishes the database performed by loser routine updates behind this repeating some of that redo [691 we have Since table reduce redo may get that are were dirty at have may the been that were written time log of the like to records to get with the redo examined redo. last pass. This log write be used to although eliminate the out this the listed in Not all pages became some identify dirty system CPU the option corresponding pass. dirty-pages of the the In of history pass. which that unnecessary. some that failure. rationale It turns pages this To to be RecLSN The in the saving that the during before and records storage, ACM Transactions or storage redone. the state is found repeating entries Only be of system be dirtied to page to be examined. 10.1. the equal Thus, may is because volume log to time during checkpoint nonvolatile nonvolatile can which pages and reducing systems to read require written of reasons expect be records is encoun- the LSN redone. in Section log the dirty-pages or have are of restricting of pages the have page’s as of the log idea only during will read do not such the number modified we and the table pages Because further is page-oriented, dirty-pages might is explained and No record that which transactions transactions’ */ log scanning in the is redone. of pages state end of RedoLSN than might If the update number of history of loser explored to possibly the log appears suspected update This Even to limit the starts is greater is is accessed. serves pass page it */ */ routine. a redoable LSN then information are redo When till be checked 1og record redo, routine referenced table, reading restart-analysis The point. the the this the routine. in suspicion, the this to by to see if the page such inputs RedoLSN and the this by is made does for resolve The Pseudocode for restart supplied written tered, RecLSN actions. table are redid /’ the the that later failure. overhead, dirty is pages available pages from on Database Systems, Vol. 17, No. 1, March 1992, 126 C. Mohan et al. 0 the dirty .pages analysis table pass. complete, being when Even if a system written. those such failure The log records in records were a narrow corresponding are encountered always to window pages will be could not during written 1/0s them from during this prevent get modified the after pass. For brevity, after logging the pending of all are redone For the all before the pass. Since also perform records table to read possibly This the the page log. This its Undo Pass The third the pass Figure undo pass table. The since history page is performed systems The logical or like not. DB2 restart order, in the maximum the yet-to-be-completely-undone remains rolled those as we undo The described protocol before this routine while this undo with next loser next by writing pool, processes. Updates to represented for a given supporting These disaster 5.2. CLRS. dirty perform selective until the buffer to each table log process manager nonvolatile pass. ACM TransactIons on Database Systems, Vol. 17, No. 1, March 1992 records of chronotaking for each of transaction to be for each of is exactly rolling back follows the storage for redo. transaction transaction be 10.1 continually no loser the should reverse to be processed encountered In by Also, on Section in is done for pass. LSN operation transactions, The pages the but the transaction undo in to process of the Section undo implements restart this describe record in the undo is the that initiated, transactions, entry recovery we This log record an processing writes losers log. is an history of the is pass what back of the in as the as before. routine during whether repeat sweep restart routine the rolls The of log buffer since order context UNDO consulted this determined transactions. transactions, WAL is order can of and, process. the same during not determine LSNS to be undone. back to is do not a single the made input before routine of the is table that one properties in the to the redo we information multiple from correctness RESTART_ the Contrast -undo only basis in [731. that to by the queues into the buffers logged, by the using the in not come orders reapplied are of pages queues in 1/0s in in-memory pages with applicable The consulted information encountered pass group different any are log is repeated not actions asynchronous (as dictated or record in violate describes _pages occur execution pending the are redo and be dealt backups actions. dirty to the be available building page log also of the 12 records complete queue may the like a per applied remote 6.3 pass. on get are were before remaining of to be reapplied 1/0s updates ideas via log need not the they during each does a failure but of initiating so that things table) missing parallelism recovery pages performed may if availability corresponding that pages all these corresponding requires different the possibility us the initiated processing transaction, gives .pages how, pass. potentially dirty as to of a transaction, of that redo sophisticated asynchronously here record parallelism, updates which the discuss end actions during ..-pages parallel in do not of the exploiting dirty in we the during the usual the ARIES: A Transaction Recovery Method . 127 . REST,.4//T-UMM(T rans-Tabl e); WHILE EXISTS (Trans with State = ‘U’ UndoLSN := maxlmum(UndoNxtLSN) /’ in Trans_Table) DO; from Trans_Tab7e pick UP LogRec := Log-Read (UndoLSN); SELECT(LogRec. Type) WHEN(‘update’) DO; entries with State = ‘u’ ; UndoNxtLSN of unprepared trans with maximum UndoNxt LSN */ J* read log record to be undone or a CLR *J IF LogRec is undoable THEN 00; f’ record needs undoing (not Page := flx&latch(LogRec .Page IO, ‘X’); Undo_Update(Page, LogRec); Log_Wri te(’compensati on’ ,LogRec .Trans ID, Trans_Tabl e[LogRec. TransID] LogRec. Page ID, LogRec. PrevLSN, Page. LSN := LgLSN; /’ I* write CLR */ CLR in page */ LSN of */ table */ undone */ /* pick UP addr of next record to examine e[LogRec. TransIO] .UndoNxtLSN := LogRec. PrevLSN; */ [ ‘ prepare’) Trans_Tabl .UndoNxtLSN I* To exploit parallelism, processes. single leaves open undos to objects the the parallel, actually trans from fully */ *I := LogRec. UndoNxt LSN; UP addr of a single next record to examine *I log for CLRS first, in 6.2. to the In pages multiple completely the CLRS. without by This then redoing this fashion, the be performed a still applying accomplishing and can using with in problems undos), Section an example records was describe written rollback transaction to was went missing one ARIES, the performed (updates regardless we have after restart and the savepoint Each of how the update option recovery concept, and 6) are many the this for the CLRS in undo work in parallel, of After 6). first log of even that During restart the then a the the undos be matched with is performed. continuation ARIES undo the write, recovery, then recovery Since in the and restart will Here, failure, disk 3) and record ARIES. the 4 and redone allowing could, using Before records times is completed. we page. update. of log 5 scenario same second (undo performed. transactions supports after recovery to the (3, 4, 4’, 3’, 5 and 1) are CLR, disk restart updates forward updates and With the 6.4 logical changes be performed be dealt transaction. 13 depicts (of 6, 5,2 also chaining of writing in the can transaction UndoNxtLSN Section require explained pass each of the (see may applying Figure undo possibility pages as at most I* *I *I Pseudocode for estart undo. that because that partial the It is important process the pick LastLSN, . . .) ; /* delete trans ‘/ SELECT “/ WHILE */ Fig. 12. the store Log_Wrlte( ’end’ ,LogRec .Trans IO, Trans_Tabl e[LogRec. Transit]. delete Trans_Table entry where TransID . LogRec. TransIO; ENO; ENO; /* /* END; RETURN; page *I Trans_Tabl e[LogRec. TransID] .LastLSN := LgLSN; /’ store LSN of CLR in table unfix&unl atch(Page); ENO; I* undoable record case ELSE; /* record cannot be undone - ignore it Trans_Tabl e[LogRec. Trans IO] .UndoNxt LSN := LogRec. PrevLSN; /x next record to process is J* the one preceding this record in its backward chain IF LogRec. PrevLSN = O THEN DO; /* have undone completely - write end WHEN(‘rollback’ all record) .LastLSN, . . . ,LgLSN, Data); ENO; /* WHEN(‘update’) */ WHEN(‘compensation’) Trans_Tabl e[LogRec. TransID] for redo-only pass, repeats roll of loser history back each ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 128 C. Mohan et al. ● u Wrl te !bdated m * 12344’3’5 REDO 344356 UNDO 6521 Fig. 13. loser only to its transactions. tion at a savepoint (1) records before ever its after some recovery amount the by processing perform undo work offline the needs undo then transactions. solely processing the of the table) that offline are tions with to protect until need are positions, and then DB2, for of the because the non-CLR [151. That is, there log undo are and they transactions remembered. [141. Unless to those accesses The there to those those during will in the online, are on Database Systems, Vol. 17, No 1, March 1992 indexes) exact storage, objects forward DB2. ranges some DB2 not brought (DBA) that those before of log in-doubt need will fact is inverses allocation no locks objects based the undos LSN handling for be up. on those finish brought are objects, to and/or be generated database are possible redo minipage, is for is brought and logical the in virtual when When no (called It system transactions can (or is system actions of doing to reduce the which written page the done it for alone CLRS records Because table in the completed. the defer unavailable. example, objects CLRS to write is opening the in offline processing to is usually is able is possible since ACM Transactions first the wish data the those objects This critical some may loser updates is locks when- cursor to restart we when uncommitted recovery log those information restore some to other also wish time. In to be recovered accessible to be applied data are exceptions is maintained objects made an in of locking, actions. in may some when transactions original can the would transaction’s enough system for granularity remembers, the to be performed information smallest the such even DB2 This on the point transactions. needs about correctly (2) reacquiring Hence, which to be performed work objects, we possible. a later recovery this from (3) logging the information loser applica- so on. failure, as during new and the its Restart recovering of restart If some to time names back invoking Doing updates, so that rolling by enough lock undone and soon work of accomplished of as passing recovery, a system transactions the not or Deferred totally transaction is to be resumed. generate state, of the and established program Sometimes, new to restart are Selective 6.4 point execution ability completing instead resume uncommitted, savepoints application could entry which the for recovery example with ARIES. savepoint, we special from require latest Later, Restart they records transac- to be acquired be permitted online, then ARIES: recovery the is performed remembered for offline In also, transactions undos. object. For logical the cannot rolling forward normal similar Method using rollbacks, the . log CLRS space of an update logical 129 records maybe in written based they that page retraversing maybe affected, page-oriented is not the undo generally can write But for tree work to and For a CLR the high since do is unpredictable; not of CLRS. possible, index will state 10.3), we loser require page-oriented. appropriate is O% full. this the current Section the of [62] may always operation, the of that on the are (see record none objects generate methods page when and insert (e.g., of which predict are since management stating undo provided offline undos a problem, approach undo of the logical at all actions, or more management in terms even the a key in fact, we hence logical is necessary. It is not during possible restart records that reverse each to handle recovery at a later Remember in the index the deletion), the not involving during of take is because are space-related effect by during one a conservative concurrency, undo This undos take can modified Redos example, for we has the we can Even Recovery objects. ARIES logical efficiently ranges. A Transaction in record, other Even all the the records have point the PrevLSN undos if the recovery next of some the order. two sets methods, Hence, record to and/or of the the is of a transaction logical) of records undo it be records (possibly, of the are of a transaction enough processed to during UndoNxtLSN chain rest is done remember, the leads of interspersed. undo; for from us to all the to be processed. under the circumstances to perform, potentially needs to be supported, restart undos handle in time, chronological transaction, that the and where logical, then one undos we or more of the on some suggest the offline loser transactions objects, following if deferred algorithm: it for 1. Perform the repeating of history for the online objects, as usual; postpone the log ranges. the off/ine objects and remember 2. Proceed with the undo pass as usual, but stop undoing a loser transaction when one of its log records is encountered for which a CLR cannot be generated for the above reasons. Call such a transaction a stopped transaction. But continue undoing the other, unstopped transactions. 3. For the stopped transactions, acquire locks to protect their updates which have not yet been undone. This could be done as part of the undo pass by continuing to follow the pointers, as usual, even for the stopped transactions and acquiring locks based on the encountered non-CLRs that were written by the stopped transactions. 4. When restart recovery is completed and later the previously offline objects are made online, fkst repeat history based on the remembered log ranges and then continue with the undoing of the stopped transactions. After each of the stopped transactions is totally rolled back, release its still held locks. 5. Whenever an offline object becomes online, when the repeating of history is completed for that object, new transactions can be allowed to access that object in parallel with the further undoing of all of the stopped transactions that can make progress. The tion above in in-doubt requires the update the ability (non-GLR) to generate log records. lock names DB2 is based doing on the that informa- already for transactions. ACM Transactions on Database Systems, Vol 17, No, 1, March 1992. 130 C. Mohan . Even the if none of the processing of transactions ing: (1) locks are first for start completed, released as each requires that the log If a loser failure, then, with are the the redo which represent object as the soon because more we than as (e.g., normal first the hence, it early undo because not we work that possibly undone. in been undone. the undo systems that a non-CLR can be performed records corresponding the that object’s works same CLRS more than of only non-CLR undo in resolution the some log This do not permit to obtained release undo of locks to is such be those on then for to release specially and record yet like transaction redo system equal to not the pass or would that to be undone. need have we mark that than (1) ensure of the remain are step during time loser (1) (e. g., once ARIES during deadlocks using rollbacks. DURING In this describe section, can save of the by, recovery Analysis can we be reduced of restart pass. some By work the This is latter, impact taking of failures on CPU checkpoints a checkpoint were of this checkpoint at the of this dirty-pages different dirty the a failure list restart the taking if table dirty–pages the how optionally, table transaction that RESTART processing during and different stages processing. transaction the of can log or release 7. CHECKPOINTS 1/0 will and step analysis less is in effect) and DB2) transaction partial by locking corresponding AS/400, This we of the in to at the Locks the and Performing the that back then CLRS are updates rolled rollbacks records CLR. follow- transactions, encountered back log LSNS those the records, acquired during last update undo once; IMS). for if record do not Encompass, whose as possible, (e. g., record, lock obtained transaction’s only are loser log appropriately rolling that the doing as the adjusted already is being as soon be of their in-doubt locks is desired it by completes. as to which records pass even The it rollbacks on and rollback information the loser transactions was transaction locks loser be known log of If a long of its the it will UndoNxtLSN during of the based parallel. RedoLSN but the accommodate of the transaction’s transaction a transaction, can transactions in restart records pass. These loser before reacquire, updates performed the we and new is offline, start then history processing are to be recovered transactions uncommitted transactions all objects new repeat the (2) then et al. from list will be the of the analysis will be at during is obtained of the during contains happens end occur checkpoint table what .pages end at the to the analysis pass, recovery. same The as the pass. entries The the same as of the analysis from the buffer redo pass, the the checkpoint. pool (BP) of entries end a normal we entries entries pass. For the dirty-pages table. Redo pass. notified during that At so that, the page redo by ACM Transactions the beginning whenever pass, making it the of the it writes will out change RecLSN a modified the be equal restart to the buffer page manager to nonvolatile dirty _pages table LSN of that log on Database Systems, Vol. 17, No. 1, March 1992. (BM) is storage entry record for such ARIES: A Transaction Recovery Method that all BM log records have the to maintain ing. The pass to a failure the reduce to dirty-pages of the of entries of redo the is pass. during the pass, undo undo. entries undo table or redo since as of to be redone the the checkpoint. be not parallelism in the of entries The the same analysis pass. is if entries will the as This employed in work on a restart view complex date checkpoints case, they 8. MEDIA will such called a are are the same to dirty, modified the as undo same checkpoint. be the sometimes some depicted physical in This Figure a system failure in restart. While to take place The as the pass, as the entries entries 17 be required (the shadow is another R. This would restart of during consequence no these are after its effect were to easily checkpoints restart true and restart is able System be logic a for of the the longer an earlier ARIES that pages) complicates checkpoint [31]. in it may pages in System The that (like media DBspace, tions. With might contain archive such dump) if desired, uncommitted updates. directly the from with a high some recovery will tablespace, concurrently course, written become during in as it consid- accommo- optional in our R. RECOVERY fuzzy performed recovery, up completes. be forced table are to table time of the will this pages up no longer time. to be describable assume some checkpoint is cleaned are about checkpoint time to be performed. during may any of that at the be repeated following too list when are transaction is taken dirty-pages table manipulates pages of the restart the pages entries table the point, manager when .pages free pass, this corresponding BP entries restart to checkpoint ered the of this cannot the the at that taken history a restart We same of undo At the entries dirty R, during be that logic the end or of the which dirty–pages table System more the whether If a checkpoint of the BP transaction checkpoint fact during The time processing–removing the transaction In time checkpoint at table. onward, adding of the of the the by any need not process- currently pass. be the this table for storage, normal entries then normal During then at normal are redo if does pages would the will of dirty-pages entries From nonvolatile during table table beginning BP those buffers. etc. of is enough BM during taken that end checkpoint affected the the by removing does this log be It fashion. of what to of the this as it does track the transaction At becomes the before not table processed. in 131 pass. Undo table amount transaction checkpointing the the been table checkpoints occur of had be keeping dirty–pages the record dirty--pages allow list restart the log dirty-pages still above were entries own it should buffers. redo to that restart its Of course, the Of up manipulates . concurrency could Let nonvolatile operation entity. us to also storage the image updates, assume at the A involving modifications uncommitted we be required etc.) easily that version copy in such entity produce image of the entity can other transac- the image method be copy of [52]. image copy with copying is performed entity. This or copy (also an to the an of a file by method, contrast the level image fuzzy means no that ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. C. Mohan et al. 132 . more recent versions transaction version of the geometry be copying up via (e.g., to for the easy to case, some latching When begin. the fuzzy remembered image point along with information with LSNS image-copied externalized tion up to began. record of the the image us call been in for the 5.4 log. taking into media while call the point the location is of on this in y records of have been copy opera- be at least media point the LSN of the as recovery begin. same of the record), would computation the check- log image is the and pages would the noted checkpoint dirt entity redo discussing is that example, this fuzzy that account recovery For it in end.chkpt the of the We then based checkpoint)) time version the of checkpoint’s the course, logged SNs copy by Of the be made had copy [131), checkpoint Let can that image-copied reason Section initiated, data. updates storage point computing in is that image nonvolatile The in given the as of that redo point. all it. desirable in be needed. complete copy is found not than be needed. minimum(minimum(RecL in Hence, to date record that entity LSN(begin_chkpt image buffer does convenient latter will will recent assertion more the device the system as described operation most the than If the the since transaction be in storage since and accommodate no locking copy The is less but the copy checkpoint. to efficient also copying, method image of may of synchronization level, record the present nonvolatile operation buffers. image be the more a copy it may from Since system’s amount page pages much copying, presented minimal chkpt direct incremental at the be such be eliminated. the the copied directly usually transaction modify the during will support of Copying would be exploited overheads to some buffers. object can manager have of system’s chkpt as the the one restart redo point. When media reloaded redo point. being recovery and then During recovered the are unless the information or the LSN on the a log record refers record’s LSN image copy pared to the end of the as until an page that is reached, had if there made recovery. be kept DBA table end analysis of the arbitrary recovery, ACM Transactions in any to the in DB2—see from last .pages list log must its are undone, about the identities, 6.4) in or complete an com- Once then as in of exceptions may be the those the etc. undo such table obtained checkpoint in if log of the LSN be redone. entity the record and list redo, and transactions, (e.g., entity applied, restart y_pages accessed update Section the dirty begin–chkpt be somewhere pass dirt the are during in-progress information separately the must if the record’s is recovery to updates Unlike entity media relating checkpoint of the page are provides every database records of the the the by log log. logging ARIES, LSN changes The may log from corresponding is not to check record’s the copy LSN an needs a page version starting it unnecessary. log the in image makes the Page-oriented Since, to image-copied the that that the in the page all and than log performing scan, then of restart such redo the is initiated processed is greater transactions scan checkpoint, transactions pass is required, a redo recovery database page the is page’s damaged recovery can independence update in be the amongst is logged separately, nonvolatile storage accomplished on Database Systems, Vol. 17, NO 1, March 1992 easily by objects. even and extracting if the ARIES: A Transaction Recovery Method an earlier copy version with of the systems index and from damage the image rolled back be so useless tion being not made had would log records, recovery (see if it changes while problems the before the gets database code is may occur key) or due process had operation to update. page volatile the to read is all is started manager. [151. The The bit from set is modified), the read or write, case automatic is unacceptable a broken updates that version of the those to the ’1’ bit page page ‘O’. situation date by corrupted that were is rolled of restart only because by rolling updated, entire letting From restart page storage. in ACM Transactions an but recovery were A related the fixed redo missing by non- page state scan of the the buffer automatically the page header. the update page LSN is latched, to ‘l’, for in which viewpoint, to recover all in the problem state cor- the and system page the the availability transaction every Once is equal the redo a page value hitting expensive by logged whenever to see if its in X-latched. is that from buffer a bit by recover operation update this, which abnormal page the using and itself, before to and changes. such forward for pool the an roll-forward recovery by buffer on noting of the The termination generally efficient is fixed left over (e.g., action It page, Given the on nonvolatile pages interruption version is initiated. down in the implement, system’s internal is tested recovery to bring were to in the state to page alternative describing way is detected (i. e., in transac- pass not an page the being result An process uninterruptable that of page to skip process DB2 remembered kind to had back analysis record limit. of a page this page up RecLSN is reset first an for rolled corrupted user’s time after complete bit it records transactions may recovered. application uncorrupted records this corruption is of the in starting log the scans to a page like circumstances, the does the operating bring log DB2 operation for and relevant by CPU written by rollback) such to abnormal a log to the process these an changes because its all storage using that put Given rupted such exhausted of the not transactions pointers be of systems terminations attention may to write executed are when logging 18). making performance-conscious scans being which total any some forward of reconeven to the or (e. g., recovery to date If that R during Figure up backward page System a chance if CLRS changes out place because is actively process turns database also the log and the but process the what 10.2 of in R), attention any These and for backward to the log as it is done pages media the Section Individual If performed, pages undone. made undone. for state that updates written, operation partial be then pages’ not index System paying forward complete even 133 is to be contrasted expensive (commit, they are any are a page’s should totally, if some the in rolling This for the require any, see be to preprocess back of or they work pages bringing and records Also, state if to that require would actions, partially since rebuilding data then copy above. log is damaged). state required recovered may transaction’ what would which, (e. g., (e.g., copy the determine in a page is performed, representing image as described pages’) object explicitly undo from such an log R of an index is performed when System entire one page from the management to the page using like space structing only of that page . those logged uncorrupted is to make the it from sure abnormally on Database Systems, Vol. 17, No. 1, March 1992. 134 C. Mohan et al. . terminating process, leaving and latch, unfix calls footprints enough the user are around process issued before aids by the transaction performing system system. operations processes like in performing fix, the By unfix necessary clean-ups. For the CLRS variety This good only page 9. NESTED TOP committed, not. with when do need illustrated in we may the the be allowed transaction. well context If the of file the extension. data extended area the effects if the of the starting mechanism is, transaction and In ARIES, the of course, the very should not which is dependent undone irrespective of the once transaction execution nested top consists action (1) ascertaining (2) logging nested (3) on the the top completion step We of the of the is enclosing until that of the be initiating unacceptable. action, to support indepenfor our a transaction and some is logged inde- transaction we are able to initiate complete action [511. A top actions top of actions top to purwhich later action stable storage, which define transaction. a sequence following of the undo nested it traditionally between having in the of actions a steps: current transaction’s information last associated with log the record; actions of the and of UndoNxtLSN A sequence nested performing and action; actions. been would top action, without data independent which very completion, waits conflicts not might transactions. their have The to lock undo system called proceeding. subsequence position redo actions extending it would committed before is a file transactions then to the or This of the an transaction of a nested the Such transactions, the the outcome A of efficiently, any on back, a failure kinds other to roll other to be commits extends database, commit transaction, concept to mean be [521, which themselves. to the updates by before to perform is taken in the by the vulnerable the a transaction independent independent requirement transactions poses, an commits using above dent such transaction After extension. performed independent initiating pendent in locking. a transaction of updates extension-related were themselves interrupted to undo them, These is necessary by writing page ultimately these were database performed only updates prior transaction of updates hand, transaction elsewhere, suggested transaction for to some system to undo other some the y property extending to a loss lead the approach, like whether atomicit to use be acceptable no-CLRs would of causes updates which and is supporting locking. irrespective We the section in this system ACTIONS are times There mentioned even if the idea is to be contrasted supports On of reasons is a very the points nested to the top log action, record writing whose dummy CLR whose was remembered in a position (l). assume associated that updates the externalized, before are to only referring ACM Transactions effects to system the the of any data dummy system actions normally like creating resident outside CLR is written. data that on Database Systems, Vol When is resident 17, No 1, March 1992, a file the we in the and their database discuss database are redo, we itself. ARIES: A Transaction Recovery Method . 135 * Fig. 14. Using roll this back nested after will ensure that not undone. If written, then nested top redo-only) top the the top the action’s action. Unlike sense be thought advantage quent to do we costly Figure 14 gives 5. Log transaction’s rolled It then we writing be do not conflict can example 6’ acts log action will as CLRS, there redo record for enclosing pay the the price to redo nested not with in a The wait its a new this when action. need of starting Contrast the CLR top proceeding to for dummy transaction before problems. The the opposed property is nothing pass. is since (as atomicity are CLR undone undo-redo desired the be action for subse- transaction. approach with the interrupted the the nested update CLR. by nested that using top of the and action it actions enclosing needs to be undone. implementation of only redo-only log action top relies a single nested index the hence is not consists a single of the though and action action method consisting Even a failure nested top action CLR. top Applications storage top dummy record update, and concept management can avoid in the be found 62]. RECOVERY section (e.g., additional of the goals and System R, be In particular, were problems and found recovery to motivate which of the locking can existing in ARIES. some record) discussion features our PARADIGMS describes granularity include top of a nested If the that dummy storage as the that of a hash-based [59, This is emphasized dummy top the to CLR approach. an history. the context of we lock 6’ ensures should nested written the were dummy before commit stable record activity back, on repeating ing into to the of the normal is that forced then occur the during independent-transaction 3, 4 and 10. the of as the 6 Also, run for approach are transaction action, as part provides encountered be actions. Nor in is of our record records enclosing top to were nested This the nested performed failure log if of the incomplete CLR can approach, updates a system a dummy this action completion 10g records. nested Nested top action example. in [97]. methods the need for Our aim in the providing rollbacks. is to show us difficulties certain why with transaction caused we show developed associated handling some features of the context how certain in accomplishwhich recovery of fineSome the we had to paradigms shadow 6 The dummy CLR may have to be forced if some urdogged updates may be performed other transactions which depended on the nested top action having completed. page later by ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 136 C. Mohan et al. . are inappropriate technique, high levels of paradigms have algorithms with The System — undo been In adopted redo work preceding of context or and of WAL, [3, 15, of interest of those leading 16, is a need there more 52, to the 71, for System 72, R design 78, of 82, 881. (i.e., no are: recovery. redo updates one errors are restart is to be used past, and/or that during logging WAL the in the limitations R paradigms — selective — no when concurrency. work during performed restart recovery. during transaction rollback CLRS). — no logging —no of index tracking no LSNS 10.1 goal has been subsection implemented in aim is to in supporting is to motivate why database recovery undo pass (see Figure 6). redo pass. As show is incorrect DB2, on the [311. We as we the describing record’s the log record’s is less than performed not ation The make undo the page (see to ACM Transactions the and and an then undo redo the preceding WAL-based pass, System in-doubt) selective approach redo transac- paradigm to take, it Writing the an If has of many the the CLR when also pass, then the not it an undo when page, is not on Database Systems, Vol. 17, No. 1, March 1992, is the actually a failure to LSN action is Whether a CLR rolled the set page undo being record record’s than page. of the are contain to handle handling less if the no as part log LSN on the actions does force is a before. of a log the page’s undo consider described LSN is performed on us LSN the performed page not in as page and transaction’s and and whether performed the LSN to be undone, been locking inconsistencies Let to the the During undo have to be necessary implemented. redone record page to data determine page. 15). when be only lead is compared to is actually simpler support will contains the log when DB2, page Figure be even out the perform pass The (i.e., it recovery. pass of During While to Otherwise, written, way. undo locking. prepared were page to would recovery turns redo that that generally a R paradigm efficient as update of the that is written, log: and redo WAL-based the opposite. LSN the the page. needs media page to L SN is always each the LSN updates CLR (i.e., problems they the System approach locking then the of performs redo. such be reapplied on the in a special the to with fine-granularity the This which pass, LSN, the history. to be the [151. update needs log or redo of selective show failures, the selectiue record in an update ing redo if concept after of committed systems, technique During updates below, WAL-based systems, WAL to logged to repeats passes and seems selective and R first does just this discuss 2 later, WAL actions call pitfalls, in System hand, the R intuitively such will with System perform changes. it locking restart updates other only the systems ARIES systems we redo Some to relate fine-granularity transaction tions information itself introduce many When R redoes management on page Redo of this introduces The space state on pages). Selective The and of page describ- undo oper- rolled back. update, just back updates performed of the to on system ARIES: A Transaction Recovery Method T1 Is a Nonloser Update 30 20 Fig. 15. during restart Selective recovery. which did not PI which had to be undone, Pi’s LSN were the being have completion the other hand, problem. page It should locking Given these would by the subsequently by Tl) comes undone redo page to or undo not. and update with the update with pass present in value to had to beyond the 15 LSN perform the page. is the 20 and since 16 it 30 since the or than page-LSN or not equal is no longer a true its update needs problem with even relies not be of the the it not (undo not current is page_LSN undone By redoing causes though LSN). be redoing but on the the to selective transaction, should indicator pushed if update records have with So, when scenario, update log was update loser. transaction, logic modi- T2) the latter undo by (say, would under page 20 a loser former an only by to a nonloser to On any to a losing the LSN latter this the the it. be when respect where know to of the appear not even update illustrate is because whether greater not it belongs would method with with The In if PI interrupts would arises situation update belongs undo This the would that, to undo WAL-based established locking. LSN to we it for and After be made page U1 [15]. redone. value for written failure there transaction’s be the loser, Figures determine page_LSN history, a nonloser fine-granularity the undo by being restart, of a page (say, U2 update of U2). next redo state in Ul) problem DB2 earlier a system then selective of the transaction which of the the track for would this with transaction losing modified 30 LSN time of lose that case an update an LSN the written, emphasized as is the (> an attempt been was before during and scenario. was (CLll of l.Jl’ then, had or in-rollback) first LSN be U1’ storage U2 properties we (in-progress the U2’ is used, discussion, fied if there in LSN restart, update if there but resulting nonvolatile of this the happen, to the to as if P1 contains will to be undone, changed to be written redo with WAL—problem-free This PI Redoes 137 Loser T2 is a UNDO Undoes Update REDO . if repeating state of the page. Undoing an harmless oriented DBMS reuse data effect action only locking and space, is not present its as [81], unique will be caused the is for they and and in effect conditions; logging, Rdb/VMS inconsistencies when certain and VAX of freed even under are other keys for by not records. undoing an in with implemented systems all present example, in [6], there With original a page will be physical/byteIMS [76], VAX is no automatic operation operation logging, whose page. ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 138 C. Mohan et al. . 0T1 Vr! fe !Mated ~, F“,2 10 20 I’Jq ,, . . i LSN Commit 30 T2 is a Loser T1 is a Nonloser REDO Redoes Update UNDO Will Try Update Though f 30 to Undo Is NOT 20 Even on Page ERROR?! Fig. 16. Reversing the the problem pass to precede greater the of that CLR’S update is redone would not redoing selective redo pass, Figure to 15, the only 30 even then the undo Since, would violate the passes might lose of 20 during in make and redo is not durability log not solve the undo actions page LSN assignment a log record’s record’s present and the the pass, the If of which would than will [3]. track of a CLR the update the undo suggested is less that scenario is writing page-LSN though and we of the page. if the update redo approach 30, because LSN redo that In than redo with WAL—problem incorrect This to be redone. become of the either. were need order Selective LSN, on the atomicity we page. Not properties of transactions. The use to have be of the the undone during shadow concept and what needs a checkpoint, shadow uersion, create points version an the shadow version, As a result, there not in and the database.7 correct] which database, This y even management is and version not. all one reason selective changes are updates of the logged other but are unnecessary page database, the recovery after restart updates before the R recovery last the current is performed during the to called constituting which needs technique, two check- between even about System The logged, thus restart, is done logged the redo. not page, updates it what shadow Updates During ambiguity All and with 1). shadowing no the storage. updated R makes to determine With consistent Figure System system redone. of the is are be by that on nonvolatile (see ery. in to version database from database technique action is saved a new of the page of page.LSN recov- are in the in the checkpoint checkpoint are method reason is that index redone or undone are functions and logically. space 8 7 This simple view, as it is depicted in Figure 17, is not completely accurate–see Section 10.2. s In fact, if index changes had been logged, then selective redo would not have worked. The problem would have come from structure modifications (like page split) which were performed which were taken advantage of later by transacafter the last checkpoint by loser transactions tions which ultimately committed. Even if logical undo were performed (if necessary), if redo was page oriented, selective redo would have caused problems. To make it work, the structure modifications could have been performed using separate transactions. Of course, this would have been very expensive. For an alternate, efficient solution, see [62]. ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. ARIES: A Transaction Recovery Method was described history. Apart As repeats repeating history commit some ultimately 10.2 The them. not time, paper, A CLRS. of transaction. very care of the next record may already of a system That is, never the to be undone be rolling failure is unimportant restart recovery starts before the the time at written, of after multiple the to avoid redoing backward scan, some when to do some which the log actions the The during only the the special to have to undo about a partial information at the time track some changes performed restart. as of the version them to handle completed a little the partial need wanted later having are those the designers rollback last of CLRS is to avoid The of at the during since and of only keeps transactions, processing pass. The of a transaction handling redo of the written R. R shadow initiated some is this, a of progress database Despite special R database of the is failure. checkpoint. over state is Since been state database the rollentire partial since System active the the level, have System state in —this transactions last passes the failure R needs in-doubt since from system of the and of the in of num- systems. System rollback uisible not system System or rollbacks each not might in of the Supporting in this any only application occurs record The for cause back. do this state for back. actions the track to in advantages transaction to keep checkpoint present 13. and rollback easy would elsewhere will at a failure the a way the also present-day when its roll not undone known violation partial transaction So, are committed for need back have whether these Section violation of writing in recovery and of around a significant play the in concept been advantages In fact, all roll- updates the has literature, they to note the if is relatively checkpoint database to by describe While section roll during last checkpoint ability transaction introduced problems key a back we taken. which the the that the that this causing for the and advantages internally, It about is In partially illustrates storage, a checkpoint after or in additional we try performed rollback. we but locking, us the and community. a unique be rolling updates to nonvolatile time 3 systems role these totally least problems. many been, [56]. statement at may of the time what in requirement a transaction transaction redo, 9. CLRS to them research contexts, Figure 31], important effects and example, update [1, gives difficulties of the in really by the may the rollback It Section writing fundamental summarize For the how relating questions We of reasons. back selective fine-granularity of whether in discuss some not the appropriate transaction ber has undone open to and problems and be as in the writing perform effect. described solves recognized could left side implemented there utility well actions were been of CLRS, been is progress rollbacks has Their support irrespective as was subsection during a long not to beneficial of a transaction their CLRS discussion another does us 139 State of this writing allowing or not, in tracking performed for has commits goal ARIES from actions Rollback backs before, . with a occurred is encountered. Figure All log 18 depicts records are an written example by the of a restart same recovery transaction, scenario say T1. ACM Transactions on Database Systems, Vol for In the System R. checkpoint 17, No. 1, March 1992. 140 . C. Mohan et al Last ‘“g~ Uncommitted Changes Need Undo Fig. 17. Committed Changes Redo Or In-Doubt Need Simple view of recovery processing in System R ~..----_- . . 12 3 5,,.’-6 4 7 8 ::jg Log @ Checkpoint Fig. 18. record, the information checkpoint was partial rollback. write a separate information records of points the log to the after the record pointer 3, we conclude with the During log 5 and of records 6, during the during the will be undone Here, the To 7, and redo 8. pass, pass in 9 is a commit and same during transaction see why the rollback it is is the undo To that log is patched record record the a partial pass is involved pass has log both records in the to precede or not. 9 points caused to log undo undo redo are not pointer 9. log record 5 will pass log undo record pass, pass to the records 4 and the as of the a forward it point of 3 from 1 needs back putting the undo state transaction had rolled its log Whether rollback during the database record by that preceding log 5 to make then, redo the via When notice with is a losing that ensure the log T1 log written protocol. database of the Such of the record this a not transaction 4 and started state does a transaction log to be undone. determined that that the the of place. immediately that needs also by follow restart, on whether it is concluded analysis record pass. pass by record of the during 2 definitely depend analysis log If log will hence partial Since, written not log 1, instead to be performed record or not redone 2. does time chaining processing pass, it took the written forward the because but rollback in record recently by undone CLRS, breakage rollback to write a log first 2 since been a partial the analysis the of needs the record that undo checkpoint, to be undone of the record not that most the is pointing ended recovery was of a partial record which say from But as part Prev-LSN to in System R, already does Ordinarily, that completion to log 3 had only inferred pointer. examine, last R not a transaction. handling points record record be rollback T1 log System PrevLSN we for taken must Partial and in 2 be redone. in the System redo R,g g In the other systems, because of the fact that CLRS are written and that, sometimes, page LSNS are compared with log record’s LSNS to determine whether redo needs to be performed or not, the redo pass precedes the undo pass— see the Section “10. 1. Selectlve Redo” and Figure 6. ACM Transactions on Database Systems, Vol 17, No. 1, March 1992 ARIES: A Transaction Recovery Method consider the allowed to following reuse transaction, the in partial ID with sequence redo the Since record’s above rollback, record’s dealt scenario: that ID case, which have been reused redo pass. To of actions be fore the dealt in have with the repeat the portion the undo is same because pass, and transaction respect must the deleted of the with by undo 141 a record later been in history failure, deleted inserted might to be might that a record a record had in the a transaction for . to of that that the be performed is original before the is performed. If 9 is neither will be and 1 will Since one a commit determined to be undone. CLRS are forward as well from what to this Section being 5.4). being done example: A T1 had logged the will be a data 3 the fancy by different integrity high not ln will logic. This whether on 10.3). Allowing concurrency of Let value O after 2, T1 rolls redo and because case, operation after recovery Of object. does not necessarily or not to be supported [59, the undo, will for T1 System R did the and is not the being support updates very of redo efficiently logging; is logically examples). T2 these have logging management 621 for Then, 2 concurrent information to an then data byte-oriented storage of undo (see for also has If T1 performed mean flexible logging be object checkpoint. Allowing recovery (see us consider commits. to support space normal information F?, undo course, be needed same T2 System redo last and the in update. the the back, some redo is further processing logging the different during on an let the occur or undo a given history cause not which R also performed would to physically redo quite operation). its which be repeating potentially in for in System did 2 transactions’ by this redoing mode may (i.e., prevents way operation for transactions depend Section 2. other the has adds exact records be redone. created problem of by dumb using data 1, T2 after-image lock information will of adds instead accomplished (i.e., that resiart also after-image piece transaction the CLRS the restart could split will processing, changes These a during physically the as transaction log with the processing index 8). such writing during logging footnote required Not be logged—not value (see problems processing pages normal Not hence known, the pass records interspersed is not then undo of the R and were different the none System actions record, during pass, in during to guarantee). contributes redo or undo as across management from the a prepare and operations happened impossible nor a loser written undo processing page In not transaction’s record be that used will ARIES (see permit supports these. WAL-based during the systems rollbacks data is being rolled which the rollbacks. locking. using always back. Gontrast method immediate to be rolled and, worse more than once. started forward, That once rolling data, works then still, the This this the some by with of its before in the ACM Transactions logging some LSN, page failure (or are are 4, in of in the which granularity) more undone, a transaction system. in during if a transaction undone also of are [521, back coarser is that, actions actions Figure suggested CLRS state actions is “pushed” level original performed the original approach, the actions is concerned, if of writing compensating is illustrated even even with only by as recovery as denoted consequence back, back problem “marching” of the were this So, as far state The handle CLRS. Then, than possibly had during on Database Systems, Vol. 17, No. 1, March 1992. 142 C. Mohan . recovery, CLRS the the are previously undone idea lock of writing Section the next CLRS ARIES avoids Not undoing CLRS. and 12, and section early release Section and Unfortunately, support written again. management 22, et al. in 6.4). [691. Additional We has already like that benefits still in also to dead- (see item are discussed the Section suggested in is an important non- retaining relating objects of CLRS one undone while discussed the this already on undone benefits were feel and a situation, CLRS methods rollbacks. undone such of locks Some recovery partial are [921 in 8. do drawback not of such methods. 10.3 Space The goal Management of this management length records A are management deletion in the before the Since some commit of the storage do (see byte the something like which describes how is that garbage to lock or log us the flexibility and modify have to reduce be Figure state of being run e.g., in space left in collects are it. the in redo This the space around records which In with not keeping nonvolatile from an earlier management the same is attempted shows the is page and on a page need for on the record does page. not IMS, of the same has tracking the gives to store utilities These version Assuming have This a page in ACM TransactIons on Database Systems, Vol. 17, No. 1, March 1992 log consequence used. the and looks The point which want name like track exact not fragmentation. storage as a location within systems do address logging The the storage did lock on a page around the The record. within to a page, We changed. under desirable record. identifies got efficiently. deal # with to use record’s pre- another [62]. within page. index to by in not space For want want of the unused in the storage not The slot not is dealt was on the record to move LSN involve for location to bytes name moved records it problem this consumed of data did page. do another This to [50]. undo a goal, logging during to is described is, we data that perform 19 a that we way storage by solutions being changed to users. the 200 lock frequently flexible Figure requiring That actual of the a scenario to when was flexible is referred from and y of data storing space varying a transaction is committed. approach # ) where able in and consumed with The were the length quite attempting updates free records 19 shows problems insert collection the availability (by, slot to not reader undo within contents variable the #, points the that logical (page then as the with by concurrency, locking bytes is deal transaction. a logical of a record be not transaction 811). locking released page interested one [6, 76, involved of locking transaction do management specific to space data We first using byte-oriented) have page and of the (i.e., first locking by flexible to identify a The problems record the of increasing released systems on here, circumstances physical that [761. the granularity doing space-releasing in out level efficiently. in sure interest space the such the problem updates, vent with update briefly reservation point page to be supported or until discussed to than is to make transaction is is finer to be dealt problem record subsection when actual page of the page) log leads that transaction, only an 100 bytes of page to all of state ARIES: A Transaction Recovery Method Redo Attempted From Here. It Fails Due to Lack of Space Page Full As of Here 143 . Page State On Disk .Og Oelete RI Free 200 Bytes Fig. 19. using an Typically, each pages map pages possibly based location of other the record, new enough free page on records one at recovery to 25% not require If are already update change the need for to do logical logging That is, but would cause state a change, to the does needs not an example ing the is not then exact the We perform the of the space easily update to to the handling of recovery to not cause as space the system FSIP during performed during a CLR the and for the to determine which and if it describes in forward rollback. during be to the updates. example during the would points to change an FSIP which has the to record, inventory information O% does then update changes free construct which scenario full from back, an O% full, This write full, roll 23% it a redoiundo redo-only and from change were to say to the to the update space-re- to provide to go to 35% page. as FSIP of the update to change record to the update an update in which inverse can an log update, the special FSIP T1 entry free only 25% every avoid of with keeps least not also page FSIP an page the should FSIP that and space data page the update FSIP. FSIP a data at the as that be logged. if this respect a data causes to perform construct to with undoing the of the changes to change the undo, about keys) requires To also Now, and FSIP sure page on the update FSIP. full current undos must cause the 3 l% and space an its operation change transaction the might to the while that T2 to make FSIP. FSIPS cause written need cause to the an to operation, index The a space information insert as that has called space related record. such a data redo requiring rollback whether during might Later, given to the to identify new corresponding FSIPS full. had etc.) relations are a clustering consulted the more They a record (or closely information is full, the key inserting or describes from are operation thereby T1 T1’s wrong, 5090 updates would FSIP. for one During same FSIPS (e.g., in T1 full, full space it least of the Transaction ing, which (FSIPS). FSIP pages. the or more in of obtained with -consuming independence, the /’ with space for insert. operations pages Each index information information 27% or information or space does redo records inventory DB2. data space is full, leasing space in many approximate then problem to containing free (SMPS) to to attempting file called relating the avoid Insert Commit R3 Consume 100 Bytes to the page. applied few Oelete R2 Free 200 Bytes Wrong redo point-causing to LSN Insert R2 Consume 200 Bytes We forward which a processcan also process- rollback. ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 144 @ C. Mohan 10.4 Multiple LSNS Noticing the support record problems object’s state explain why DB2 precisely in transactions state an LSN for the log is set the to determine LSN the for This storing able for the and supported best. when tends keys. Further, key desired into handle the performing a fixed number space reservation fine-granularity the terminology 11. OTHER In the methods shadow because of their blocks the (see the sions). First, which we methods along method Siemens. are unable ACM Transactions also (like each at page, even media history during transactions turns out divides is one length introduce up a needed proposed to in [61] objects (atoms in other significant in the this has been for the extra paper and different have shadow 1/0s systems been of information we on Database Systems, Vol additional and recovery that about 17, No. 1, March 1992 the of data, page for significant it here. copies compare here checkpoints, involving informed with based considered costly [31] Next, methods not very and implemented of lack R) are e.g., section. We some Recovery of System of data, of this of protocol. overhead dimensions. because properties WAL that space sections of [25] to include the clustering previous But, to be especially the varying case have is cumbersome for technique like disadvantages, storage various support the use technique be examining recovery we summarize we briefly will by we physical special avail- to the METHODS well-known nonvolatile disturbing do not which page no Methods on paper). WAL-BASED following, minipages, space physically DB2 is overhead objects repeating of loser it record’s undone waste) recovery, of Since log space objects page undo, to the (LSN) the The During length deleted rollback field. conveniently varying in ARIES. problem. locking of that on the extra of over technique the seen therefore having is updated, much (and way of loser minipage’s to be actually too The besides LSNS. variable to make done, simple each LSN carry for state being before as we have for recovery is The recovery page LSNS efficient. to be sufficient, not when a single locking very fragment 12]. of 2 to 16 actions tracks needs option into [10, is compared incurring a page. the index redoing minipage update it does has a minipage that besides especially have to trying than minipage, minipage LSN record’s to each in the of the Maintaining to minipage recovery, restart locking, efficiently. We log technique, less of the not Whenever page is user DB2 with is stored the LSNS, storing of record not if that minipage. LSN maximum and despite is as follows. an LSN to the LSN when of a minipage pages, pass, the leaf page granularity as a whole. record’s equal minipage redo page page that where up each on such associating leaf per of locking indexes at the the by corresponding LSN of divide properly during LSN to suggest that we track each a separate LSN to each object. Next we idea. case do locking separately one tempting a granularity the recovery having be assigning to physically and does by may by supports DB2 minipages DB2 it it is not a good happens requiring caused locking, already This et al the the map discusmethods different DB-cache modifications implementation, ARIES: A Transaction Recovery Method IBM’s IMS/VS [41, 42, database system, consists relatively flexible, and has restrictions many transaction can buffering locked objects fixed length make the page locking provides ports data pOOk [80, sharing different locking minipage and repeatable read) tables A in support field to Only many high- IMS, locking, with only calls) records. have each storage support support. global FF, of the main MSDB DEDBs with also its own supbuffer record) prefix and and unlocked or permanently even Schwarz [881 (a la differences, for IMS) as which is much been implemented will be less complex CMU’S management. adopted OLM the write storage and written back in pages DB2 an steal a that has Encompass, and fetch no-force record end-write to nonvolatile OLM alone. might The operation [23, have a sophisticated been in buffer stability, repeat- off temporarily based logging SQL, on have value several method method are help a OLM, normal a page the commit granularities methods logging time These records two During whenever storage. These and two-phase locking value NonStop every Tan- (VLM), (OLM), has 901. policies. record changes updates methods The the Camelot some data on files. below. than and IMS multisite be turned recovery logging. outlined and NonStop, (cursor can operations operation Abort and stability, Encompass allow levels Logging different With different data, loading DB2 Both They DB2 supports off temporarily [4, 37] with Presumed consistency read). two and in and nonutility presents access. supports for like [95]. products. It (cursor both SQL its the page to be turned access The 19]. levels algorithm data or dirty and system. DB2. 15, operations can for operating in 14, table NonStop SQL MVS 13, logging recovery using NonStop [1, utility support the available consistency allows Tandem’s 64]. for are transaction transaction [63, key failure. for via and In granularities (i.e., IMS, in during distributed read, dirty IMS MSDBS database system and Encompass able also A single recovery databases: systems, presented only single (file, ing But, (tablespace, hot-standby single of Buffer of is but differences. the large functions indexes) indexes The SQL a been for incorporated NonStop and database [10, 11, 12]. DB2 provides have many which efficient The mechanisms and different granularities page atomicity. logging two data. possible [431. access has data. been protocol DEDBs. support data and reorganizing within for features relational distributed dem minimum is more (DEDBs). the be the across have kinds a hierarchical (FF), indexes). (FP) operations, two provides which secondary databases is Function 93], Path the which 145 941. algorithm has times hot-standby recovery with FP for parts supports parallelism is IBM’s Limited for and supported and XRF, types 941, Full 42, Fast two entry 80, IMS [28, and the data but hold is availability DB2 FP records, lock by and 76, Path FF database vary. 53, parts: no support both used (MSDBS) 48, two Fast (e.g., on the databases IMS access methods depending 43, of . is written in buffer manager read dirty during [10, and from page identifying pool VLM processing, nonvolatile is successfully restart the at the 961, process- super time and DB2 VLM set of of system writes a log ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 146 . record C. Mohan whenever whenever after the storage. DB2’s For not writes, to the log are held on to stable all the nonvolatile pages of the release updates the being is used pages to nonvolatile the next lelism for for the that were by nonvolatile section the in SQL, and logging Encompass writing we the dirty their storage MSDBS, no version. Also, commit record not it present. yet ACM Transactions is written. been ARIES. it written that DEDB to with gain paral- policies. than Before all the page data pages locking is placed on being considered are the to in System R, the each will this partial since For DEDBs, to nonvolatile dirty on one present committed the any updates committed storage is that, are on Database Systems, Vol. 17, No. 1, March 1992 (table MSDBS of two files in spaces, alone, on non- is performed their changes are are instead For updating and actions objects [961. contents NonStop update checkpoint object operation IMS, when taken quiesce The difference deferred be an DB2, even are VLM checkpoint. major writes that and DB2’s Since ones OLM of ARIES. alternately no needed also storage finer a no-steal the go ahead force mode. The for changes is and checkpoints for a checkpoint. Care with process and of the uncommitted writing algorithms to those contents is ensured steal similar concurrently. table, uncommitted user consistent) (fuzzy) a RecLSN during the for uncommitted on with complete locking recovery take, _pages etc. ) list writes are described page recovery similar take since checkpoints and going any forced processing. restart are in Since transaction do are to what volatile system are result processes After committed), transaction to nonvolatile some the Normal in the record activities indexspaces, have the all the to records [28]). been not the in commit necessarily checkpoint tion result is used—see as possible forces log completion to let follows the on of separate FF course, by record of time transferred it force has (not locks log which, does transaction. during is not (not of the IMS Of log system activity similar may modified buffers amount are FP a single record commit locks logic in log the to let processes is used. MSDB the transaction as soon FF checkpointing. the consistent of this storage. force Normal all FF, DEDB time storage the the minimizes commit This use that before the were IMS by objects a transaction policy in even FP is intended IMS modified supported when 1/0s. only nonvolatile transaction and The group processing a transaction, how system storage record dirty that a no-steal applied is given to nonvolatile transaction’s committing records after The to the means log the are released locks. DEDBs. This a given is that DEDB back to bring for records. using forced policy are (i.e., DEDBs records records placing (i.e., storage log written DEDBs, manager completed performed been updating. This ultimately is to log another is For updates MSDB The storage logging After the processes. these log locks and operation failure. the storage. is opened, close have deferred MSDB MSDB stable space updates. all the The uses does MSDB time, The on system 1/0s, pass FP indexspace closed. as of the manager. released. the of the own storage), placed locks pages IMS at commit on stable are is analysis see its or an space up to date MSDBS, does is a dirty information call a tablespace such all et al. of a transac- applied updated included for checkpointed after pages in the the which check- ARIES: A Transaction Recovery Method point records. recovery, These any Encompass storage once of the this policy, NonStop the Partial partial access FP undo data MSDBS. in The its DB2 provide FF supports for FP does because needs is always to get MSDBS, time. into the Since the some This restart log with storage— when policy, none nonvolatile writes undo such reader rollbacks, often, has there many VLM repeated rollbacks. media that people amount is storage been in there would hence a no-steal policy problems no-steal and FP) write IMS FP might in-progress pool about to of the been Since CLRS, This without with many to IMS the for FP FP log which data should no-steal written to be undone, [931. com- to nonvolatile because have at transaction. written these to be dealt DEDBs, buffer unmodified eliminates for the recovery and rollback (FF would recovery. for at IMS been corresponding restart is done recovery, to write is it never from though, media records This is performed discarded be nothing just the some and are OLM log commit one updates to simplify even that to rollback, any processing—i.e., Even information, needed, still for VLM, made. and having down. during are commit DB2, a normal locking most) already is accessed assume write system is restart the FP hence with do not not the written purged DB2 During (at records redo by updating page SQL, also. records only does rollback lists simply corresponding and information volatile the for contain are went have to deferred and by have system the storage CLRS records the of Since NonStop log use SQL, in two-phase is followed written of its application is performed During not decision (to-do) rollbacks must some sup- supports that FP updating NonStop pending DEDBs records transaction mit, of at the applications internal it would the state. in the do not 1, IMS is because rollbacks. coordinator Encompass, during find since policy pages for normal until kept of for [1]. prepared a no-steal time. CLRS the updates modified rollback the Because VLM is exposed deferred Encompass, data a comple- waiting and to those is excluded rollbacks during CLRS such only because atomicity write to FP and that the page. 2 Release concept data nonvolatile before delayed OLM Version savepoint FP recovery. requires of the be SQL, From partial CLRS not changes NonStop log records. write dirtying to that storage restart data pages. is available records statement-level IMS IMS the reason log Compensation and fact, support FP pages policy may dirty rollback. In This data. old Encompass, transaction level. the during for dirty the nonvolatile following of the examining, some of a checkpoint writing rollbacks. program force to for checkpoint enforce be written checkpoint rollbacks. partial need the might They completion of the the before SQL must second completion avoid written a checkpoint. dirtied tion together records and during page port log 147 . on the non- illustrate supporting at restart for problems. to partial FP. Too Actually, it shortcomings. does not write of logging will failures during Of recovery. course, OLM CLRS occur during for restart. this has writes restart a rolled In some CLRS rollbacks. back fact, CLRS negative for undos As transaction, are written implications and redos a result, even a bounded in only with performed the for face of normal respect to during ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 148 . restart C. Mohan undomodify (called done to deal modify with and rupt restart causing the CLRS for case, grows recovery, writing the The net result is up IMS need Log record of records) (or state) logs both undo by to redo the objects. XRF records for to reduce log NonStop OLM and logs DB2 of each or But map address work the before- and of the OLM OLM’S modify, only IMS does not and The and set of pages of a modified recovery CLRS and fields. of Encompass since their consistent records of the snap- contain corresponding parts VLM DB2 updated records the in updated and information undomodify where by the redomodify L SNS of records. of undo For information Encompass logs an operation the redomodify the FF Since or restart of updated and Ih’ls occupied updates. value [761). names takeover operation. redo undomodify the buffer does (see information. lock log policy, after-image IMS enough the of force (i.e., redo after-images periodically but specifies update the the record recovery. before, includes of DEDBs’ only both only of the information number IMS given of its locking a backup’s redo the information track and information OLM’S to case, media them. others, a information. IMS undo object. which redo have during of redo be undone. undo the is used to contain shot a page logs description redo to system the might need backup need CLRS records. log the the amount SQL and the IMS for for mentioned byte-range) problem. CLR during the failures CLRS Because redo As support, also complete only (i.e., CLRS only policy. hot-standby information the writes physical updates, FP FP no-steal worst linearly. updates information the This also and undo IMS page. IMS the grows this like same In restart write failures, the In OLM CLR’S of its logging CLRS’ log the contents. because providing and not thus identical processing. avoids does multiple times of CLRS, repeated ARIES inter- themselves. of multiple, or restart hence undo- failures CLRS writing during how of undo the forward processing. IMS changes and multiple if DB2 written pass record and 5 shows because forward written will that, update is This multiple for during undo writing during records Figure the write generated and records respectively). might are CLRS written of log during wind written its record number CLRS might Encompass for OLM a given CLRS of CLRS a given for No records, restart. records exponentially. ignores during processing. During redomodify and failures redomodify restart worst et al no modify also contain modified object reside. Encompass and NonStop SQL use one LSN on each page Page overhead. uses no LSNS, but OLM uses one to keep track of the state of the page. VLM LSN. DB2 uses one LSN and IMS FF no LSN. Not having the LSN in IMS FF and VLM to know the exact state of a page does not cause any problems because of IMS’ and VLM’S value logging and physical locking attributes. It is acceptable to redo an already present update or undo an absent update. IMS FP uses a field in the pages of DEDBs as a version number to correctly handle redos after all the data sharing systems have failed [671. When DB2 divides an index minipage, besides ACM Transactions leaf page into minipages then it one LSN for the page as a whole. on Database Systems, Vol 17, No. 1, March 1992. uses one LSN for each ARIES: A Transaction Recovery Method . 149 Log passes during restart recovery. Encompass and NonStop SQL two passes (redo and then undo), and DB2 makes three passes (analysis, make redo, and their then undo— see Figure redo passes This is sufficient dirty page from the because within 6). Encompass beginning two of the of the buffer checkpoints and NonStop penultimate management after the SQL start successful policy page checkpoint. of writing became to disk dirty. They a also seem to repeat history before performing the undo pass. They do not seem to repeat history if a backup system takes over when a primary system fails [41. In the case of a takeover by a hot-standby, locks are first reacquired for the losers’ updates and then the rollbacks with the processing of new transactions. using a separate that point, process which is to gain of the losers are performed in parallel Each loser transaction is rolled back parallelism. determined using DB2 successful checkpoint, as modified by the analysis DB2 does selective redo (see Section 10.1). VLM makes one backward undo, and then redo). Many starts information its redo recorded scan from in the pass. As mentioned last before, pass and OLM makes three passes (analysis, lists are maintained during OLM’S and VLM’S passes. The undomodify and redomodify log records of OLM are used only to modify these lists, unlike in the case of the CLRS written in the other systems. In VLM, the one backward pass is used to undo uncommitted changes on nonvolatile storage and also to redo missing committed changes. No log records are written during these operations. In OLM, during the undo pass, for each object to be recovered, if an operation consistent version of the object does not exist on nonvolatile storage, of the object from the snapshot log record version of the object, (1) in the remainder updates that precede the snapshot then it restores a snapshot so that, starting from a consistent of the undo pass any to-be-undone log record can be undone in the redo pass any committed or in-doubt updates (modify follow the snapshot record can be redone logically. This logically, and (2) records only) that is similar to the shadowing performed in [16, 781 the database-wide checkpointing the use of a single log instead of IMS first reloads MSDBS from using a separate log—the difference is that is replaced by object-level checkpointing and two logs. the file that received their contents during the latest before that were successful included buffers as before. number of buffers checkpoint in the checkpoint This cannot means that, be altered. the records during Then, failure. The dirty DEDB buffers are also reloaded into the a failure, restart it makes after just the same one forward the pass over the log (see Figure 6). During that pass, it accumulates log records in memory on a per-transaction basis and redoes, if necessary, completed transactions’ FP updates. Multiple processes are used in parallel to redo the DEDB updates. As far as FP is concerned, only the updates starting from the last checkpoint before the failure are of interest. At the end of that one pass, in-progress transactions’ FF updates are undone (using the log records in memory), in parallel, using one process per transaction. If the space allocated in memory for a transaction’s log records is not enough, then a backward scan of the log will be performed to fetch the needed records during that transaction’s rollback. In the XRF context, when a hot-standby IMS ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 150 C. Mohan . et al. takes over, the handling of the loser transactions Tandem does it. That is, rollbacks are performed transaction processing. Page forces the end during of restart. restart. OLM, Information VLM on and is similar in parallel DB2 Encompass force and all to the with dirty NonStop way new pages SQL at is not available. Restart checkpoints. the end of restart not available. Restrictions record have IMS, DB2, recovery. on data. a unique OLM and VLM Information Encompass key. This take on Encompass and unique NonStop key a checkpoint only at and NonStop SQL is SQL require that is used to guarantee every that if an attempt is made to undo a logged action which was never applied to the nonvolatile storage version of the data, then the latter is realized and the undo fails. In other words, idempotence of operations is achieved using the unique key. IMS in effect hence does not allow records results in the fragmentation imposes that some additional an object’s does byte-range locking and logging and to be moved around freely within a page. This and the less efficient usage of free space. IMS constraints representation with respect be divided into to FP data. fixed VLM length requires (less than one page sized), unrelocatable quanta. The consequences of these restrictions are similar to those for IMS. [2, 26, 56] do not discuss recovery from system failures, while the theory of [33] does not include semantically logging). In other sections of this rich paper, with that 12. some of the other ATTRIBUTES ARIES makes approaches modes of locking (i.e., operation we have pointed out the problems have been proposed in the literature. OF ARIES few assumptions about the data or its model and has several advantages over other recovery methods. While ARIES is simple, it possesses several interesting and useful properties. Each of most of these properties has been demonstrated in one or more existing or proposed systems, as summarized in the last section. However, we proposed or real, which has all of these properties. ARIES are: (1) Support for finer larities of locking. a uniform locking fashion. is. Depending than page-level ARIES expected control page-level is not Recovery on the concurrency supports know of no single system, Some of these properties of and affected by contention what (2) Flexible buffer management long as the write-ahead logging schemes the for the data, ate level of locking can be chosen. It also allows locking (e.g., record, table, and tablespace-level) tablespace). Concurrency control schemes of [2]) can also be used. and multiple record-level other granu- locking of the appropri- multiple granularities of for the same object (e. g., than locking (e.g., during restart and normal processing. protocol is followed, the buffer manager ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992 in granularity the As is ARIES: A Transaction Recovery Method free to use any page incomplete transactions transactions commit replacement policy. In particular, dirty pages of can be written to nonvolatile storage before those (steal dirtied by a transaction transaction is allowed lead to reduced 151 . policy). Also, be written to commit demands for it is not required that back to nonvolatile storage (i.e., no-force policy). These buffer storage and fewer all pages before the properties 1/0s involving frequently updated (hot-spot) pages. ARIES does not preclude the possibilities of using deferred-updating and force-at-commit policies and benefiting from them. ARIES is quite flexible in these respects. (3) Minimal (excluding required space overhead–only log) space overhead No The LSN constraints on actions. There logged unique keys, around within ensured etc, should Actions taken exact inverses are being original recorded to guarantee are LSN on each during actions the and the undo of an update taken undos, what former. undo of respect to can be moved Idempotence is used or with Data performed value. of operations to determine whether is an or not. of the actions during page data length. collection. action of redo on the can be of variable The permanent to the storage increasing idempotence no restrictions be redone written in data Records the of the last logged of a page is a monotonically a page for garbage since operation (5) LSN per page. scheme is limited on each page to store the LSN on the page. (4) one of this during any differences actually An had example need not necessarily the original update. between to be done of when the be the Since the inverses during inverse CLRS of the undo can be might not be correct is the one that relates to the free space information 10% free, 20% free) about data pages that are maintained (like at least in space map pages. while Because of finer than page-level granularity locking, no free space information change takes place during the initial update of a page by a transaction, a free space information change might occur during the undo (from 20% free to 10% free) of that original change because of intervening update activities of other transactions (see Section 10.3). Other benefits of this attribute in the context of hash-based storage methods and index management (6) Support for operation to a page can be logged redo information can be found in [59, 621. logging and novel lock modes. in a logical fashion. The undo for the entire object The changes made information and the need not be logged. It suffices if the changed fields alone are logged. Since history is repeated, for increment or decrement kinds of operations before- and after-images of the field are not needed. Information about the type of operation and the decrement or increment amount is enough. Garbage collection actions and changes to some fields (e.g., amount of free space) of that page need not be logged. Novel lock modes based on commutativity and other properties of operations can be supported [2, 26, 881. (7) Even redo-only and undo-only (single call to the be efficient undo and redo information about records are accommodated. log component) sometimes an update While it may to include the in the same log record, at other ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 152 C. Mohan et al. . times it may be efficient (from the original data, the undo record constructed and, after the update is performed in-place in the data from the sary (because updated different tions, data, of log records. the undo the record ARIES record redo size can must record can be restrictions) handle both be logged before the for partial and total transaction to be rolled back totally, ARIES savepoints and the partial rollback system) will Under redo itself for every page does not treat affected by that multipage objects total to namically and permanently to the two condi- update, such ARIES savepoints. recoverable information rollbacks in any special (10) Allows files to be acquired or returned, system. ARIES provides the flexibility these record. (9) Support for objects spanning multiple pages. Objects pages (e.g., an IMS “record” which consists of multiple scattered over many pages). When an object is modified, written necesin rollback. Besides allowing allows the establishment of even logically cached catalog require and/or information of transactions Without the support for partial rollbacks, (e.g., unique key violation, out-of-date distributed database wasted work. the situations. Support transactions (8) constructed) to log can be record, and errors in a result in can span multiple segments may be if log records are works fine. ARIES way. any time, from or to the operating of being able to return files dy- operating system (see [19] for the detailed description of a technique to accomplish this). Such an action is considered to be one that cannot be undone. It does not prevent the same file from being reallocated to the database system. Mappings between objects (table spaces, as in System R. (11) Some actions etc.) and files of a transaction a whole is rolled back. This a dummy CLR to implement given as an example situation are not required maybe to be defined committed statically even if the transaction as refers to the technique of using the concept of nested top actions. File extension has been which could benefit from tions of this technique, in the context of hash-based index management, can be found in [59, 621. this. Other storage applica- methods and (12) Efficient checkpoints (including during restart recovery). By supporting fuzzy checkpointing, ARIES makes taking a checkpoint an efficient operation. Checkpoints can be taken even when update activities and logging are going on concurrently. processing will help reduce The dirty .pages information the number redo pass. of pages which Permitting the impact written are read checkpoints even during restart of failures during restart recovery. during checkpointing helps reduce from nonvolatile storage during the (13) Simultaneous processing of multiple transactions in forward processing and /or in rollback accessing same page. Since many transactions could simultaneously be going forward or rolling back on a given page, the level of concurrent access supported could be quite high. Except for the short duration latching which has to be performed any time a page is being ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. ARIES: A Transaction Recovery Method physically rollback, modified or examined, rolling back transactions . 153 be it during forward processing or during do not affect one another in any unusual fashion. (14) No locking or deadlocks during transaction rollback. is required during transaction rollback, no deadlocks will Since no locking involve transac- tions that are rolling back. Avoiding locking during rollbacks simplifies not only the rollback logic, but also the deadlock detector logic. The deadlock detector need not worry about making the mistake of choosing a rolling back transaction as a victim in the event of a deadlock (cf. System R and R* [31, 49, 64]). (15) Bounded logging rollbacks. Even CLRS written The number time during restart if repeated is unaffected. of log records of transaction in spite of repeated failures occur during failures restart, or of nested the number of This is also true if partial rollbacks are nested. written will be the same as that written at the rollback during normal processing. The latter again is a fixed number and is, usually, equal to the number of undoable records written during the forward processing of the transaction. No log records are written during the redo pass of restart. (16) Permits faster exploitation restart. of parallelism Restart and can be made selective/deferred faster by not doing processing all the needed for 1/0s synchronously ARIES permits one at a time while processing the corresponding log record. the early identification of the pages needing recovery and the of asynchronous initiation pages. The memory pages during parallel can be processed the redo pass. dling of a given transaction processing can be postponed Undo Fuzzy image copying outside the reading as they parallelism transactions (archive requires the transaction data the forward traversal of those for media into complete hanrestart offline in parallel recovery. Media are supported very efficiently. To actual act of copying can even be system (i.e., without going buffer pool). This can happen even while the latter modifying the information being copied. During media (18) Continuation repeats history in are brought can be performed dumping) recovery and image copying of the take advantage of device geometry, performed for by a single process. Some of the to speed up restart or to accommodate devices. If desired, undo of loser with new transaction processing. (17) 1/0s concurrently through is accessing recovery only the and one of the log is made. of loser transactions after and supports the savepoint a system concept, restart. Since ARIES we could, in the undo pass, instead of totally rolling back the loser transactions, roll back each loser only to its latest savepoint. Locks must be acquired to protect the transaction’s uncommitted, not undone updates. Later, we could resume the transaction by invoking its application at a special entry point and passing enough be resumed. (19) Only information one backward about traversal the savepoint of log during from restart which execution or media is to recovery. ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 154 C. Mohan . Both during media Need only compensation recovery and restart This is especially important in a slow medium like tape. redo information records information. are in never So, on the average, Support for distributed transactions. Whether does not affect (22) Early compensation undone the site ARIES of locks rollback, during redo during the forward distributed or a subordinate site during transaction rollback when the transaction’s very and deadlock resolu- never undoes CLRS and more than once, during a first update to a particular for it, the system can release the lock to consider resolving deadlocks using be noted that ARIES does not prevent the shadow page technique used for selected portions of the data to avoid logging of only undo information or both dealing with long Database Manager. undo and redo information. This may be useful for fields, as is the case in the 0S/2 Extended Edition In such instances, for such data, the modified pages have to be forced to nonvolatile storage before commit. media recovery and partial rollbacks can be supported logged and for which updates shadowing is done. 13. Since only accommodates is a coordinator object is undone and a CLR is written on that object. This makes it possible partial rollbacks. would records. to contain ARIES. release It should from being traversal of of the log is of log space consumed tion using partial rollbacks. Because ARIES because it never undoes a particular non-CLR (partial) log need space consumed transactions. a given they the amount a transaction rollback will be half processing of that transaction. (21) one backward if any portion recovery the log is sufficient. likely to be stored (20) et al will Whether depend or not on what is SUMMARY In this paper, some of the we presented recovery the paradigms ARIES of System recovery method and R are inappropriate showed why in the WAL context. We dealt with a variety of features that are very important in building and operating an industrial-strength transaction processing system. Several issues regarding operation logging, fine-granularity locking, space management, and flexible recovery were discussed. In brief, ARIES accomplishes the goals that we set out with by logging all updates on a per-page basis, using an LSN on every page for tracking page state, repeating history during restart recovery before undoing the loser transactions, and chaining the CLRS to the predecessors of the log records that they compensated. Use of ARIES is not restricted to the database area alone. implementing persistent object-oriented languages, and transaction-based operating systems. In fact, QuickSilver distributed operating system [401 and aid the backing up of workstation In this section, we summarize to which specific ACM Transactions attributes that It can also be used recoverable it is being in a system data on a host [441. as to which specific features give us flexibility of ARIES and efficiency. on Database Systems, Vol. 17, No. 1, March 1992 for file systems used in the designed to lead ARIES: A Transaction Recovery Method Repeating history CLRS during chained undos, using (1) Record within records logged. exactly, which permits the following, the UndoNxtLSN in turn field implies using irrespective 155 . LSNS and writing of whether CLRS are or not: level locking to be supported and records to be moved around a page to avoid storage fragmentation without the moved having to be locked and without the movements having to be (2) Use only one state variable, a log sequence number, per page. (3) Reuse of storage released by one transaction for the same transaction’s later actions or for other transactions’ actions once the former commits, thereby leading to the efficient usage of storage. preservation of clustering of records (4) The inverse of an action origianlly performed during forward of a transaction to be different from the action(s) performed undo That of that original is, logical undo undo on the (6) Recovery of each page independently relating to transaction state, especially (7) If necessary, the continuation the time of system failure. or deferred transaction (9) Partial same processing during the concurrently with of other pages or of log during media recovery. records of transactions restart, processing rollback the action (e. g., class changes in the space map pages). with recovery independence is made possible. (5) Multiple transactions may transactions going forward. (8) Selective and and undo to improve data page which of losers were in progress concurrently with at new availability. of transactions. (10) Operation logging and logical logging of changes within a page. For example, decrement and increment operations may be logged, rather than the before- and after-images of modified data. Chaining, using the UndoNxtLSN field, forward processing permits the following, history CLRS to log records written during provided the protocol of repeating is also followed: (1) The avoidance CLRS. This of undoing also makes CLRS’ actions, it unnecessary thus avoiding to store undo (2) The avoidance of the undo of the same log record processing more than once. (3) As a transaction is being rolled back, the ability writing information written to release by partially (4) Handling partial log, as in System (5) Making permanent, rolling rollbacks R. if back without necessary for in CLRS. during object when all the updates to that object had been undone. important while rolling back a long transaction or while deadlock CLRS forward the lock on an This may resolving be a the victim. any special via nested actions top like actions, patching some the of the ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 156 C. Mohan . et al. changes made by a transaction, irrespective itself subsequently rolls back or commits. Performing the analysis (1) Checkpoints pass before to be taken any of whether repeating time history during the the permits redo and transaction the following: undo passes of recovery. (2) Files to be returned ing dynamic binding (3) Recovery of file-related user data, without (4) Identifying 1/0s to the operating system dynamically, between database objects and files. pages could information requiring possibly be initiated concurrently special requiring for them treatment redo, with volatile storage when of asynchronous parallel pages by eliminating e.g., that some empty been freed. (6) Exploiting opportunities to avoid writing end. write records after table recovery the redo pass starts. (5) Exploiting opportunities to avoid redos on some those pages from the dirty .pages table on noticing, pages have the allow- for the former. so that even before thereby and by the end. write (7) Identifying the transactions locks could be reacquired reading some pages during redo, e.g., by dirt y pages have been written to non- eliminating records those pages from the dirty .pages are encountered. in the in-doubt and in-progress states so that for them during the redo pass to support selective or deferred restart, the continuation of loser transactions after restart, and undo of loser transactions in parallel with new transaction processing. 13.1 Implementations ARIES forms and Extensions the basis of the recovery algorithms used in the IBM Research prototype systems Starburst [871 and QuickSilver [401, in the University of Wisconsin’s EXODUS and Gamma database machine [201, and in the IBM program products 0S/2 Extended Edition Database Manager [71 and Workstation history, Data Save Facility/VM has been implemented [441. One feature of ARIES, namely repeating in DB2 Version 2 Release 1 to use the concept of nested top action for supporting segmented tablespaces. A simulation study of the performance of ARIES is reported in [981. The following conclu“Simulation results indicate the sions from that study are worth noting: success of the ARIES recovery method in providing fast recovery from failures, caused by long intercheckpoint intervals, efficient use of page LSNS, log LSNS, and RecLSNs avoids redoing updates unnecessarily, and the actual recovery load is reduced skillfully. concurrency control and recovery indicated by the negligibly small Besides, algorithms difference the overhead incurred by the on transactions is very low, as between the mean transaction response time and the average duration of a transaction if it ran alone in a never failing system. This observation also emerges as evidence that the recovery method goes well with concurrency control through fine-granularity locking, an important virtue. ” ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992 ARIES: A Transaction Recovery Method We have extended transaction methods, model called ARIES (see [70, ARIES to make 85]). /KVL, it work in the Based on ARIES, ARIES/IM and context we have ARIES 157 . of the nested developed /LHS, to new efficiently provide high concurrency and recovery for B ‘-tree indexes [57, 62] and for hash-based storage structures [59]. We have also extended ARIES to restrict the amount of repeating of history that takes place for the loser transactions [691. We have designed concurrency control and recovery algorithms, on ARIES, for the N-way data sharing (i. e., shared disks) environment 66,67, 68]. Commit.LSN, that exists reevaluation in [54, a method which takes in every page to reduce the overheads, and also to improve 58, processing, 60]. Although we did not messages discuss are message advantage based [65, of the page.LSN locking, latching and predicate concurrency, has been presented an important logging part and recovery of transaction in this paper. ACKNOWLEDGMENTS We have benefited immensely from the work that was System R project and in the DB2 and IMS product groups. valuable lessons by looking at the experiences with those the source code and internal documents of those systems The Starburst project gave us the opportunity the contributions also like to thank have adopted our Brian and Irv Oki, Erhard Traiger of the designers to begin of the other in the learned systems. Access to was very helpful. from design some of the fundamental algorithms of a transaction into account experiences with the prior systems. We would edge performed We have scratch and system, taking like to acknowl- systems. We would our colleagues in the research and product groups that research results. Our thanks also go to Klaus Kuespert, Rahm, for their Andreas detailed Reuter, comments Pat Selinger, Dennis Shasha, on the paper. REFERENCES 1. BAKER, J., CRUS, R., AND HADERLE, D. Method for assuring atomicity of multi-row update operations in a database system. U.S. Patent 4,498,145, IBM, Feb. 19S5. 2. BADRINATH, B. R., AND RAMAMRITHAM, K. Semantics-based concurrency control: Beyond 3rd IEEE International Conference on Data Engineering commutativity. In Proceedings (Feb. 1987). Concurrency Control and Recovery in 3. BERNSTEIN, P., HADZILACOS, V., AND GOODMAN, N. Database Systems. Addison-Wesley, Reading, Mass., 1987. 4. BORR, A. Robustness to crash in a distributed database: A non-shared-memory multi10th International Conference on Very Large Data Bases processor approach. In Proceedings (Singapore, Aug. 1984). 5. CHAMBERLAIN,D., GILBERT, A., AND YOST, R. A history of System R and SQL)Data System. 7th International Conference on Very Large Data Bases (Cannes, Sept. In Proceedings 1981). ACM Trans. 6. CHANG, A., AND MERGEN, M. 801 storage: Architecture and programming. Comput. Syst., 6, 1 (Feb. 1988), 28-50. 7. CHANG, P. Y., AND MYRE, W. W. 0S/2 EE database manager: Overview and technical ZBM Syst. J. 27, 2 (198S). highlights. schemes 8. COPELAND, G., KHOSHAFIAN, S., SMITH, M., AND VALDURIEZ, P. Buffering International Conference on Data Engineering for permanent data. In Proceedings (Los Angeles, Feb. 1986). ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 158 . C. Mohan et al. 9. CLARK, B. E., AND CORRTGAN,M. J. Application System/400 performance characteristics. IBM S@. J. 28, 3 (1989). 10. CHENG, J., LOOSELY, C., SHIBAMIYA, A., AND WORTHINGTON, P. IBM Database 2 perforIBM Sy.st. J. 23, 2 (1984). mance: Design, implementation, and tuning. 11. CRUS, R , HADERLE, D., AND HERRON, H. Method for managing lock escalation in a multiprocessing, multiprogramming environment. U.S. Patent 4,716,528, IBM, Dec. 1987. IBM Tech. Disclosure 12. CRUS, R., MALKEMUS, T., AND PUTZOLU, G. R. Index mini-pages Bull. 26, 4 (April 1983), 5460-5463. 13. CRUS, R., PUTZOLU, F., AND MORTENSON, J. A Incremental data base log image copy IBM !l’ec~. Disclosure Bull. 25, 7B (Dec. 1982), 3730-3732. Bull. 25, 7B 14. CRUS, R., AND PUTZOLU, F. Data base allocation table. IBM Tech. Disclosure (Dec. 1982), 3722-2724. 15. CRUS, R. Data recovery in IBM Database2. IBM Syst. J. 23,2(1984). Informix-Turbo, In Proceedings LZEECornpcon Sprmg’88(Feb. -March l988), 16. CURTIS, R. operating 17. DASGUPTA, P., LEBLANC, R., JR., AND APPELBE, W. The Clouds distributed 8th International Conference on Distributed Computing Systems system. In Proceedings (San Jose, Calif., June 1988). AGuideto INGRES. Addison-Wesley, Reading, Mass., l987. 18. DATE, C. data sets. IBM Tech. Disclosure 19. DEY, R., SHAN, M., AND TRAIGER, 1. Method fordropping Bull. 25, 11A (April 1983), 5453-5455. AND 20. DEWITT, D., GHANDEHARIZADEH, S., SCHNEIDER, D., BRICKER, A., HSIAO, H.-I., Data Eng. RASMUSSEN,R. The Gamma database machine project. IEEE Trans. Knowledge 2, 1 (March 1990). 21. DELORME, D., HOLM, M., LEE, W., PASSE, P., RICARD, G., TIMMS, G., JR., AND YOUNGREN, L. Database index journaling for enhanced recovery. U.S. Patent 4,819,156, IBM, April 1989 The treatment of 22. DIXON, G. N., BARRINGTON, G. D., SHRIVASTAVA, S., AND WHEATER, S. M. persistent objects in Arjuna. Comput. J. 32, 4 (1989). management. Ph.D. dissertation, Tech. Rep. CMU-CS-88-192, 23. DUCHAMP, D. Transaction Carnegie-Mellon Univ., Dec. 1988, ACM of database buffer management, 24. EFFEUSBERG, W., AND HAERDER, T. Principles Trans. Database Syst. 9, 4 (Dec. 1984). 25. ELHARDT, K , AND BAYER, R. A database cache for high performance and fast restart in database systems. ACM Tram Database Syst. 9, 4 (Dec. 1984). locking for 26. FEKETE, A., LYNCH, N., MERRITT, M., AND WEIHL, W. Commutativity-based nested transactions. Tech. Rep. MIT/LCS/TM-370.b, MIT, July 1989, Data base integrity as provided for by a particular data base management 27. FOSSUM, B J. W. Klimbie and K. L. Koffeman, Eds., North-Holland, system. In Data Base Management, Amsterdam, 1974. of concurrency control in IMS/VS Fast Path. 28. GAWLICK, D., AND KINKADE, D. Varieties IEEE Database Eng. 8, 2 (June 1985). management in an object-oriented database system. 29. GARZA, J., AND KIM, W. Transaction ACM-SIGMOD International Conference on Management of Data (Chicago, In Proceedings June 1988). CHAOS’% Support for real-time atomic transactions. In 30. GHEITH, A., AND SCHWAN, K. Proceedings 19th International Symposium on Fault-Tolerant Computing (Chicago, June 1989). 31. GRAY, J., MCJONES, P., BLASGEN, M., LINDSAY, B., LORIE, R., PRICE, T., PUTZOLU, F., AND ACM Comput. TRAIGER, I. The recovery manager of the System R database manager. Suru. 13, 2 (June 1981). Systems–An Aduanced systems. In Operating 32. GRAY, J. Notes on data base operating Course, R. Bayer, R. Graham, and G. Seegmuller, Eds., LNCS Vol. 60, Springer-Verlag, New York, 1978. m database systems. J. ACM 35, 1 (Jan. 1988), 33. HADZILACOS, V, A theory of reliability 121-145. S.yst. 13, 2 (1988), hot spot data in DB-sharing systems. Inf 34. HAERDER, T. Handling 155-166. ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992 ARIES: A Transaction Recovery Method . 159 35. HADERLE, D., AND JACKSON, R. IBM Database 2 overview. IBM Syst. J. 23, 2 (1984). Principles of transaction oriented database recovery–A taxonomy. ACM CornPUt. Sure. 15, 4 (Dec. 1983). 37. HELLAND, P. The TMF application programming interface: Program to program communication, transactions, and concurrency in the Tandem NonStop system. Tandem Tech. Rep. TR89.3, Tandem Computers, Feb. 1989. 36. HAERDER, T., AND REUTER, A. 38. HERLIHY, M., Proceedings AND WEIHL, W. 7th ACM Hybrid concurrency SIGAC’T-SIGMOD-SIGART Systems (Austin, Tex., March 1988). 39. HERLIHY, M., AND WING, J. M. Avalon: 17th International systems. In Proceedings (Pittsburgh, Pa., July 1987). control for abstract Symposium Language Symposium support on data on Principles for reliable Fault-Tolerant types. In of Database distributed Computing 40. HASKIN, R., MALACHI, Y., SAWDON, W., AND CHAN, G. Recovery management in QuickSilver. ACM !/’runs. Comput. Syst. 6, 1 (Feb. 1988), 82-108. Dec. GG24-1652, IBM, April 1984. 41. IMS/ VS Version 1 Release 3 Recovery/Restart. Programming. Dec. SC26-4178, IBM, March 1986. 42. IMS/ VS Version 2 Application 43. IMS/ VS Extended April 1987. Recovery 44. IBM Workstation Data 1990. Facility Save Facility (XRF): / VM: Technical General Reference. Information. Dec. GG24-3153, IBM, Dec. GH24-5232, IBM, 45. KORTH, H. Locking primitives in a database system. JACM 30, 1 (Jan. 1983), 55-79. 46. LUM, V., DADAM, P., ERBE, R., GUENAUER, J., PISTOR, P., WALCH, G., WERNER, H., AND WOODFILL, J. Design of an integrated DBMS to support advanced applications. In Proceedings International Conference on Foundations of Data Organization (Kyoto, May 1985). 47. LEVINE, F., AND MOHAN, C. Method for concurrent record access, insertion, deletion and alteration using an index tree. U.S. Patent 4,914,569, IBM, April 1990. Isolation Locking. Dec. GG66-3193, IBM Dallas Systems 48. LEWIS, R. Z. ZMS Program Center, Dec. 1990. 49. LINDSAY, B., HAAS, L., MOHAN, C., WILMS, P., AND YOST, R. Computation and communication in R*: A distributed database manager. ACM Trans. Comput. Syst. 2, 1 (Feb. 1984). 9th ACM Symposium on Operating Systems Principles (Bretton Woods, Also in Proceedings Oct. 1983). Also available as IBM Res. Rep. RJ3740, San Jose, Calif., Jan. 1983. 50. LINDSAY, B., MOHAN, C., AND PIRAHESH, H. Method for reserving space needed for “rollBull. 29, 6 (Nov. 1986). back” actions. IBM Tech. Disclosure 51. LISKOV, B., AND SCHEIFLER, R. Guardians and actions: Linguistic support for robust, distributed programs. ACM Trans. Program. Lang. Syst. 5, 3 (July 1983). 52. LINDSAY, B., SELINGER, P., GALTIERL C., GRAY, J., LORIE, R., PUTZOLU, F., TRAIGER, I., AND WADE, B. Notes on distributed databases. IBM Res. Rep. RJ2571, San Jose, Calif., July 1979. 53. MCGEE, W. C. The information management syste]m IMS/VS—Part II: Data base faciliIBM Syst. J. 16, 2 (1977). ties; Part V: Transaction processing facilities. 54. MOHAN, C., HADERLE, D., WANG, Y., AND CHENG, J. Single table access using multiple indexes: Optimization, execution, and concurrency control techniques. In Proceedings International Conference on Extending Data Base Technology (Venice, March 1990). An expanded version of this paper is available as IBM Res. Rep. RJ7341, IBM Almaden Research Center, March 1990. 55. MOHAN, C., FUSSELL, D., AND SILBERSCHATZ, A. Compatibility and commutativity of lock modes. Znf Control 61, 1 (April 1984). Also available as IBM Res. Rep. RJ3948, San Jose, Calif., July 1983. 56. MOSS, E., GRIFFETH, N., AND GRAHAM, M. Abstraction in recovery management. In Proceedings ACM SIGMOD International Conference on Management of Data (Washington, D. C., May 1986). 57. MOHAN, C. ARIES /KVL: A key-value locking method for concurrency control of multiac16th International Conference tion transactions operating on B-tree indexes. In Proceedings on Very Large Data Bases (Brisbane, Aug. 1990). Another version of this paper is available as IBM Res. Rep. RJ7008, IBM Almaden Research Center, Sept. 1989. ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. 160 . C. Mohan et al 58. MOHAN, C. Commit -LSN: A novel and simple method for reducing locking and latching in 16th International Conference on Very Large processing systems In Proceedings Data l?ases (Brisbane, Aug. 1990). Also available as IBM Res. Rep. RJ7344, IBM Almaden Research Center, Feb. 1990. 59 MOHAN, C. ARIES/LHS: A concurrency control and recovery method using write-ahead logging for linear hashing with separators. IBM Res. Rep., IBM Almaden Research Center, Nov. 1990. 60. MOHAN, C. A cost-effective method for providing improved data avadability during DBMS of the 4th International Workshop on HLgh restart recovery after a failure In Proceedings Performance Transachon Systems (Asilomar, Calif., Sept. 1991). Also available as IBM Res. Rep. RJ81 14, IBM Almaden Research Center, April 1991. transaction 61. Moss, E., LEBAN, B., AND CHRYSANTHIS, P. Fine grained concurrency for the database 3rd IEEE International Conference on Data Engineering (Los Angeles, cache. In Proceedings Feb. 1987), 62. MOHAN, C., AND LEVINE, F. ARIES/IM: An efficient and high concurrency index management method using write-ahead logging. IBM Res. Rep. RJ6846, IBM Almaden Research Center, Aug. 1989. 63. MOHAN, C., AND LINDSAY, B. Efficient commit protocols for the tree of processes model of 2nd ACM SIGACT/ SIGOPS Sympos~um on Pridistributed transactions. In Proceedings nciples of Distributed Computing (Montreal, Aug. 1983). Also available as IBM Res. Rep. RJ3881, IBM San Jose Research Laboratory, June 1983. 64. MOHAN, C., LINDSAY, B., AND OBERMARCK, R. Transaction management in the R* dktributed database management system. ACM Trans. Database Syst. 11, 4 (Dec. 1986). 65. MOHAN, C., ANn NARANG, I. Recovery and coherency-control protocols for fast intersystem page transfer and tine-granularity locking in a shared disks transaction environment. In Proceedings 17th International Conference on Very Large Data Bases (Barcelona, Sept. 1991). A longer version is available as IBM Res. Rep. RJ8017, IBM Almaden Research Center, March 1991. 66. MOHAN, C., AND NARANG, I. Efficient locking and caching of data in the multisystem of the International Conference on shared disks transaction environment. In proceedings Extending Database Technology (Vienna, Mar. 1992). Also available as IBM Res. Rep. RJ8301, IBM Almaden Research Center, Aug. 1991. 67. MOHAN, C., NARANG, I., AND PALMER, J. A case study of problems in migrating to distributed computing: Page recovery using multiple logs in the shared disks environment. IBM Res. Rep. RJ7343, IBM Almaden Research Center, March 1990. 68. MOHAN, C., NARANG, I., SILEN, S. Solutions to hot spot problems in a shared disks of the 4th International Workshop on High Perfortransaction environment. In proceedings mance Transaction Systems (Asilomar, Calif., Sept. 1991). Also available as IBM Res Rep. 8281, IBM Almaden Research Center, Aug. 1991. 69. MOHAN, C., AND PIRAHESH, H. ARIES-RRH: Restricted repeating of history in the ARIES 7th International Conference on Data Engitransaction recovery method. In Proceedings neering (Kobe, April 1991). Also available as IBM Res. Rep. RJ7342, IBM Almaden Research Center, Feb. 1990 70. MOHAN, C , AND ROTHERMEL, K. Recovery protocol for nested transactions using writeBull. 31, 4 (Sept 1988). ahead logging. IBM Tech. Dwclosure 3rd 71. Moss, E. Checkpoint and restart in distributed transaction systems. In Proceedings Symposium on Reliability in Dwtributed Software and Database Systems (Clearwater Beach, Oct. 1983). 13th International 72. Moss, E Log-based recovery for nested transactions. In Proceedings Conference on Very Large Data Bases (Brighton, Sept. 1987). 73. MOHAN, C., TIUEBER, K., AND OBERMARCK, R. Algorithms for the management of remote backup databases for disaster recovery. IBM Res. Rep. RJ7885, IBM Almaden Research Center, Nov. 1990. 74. NETT, E., KAISER, J., AND KROGER, R. Providing recoverability in a transaction oriented 6th International Conference on Distributed distributed operating system. In Proceedings Computing Systems (Cambridge, May 1986). ACM Transactions on Database Systems, Vol. 17, No, 1, March 1992 ARIES: A Transaction Recovery Method 75. NOE, J., KAISER, J., KROGER, R., AND NETT, E. The commit/abort problem GMD Tech. Rep. 267, GMD mbH, Sankt Augustin, Sept. 1987. locking. 76. OBERMARCK, R. IMS/VS Calif., July 1980. 77. O’NEILL, P. (Dec. 1986). 78. ONG, K. The SYNAPSE SIGMOD Symposium program Escrow isolation transaction approach IBM ACM method. to database on Principles feature. . 161 in type-specific Res. Rep. RJ2879, San Jose, Trans. Database Syst. 11, 4 recovery. of Database In Proceedings 3rd ACM SIGACT(Waterloo, April 1984). contention in a stock trading database: A Systems 79. PEINL, P., REUTER, A., AND SAMMER, H. High ACM SIGMOD International Conference on Management of Data case study. In Proceedings (Chicago, June 1988). 80. PETERSON,R. J., AND STRICKLAND, J. P. Log write-ahead protocols and IMS/VS logging. In Proceedings (Atlanta, 2nd ACM SIGACT-SIGMOD Ga., March Symposium on Principles of Database Systems 1983). 81. RENGARAJAN, T. K., SPIRO, P., AND WRIGHT, W. DBMS software. Digital Tech. J. 8 (Feb. 1989). 82. REUTER, A. Softw. Eng. A fast transaction-oriented 4 (July 1980). scheme for UNDO mechanisms recovery. of VAX IEEE Trans. SE-6, 83. REUTER, A. SIGMOD logging “High availability Concurrency Symposium on high-traffic on Principles 84. REUTER, A. Performance (Dec. 1984), 526-559. analysis data elements. of Database Systems of recovery techniques. ACM SIGACTIn Proceedings (Los Angeles, March 1982). ACM Trans. Database Syst. 9,4 85. ROTHERMEL, K., AND MOHAN, C. ARIES/NT: A recovery method based on write-ahead 15th International Conference on Very Large logging fornested transactions. In Proceedings Data Bases (Amsterdam, Aug. 1989). Alonger version ofthis paper is available as IBM Res. Rep. RJ6650, lBMAlmaden Research Center, Jan. 1989. 86. ROWE, L., AND STONEBRAKER, M. The commercial INGRES epilogue. Ch. 3 in The ZNGRES Papers, Stonebraker, M., Ed., Addson-Wesley, Reading, Mass., 1986. 87. SCHWARZ, P., CHANG, W., FREYTAG, J., LOHMAN, G., MCPHERSON, J., MOHAN, C., AND Workshop on PIRAHESH, H. Extensibility in the Starburst database system. In Proceedings Object-Oriented Data Base Systems (Asilomar, Sept. 1986). Also available as IBM Res. Rep. RJ5311, San Jose, Calif., Sept. 1986. 88. SCHWARZ,P. Transactions on typed objects. Ph.D. dissertation, Carnegie Mellon Univ., Dec. 1984. Tech. Rep. CMU-CS-84-166, ACM Trans. 89. SHASHA, D., AND GOODMAN, N. Concurrent search structure algorithms. Database Syst. 13, 1 (March 1988). 90. SPECTOR, A., PAUSCH, R., AND BRUELL, G. Came Lot: A flexible, distributed transaction IEEE Compcon Spring ’88 (San Francisco, Calif., March processing system. In Proceedings 1988). 91. SPRATT, L. ACM The transaction resolution journal: Extending the before journal. 1985). 92. STONEBRAKER, M. The design of the POSTGRES storage system. In Proceedings International Conference on Very Large Data Bases (Brighton, Sept. 1987). Syst. Oper. Rev. 19, 3 (July IMSj VS Version 1 Release 3 Fast Path 93. STILLWELL, J. W., AND RADER, P. M. Dec. G320-0149-0, IBM, Sept. 1984. 94. STRICKLAND, J., UHROWCZIK, P., AND WATTS, V. IMS/VS: An evolving system. 13th Notebook. IBM Syst. J. 21, 4 (1982). 95. high-performance, THE TANDEM DATABASE GROUP. NonStop SQL: A distributed, Science Vol. 359, high-availability implementation of SQL. In Lecture Notes in Computer D. Gawlick, M. Haynie, and A. Reuter, Eds., Springer-Verlag, New York, 1989. 96. TENG, J., AND GUMAER, R. IBM Syst. 97. TRAIGER, I. Virtual 4 (Oct. 1982), 26-48. 98. VURAL, S. Managing IBM Database 2 buffers to maximize performance. J. 23, 2 (1984). memory management for database systems. A simulation study for the performance recovery method. M. SC. thesis, Middle East Technical ACM Oper. Syst. Rev. 16, analysis of the ARIES transaction Univ., Ankara, Feb. 1990. ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992, 162 . C. Mohan et al. WATSON, C. T., AND ABERLE, G. F System/38 machine database support. In IBM Syst, 38/ Tech. Deu., Dec. G580-0237, IBM July 1980. 100. WEIKUM, G. Principles and realization strategies of multi-level transaction management. ACM Trans. Database Syst. 16, 1 (Mar. 1991). 101. WEINSTEIN, M., PAGE, T., JR , LNEZEY, B., AND POPEK, G. Transactions and synchroniza10th ACM Symposium on Operating tion in a distributed operating system. In Proceedings Systems Principles (Orcas Island, Dec. 1985). 99 Received January 1989; revised November 1990; accepted April 1991 ACM TransactIons on Database Systems, Vol. 17, No. 1, March 1992