Application Recovery: Advances Toward an Elusive Goal David Lomet Microsoft Research Redmond, WA 98052 lomet@microsoft.com Introduction Persistent savepoints are the model that has usually dominated previous thinking on application recovery. "A persistent savepoint is a savepoint where the state of the transaction's resources is made stable …, and enough control state is saved to stable storage so that on recovery the application can pick up from this point in its execution. … If the system should fail and subsequently recover, it can restart any transaction at … its last persistent savepoint operation. It doesn't have to run an exception handler because the transaction has its state and can simply pick up where it left off."{BeNe97} The value of persistent savepoints is more than reducing lost work. Persistent savepoints simplify application programming compared to more explicit methods for coping with failures. "… the code is not only shorter than the [prior] solution… but simpler. … everything related to the maintenance of persistent context is now taken care of by the Save_Work function, whereas the [prior] solution had to do the maintenance all by itself…" {GrRe93} Traditionally, a savepoint has been viewed as the capture of an application’s state in stable storage at the time a savepoint operation is executed. However, this is not necessary, any more than it is necessary for database recovery to materialize the final states of all pages of all committed transaction in stable storage. Instead, we recover via replay from the log. "Each resource manager participating in the transaction with the persistent savepoint is brought up in the state as of the most recent persistent savepoint. For that to work, the run-time system of the programming language has to be a resource manager, too; consequently, it also recovers its state to the most recent persistent savepoint. Its state includes the local variables, the call stack, and the program counter, the heap, the I/O streams, and so forth.” {GrRe93} Phoenix Goals At Microsoft Research, we have started a project called Phoenix “We suggest transactions using persistent savepoints be called Phoenix transactions, because they are reborn from the ashes.”{GrRe93} One technology we purse in Phoenix is application recovery, specifically involving applications that interact with a database, from system failures. We call these Phoenix applications. Our intent is to exploit database redo recovery techniques and infrastructure to reestablish application state, e.g. for simple applications like stored procedures. Thus, the database system becomes the resource manager for the application. This has a number of benefits1. Application recovery increases application availability. Current recovery is limited to database state. Work can then resume on the database, but lack of "application consistency" can greatly delay application resumption. Application recovery decreases the lengths of these outages. 2. The application programmer does not need to deal directly with system failures. The application can be written as a simple sequential program that is unaware of the crash“… unless it keeps [a] timer… and finds out that [execution] took surprisingly long to complete.” {GrRe93} 3. A user of the application, end user or other software, may likewise be unaware that a crash has occurred, except for some delay. An end user need not re-enter input data. Client software can simply continue to execute. 4. Operations people who are responsible for ensuring that applications execute correctly, need only initiate recovery, a process that they are already familiar with for database recovery, and the recovered system state will include the recoverable application state. The Technology Challenges There has been prior research directed to achieving the Phoenix aims. A problem has been that the technology has been expensive in its impact on normal execution. We have addressed how to enable application recovery without serious impact on normal application performance. That is, we reduce the amount of data and state that has to be captured and written to stable storage, either to the "database" or to the log. And we exploit database recovery infrastructure, which is well-tested, for this purpose. We treat application program state in the same manner as a database page. Thus, an application's state is one of the objects in the database cache. Flushing it is 1 controlled by the cache manager. As with database pages, we post to the log operations that change application state. Like any good redo scheme, we accumulate changes to an application's state in the cache, and only flush it to stable storage to control the length of the redo log and hence the cost of redo recovery. These flushes will be rare- and sometimes entirely unnecessary, as when the application terminates prior to the need to truncate the log. For example, we anticipate that most stored procedures can execute and complete within a single checkpoint interval- and hence many applications states needn't be flushed to stable storage. And, how exactly does one log operations for applications without their help? Application state can only be changed in two ways, via interactions with the world outside of itself, and by its own internal execution. We capture both as loggable operations. Since our applications are unaware of our efforts to make them recoverable, we seize upon the times when the application interacts with the rest of the recoverable resource manager (e.g. a database system) to capture state transitions as log operations. This has the significant benefit of not requiring the separate logging of any internal state changes caused solely by application execution. There are a number of ways to log the effects of "normal" execution. For database pages, we can log entire page state or partial page state, or we can log what {GrRe93} calls "physiological" operations, as in ARIES {MHLPS92}. Given the size of log records, we point ourselves toward the later. But, in fact, in order to do really well in keeping log records small, we cannot restrict ourselves to physiological operations. Cache management is also impacted by the inclusion of applications as objects being managed. Application state is much larger than a page, and that by itself poses a problem because flushing needs to be atomic. In addition, application operations execute for unknown periods, making it difficult to seize a suitable operation consistent state to flush. Most dramatically, the very application operations that make logging efficient introduce flush order dependencies between objects in the cache. Finally, in addition to application program state, a database will commonly hold state that is related to the application. This state is not a single entity, but can be both scattered and complicated. We need to capture and log changes for each of the several pieces involved. For example, temporary tables can be logged so as to make them stable. Other state is more intimately held in volatile database system data structures. Much of this state needs to be recoverable, e.g., database session state. Phoenix Technology This short overview cannot describe all the technology we have in mind. Phoenix is a research project so some of the technology is in the process of being created. We sketch here our work that shows (i) how an application can be treated as a recoverable database object and its interactions with the system and database objects can be logged; (ii) how this can control frequency of application checkpointing; and (iii) how logging costs can be greatly reduced by logging only the identities of objects read and not their data values. This is more completely described in {Lo97,Lo97a}. There are two kinds of application operation, one for an application's execution between interactions, and the second for its interactions. Our intent is to replay the application from an earlier state to re-create the pre-crash state. Thus, we must: Re-execute an application between interactions. When replaying from an application in state A1 , we reproduce the same state A2 at the next application interaction as was produced originally. This requires deterministic execution. Non-deterministic behavior is captured by the sequence of log records for the application. Reproduce the effect of each application interaction . As an example, if an application read object X, we could include in the log record its value x1 at the time it was originally read. During recovery, when the read is encountered, we use the logged value x1 to update the application's input buffers. We intercept every application interaction, as only interactions change an application's execution trajectory. Our resource manager wraps itself around the application, trapping its external calls. At each call point, it logs the nature of the call and its effect on the application. 2 ( A ) T r a d itio n a l V ie w o f A p p lic a tio n E x e c u tio n I n itia te A E x e c u te A E x e c u te A E x e c u te A T erm . A A p p lic a tio n A RM I n te r a c tio n s I n it A W r ite O R ead O T erm A R eso u rce M anager ( B ) N e w V ie w o f A p p lic a tio n E x e c u tio n a s L o g g e d O p e r a tio n s I n it(A ) E x (A ) S0 W P (O ) E x (A ) S1 R (A ,O ) S2 E x (A ) S3 C m p (A ) S4 S5 A p p lic a tio n S ta te I d s a s A p p lic a tio n S ta te C h a n g e s Figure 1 The resource manager logs the application execution between these calls so that the application itself can be re-invoked to replay the transition to the next interaction. The resource manager does this by returning to the application after its system call, and the operation finishes to via its next system call. This stands on its head the execution call tree and permits the resource manager to orchestrate the replay of the application. This is illustrated in Figure 1. Executions between interactions, i.e. Ex(A) operations, are thus logged as "physiological" operations. An Ex(A) reads application state A i as of interaction i and transforms it by re-execution of the application into Ai+1 as of interaction i+1, which is “written" by Ex(A). Ex(A) can be logged very compactly since we need only name the application that produces the change in state, and whatever parameters were returned to it at interaction i. That is, we replay Ex(A) by this return of control to the application. Logging interactions can be much more costly. The read above stored in its log record the entire value of the object. If many pages of a multi-megabyte file are read, we have a log space problem, a write bandwidth problem, and an instruction path problem. It is far better to replay the actual read operation during recovery, meaning we log the name of the object instead of its value. That requires, when the read is replayed, that the named object have the original value. The reduced data logged is potentially enormous, so it is worth going to some effort to make this possible. Here is the problem. After application A reads object X with value x1, X may be updated to x2. Should A's read of X be replayed later, x2 would be returned as the result. But recovery needs to re-create the x1 so that it is available when A's read is replayed. This seeming contradiction can be overcome with sophisticated cache management. The important observation is that a resource manager has two versions of actively updated objects, a cached volatile current version and a stable earlier version. The stable version enables us to re-create the x1 read by A. To guarantee this, our cache manager ensures that x2 is not flushed until we no longer need to replay A's read of X. Replaying the read is no longer required when we have flushed a later state of A, or when A terminates. Recovery of A need continue only from the flushed, later state or need not be done (if A is terminated). The cache manager enforces flush order dependencies on cached objects so that the more powerful operations (e.g. a "read" that reads X and application state A and writes A, i.e. A’s input buffers) can be replayed. This requires keeping a flush dependency graph, called a Write Graph in {LoTu95}. The cache manager data structures needed when reads are the only such "logical" operation are quite simple {Lo97}. When writes are treated as logical operations to avoid logging data values written, we must cope with circular dependencies. Circular dependencies are not just a bookkeeping 3 problem. Naively, all objects involved in a cycle must be flushed atomically together. To avoid this, as before, requires that the right versions are available when needed. By clever slight of hand, these versions can be on the log instead of re-created from earlier stable versions in the database {Lo97a}. We can then "unwind" each circular dependency and flush one object at a time. Morgan Kaufmann (1997) San Mateo, CA {GrRe93} J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques. Morgan Kaufmann (1993) San Mateo, CA {Lo97} D Lomet, Persistent Applications Using Generalized Redo Recovery. Technical Report, March 1997. Discussion We log reads/writes of objects local to the resource manager that manages (provides recovery for) an application as logical operations. This dramatically reduces logging cost for these operations. We needn't log the data values. Cache management is more complex but reduced logging cost outweighs this. This is how we “advance toward an elusive goal”, i.e. to recover applications with low normal execution cost. But there is surely more to do, both in understanding the issues and in augmenting the recovery infrastructure. Some of the issues are described below. Not fully captured at the current level of abstraction is how to characterize all interactions in terms of reads and writes and how to capture the perhaps diffuse state associated with the application, parts of which are traditionally held in resource manager volatile data structures. One needs to be very careful about the boundary for application state so that the log accurately captures all non-determinism. And one must decide exactly what to include in an application checkpoint. {Lo97a} D. Lomet, Application Recovery with Logical Write Operations. Technical Report, April 1997. {LoTu95} D. Lomet and M. Tuttle, Redo recovery from system crashes. VLDB Conf. (Zurich, Switzerland) Sept. 1995 {LoTu97} D. Lomet and M. Tuttle, A Formal Treatment of Redo Recovery with Pragmatic Implications. DEC Cambridge Research Lab Technical Report (in prep) {MHLPS92} C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz, ARIES: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Trans. On Database Systems 17,1 (Mar. 1992) 94-162. {StYe85} R. Strom, and S. Yemini, Optimistic Recovery in Distributed Systems. ACM Trans. On Computer Systems 3,3 (Aug. 1985) 204-226. Further, some applications are not easily “wrapped” by a resource manager. Client applications are at arms-length from server resource managers. Our read and write optimizations are not possible then. Indeed, some interactions are inherently non-replayable even when local, e.g. reading the real-time clock. Is there a substitute for logging the data values that have been read or written? Distributed applications deal with multiple resource managers. This means partial failures are possible, which are more difficult to handle than monolithic failures. {StYe85} describes application recovery in a distributed system. They do substantial logging and subtle log management involving logs at each site. It is highly desirable to make this activity cheaper and simpler. Bibliography {BeNe97} P. Bernstein and E. Newcomer, Principles of Transaction Processing for the Systems Professional. 4