Logging and Recovery in a Highly Concurrent Database by John Sidney Keen B.A.Sc., Electrical Engineering University of Waterloo, 1986 M.A.Sc., Electrical Engineering University of Waterloo, 1987 Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Doctor of Science at the Massachusetts Institute of Technology May 1994 @1994 Massachusetts Institute of Technology All rights reserved Signature - of Author - --- . Certified by William J. Dally Associate Professor of ElectricA Engineering and Computer Science \ A J, 1A I Thesi Advisor dvis Accepted by rederic R. Morgenthaler OarmCranyepFetment Conmiftee on Graduate Students WITHDRAWN FROM Ml S2M11e RARIES , _ lra ok ~\·n; . t g1 , I JUL 1.3 1994 LI1BRAR!F. t. 2 Logging and Recovery in a Highly Concurrent Database by John Sidney Keen Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Doctor of Science at the Massachusetts Institute of Technology,May 1994 Abstract This thesis addresses the problem of fault tolerance to system failures for database systems that are to run on highly concurrent computers. It assumes that, in general, an application may have a wide distribution in the lifetimes of its transactions. Logging remains the method of choice for ensuring fault tolerance, but this thesis proposes new ways of managing a database's log information that are better suited for the conditions described above. The disk space reserved for log information is managed according to the extended ephemerallogging(XEL) method. XEL segments a log into a chain of fixed-size FIFO queues and performs generational garbage collection on records in the log. Log records that are no longer necessary for recovery purposes are "thrown away" when they reach the head of a queue; only records that are still needed for recovery are forwarded from the head of one queue to the tail of the next. XEL does not require checkpoints, permits fast recovery after a crash and is well suited for applications that have a wide distribution of transaction lifetimes. The cost of XEL is more main memory space and possibly a slight increase in the disk bandwidth required for log information. XEL can significantly reduce the disk bandwidth required for log information in a system that has been augmented with a non-volatile region of main memory. When bandwidth requirements for log information demand an arbitrarily large collection of disks, they can be grouped into separate log streams. Each log stream consists of a small fixed number of disks and operates largely independently of the other streams. XEL manages the storage space of each log stream. Load balancing amongst the log streams is an important issue. This thesis evaluates and compares three different distribution policies for assigning log records to log streams. Simulation results demonstrate the effectiveness of the implementation techniques proposed in this thesis for a highly concurrent database system in which transactions may have a wide distribution in lifetimes. Thesis Advisor: William J. Dally Associate Professor of Electrical Engineering and Computer Science 3 4 Acknowledgements This thesis reflects a life-long learning process. Many people have guided and encouraged me along the way, and they have my deepest gratitude. From my earliest days of childhood, my parents, Jim and Carolyn Keen, fostered my creativity and encouraged me to learn and think. When my love of technical engineering-style topics became apparent, they supported my further development in this direction. Thanks, Mom and Dad. I am deeply indebted to my advisor here at MIT, Bill Dally. It has been a great privilege to work with Bill. His intellect, energy and enthusiasm are simply amazing. Inspirational! Bill graciously gave me considerable freedom to choose a research problem for my thesis, and then he helped me to follow it through to a finished thesis. Thanks for your encouragement, prodding and suggestions, Bill. I am also very grateful to the readers on my thesis committee: David Gifford, Nancy Lynch and Bill Weihl. They all offered many helpful comments and suggestions on preliminary drafts of my thesis. My colleagues in the CVA (Concurrent VLSI Architecture) group were a pleasure to work with and they helped me to refine my ideas. Lots of good criticisms and valuable suggestions during presentations at group meetings assisted me in my work. The friendship of all CVAers created a wonderful environment. I thank my current officemates, Kathy Knobe and Diana Wong, for their cheerful company during this last hectic year. During my time at MIT, I always lived at Ashdown House (most of the time on the fifth floor). I was fortunate to live in such a vibrant community of terrific people. There's not enough space to list everyone's name, but you know who you are and I thank you all for your kind friendship during our years together at MIT. I must make special mention of Vernon and Beth Ingram, who have been the housemasters at Ashdown throughout my stay. They dedicated themselves to making Ashdown a truly special place to live and I shall be forever grateful to them. I thank God that I have been blessed with the opportunity and ability to study at MIT and complete this thesis. May the results of my work be to the glory of God and the good of society. The research described in this thesis was supported in part by a Natural Sciences and Engineering Research Council of Canada (NSERC) 1967 Scholarship and in part by the Advanced Research Projects Agency under contracts N00014-88K-0738, N00014-91J-1698 and F19628-92-C-0045 and in part by a National Science Foundation Presidential Young Investigator Award, grant MIP-8657531, with matching funds from General Electric Corporation, IBM Corporation and AT&T Corporation. I thank all these sponsors for generously supporting my research. 5 6 Contents 1 Introduction 1.1 11 Motivation .................................. 11 1.2 Major Contributions. 1.3 Statement of Problem. 1.4 Review of Previous Research ........................ 1.4.1 Disk Storage Management ...................... 1.4.2 Parallel Logging and Recovery ................... 1.4.3 Management of Precommitted Transactions ............ 1.5 Commercial Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Assumptions ................................. 1.7 Summary of Remaining Chapters ...................... 2 Extended Ephemeral Logging (XEL) 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Preamble: Limitation of Ephemeral Logging ................ Conceptual Design of XEL .......................... Types and Statuses of Log Records ..................... Management of Log Records .......................... Buffer Management for Disk Version of the Database ........... Flow Control. Buffering, Forwarding and Recirculation .................. Management of the LOT and LTT ..................... Crash Recovery. 3 Correctness Proof for XEL 3.1 3.2 3.3 3.3.2 Invariants for Composition of SLMand ENV............. 3.3.3 Correctness of SLMModule ..................... Implementation of Log Manager ...................... 3.4.1 I/O Automata Model. 3.4.2 3.4.3 3.4.4 27 28 30 32 39 41 41 42 48 51 57 Simplifications. Well-formedness Properties of Environment ................ Specification of Correctness (Safety) .................... 3.3.1 I/O Automaton Model . 3.4 12 14 15 16 19 21 22 24 25 Invariants for Composition of LMand ENV ............. Proof of Safety for LM ........................ Proof of Liveness ........................... 7 58 60 62 62 63 64 67 68 74 76 77 4 Parallel Logging 4.1 4.2 78 Parallel XEL. . . . . . . . . . .. Three Different Distribution Policies . . . . . . . . . .. 4.2.1 Partitioned Distribution . . . . . . . . . 4.2.2 4.2.3 Random Distribution .... Cyclic Distribution ...... . . . . . . . . . . . . 5 Management of Precommitted Transactions 5.1 5.2 5.3 The Problem. Shortcomings of Previous Approaches Logged Commit Dependencies (LCD) ..... ..... 6.3 Conclusions 7.1 Lessons Learned. 7.2 Importance of Results 7.3 Extensions . 7.4 . . .. . . . .............. ... 88 89 91 ,............. ............. ,... 98 .............. . . . . . . . . .............. .............. . . 81 88 ,.......... .............. ............ 78 81 85 86 . .............. 6.1.4 Validity of Simulation Model ...... Extended Ephemeral Logging for a Single Strea I 6.2.1 Effect of Transaction Mix ........ 6.2.2 Effect of Transaction Duration ..... 6.2.3 Effect of Size of Data Log Records . . . . . . . . . . . 6.2.4 Effect of Data Skew . . . . . . . . Parallel Logging ..... . . . . . . . . 6.3.1 No Skew ...... . . . . . . . . 6.3.2 Moderate Skew .. . . . . . . . . 6.3.3 High Skew..... 6.3.4 Discussion..... 7 . Simulation Environment. 6.1.1 Input Parameters . 6.1.2 Fixed Parameters. 6.1.3 Data Structures . 6.2 . ........ 6 Experimental Results 6.1 .. .. . .. . . . .. . . . . . .. . . . .. . . . . . . . . . .............. ............. .............. ........... e.. ,~............. .............. .............. .............. . . . . . . . . . . . . 99 99 102 102 105 108 109 111 112 113 113 116 116 117 118 119 ....... .. . .. . .. .. . .. . .. . .. . . . . .. . .. . .. . .. .. .. .emo ............. 7.3.1 Non-volatile Region of Main MIlemo:ry . 7.3.2 Log-Only Disk Management 7.3.3 Multiplexing of Log Streams 7.3.4 Choice of Generation Sizes .... 7.3.5 Fault Tolerance to Isolated Media IFailures 7.3.6 Fault Tolerance to Partial Failures 7.3.7 Support for Transition Logging . . 7.3.8 Adaptive Distribution Policies . . .aCrash. 7.3.9 Quick Resumption of Service After a Crash The Future of High Performance Databases . .. .. .. .. . . .. . . . . . . . . . . . . . . .. .. 119 120 121 121 121 122 123 123 124 124 125 .... .... ... ... ... ... .... .... .... .... ... . . . 126 ... . . . . . .. . . . .. . .. . .. . .. . . .. . .. . . .. . . .. . .. . ............... 126 . .. A Theorems for Correctness Proof of XEL 128 A.1 Proof of Possibilities Mapping ............ 128 8 A.2 Proof of Liveness ............................... 9 150 List of Figures 1.1 1.2 Disk Configuration for Concurrent Database System ..... Firewall Method of Disk Space Management for Log ..... 15 17 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Two Successive Updates to Object ob8. .................. State of the Log After a Crash ....................... Disk Space Management Using XEL ................... Typical Events During the Lifetime of a Transaction .......... State Transition Diagram for a REDO DLR ................ State Transition Diagram for an UNDO DLR ............... TLR ................ State Transition Diagram for a COMMIT Data Structures for XEL ........................... Buffering of Incoming Log Records for Batched Write to Disk ...... 29 30 31 33 34 34 35 40 43 3.1 3.2 3.3 3.4 3.5 Interface of Simplified Log Manager ................... Automaton to Model Well-formed Environment .............. Specification Automaton for LM ...................... I/O Automata for an Object ........................ States and Action Signatures of Automata in LM Module ........ 4.1 4.2 Four Parallel Log Streams ....................... Load Imbalance vs. Number of Objects ............... 79 83 5.1 5.2 Deadlock in Dependency Graph for Buffers at Two Log Streams Static Dependency Graph of Log Streams .............. 90 90 6.1 Simulation Transaction Model ........................ . . . 59 61 63 68 71 100 110 6.2 Performance Results for Varying Transaction Mix ............. 111 6.3 Performance Results for Varying Long Transaction Duration ...... 6.4 Performance Results for Varying Size of DLRs from Long Transaction Typell2 114 ............... 6.5 Performance Results for Varying Data Skew . 6.6 Disk Bandwidth and Memory Requirements vs. Parallelism (x=0.5) . . 116 6.7 Disk Bandwidth and Memory Requirements vs. Parallelism (x=0.01). . 117 6.8 Disk Bandwidth and Memory Requirements vs. Parallelism (x=2x10 - 4 ) 118 7.1 . Multiplexing of Older Generations 10 .................... 122 Chapter 1 Introduction 1.1 Motivation This thesis re-examines the problem of fault tolerance within the context of highly concurrent databases whose applications may be characterized by a wide distribution in transaction lifetimes. The goal of this effort is to propose and evaluate new data structures and algorithms that constitute an efficient and scalable solution to the fault tolerance problem. This thesis devotes most of its attention toward the management of information on disk and ignores other aspects of the problem. Current technical and economic trends justify this approach. Processors have made dramatic improvements, in terms of both cost and performance, compared to disk technology. Similarly, main memory storage space and interconnection network bandwidth are now relatively inexpensive (compared to disk) and abundant in most concurrent computer systems. Disk technology is becoming an increasingly crucial factor in the design of any concurrent database management system (DBMS) because it accounts for a significant fraction of the cost of a system and threatens to limit the system's performance [22, 7]. Highly concurrent computers offer the potential for powerful databases that can process many thousands of transactions per second. According to [18], a good database system typically requires 105 instructions per transaction for the debit-credit benchmark. Current microprocessors can process instructions at a rate of at least 108 instructions per second [29]. With current technology, therefore, it is reasonable to assume that each processor can support a rate of at least 1000 TPS if there are no limitations other than CPU speed. The overall performance of a system with hundreds of processors is expected to be several hundred thousand transactions per second for the debit-credit benchmark. A good DBMS design must eliminate bottlenecks that would otherwise prevent users from fully harnessing the computational potential of highly concurrent computers. 11 Data structures and algorithms that worked well in DBMS designs for serial computers are inappropriate for highly concurrent systems. Consider the specific problem of fault tolerance to system failures (also known as crashes), in which the contents of volatile main memory storage are corrupted. To provide support for atomic transactions, a DBMS must guarantee fault tolerance to crashes. Traditionally, DBMSs have kept a log of all modifications performed by transactions that are still in progress. Conceptually, the log is a FIFO queue to which records are added at the tail; the log should be sufficiently long that records become unnecessary before they reach the head. This abstraction of a single FIFO queue becomes awkward and inappropriate in a highly concurrent system; the tail of the queue is a potential serial bottleneck. Furthermore, if only a single disk drive is dedicated to hold the log, it may not provide sufficient bandwidth for large volumes of log information. For example, a system that generates 500 Bytes of log information per transaction and runs at 100,000 TPS needs at least 50 MBytes/sec of disk bandwidth. If current disk drive technology can provide at most 2 MBytes/sec bandwidth per drive, the system requires at least 25 disk drives just for log information to ensure that the log is not a bottleneck. To add to these difficulties, applications that use databases are becoming more diverse. Some applications may have a wide distribution of transaction lifetimes. An application with a small proportion of transactions whose lifetimes are much longer than average poses problems for traditional logging algorithms. Most variations of logging retain all log records that have been written (by all transactions) since the beginning of the oldest transaction that is still in progress. Many of these log records may be unnecessary for the purposes of recovery, but their disk space cannot be reclaimed as long as some older record must be retained; this situation arises as a consequence of the FIFO policy which governs the management of a log's disk space. This constraint poses disk management problems. If a transaction lives too long, the log will run out of space to hold new records. An obvious solution is to simply allocate a large amount of disk space for the log, but this implies some unpleasant consequences. First, it may unnecessarily increase a system's cost. Second, the large size of the log may entail a much longer recovery time after a crash. These drawbacks prompt an investigation into better methods of fault tolerance for databases whose applications may have a wide distribution in transaction lifetimes. 1.2 Major Contributions The following paragraphs summarize the major contributions of this thesis. Extended Ephemeral Logging (XEL). XEL is a new technique for managing a log of database activity on disk. This thesis presents the data structures and algorithms which constitute XEL; it explains XEL's operation during both normal database activity 12 and recovery from a crash. XEL does not require periodic checkpoints and does not abort lengthy transactions as frequently as traditional logging techniques which manage the log as a FIFO queue (assuming that XEL and the FIFO queue technique are both limited to the same amount of disk space). Therefore, XEL is well suited for highly concurrent databases and applications that have a wide distribution of transaction lifetimes. XEL can offer significant savings in disk space, at the expense of slightly higher bandwidth for log information and more main memory. The reduced size of the log permits much faster recovery after a crash as well as cost savings. XEL can significantly reduce both disk space and disk bandwidth in a system that has at least some portion of main memory which is non-volatile. Proof of Correctness for XEL. XEL's safety and liveness properties are formally proven. Apropos safety, this thesis proves that XEL never does anything wrong; therefore, the database can always be restored to a consistent state after a crash, regardless of when the crash occurs. The liveness property ensures that XEL always makes progress; every log record is eventually erased. Evaluation of XEL. The benefits and costs of XEL, relative to logging techniques which manage log information in a FIFO queue, are quantitatively evaluated via eventdriven simulation. This thesis presents these experimental results. Evaluation of Parallel Logging Distribution Policies. The abstraction of mul- tiple log streams for log information can be easily implemented in a highly concurrent DBMS which requires an arbitrarily large collection of disk drives to provide the neces- sary bandwidth for log information. A database system's distribution policy dictates the log stream(s) to which any particular log record is sent. This thesis evaluates and compares three different distribution policies for a DBMS that has multiple parallel streams of log records. The random policy, which randomly chooses a log stream for any log record, has good load balancing properties and is simple to implement. Logged Commit Dependencies (LCD). The LCD technique permits very high throughput on "hot spot" objects in a highly concurrent database. It is a variant of the precommitted transaction technique [15] and is especially well suited for a DBMS that uses multiple parallel log streams. A transaction's dependencies on previous precommitted transactions are explicitly encoded in a special PRECOMMIT record that is generated as soon as the transaction requests to commit, thus eliminating potentially awkward synchronization requirements between the log stream to which the transaction's COMMIT record is eventually written and the log streams to which COMMIT records are written for the transactions on which it depends. 13 1.3 Statement of Problem In any DBMS, the log manager (LM) manages log information during normal database operation, and the recoverymanager (RM) is responsible for restoring the database to a consistent state after a crash. Together, these two components make up a DBMS's logging and recovery subsystem. The log holds records for only recent modifications to the database. A version of the database kept elsewhere on disk stores the state of all items of data in the database. At any given point in time, this disk version of the database does not necessarily incorporate all the updates that have been performed by committed transactions; some of these updates may be recorded only in the log. Another DBMS component called the cache manager (CM) is responsible for managing the contents of the disk version of the database and must work in collaboration with the LM. The CM choosesto flush (transfer) updated objects' to the disk version of the database in a manner that uses I/O resources efficiently. Figure 1.1 graphically represents the disk configuration of a concurrent database system. At the top, a collection of disk drives provide the bandwidth and storage capacity required for log information generated by the DBMS; these are called the log disks. The LM manages these drives. The exact number of log disks is chosen to support the highest rate of transaction processing of which the rest of the system is capable. A small number of buffers, in main memory, are dedicated to each of these log disks. On the right hand side, some other collection of disk drives hold the disk version of the database. The exact number of drives required for the disk version of the database depends on the demands for disk space and bandwidth imposed by the rest of the system. A buffer pool is associated with each different disk drive of the disk version of the database. These buffer pools are quite large so that they serve as caches. They reduce the number of retrievals from disk and allow writes to disk to be ordered in a manner that permits higher transfer rates (due to mostly sequential I/O). The CM manages these buffer pools and their associated disk drives. A complete solution to the logging and recovery problem in a concurrent DBMS must answer all of the following questions, which apply to the management of log information during normal operation of the database: 1. What events are logged? 2. What information should a log record contain? 3. How does the LM decide the disk drive(s) to which it will write a log record? 4. At what time should the LM write a log record to disk? 5. Where on disk should the LM write a log record? 1The term objectis used broadly to denote any distinct item of data in a database. It may be a record in a hierarchical or network database, a tuple in a relational database or an object in an object-oriented database. 14 information information _1 ------------II main memory I ------------- .. I..-..G , disk version ^ . . . ----------.----------------------------------I of database IM = log disk buffer buffer pool i .............................................. I................................... Figure 1.1: Disk Configuration for Concurrent Database System 6. When can the LM overwrite a log record on disk with more recent log information? 7. When can the CM flush an updated object's new value to the disk version of the database? Any proposed logging and recovery method must also respond to the followingquestions that concern recovery after a crash: 1. How should the RM schedule retrievals of blocks of log information from disk? 2. In what order does the RM process the log records contained in a block of log information? 3. Given a particular log record, what does the RM do with it? 1.4 Review of Previous Research Several good textbooks and articles have been published on the subject of fault tolerance in database systems. A reader who would like to become familiar with the basic techniques and terminology of the field is referred to [20, 24, 30, 5, 26]. The remaining subsections of this section review prior research that is specifically relevant to the material in this thesis. 15 1.4.1 Disk Storage Management The Firewall Method of Disk Management Traditionally, logging has been the method of choice for fault tolerance in most database systems 2 . A LM maintains a log of database activity as transactions execute and modify items of data (i.e., objects) in the database. Conceptually, the log is a FIFO queue. The LM adds log records to the tail immediately after they are created. Log records progress toward the head of the queue and ought to become unnecessary by the time they eventually reach the head. The DBMS allocates a fixed amount of disk space to hold log information. The LM manages this disk space as a circular array [3, 10]; the log's head and tail pointers rotate through the positions of the array so that records conceptually move from tail to head but physically remain in the same place on disk. System R [24] is a familiar example of this traditional logging technique. The LM maintains a pointer to the oldest record in the log that must still be retained; this constitutes a "firewall" beyond which the head of the log cannot be advanced. Hence, this logging technique shall be referred to as the firewall (FW) method. The LM initiates periodic checkpoints. As soon as a checkpoint begins, the LM writes out a special beginning-of-checkpoint record to the log. During a checkpoint, the CM writes out all updated objects to the disk version of the database3 and then the LM writes out a special end-of-checkpoint record to the log. After the checkpoint has completed, the LM can be sure that all preceding log records for committed updates are no longer necessary to ensure correct recovery of the database after a crash. The LM keeps a pointer to the position within the log of the beginning-of-checkpoint record for the most recently completed checkpoint. The LM also maintains a pointer for each active4 transaction that identifies the position within the log of the oldest record written by the transaction. At any given time, the log's firewall is the oldest of the pointers for all active transactions and the pointer to the beginning of the most recent checkpoint. Figure 1.2 illustrates an example. If the log starts to run short on space for new log records, the LM must free up some space by advancing the firewall. It must either kill an old active transaction or it must perform another checkpoint, depending on the exact nature of the current firewall. In general, it is bad to kill a transaction because this will likely annoy the client who originally initiated the transaction. Furthermore, all the resources consumed by the transaction have essentially been wasted, and the transaction's effort will be repeated 2 Logging is not the only possible solution to the fault tolerance problem. However, it has tended to be the most popular solution for reasons of performance and efficiency. Refer to [37, 30, 2, 5, 39] for explanations of alternative methods of achieving fault tolerance (such as shadowing) and comparisons of the strengths and weaknesses of the different approaches. 3Sophisticated fuzzy checkpoint methods allow the database to continue servicing requests from client transactions while the CM flushes out all updated objects, so that the checkpoint activity causes negli- gibly small disruption to normal operation of the database. 4 An active transaction is one that is still in progress (it has not committed nor been aborted). 16 oldest record from all active transactions 1I new _I _ log records V tail head log ftamost recent checkpoint record log rewall I unused disk space (available for re-use) ~ old log records no longer required for recovery Figure 1.2: Firewall Method of Disk Space Management for Log if its client decides to try it again. This scenario is particularly irritating because many records in the log may be unnecessary for recovery purposes but the FIFO queue abstraction prevents the LM from reclaiming their space until all prior log records have been rendered unnecessary. For example, suppose that a transaction updates an object and continues to live for another 10 min while many short (several seconds) transactions each update a few objects and commit. The log records from these short transactions follow the long transaction's first log record. Even though most of the log records from these short transactions are no longer needed for recovery, their space cannot be reclaimed until the long-lived transaction finishes. Checkpoints are not free. They become awkward in a concurrent system because they entail synchronization and coordination amongst an arbitrarily large number of participating parties. In general, if the LM wishes to perform a checkpoint, it must coordinate activity at all the log disks and all the disk drives on which the disk version of the database resides. A checkpoint operation requires communication bandwidth, processor cycles, storage space in main memory and disk bandwidth. Periodic checkpoints may interfere with the CM's operation by constraining it to schedule flushes to disk in an order that does not take full advantage of locality within the disk version of the database. Finally, the duration required to perform a checkpoint and the delays between consecutive checkpoints limit the speed with which the LM can reclaim space in the log. As the number of log disks increases, the abstraction of a single queue becomes increasingly difficult to implement. The tail of the queue is a potential serial bottleneck and the LM must carefully manage the order in which it writes blocks to each log disk so that it preserves the log's FIFO property. Therefore, the traditional "firewall" method of logging poses implementation prob- 17 lems for highly concurrent databases with wide variations in transaction lifetimes because it does not reclaim disk space sufficiently quickly, it suffers from the overhead of periodic checkpoint operations and it is difficult to implement on an arbitrarily large number of disk drives. Recirculation of Log Records Hagmann and Garcia-Molina [32] propose a solution to the disk management problem posed by long lived transactions. If a record reaches the head of the log but must still be retained, the LM recirculates the record within the log by adding it to the tail. In contrast to the XEL method which this thesis will propose, they continue to implement the log as a single FIFO queue, rather than a chain of FIFO queues. A log record from a very long lived transaction may therefore be recirculated numerous times within the log, thereby consuming more bandwidth than would be required by the XEL method. Furthermore, Hagmann and Garcia-Molina do not attempt to eliminate the need for checkpoint operations. Log Compression Log compression filters out unnecessary information so that less storage space is required to store the log. However, previous methods of log compression [34, 31] are intended for "batch mode" log compression; the LM cannot add new records to the existing log while it is being compressed. The XEL method proposed in this thesis essentially performs log compression, but its continuous and incremental nature distinguishes it from previous log compression methods. Generational Garbage Collection Previous work on generational garbage collection inspired XEL's essential idea: the log is segmented into a chain of FIFO queues. Lieberman and Hewitt [40] proposed the segmentation of a system's main memory storage space into several temporal generations; most of the system's garbage collection effort is limited to only its younger generations. Quantitative evaluation of several variations on generational garbage collection [62, 49, 63, 59] have demonstrated its effectiveness for automatic memory management. However, previous work on generational garbage collection addressed the more general problem of automatic storage reclamation by a programming language's runtime 18 system. The reference pattern amongst a program's data objects is usually more complicated than the dependencies that exist amongst log records, and so garbage collection methods that worked well for programming languages may be inappropriate for managing the disk storage reserved for a database's log. Less complicated yet more effective techniques may be possible for the simpler problem of managing a database's log. Rosenblum and Ousterhout [56, 57, 58] adopt a similar strategy for the log-structured file system (LFS). The LFS adds all changes to data, directories and metadata to the end of an append-only log so that it can take advantage of sequential disk I/O. The log consists of several large segments. A segment is written and garbage collected ("cleaned") all at once. To reclaim disk space, the LFS merges non-garbage pieces from several segments into a new segment; it must read the contents of a segment from disk to decide what is garbage and what isn't. There is a separate checkpoint area and the LFS performs periodic checkpoints. The LFS takes advantage of the known hierarchical reference patterns amongst the blocks of the file system during logging and recovery; the inode map, segment summary blocksand segment usage array data structures describe the file system's structure and play an important role in the LFS's management of disk space. 1.4.2 Parallel Logging and Recovery Distributed databases are similar to concurrent databases, but not identical. Like concurrent databases, a distributed database consists of an arbitrary number of processors linked via some communication network; the total set of processors is partitioned into disjoint sites, and the communication network connects the sites together. These sites can share data, so that a transaction executing at any particular site effectively sees one large database. However, the underlying technology distinguishes distributed databases from concurrent databases. In general, a distributed database's communication network has lower bandwidth, much longer latency and lower reliability than that of a concurrent database. Moreover, individual sites or links in a distributed database's network may fail (partial failures, in the terminology of [5]), yet other portions of the system remain intact; the system can continue to provide service to client transactions as long as these transactions do not require access to data that is unavailable due to failures elsewhere. For many concurrent systems, either the entire system is operational so that all data is available to client transactions or it is completely failed (a total failure [5]) so that no transaction can execute. This "all or nothing" characteristic simplifies some aspects of the DBMS implementation problem. The limitations of a distributed database's communication network and concerns about availability of data lead to rigid partitioning of data objects within the system. Assume, for simplicity, that objects are not replicated at different sites. Each object has a home site [5]and objects do not migrate between sites. Any transaction that wants to access a particular object must do so at its home site. Whenever an object is updated, 19 log information is generated and stored at its home site. This rigid partitioning is liable to load balancing problems. One site may hold a disproportionately large number of "hot spot" objects 5 and so it is overloaded while another site is relatively idle. However, the rigid partitioning prevents the system from taking advantage of available processing power and disk bandwidth elsewhere in the system. A concurrent system has more flexibility in balancing the loads amongst processors and disk drives. The possibility of partial failures necessitates complicated algorithms for atomic commitment, such as the two phase commmit (2PC) and three phase commit (3PC) protocols [5],for distributed databases. These sophisticated techniques are unnecessary for concurrent databases under the "all or nothing" assumption. The SWALLOW distributed storage system [54] stores objects in a collection of autonomous repositories. Each object is represented as a sequence of versions, called a version history, which records successive updates to the object over time. SWAL- LOW creates a commit record at each repositoryat which a transaction updates objects and links new versions of updated objects to the transaction's commit record while the transaction executes. If the transaction eventually aborts, the system deletes the transaction's commit record and the versions for the updates which it performed on objects. Otherwise, the committed transaction's changes become permanent in the multiversion representation of the database. Hence, there are no distinct log records in SWALLOW because it retains versions for objects. Old versions can be garbage-collected when there are no longer any references to them. Lomet [41] proposes a new method for redo logging and recovery that is intended for use in a data sharing system. Multiple nodes can access common data, yet each node can have its own log. In contrast to the partitioning of data which characterizes many distributed databases, Lomet's system allows data objects to migrate within the system. After modifying an object, a processing node records the operation in its own private log; each private log is a sequential file, apparently managed as a FIFO queue. Hence, log records for different updates to the same object may be distributed throughout a collection of private logs. This data sharing approach offers potentially better load balancing behavior. For each object, Lomet's system ensures that at most one node's log can record updates that have not yet been applied to the version of the object currently in the disk version of the database; hence, recovery activity for each object is still limited to only a single node even though an object's log records may be distributed throughout the private logs of numerous nodes. Previous researchers have already broached the problem of parallel logging and recovery in a concurrent database but they often focussed on only isolated subproblems or declined to propose detailed solutions. The expedient of using several disk drives to increase the throughput for log information was suggested in [15]. Lehman and Carey [39] propose storing log information in a logically parallel manner that easily maps to a physically parallel implementation; every partition 6 in the database has its own separate 5 "Hot 6 spot" objects are accessed much more frequently than other objects in the database. A partition, as defined in [39], is the unit of memory allocation for a system's underlying memory 20 log of activity. Agrawal [1, 2] investigates some alternative solutions to the recovery problem for the multiprocessor-cache class of database machines, of which the DIRECT system [14, 8] is an example. He investigates four different distribution policies for determining the log disk to which to send a log record: cyclic, random, Query Processor Number mod Total Log Processors (QPNmTLP) and Transaction Number mod Total Log Processors (TNmTLP). Agrawal concluded that the TNmTLP policy suffers from significant load balancing problems; an exceptionally industrious transaction can generate a deluge of log information for one log disk while the other disks sit relatively idle. As processor speed and bandwidth continue to rise, relative to disk bandwidth, the QPNmTLP policy faces similar load balancing problems. Agrawal examined these four policies within the context of the multiprocessor-cache class of database machines and in conjunction with different solutions to some of the other subproblems associated with logging and recovery. His evaluation criteria focused on overall performance (throughput and response time) rather than on specific aspects of efficiency. He does not consider how much disk space the log requires or the extent to which the log disks' loads are balanced. This thesis will make different assumptions that more closely model today's concurrent computer technology and will choose different evaluation criteria. Apropos recovery, DeWitt et al. [15] propose merging several parallel logs into a single log so that familiar algorithms from the sequential world will be applicable. Agrawal's algorithm [1] does not require a merge step, but it processes each log sequentially one after the other and therefore forfeits the opportunity to exploit parallelism during recovery. Kumar [38] proposes a parallel recovery algorithm, but it requires processing at all log streams to proceed in a lock-step manner; frequent barrier synchronization limits the performance and scalability of this algorithm. To address these limitations, he proposes an improved algorithm that involves minimal synchronization between recovery activities at separate log streams, but this latter algorithm still requires two scans over each log stream. 1.4.3 Management of Precommitted Transactions Interactions between the LM and the concurrency control manager (CCM) of a DBMS can affect a system's overall performance. Strict two phase locking (2PL) [5] requires that the CCM release all write locks held by a transaction only after the transaction has committed or aborted. Hence, the CCM must hold all the write locks of a successful transaction for at least as long as the minimum response time for the LM to accept a DLR and process a request to commit from the transaction. Without non-volatile main memory, the lower bound for this response time is the minimum time required to write a block to disk. Slow response on the part of the LM may entail concurrency control bottlenecks on some objects that must be updated very frequently. mapping hardware. 21 To alleviate these performance limitations, the precommitted transaction (PT) technique [15] enables a CCM to release a transaction's write locks as soon as the transaction requests to terminate. The transaction precommits when it requests to commit, and it commits when all its log records (including a COMMIT record) have been written to disk. It is in a precommitted state between these two times. The PT technique alleviates the throughput bottleneck on "hot spot" objects due to I/O latency to disk. Without the PT technique, a transaction that has requested to commit cannot release its write locks until after it has committed, lest some other transaction see its effects before they have been made permanent in the database. In this case, the maximum rate at which independent transactions can update an object is limited by the rate at which successive blocks of log records can be written to disk. If the minimum time to write a block to disk is Tmi, (for example, 10 ms), then each transaction must hold its write locks for at least Tmi, and the maximum throughput for any object is 1/Tmin. If a database has hot spot objects, its entire throughput may therefore be limited to 1/Tmi,. When the LM uses the PT technique, independent transactions can update hot spot objects at a much higher rate, limited by the time required to acquire and release locks. Now suppose that a DBMS's LM supports precommitted transactions. A transaction can release its write locks after it has requested to commit but before it actually commits. While it waits in this precommitted state, the only thing that could cause it to fail is some failure on the part of the DBMS (e.g., a crash) which prevents the log records from being written to disk. Other transactions can see the updates from the precommitted transaction, but they become dependent on it to eventually commit. The LM sends an acknowledgement to the transaction in response to its commit request after the transaction commits. If a crash occurs before the precommitted transaction commits, the RM must ensure that the restored database does not include updates from the transaction or any of the subsequent transactions which depended on it. After a transaction commits, a COMMIT log record exists on disk to record the fact that the transaction committed and so its effects are guaranteed to survive a crash. 1.5 Commercial Systems This section briefly reviews existing commercial database systems. Most commercial systems incorporate variations of the techniques presented in the previous section. IBM has had a long and influential presence in the database market. Details about several of its most noteworthy products can be found in [26]. IMS, one of the industry's earliest database systems, runs on the MVS operating system; MVS runs on computer systems built around the IBM 370 processor family. Early versions of IMS maintained separate redo and undo logs. Redo information was kept for restart recovery and undo 22 information (the dynamic log) was for transaction backout. More recently, IMS has merged the redo and undo logs into one log to reduce I/O at commit. IMS supports group commit [5] and performs fuzzy dumps for archive recovery. IBM's IMS FastPath system introduced the notions of group commit and main-storage databases, among other things. It is a pure deferred-update, redo-only system (this implies a STEAL buffer management policy [5]). DB2 is IBM's implementation of SQL for its mainframe systems; it implements a -STEAL buffer management policy and the WAL (write ahead log) protocol [5]. DEC markets a database system called Rdb/VMS. It runs on a VAXcluster system ("shared disk" architecture) and supports a single global log. Rdb/VMS performs undo/redo logging, with separate journals for redo and undo log records; the After Image Journal (AIJ) holds redo information and the Run-Unit Journal (RUJ) keeps undo information. Periodic checkpoints bound the length of recovery time. Rdb/VMS exploits group commit to achieve efficient disk I/O. The Teradata DBC/1012 system [60, 46, 47, 55] is a highly parallel database system that is intended principally for decision support (i.e., mostly queries) but which provides some support for on-line transaction processing (OLTP). The most recent version, the DBC/1012 model 4, incorporates up to 1024 Intel 80486 microprocessors. The system's processors are divided into interface processors (IFPs) and access module processors (AMPs). The DBC/1012 can support multiple hosts; each IFP connects to one particular host. The AMPs manage the data. Tuples within a relation are partitioned across the AMPs so that the DBC/1012 can support parallelism both within and between independent requests. The AMPs and IFPs are interconnected by a proprietary Ynet "active logic" interconnection network. The Tandem NonStop SQL system [27, 28, 33, 16] is essentially a distributed relational database system. Data objects are partitioned across multiple processing nodes and transactions can access data at different sites. The system provides local autonomy so that a site can perform work despite failures at other sites or in the interconnection network. An implementation of the two-phase commit (2PC) protocol [5] ensures that distributed transactions commit atomically. NonStop SQL performs undo/redo logging and maintains a separate log at each site. The state of the database is periodically archived by performing a fuzzy dump. Oracle's Parallel Server [44] is intended to run on highly parallel systems (such as the KSR1 [61]). It can support very high transaction processing rates, compared to other available systems. Version 6.2 has been benchmarked at 1073 TPS (transactions per second) for the TPC-B benchmark [21], with a cost of only $2,480 per TPS; these results were obtained for Oracle V6.2 running on an nCube concurrent computer system with 64 processors. Parallel Server employs redo/undo servers for log information, performs periodic checkpoints and uses a partitioned distribution policy for distributing log records. 23 1.6 Assumptions Physical State Logging at the Access Path Level. This thesis limits its attention to physical state logging at the access path level [30]. According to this definition, any modification to an object in the database results in a log record that holds some representation of the state of the object; the log record may hold the pre-modification state of the object, the post-modification state, or both. Buffer Management: -FORCE, STEAL. This thesis assumes the most general policy for buffer management. The CM may flush an updated object to the disk version of the database whenever it chooses, regardless of whether or not the transaction that performed the update has yet committed. Restated in formal terminology, the buffer management policy is -FORCE and STEAL [30]. Concurrency Control: Two Phase Locking. The LM does not perform concur- rency control, but the concurrency control manager that schedules requests to the LM on behalf of client transactions must respect certain restrictions if the RM is to be able to restore the database to a consistent state after a crash. This thesis assumes that the concurrency control manager performs two phase locking (2PL) [17, 5] so that all executions are serializable. All chapters except Chapter 5 will further assume that all executions are strict [5]: no transaction reads or updates an object that has been modified by another transaction which has not yet committed. Chapter 5 relaxes this assumption slightly; a transaction may release its write locks shortly after it requests to commit even though it has not actually committed yet. Volatile Main Memory, Non-volatile Disk Storage. All main memory is volatile. In the event of a system failure, such as a power interruption, some or all of the contents of main memory may be lost. In contrast, disk storage (secondary memory) is nonvolatile. Any information written to disk will remain on disk despite a system failure. Distributed Memory Multiprocessor System. The data structures and algo- rithms presented in this thesis are intended for a fine-grain distributed memory multiprocessor system, such as the MIT J-Machine [11, 12, 48], in which each processor can directly address only its own local memory and all interprocessor communication must occur via explicit message passing. Nevertheless, the techniques presented in this thesis could be adapted to a shared memory multiprocessor system with little effort. 24 Disks are the Limiting Resource. This thesis addresses the problem of manag- ing log information on disk, subject to limited storage capacity and bandwidth. Disk technology threatens to limit the performance of concurrent database systems, and is expected to account for a significant fraction of their cost [22, 7]. Existing concurrent computer systems provide abundant computational power, volatile main memory storage and interprocessor communication ability so that none of these resources constitutes a bottleneck. Hence, the attention specifically to disk technology. Recent trends and expectations for future progress justify this assumption. Processor performance has increased dramatically over the past decade and will likely continue to improve at a fast pace in the near future. Similarly, DRAM (dynamic random access memory) capacities have soared and prices have fallen during the past decade, and these trends in main memory technology are expected to continue. Interconnection network technology has improved significantly so that high bandwidth, low latency interprocessor communication is now a reality. For example, the MIT J-Machine provides 288 Mbps communication bandwidth per channel [12, 50], and each processing node has 6 channels. In contrast, the capacity, bandwidth and cost of disk drives have not improved as dramatically. Unique Identifiers for Objects and Transactions. Every object in the database must have some unique object identifier (oid). Similarly, each transaction must have a transaction identifier (tid) that distinguishes it from all other transactions. 1.7 Summary of Remaining Chapters Chapter 2 explains the extended ephemeral logging (XEL) method. XEL is a new method for managing a log of database activity on disk. It is a more general variation of ephemeral logging (EL) [35]; XEL does not require a timestamp to be maintained with each object in the database. XEL does not require periodic checkpoints and does not abort lengthy transactions as frequently as traditional firewall logging for the same amount of disk space. Therefore, it is well suited for highly concurrent databases and applications that have a wide distribution of transaction lifetimes. Important safety and liveness properties for a simplified version of XEL are proven in Chapter 3. The log record from the most recently committed update to an object remains recoverable as long as log records from earlier updates to the same object can be recovered from the log. However, every log record is eventually erased so that its space on disk can be re-used for subsequent log information. Chapter 4 considers how to manage log information in a highly concurrent database. The abstraction of a collection of log streams, all operating in parallel with one another, 25 is suitable for applications with very high bandwidth requirements for log information. The LM uses the XEL method to manage the disk space within each log stream. With multiple log streams, the LM must have some distribution policy by which it decides the stream(s) to which it will send any particular log record. Chapter 4 analyzes three different distribution policies. Chapter 5 points out the difficulties of implementing the PT technique, in its current form, in a LM that supports an arbitrarily large collection of log streams. The chapter proposes a new variation of the technique, called Logged Commit Dependencies (LCD), which alleviates these difficulties. It introduces a new type of record, called a PRECOMMIT record, which explicitly states all a transaction's dependencies at the time that the transaction requests to commit. Chapter 6 quantitatively evaluates XEL via event-driven simulation. XEL's complex- ity severelylimits analytical attempts to evaluate its performance. Simulation provides an alternative means by which to study its behavior. Section 6.1 describes the imple- mentation of a simulator for XEL. It explains each of the input parameters, documents the fixed parameters, presents the definitions of XEL's data structures as expressed in the C programming languange [36] and justifies the validity of the simulation model. Section 6.2 evaluates XEL's performance for only a single log stream as various input parameters vary and compares XEL's performance to that of the FW method. The following section examines XEL's behavior for a collection of parallel log streams as the degree of parallelism increases. Disk space, disk bandwidth, main memory requirements and recovery time are the evaluation criteria throughout the chapter. The last chapter of the thesis summarizes the important lessons that were learned, explains the importance of the results and discusses various extensions to XEL. 26 Chapter 2 Extended Ephemeral Logging (XEL) This chapter proposes a new technique for managing the disk space allocated for log information. This new technique, called extended ephemeral logging (XEL), is a more general variation of the ephemeral logging (EL) technique that the author presented in an earlier publication [35]. Both EL and XEL break the abstraction of the single FIFO queue that was presented in Section 1.4.1. Rather than managing log information in a single FIFO queue, EL and XEL treat the log as a chain of fixed-size FIFO queues and perform garbage collection at the head of each queue. This approach, inspired by previous research on generational garbage collection, mitigates the threat of the log running out of space for new log records because a transaction lives too long; EL and XEL can retain the records from long running transactions but can reclaim the space of chronologically subsequent log records that are no longer needed for recovery. Hence, a log manager that uses EL or XEL generally requires less disk space than one that treats the log as a single FIFO queue. Intuitively, this advantage is strongest if an application has only a small fraction of transactions that execute for a very long time and write only a small number of records to the log. Another strong motivation for EL and XEL arises if a system can be augmented with a limited amount of non-volatile main memory. In such a system, EL and XEL can drastically reduce the amounts of disk space and bandwidth required for log information if most log records emanate from short-lived transactions. The benefits here are twofold. First, the system's cost may be substantially reduced since fewer disk drives are needed for log information. Second, recovery after a crash may be much faster since the amount of log information is considerably smaller. Variations on EL and XEL can render a separate disk version of the database unnecessary. The most recently committed value for each object is always retained in the log. 27 This approach may significantly reduce a system's cost because it likely requires fewer disk drives. It also simplifies the DBMS design because the CM is no longer needed. This expedient pertains to main memory database systems as well as other systems which hold most of their data in main memory and update them sufficiently often. EL and XEL maintain pointers to all records in the log that are relevant for recovery purposes. This entails significantly higher main memory requirements, compared to the FW technique. However, the LM no longer needs to perform periodic checkpoints to ensure that all updates prior to a particular point in time have been flushed to the disk version of the database, as had been necessary for the FW method. Of course, the CM should continue to flush committed updates to the disk version of the database at as fast a rate as possible so as to reduce the amount of information that must be kept in the log. This elimination of checkpoints is a benefit for highly concurrent systems which have many processors and an arbitrary number of parallel log streams (as will be discussed in Chapter 4). Checkpointing is more complicated in concurrent systems, compared to sequential systems, so EL and XEL relieve concurrent DBMS designers from having to design and implement efficient checkpointing algorithms that are provably correct. The presence or absence of timestamps in the disk version of the database differentiates EL and XEL. EL assumes that each object's representation in the disk version of the database has a timestamp kept with it. However, there are good reasons why some databases may violate this assumption. The absence of timestamps complicates the problem of managing log information in a manner that does not jeopardize the consistency of the database. The XEL technique presented in this chapter does not require timestamps for objects in the disk version of the database. Section 2.1 illustrates why EL cannot guarantee consistency after a crash for a DBMS that does not maintain a timestamp with every object in the disk version of the database. Once familiar with the pitfalls of EL, a reader will be better able to appreciate the rationale that underlies the complexities of XEL. Subsequent sections each address specific problems that must be solved in order to implement XEL. To some extent, these subproblems are independent of one another. The structure of this chapter reflects the modular nature of these problems. 2.1 Preamble: Limitation of Ephemeral Logging Ephemeral logging (EL), as originally proposed in [35], adopts a generational garbage collection strategy for managing a log's disk space. The LM manages the log as a chain of FIFO queues, each of which is called a generation. The LM adds new log records to the tail of the youngest generation. When a record that must still be kept in the log approaches the head of a generation, the LM forwards it to the tail of the next generation or recirculates it within the last generation. For any particular application, 28 the generations' sizes are chosen so that only a small fraction of the records in any generation need to be forwarded or recirculated. EL requires each object in the database to have a monotonically increasing times- tamp. The simplest implementation that satisfies this constraint maintains an integervalued counter with every object. The LM increments an object's counter each time a transaction updates the object. Whenever the CM flushes an updated object to the disk version of the database, the accompanying timestamp value is stored with the object. Likewise, each data log record (DLR) for an object holds the value and corresponding timestamp (as well as the object and transaction identifiers) from a particular update to the object. After a crash, the RM can determine if the disk version of the database holds the most recently committed value for any particular object. It finds the most recently committed DLR for the object that is still in the log and checks if this DLR has a more recent timestamp than the version of the object currently on disk. If the DLR is more recent, then the RM should update the object in the disk version of the database; otherwise, it should ignore the DLR. Now suppose that timestamps are not kept with each object stored in the version of the database on disk. This case might arise because of a deliberate decision to conserve storage, or it could be a constraint inherited from an existing implementation. Without timestamps in the database, EL is not sufficient to guarantee a consistent state after recovery from a crash. The RM no longer has a standard by which to judge whether or not a DLR holds a more recent value than that which currently exists for the corresponding object on disk. Accordingly, the RM can no longer deduce which records are non-garbage and which are garbage. The following example illustrates what can go wrong. Assume that the log has two generations. Suppose transaction tx3 assigns object ob8 a value of 12, writes a corresponding DLR to the log and then commits. Assume that the DLR for this update and the transaction's COMMIT TLR are both forwarded to generation 1 of the logl. After moving to generation 1, these two records soon become garbage. Now suppose that transaction tx6 subsequently assigns object ob8 a value of 9 and then commits. Figure 2.1 summarizes this chronology of events for transactions tx3 and tx6. DLR for ob8-12 by tx3 I COMMIT for tx3 t DLR for ob8+-9 by tx6 T COMMIT for tx6 tme Figure 2.1: Two Successive Updates to Object ob8 Suppose further that both the DLR and the COMMIT TLR from tx6 become garbage before they reach the head of generation 0 and are overwritten by other log records, but 1 The DLR may have been forwarded because t3 had a relatively long lifetime, for example. The COMMIT TLR was forwarded because not all the transaction's updates had been flushed to the disk version of the database before it reached the head of generation 0. 29 the "stale" log records from tx3 are still lingering in generation 1, as shown in Figure 2.2. If a crash were to occur while the system is in such a state, the RM would find tx3's DLR to be the most recently committed update in the log when it attempts to recover object ob8 but it would not know whether the value in this DLR is more recent than the value currently stored for ob8 in the disk version of the database. generation ob8=9 disk version A1 A .... OI UaLaduse :----------------legend - --------non-garbage log record garbage log record ....................................... [ Figure 2.2: State of the Log After a Crash In this example, the DLR from tx3 is garbage and ought to be ignored; the disk version of the database already holds the most recently committed value for object ob8. It is not difficult to construct a different example in which the RM finds a (non-garbage) DLR that is more recent than the value stored for an object in the disk version of the database. 2.2 Conceptual Design of XEL Extended ephemeral logging (XEL) manages a log's disk space as a chain of fixed-size queues. Each queue is called a generation. If there are N generations, then generation O is the youngest generation and generation N-1 is the oldest generation. New log records are added to the tail of generation 0. A log record near the head of generation i, for i<N-1, is forwarded to the tail of generation i+1 if it must be retained in the log; otherwise, it is simply discarded (overwritten by more recent log information). In the special case of generation N-1, a log record near its head that must be retained is recirculated in it by adding the record to its tail. The disk space within each queue is managed as a circular array [10]; the head and tail pointers rotate through the positions of the array so that records conceptually move from tail to head but physically remain 30 in the same place on disk. In tandem with the activity of the LM, the CM flushes (transfers) updates to the disk version of the database so that some log records become unnecessary for recovery. The LM no longer needs to retain these log records and so it can re-use their space on disk. Figure 2.3 conveys the essence of XEL for the specific case of a log stream with three generations. Non-garbage log records are necessary for recovery and must be kept in the log; all other log records are garbage. The "garbage pail" does not actually exist, but is conceptually convenient to suggest that log records are "thrown away" after they are no longer needed. The arrows at the head of each generation portray the two possible fates for a log record near the head. If the record is garbage, it is ignored (conceptually thrown away in the garbage pail). If it is non-garbage, then it must be retained in the log and so it is either forwarded to the tail of the next generation or recirculated in the last generation. A stable version of the database resides elsewhere on disk. It does not necessarily incorporate the most recent changes to the database, but the log contains sufficient information to restore it to the most recent consistent state if a crash were to occur. The arrows underneath each generation illustrate the flushing activity that occurs in parallel with logging, and indicate that the log records whose updated objects are flushed may, in general, be anywhere in the log. gen new _ log -records ] I \garbage n legend:pail ~ legend: C] non-garbage log record }1 garbage log record ~ disk ~~ ~ ~~ ~~~~ version of database Figure 2.3: Disk Space Management Using XEL This segmentation of the log is particularly effective if a large proportion of transactions finish execution and have their updates flushed before their log records near the head of generation 0. Many, if not all, of these records become garbage before the LM must decide whether or not to forward them and so the LM does not forward them to generation 1; their disk space can quickly be reclaimed for more incoming log records. 31 Only a small proportion of log records, mostly from transactions with longer lives, are forwarded to subsequent generations. Recirculation in the last generation means that the physical order of its records no longer necessarily corresponds to the temporal order in which they were originally generated. The LM includes timestamps in data log records to enable the RM to establish the temporal order of the records. 2.3 Types and Statuses of Log Records The previous section described the segmentation of a log stream into a chain of fixed-size FIFO queues and mentioned that the LM performs garbage collection at the head of each queue. This section will explain the basis upon which the LM decides whether or not a particular log record is garbage. There are several types of log records. For each type of log record, a record may be in any one of several possible states at any given time. In response to ongoing activity in the database, the state of a record may change over time. The LM's decision about whether to retain or throw away a log record depends on the record's state. XEL performs physical state logging on the access path level, according to the taxonomy of [30]. In short, XEL performs redo logging with lazy logging of undo records. It adheres to the write ahead log (WAL) protocol [5]: the disk version of the database cannot be modified before the LM has written a log record to disk which describes the modification. There are two types of log records. Data log records (DLRs) chronicle changes to the contents of the database (creation, modification or deletion of objects). Transaction (tx) log records (TLRs) mark important milestones (e.g., begin, commit or abort) during the lives of transactions. Apropos TLRs, XEL logs only commit events; it does not bother to log even the commit of a transaction that did not update any objects in the database. When a transaction (that updated at least one object) successfully terminates, the LM adds record holds only the record to the log to mark the occasion. The COMMIT a COMMIT transaction's identifier. Previous logging and recovery methods also logged transactions' begin and abort events. XEL can incorporate these other types of TLRs, but they are superfluous. These anachronistic TLR types played an important role in previous recovery algorithms but are no longer relevant for XEL's recovery algorithm. Figure 2.4 illustrates the noteworthy events in the lifetime of a typical transaction. The transaction begins, updates several objects and then requests to commit. Whenever the transaction updates an object, the LM writes a DLR to the log. The transaction 32 can continue executing without needing to wait for the DLR to be written to disk. If the transaction eventually requests to commit, the LM generates a COMMIT record for it and writes the record to the log. After all the transaction's log records have been written to disk, the LM acknowledges the transaction's request to commit, thus bringing its course of execution to a close. tx begins tx updates an object tx updates another object te DLR DLR COMMIT acknowledge commit Figure 2.4: Typical Events During the Lifetime of a Transaction There are two varieties of DLRs: REDO and UNDO DLRs. The LM generates a REDO DLR whenever a transaction modifies an object in the database. Each REDO DLR contains the following four pieces of information: oid: identifier for the affected object txid: identifier for the transaction that performed the update timestamp: indication of when the update occurred new-value: new value of the object If the CM wants to flush an uncommitted update out to the disk version of the database, it must first inform the LM of its intentions and obtain permission from the LM. In response to such a request, the LM generates an UNDO DLR with the following pieces of information: oid: identifier for the affected object txid: identifier for the transaction that performed the update timestamp: indication of when the update occurred old-value: old value of the object (prior to start of transaction) The LM grants permission to the CM to flush the uncommitted update only after the UNDO DLR has been written to disk. It is unnecessary for the LM to generate more than one UNDO DLR for a particular object and a particular transaction. The LM must write a REDO DLR to the log for every update, but it is expected that only a very few updates, mostly from exceptionally lengthy transactions that modify a large number of objects, will trigger UNDO DLRs as well. Therefore, UNDO DLRs ought to be quite rare, in general. Each log record must be in a particular state at any given time. The LM maintains 33 data structures in main memory that track of the state of all log records; a record's state information is not kept in the log record itself. Figures 2.5, 2.6 and 2.7 graphically TLRs, summarize the states and transitions for REDO DLRs, UNDO DLRs and COMMIT respectively. Subsequent paragraphs will explain these state transition diagrams in detail. 5 1 Transition events: 1) Transaction updates same object again 2) Commit of more recent update to same object 3) Commit of more recent update to same object 4) 5) 6) 7) Flush completed; older recoverable DLR still exists Flush completed; no older recoverable DLR existsLast older recoverable DLR becomes non-recoverable Transaction's COMMIT record is overwritten Figure 2.5: State Transition Diagram for a REDO DLR Transition events: 1) Transaction commits 2) Aborted update undone by cache manager Figure 2.6: State Transition Diagram for an UNDO DLR A REDO DLR may have one of four different status values: unflushed, required, recoverable and non-recoverable. If a REDO DLR holds a more recent value for its asso34 required recoverable Transition event: 1) No UNDO DLRs and only recoverable REDO DLRs left Figure 2.7: State Transition Diagram for a COMMIT TLR ciated object than does the disk version of the database, the REDO DLR corresponds to the most recent modification to its associated object by the transaction that performed the update 2 and there is no REDO DLR in the log for a more recently committed up- date to the same object, then the DLR must have a status of unflushed. A requiredDLR must be retained in the log in order to ensure correct recovery after a crash. A REDO DLR with a status of recoverable can be recovered by the RM after a crash (because the COMMIT TLR from the corresponding transaction is also still in the log on disk), but is not required for correct recovery. A REDO DLR whose status is non-recoverable cannot be recovered after a crash because the COMMIT TLR from its transaction has already been overwritten on disk by more recent log records. An UNDO DLR can have one of three status values: required, annulled and recoverable. An UNDO DLR initially has status required when the LM creates it. It remains requireduntil the transaction that wrote it commits, at which time it becomes annulled; the UNDO DLR remains annulled until it is overwritten on disk. If a transaction writes an UNDO DLR and later aborts, the UNDO DLR retains its requiredstatus until the CM restores the corresponding object in the disk version of the database to the value that it held prior to the aborted transaction's update. Note that the CM need not immediately write out the object's original value to the disk version of the database; it can buffer the undo operation until a convenient opportunity, so as to achieve bet- ter disk I/O. After the CM has undone the aborted update (by restoring the object's representation in the disk version of the database to its original value), the LM changes the status of the corresponding UNDO DLR to recoverable. A recoverable UNDO DLR remains recoverable until it is eventually overwritten on disk. There are two status values for a COMMIT TLR: required and recoverable. A COMMIT TLR has status required if at least one UNDO DLR (which may have a status of either required or annulled) that was written by the transaction still exists or if at least one REDO DLR that was written by the transaction has a status of unflushed or required; otherwise (i.e., no UNDO DLRs and any remaining REDO DLRs have recoverable sta- tus), the TLR has status recoverable. 2If a transaction modifies a particular object more than once, then the REDO DLRs for all updates except the most recent one have status recoverable. 35 Whenever an executing transaction updates an object, it writes a REDO DLR to the log. This DLR has unflushed status. The LM downgrades the status of any REDO DLRs from earlier updates to the same object by the same transaction to recoverable. If TLR has status required. Suppose that the transaction eventually commits, its COMMIT transaction t modifies an object x and commits. At the time that t commits, the LM checks if there is an unflushed or required REDO DLR for object x from an update by some earlier transaction. If such a DLR exists, the LM downgrades that DLR's status to recoverable. The CM may flush an updated object to the disk version of the database whenever it chooses. In general, the CM attempts to flush all a transaction's updates after it has committed, so that no UNDO DLRs are needed; nevertheless, the CM may occasionally need to flush some of a transaction's updates to disk before the transaction commits. As soon as the CM has flushed an update to some object, the LM downgrades the status of any unflushed REDO DLR from an earlier transaction. The LM assigns a status of required to an earlier REDO DLR if it corresponds to the most recently committed update to the object and must be retained because of lingering recoverable DLRs from earlier updates to the same object; otherwise, it assigns a status of only recoverable to the earlier REDO DLR. After processing any unflushed previous DLR for the object, the LM then processes the REDO DLR for the update that was just flushed. If this DLR still has unflushed status at the time the flush operation completes 3 and there exists a required or recoverable DLR from an earlier update to the object, then the LM downgrades the DLR's status to required; otherwise, the LM assigns a status of only recoverable to the DLR whose update was just flushed. The LM downgrades a required REDO DLR's status to recoverable after there is no longer a required or recoverable (REDO or UNDO) DLR from any earlier update to the corresponding object. TLR to status recoverable as soon as The LM downgrades a transaction's COMMIT there is no longer any unflushed or required REDO DLR nor any UNDO DLR remaining TLR is eventually overwritten on from the transaction. When a transaction's COMMIT disk, the LM changes the status to non-recoverable for any remaining REDO DLRs that the transaction wrote. Note that this may trigger status changes for REDO DLRs and TLRs from more recent transactions. The following pseudocode expresses how the LM manages the states of records in the log. createnew record(logrecord) { if (type of log-record is REDO DLR) ( status of new record - unflushed if (logrecord's transaction previously updated same object) { status of REDO DLR from previous update - recoverable 3 A REDO DLR may have a status of unflushed at the time that the CM decides to initiate a flush operation for it, but the DLR may be rendered recoverable by a more recently committed update before the flush operation completes. 36 else { status of new record -- required } commit transaction(txid) { for every object updated by transaction txid { if (unflushed or required REDO DLR remains from earlier transaction) { status of REDO DLR from earlier transaction -- recoverable } if (an UNDO DLR for the object was written out to the log) { status of UNDO DLR +- annulled } } aborttransaction(txid) { for every object updated by transaction txid { for every update to the object { status of REDO DLR from the update +-- non-recoverable } if (uncommitted update was flushed to disk version of database) { retrieve UNDO DLR from log request CM to restore object in disk version of database to original value } } changeredotorecov(logrecord) { status of logrecord +- recoverable tid +- txid of transaction that write log-record if ( (no more unflushed, required or annulled DLRs from txid) AND (txid has committed)) { status of COMMIT record from txid +- recoverable } abortedupdateundone(object id, txid) { status of UNDO DLR for update to objectid by txid +- recoverable 37 } updatewrittento-diskversionofdb(objectid, txid, timestamp) { for every REDO DLR r from a previous update that has status unflushed { (r is for most recently committed update to objectid) if ( AND (older recoverable DLRs for objectid still exist)) { status of r - required - recoverable } else { status of r } if (status of DLR from flushed update is still unflushed) { if (recoverable DLRs from previous updates are in log) { status of REDO DLR for flushed update - required } else { change-redotorecov(REDO DLR for flushed update) } recorderased(logrecord) { case (type of logrecord) { REDO DLR: if (status of oldest surviving REDO DLR for the object is required) { surviving REDO DLR for object) changeredotorecov(oldest } UNDO DLR: if (no unflushed, required or annulled DLRs from logsrecord's tx) { status of COMMIT record for logrecord's transaction - recoverable } COMMIT: for every remaining recoverable REDO DLR from logrecord's tx { status of REDO DLR - non-recoverable if (status of oldest surviving REDO DLR for the object is required) { surviving REDO DLR for object) changeredotorecov(oldest } 38 2.4 Management of Log Records The previous section explained the basis upon which the LM decides whether or not to keep a log record. This section introduces the data structures which enable the LM to make this decision. The LM keeps a pointer to every noteworthy log record. A record's pointer indicates its location in the log and current status. The LM uses this information when it must decide a log record's fate. It is convenient to broadly classify log records as relevant or irrelevant. All records that can affect the recovered state of the database are collectively referred to as relevant log records; unflushed, required and recoverable REDO DLRs and all UNDO DLRs are relevant log records, as are all required COMMIT TLRs and recoverable COMMIT TLRs for which some corresponding recoverable REDO DLRs still remain. All other log records (non-recoverable DLRs and every recoverable COMMIT record for which no corresponding REDO DLRs remain) are irrelevant. The LM must keep track of the positions of all relevant records. It does not bother to keep track of irrelevant records. A cell exists for every relevant record in any generation of the log. Each cell resides in main memory and points to the record's location on disk. A record's location on disk, as pointed to by its cell, is indicated by an identifier of the block to which it belongs; finer granularity (e.g., position within the block) is not required by XEL. The cells corresponding to each generation are joined in a doubly linked list that "wraps around" in a circular manner; the cells nearest the head and tail have right and left pointers to each other, respectively. For generation i, pointer hi points to the cell for the relevant record nearest the head. There is no tail pointer for a generation, but the cell for the relevant record nearest to the tail can be found quickly by following the right pointer of the cell pointed to by hi. The logged object table (LOT) has an entry for every object that has at least one relevant DLR somewhere in the log. An object's LOT entry keeps track of the positions within the log of its relevant DLRs. Cells for an object's relevant DLRs are accessible via its LOT entry. Likewise, the logged transaction table (LTT) has an entry for every transaction that has updated at least one object. A transaction's LTT entry keeps track of all objects that it updated and for which the corresponding REDO DLRs are still relevant. After a transaction commits, its LTT entry points to the cell that corresponds to its COMMIT TLR. The LM continually updates the LOT and LTT to reflect the current state of the system as transactions and log records come and go. At any given time, the cells associated with the LOT and LTT entries point to all relevant log records. Although cells belong to these two different tables, they may nonetheless simultaneously belong to the same doubly linked list. 39 An example of XEL with N=3 generations is shown in Figure 2.8. To relate Figure 2.8 to Figure 2.3, note that all irrelevant records are garbage records. Some relevant log records can affect recovery but are not required for recovery, and so they are also garbage records; they can be thrown away with impunity when convenient. ' :. dlsI version I legend: of database * cell O relevant log record irrelevant log record Figure 2.8: Data Structures for XEL Figure 2.8 illustrates the most important aspects of XEL's data structures. The LOT and LTT, with their constituent cells, reside in main memory. Other internal details of the LOT and LTT have been omitted; the circular doubly linked lists of cells are the important aspect of the LOT and LTT in this figure. Each cell has a status field that indicates the status of its corresponding log record. At any given time, the LM can determine whether a record is non-garbage by checking the status field in the record's cell. When a record must be forwarded to the tail of generation i+1, the LM writes its contents to disk at the tail of generation i+1, updates its cell, c, to point to its new position in the log and transfers c from the circular linked list for generation i to the circular linked list for generation i+1. The LM updates pointer hi to point to the cell previously to the left of c, if such a cell exists for generation i; otherwise, it sets hi to NULL. If hi+l was NULL immediately before the record was forwarded, then the LM updates it to point to c (and c's left and right pointers point 40 to itself). Recirculation in the last generation is handled similarly. 2.5 Buffer Management for Disk Version of the Database The LM relies on the CM to flush updated objects to the disk version of the database so that log records from these updates will become garbage. This section briefly discusses how the CM ought to schedule flush operations and elaborates on how the CM interacts with the LM. In general, there is negligible locality of access between the updates of independent transactions. Flushing updates in the order that they are written to the log would lead to random disk I/O for the disk version of the database. Instead, the CM maintains a pool of objects waiting to have their committed updates flushed and schedules writes to disk so that it can take advantage of locality in the disk version of the database and thus improve I/O performance. Ideally, there should usually be a significantly large number of committed updates from which the CM can choose the next object to be flushed; too small a "pool" of updates leads to random I/O. Flushing can proceed continuously at as high a rate as possible. Occasionally,the CM may need to flush uncommitted updates out to the disk version of the database (because its buffer pool is running dangerously low on free space, for example). In such an emergency situation, the CM must first obtain permission from the LM, as described in section 2.3. After the LM has written the necessary UNDO DLRs to disk and granted the CM permission to flush some uncommitted updates to disk, the CM can schedule the writes to disk so as to exploit locality. In the rare event that a transaction aborts after writing one or more UNDO DLRs, the CM must undo the transaction's updates that have already been propagated to the disk version of the database. The LM reads (from disk) every UNDO DLR that was written by the aborted transaction. For each such UNDO DLR, the LM communicates the oid and old value to the CM. In response, the CM restores the object (in the disk version of the database) to the value that it had prior to the aborted transaction's update and informs the LM after it has undone the transaction's modification to the object. 2.6 Flow Control This section discusses how the LM regulates the flow of log records from one generation to the next in a log stream. Each generation is a FIFO queue of fixed size. If it begins to run out of space for new records, the LM must try to free up some space by throwing away or forwarding (or recirculating) log records from near the head of the queue. The 41 LM can forward records from one generation only if the next generation is able to accept them. Hence, flow control between successive generations within a log stream must be regulated. The LM attempts to keep at least Nfree blocks available in each generation to accept incoming (and recirculated, in the case of the oldest generation) log records. When the tail of generation i advances so as to violate this "low water mark", the LM attempts to forward (or recirculate) log records from generation i. However, the LM can forward log records to generation i+1 only if generation i+1 is able to accept them. The LM refuses to overwrite any block that holds an unflushed or required log record. If hi+l points to a record in the block immediately after the current tail position of generation i+1, then generation i+1 cannot accept any forwarded log records. Similarly, the LM refuses to accept any new log records from client transactions if space is not available for them in generation 0. When space eventually becomes available in generation i+1, the LM will resume forwarding of records from generation i if there are fewer than Nf,,ree blocks between the current tail position of generation i and the block to which hi indirectly points (hi points to a cell, and this cell points to a block position on disk). This flow control policy is guaranteed to be free of deadlock. The LM never needs to keep a log record because of the lingering presence of some chronologically subsequent log record. This property ensures that the dependency graph amongst log records is acyclic, and therefore deadlock is impossible. In summary, a producer-consumer protocol between adjacent generations regulates the flow of forwarded log records. Older generations that become full exert "backpressure" on younger generations. If all generations become full, then the LM does not accept log records from client transactions. This policy ensures that necessary log information is never lost. 2.7 Buffering, Forwarding and Recirculation The previous section explained how the LM regulated the flow of records into and out of each generation of a log stream. This section elaborates on some important timing details which govern this movement of log records. Much of the complexity arises from the characteristics and limitations of current disk drive technology. Two characteristics of current disk technology exert an important influence on the implementation of XEL. Information is written to disk in fixed sized blocks (with each block typically some multiple of 1024 bytes). Sequential disk I/O is faster than random 42 disk I/O. XEL must accommodate the constraint of fixed sized disk blocks, and ought to take advantage of the performance benefits of sequential I/O. The LM uses the group commit technique [15, 5]. Records are collected in a buffer and written to disk all at once. Figure 2.9 illustrates the group commit technique. The bottom of the buffer holds log records that have already arrived. The LM adds new incoming log records to the buffer by putting them in the unfilled portion shown at the top of the buffer. In Figure 2.9, the direction of growth is upward. log stream buffer fill up remainder of buffer with new log records .. : - .: ,.. ..... ,,,.. . '.-. -- . , .... ,' . :.. '' .' ' " -: :-' .. .' ':.:',:-.. 1 block records already in buffer Figure 2.9: Buffering of Incoming Log Records for Batched Write to Disk The LM should dedicate at least two buffers for log records that are to be written to generation 0 because a disk write generally requires a significant amount of time, such as 10 ms, during which other log records may arrive. While one buffer is being written to disk, the LM can add new records to a different buffer without risk of interference. The size of each buffer in the pool is exactly equal to the size of a disk block. At any given time, there is a current buffer for generation 0. The LM adds new log records to this buffer until it is full or a time limit runs out, at which time the LM writes it to disk; another buffer in the pool becomes the current buffer as soon as there are no unflushed or required records in the block to which the new current buffer will be written. Therefore, log records are not immediately written to disk. There is a delay while the current buffer fills, and some extra delay for the disk I/O. Only one buffer is needed for each generation i>0 because the LM has the liberty of scheduling the movement of log records between generations. There can be only one outstanding write to the tail of a particular generation and the LM can quickly refill generation i's buffer as soon as the current write operation completes, so additional buffers would not help anyway. The tail of generation i points to the location of a block on disk; finer granularity is unnecessary. When a new log record comes in to generation i, the LM will attempt to allocate it to the block indicated by the current position of the tail; if there is insufficient room remaining in the current buffer to accommodate the record, then the LM will attempt to advance the tail block position and allocate the record to the new tail block. 43 The head of generation i is the block to which hi (indirectly) points, if hi is not NULL. If hi is NULL, then the head of generation i is, by default, the current tail block. The movement of head and tail pointers in block sized quanta has implications. When the LM decides to advance the head of generation i, it must deal with all log records in the head block. The LM attempts to forward all unflushed and required records in this head block to generation i+1. Suppose that the LM can forward these records to generation i+1. It adds the records to the buffer for generation i+1. In general, they are insufficient to completely fill the buffer, but the LM must ensure that the forwarded records are soon written to disk in generation i+1. Therefore, it attempts to fill the buffer as full as possible before writing it. After forwarding records from the block at the head of generation i, the LM works backward (i.e., to the left) from the head to gather enough other non-garbage log records to fill the buffer that is destined for the tail of generation i+1. In summary, the requirements of generation i dictate that records be removed from its head in quanta of size at least a block. The requirements associated with forwarding records to the tail of generation i+1 imply that records are usually forwarded as a group from the first several blocks at the head of generation i. There are two details that complicate the operation of forwarding records from generation i to generation i+1. First, there is some delay between the time when the LM decides to forward a log record to the moment when the forwarded record is actually on disk in generation i+1. Second, two copies of a forwarded record may temporarily exist in the log. The forwarded copy of the record resides on disk in generation i+1, but there is also the "stale" copy left behind in generation i; this latter copy remains in the log until it is overwritten by newer log records. Because of these details, the LM manipulates two additional special pointers for each generation. For each generation i, si is the scan pointer. Like hi, it points to a cell in the circular doubly linked list for generation i or is NULL. It indicates how far the LM has scanned to forward log records to generation i+1. If there is no forwarding operation in progress, then si coincides with hi. When the LM examines a log record and decides to forward it, it leaves the cell in generation i's list and advances si to the left; if this leftward movement causes si to "wrap around" so that it comes back to hi, then the LM sets it to NULL instead. Immediately after a buffer of forwarded records has been written to disk in generation i+1, the LM keeps advancing hi leftward until it coincides with si or becomes NULL; until hi "catches up" to si, the LM transfers each cell over which it passes from generation i's circular list of cells to the tail of generation i+1's list. This cautious management of the hi and si pointers ensures that the LM never inadvertently overwrites a non-garbage log record in generation i before it has been forwarded to generation i+1 (and is on disk in generation i+1). The LM maintains a second circular doubly linked list of cells for every generation. This other list is called the doomed list because it indicates relevant records that will (soon) be overwritten. The pointer di points to the cell at the head of generation i's doomed list. When the LM examines a log record's cell and decides that the record 44 is garbage, it removes the cell from the regular list (after advancing s, of course, and possibly also hi if necessary) and adds it to the tail of the doomed list. If di was previously NULL, it will point to the cell that has just been transferred to the doomed list; otherwise, the transferred cell is the target of the right pointer of the cell to which di points. Now consider the case that the LM decides the log record to which si points is non-garbage. It creates a new cell that also points to the record and leaves the old cell intact in the regular list for generation i (where it waits to be forwarded to generation i+1 after the buffer has been written to disk in that generation). If the log record is a DLR, the LM adds the new cell to the LOT entry of the associated object; otherwise, it adds the new cell to the LTT entry of the corresponding transaction. Finally, the LM inserts this new cell into the doomed list at its tail. Whenever a block of log records is written to disk in generation i, the LM examines di. If di (indirectly) points to the block that has just been written, then the LM concludes that the record associated with the cell to which di points is now gone and so it moves di leftward (or assigns di the value NULL, if appropriate) and deletes the cell to which di had pointed. The LM continues to advance di leftward until it becomes NULL or no longer (indirectly) points to the block that has just been written. The di pointer ensures that the LM does not forget about any stale copy of a relevant log record. Recirculation is not as complicated. The LM recirculates records from only the block at the head of the last generation and places them in a buffer without immediately writing it to disk. The existing copies of these records will not be overwritten before the tail has advanced, but the recirculated copies will belong to the disk block written at the tail. There is no need for a scan pointer in the last generation; the LM immediately transfers the cells for the recirculated records from head to tail in the circular list (in practice, this is accomplished simply by advancing hi to the left). Similar to the case of forwarding, the LM transfers the cells for garbage log records to the doomed list and eventually deletes the cells after the log records have actually been overwritten. As new log records come in to the last generation, the LM adds them to the buffer after the recirculated records. If the LM ever decides to delete the cell to which hi points, it must first adjust hi accordingly. If there are other cells in generation i's linked list, then the LM advances hi to the left; otherwise, it assigns NULL to hi. The LM behaves similarly if it deletes the cell to which si or di points. In summary, the LM collects records in a buffer before writing them to any generation. It attempts to fill a buffer as full as possible before writing it to disk. When the LM decides to forward a log record, it does not transfer the record's cell from the circular list of generation i to the list of generation i+1 until after it is certain that the record is on disk in generation i+1. The LM keeps track of the positions of all copies of all relevant log records until they are actually overwritten on disk (or until they become irrelevant, if this happens first). The following pseudocode routines succinctly state the LM's algorithms for forward45 ing and recirculating log records. i) { addcelltogeneration(cell, { if (hi==NULL) cell cell hi si cell->left - cell cell->right - cell } else { if (si==NULL) { cell si} cell->left - hi cell->right -hi->right cell- >left->right - cell cell->right->left - cell deletecellfrom-generation(cell, i) { if (si==cell) { if (si->left==hi) { si -NULL } else { si - si->left if (hi==cell) { { if (hi->left==hi) NULL hi } else { hi - hi->left if (di==cell) { { if (di->left==di) NULL di else { di e di->left I 46 } if (cell->leftAcell) { cell->left->right - cell- >right cell->right->left cell->left } addcelltodoomedlist(cell, if (di==NULL) di i) { { cell cell->left - cell cell->right - cell } else { cell->left di cell->right di->right cell->left->right cell cell->right->left cell - - } forwardrecords-from-generation(i) { while ((generation i+l can accept records) AND (siyANULL)) { if (record pointed to by si must be kept) { copy record pointed to via s to buffer for generation i+l create newcell copy contents of cell pointed to by si into newcell addcelltodoomedist(newcell, i) if (si->left==hi) { si -- NULL } else { si si->left } } else { cell si deletecellfrom-generation(cell, i) addcelltodoomedlist(cell, i) 47 recirculaterecordswithingeneration(i) { while (fewer than Nfree blocks available in generation i) ( if (record pointed to by hi must be kept) { copy record pointed to via hi to buffer for generation i hi- hi->left } else { cell hi deletecellfrom-generation(cell, i) addcellto-doomedlist(cell, i) } { bufferwrittento.generation(i) if (i>l) { while ((hi_iNULL) AND (hi_ilsi_l)) { cell - hi-_ deletecellfromgeneration(cell, addcellto-generation(cell, i) i-1) } I while ((di:NULL) AND (di points to block position just overwritten)) { recorderased(di) if (di->left==di) { di NULL else { di -- di ->left } 2.8 Management of the LOT and LTT Section 2.4 introduced the LOT and LTT when it described the doubly linked lists of cells that track the positions of all relevant log records in the generations of a log stream. This section provides more details about the LOT and LTT and describes how the LM manages these data structures. 48 The LOT and LTT keep track of all relevant log records. The LM updates them on a continual basis as records enter the log and progress through it. The LM associatively accesses each object's LOT entry by using its object identifier (oid) as a key. A hash table implementation is therefore appropriate. The dynamic nature of the LOT strongly suggests that chaining [10] (rather than open addressing)is the most suitable technique for collision resolution. An object's LOT entry has one or more cells, each of which points to the disk block of a relevant DLR for the object. The LM manages these cells as a linked list. Entries in the LTT are associatively accessed using transaction identifiers (tids) as keys. Like the LOT, the LTT is implemented as a hash table with chaining for collision resolution. Each transaction's LTT entry holds a set obj_ids of oids to keep track of which objects were updated by the transaction; this set is initially empty and grows as the transaction progresses and performs work. The LM maintains a timestamp in each object's LOT entry, although no timestamp is necessarily stored with any object in the disk version of the database. A simple integer-valued counter suffices for the timestamp. When the LM creates a new LOT entry for an object, it initializes the timestamp to 0. Whenever a transaction updates the object, the LM increments the timestamp and then puts the new timestamp value in the resulting REDO DLR. An UNDO DLR for an object holds the current value of the timestamp in its LOT entry at the time that the UNDO DLR is created; the LM does not bother to increment the timestamp when it creates an UNDO DLR. The LM removes an object's LOT entry only after it has no more relevant DLRs remaining in the log (the LM detects this situation when the set of cells associated with the LOT entry becomes empty). Therefore, at any given time, all relevant REDO DLRs for an object have unique timestamps and these DLRs can be placed into chronological sequence by their timestamps. Likewise, all UNDO DLRs have timestamps that indicate their chronological ordering. The cell for each DLR has a tstamp field that stores the value of the timestamp contained in the DLR. Whenever a transaction modifies an object in the database, it causes the LM to send a REDO DLR to the log. If an entry does not already exist for the object in the LOT, the LM creates one. The LM increments the timestamp in the object's LOT entry, formats the DLR, adds the DLR to the current buffer for the tail of generation 0, creates a cell to point to the DLR's position in the log, adds it to the set of cells maintained in the object's LOT entry, inserts the cell in the doubly linked list for generation 0, creates a new LTT entry for the transaction that performed the update if it did not already have one and then adds the object's oid to the obj_ids set in the transaction's LTT entry. Every transaction eventually commits or aborts. An abort is easy to handle. Because the log will never hold a COMMIT TLR from an aborted transaction, all the REDO DLRs from the transaction immediately become non-recoverable; the LM disposes the cells that pointed to these DLRs and deletes the transaction's LTT entry. However, any UNDO 49 DLRs from the transaction retain their requiredstatus after it aborts until the CM has undone these updates to the disk version of the database, as described in Sections 2.3 and 2.5. When a transaction (which updated at least one object) commits, the LM updates record in the log. Then the LM its LTT entry so that it points to the cell for its COMMIT processes the members of obj.ids in the transaction's LTT entry. For each oid in objids, the LM retrieves the object's LOT entry, assigns a status of recoverable to any unflushed or required REDO DLR from an earlier committed update to the same object, assigns a status of annulled to any UNDO DLR from the transaction that just committed and informs the CM that the most recent update has now been committed. If the CM has not already flushed this most recent update to disk, then the CM enqueues it to be flushed4 . The LTT entry for each transaction includes a counter to keep track of the number of UNDO DLRs and unflushed or required REDO DLRs that exist for the transaction. The LM initializes an LTT entry's counter to 0 and increments this counter every time the transaction writes another REDO or UNDO DLR to the log. The LM also increments this counter whenever it copies an existing UNDO DLR so that it can forward the DLR. The LM decrements a transaction's counter each time that it downgrades the status of one of the transaction's REDO DLRs from unflushed or required to recoverable, or each time that it overwrites an UNDO DLR from the transaction. When the counter reaches TLR to recoverable. zero, the LM downgrades the transaction's COMMIT After a transaction commits, its obj ids set can only shrink in size. Whenever the last copy of a relevant REDO DLR is overwritten and no other REDO DLRs from other updates to the same object by the same transaction remain, the LM removesthe corresponding oid from the obj-ids set of the transaction that wrote the DLR. TLR is eventually overwritten, the LM When the last copy of a transaction's COMMIT examines its objids set; for every object still represented in this set, the LM downgrades the status of all corresponding REDO DLRs to non-recoverable (and deletes the cells that pointed to these DLRs). Finally, the LM deletes the transaction's LTT entry. If the objids set in a committed transaction's LTT entry becomes empty and no UNDO DLRs remain from the transaction (as indicated by a counter value of zero), all record become irrelevant. The LM disposes the cells copies of the transaction's COMMIT that point to them and removesthe transaction's entry from the LTT. To summarize, every object with relevant DLRs in the log has an entry in the LOT. An object's LOT entry keeps track of the positions within the log of its relevant DLRs. There is an LTT entry for every transaction currently in progress that has updated 4If the CM already enqueued the object in response to an earlier update by another transaction but it has not yet flushed the object, then the object's oid remains unchanged in the set of objects waiting to be flushed. 50 at least one object and every committed transaction that still has relevant DLRs. A transaction's LTT entry keeps track of all objects that it updated and the positions within the log of copies of its COMMIT record. The LM continually updates the LOT and LTT to reflect the current state of the system as transactions and log records come and go. At any given time, the cells associated with the LOT and LTT entries point to all relevant records in the log. 2.9 Crash Recovery After a crash has happened, the database invokes the RM to restore the disk version of the database to a consistent state. The RM examines the records in the log and attempts to find the most recently committed value, if any, for each object that has one or more DLRs in the log. The RM propagates each object's most recently committed value, if any, to the disk version of the database, thus restoring the disk version of the database to a consistent state: it incorporates all the effects of transactions which committed prior to the crash and none of the effects of transactions which aborted or were interrupted. The RM starts sequentially reading from the disk(s) where the log is stored. It does not need to begin at the tail of generation 0 (nor of any other generation). The RM processes the log in a single pass. This new recovery algorithm is suitable for systems in which the log is not larger than main memory. As each block is read from the log, the RM processes the DLRs and TLRs in the block. The Pending Object Table (POT) keeps track of all objects during the recovery process, while the Recovered Transaction Table (RTT) serves a similar purpose for transactions. Each RTT entry belongs to a particular transaction. It holds two pieces of infor- mation about the transaction. The first is the status of the transaction. If the RM has found a COMMIT record for the transaction, it has a status of committed; otherwise, it has an unknown status. The RTT entry also contains a set of oids, called pendingobjs. The contents of this set are meaningful only if the transaction's status is still unknown. Each member of pendingobjs indicates an object for which a REDO DLR was already found in the log and this DLR was written by the transaction. When the RM reads a REDO DLR from the log, it checks the POT entry for the associated object to see if a more recently committed REDO DLR or a more recent UNDO DLR has already been found (the timestamps within an object's REDO and UNDO DLRs indicate their relative temporal ordering). If not, it adds the new DLR to the POT. It also inspects the RTT to find out if the transaction that wrote the DLR is known to have committed. If the transaction did indeed commit, then the RM marks 51 the new update as committed and deletes from the POT all earlier DLRs for the same object; otherwise, it leaves the update with a status of pending and adds the object's oid to the pendingobjs set in the RTT entry of the correspondingtransaction. If the RM reads an UNDO DLR from the log, it ignores it if a more recently committed REDO DLR or a more recent UNDO DLR has already been found for the associated object; otherwise, it adds the UNDO DLR to the object's POT entry and deletes all DLRs that were written by earlier transactions (as indicated by their txid and timestamp fields). When the RM upgrades a transaction's status from unknown to committed (in reTLR), it processes each object represented in the sponse to the discovery of a COMMIT pendingobjs set kept with the transaction's RTT entry. For each such object, it marks the corresponding update in the POT (if it still exists) as committed and deletes all earlier updates to the object; it also deletes the object's oid from the pendingobjs set. After all log records have been processed, the RM restores the disk version of the database to the most recent consistent state that existed prior to the crash by examining each object's POT entry and taking appropriate action. If an object's POT entry holds an UNDO DLR and the transaction that wrote the record has a status of committed, then the RM does nothing further for the object. However, an UNDO DLR from an uncommitted transaction 5 prompts the RM to propagate the object's value (indicated in the UNDO DLR) to the disk version of the database and thus undo the effect of the unsuccessful transaction. If an object's POT entry has a REDO DLR from a committed transaction, the RM flushes this updated value to the disk version of the database. The RM ignores any object whose POT entry holds neither an UNDO DLR from an uncommitted transaction nor a committed REDO DLR. The following pseudocode expresses the RM's algorithms. { shouldkeepredodlr(redo-dlr) potentry redos - - POT entry for object redodlr->oid redodlr->timestamp if (potentry has a committed REDO DLR with timestamp > redo-ts) ( return FALSE } else ( if (potentry has an UNDO DLR with timestamp > redots) ( return FALSE } else { if (potentry 5 has a REDO DLR with timestamp == redots) { The RM concludes that any transaction crash. whose status is still unknown did not commit before the 52 return FALSE } else ( return TRUE } knownitohavecommitted(txid)( if (RTT entry of txid has committed status) { return TRUE else ( return FALSE recoverredodlr(redodlr) ( if (object redo_dlr->oid has no POT entry) ( create new POT entry for object redodlr->oid } if (should keep_redodlr(redo dlr)) ( add redo dlr to POT entry of redodlr->oid if (knowntohave_committed(redo_dlr->txid)) { redodlr->txid +- tx.committed delete all DLRs with timestamps less than redodlr->timestamp delete any UNDO DLR with timestamp equal redodlr->timestamp else ( add topendingobjs inrtt(redodlr- >oid, redodlr- >txid) shouldkeepundodlr(undodlr)( potentry +- POT entry for undo.dlr->oid undors -- undodlr->timestamp if (potentry has a committed REDO DLR with timestamp > undots) ( return FALSE else { 53 if (potentry has an UNDO DLR with timestamp > undots) { return FALSE } else { return TRUE } recoverundodlr(undodlr) { has no POT entry) { if (object undodlr->oid for object undodlr->oid entry POT new create } if (shouldkeep undodlr(undo-dlr)) { add undodlr to POT entry of undodlr->oid delete all other DLRs with timestamps less than undodlr->timestamp } updatepot-aftertxcommit(oid, txid) { if (POT entry for oid still has a REDO DLR from transaction txid) { redodlr +- REDO DLR for oid and txid with greatest timestamp redodlr->txid +- txcommitted delete all other DLRs with timestamps less than redodlr->timestamp delete any UNDO DLR with the same timestamp as redo.dlr->timestamp } recovercommit(commit-record) { if (commit-record->txid has no RTT entry) { create new RTT entry for commit-record->txid } status of transaction commitrecord->txid -- committed if (pendingobjs0O in RTT entry of commitrecord->txid) for every oid in pendingobjs { updatepot aftertxcommit(oid, remove oid from pending-objs } 54 { commit record- >txid) perform-recovery() while (there are still some unread log records) ( log-record - read another record from the log case (type of logrecord) REDO DLR: recoverredodlr(log record) UNDO DLR: recoverundodlr(logrecord) COMMIT: recover commit(log record) for every object with an entry in the POT { if (object's POT entry has an UNDO DLR from an uncommitted tx) ( write out value from UNDO DLR to object in disk version of DB else { if (object's POT entry has a committed REDO DLR) ( write out value from committed REDO DLR to disk version of DB The LM ensures that the log always contains sufficient information for the RM to restore the database to a consistent state if a crash were to ever occur at any time. Consider a series of updates, performed by different transactions, to a particular object. Suppose that the most recent update to the object has been committed. If the CM has already flushed this update to the disk version of the database, then either no recoverable (UNDO or REDO) DLRs from prior updates to the object remain in the log or the DLR from the most recent update is still in the log (and has a status of required) along with the COMMIT record for the transaction which performed this update. If the CM has not already flushed this update, then its DLR must have a status of unflushedand is still in the log. Now suppose that the most recent update to a particular object was performed by a transaction that aborted or is still in progress. Therefore, the log does not contain any COMMIT record from this transaction. If the log still holds a REDO DLR from this update, then it is innocuous anyway (because the RM ultimately ignores any REDO DLR from a transaction that did not commit). If the disk version of the database already holds the new uncommitted value for the object, then the log must hold an UNDO DLR which records the object's original value (i.e., the value which it had immediately prior to the beginning of the current transaction). Since the RM finds this UNDO DLR but it does not find a COMMIT record for the associated transaction, it restores the object in the disk version of the database to the value indicated in the UNDO DLR, thus undoing the uncommitted transaction's update. If the disk version of the database still holds the object's original value but the log holds an UNDO DLR for the uncommitted 55 update (because the RM had requested the LM for permission to flush but had not yet performed the flush), then correct recovery is still ensured. Finally, if the log contains no UNDO DLR for the uncommitted update, then the disk version of the database must hold the object's original value and either the log holds a REDO DLR (with unflushed, required or recoverable status) from the most recently committed update or the log holds no unflushed, requiredor recoverable DLRs from prior updates to the object. Either way, the disk version of the database will hold the original value of the object after the RM finishes its work. Therefore, the LM and RM together guarantee that the most recently committed value for every object is restored to the database after a crash, and thus the consistency of the database is maintained. Note that the LOT and LTT data structures which played a crucial role during normal logging operations are unnecessary for recovery. They enabled the LM to manage the log's records so that the database could always be restored to a consistent state if a crash were to ever occur. After a crash has actually happened, the RM must examine the information which the LM left on disk and use it to restore the disk version of the database to a consistent state. When the RM has finished its work, the database resumes normal processing. The LM initializes the LOT and LTT data structures (they are initially empty) and then begins accepting requests from client transactions. 56 Chapter 3 Correctness Proof for XEL This chapter presents a theoretical model for a simplified version of XEL and proves important safety and liveness properties. Effectively, the model ignores the role of the cache manager and assumes that every update is flushed to the disk version of the database immediately after it is committed. Nevertheless, the log manager must ensure that the REDO DLR from the most recently committed update to each object is retained until all prior REDO DLRs for the same object have first been rendered non-recoverable (review Section 2.1 to understand this requirement). A manual correctness proof for a complete implementation of XEL, as presented in Chapter 2, would be overwhelming in terms of both size and effort; the proof itself would be prone to human error and its length would deter most readers from bothering to verify it. Nevertheless, this chapter does prove the correctness of a simplified version of XEL. The proof focuses on only a single object, but it applies to all objects in the database. Therefore, this simplified version of XEL ensures that the log always holds sufficient information for the RM to restore the database to a consistent state after a crash. Section 2.9 explained how the RM actually does restore the database to a consistent state, given the information in the log. This chapter also proves that every committed update's REDO DLR is eventually erased (a liveness property) so that its disk space can be reused. Although the proof considers only a simplified version of XEL, it has worth nonetheless. After someone understands this proof for simplified XEL, they can extend this understanding so that it applies to more realistic implementations of XEL. This approach of starting reasonably simple and then gradually adding in more detail has pedagogic value. Furthermore, the experience of proving the correctness for a simplied version of XEL can suggest approaches for automating the many "mechanical" parts of the proof effort so that much more sophisticated implementations of XEL can be proven automatically by computer (assuming that the program which assists in the proof process is itself correct); little human effort would be required and so the chance of human error 57 would be significantly reduced. This chapter uses I/O automata theory [42, 43] extensively. A reader unfamiliar with I/O automata theory is referred to [42, 43] for an explanation of it. The remainder of the chapter is organized as follows. Section 3.1 states what aspects of XEL are simplified and explains the interface for this much-simplified version of the log manager. It also suggests how to gradually embellish this simplified model so that a more realistic version of XEL is obtained. To perform correctly, any variation of XEL can reasonably require client transactions to behave appropriately; Section 3.2 formally expresses these restrictions as four well-formedness properties that the log manager's surrounding environment must always satisfy. Section 3.3 presents a model of the log manager that is as simple as possible and proves that it is correct, as long as the environment satisfies the well-formedness properties; this very simple model for the log manager is referred to as SLM.Section 3.4 defines a more complex model for XEL that more closely resembles a real implementation and then proves safety and liveness properties for it. The I/O automata description for this implementation is presented in Section 3.4.1. This implementation is referred to as LM.Even though LMis fairly elaborate, it still has many simplifications and does not constitute an implementation of the complete XEL technique as presented in Chapter 2. Section 3.4.3 postulates a possibilities mapping f that maps each state in LMto a set of states in SLMand then proves that f is indeed a possibilities mapping, according to the definition in [42, 43]. This result inductively proves that LMis correct apropos safety. For any possible execution of LM,a corresponding execution of SLMalso exists which has exactly the same external behavior. Since the correctness (in terms of all possible external behaviors) of SLM has already been proven, the correctness of LMfollows as a result. Finally, Section 3.4.4 states an important liveness property: every log record is eventually erased. Appendix A provides the many lemmas and theorems which constitute the safety and liveness proofs. 3.1 Simplifications This section describes the log manager's interface and explains how the log manager interacts with the world around it. Subsequent sections will build upon the introductory description which this section provides. Because this chapter considers a simplified version of XEL for the log manager, the interface is simpler than what was described in Chapter 2. This section explains and justifies these simplifications. It also discusses how some of these simplifications would be relaxed if one wanted to extend the techniques of this chapter to a more realistic version of XEL. This chapter will prove that XEL (in its simplified manifestation) does "the right thing" for each object. Each object is characterized by a set of possible updates. These updates may be sequenced in any particular order; the "external world" chooses the 58 order by issuing a series of commit commands. Figure 3.1 illustrates the log manager's interface for the simplified version of XEL that is considered in this chapter. In this simple model, the log manager manages only a single object. The external world sends COMMITiand ERASE,messages to the log manager, where the subscript i (which is a member of some index set) identifies a particular update, and the log manager sends ERASABLEi messages to the external world. Here, the external world is everything outside the log manager. COMMIT 1 ERASABLE1 ERASE 1 COMMIT ERASABLE ERASEi Figure 3.1: Interface of Simplified Log Manager When the external world (specifically, some client transaction) wants to atomically update an object, it sends a COMMITi request to the log manager; the subscript identifies the particular update which the external world wants to commit. The log manager keeps some (internal) record of this update but may eventually decide to delete it. When it decides that it no longer needs to retain a record from update i, it sends an ERASABLE, message to the external world. In response to this message, the outside world chooses exactly when the record from update i is actually deleted; the external world sends an ERASEimessage to the log manager to inform it when the record from update i has been deleted. The ERASABLEi and ERASEimessages model the operation of a typical disk drive. When the log manager decides to overwrite a particular log record on disk, it submits a request to the disk drive's controller to write a block of new information to the record's location on disk; this request corresponds to the ERASABLEi message. After the disk drive has actually completed the write operation, it informs the log manager; this acknowledgement corresponds to the ERASEimessage. The log manager must satisfy the followingimportant property. Either the log manager still retains the record from the most recently committed update to the object (i.e., COMMITi has happened, no subsequent COMMITj has occurred yet, and the log manager has not issued an ERASABLEimessage) or the records from all previously committed updates have already been deleted (i.e., for every COMMITh which preceded COMMITi, a corresponding ERASEhhas already happened). This guarantees that no stale old update 59 can jeopardize the state of the database if a crash were to occur. The particular values of the updates to an object are irrelevant, so this simplified model makes no mention of them. It may help to think of update i's value being communicated in the COMMITS message. A more elaborate model of XEL would choose to include a MODIFYiaction by which the external world assigns a value to the object for update i. If the external world later decides to make this modificationpermanent, it submits a separate COMMITi request; alternatively, an ABORTimessage would annul the update. The model ignores the role of the cache manager. Effectively, it assumes that each COMMIT,action simultaneously commits update i and flushes the object's new value (which was assigned by update i) to the disk version of the database. The log manager can grant permission to erase the REDO DLR from some update i as soon as the records from all chronologically preceding updates have been erased; it need not wait for the completion of a flush to the disk version of the database. Furthermore, UNDO DLRs do not play a role in this simplified model because a new value is (conceptually) assigned and committed at the same time. A more realistic model would include some additional FLUSHiinput action to the log manager to inform it that the value which update i assigned to the object has been flushed to the disk version of the database; as long as update i is the most recently committed update, the log manager cannot issue an ERASABLEi message until FLUSHihas happened. This model applies to only a single object. The fact that a transaction can update an arbitrary number of other objects is irrelevant. A more realistic model would embed the above model within a larger model that provides transactional support. In this larger model, a transaction would send MODIFY/messages to any number of objects. If message to each the transaction later committed, the log manager would send a COMMITi object which the transaction updated. Hence, the above model is a building block on which to construct a more realistic model. 3.2 Well-formedness Properties of Environment The external world constitutes the environment in which the log manager must operate. By definition, the environment is constrained to respect certain conventions which govern its relationship with the log manager. These conventions are expressed in terms of the well-formedness properties presented in this section. Let denote a behavior for the log manager module, and let ri represent the ith action of 3 (where iEJf and i_1). The behavior 3 is well-formed if and only if it satisfies the four properties expressed below. 60 WF1: Vx: ri=COMMIT, Bj, ] jAi, such that rj=COMMIT,. WF2: Vx: ri=ERASE, = 3j, j<i, such that rj=ERASABLEx. WF3: Vx: ri=ERASEx => ,Bj, joi, such that 7rj=ERASE,. WF4: Vx: ri=ERASABLEx = 3j, j>i, such that rj=ERASE,. Property WF1 states that the external world may commit a particular update, x, at most once. Property WF2 states that the external world cannot perform an ERASEx action until after the log manager has given it permission to do so via the ERASABLEx action, whileWF3 constrains the external world to perform at most one ERASEx action for any particular DLR x. Finally, WF4 insists that the external world must eventually perform an ERASEx action in response to an ERASABLEX action. These well-formedness properties which the environment must preserve can be represented in an automaton, called ENV,as shown in Figure 3.2. COMMIT 1 ERASABLE1 ERASE 1 COMMIT. ERASABLE/ ERASE Figure 3.2: Automaton to Model Well-formed Environment The names, types and initial values of the three variables which constitute the stat of ENVare 1 : variable type - committed can-erase erased initial value - 2N 2Af 0 2Nr 0 The ENVmodule has the following transition relation: 1'A denotes the set of natural numbers {0,1,2,...). 61 For any set S, 2 denotes the powerset of ERASABLE Effect: can-erase - can-erase U {i} COMMIT Precondition: Effect: committed Precondition: Effect: (iEcan-erase) A (ioerased) canerase - can-erase - {i) icommitted - committed U {i} ERASEi erased 3.3 +- erased U {i} Specification of Correctness (Safety) This section defines a very simple I/O automaton, called SLM,which embodies the safety property required of any implementation of XEL. Namely, it ensures that the record from the most recently committed update is not erased unless the records from all previously committed updates have already been erased. To prove this property, this section states a set of invariants which describe the composition of the ENVand SLMautomata in all reachable states and then uses these invariants to prove that all behaviors of the SLM automaton satisfy the safety property. 3.3.1 I/O Automaton Model Figure 3.3 illustrates the I/O automaton for the very simple version of the log manager. This automaton shall be referred to as SLM. Three variables comprise the state of SLM.Their names, types and initial values are 2: variable keep let erase waiterase type Ar. 2r 2Ar initial value I 0 0 The SLMautomaton has the following transition relation: 2 For any set S, S denotes the lifted domain S=S U I, where that does not belong to S. 62 is some unique bottom element COMMIT 1 ERASABLE ERASE 1 COMMIT ERASABLE. ERASEi Figure 3.3: Specification Automaton for LM COMMIT Effect: if ((let-erase=0) AND (wait.erase=0)) leterase - let-erase U {i} else if (keepI) leterase keep - leterase U {keep} i ERASEi Effect: wait erase if ((keepi) - leterase wait erase - {i} AND (leterase=0) keep- I - AND (wait erase=0)) leterase U {keep} ERASABLEi Precondition: Effect: iEleterase leterase - leterase - {i} wait-erase 3.3.2 - wait-erase U i) Invariants for Composition of SLMand ENV The following invariants apply to the system composed from the SLMand ENVautomata. It is easy to verify inductively that they are true in all reachable states of the system. Invariant 3.1 (keep=I) V (leteraseA0) V (waiterase$0) 63 Invariant 3.2 Vx, xEJ, (keephx) V (xEcommitted) Invariant 3.3 Vx, xEAr, (xleterase) V ((keepox) A (xEcommitted)) Invariant 3.4 Vx, zxEA, (xowaiterase) V ((keepox) A (xoleterase) A (Ecommitted)) Invariant 3.5 (Ilet-erase) A (IOwaiterase) 3.3.3 Correctness of SLMModule Theorem 3.4, at the end of this subsection, expresses XEL's important safety property: the log record from the most recently committed update is never erased before the records from all earlier committed updates have already been erased. Several supporting lemmas must first be proven. This subsection proves that the SLMautomaton, when composed with ENV,satisfies this property. Throughout this section, a will represent an execution of the module composed of SLMand ENV,and ri will represent the it h action of a. The following lemma states that after a particular update w has been committed, the SLMautomaton keeps track of w in one of its three state variables at least until w is erased. Lemma 3.1 A (r=COMMIT, ) A (Am, <m<j, s.t. 7rm=ERASE,) Vh, l<h<j, ((keep=w) V (wEleterase) V (wEwaiterase)) in state th Proof: *· 7r=COMMIT, ==, ((keep=w) V (wEleterase)) in state tj * (((keep=w) V (wElet-erase) V (wEwaiterase)) in state t-_l) = · ((keep=w) V (wElet-erase) V (wEwaiterase)) (((keep=w) V (wElet-erase)) in state t) A (m#ERASE.) in state tm A (Am, I<m<j, s.t. rm=ERASE,) -- Vh, l<h<j, ((keep=w) V (wEleterase) and thus the lemma has been proven. 64 V (wEwait-erase)) in state th by induction O The following lemma states that after a particular update x has been committed, then at least one of the let-erase and wait-erase state variables of the SLMautomaton must be non-empty at least until z is erased (assuming that the execution is well-formed). (a is a well-formed execution) Lemma 3.2 A ('ri=COMMITx) A (3k, k>i, s.t. j, i<j<k, s.t. 7rj=ERASEx) Vl, i<l<k, ((leterase#O) V (waiterase0)) in state tj Proof: * (i=COMMITx) A (j, i<j<k, s.t. rj=ERASE,) -- > Vl, il<k, ((keep=x) V (xEleterase) V (xEwait-erase)) in state tj by Lemma 3.1 * V1, i<l<k, ((keep=l) V (leterasej0) V (waiterase#0)) in state tj by Invariant 3.1 (((keep=x) V (xElet-erase) V (xEwait-erase)) in state t) * A (((keep=I) V (leteraseo0) V (wait_erasej0)) in state t) == ((let_erase0) V (waiterase40)) in state t1 * It therefore follows that Vl, i<l<k, ((leterase$0) V (waiterase#0)) in state tj and thus the lemma has been proven. D The following lemma proves that, in a well-formed execution, a particular update x cannot be erased before it has been committed. (a is a well-formed execution) Lemma 3.3 A (ri=COMMITx) A (rj=ERASEx) i<j Proof: *· ri=COMMIT, · rj=ERASEx - - ,3f, fTi, s.t. rf=COMMIT, 3h, h<j, s.t. rh=ERASABLEx ==> xElet-erase in state th-1 * 7rh=ERASABLE, by WF1 by WF2 * xEleterase in state th-1 ==- Either (1) 3f, f<h-1, s.t. (rf=COMMITx) A (((leterase=0) A (waiterase=0)) in state tf-1) 65 * (rf=COMMIT,) A (f, ffi, s.t. rf=COMMIT,) =- f=i *(f=i)A (f<h-l<j) =4 i<j or (2) 39g, g<h-1, s.t. A (rg=ERASEzfor some z) (keep=x in state tg-1) * keep=x in state tg-1 == 3f, f<g-1, s.t. 7rf=COMMITx * (rf=COMMITx) A (/f, f#i, s.t. rf=COMMITx) =* f=i · (f=i) A (f<g-l<j) =: i<j Therefore, the desired result follows for both possible cases and thus the lemma has been proven. ° The following theorem proves that, in a well-formed execution, the most recently committed update, x, cannot be erased before all previously committed updates have been erased. Theorem 3.4 (a is a well-formed execution) A (ri=COMMITx) A A (rj=ERASE,) (Ak, i<k<j, s.t. 7rk=COMMITy for any y) A (r=COMMITw, l<i) 3m, m<j, s.t. rm=ERASE, Proof: By contradiction. Assume /3m, m<j, s.t. rm=ERASE, * 7ri=COMMITx =# g, g7i, s.t. rg=COMMITx * (rl=COMMITw) A (l<i) A (g, * (r=COMMIT,) A (m, = g<i, s.t. rg=COMMITx) by WF1 :-wgx m<j, s.t. ,m=ERASEw) Vh, Il<h<j, ((let_erase-0) V (waiterase&0)) in state th b,y Lemma 3.2 * (i=COMMIT.) A (j=ERASEx) i<j by Lemma 3.3 (<i) A (i<j) A (h, l<h<j, ((let_erase4 0) V (waiterasef0)) in state th) =-- ((leterase 4 0) V (wait-erase$0)) in state ti-1 * (ri=COMMITx)A (((let_erase$0) V (wait_erase$0)) in state ti-l) =, keep=x in state ti * 7r=COMMIT, =* ,An, n>l, s.t. 7rn,=COMMIT by W'F1 w * (keep=x in state ti) A (xyw) A (n, n>l, s.t. 7r=COMMIT,)A (<i) ==- Vp, i<p, keepyw in state tp * rj=ERASE2, ==- 3q, q<j, s.t. q=ERASABLEx *· q=ERASABLEx = xEleterase in state tql * xElet-erase in state tq-1 == 3r, r<q-1, s.t. either 66 by W rF2 · (rr=COMMITx) A (g, · r=i =', ((leterase0) gfi, in state t- A (waiterase=0)) (1) (rr=COMMITx) A (((leterase=0) s.t.7rg=COMMITx) = 1 ) r=i V (waiterase#40)) in state tr-1 But this is a contradiction, so this case cannot be true. or (2) (rr=ERASEz for some z) (((keep=x) A (leterase=0) A A (waiterase={z})) * (7rr=ERASEz) A (r<q<j) A (m, m<j, s.t. 7rm=ERASE,) - zw * keep=x, xEA, in state tr,_l == 3f, f<r-1, · (rf=COMMITx) A (g, goi, s.t. in state tr-i) s.t. rf=COMMITx g=COMMIT) 9 ==' f=i * (f=i) A (f<r-1) = i<r-1 * (7r-=COMMITw) A (m, m<j, s.t. 7rm--ERASEw) Ve,I<ej, ((keep=w) V (weleterase) V (wEwaiterase)) in state te by Lemma 3.1 (((keep=x) A (leterase=0) A (waiterase={z})) in state tr_1) · A (x5w) A (l<i<r-l<j) A (Ve, I<e<j, ((keep=w) V (wEleterase) V (wEwaiterase)) in state t) Z=W But this is a contradiction, and so this case cannot be true either. Since both possible cases must be false the original assumption must be [ false and thus the theorem has been proven. 3.4 Implementation of Log Manager This section describes an implementation of the log manager. It composes a set of constituent I/O automata to yield a module with the same external action signature as the SLMmodule. This log manager module shall be referred to as LM.To prove that LM satisfies XEL's safety property, this section states a set of invariants that characterize all reachable states when LMis composed with ENV,postulates a possibilities mapping f from the composition of LMand ENVto the composition of SLMand ENVand then proves that f is indeed a possibilities mapping. Given that SLM(when composed with ENV) is correct and that f is a possibilities mapping from LMto SLM,it immediately follows that LM(when composed with ENV)implements SLMand therefore satisfies XEL's safety property. Finally, this section proves a liveness property of LM:every record is eventually erased. 67 3.4.1 I/O Automata Model Figure 3.4 depicts the composition of a collection of automata, all for the same object. The LOT automaton represents the object's LOT entry and the accompanying procedures which manage it. Each DLRiautomaton represents a different possible update to the object. Together, these automata compose the LMmodule which models the log manager's activity for the object. The external world issues a COMMITicommand to instruct LMto commit update i. Some time later, LMsends an ERASABLE, message to the external world to inform it that it is now allowed to erase the DLR from update i. In response, the outside world will eventually send an ERASE,message back to LMto inform it that update i's DLR has been erased. ERASE ERASE 2 ERASE Figure 3.4: I/O Automata for an Object This model ignores the problem of flow control between generations in a log stream. Management of the producer-consumer relationship between consecutive generations is an entirely different problem which is not considered here. The focus of this section is the management of each object's collection of REDO DLRs according to the state transition diagram that was represented in Figure 2.5. This state transition diagram implicitly takes into account the fact that each stream may have more than one generation, and that there may be more than one stream. It shall be proven that XEL guarantees a consistent state when DLRs are characterized by the state transition diagram of Figure 2.5. A more elaborate model that incorporates XEL's flow control activities would show that XEL preserves the state transition diagram for each object's DLRs. By implication, this more elaborate version of XEL must also guarantee a consistent state. 68 Action Signatures of Automata The LOTautomaton has the following action signature: in out COMMITi <ASSIGNi,tsi> CSREQDi ACKASSIGNi ACK_CSRECVi <DLR_GONE,tsi> int none CSRECVi ERASABLEi In this action signature, as well as that given below for DLR, iEA and tsiEAr, where A denotes the set of natural numbers. Each DLRi,iEA, automaton has the following action signature: . in out <ASSIGN,tsi> CSREQDi CSRECVi ACK_CSRECVi ACK-ASSIGNi int none <DLR_GONE,tsi> ERASEi After receiving a COMMITimessage from the external world, the LOT automaton chooses a unique timestamp, tsi, for the associated DLR and sends an <ASSIGNi,tsi> message to DLRi. The DLRi automaton receives the message and replies with an ACKASSIGNimessage to the LOT. The LOT automaton sends a CS_REQDimessage to DLRito instruct it to change its status from unflushed to required. Similarly, the CS_RECVi message informs DLRjthat it should change its status to recoverable. DLRidoes not bother to acknowledge receipt of a CS_REQDi message, but it does send an ACK_CSRECVi message to acknowledge a previous CSRECVi message. After DLR i has been erased, its DLRiautomaton sends a <DLR_GONE,tsi>message to the LOTto inform it that the DLR whose timestamp was tsi no longer exists. The subscripted messages from the LOTto DLRP(namely, <ASSIGNi,tsi>, CSREQDi and CSRECVi) denote point-to-point communication. The fact that the <DLR_GONE,tsi> message is not subscripted reflects the implementation of XEL, in which only the timestamp of an erased DLR is communicated to the LOT. 69 States of Automata The following variables constitute the state of the LOTautomaton 3 : variable type pendingtsassign recvtss 2NxN current ts ArI Af currreqdts currreqddlr currreqd acked BN send cs reqd send cs recv nAr 2Af pendingerasable 2f represents a committed DLR for which no Every member of pendingtsassign <ASSIGNi,tsi> message has yet been generated. The recv.tss variable represents the set of timestamps which correspond to all DLRs whose status is merely recoverable. The LOT maintains a counter, current_ts, which it will assign as the timestamp value for the next committed DLR. The curr-reqdts and currreqddlr variables indicate the timestamp and identity, respectively, of the DLR which currently has status required, if there is such a DLR. In conjunction with these two variables, the currreqdacked variable indicates if that DLR's automaton has acknowledged receipt of its timestamp. The sendcsreqd variable represents the DLR to which a CS.REQDimessage should be sent, if any such DLR exists. Likewise, sendcsrecv identifies all DLRs to which CSRECVi messages should be sent. Finally, the pendingerasable set indicates all DLRs for which the LOTcan issue an ERASABLE, message. The following variables constitute the state of each DLRiautomaton: variable statusi type {UNFL,REQD,RECV,NONR} B pendingacki AiV timestampi The status/ variable of DLRiindicates the current status of the DLR and may have one of the four values listed in the table above. If pendingacki is true, then DLRiowes a message of response to the LOT; the value of statusi determines the particular type of the message. The timestampi variable represents a DLR's unique timestamp, and receives its value in response to an <ASSIGNi,tsi> message from the LOT. The variables which comprise the state of each of the constituent automata of the LMand the relationships amongst these automata are depicted in Figure 3.5. 3B denotes the set of boolean values {T,F}. For any sets S and T, S x T denotes the set which is their cartesian product. 70 C£ ERASABLE, Figure 3.5: States and Action Signatures of Automata in LM Module Initial State of System The LOTautomaton has a unique initial state. The initial values of the LOT's variables are: variable initial value pendingts.assign 0 recvtss 0 0 current ts I curr-reqdAts curr-reqddlr curr-reqdacked sendcs-reqd send cs recv pending-erasable 71 I F I 0 0 Each DLRiautomaton also has a unique initial state in which its variables have the following values: variable initial value UNFL statusi pendingacki timestamp i F I Transition Relations of Automata The steps for the LOTautomaton's input actions are as follows: COMMITi Effect: if (currreqdtsyI) recvtss 4- recvAss U {curr-reqd ts} if (currreqdacked=T) sendcs.recv - sendcs-recv U {currreqddlr} currreqdts - currentts currreqddlr - i pendingts.assign pendingts.assign U {<i,current-ts>} +- currreqdacked - F currentts 4- currentts + 1 ACKASSIGNi if (i=currreqddlr) Effect: currreqd.acked - T if (recvtss$0) sendcsreqd 4 i else recvtss 4- recvtss U {currreqd ts} sendcsrecv - sendcsrecv U {i} currreqddlr 4- currreqdts else sendcsrecv 4- sendcs recv U {i} ACK_CSRECVi Effect: pendingerasable +- pendingerasable U {i} <DLR_GONE,tsi> Effect: if ((recv tss={tsi}) AND (currreqddlry_) AND (curr-reqdacked=T)) recvtss 4- recvtss U {currreqdts} sendcs-recv - send-cs-recv U {curr-reqd-dlr} curr-reqd.dlr - 1 currreqdts -I recvtss 4- recvtss - {tsi} 72 The steps for the LOTautomaton's output actions are specified below: ERASABLEi Precondition: iEpending-erasable Effect: pendingerasable - pendingerasable - i} <ASSIGNi,tsi> Precondition: Effect: <i,tsi>Ependingtsassign pendingtsassign -- pendingts.assign - CSREQDi Precondition: i=sendcsreqd Effect: send-cs-reqd 1- CSRECVi Precondition: iEsend cs recv Effect: sendcs-recv - sendcsrecv - i} The steps for the DLR,automaton's input actions are: <ASSIGNi,tsi> Effect: timestamp/ + tsi pendingacki -- T CSREQDi Effect: if (statusi=UNFL) statusi +- REQD CSRECVi Effect: statusi - RECV pendingacki - T ERASEi Effect: statusi - NONR pendingacki - T The steps for the DLR/automaton's output actions are: ACKASSIGNi Precondition: statusi=UNFL pendingacki=T Effect: pendingacki - F ACK_CSRECVi Precondition: statusi=RECV pendingacki=T Effect: pendingacki F 73 <i,tsi>} <DLR_GONE,tsi> Precondition: statusi=NONR pending-acki=T timestampi=tsi Effect: pendingack i - F 3.4.2 Invariants for Composition of LMand ENV The following invariants will assist in the proof of LM'scorrectness. They apply to system composed of the LMand ENVmodules. It is straightforward to verify that invariants are true in the initial state of the system. Likewise, the definitions for automata's actions ensure that the invariants remain true in all reachable states of system. the the the the For convenience, define a predicate recvbl(x), xEVA,which characterizes a particular update, x, as recoverable or not, in a particular state of LM: recvbl(x) - ( (<x,u>Ependingtsassign V ((timestampZI) for some u) A (statusxZNONR))) Similarly, it is notationally convenient to define three other predicates. These predicates apply to a state of the LMautomaton, but they have an obvious correspondence to the variables which comprise the state of the SLMautomaton. The lm_ prefix is a reminder of the fact that they apply to the LMautomaton. These predicates will play an important role in defining and proving a possibilities mapping from the states of LM to the states of SLM. lmkeep(x) lmlet(x) (currreqddlr=x, xJ>A) A (3y, yox, s.t. recvbl(y)) ((curr-reqd-dlr=x, V ( xEAf) A (y, y5x, s.t. recvbl(y))) (currreqd_dlrzx) A (recvbl(x)) A ( (statusxARECV) V (pendingackxZF) V (xEpendingerasable)) ) lmwait(x)- (statusx=RECV) A (pendingackx=F) A (x pendingerasable)) The following invariants for the composition of LMand ENVare expressed in terms of the predicates defined above. Invariant 3.6 Vx, xEJA, (statusx=UNFL) V (status.=REQD) 74 V (currreqddlrox) Invariant 3.7 (timestamp.EAf) (statusx=UNFL) V ( Vx, xEKA, A (x5sendxcs-reqd) (pendingack.=F) (x pendingerasable)) A (xzsendcsrecv) A A (x can-erase) A Invariant 3.8 (currreqddlr4x) Vx, xEA, V (c urrreqdts=v) (3v, vEJA, s.t. A (timestampx=v) ( V (<x,v>Ependingts-assign))) Invariant 3.9 Vx, xENA, (,/v s.t. <x,v>Ependingtsassign) V (3v, vENA, s.t. ( (<x,v>Epending-ts-assign) A A A (Au, u7v, s.t. <x,u>pendingtsassign)) (timestamp=l) (statusx=UNFL)) Invariant 3.10 Vx, xEA, (xEcommitted) (currreqddlrzx) A (,]u s.t. <x,u>Ependingtsassign) (x:sendcsreqd) A (x sendcsrecv) A (xopendingerasable) A (xfcan-erase) A (pendingackr=F) A (timestamp=lI)) V ( A Invariant 3.11 Vx, xEAN, (curr-reqddlr5x) V ((xzsendcs-recv) Invariant 3.12 'Vx,xEA, (xopendingerasable) A (xfcanerase)) V ((pending-ack,=F) A (xcanerase)) Invariant 3.13 Vx, xEJ, (xocan-erase) V (pendingackx=F) Invariant 3.14 Vx, xEA, (statusx=RECV) V ((xopendingerasable) 75 A (xocanerase)) Invariant 3.15 Vx, xEAf, (xfsendcs-recv) Invariant 3.16 Vx, xEAf, V (status,=UNFL) V (statusx=REQD) ((,v s.t. <x,v>Epending.tsassign) A (timestamp=l±)) V (3v, vENJ, s.t. ((<x,v>Ependingtsassign) V (timestampr=v)) A (Vy,yx, ( (Bu s.t. <y,u>Ependingts-assign) A V (timestampy=l)) (3u, uA, ( A A s.t. (<y,u>Ependingts-assign) V (timestampy=u)) (v))) (v<currentts)) Invariant 3.17 Vx,xEf, (currreqddlr=x) V (-recvbl(x)) V (3v, vEJA, s.t. ((<x,v>ependingtsassign) A 3.4.3 V (timestamp,=v)) (vErecvtss)) Proof of Safety for LM Let s be a state in the LMmodule (i.e., the implementation), t be a state in the SLM module (i.e., the specification), and let f denote a possibilities mapping from the states of LMto the states of SLM.This possibilities mapping f is defined as follows. Definition 3.1 Vx, xE., ((lmkeep(x) in state s) V ((lmlet(x) in state s) V ((lmwait(x) in state s) V ( A (keep=x in state t)) A (xEleterase in state t)) A (xEwaiterase in state t)) (-recvbl(x) in state s) A (((keep5x) A (x leterase) A (xowaiterase)) in state t) ) tEf(s) Refer to Section A.1 in Appendix A for all the lemmas and theorems which prove that f is a possibilities mapping from LMto SLM. 76 3.4.4 Proof of Liveness Theorem A.69, which is found in Section A.2, states an important liveness property for LM:the DLR for every committed update is eventually erased. This property is formally expressed as: (a is a well-formed and fair execution) A (i--COMMIT,) 3j, j>i, s.t. rj=ERASE. represents where acrepresents an execution of the module composed of LMand ENV,and 7ri the ith action of a. Refer to Section A.2 for the proof of this theorem and the many lemmas that support it. 77 Chapter 4 Parallel Logging 4.1 Parallel XEL The XEL algorithm presented in Chapter 2 can be applied in a parallel system in which there are multiple log streams, each of which accepts incoming log records and operates independently of other log streams. Collectively, the log streams provide the bandwidth required for an application's log information. The name for this practice is parallel XEL. Each log stream accepts incoming log records from client transactions and manages them according to the XEL algorithm. The streams operate independently of one another, except for dependencies introduced by the status values for log records. The LOT and LTT tables are distributed across numerous processors in the parallel system so that they can each provide the necessary throughput for operations on them. Each log stream is segmented into the same number of generations. Generation i is the same size for each stream, but different generations within each stream may still be of different sizes. The positions of the head and tail for generation i in one stream are completely independent of the head and tail positions for another stream's generation i. That is, the LM performs forwarding and recirculation at each log stream independently of such activity at other streams. The abstraction of multiple log streams that operate independently of one another is well suited to a system which requires an arbitrarily large number of disk drives to provide the necessary bandwidth for log information. The LM can dedicate only one or some small fixed number of disk drives to each particular stream. Figure 4.1 illustrates the situation for a LM that manages four log streams, each composed of two generations. Within each stream, non-garbage log records are forwarded 78 or recirculated and garbage records are thrown away (no garbage pails are shown in this figure in order to reduce visual clutter). Updated objects whose DLRs are anywhere in any of the log streams may have their new values flushed to the disk version of the database. gm new log records I ----- N.- new log records new log records I new log records I Figure 4.1: Four Parallel Log Streams When the LM must examine or modify an object's LOT entry, it first hashes the object's oid to a processor identifier within the concurrent system and then it hashes the oid to a particular address within the processor's memory space. The number of processors over which the LOT is distributed must be sufficient to satisfy the throughput requirements of client transactions. Similarly, the LTT is implemented as a distributed hash table with a two-step translation procedure. Each log stream has one particular processor that is responsible for managing its records1 . The cells for a stream's relevant log records all reside in the memory space of the stream's processor. In general, the LM may send an object's DLRs to different log streams, and so the object's LOT entry cannot hold direct pointers to the cells for all 1This is not necessarily a one-to-one mapping. A processor may manage more than one log stream if it has sufficient processing power, memory capacity and communication bandwidth. 79 the object's relevant DLRs (as was the case for only a single log stream in Chapter 2). Rather, an object's LOT entry keeps track of the streams at which relevant DLRs exist. A local LOT at the processor of each of these streams, which is also associatively accessed via oid, holds pointers to the cells for each object's DLRs at the stream. Similarly, a TLR, if any, was sent; transaction's LTT entry indicates the stream to which its COMMIT to the cells for all copies of LTT at that stream points the transaction's entry in a local record within the stream. the COMMIT As a special case, the LM may insist that all an object's DLRs go to the same stream, and the identity of this stream is determined by a function whose domain is the set of all oids. In this case, the LM can place the object's LOT entry at the processor that manages the stream to which its DLRs are sent so that indirection via a local LOT is unnecessary. This placement of LOT entries at processors that are responsible for managing log streams assumes that each processor has ample resources to support both purposes. Similarly, a transaction's LTT entry can be placed at the processor that record will be sent so that indirection via the manages the stream to which its COMMIT local LTT is eliminated (this assumes that each transaction is statically mapped to some record). stream which will receive its COMMIT TLR to the log (i.e., adds When only a single log stream exists, the LM adds a COMMIT it to the buffer in main memory that currently holds records at the tail of generation 0) as soon as a transaction requests to commit. In the more general case of more than one log stream, the LM waits until all a transaction's DLRs are on disk before it TLR for it; this delay is expected to be less than 100 ms. Therefore, generates a COMMIT record marks both its intention and its eligibility to successfully a transaction's COMMIT terminate. record is acThe synchronization between a transaction's DLRs and its COMMIT complished as follows. For each buffer of each log stream, the LM keeps a list of the transaction identifiers for all transactions which wrote log records to the buffer. For each transaction, the LM keeps a list which identifies the buffers to which the transaction has written log records; this list is stored in the transaction's LTT entry. Immediately after a buffer of log records has been written to disk, the LM examines the buffer's list of transactions. For each transaction in the list, the LM removes the buffer identifier from the list kept in the transaction's LTT entry. If this list becomes empty and the record for the transtransaction is waiting to commit, then the LM generates a COMMIT action; otherwise, the LM does nothing further and leaves the transaction waiting for the rest of the buffers on which it depends to be written to disk. Crash recovery is almost the same as for the special case of only a single log stream, except that the POT and RTT data structures are distributed across processing nodes in a parallel machine so that they provide the necessary throughput. Like the LOT and LTT, they are implemented as distributed hash tables with a two step translation procedure. The RM's work at each log stream proceeds completely independently of recovery activity at other log streams. The RM sends each DLR to the processor that 80 manages the POT entry for the object indicated in the DLR. Likewise, after retrieving a transaction's COMMIT record from disk, the RM sends it to the processor that is responsible for its RTT entry. Many of the messages used in parallel XEL are quite short, typically only 10 to 20 Bytes in length. Low overhead interprocessor communication is therefore particularly important. For best performance, parallel XEL should be implemented on a fine-grain concurrent computer that provides low overhead, low latency communication primitives. The MIT J-Machine [11, 12, 48] is an existing example of such a machine. XEL will perform satisfactorily on other concurrent systems in which the overhead for interprocessor communication and synchronization is higher as long as the added delays are still relatively short compared to the delays for writing blocks to disk, the interconnection network provides sufficient bandwidth and CPU cycles are plentiful. 4.2 Three Different Distribution Policies When a client transaction submits a log record to the LM, the LM must choose the stream(s) to which it will assign the record. The LM's distribution policy governs its choice of log streams for records. In general, copies of a log record may be sent to any number of streams. All the policies examined in this thesis send a log record to only one log stream. This section proposes three distribution policies: partitioned, random and cyclic. These policies are all oblivious policies: they do not use information about current load2 imbalances to help choose the stream to which to send a log record. More elaborate adaptive policies, which monitor load imbalances between streams and attempt to send records to streams so as to counteract current imbalances, are beyond the scope of this thesis. The analyses in the following subsections consider only the bandwidth required for incoming log records to generation 0 of each log stream. The bandwidth for forwarded and recirculated records within each stream is ignored because it defies accurate analytical modelling. In practice, the bandwidth required for forwarded and recirculated records ought to be relatively small compared to that required for incoming new records. 4.2.1 Partitioned Distribution The partitioned policy assigns each object to a particular log stream. A function whose only argument is an oid defines this mapping. For each object, the LM directs all its 2In these discussions, a log stream's load is defined to be the bandwidth demanded of it. 81 DLRs to the stream prescribed by this mapping. Similarly, another mapping (via some other function) from tid to stream number determines the stream to which the LM sends each transaction's COMMIT record. The partitioned distribution is susceptible to "static skew" effects. Even if all objects are updated with the same frequency, some streams may be assigned more objects than others, and so they must provide more bandwidth for log information. By definition, the entire LM fails when it must refuse to accept a log record from a client transaction. The failure of only one log stream condemns the entire system, according to this definition. A log stream will fail when the bandwidth demanded of it exceeds its available bandwidth. Therefore, the maximum demanded bandwidth, over all log streams, is an important metric when evaluating a parallel logging system. A quantitative analysis for a simple case can provide some insight into the static skeweffect that threatens the partitioned distribution policy. Suppose that there are N objects and 2 log streams; denote the log streams as A and B. Each of the N objects is randomly assigned to a particular log stream; the probability that it is assigned to stream A is PA=0.5 (and hence, PB=0.5 is the probability that it is assigned to stream B instead) and is independent of the assignments for the other objects. Let M be a random variable 3 that denotes the maximum number of objects assigned to either stream. The probability distribution function for M is 0 if m< N NI Prob[M = m] = if N is even and m = 2)1 2w 2 (m 1 if m> N It follows that the expected value for M is given by N E[M] = mProb[M=m] E m=r-N /2 1 1 2NN (N rI 3 + L,2Z=1+Nm(m) } if N is even if N is odd %rnl(m) All random variables are typeset in boldface to emphasize their nature. 82 Define E[M]/N to be the load imbalance; it represents the expected fraction of the objects assigned to the stream with the most objects. Ideally, E[M]/N remains constant at 0.5 (i.e., each stream gets half the objects) for all N. Figure 4.2 plots E[M]/N as N increases. For small values of N, the imbalance between the two streams is quite pronounced. For example, E[M]/N=0.6 for N=16. That is, 60% of the objects go to one stream and the other 40% go to the other stream, on average; the first stream's bandwidth is 50% higher than that of the second stream. However, a significant imbalance remains even for fairly large N. For N=128, E[M]/N=0.535193 and so the busier stream's bandwidth is 15%higher than that of the other stream, on average. The total number of objects in a database may be quite high (several million, say) but a relatively small number of "hot" objects may receive a disproportionately large number of updates. If these hot objects are not evenly distributed across all the log streams, the static skew effect may lead to significant load imbalances. The smaller the set of hot objects and the higher their "temperature", the worse the threat of load imbalances becomes. 1.1 ·~ 1.0 8 0.9 Ce no 0.8 E *o 0.7 0.6 0.5 0.4 0.3 0.2 0.1 fin 0 20 40 60 80 100 120 140 Numberof objects, N Figure 4.2: Load Imbalance vs. Number of Objects Now consider what happens when the number of log streams increases in proportion to the number of objects. This is a reasonable exercise because each object may be characterized by a particular bandwidth (which depends on how often it is updated). If every object has the same characteristic bandwidth and this parameter remains constant, then a database's total demanded bandwidth is proportional to the number of objects in the database. To satisfy this demanded bandwidth, the LM must provide a number of log streams that is at least proportional to the number of objects. Let there be N objects and S log streams. Each object is randomly assigned to one stream; the probability that the object is assigned to stream i is 1, for <i<S. Let Peq(m,N,S) denote the probability that the maximum number of objects assigned to any stream is exactly m and Pg(m,N,S) denote the probability that the maximum 83 number of objects assigned to any stream is at least m. Pge(mr,N2 ,S)>Pge(mNl,S) *,N1 -<m<N -(N 2 >N1 ) A (([s] 2) Lemma 4.1 Proof: Partition the collection of N 2 objects into two disjoint sets of sizes N1 and NR-N 2 -N1. To assign the N 2 objects to the S streams, first assign the N1 objects of the first set to streams and then assign the remaining NR objects to streams. After assigning the first N 1 objects, let the random variable Bi denote the number of objects assigned to stream denote the set of streams that have a maximum number of objects i, for 1<i<S, and assigned to them. That is, VjE, k, l<k<S, such that Bk>Bj. The probability that any one of the remaining NR objects is assigned to one of the streams in IF is at least s (this minimum occurs for the case of NR=1 and I1'I=1). Therefore, Pge(m,N2,S) > Peq(m-1,N1,S)() + Pge(m,N,S) and Peq(m-l,N 1,S)>O for [N 1]<m-l<N rPeq(m-l,N 1, S)>O for FN]<m<Nl+l 1 so and hence Pge(mN2,S)>Pge(m,NS) for [N 1<m<Nl+1. For larger values of m, Pge(m,Nl,S)=O and Pge(m,N2,S)>O for Nl+l<m<N 2 and so Pge(m,N2,S)>Pe(m,N1,S >) for Ni+l<m<N2. Therefore, the general result follows: Pge(m,N2 ,S)>Pge(m,N,S) Theorem 4.2 for [N 1<m<N2 0 Vm, [N1<m<2N, Pge(m,2N,2S)>Pge(m,N,S) Proof: Divide the set of 2S log streams into two equal (and disjoint) sets, each of size S. Assign the 2N objects to streams in two steps. In step 1, randomly assign each object to one of the two sets. In step 2, assign each object to a particular stream within its set. In step 1, the two sets are equally likely to be chosen when assigning objects to sets; similarly, the streams within a set all have the same probability of being chosen in step 2. Therefore, this two step procedure implies that the probability that a particular object is assigned to a particular stream is the same for all streams, as required by the statement of the problem. After step 1, there are two possible outcomes: (a) Both sets have been assigned exactly N objects each. Denote this outcome by A. or 84 (b) One set has been assigned more than N objects. Denote this outcome by B, and let the random variable MB represent the number of objects assigned to the set with more objects. It must be true that MB>N. The probability of outcome A is PA= ( (1)2N and the probability of outcome B is PB = 1-PA = 1( N ) Therefore, Pge(m,2N,2S)-N+ ) 2 (1- , () Pge(m ,MB,S) for [N 1 <m<N. where aPge(m,N,S) By Lemma 4.1, Pge(m,MB,S)>a for VN <m<MB because MB>N. = a[1 + >a ( a) a)] - T herefore, -(1 becauseO<a<l This completes the proof that Pge(m,2N,2S) > Pge(m,N,S) for [rl <m<N. Turning now to larger values of m, Pge(m,2N,2S)>O but Pge(m,N,S)=O0 for N<m<2N and so it followsthat Pge(m,2N,2S) > Pge(m,N,S) foir N<m<2N. Combining these results yields the final conclusion: Pge(m,2N,2S) > Pge(m,N,S) for [N1 <m<2N. This completes the proof of the theorem. 0 Theorem 4.2 implies that the threat of load imbalances becomes increasingly severe as the number of log streams increases in proportion to the number of objects. Hence, a LM must increase the number of log streams superlinearly if it is to maintain the same probability of failure (i.e., overload) as the number of objects in the database increases (assuming that the average rate at which each object is updated remains constant). 4.2.2 Random Distribution The random distribution policy randomly chooses the stream to which each log record is sent; all log streams are equally likely to be chosen. Static skew is not a problem with the random distribution policy because it does not assign log records to streams on the basis of oids or tids; log records for different updates to the same object may go 85 to different streams. Nevertheless, different streams may receive different numbers of log records simply as a consequence of the LM's random decisions. Imbalances between log streams are now attributed to "dynamic skew" because they arise from decisions that the LM makes during operation rather than from a static assignment of objects and transactions to log streams. Suppose that L log records are to be distributed to S log streams. The probability and that the LM chooses to send a particular log record to stream i, for l<i<S, is Ri denote random variable Let the log records. is independent of its choices for other the number of log records that the LM sends to stream i, for 1<i<S; Ri has a binomial 1) . Define another distribution [6] with a mean E[Ri]=L and a variance V[Ri]=-l random variable EOiR to be the fraction of log records sent to log stream i. Since and its variance is Oi is a linearly scaled version of Ri, its mean is E[Ei] =E[= v[Oi]=VL =S2L-. By the Chebyshev Inequality 4 , Prob[IEilim Prob[10i- L oo - S ½I > e] < S and so > ] = 0. Under the assumption that all log records are the same size, this result proves that load imbalances between log streams are expected to diminish as the number of log records increases. Therefore, dynamic skew is not expected to be a serious problem in a system that operates on a continuous basis for long durations (so that many log records are written). 4.2.3 Cyclic Distribution The cyclic policy assigns log records to streams in a round robin manner: the LM assigns a total order to the log streams and directs successive records to successive streams in the order. The cyclic distribution policy does not suffer from either static or dynamic skew effects. If all log records have the same size, it guarantees an optimal load balance amongst the log streams. Again, assume that L log records are to be distributed across S streams. According to the cyclic distribution policy, the LM sends log record j to stream 1+(j mod S), for j>O, and so stream i receives li=[rL records if i<(L mod S) and li=LL] records otherwise, for 1<i<S. Define i- k to be the fraction of log records sent to stream i Note that OS<Oi<l1, for all i, 1<i<S. Under the assumption that all log records are the same size, i is proportional to the load on stream i and so the load of the most 4The Chebyshev Inequality [6] states that for any random variable, X, which is characterized by an expectation E[X]=p and a variance V[X]=a2, the following relation is true: Prob[lX where e is any arbitrarily chosen positive constant. 86 1 > e] < - heavily loaded stream cannot exceed that of the most lightly loaded stream by more than a factor of Os -- Is < - (1+ [LJ)/ S [LLbut this quantity monotonically approaches 1.0 as L increases. Therefore, load imbalances amongst the streams are expected to be negligible after a system has been running for a while and many log records have been generated. In practice, not all log records will be the same size so the cyclic policy no longer guarantees an optimal load balance amongst the set of log streams. Nevertheless, the cyclic policy is expected to yield reasonably good load balancing behavior for most applications if a system is allowed to run for a sufficiently long amount of time. The cyclic policy will serve as a touchstone against which to judge the other distribution policies. The cyclic policy poses implementation problems as a system's degree of concurrency increases. The most straightforward implementation employs a single variable that keeps track of the stream to which the most recent log record was written. This single variable may become a serial bottleneck at some point as the number of processing nodes increases, or it may introduce significant complexity (such as a combining tree [19] implementation, for example). A simpler approach is to divide processing nodes into disjoint sets and perform cyclic distribution amongst the members of each set; that is, each set adheres to its own cyclic distribution discipline independently of the other sets. In this latter approach, each set has a separate variable that identifies the stream to which the most recent log record was written by any processing node in the set. The set size is restricted to some manageable limit, and the number of sets increases as a system's degree of concurrency increases. The superimposed loads from the different sets will still yield even load balancing amongst the log streams. Buffering may still be a problem, though. If the separate sets inadvertently "synchronize" so that they all send their records to the same stream at approximately the same time, one stream may receive a flood of records while other streams are relatively idle. 87 Chapter 5 Management of Precommitted Transactions 5.1 The Problem The problem of managing precommitted transactions and the transactions which depend on them becomes much more complicated in a highly concurrent database that has a collection of parallel log streams. The following example illustrates the crux of the problem. Suppose a transaction txl acquires a write lock on some object Obj5 in a database, updates the object and then requests to commit. Assume that the REDO DLR from txl's update to Obj5 is already on disk when txl requests to commit. In response to record for txl and adds the record to some log txl's request, the LM generates a COMMIT stream's current buffer in main memory. Recall that the LM does not write the buffer to disk right away. Rather, it waits for the arrival of enough log records from other transactions to fill up the buffer (or for a time limit to expire) and then writes the buffer to disk. Now suppose that some other transaction, tx2, reads the updated value of Obj5 record has been written to disk. Thus tx2 becomes dependent on before txl's COMMIT record for tx2 and txl. When tx2 later wants to commit, the LM must create a COMMIT record for tx2 goes to the same add it to some log stream's current buffer. If the COMMIT log record did, then there are no problems. Either these two log stream as tl's COMMIT log records belong to the same block of records, or tx2's record is subsequently written to disk in another block. Either way, it is impossible for the RM to find tx2's COMMIT record on disk, but not that of txl. 88 Now suppose that tx2's COMMIT log record is directed to a different stream than txl's record. It is possible that the buffer holding t2's COMMIT log record will fill up before the buffer to which tl's COMMIT record belongs. Let bufl and buf2 denote the buffers that hold the COMMIT records from txl and t2, respectively. If buf2 is written to disk but then a crash occurs before bufl can be written, the RM is faced with a problem. It finds a COMMIT record for t2 but not for tzl. How is it to know that tx2 depended on tl, so that its changes must be undone, despite the fact that its COMMIT record was written to disk? The problem to be addressed concernsthe management of precommitted transactions in a highly concurrent system which has many parallel log streams. The LM must ensure that the log on disk always contains sufficient information so that the RM can restore the database to a consistent state after a crash. If a crash occurs before a precommitted transaction commits, then the effects of this transaction and all transactions which depend on it must not be present in the restored database. 5.2 Shortcomings of Previous Approaches Other researchers [15] have previously suggested that the LM ensure that buffers are written to disk in an order that will never jeopardize the consistency of the database. Transactions' COMMIT log records do not contain any explicit information about dependencies amongst transactions, so the LM must not allow the COMMIT log record of a transaction to be written to disk before any of the COMMIT log records for earlier transactions on which it depends. Consider the application of this approach to the situation described in the previous section. The COMMIT record for transaction tl is waiting in buffer bufl and the COMMIT record for t2, which depends on tl, is waiting in buffer buf2 at a different log stream. The log manager must enforce a topological ordering amongst these buffers so that buf2 is written to disk after bufl. This approach becomes awkward as more complex situations arise. For example, suppose that another transaction tx3 becomes dependent on t2 and requests to commit before either bufl or buf2 has been written to disk. If the COMMIT record for t3 is added to bufl, then a dependency cycle now exists in the topological ordering amongst buffers. Buffer buf2 must be written after bufl because of tx2's dependency on tl, but bufl must be written after buf2 because of t3's dependency on t2. Neither buffer can be written to disk before the other without risking a possibly inconsistent state after recovery. This situation is illustrated in Figure 5.1. The interdependencies amongst buffers at different log streams can be represented as a dependency graph. An arc from node x to node y in the dependency graph indicates that buffer x must be written after buffer y. In Figure 5.1, buf2 must be written after bufl because of the dependency introduced by 89 tx2, but bufl must be written after buf2 because of the dependency introduced by tx3. tx3 tx2 Figure 5.1: Deadlock in Dependency Graph for Buffers at Two Log Streams One solution to this problem is to keep track of existing dependencies so that a cycle never forms. This leads to difficulties. In a large system with many parallel log streams, the maintenance of a dynamic dependency graph would entail prohibitive overhead. record to a stream's current buffer, the LM must traverse the Before adding a COMMIT graph to check that a cycle will not be created. A static approach may involve less overhead. A static graph, defined at system initialization, specifies upon which other log streams a particular stream's buffers may depend. The graph is constructed so that no cycles can possibly occur. When a transac- record must be written, it is written to the stream which has the smallest tion's COMMIT set of allowed dependencies that includes all of the transaction's current dependencies. For any set of log streams, there must be a log stream which can have dependencies on all of them. This implies that the graph be a partial order with some unique bottom element. For example, the graph in Figure 5.2 is a suitable static partial ordering amongst seven log streams. For any set of nodes in the graph, there exists some node which is below all of them. Log stream L7is the unique bottom element. Figure 5.2: Static Dependency Graph of Log Streams If a transaction has dependencies on buffers at log streams Li and L2, then its record will be sent to log stream L5. A transaction with dependencies on L2 COMMIT and L3 must have its COMMIT record directed to L7. Although this static ordering reduces the overhead of maintaining a dependency graph to avoid cycles, it can lead to other problems. It restricts the LM's options for a 90 distribution policy. Log streams at the bottom of the graph will tend to receive a higher load, at least in terms of COMMIT records, than streams near the top; this imbalance could persist indefinitely. This section has explained the drawbacks of dynamic and static solutions to the problem of maintaining dependencies amongst buffers at different log streams so that records are written to disk in an order that respects their depentransactions' COMMIT dencies. Dynamic approaches have significant run-time overhead, and static approaches are prone to load imbalances. 5.3 Logged Commit Dependencies (LCD) This section presents a new solution, called Logged Commit Dependencies (LCD), that is an appealing alternative to the ones described in the previous section. All considerations about dependency graphs are banished. The choice of a log stream to which a record is written is no longer limited by synchronization constraints. record. When a transaction LCD introduces a new type of TLR called a PRECOMMIT record which explicitly requests to commit, the LM immediately generates a PRECOMMIT identifies all the transaction's unsatisfied dependencies at the time of the request. The record to any log stream and can write it to disk at any LM can send the PRECOMMIT time. records Recovery becomes more complicated, however. If the LM wrote PRECOMMIT but not COMMIT records (which are no longer absolutely necessary), the RM might be forced to unravel a deep "tree" of transaction dependencies before it can conclude that a recent transaction actually committed. To make the RM's job easier, the LM also generates a COMMIT record for each transaction after all the transaction's dependencies have been satisfied. This COMMIT record is simply an indication to the RM that the transaction did indeed commit before the crash occurred, and so it need not bother to record. check all the dependencies listed in the transaction's PRECOMMIT The LM maintains a monotonically increasing Log Sequence Number (LSN) for each log stream and associates a unique LSN value with every block of records that it writes to disk at the stream. The LM places a block's LSN at the beginning of the buffer which has been allocated for it. When the LM decides to write the buffer to disk, it increments the stream's LSN and puts the new LSN at the beginning of the new current buffer. Each transaction's LTT entry has a field, called DLRdeps, which keeps track of the transaction's dependencies on unwritten REDO DLRs. This DLR-deps field holds a set of pairs, where each pair has a stream identifier and a LSN. Whenever a transaction writes a REDO DLR to the log, the LM notes the stream to which it was sent and the 91 LSN for the buffer to which it was added; denote the stream as s and the LSN as n. The LM uses this information to update the transaction's LTT entry. If the transaction's DLRdeps field currently has no pair for stream s, it adds the pair <s,n> to the set. Otherwise, it updates the existing pair for s so that it holds the new LSN n instead of its previous value. Similarly, the LM also maintains corresponding information for each buffer at each log stream. The LM keeps track of the current LSN for a buffer and the transactions that have had log records added to the buffer. After a buffer has been written to disk, the LM uses this information to update the appropriate LTT entries. Suppose that transaction t had a log record in a buffer at stream s whose LSN is n. After this buffer has been written to disk, the LM retrieves the current pair <s,m> for stream s from the DLRdeps field in t's LTT entry and compares m to n. If m=n, then the LM removes the pair from the DLRdeps field (because t has not written any records to subsequent buffers at stream s); otherwise (i.e., m>n), it just leaves the current <s,m> pair in t's DLRdeps field. Each transaction's LTT entry also has three more fields, called dependson, isdependedonby and deptxctr. A transaction t's depends on field holds the set of transaction identifiers for all precommitted transctions on which transaction t depends, and t's isdependedonby field holds the set of transaction identifiers for all subsequent transactions that depend on t. The deptx._ctr field is an integer-valued counter that keeps track of the number of transactions on which t depends while t is in a precommit- ted state. The LM must remember the identity of the precommitted transaction, if any, that most recently updated each object. This information is kept in each object's LOT entry. Whenever a transaction reads or updates an object, it becomes dependent on the precommitted transaction, if any, that previously updated the object. The LM adds this dependency information to the respective dependson and isdependedonby in the LTT entries of both transactions. fields Each PRECOMMIT record contains the following three fields: txid: dlr_streams: identifier for the transaction that has requested to commit <streamid,LSN> pairs for all streams with unwritten DLRs precomm_txs: transactions on which this transaction depends Suppose a transaction t requests to commit. The LM determines which REDO DLRs (for updates by t) are still waiting to be written to disk and which precommitted transactions (on which t depends) are still waiting to commit by examining the DLRdeps and dependson fields, respectively, in transaction t's LTT entry. Unless both these fields are empty, the LM puts the contents of both fields into their corresponding fields of the PRECOMMIT record for t and sends the PRECOMMIT record to some log stream. For each object b that t modified (as indicated by the contents of the objids set in t's LTT entry), the LM updates b's LOT entry to record the fact that its current value depends on precommitted transaction t and then the LM releases t's write lock on b. Finally, the 92 LM counts the number of transactions listed in the dependson field of t's LTT entry and assigns this value to t's deptxctr counter. If DLRdeps and dependson are both empty when t requests to commit, the LM just goes ahead and generates a COMMIT record for t. Similar to before, a transaction t does not actually commit until all its DLRs have been written to disk, all the precommitted transactions on which it depends have committed, and its PRECOMMIT record (or its COMMIT record, as explained below) has been written to disk. Without the LCD technique and with no explicit dependency informa- tion in the COMMIT log record, transaction t committed at the instant that its COMMIT record was written to disk (since all dependencies had to be satisified before its COMMIT record could be written to disk). Now, the explicit information in the PRECOMMIT record allows the PRECOMMIT record to be written to disk before all dependencies on DLRs and earlier transactions have been satisfied. There may be some delay between the time that t's PRECOMMIT record is written to disk and the time that it commits. The LM detects that a precommitted transaction t's last dependency (on either an unwritten DLR or a precommitted transaction) has been satisfied when both DLR-deps=O and deptx_ctr=O become true for the transaction. When this happens, the LM immediately generates a COMMIT record for t and sends it to any log stream; the LM can go ahead and generate a COMMIT record for t even before t's PRECOMMIT record has been written to disk. The transaction commits as soon as either its PRECOMMIT record or COMMIT record has been written to disk (and DLR-deps=O and dep_txctr=O). When transaction t actually does commit, the LM sends an acknowledgement to t in response to its commit request, updates the LOT entries of all objects which t had modified, and updates the LTT entries for all transactions listed in the isdepended-onby set in t's LTT entry. For each object that t modified (as indicated by the contents of the objids set in t's LTT entry), the LM first checks to see if the object still depends on t. If so, it changes the LOT entry to indicate that the object no longer depends on any precommitted transaction. Otherwise, the object must now depend on some subsequent precommitted transaction, and so the LM does not change the LOT entry. The LM processes all the members of the isdependedonby field in t's LTT entry immediately after t commits. For each transaction u in this set, the LM decrements the deptxctr counter in u's LTT entry. The pseudocode for the LM's management of precommitted transactions, using the LCD technique, is given below. readobject(txid, object id) { lotentry - LOT entry for object objectid lttentry - LTT entry for transaction txid ptx - lot entry->precomtx if (ptx$4NULL) { pretxlttentry -- LTT entry for transaction ptx 93 pretxJtt entry- >isdependedonby pretxlttentry->isdependedonby ittentry- >dependson - U {txid} Ittentry- >dependson U ptx} updateobject(txid, object id) lotentry - LOT entry for object objectid ittentry - LTT entry for transaction txid ptx - lotentry->precomtx if (ptx#NULL) { pretxltt-entry - LTT entry for transaction ptx pretxltt entry- >isdependedonby pretxlttentry->isdependedonby Ittentry- >dependson U {txid} - ttentry- >dependson U ptx} } <stream,lsn> -- writelogrecord(txid, objectid) if (3x s.t. <stream,x> E lttentry->DLRdeps) { Ittentry->DLRdeps - ttentry->DLRdeps - <stream,x>} } Ittentry->DLRdeps - Ittentry->DLRdeps U {<stream,lsn>} } requesttocommit(txid) ittentry - LTT ittentry->tx { entry for transaction txid status - precommitted if ((lttentry->DLRdeps=0) AND (ttentry->dependson=0)) { generatecommit rec(txid) else ( generateprecomm rec(txid, ttentry->DLR-deps, Itt entry- >depends on) for (b E lttentry->objids) lotentry { LOT entry for object b lotentry->precomtx - txid release write lock on object b lttentry->dep.txctr - ltt-entry->depends.onl 94 txcommitted(txid) { send acknowledgement of commit to txid ittentry - LTT entry for transaction txid lttentry->txstatus - committed for (b C lttentry->objids) lotentry { - LOT entry for object b if (lotentry->precomtx = txid) { lotentry->precomtx - NULL for (u E lttentry->isdependedonby) { pretxcommitted(u) } pretxcommitted(txid) ittentry { -- LTT entry for transaction txid Ittentry->dep-txctr - ttentry->deptxctr - 1 if ( (lttentry->txstatus = precommitted) AND (lttentry->DLRdeps=0) AND (lttentry->deptxctr=O) ){ generatecommit rec(txid) if (PRECOMMIT record from txid is already on disk) { txcommitted(txid) bufferwrittentodisk(txid, lttentry stream, lsn) { - LTT entry for transaction txid nocommityet - (lttentry->DLRdeps540) lttentry->DLRdeps if ( - lttentry->DLR-deps- {<stream,lsn>) (ltt entry- >txstatus=precommitted) AND (nocommityet) AND (lttentry->DLRdeps=0) AND (lttentry->deptxctr=0) ){ generatecommit rec(txid) if (PRECOMMIT record from txid is already on disk) { tx-committed(txid) } } 95 precommit orcommit-recordwrittento-disk(txid) ittentry if ( { - LTT entry for transaction txid (Ittentry->tx status=precommitted) AND (lttentry->DLRdeps=0) AND (lttentry->deptxctr=O) ){ txcommitted(txid) The RM now has greater responsibilities. It may discover transaction t's PRECOMMIT record on disk, but it must do some detective work to figure out if t really did commit before the crash occurred. For every log stream listed in t's PRECOMMIT record, the RM must check that all records in the stream up to and including the LSN indicated in the PRECOMMIT record were written to disk prior to the crash. This is easily accomplished, by inspecting the LSN numbers in all the blocks found in the log on disk. Likewise, for every indicated precommitted transaction, it must verifythat the transaction did indeed commit before the crash. The RM keeps track of these dependencies by using two new fields, called dependson and isdependedonby, that belong to each transaction's RTT entry. These fields are analogous to their counterparts in the LTT. Let rmax denote the maximum time required to fill a buffer and write it to disk. A transaction will commit (and generate a COMMIT record) within time rma, after it submits its commit request. Therefore, a COMMIT record exists in the log on disk for every transaction which precommitted at least 2 rma, prior to a crash. At worst, the RM must deduce the fates of only those transactions which precommitted in the last 2ra,, seconds prior to the crash. It is expected that this number of transactions will be small, compared to the total size of the log. A transaction t's PRECOMMIT record explicitly lists all the streams on which t depends for its REDO DLRs to be written to disk. To reduce the size of the PRECOMMIT record, and thus save disk space and bandwidth, the LM can postpone generating a PRECOMMIT record for t until all its REDO DLRs have been written to disk. Of course, transaction t can still release all its write locks (after the DBMS has updated the objects' LOT entries to record their dependency on t) as soon as t requests to commit. After all t's REDO DLRs have been written to disk, the LM generates a PRECOMMIT record for t that lists all the precommitted transactions on which t still depends at the time the PRECOMMIT record is generated. Now, the RM may need to deduce the fates of transactions that precommitted as early as 3 rm,, prior to a crash. Now consider how to integrate LCD into the XEL technique. The LM continues to manage DLRs exactly the same as before, but has more complexity for the handling of PRECOMMIT and COMMIT records. Each PRECOMMIT record has a status of either required or recoverable. A PRECOMMIT record is initially required, and becomes recoverable after the LM has written a COMMIT record (for the same transaction) to disk. The LM must keep 96 a required PRECOMMITrecord in the log, but can throw away a recoverable PRECOMMIT record at its earliest convenience. record until all subsequent transactions The LM must retain a transaction t's COMMIT records written to the log. Suppose some transaction which depend on t have had COMMIT record is on disk, the LM u, which depends on t, commits. As soon as u's COMMIT retrieves the contents of the depends-on field from u's LTT entry. For each member v of this field, the LM removes u from the isdepended-onby field in v's LTT entry. Since u depended on t, the LM will remove u from the isdepended-on.by field in t's LTT entry. in t's LTT entry, it concludes that As soon as the LM detects isdependedonby=0 record any longer. The LM changes the no subsequent transactions require t's COMMIT record to recoverable as soon as isdependedon-by=O status of a transaction's COMMIT for the transaction, no recoverable UNDO DLRs remain from the transaction, and any remaining REDO DLRs have only recoverable status (these latter two conditions are determined by maintaining a counter in each transaction's LTT entry, as described in Section 2.8). LCD can increase the maximum throughput for any particular object in the database, but this enhanced performance comes at a cost. LCD requires more storage for extra fields in each LOT, LTT and RTT entry; these extra fields maintain dependency infor- mation. It also increases the complexity and computational requirements of the LM and records consume disk space and bandwidth. RM. Finally, the PRECOMMIT 97 Chapter 6 Experimental Results This chapter evaluates XEL and the three proposed distribution policies for parallel logging. Disk space, disk bandwidth, main memory requirements and recovery time are the evaluation criteria throughout the chapter. Section 6.1 describes the event-driven simulator by which the experiments were performed. It explains each of the input parameters, documents the fixed parameters, presents the definitions of XEL's data structures as expressed in the C programming languange [36] and justifies the validity of the simulation model. Section 6.2 quantitatively evaluates and compares the performances of XEL and the FW technique for a single log stream as various application characteristics vary. Four sets of experiments consider separately the effects of: (1) the probability of long transactions, (2) the duration of long transactions, (3) the size of DLRs from long transactions and (4) the "data skew" which characterizes access patterns to objects in a database. The results of this section demonstrate that XEL can significantly reduce the amount of disk space required for log information, compared to FW, although XEL requires significantly more main memory and may entail increased disk bandwidth for log information. Recovery is I/O bound, so recovery time is less for XEL than for FW. Section 6.3 quantitatively evaluates and compares the load balancing properties of the three oblivious distribution properties as the number of parallel log streams increases. Three separate sets of experiments consider the cases of low, moderate and high data skew, respectively. In these experiments, the log streams are managed by only the XEL technique (FW is no longer of interest in this section). This section's results indicate that all three policies yield approximately equal load balancing behavior for low data skew. The partitioned policy performs slightly worse than random and cyclic for moderate data skew, and it is much worse for high data skew. However, the partitioned policy generally consumes the least amount of main memory because it does not require indirection via the LOT and LTT data structures. 98 6.1 Simulation Environment To quantitatively evaluate XEL and the three distribution policies for parallel logging, the author implemented an event-driven simulator. The simulator is written in C and runs on SPARCstations. 6.1.1 Input Parameters The user can specify the following input parameters: timestamps: arrival rate: tx pdf: object pdf: flush rate: generations: flag to indicate if disk version of database keeps timestamps rate at which transactions are initiated statistical mix of transaction types statistical access pattern to objects rate for flushing updates to disk version of database number and sizes of generations recirculation: flag to turn recirculation on or off num streams: number of parallel log streams distribution policy for parallel logging distn policy: runtime: duration of simulated time span recovery: flag to request recovery after normal logging activity ends The timestamps parameter specifies whether or not timestamps are assumed to exist in the disk version of the database. If timestamps exist, then the simulator uses the EL [35] algorithm; otherwise, it uses the more complicated (but more general) XEL algorithm. The simulator initiates transactions at regular intervals, according to the specified arrival rate (transactions per second). The user specifies an arbitrary number of different transaction types and their prob- ability distribution function (pdf). For each type of transaction, the user states the probability of occurrence, the duration of execution, the number of REDO DLRs written and the size of each DLR. Figure 6.1 graphically represents this transaction model for a transaction that generates N=2 REDO DLRs in a system with only one log stream. Whenever a new transaction must be initiated, the simulator randomly (according to the pdf) selects its type. After choosing its type, the simulator schedules when its REDO DLRs will be written. The DLRs are written at equally spaced intervals, with the last being written only some short time E (equal to t 3 - t2 ) prior to completion. Suppose that the transaction's lifetime (specified as part of its type) is T. It will finish execution and request to commit (at time t 3 ) T seconds after it started. Its last DLR 99 choose tx type T-£, IF V to write N th data log record write first data log record - F p F F t2 ti t3 t4 time 1 T begin execution write COMMIT tx log record tx commits Figure 6.1: Simulation Transaction Model is written (at time t2 ) T - before it finishes, and each DLR is written (T - E)/N after the preceding one, where N is the number of REDO DLRs written by a transaction of this type. After the LM has generated a COMMIT TLR for the transaction and sent it to a particular log stream, the transaction continues to wait for acknowledgement (at time t4 ) from the LM before it actually commits; this delay occurs because the LM waits until a buffer is almost full before writing it to disk at the tail of generation 0, and then there is some delay rDisk_Write for transferring the contents to disk. Whenever a transaction writes a REDO DLR, the simulator randomly picks some oid, according to the access probabilities specifiedby the user and subject to the constraint that the oid has not already been chosen for an update by a transaction which is still active. The set from which an oid can be chosen consists of all integers from 0 up to NUMOBJECTS-1, where NUMOBJECTS is the total number of objects (a fixed value). The user breaks up this set of objects into several classes. For each class, the user specifies the probability of occurrence and the size as a proportion of the total number of objects. When a DLR is to be written, the simulator first chooses a class according to the specified object pdf and then randomly selects an available oid from within this class. To control the rate at which the CM can flush updates, the user specifies some number of disk drives and the time required to write a block to any of these drives. There can be at most one request at a time for any particular drive. The user can increase the maximum rate at which updates are flushed by increasing the number of drives or decreasing the time to write a block to any drive. The NUMOBJECTS objects are striped evenly over these drives. That is, for D drives, object i is mapped to drive i mod D. Striping ensures that the objects within each class are distributed as evenly as possible across the different drives, so no drive is relatively overloaded by a large number of "hot spot" objects'. Each updated object requires a separate disk write (i.e., the simulator assumes that there is negligible locality of updates within a disk 1 A "hot spot" object is one which is updated very frequently. 100 block). Each disk drive attempts to service pending flush requests in a manner that minimizes access time. The simulator assumes that the difference between two objects' oids corresponds to their locality on disk. For the purpose of calculating the difference between two oids, the simulator assumes that the sequence of integers assigned to their disk drive wraps around. The user specifies the number of generations and the size (number of disk blocks) of each generation. The size of each disk block is fixed in the simulator. In some experiments, it is worthwhile to examine the LM's behavior without recirculation in the last generation, just to see the effect of simply segmenting the log. There is an input flag to specify whether recirculation in the last generation is turned on or off. If recirculation is disabled and the LM cannot advance the tail of the last generation because it would overwrite a non-garbage log record at the head, then it refuses to accept any more incoming log records to the last generation; this tends to exert "back pressure" on younger generations. The user can specify the number of log streams that are to operate in parallel. If more than one log stream is specified, then the user must also specify one of the three distribution policies. If a log stream refuses to accept a log record, the simulator kills the client transaction that submitted the request. A more realistic simulator would stall, rather than kill the transaction. The current version of the simulator suffices because the experimental objective will be to determine the LM's resource requirements to support a particular load without needing to kill or stall transactions. After simulating normal logging activity for the specified runtime, the simulator will also simulate recovery if the recovery flag has been set. Recovery uses the state of the log on disk in the condition which exists immediately after the specified runtime has elapsed. The simulator models only the first phase of recovery, in which the contents of the log streams are retrieved from disk and the most recently committed value in the log, if any, is determined for each object that had a DLR in the log. The second phase of recovery, in which the disk version of the database is updated with these values from the log, is not considered. In practice, this work can be performed in background after normal processing has resumed, so it is reasonable to ignore it when simulating recovery. The current version of the simulator does not incorporate the LCD technique; all transactions retain their write locks until they commit. No experimental data are available for the performance of a LM which employs the LCD technique. 101 Fixed Parameters 6.1.2 Several parameters are fixed in the simulator. The delay between the write for a transaction's last REDO DLR and its request to commit is fixed at 1 ms. The capacity of each disk block is 2000 Bytes 2 . The LM attempts to keep Nfree>3 blocks available in each generation to hold incoming log records. Four disk block buffers (2048 Bytes TLR is assumed to each) are provided for generation 0 of each stream. Each COMMIT require 8 Bytes. The simulator conservatively assumes a fixed delay of TDisk_Write=15 ms to transfer a buffer's contents to disk when writing out records to the log. For each log stream, at most one disk write operation (to any generation) can be outstanding at any time. If several generations have buffers waiting to be written to them, the simulator gives older generations priority over generation 0 when it must schedule the next buffer to be written to disk. The simulator uses the group commit technique [5]; a log record is not written to disk until its buffer is as full as possible. Therefore, the delay between the time a record is added to a buffer and the time it is written to disk The number of objects in the database is fixed is generally longer than TDiskWrite. at NUMOBJECTS=107. Disk I/O from each log stream is entirely sequential during recovery, so the simulator assumes that only 5 ms is required to retrieve a block from disk when reading the log. When recovering the contents of a block, each DLR requires 100 ps to process and a TLR requires 40 ps. These fixed parameters are summarized in the following table: Parameter Value E = delay from last REDO DLR to commit request Capacity of each disk block in log Nfree = threshold number of free blocks per generation Number of buffers (for generation 0) per log stream record Size of each COMMIT 1 ms - TDisk_Write = delay to write a block to the log 2000 Bytes 3 blocks 4 buffers 8 Bytes 15 ms Maximum number of outstanding disk writes per stream 1 write NUMOBJECTS = number of objects in the database 107 objects 5 ms Delay to read a block from log during recovery Time required to process a DLR during recovery TLR during recovery Time required to process a COMMIT 100Is 40 /us 6.1.3 Data Structures The following declarations define XEL's principal data structures for the most general situation of numerous parallel log streams and any distribution policy. They are expressed according to the syntax of the C programming language [36], in which the 2 A block size of 2048 is typical, but the simulator assumes 48 Bytes are reserved for bookkeeping purposes and so only the remaining 2000 Bytes are available to hold log records. 102 simulator itself is written. These data structures are intended for a distributed memory message passing parallel system architecture, rather than a shared memory model. struct strlsm int int { struct strlsm struct stroid OBJID /* one log stream id in a list streamid; minrecovtstamp; /* identifier of a log stream /* needed for parallel XEL /* pointer *next; /* one object { id in a list of oids /* object identifier oid; struct stroid to next cell in the list /* pointer to next cell in the list *next; }; struct strlttentry TXNID /* LTT entry for a transaction { /* tid; TXSTATUS status; int numrqddlrs; int recstr; struct strpcg struct stroid struct strlttentry struct strlotentry oid; int int struct struct uncmtstamp; commtstamp; strlsm strlotentry int int /* number of required DLRs remaining /* log stream to which COMMIT written *setcgs; /* cmt grps on which tx depends *objids; /* objects modified by this tx *next; /* other txs in same hash buckt /* LOT entry for an object { OBJID struct strrelcell TXNID tid; id of transaction /* current status of the transaction /* id of object with records in log { blocknum; reclength; /* /* *lstms; /* *next; /* timestamp for most recent DLR tstamp most recently committed DLR log streams with DLRs for object other objects in same hash bucket /* cell to point to a relevant log record /* id of associated transaction /* index of block in log to which rec belongs /* size of the log record (in Bytes) /* current status of log record int tstamp; /* timestamp of update, if cell is for DLR struct strllotent: r3 r *pobj; /* parent object, if cell is for DLR RSTAT recstatus; struct strrelcell struct strrelcell struct str-relcell struct strllotentry *next; *left; /* more cells for same obj or tx /* left neighbor in doubly lkd list *right; /* right neighbor in doubly lkd list /* local LOT entry for an object 103 OBJID struct struct oid; *cells; /* strllotentry struct strllttentry TXNID /* id of object with DLRsin stream strrelcell list of cells for object's DLRs *next; /* other objects in samehash bucket /* local LTTentry for a transaction { /* id of tx with TLRsin log tid; struct strrelcell struct strllttentry /* list of cells for the tx's TLRs *next; /* next tx in the hash bucket list *cells; static struct str_lotentry *lot_tbl[LOT_TBL_SIZE]; static struct strlttentry *ltt_tbl[LTT_TBL_SIZE]; /* LOT hashtbl */ /* LTThashtbl */ definition does not include a field that indicates a log record's The strrelcell type (TLR, REDO DLR or UNDO DLR). Such an extra field is unnecessary. The contents of the recstatus field identify both the type and the status of a record. For example, a required REDO DLR and a required UNDO DLR have different values in the recstatus fields of their cells. The strlsm and strilotentry structures go together on processing nodes that administer portions of the LOT. The stroid and strlttentry together on nodes that manage the LTT. The str str relcell lotentry, structures belong strillttentry and structures are used at nodes that manage the log streams. Assume that each oid and tid requires 8 Bytes. Integers and pointers consume 4 Bytes each. Each field of type TXSTATUS or R.STAT requires 4 Bytes (these types are typedef'ed to int). The amount of storage required for each of these structures is summarized below. Storage required (Bytes) Structure name str lsm 12 stroid 12 str lttentry str otentry str relcell strllot-entry strilltt-entry 32 24 40 16 16 For the specific case of a single log stream or multiple log streams with the partitioned distribution strategy, indirection via the local LTT and local LOT is no longer neces- sary. The cells field from the str lttentry structure replaces the rec str field structure. Similarly, the cells field from the strllotentry in the strilttentry structure replaces the lstms field in the strlot-entry structure. The pobj pointer in strrel_cell now points to an instance of the strlot-entry type. The declarations 104 of the strlsm,strillot_entry and str lltt_entry structure types can be omitted. The storage requirements for each of the stroid, strrel_cell str lttentry, str lotentry and structure types remains unchanged, despite these modifications 3 . The simulator can also report the storage requirements for FW logging. The user must specify only a single log stream with a single generation, of course. The FW method requires a simpler version of the LTT. As before, each entry in this LTT corresponds to a particular transaction. It has fields for the transaction's identifier (8 Bytes), the current status of the transaction (4 Bytes), the number of DLRs still waiting to be flushed (4 Bytes), the position within the log of the first record written by the transaction (4 Bytes) and a pointer to the next entry in the LTT (4 Bytes). The FW method keeps track of a transaction until after it has committed and none of its updates need to be flushed to the disk version of the database. When a committed transaction no longer has any updates waiting to be flushed, the FW method removesit from the LTT. Therefore, the user should set the simulator's timestamps flag to true so that the EL algorithm is used4 , even though the disk version of the database may not actually keep a timestamp with every object in the database. 6.1.4 Validity of Simulation Model The simulator provides sufficient flexibility to realistically evaluate various LM configurations for many different applications. It does not permit a user to precisely model every possible application, but it does allow a user to succinctly specify the characteristics for a broad range of applications. The simulator's inherent technological assumptions only approximate reality, yet they capture the important characteristics of the underlying technology while abstracting out many details that are largely irrelevant. Therefore, the simulator provides sufficient power to evaluate XEL and its parallel variants as important parameters vary. The probabilistic transaction model statistically describes an application's static mixture of transactions. It is worthwhile to examine XEL's behavior as the relative lifetimes of different transaction types vary, and so the simulator provides this capability. The number and size of each transaction type's log records affect XEL's performance, and so a user can also vary these parameters. The probabilistic transaction model does not provide sufficient power to specify every possible application. For example, an application in which exactly every eighth transaction is 10.0 s long and the remaining applications have duration 1.0 s cannot be modelled, nor can an application whose transactions do not write REDO DLRs at equally spaced intervals. Despite these shortcomings, enough different application transaction mixes can be specified to provide 3 For the case of only a single log stream, the setcgs field can be removed from the str ltt entry declaration, thus saving another 4 Bytes per transaction. 4 The EL algorithm removes a committed transaction's LTT entry as soon as all its DLRs have become garbage. 105 meaningful results. Likewise, the deterministic arrival rate enables a user to control the system's overall throughput, but limits the ability to control precisely when each transaction is initiated. A Markov arrival pattern [6], for example, cannot be accurately modelled with the current version of the simulator. However, variations in the arrival pattern are not an important issue for the evaluation of XEL; the overall throughput is the important parameter. The simulator is an open system. It does not incorporate feedback in its scheduling of the times at which transactions are initiated, DLRs are generated and commits are requested. If a transaction manager refrains from initiating new transactions before it has received commit acknowledgements for enough previous transactions, then the open system assumption is unrealistic. However, the open system assumption can be justified for some other applications. As long as the LM continues to accept incoming log records, there may be little reason why the DBMS would change the rate at which it initiates new transactions. Likewise, a client transaction never needs to wait for a reply after requesting the LM to write a REDO DLR to the log; at the time that the request is made, the LM already knows whether or not it has space available on disk for the record and it can respond to the client transaction immediately. A transaction's lifetime is therefore largely unaffected by the performance of the LM, as long as the LM is able to accept its log records. After a transaction requests to commit, it must wait for acknowledgement from the LM. For the most general case of more than one log stream, the LM must make sure that all a transaction's DLRs are on disk before it record for the transaction, and then there is some additional delay generates a COMMIT record arrives on disk. As the LM becomes busier, queues form and before the COMMIT a buffer of records may need to wait longer before being written to disk. Therefore, the length of time that a transaction must wait for acknowledgement to its request to commit depends on the LM's load, but this delay does not affect the times at which the transaction writes its DLRs and requests to commit. The experimental objective will be to evaluate the LM's resource requirements such that the LM can accept all requests without needing to kill (or stall) any client transactions, so the open system assumption does not diminish the significance of the experimental results. The probabilistic specification of data access patterns enables a user to model different collections of data objects in a database. These collections are characterized by the frequency with which transactions update their member objects. A user can model a wide range of different "data skews". Again, the statistical nature of this modelling is concise and simple but there are some applications whose exact data access patterns cannot be expressed in terms of this model. This inability to accurately model all possible applications does not prevent one from using the simulator to conduct experiments which illustrate important aspects of XEL's performance. XEL's behavior is largely influenced by what happens as records approach the head of a generation, because only then does XEL decide whether or not to forward or re106 circulate a record. The arrival of subsequent records push a record toward the head of its generation so that XEL must decide its fate, but the exact times of arrival of these records are largely irrelevant. This observation supports the claim that the simulator's acknowledged inabilities to accurately model all possible applications' characteristics does not seriously limit its worth for the purposes of studying XEL. The statistical specificationsof transaction types and data access patterns are static, as is the arrival rate. In reality, an application's characteristics may vary over time. The simulator does not permit specification of dynamically varying parameters. Despite this shortcoming, meaningful experimental results may be obtained for important cases in which an application's parameters remain static. These results can provide valuable insights into XEL's behavior. The simulator uses a simple model for disk I/O. A flush to the disk version of the database always requires the same duration (a parameter which the user specifies). In reality, this duration may vary from one write operation to the next. However, small fluctuations in the actual duration to flush an update will have only a minor effect on XEL's performance. The simulator provides only a first order estimate of the bandwidth required for log information. It assumes that each block write operation to the log requires the same amount of time, rDiskWrite In reality, the time required to write a block of log records to disk depends on whether the I/O is sequential or random in nature. Successive writes to the tail of generation 0 are sequential and so they will generally have a short duration (such as 5 to 10 ms each). When the LM must occasionally write a block to the tail of generation 1, for example, the resulting disk I/O is random;, it will tend to take significantly longer (such as 20 ms) because of seek and rotational delays. The effects of the random disk. I/O to all generations except generation 0 can be minimized by choosing generation 0 to be sufficiently large so that only a small fraction of log records need to be forwarded. Section 7.3.3 explains an optimization for a system with several log streams so that most disk I/O to log disks is sequential. The simple model of the CM assumes that no uncommitted updates are ever written out to the disk version of the database. Hence, UNDO DLRs are never needed. Economic trends justify this simplification. For many applications, the savings in disk I/O bandwidth outweigh the price of the extra main memory needed to buffer uncommitted updates. For example, suppose that each transaction modifies X Bytes of state, is T1 seconds long and begins Td seconds after the preceding transaction. The extra memory required to buffer all uncommitted updates from a continuous sequence of transactions is no more than XTl/Td Bytes5 . If an UNDO DLR were written for each uncommitted update, the extra disk bandwidth would be X/Td Bytes/sec. The technique introduced in [25] permits a comparison of the relative costs for these two options. Let Cm represent the cost (in dollars per Byte) of main memory and Cd represent the cost of disk 5 To be more precise, the upper limit is actually X[rT/Td], for a first order analysis. 107 but the approximation XTL/Td suffices bandwidth (in dollars per Byte/sec). The cost of buffering uncommitted updates is CmXTi/Td, and the cost of writing UNDO DLRs is CdXITd. Therefore, it is less expensive to buffer uncommitted updates in main memory if Tl<CdICm. A typical disk drive costs at least $200 and provides a maximum I/O bandwidth of 2 MBytes/sec, so Cd = 10- 4 $.sec/Byte. On the other hand, 1 MByte of DRAM costs approximately $20, so Cm = 20 x 10-6 $/Byte. Hence, buffering of uncommitted updates is better economically if T1<5 sec. For many applications, transactions have lifetimes shorter than 5 sec. As the price of DRAM continues to fall, relative to the price of disk bandwidth, T1 will continue to increase. The simulator concerns itself with the management of log information on disk and the associated data structures which must reside in main memory. It does not account for computational requirements nor for interprocessor communication. To some extent, the consumption and management of these latter two resources depend on the specific system upon which the logging and recovery system is implemented and so it is difficult to accurately account for them. Furthermore, computation and communication are relatively cheap and abundant in concurrent systems, and so they do not deserve nearly as much serious attention as disk I/O, which is a limited and relatively expensive resource. 6.2 Extended Ephemeral Logging for a Single Stream This section presents the results of many experiments which were conducted to observe the behavior of XEL as applied to a single log stream and to understand the effects of varying different parameters. For comparison purposes, the traditional FW technique was simulated by specifying a single generation with no recirculation. The FW simulation did not involve any checkpointing activity; the firewall was always the oldest log record from the oldest transaction in the system. This omission favors FW because it ignores the overhead (in terms of disk space and bandwidth) associated with checkpointing. There are several evaluation criteria. Disk space, disk bandwidth (in terms of block writes per second) and main memory requirements (for the LOT and LTT) are the main criteria for normal logging activity. Elapsed time is the primary criterion for recovery. The following parameters are specified for all experiments, unless otherwise stated. There are two types of transactions. The first is of 1.0 s duration and writes 2 DLRs, each of size 100 Bytes. The second lasts 10.0 s, in which time it writes 4 DLRs of size 100 Bytes each. Their probabilities of occurrence are 0.95 and 0.05, respectively. The arrival rate is 100 TPS. 108 There is no data skew. That is, all NUMOBJECTS chosen whenever an update is to be performed. objects are equally likely to be To provide sufficient bandwidth for flushing updates, each experiment specifies 10 disk drives with a transfer time of 25 ms. The conservative 25 ms time allows for some read operations to be interspersed with writes. All tests of XEL use two generations. The minimum possible sizes for these generations are determined experimentally. Recirculation is disabled, so that it is possible to assess the effect of simply segmenting the log. There is only one log stream. The simulation time is 500 s; these results reflect the minimum disk space required to support 500 s of logging activity such that no transaction is killed. 6.2.1 Effect of Transaction Mix For the first set of experiments, the probabilities of occurrence for the two transaction types are varied. The probability of the long transaction type increases from 0 to 1.0, while the probability of the short type decreases accordingly. Figure 6.2(a) plots the disk space requirements (number of blocks) versus the transaction mix for both FW and XEL. The corresponding graphs of disk bandwidth (to only the log), main memory requirements and recovery time are shown in Figures 6.2(b) to 6.2(d), respectively. XEL's advantages are most apparent for the 5% mix. It reduces disk space by a factor of 3.2 with only a 9.1% increase in bandwidth. XEL requires 13 times as much main memory as FW for the 5% mix, but this requirement is still modest in absolute terms; XEL needs only 57.5 KBytes of main memory. The time required to read in the log from disk dominates recovery time for both XEL and FW, so XEL offers much faster recovery. As the probability of 10 s transactions increases, XEL's relative advantage over FW diminishes. The reductions in disk space and recovery time are not as large, but the increase in bandwidth is greater. As the probability of the long transaction type approaches 1.0, the rate at which objects are updated approaches the maximum rate at which updates can be flushed (400 updates/s). The resultant queueing delay causes DLRs to tend to remain unflushed longer, and so the length of a single FW log increases accordingly. In the case of XEL, many of the DLRs have had their updates flushed by the time the LM must decide whether or not to forward them to generation 1; most of these DLRs have recoverable status and need not be forwarded. Only a fraction of all log records have unflushed or required status as they approach the head of generation 0, and so only these records advance into generation 1. By throwing away the garbage records at the head of generation 109 450 0 C) .O 400 Q - 0 ...... 0 Firewall A---A 350 Co .Y 'a XEL u.2 0 0) Z _ 35 - ...... 0 Firewall A- - -A 300 ,& u0) 25 250 . .Q! 150 ~~ 1 ~~AA.' 100 20 15 0) C' ._ b.00 I 0.20 I 0.40 I 0.60 I o "-i 0.00 0.20 I ~ I 0.80 1.00 0 a, I 0.40 I 0.60 I 0.80 1.00 (b) Disk Bandwidth 0 22550 500 LW X. A @..©" Probability of tx of 10 s duration (a) Disk Space 250 ,De '6 '©' 'I" - Probability of tx of 10 s duration .Y o 350 E 300 "C 5 a E 450 0) E 400 > _M ,~ .1A 3 Af ' 4' 10 50 o XEL(Total) - - -FOXEL (Generation 1) a) 30 u0 200 _ E 40 E - 0 ...... 0 Firewall A- - -A XEL I I I I I A _ _ - 0 ...... 0 Firewall E 175;O A---A _ XEL 15CI0 125;O 0 _ cc 100I0 I _ 200l0 75;O 150 . / A _ 'a. @_E' - .©.' ' " 50 100 25;a 50 63 q} *(3 & q) a (a.h.h...00 0.20 0.40 0.60 0.80 1.00 Probability of tx of 10 s duration n 0.00 0.20 0.40 0.60 0.80 1.00 Probability of tx of 10 s duration (d) Recovery Time (c) Main Memory Figure 6.2: Performance Results for Varying Transaction Mix 110 0, XEL manages to use 44% less disk space than FW. 6.2.2 Effect of Transaction Duration Figures 6.3(a)-(d) show the results as the duration of transactions of the long type increases from 10.0 s to 60.0 s in increments of 10.0 s. 500 . D 0 0 Firewall ...... A- - -A XEL 450 C' 9 400 n 350 13 12 11 0 10 - - - - - AD.-- e ®.... @.... 0 .... E).... E).... 9 C 300 .0 O 250 - X 8 0 7 - A--- -A C- - -I .S: . ) 200 5 ., E 150 4 3 Z 100 2 50 ir v 00 1 0 60 50 40 30 20 10 50 60 20 30 40 10 Duration of long txs (seconds) I I I - 0 10 . 2750 E 2500oo 0 2250 E 0_ 0) rn A-' 40 o .I 60 O...... 0 Firewall A---A XEL O - 1750 Co 1250 1000 750 XEL 500 20 n Li 0 50 8 1500 0 ...... 0 Firewall A - -A 40 E 2000 80 60 T 30 (b) Disk Bandwidth 0 120 E - Duration of long txs (seconds) (a) Disk Space E 100 20 Firewall XEL(Total) XEL(Generation 1) _ 250 i w 10 20 30 40 50 60 Duration of long txs (seconds) I 4 0 I I I I 40 50 60 10 20 30 Duration of long txs (seconds) (d) Recovery Time (c) Main Memory Figure 6.3: Performance Results for Varying Long Transaction Duration XEL's advantage over FW increases as the duration of the long transaction type lengthens. For a 60.0 s duration, XEL reduces the size of the log by a factor of 7.9 with only a 6.9% increase in disk bandwidth. Regardless of transaction duration, the average rate of arrival of log records (in steady state) is the same for both FW and XEL. At any given moment, FW must retain all log records that have been written 111 . since the first record of the oldest transaction, so the disk space required for FW is roughly proportional to the duration of the longest transaction. However, XEL is able to filter out most log records from short transactions at the head of generation 0, so XEL is largely unaffected by the duration of a small fraction of long transactions. 6.2.3 Effect of Size of Data Log Records This set of experiments examines the effect of the size of DLRs from the long (10.0 s) transaction type. Each long-lived transaction still writes 4 DLRs, as before, but now the size of each long transaction's DLRs varies from 100 Bytes up to 500 Bytes in increments of 100 Bytes. Figures 6.4(a)-(d) present the results. 0, 140 0 _ 120 _ -o 100 _ 80 _ 60 _ 40 _ a) E Z .0 0 Firewall XEL A- - -A C_ o _Ce O...... 20 a) ., L. .0 0.· le I - 1 2 0 o 100 _ I 200 3 I 300 I I 400 500 4 0G 10 0 ...... 0 Firewall -E- 2 nV i 0 100 a) E C i 300 400 X' 700 _ 500 0 600 _ .0 E 50 .o' E 5oo 40 o Firewall XEL 400 0 ...... 0 Firewall CC --JUU _ 20 200 _ 10 100 A- - -A 10r n I 0 i (b) Disk Bandwidth A. 30 200 E. - -N Size of DLRs of long txs (Bytes) (a) Disk Space E a, E E XEL(Total) XEL(Generation1) 4 Size of DLRs of long txs (Bytes) 60 ..0 .' 14 8 - A- - -A 6 -_0-- --O _-A 20 n 16 12 . I 18 c 0 XEL -020 30 40 I. Size of DLRs of long txs (Bytes) 500 200 300 400 100 Size of DLRs of long txs (Bytes) (c) Main Memory (d) Recovery Time 100 200 300 400 500 Figure 6.4: Performance Results for Varying Size of DLRs from Long Transaction Type 112 As the size of DLRs from long-lived transactions increases, XEL suffers more than FW. The proportion of log information, measured in Bytes, that must be forwarded to generation 1 increases with the size of DLRs from long transactions, so disk space and bandwidth for generation 1 both increase. Furthermore, the bandwidth for generation 0 increases, so records tend to move from tail to head faster. Most records from short transactions are thrown away at the head of generation 0, so their faster movement through generation 0 means that the LM does not need to keep track of them for as long a period of time. This tends to decrease the overall main memory requirements for XEL. 6.2.4 Effect of Data Skew This section examines the effect of data skew on the performance of XEL for a single log stream. Suppose that there is some subset H of the set of all objects and that the members of H receive a disproportionately large number of updates; these objects are "hot spot" objects because they are updated much more frequently than other objects. Let x, 0<x<1, be the ratio of the size of H to the total number of objects. Suppose further that when a transaction must choose an object to update, the probability that it chooses a member of H is 1-x. This simple definition provides a single parameter, x, which represents an application's data skew. By varying x appropriately, it is possible to control the amount of skew in the pattern of updates to data by an application's transactions. In this section's experiments, x ranges from 5x10 - 5 up to 0.5. In the case of x=5x10-5, almost all the updates affect a set of only 500 objects. Such extremely skewed distributions characterize databases with "hot spot" objects. When x=0.5, all objects are updated equally often, on average. Figures 6.5(a)-(d) present the results. Data skew has only a minor effect on XEL. Even for the most highly skewed distribution, XEL requires only 48% more disk space and 5.5% more bandwidth than it does for a completely unskewed distribution. 6.3 Parallel Logging This section compares the performances of the three distribution policies for three different data skew specifications as the number of log streams increases. All experiments in this section use the XEL method for disk space management within each log stream. Let I be the number of log streams. The experiments in this section examine 1=2 k for O<k<6. 113 O) 0O 90 80 ._8 14 _ 12 0) 0 10 _ _ 0 ...... 0 Firewall 70 A- - -A XEL ®.-. .. ..-. . ..... ..-. CD . 60 ._)50 _ . 40 u) CD 0. 8 0 .'.'.0 Firewall 6 0- - -_ XEL(Generation 1) A- - -A XEL(Total) u) E z 30 4 20 2 10 j mI 11111 JJ1I md JJ1 I I m, fill IIi l 16-04 16-03 16-02 1-01 Data skew (fraction of objects) 0 0 E Q, E 70 .o 60 E CO a, 450 E a, 400 E 350 8 0r 0) 0 ...... 0 Firewall A- - -A XEL 40 30 250 200 150 _ -A - A--A 100 10 0 0 ...... 0 Firewall - A---A XEL 300 50 20 E (b) Disk Bandwidth 4 80 B - - il J111 11111111I1 11111 111 JIlJ 1 J11111! le-04 le-03 le-02 le-01 Data skew (fraction of objects) (a) Disk Space 0 E13.- EfE I 50 ip ;''-Ai ) i i ili li'','ll le-04 1e-03 1e-02 le-01 Data skew (fraction of objects) II 0 1 le-04 Ile-03 III le- 02I 11111111 e- IIIIII le-04 -03 le-02 le-01 Data skew (fraction of objects) (d) Recovery Time (c) Main Memory Figure 6.5: Performance Results for Varying Data Skew 114 Refer to section 6.2.4 for a definition of the data skew parameter, x. The experiments in this section consider three cases of data skew: x=0.5 (no skew), x=0.01 (moderate skew) and x=2x10- 4 (high skew). In the case of x=2x10 - 4 , 99.98% of the updates, on average, affect a set of only 2,000 objects. For the case of maximum data skew and the greatest number of log streams, the average time between consecutive updates for a hot spot object becomes so short that the transaction types defined for the previous sets of experiments (which involved only a single log stream) are no longer feasbile. It is necessary to define new transaction types which would hold their write locks on objects for much shorter durations. All experiments in this section specify the following two types of transactions. The first is of 0.1 s duration and writes 2 DLRs, each of size 250 Bytes. The second lasts 2.0 s, in which time it writes 2 DLRs of size 250 Bytes each. Their relative probabilities of occurrence are 0.99 and 0.01, respectively. To provide sufficient bandwidth for flushing updates, each experiment specifies 10xl disk drives with a transfer time of 25 ms each. The conservative 25 ms time allows for some read operations to be interspersed with writes. The arrival rate is 100x1 TPS and the simulation time is 500 s. All tests use two generations. Recirculation is enabled for 1>2 so that race conditions will not stall any of the log streams 6. For each skew setting, the sizes of the two generations are found such that disk space is minimized for 1=1 (with recirculation disabled), subject to the constraint that no transaction is killed. For 1>2 and x=0.5 or x=0.01, the sizes of generations 0 and 1 are both doubled. For 1>2 and x=2x10 - 4 , the size of generation 0 is quadrupled and double the size of generation 1 is doubled; the larger increase in generation 0 for x=2x 10- 4 was chosen because it was found to yield substantially lower bandwidth requirements for generation 1. The evaluation criteria are disk bandwidth and main memory for normal logging activity and elapsed time for recovery. When considering bandwidth, the results reflect the maximum total bandwidth (both generations) for any particular stream. 6 To appreciate the subtlety of potential race conditions, suppose that recirculation in the last generation is disabled and imagine that two different transactions update the same object in quick succession and write DLRs to different streams. Until the first REDO DLR becomes non-recoverable, the second DLR must be either unflushed or required(assuming that no other transaction subsequently updates the object and commits). If the second DLR's stream is "faster" than the stream of the first DLR, it may reach the head of the last generation of its log stream before the first DLR gets to the end of its stream. The fast stream will be forced to stall until the slow stream has caught up. Similar situations can also be imagined (such as for a DLR and the COMMIT TLR on which it depends). These interstream dependencies can cause "fast" streams to synchronize with "slow" streams, which is generally undesirable for performance reasons. 115 6.3.1 No Skew The results for the case of no skew (x=0.5) are presented in Figure 6.6. The sizes of generations 0 and 1 are 10 and 5 blocks, respectively, for 1=1; they are 20 and 10 blocks, respectively, for 1>2. Recovery time is dominated by the delay required to read in the contents of the log from disk. The elapsed time for recovery is 76 ms for 1=1 and 151 ms for 1>2, regardless of the distribution policy. o 0 " 1500 0 30 oc,0 25 - 0...... Partitioned A 0. 0 0) ...... E 1350 --- A [ L 0-_ 20 E 1200 Random Partitioned A- - -A Random ] - Cyclic *' 1050 E 900 Cyclic .0 0O 750 15 4 600 10 450 300 5 150 0 o I 8 I I I I I I I 16 24 32 40 48 56 64 Number of log streams 0 (a) Maximum Disk Bandwidth 0 8 16 24 32 40 48 56 64 Number of log streams (b) Main Memory Figure 6.6: Disk Bandwidth and Memory Requirements vs. Parallelism (=0.5) All three distribution policies require approximately the same maximum bandwidth, and the maximum bandwidth remains practically constant as the number of log streams increases. Refer to Section 6.3.4 for an explanation of why the partitioned distribution policy requires less main memory, compared to the other two policies, for all experiments. 6.3.2 Moderate Skew Figure 6.7 shows the results for the moderate skew (x=0.01) case. The sizes of generations 0 and 1 are 10 and 5 blocks, respectively, for 1=1; they are 20 and 10 blocks, respectively, for 1>2. The elapsed time for recovery is 76 ms for 1=1 and 151 ms for 1>2, regardless of the distribution policy. For all three policies, the maximum required bandwidth increases slightly with the number of log streams. This behavior can be attributed to decreasing intervals between successive updates to each object. The transaction arrival rate increases in proportion to the number of log streams but the number of hot spot objects remains constant, so the average duration between successive updates to any particular object is inversely proportional to the number of log streams. At this skew setting, a REDO DLR becomes 116 -o ...... 1800 I 0= 16000A E 3 co a0 25 . 20 0 Partitioned 0.- A- - -A Random _ - ' 1400 - E 1200 Cyclic [ - 1000 t 15 800 Y 10 600 400 5 0 . 200 0 I I 8 16 I I I I I I n 24 32 40 48 56 64 Number of log streams 0 8 16 24 32 40 48 56 64 Number of log streams (a) Maximum Disk Bandwidth (b) Main Memory Figure 6.7: Disk Bandwidth and Memory Requirements vs. Parallelism (x=0.01) more likely to have a required status, because of the lingering presence of a recoverable DLR from a prior update to the same object, when the LM must decide whether or not to forward the DLR from generation 0 to generation 1. In terms of maximum bandwidth, the partitioned policy performs slightly worse than the other two because of static skew effects. The non-linear slope for main memory requirements can be explained similarly. The COMMIT records from transactions are more likely to be forwarded into generation 1 because required REDO DLRs depend on them. Therefore, COMMIT records tend to survive longer and the LM must continue to pay attention to all associated DLRs; previously, the LM would have thrown away many of these COMMIT records at the head of generation 0 instead of forwarding them, and any remaining REDO DLRs would instantly become non-recoverable. 6.3.3 High Skew Figure 6.8 shows the results for the case of high skew (x=2x10-4). The sizes of generations 0 and 1 are 10 and 5 blocks, respectively, for 1=1; they are 40 and 10 blocks, respectively, for 1>2. The elapsed time for recovery is 76 ms for 1=1 and 252 ms for 1>2, regardless of the distribution policy. It is interesting to note that the bandwidth curves for the cyclic and random policies slope up and then down, as the number of log streams increases. The initial increases have a similar explanation as was given for the case of moderate skew. Namely, a REDO DLR becomes more likely to be forwarded to generation 1 as the mean time 117 _...) o> 2400 o 50 o 45 ou 40 z5 35 w 30 25 0. *- - .. 0 ...... = 20 _ A---A El C 15 "' ,. . C 1800 E am1200 0 Partitioned [ Cyclic ... 2000 1600 ~ ~ 1Et~ _ Random 0 ."E 2200 ' m ] 10 5 n 1000 800 600 400 200 I 0 8 16 24 32 40 48 56 64 Number of log streams 0 8 16 24 32 40 48 56 64 Number of log streams (b) Main Memory (a) Maximum Disk Bandwidth Figure 6.8: Disk Bandwidth and Memory Requirements vs. Parallelism (=2xl0 -4 ) between updates decreases. As the throughput increases even further, the average time between consecutive updates to an object becomes so short that a DLR is more likely to become only recoverable by the time the LM must decide whether or not to forward it to generation 1 because another transaction has already updated the same object and committed. 6.3.4 Discussion Regardless of skew, the partitioned distribution policy requires less main memory than either cyclic or random because it takes advantage of the fact that all DLRs for a particular object are directed to the same log stream; the object's LOT entry can be located at the same processor node as the one that manages the cells for its DLRs and can directly point to these cells. With the cyclic and random policies, an object's LOT entry no longer points directly to the cells for its DLRs. Instead, it indicates the streams where there are relevant DLRs for the object; local LOT tables at those streams point directly to the corresponding cells. This indirection entails higher main memory requirements. In terms of disk bandwidth, all three distribution policies are approximately the same for low data skew. For high data skew, partitioned suffers noticeably from static skew effects and so it requires significantly more disk bandwidth than the other two. The random and cyclic strategies both exhibit approximately the same behavior for all data skews. 118 Chapter 7 Conclusions 7.1 Lessons Learned This thesis has proposed and evaluated a new variation of logging and recovery that is well suited to highly concurrent database systems. Extended ephemeral logging (XEL), a new disk management method that does not require periodic checkpoint operations, is the cornerstone upon which the rest of the logging and recovery system is built. An application's bandwidth requirements for log information may demand a collection of log streams which work in parallel. Each log stream resides on a single disk drive or possibly a small set of drives. XEL manages the disk space within each log stream. Chapter 3 proved important safety and liveness properties for a simplified version of XEL that was expressed in terms of the I/O automata model [42], thereby imparting confidence in the correctness of XEL. XEL's performance was experimentally evaluated in Chapter 6, using an event-driven simulator. XEL was compared to the traditional "firewall" (FW) disk management method for a single stream of log records and the experimental results suggest that XEL's advantage over FW increases under any of the following conditions: * The lifetime of an application's longest transaction type increases. * The amount of log information written by an application's longest transaction type decreases. * The probability of occurrence of an application's longest transaction type decreases (but remains non-zero). 119 When a system has multiple log streams, XEL can accommodate any distribution policy but the partitioned policy generally requires less main memory space. Unless an application updates a relatively small collection of objects much more frequently than other objects in the database, all three oblivious distribution policies which Chapter 4 considered yield approximately equal loads on all log streams. However, the random and cyclic distribution policies lead to better load balancing, compared to the partitioned policy, if an application has a small collection of "hot spot" objects. Log streams need not be especially large, in terms of storage space, when a system's LM uses XEL to manage the disk space allocated for log information. Small log streams and large main memories enable much faster recovery after a crash; a database system's recovery manager (RM) can sequentially retrieve a log stream's contents from disk and process them in a single pass. When multiple log streams exist, the RM processes them all independently in parallel. 7.2 Importance of Results This thesis widens the options available to DBMS designers. The firewall (FW) method's abstraction of a single FIFO queue for log information is inappropriate in some circumstances. In such circumstances, a DBMS designer may prefer to use the XEL method instead. If a small fraction of transactions have relatively long lifetimes, XEL retains the necessary log information from these long transactions but reclaims the disk space occupied by log records from much shorter transactions. Therefore, XEL can significantly reduce the size of the log for some applications. The primary benefit of a much smaller log is faster recovery after a crash. A smaller log may also decrease a system's cost. Checkpoints are no longer a necessity with XEL. This eliminates the overhead (in terms of computation, communication, disk bandwidth and disk space) and complexity that accompany any disk management method which involves checkpoints (e.g., the FW method). This advantage is especially welcome in highly concurrent systems that have an arbitrarily large number of log streams and an arbitrarily large number of disk drives on which the disk version of the database is kept; coordination for periodic checkpoints in such a highly concurrent setting becomes cumbersome. An arbitrarily large collection of disk drives provide the necessary bandwidth for log information. As the size of this collection grows, the single FIFO queue abstraction becomes increasingly awkward to implement. A more convenient abstraction is to view the log as a collection of log streams that operate largely independently of each other. This abstraction can be implemented efficiently if XEL is used to manage the disk space within each log stream. 120 The simulation results presented in this thesis quantitatively demonstrate XEL's effectiveness for a wide variety of applications and illustrate its strengths and weaknesses compared to the traditional FW method. 7.3 7.3.1 Extensions Non-volatile Region of Main Memory Previous authors [15, 13, 39, 9, 53] have proposed system designs in which some (but not all) of main memory is non-volatile. Battery backup to some portions of RAM ensures that the contents will not be lost if the regular power supply is interrupted. In such a system, parallel XEL can greatly reduce the disk bandwidth required for log information so that much fewer disk drives are needed for the log. The youngest generation (generation 0) for each log stream can be kept in non-volatile main memory; the LM writes log records to disk only when space in non-volatile main memory has been exhausted. 7.3.2 Log-Only Disk Management Suppose a computer's main memory provides sufficient capacity to hold all the objects of a database and that applications update most of these objects quite frequently. In such a setting, a separate disk version of the database is superfluous. The most recently committed value for each object can be kept in only the log. A few small changes to the XEL algorithm yield a variant that meets the needs of this log-only situation. Without a disk version of the database, UNDO DLRs are no longer needed. The status of each REDO DLR is either required or not-required; a REDO DLR has status required if the transaction which performed the update is still in progress or if the DLR is for the most recently committed update to the object. Although older REDO DLRs for the same object may still be recoverable, the LM doesn't need to keep track of them because it will never erase the DLR for the most recently committed update to the object; hence, these older REDO DLRs all have status not-required. Likewise, a COMMIT record can have a status of either required or not-required; a COMMIT record is required if any only if at least one DLR from the transaction is still required. The LM keeps track of only COMMIT records which are required. This log-only variant of XEL offers several advantages. First, it eliminates the expense and complexity that arise from managing a separate disk version of the database. Second, the LM no longer needs to keep track of old stale records in the log, and this implies lower main memory storage requirements for the LOT and LTT. Furthermore, the fact that the LM keeps track of only required records means that the status field can 121 be eliminated from each cell, thus yielding further reductions in main memory requirements. 7.3.3 Multiplexing of Log Streams One possible drawback to XEL is that it may cause write operations to non-sequential positions on disk. Movement between generations within a stream introduces random disk I/O, which is generally much less efficient than sequential I/O. For example, suppose a log stream has two generations. When the LM must occasionally write a block of forwarded log records to the tail of generation 1, it must seek to this track's location on disk and wait for the block's location to rotate under the disk head. When the write to generation 1 has finished, the LM returns to the tail of generation 0, where it can resume writing blocks to generation 0 in sequential order. A LM with a sufficiently large collection of log disk drives can alleviate the need for occasional non-sequential accesses to the log by multiplexing older generations from different streams, as illustrated in Figure 7.1 for a LM whose streams have two generations. disk 1 new log -records disk 2 new log -records disk 4 disk 3 new log --- J1||I ; Lm![mt records Figure 7.1: Multiplexing of Older Generations With this configuration, the LM can exploit completely sequential disk I/O when writing log information to generation 0 of any particular log stream. When the LM must forward log records to generation 1, it writes them to a physically different disk drive (or set of drives) so that no random I/O is required. This technique of multiplexing older generations from different streams permits quick reclamation of the disk space allocated to log records from short transactions but doesn't 122 suffer from any performance degradations due to random disk I/O to the log disks. 7.3.4 Choice of Generation Sizes For a particular application, how many generations should each log stream have, and what should be the sizes of these generations? Currently, no analytical methods are available to answer these questions. The experiments reported in Chapter 6 relied on simulation to determine the optimal configuration for each particular case. Therefore, formulation of an accurate analytic model to determine the optimal number of generations and their sizes for any particular application remains a challenging open problem. The characteristics of an application may vary over time, and so the design of an enhanced version of XEL that can adaptively alter its parameters in response to changing conditions is another important open problem. 7.3.5 Fault Tolerance to Isolated Media Failures This thesis has concentrated on fault tolerance to system failures (crashes) in which the contents of main memory are lost but all information on non-volatile disk storage remains intact. To tolerate media failures, in which information on disk is lost, a DBMS must exploit redundancy. RAID [52, 51] and Parity Striping [23] have both recently been proposed as solutions to the problem of isolated media failures. A parallel implementation of XEL and RAID can be combined with little difficulty. Disk I/O for log information is characterized by sequential transfers of large blocks of information, and so level 3 RAID is most appropriate. A group of disk drives, one of which is used only for parity information, constitute each log stream. The maximum bandwidth per stream is now higher (compared to the simplistic situation of only one disk drive per log stream), so fewer streams are required. In contrast, I/O for the disk version of the database is characterized by random requests for small pieces of data, so level 5 RAID or Parity Striping would be the best choice for it. The differing requirements of the LM and CM provide a good example of a situation in which it is advantageous to employ two different levels of RAID in the same system. RAID systems can be designed to provide very high reliability. For example, a 1000 disk level 5 RAID with a group size of 10 and a few standby spares could have a calculated MTTF (mean time to failure) of over 45 years [52]. Isolated media failures pose very little threat under such circumstances. Other components of the system may likely fail before an irrecoverable media failure happens. Nevertheless, the mere threat of an irrecoverable media failure may warrant attempts for further fault tolerance. A familiar solution (see [5], for example) is to periodically dump the database's current state to an archive version and maintain an archive log of all transactions which have executed since 123 the archive version was dumped. The archive version of the database and the archive log may be kept on tape rather than on disk. This thesis has not addressed the problem of maintaining an archive log for a database which requires very high bandwidth for log information. Another option, which is already practiced by some large commercial users of databases, is to have two duplicate database systems running at geographically separate sites to that a natural or man-made disaster at one site does not wipe out all data. This option may be expensive, because it requires full duplication, but it does provide good fault tolerance. In the event of a media failure at one site, the other site still offers an up-to-date version of the database. The likelihood of simultaneous failures at both sites ought to be very low. The provision of more efficient means to ensure fault tolerance to irrecoverable media failures remains a challenging open problem. 7.3.6 Fault Tolerance to Partial Failures This thesis has conveniently assumed that the concurrent system on which the DBMS executes is either completely up or completely down. In practice, some machines may be able to continue operation at some processing nodes despite partial failures at other nodes elsewhere within the same machine. When the system hardware allows such "graceful degradataion", a DBMS designer might wish to structure the DBMS software so that it too allows graceful degradation. One obvious way to provide fault tolerance to partial failures is to duplicate all an application's data structures and code. For example, all LOT entries at a particular node would be duplicated at some other node elsewhere in the machine. Any update to one copy of the these LOT entries must be applied to the other copy as well. Refer to [4] for an example of a system which exploits software redundancy to tolerate partial failures. 7.3.7 Support for Transition Logging This thesis has always assumed physical state logging at the access path level. Some DBMS designers may prefer other styles of logging. For example, the ARIES method for logging and recovery [45] uses transition logging [30] and incorporates compensation log records (CLRs) to undo the effects of previous updates to objects. If a LM performs transition logging for an object and a transaction updates the object, the resulting log record indicates only the operation that transformed the object's old value into its new value; neither the old state nor the new state of the object is represented in the log. In this context, what is the definition of a garbage log record, and what is a relevant log record? How should the LM manage the LOT and LTT? When a transaction performs an operation on an object, the resulting log record shall 124 be called a FWD record (since it refers to an operation in the "forward" direction in time). When the LM undos an operation, the resulting record is a CLR, as mentioned above. In general, an operation described in a log record is not idempotent. If the RM must undo an update by a transaction, it must first be sure that the version of the object in the disk version of the database actually incorporates the transaction's original update, lest an unwarranted undo action put the object into an incorrect state. One way to synchronize the log and the disk version of the database is to keep a timestamp with every object upon which the LM will perform transition logging. Suppose this timestamp is an integer-valued counter (initially 0) and the LM increments the counter every time it changes the object. Whenever the CM flushes an updated object to the disk version of the database, the object's current timestamp accompaniesit and resides with it in the disk version of the database. Whenever the LM modifies an object (in either the "forward" or "backward" direction), it increments the timestamp and stores the new timestamp value in the associated log record. When the RM must restore an object to its most recently committed pre-crash state, it knows that the version of the object in the disk version of the database already incorporates the effects of all operations whose log records have timestamps less than or equal to that found with the object in the disk version of the database. The RM must redo only those operations which are described in subsequent log records from committed transactions. Similarly, the RM must undo any operations that were performed by transactions which aborted or were interrupted by the crash; some of these operations may temporally precede the current version of the object in the disk version of the database. The LM must retain all FWD and CLR log records for operations that temporally follow the current version of the object in the disk version of the database. It must also retain all FWD log records from uncommitted transactions that do not have corresponding CLRs, regardless of whether these records temporally precede or follow the current version of the object in the disk version of the database. If a transaction commits, the LM must retain its COMMIT record until all FWD log records from the transaction have been overwritten; otherwise, the RM could find a FWD log record which appears to be from an uncommitted transaction but the log contains no subsequent CLR so the RM ought to undo the operation described in the FWD record. If the LM writes a FWD log record and later writes a CLR to undo the operation, it must retain the CLR in the log until the FWD record has been overwritten; this ensures that the RM does not undo an operation which has already been undone. These considerations dictate when COMMIT, FWD and CLR records become garbage. The LOT and LTT can track the positions and status values for FWD and CLR records just as it did for other types of log records. 7.3.8 Adaptive Distribution Policies This thesis demonstrated that a simple oblivious distribution policy, such as random or cyclic, can ensure excellent load balancing behavior in a system with arbitrarily many log streams, where load balancing is defined in terms of the demanded bandwidth of each 125 stream. However, there may still be reasons to pursue more sophisticated distribution policies. An adaptive distribution policy takes into account the current state of the system when deciding the stream to which to send a log record; any such strategy is clearly not oblivious. Although an adaptive strategy cannot achieve any improvements in terms of balancing the demanded bandwidths of log streams, it can offer other benefits. Each log stream has a fixed number of buffers. If one stream's buffers are all completely full temporarily, an adaptive strategy ought to redirect log records to other streams which still have buffer capacity available. Furthermore, the random and cyclic policies make no attempt to exploit locality within a concurrent computer. As a system's degree of concurrency scales up, global network bandwidth may become more limited and locality may affect a database's overall performance. The partitioned policy, despite its drawbacks in terms of load balancing, can provide good locality; a modified variant of the partitioned policy may yield the best solution, in terms of both load balancing and locality. Log records are sent to streams according to the partitioned policy. However, an overloaded log stream does not refuse to accept records if it can redirect them to some other stream (preferably one that is nearby) that can accept more records. Theoretical analysis of an adaptive distribution policy becomes quite complicated because the system incorporates feedback. Such problems have classically been the preserve of control systems theory. Theoretical analysis and experimental evaluation of an adapative policy must take into the system's dynamic behavior. For example, a system might exhibit oscillatory behavior under some conditions. These issues ought to be thoroughly understood before any adaptive policy is chosen for use within a database system. 7.3.9 Quick Resumption of Service After a Crash The description of the RM's operation in Section 2.9 stated that normal operation resumes after recovery activity has completely finished. For some applications, availability is very important and normal activity should resume as quickly as possible after a crash. More sophisticated recovery algorithms for XEL may permit a database to start servicing requests before recovery has entirely completed; normal operation and recovery are overlapped. This remains an interesting open problem. 7.4 The Future of High Performance Databases In the future, information will be plentiful, available and inexpensive; but it won't be free. As computer and communications technology becomes pervasive in our society, many offices and homes will be able to retrieve information from large databases. In 126 general, the corporations which provide these information services will charge the consumers appropriately. Fees will reflect the value of the information to the consumer and the cost to the producer for gathering, storing and distributing the information. Database technology will provide not only the ability to store and retrieve large amounts of diverse information. It will also provide a means to charge customers accordingly, so as to ensure economic efficiency. Suppose an "information vendor" establishes a database which provides information services to many millions of customers. If 10,000,000 customers happen to be using the system at one time and each customer submits approximately one request every 100 seconds, say, this generates a load of 100,000 TPS (transactions per second) for the billing database. This is enormous, compared to the best demonstrated performance of today's systems. The cost for maintaining billing information is important. It ought to be relatively low, compared to the cost of the information itself, so that the price is not significantly inflated by the need to charge a fee for each request. These considerations suggest that high performance, low cost transaction processing systems will play an important role in our society as we become information consumers and many hundreds of millions of people and companies buy and sell information. The work in this thesis is a step toward realizing this dream. However, it addresses only a necessary condition (fault tolerance), not a sufficient condition. Much more work remains to be done in many other areas. The problems of concurrency control, query optimization and transaction management all deserve re-examination within the context of highly concurrent database systems that support very high rates of transaction processing. 127 Appendix A Theorems for Correctness Proof of XEL This appendix contains proofs for the safety and liveness properties of XEL. These proofs supplements Chapter 3 in the main body of the thesis. Proof of Possibilities Mapping A.1 The following lemmas and theorems will prove that f, as defined in Section 3.4.3, satisfies the sufficient conditions (stated in [42, 43]) to be a possibilities mapping from LMto SLM. Theorem Vso, soEstart(LM), 3to s.t. (toEstart(SLM)) A (toEf(so)) A.1 Proof: * (( A (curr reqddlr=l) A (pendingts-assign=0) A A (Vx, xEA, timestamp=l) (Vx, xEAr, statusx=UNFL) ) in state so) (tEf(so)) ((keep=) * start(SLM)={to} =:- where ((keep=I) A (leterase=0) A (wait-erase=0)) in state t A (leterase=0) A (waiterase=0)) in state to * ((keep=I) A (leterase=0) A (wait-erase=0)) in state to -== t=to * (t=to) A (tEf(so)) =- toEf(so) and thus the theorem has been proven. 128 O Lemma A.2 (lmkeep(z) A in state s) (tEf(s)) keep=x in state t Proof: By contradiction. Assume keepox in state t. * (tEf(s)) A (keep/x in state t) == Either (1) (lmlet(x) in state s) A (xEleterase in state t) * lmlet(x) in state s ==: -lmkeep(x) in state s by definition of lm let(x) and Imkeep(x) But this is a contradiction, so this case cannot be true. or (2) (Imwait(x) in state s) A (xEwaiterase in state t) * lmIwait(x) in state s ==> statusx=RECV in state s by defin!ition of Imwait(x) * statusx=RECV in state s =- currreqddlr5x in state s * currreqddlr5x in state s ==~ -lmkeep(x) by Invariant 3.6 by definiition of lmkeep(x) in state s But this is a contradiction, so this case cannot be true. or (3) A (-recvbl(x) in state s) (((keepy4x) A (xfleterase) A (xwaiterase)) in state t) * Imkeep(x) in state s == curr-reqddlr=x in state s by definition of Imkeep(x) * currreqddlr=x in state s == recvbl(x) in state s by Invariants 3.6 and 3.8 and definition of recvbl(x) But this is a contradiction, so this case cannot be truite. * Therefore, every possible case leads to a contradiction and so the original assumption must be false. Thus the lemma has been proven. [ Lemma A.3 (Imlet(x) in state s) A (tEf(s)) xEleterase in state t Proof: By contradiction. Assume xleterase * (tEf(s)) A (xflet-erase in state t) == Either 129 in state t. (1) (Imkeep(x) in state s) A (keep=x in state t) * lmkeep(x) in state s ==- lmlet(x) in state s by definition of lmkeep(x) and Imlet(x) But this is a contradiction, so this case cannot be true. or (2) (lmwait(x) in state s) A (xEwait-erase in state t) * lmwait(x) in state s = statusx=RECV in state s by definition of lmwait(x) * statusx=RECV in state s ==- currreqddlrox by Invariant 3.6 in state s * ((curr-reqd dlrsx) A (lm-wait(x))) in state s by definition of lmlet(x) -lmlet(x) in state s = But this is a contradiction, so this case cannot be true. or (3) (-recvbl(x) in state s) A (((keep5x) A (xleterase) * lmlet(x) in state s A (xwaiterase)) V (recvbl(x))) == ((curr-reqddlr=x) in state t) in state s by definition of lmlet(x) V (recvbl(x))) in state s * ((currreqd.dlr=x) = recvbl(x) in state s by Invariants 3.6 and 3.8 and definition of recvbl(x) But this is a contradiction, so this case cannot be true. *Therefore, every possible case leads to a contradiction and so the original 0 assumption must be false. Thus the lemma has been proven. Lemma A.4 (lmwait(x) A in state s) (tEf(s)) xEwaiterase in state t Proof: By contradiction. Assume xowait-erase in state t. * (tEf(s)) A (xowaiterase in state t) -- Either (1) (lmkeep(x) in state s) A (keep=x in state t) * Imkeep(x) in state s == curr-reqd dlr=x in state s by definition of Imkeep(x) * curr-reqddlr=x in state s == status,RECV in state s by Invariant 3.6 * statusx$RECV in state s by definition of lmwait(x) ==- lmwait(x) in state s But this is a contradiction, so this case cannot be true. 130 or (2) (lm-let(x) in state s) A (xEleterase in state t) * lmlet(x) in state s ((currreqddlr=x) == V (-,lmwait(x))) in state s by definition of lmlet(x) * ((curr_reqddlr=x) V (-lmwait(x))) in state s :==-lm.wait(x) in state s by Invariant 3.6 But this is a contradiction, so this case cannot be true. or (3) (-recvbl(x) in state s) A (((keep-x) A (xzlet-erase) A (xfwait-erase)) in state t) * lm.wait(x) in state s == status.=RECV in state s by definition of lmwait(x) * statusx=RECV in state s in state s by Invariant 3.7 ==- timestampEA * ((statusx=RECV) A (timestamp.EA/f)) in state s -=recvbl(x) in state s by definition of recvbl(x) But this is a contradiction, so this case cannot be true. *Therefore, every possible case leads to a contradiction and so the original assumption must be false. Thus the lemma has been proven. Lemma A.5 (-recvbl(x) A [] in state s) (tEf(s)) ((keephx) A (xfleterase) A (xEwait-erase)) in state t Proof: By contradiction. Assume ((keep=x) V (xElet-erase) V (xEwait-erase)) in state t. * (tEf(s)) A (((keep=x) V (xEleterase) V (xEwaiterase)) in state t) =. Either (1) (lmkeep(x) in state s) A (keep=x in state t) * lmkeep(x) in state s =# currreqddlr=x * currreqddlr=x in state s by definition of lmkeep(x) in state s =- recvbl(x) in state s by Invariants 3.6 and 3.8 and definition of recvbl(x) But this is a contradiction, so this case cannot be true. or (2) (lmlet(x) in state s) A (xEleterase in state t) * lmlet(x) in state s = ((curr eqd dlr=x) V (recvbl(x))) in state s by definition of lmlet(x) 131 V (recvbl(x))) in state s *((curr-reqd.dlr=x) - recvbl(x) in state s by Invariants 3.6 and 3.8 and definition of recvbl(x) But this is a contradiction, so this case cannot be true. or (3) (lmwait(x) in state s) A (xEwait-erase in state t) * lmwait(x) in state s ==- status,=RECV in state s by definition of lmwait(x) * status,=RECV in state s =- timestampEAf in state s by Invariant 3.7 * ((status,=RECV) A (timestampEAf)) in state s by definition of recvbl(x) ==- recvbl(x) in state s But this is a contradiction, so this case cannot be true. *Therefore, every possible case leads to a contradiction and so the original [ assumption must be false. Thus the lemma has been proven. Lemma A.6 A (recvbl(x), xEAr, in state s) A (tEf(s)) ((keep=x) V (xEleterase) V (xEwaiterase)) in state t Proof: By contradiction. Assume ((keepox) A (xleterase) A (xfwaiterase)) in state t. * (tEf(s)) A (((keepox) A (xleterase) A (xowaiterase)) in state t) =: by Definition 3.1 -recvbl(x) in state s But this contradicts the lemma's predicate. Therefore, the initial assumption must be false and thus the lemma has been proven. 0 (s is a reachable state of LM) Lemma A.7 A (recvbl(x), A (tEf(s)) xzEA, in state s) ((leterase#0) V (waiteraserS0)) in state t Proof: * (recvbl(x) in state s) A (tEf(s)) == ((keep=x) V (xElet-erase) V (xEwait-erase)) in state t by Lemma A.6 * Either (1) keep=x in state t 132 * t is a reachable state =- ((keep=l) V (leterase#O) V (waiteraseO)) in state t by Invariant 3.1 · (keep=x, EAV,in state t) A (((keep=I) V (let-erasej0) V (waiteraseo0)) in state t) :- ((let-eraseo0) V (waiterase$0)) in state t or (2) keepox in state t * (keepox, xEA(, in state t) A (((keep=x) V (xEleterase) V (xEwait-erase)) in state t) =- ((xEleterase) V (xEwaiterase)) in state t * ((xEleterase) V (xEwaiterase)) in state t == ((let-erase40) V (wait-erase#0)) in state t Therefore, the desired result is obtained for both possible cases and thus the lemma has been proven. [ Lemma A.8 (s is a reachable state of LM) (x, XEAE,s.t. recvbl(x), in state s) (t Ef(s)) A A ((let-erase=0) A (waiterase=0)) in state t Proof: By contradiction. Assume ((leterase$O) V (waiterase40)) in state t * (((leterase-0) V (waiterase$0)) in state t) A (((L~leterase) A (waiterase)) =: 3x, in state t) EAr, s.t. ((xElet-erase) V (xEwait-erase)) in state t by Invariant 3.5 * (((xEleterase) V (xEwaiterase)) in state t) A (tEf(s)) =- ((lm let(z)) V (Imwait(z))) in state s by Lemmas A.3 and A.4 * ((lmlet(x)) V (lmwait(x))) in state s =. recvbl(x) in state s by definitions of lmlet(x) and lmwait(x), and by Invariants 3.6, 3.7 and 3.8 But this contradicts the lemma's predicate, and so the original assumption must be false. Thus the lemma has been proven. Theorem A.9 (sil is a reachable state of LM) A (ri=COMMIT) A A ((s_- ,COMMIT,sj)Esteps(LM)) (t'Ef(si-)) 3t s.t. ((t',COMMIT,t)Esteps(SLM)) A (tEf(si)) 133 0 Proof: * 7ri=COMMIT, = currreqddlr=x * Either (1) 3y, yx, in state si s.t. recvbl(y) in state si_ * (recvbl(y) in state si_1 ) A (ri=COMMITx)=>- recvbl(y) in state si * ((currreqddlr=x) A (recvbl(y), y/x)) in state si = lmkeep(x) in state s* (recvbl(y) in state si-1) A (t'Ef(si_-)) ((leterase$0) V (waiterase/(O)) in state t' by Lemma A.7 * (((let-erase$O) V (waiterase0O)) in state t') A ((t',COMMIT,t)Esteps(SLM)) =: x=keep in state t * Either (i) 3w, w5x, s.t. currreqddlr=w in state si-, * curr-reqddlr=w in state si, ((recvbl(w)) A (statuso/RECV)) in state s_l by Invariants 3.6 and 3.8 * (((recvbl(w)) A (status A RECV)) in state si_l) (7ri=COMMITx) ==- ((recvbl(w)) A (statusRECV)) in state si * (ri=COMMITx) A (w/x) => currreqddlrow in state si * ( (currreqddlr/w) A (recvbl(w)) A (statusw#RECV) ) in state si lmlet(w) in state si * currreqddlr=w in state si=:v ((lmkeep(w)) V (lmlet(w))) in state si_1 * (((lmkeep(w)) V (lmlet(w))) in state si-l) A (t'Ef(si-l)) == ((keep=w) V (wEleterase)) in state t' by Lemmas A.2 and A.3 * (((keep=w) V (wEleterase)) in state t') A ((t',COMMITx,t)Esteps(SLM)) =- wEleterase in state t or (ii) A/w, w/x, s.t. currreqddlr-w in state si-1 * ,w, w/x, s.t. currreqddlr=w in state si-_ ==- /w, w/&x, s.t. lm-keep(w) in state si-1 * Vz, z/x, (lmlet(z) in state si_l) A (7ri=COMMITx) == lmlet(z) in state si · Vz, z5x, (Imlet(z) in state si-1) A (t'Ef(si_l)) =V zEleterase in state t' 134 by Lemma A.3 * Vz, zix, (zElet_erase in state t') A ((t',COMMIT,t)Esteps(SLM)) ==, zEleterase in state t * Vz, z5x, (lmwait(z) in state sil) A (ri=COMMITx) == lm_wait(z) in state si * Vz,zzx, (lmwait(z) in state sil) A (t'Ef(sil)) ==, zEwait-erase in state t' by Lemma A.4 * Vz, z5x, (zEwaiterase in state t') A ((t',COMMIT,t)Esteps(SLM)) == zEwaiterase in state t or (2) ]]y, yZ:, s.t. recvbl(y) in state si-_ (y, yx, s.t. recvbl(y) in state s_l) A (7r=COMMIT,) == y, yxz, s.t. recvbl(y) in state si * ((currreqd_dlr=x) A (y, = (y, y5x, s.t. recvbl(y))) in state si lm let(x) in state si yx, s.t. recvbl(y) in state si-1) A (t'Ef(sil)) ;- ((let_erase=0) A (wait_erase=0)) in state t' by Lemma A.8 * (((leterase=0) A (wait_erase=0)) in state t') A ((t',COMMIT,t)Esteps(SLM)) =- ((keep=l) A (xElet-erase)) in state t * Vz, z7x, (-,recvbl(z) in state si-1) A (ri=COMMITx)=> · Vz, zx, (-irecvbl(z) in state si-1) A (t'Ef(si-1)) . ((keepAz) A (zleterase) recvbl(z) in state si A (ziwaiterase)) in state t' by Lemma A.5 * Vz, zix, (((keep5z) A (zfleterase) A (zfwaiterase)) in state t') A ((t',COMMIT,t) Esteps(SLM)) =- ((keep7z) A (zlet_erase) A (zfwait_erase)) in state t * Therefore, from the above deductions it follows that tEf(si) and thus the theorem has been proven. Theorem A.10 (sil by Definition 3.1 O is a reachable state of LM) A A (iri=ERASABLEx) ((sil,ERASABLEx,si)Esteps(LM)) A (t'Ef(si-1)) 3t s.t. ((t',ERASABLE,t)Esteps(SLM)) A (tEf(si)) Proof: * 7ri=ERASABLEx =•= xEpending_erasable in state si_ · xEpending_erasable in state si-1 :- ((statusx=RECV) A (pending_ackr=F)) in state si-1 by Invariants 3.14 and 3.12 135 * (((status,=RECV) A (pendingack=F)) in state si_1) A (ri=ERASABLE,) Im_wait(x) in state si * xEpending_erasable in state si-1 =- ((status,=RECV) A (currreqd_dlr x) A (timestampEAF)) in state si-1 by Invariants 3.14, 3.6 and 3.7 * ( A (statusx=RECV) (currreqd_dlrzx) A (timestampxEKf) in state si-I (xEpending_erasable)) by definition of lmlet(x) =lm let(x) in state si-1 == xElet_erase in state t' · (lmlet(x) in state si-1) A (t'Ef(sil)) by Lemma A.3 * (xElet_erase in state t') A ((t',ERASABLE,t)Esteps(SLM)) = XEwait_erasein state t * Vz, zx, (lmkeep(z) in state si-1) A (ri=ERASABLE,) A == lmkeep(z) in state si * Vz, z6x, (m_keep(z) in state si-1) A (t'Ef(si- 1 )) by Lemma A.3 => keep=z in state t' * Vz, zx, (keep=z in state t') A ((t',ERASABLE,t)Esteps(SLM)) == keep=z in state t =: lmlet(z) in state si * Vz, zox, (lmlet(z) in state Si_l) A (ri=ERASABLEx) * Vz, zx, (lmlet(z) in state si-1) A (t'Ef(sil)) by Lemma A.3 * Vz, zox, (zElet_erase in state t') A ((t',ERASABLEx,t)Esteps(SLM)) == zElet_erase in state t' => zElet_erasein state t 4 * Vz, zy x, (lm_wait(z) in state si-1) A (ri=ERASABLE,) ==: lm_wait(z) in state si * Vz, zy4x, (lm_wait(z) in state si-1) A (t'Ef(si-1)) by Lemma A.4 =-- zEwait_erase in state t' * Vz, zx, (zEwait_erase in state t') A ((t',ERASABLE,,t)Esteps(SLM)) -=> zEwait_erase in state t * Vz, zyx, (recvbl(z) in state si-1) A (ri=ERASABLEx) => -recvbl(z) in state si * Vz, zyx, (recvbl(z) in state si-l) A (t'Ef(si-1)) ==: ((keepoz) A (zlet_erase) A (zfwaiterase)) in state t' by Lemma A.5 (((keep4z) A (zleterase) · Vz, zx, A A (zowait_erase)) in state t') ((t',ERASABLE,t)Esteps(SLM)) = ((keepyLz) A (zolet-erase) A (zowaiterase)) in state t * Therefore, from the above deductions it follows that by Definition 3.1 tEf(si) and thus the theorem has been proven. 136 0 Theorem A.11 (si-_ is a reachable state of LM) A (7ri=ERASE,) A ((si_l,ERASE ,si) Esteps(LM)) A (t'Ef(sii)) 3t s.t. ((t',ERASE,t)Esteps(SLM)) A (tEf(si)) Prioof: * ri=ERASEx==* xEcanerase in state si-, * xEcan-erase in state si- 1 =- lm-wait(x) in state si-l by Invariants 3.14, 3.13 and 3.12 * Imwait(x) in state si-l ==- (Iv ]v s.t. <x,v>Ependingts-assign s.t. <x,v>Ependingts.assign in state si-_ by Invariant 3.9 in state Si-1) A (ri=ERASEx) =: -recvbl(x) in state si * (Imwait(x) in state si_1) A (t'Ef(sil)) ==- xEwaiterase in state t' by Lemma A.4 * xEwait-erase in state t' == ((keepox) A (xleterase)) in state t' by Invariant 3.4 * (((keepox) A (xVleterase)) in state t') A ((t',ERASE,,t)Esteps(SLM)) == ((keep5x) A (xlet-erase) A (xwaiterase)) in state t * Either (1) w, wEA, s.t. curr-reqddlr=w in state si-_ * (currsreqddlr=w in state si- 1 ) A (ri=ERASEx) = currsreqddlr=w in state si * ((curr-reqd.dlr=w) A (statusx=RECV)) in state si-l ==- w7x by definition of lmwait(x) and Invariant 3.6 * lm-wait(x) in state si-1 = recvbl(x) in state Si-l by Invariant 3.7 * ((curr-reqddlr=w) A (recvbl(x)) A (wkx)) in state si-_ . lmkeep(w) in state si-l * (lmkeep(w) in state si- 1) A (t'Ef(si_l)) ==- keep=w in state t' by Lemma A.2 * Either (i) 3y, yo{x,w}, s.t. recvbl(y) in state si-1 * (recvbl(y) in state si_ 1) A (ri=ERASE,) A (y5x) => recvbl(y) in state si * ((curr-reqddlr=w) A (y, yow, s.t. recvbl(y))) in state si == lmkeep(w) in state si by definition of lmkeep(w) 137 * (recvbl(y) in state si- 1 ) A (t'Ef(Si-l)) ((keep=y) V (yEleterase) V (yEwaiterase)) in state t' by Lemma A.6 (keep=y) (( · (keep= (yEleterase) V V (yEwaiterase)) in state t') A A (keep=w in state t') (y/w) => ((yEleterase) (( * V (yEwaiterase)) in state t' A ((yEleterase) V (yEwaiterase))) in state t') A ((t',ERASE,t)Esteps(SLM)) keep=w in state t =# or (ii) Ay, yo{x,w}, s.t. recvbl(y) in state si-1 * (/y, yff{x,w}, s.t. recvbl(y) in state sil) A (ri=ERASEx) ,Ay, y54w, s.t. recvbl(y) in state si A (Ay, y/w, s.t. recvbl(y))) in state si * ((currreqddlr=w) == lmlet(w) in state si * xEwaiterase in state t' == xleterase in state t' by Invariant 3.4 * keep=w in state t' => ((wleterase) A (w~waiterase)) in state t' by Invariants 3.4 and 3.3 * (Ay, y4{x,w}, s.t. recvbl(y) in state si_l) A (t'Ef(si_l)) (keep/y) Vy, yo{x,w}, ( (yleterase) A A (yfwaiterase) ) in state t' by Lemma A.5 * ((xleterase) A (xEwaiterase) in state t') A (((woleterase) V (wfwaiterase)) in state t') (keep/y) A (Vy, yo{x,w}, ( A (yoleterase) A (yfwaiterase) ) in state t') ((let-erase=0) A (waiterase={x})) in state t' (keep=w) (( * A (leterase=0) A (waiterase={x})) in state t') A ((t',ERASEx,t)Esteps(SLM)) - ((keep=l) A (wEleterase)) in state t or (2) w, wefV, s.t. currreqddlr=w 138 in state si-1 * w, WEAf, s.t. curr-reqd dlr=w in state Si_1 =- 8w, wEAf, s.t. lmkeep(z) in state si-1 * Vz, z7x, (lmlet(z) in state sil) A (7ri=ERASEx)== Ilmlet(z) in state si * Vz, z7x, (lmlet(z) in state si-1) A (t'Ef(sil)) - zEleterase in state t' * Vz, zx, by Lemma A.3 (zElet-erase in state t') A ((t',ERASE,t)Esteps(SLM)) == zEleterase in state t * Vz, z7x, (lmwait(z) in state si-1) A (7ri=ERASEx) = * Vz, zx, (lmwait(z) in state si-1) A (t'Ef(sil)) lmwait(z) in state si == zEwaiterase in state t' by Lemma A.4 * Vz, zxz, (zEwait-erase in state t') A ((t',ERASE,t)Esteps(SLM)) zEwaiterase in state t == * Vz, z4x, (recvbl(z) in state sil) A (ri=ERASE) == -irecvbl(z) in state si * Vz, zkx, (recvbl(z) in state si-i) A (t'Ef(s-il)) :. ((keep5z) A (zlet-erase) A (zowaiterase)) in state t' by Lemma A.5 * Vz, z5x, (((keep5z) A (zleterase) A A (zowaiterase)) in state t') ((t',ERASEx,t) Esteps(SLM)) :== ((keepoz) A (zleterase) A (zowaiterase)) in state t * Therefore, from the above deductions it follows that tEf(si) and thus the theorem has been proven. Lemma A.12 A (si_l is a reachable state of LM) (lmkeep(z) in state si-1) A (7ri=ACKASSIGNx) by Definition 3.1 E lmkeep(z) in state si Proof: * lmkeep(z) in state si-1 =- ((currreqd-dlr=z) A (3y, y5z, s.t. recvbl(y))) in state Si_* ((currreqddlr=z) A (3y, yhz, s.t. recvbl(y))) in state si- 1 ==- recvtss50 in state si-1 by Invariant 3.17 * (((curr-reqddlr=z) A (recvtss60)) in state Si-1 ) A (ri=ACKASSIGN,) = curr-reqddlr=z in state si * (recvbl(y) in state si-1) A (ri=ACKASSIGN,) ==~ recvbl(y) in state si * ((currreqddlr=z) A (recvbl(y), yz)) and thus the lemma has been proven. 139 in state si =- lmkeep(z) in state si E Lemma A.13 A (si-1 is a reachable state of LM) (lmlet(z) in state si-1) A (7ri=ACK-ASSIGN,) lmlet(z) in state si Proof: * lmlet(z) in state si-1 = Either (1) ((curr-reqddlr=z) A (y, y$z, s.t. recvbl(y))) in state si-1 * (y, yz, s.t. recvbl(y) in state si_l) A (7ri=ACKASSIGN,) = Ay, yoz, s.t. recvbl(y) in state si * currreqddlr=z = in state si-1 ((recvbl(z)) A (statuszRECV)) in state si-_ by Invariants 3.6 and 3.8 * (((recvbl(z)) A (status-zRECV)) in state si-1) A (ri=AcKASSIGN,) =- ((recvbl(z)) A (statusz#RECV)) * ( in state si (Ay, yoz, s.t. recvbl(y)) A (recvbl(z)) A (statusz#RECV) ) in state si =: lmlet(z) in state si or (2) ((currreqddlr7z) * (( A (recvbl(z)) A (lmwait(z))) in state si-1 (currreqddlrz) A (recvbl(z)) A (-lmwait(z)) ) in state si-1) A (7ri=ACKASSIGN,) :- ((curr-reqddlroz) A (recvbl(z)) A (-lmwait(z))) in state si * ((currreqddlr5z) A (recvbl(z)) A (-lmwait(z))) in state si = Ilmlet(z) in state si by definition of lmlet(z) * For both possible cases, the desired result is obtained and thus the lemma has been proven. [1 Theorem A.14 (si-1 is a reachable state of LM) A (7ri=ACKASSIGN,) A ((si_l,ACK ASSIGN.,si) steps(LM)) A (t'Ef(si-1)) t'Ef(si) Proof: 140 · Vz, zeAN, (lmkeep(z) in state si-1) A (t'Ef(sil)) - keep=z in state t' by Lemma A.2 * Vz, zEAr, (lmkeep(z) in state Si-1) A (ri=ACK-ASSIGN.) =: lmkeep(z) in state si by Lemma A.12 * Vz, zEAN, (lmlet(z) in state si-1) A (t'Ef(sil)) == zElet-erase in state t' by Lemma A.3 * Vz, zENA, (Imlet(z) in state sil) A (ri=ACKASSIGN,) ==, lmlet(z) in state si * Vz, zEAf, (lmwait(z) in state si-1) A (t'Ef(sil)) by Lemma A.13 =: zEwaiterase in state t' by Lemma A.4 * Vz, zENV, (lm.wait(z) in state si-1) A (ri=ACKASSIGN) ==r lmwait(z) in state si * Vz, zEAl, (-,recvbl(z) in state si-1) A (t'Ef(sil)) ==* ((keepoz) A (z4leterase) in state t' A (zEwait_erase)) by Lemma A.5 * Vz, zEAN, (recvbl(z) in state si-1) A (ri=ACKASSIGN,) -recvbl(z) in state si * Therefore, from the above deductions it followsthat == by Definition 3.1 t'Ef(si) 0 and thus the theorem has been proven. Theorem A.15 (si-1 is a reachable state of LM) A (r=ACK_CSRECV) A A ((si_-1,ACKCSRECV ,si) Esteps(LM)) (t'Ef(si-1)) t'Ef(si) Proof: * Vz, zENV, (lmkeep(z) in state si-1) A (t'ef(si- 1)) = keep=z in state t' by Lemma A.2 * Vz, zEAl, (lmkeep(z) in state si_l) A (ri=ACK_CSRECVx)== lmkeep(z) in state si * Vz, zE.A, (lm let(z) in state si-1) A (t'Ef(sil)) =; zElet-erase in state t' by Lemma A.3 · Vz, zEJ, (lmlet(z) in state si-1) A (ri=ACK_CS.RECV,) - lmlet(z) * Vz, zEAJ, (lm-wait(z) in state si-1) A (t'Ef(si- 1)) -- zEwaiterase in state t' in state si by Lemma A.4 * Vz, zENA, (lmwait(z) in state si-1) A (7ri=ACK_CSRECV) : lmwait(z) 141 in state si · Vz, zAf, (-recvbl(z) in state si-1) A (t'Ef(si-l)) :- ((keep5z) A (zflet-erase) A (zEwait-erase)) in state t' by Lemma A.5 * Vz, zEA, (-recvbl(z) in state si_l) A (ri=ACKCSRECV:)=: -recvbl(z) in state s * Therefore, from the above deductions it follows that by Definition 3.1 t'Ef(si) [ and thus the theorem has been proven. A (si-1 is a reachable state of LM) (lm-keep(z) in state si-l) A (ri=<DLRGONE,u>) Lemma A.16 lmkeep(z) in state si Proof: * lmkeep(z) in state si-1 ((curr.reqdadlr=z) A (3y, y#z, s.t. recvbl(y))) in state si-1 by definition of lmkeep(z) * ((currreqdidlr/y) A (recvbl(y))) in state si- 1 3v, :vEAN,s.t. ( ((<y,v>Ependingts.assign) V (timestampy=v)) A (vErecvtss)) in state si-1 by Invariant 3.17 * ri=<DLR_GONE,u> =- 3x s.t. ((status.=NONR) A (timestamp,=u)) in state si- 1 * statusc=NONR in state si- 1 =- ,Aw s.t. <x,w>Ependingtsassign in state si-_ by Invariant 3.9 * ((,]w s.t. <x,w>Ependingts-assign) A (statusx=NONR)) in state si- 1 == -recvbl(x) in state i-l * ((recvbl(y)) A (-irecvbl(x))) in state si-1 == xy * ((<y,v>Ependingtsassign) V (timestampy=v, vEAr)) ( A (timestampx=u) A . (xzy)) uv in state si-1 by Invariant 3.16 (((curr-reqddlr=z) A (vErecvtss)) in state Si-l) · A (ri=<DLR_GONE,u>) A (u#v) - currreqddlr=z in state si * (3y, yoz, s.t. recvbl(y) in state si-1) A (ri=<DLR_GONE,u>) 3y, yAz, s.t. recvbl(y) in state si * ((curr.reqd.dlr=z) A (3y, yoz, s.t. recvbl(y))) in state si Ikeep(z) in state si and thus the lemma has been proven. 142 0 Lemma A.17 A (si-1 is a reachable state of LM) (lmlet(z) in state si- 1 ) A (ri=<DLR_GONE,u>) lmlet(z) in state si Proof: * lmlet(z) in state si-1 =. Either (1) ((currreqddlr=z) A (y, yhz, s.t. recvbl(y))) in state si-1 * curr reqddlr=z in state si- 1 ((recvbl(z)) A (statuszRECV)) == * (( A (Ay, yoz, s.t. recvbl(y)) A (recvbl(z)) A (statusz#RECV)) in E;tate Si-i) (ri=<DLRGONE,u>) - ( (3y, yoz, s.t. recvbl(y)) A A · in state si-1 by Invariants 3.6 and 3.8 (recvbl(z)) (statusz7RECV)) (,y, yz, ( A in state si s.t. recvbl(y)) (recvbl(z)) A (statusz#RECV) ) in state si == lmlet(z) in state si or (2) ((currreqddlr5z) * (( A (recvbl(z)) A (lm by definition of lmlet(z) wait(z))) in state si-1 (curr.reqddlroz) A (recvbl(z)) A (-lmwait(z)) ) in state si-1) A (ri=<DLR_GONE,u>) =: ((curr-reqddlr7z) A (recvbl(z)) A (-lmwait(z))) in state si * ((currreqddlr z) A (recvbl(z)) A (lmwait(z))) in state si ==- lmlet(z) in state si by definition of lmlet(z) * For both possible cases, the desired result is obtained and thus the lemma has been proven. O Theorem A.18 (si_l is a reachable state of LM) A A (ri=<DLR_GONE,u>) ((si_l,<DLR-GONE,u>,si)Esteps(LM)) A (t'Ef(si-l)) t'Ef(si) 143 Proof: * Vz, zEAr, (lmkeep(z) in state Si-l) A (t'Ef(sil)) by Lemma A.2 =- keep=z in state t' * Vz, zEAf, (lmkeep(z) in state Si_l) A (ri=<DLR_GONE,u>) lmkeep(z) == in state si * Vz, zEA, (lmlet(z) in state si-1) A (t'Ef(si_l)) =- zEleterase in state t' * Vz, zEN/, (lmlet(z) in state si-1) A (ri=<DLR_GONE,u>) ==' lmlet(z) in state si * Vz, zEAr', (lmwait(z) in state si-1) A (t'Ef(sil)) =- zEwaiterase in state t' * Vz, zEAN, (lmwait(z) in state si-1) A (ri=<DLRGONE,u>) Iby Lemma A.16 by Lemma A.3 Iby Lemma A.17 by Lemma A.4 =- lmwait(z) in state si · Vz, zEJAr, (-,recvbl(z) in state sil) A (t'Ef(si-l)) ~=((keephz) A (zflet-erase) A (zEwait-erase)) in state t' by Lemma A.5 * Vz, zEJV, (-,recvbl(z) in state Si_l) A (7ri-=<DLRGONE,u>) - _recvbl(z) in state si * Therefore, from the above deductions it follows that t'Ef(si) and thus the theorem has been proven. Lemma A.19 A (si-1 is a reachable state of LM) (lmireep(z) in state si- 1) A (ri=<ASSIGNx,v>) b y Definition 3.1 E lmkeep(z) in state si Proof: * lmkeep(z) in state si-1 A (3y, yoz, s.t. recvbl(y))) in state si-1 == ((curr-reqd.dlr=z) * (currreqddlr=z in state si_1 ) A (ri=<ASSIGN,,v>) == currreqddlr=z in state si * ri=<ASSIGN,v> = <x,v>Ependingts-assign in state si-_ * <x,v>Ependingtsassign in state si-1 by Invariant 3.9 =- (vEA) A (statusx=UNFL in state si- 1 ) * (statusx=UNFL in state si_1 ) A (7ri=<ASSIGNx,v>) A (vEAr) == ((timestampxEAf) A (statusx=UNFL)) in state si * ((timestamp.EAf) A (status,=UNFL)) in state si by definition of recvbl(x) => recvbl(x) in state si * (recvbl(y) in state si- 1 ) A (7ri=<ASSIGNx,v>) == recvbl(y) in state si 144 * ((currreqddlr=z) == A (3y, y/z, Imkeep(z) s.t. recvbl(y))) in state si in state si and thus the lemma has been proven. Lemma A.20 (si_l is a reachable state of LM) A (lmiet(z) in state sil) A (ri=<ASSIGN,v>) lmlet(z) in state si Proof: * 7ri=<ASSIGNz,v> = <x,v>Ependingtsassign in state si · <x,v>cpendingtsassign in state si-1 =- (vEN) A (statusx=UNFL in state Sil) * (statusx=UNFL in state si- 1) A (i=<ASSIGNx,v>) by Invariant 3.9 A (vEx) ((timestampEJV) A (statusx=UNFL)) in state si * lmJet(z) in state si- 1 =- ==- Either (1) ((currreqddlr=z) * ( A (y, y/z, s.t. recvbl(y))) in state si-1 (<x,v>Ependingtsassign) A (fy, y/z, s.t. recvbl(y)) ) in state si-1 x=z * (curr-reqddlr=z in state si-1) A (ri=<ASSIGNz,v>) currreqddlr=z in state si (y, y/z, s.t. recvbl(y) in state si_l) A (7ri=<ASSIGNz,v>) =; Ay, y5z, s.t. recvbl(y) in state si -= * ((curr reqddlr=z) A (y, y5z, s.t. recvbl(y))) in state si =- IlmJet(z) in state si or (2) ((curr-reqddlr/z) A (recvbl(z)) A (-Ilmwait(z))) * (( in state si-1 (currreqddlr5z) A (recvbl(z)) A (-lm wait(z)) ) in state si-1) A A (ri=<ASSIGNx,v>) (vEA) ((curr reqddlrz) A (recvbl(z)) A (-lmwait(z))) == * ((curr-reqddlr/z) A (recvbl(z)) A (-lm-wait(z))) in state si in state si =- lmlet(z) in state si * For both possible cases, the desired result is obtained and thus the lemma has been proven. [ 145 Lemma A.21 A (si-_ is a reachable state of LM) (lmwait(z) in state si-1) A (ri=<ASSIGNz,v>) lmwait(z) in state si Proof: * ri=<ASSIGN,,v> == <x,v>Ependingtsassign · <x,v>Ependingtsassign in state si-_ in state si-_ statusx=UNFL in state si-l by Invariant 3.9 · Ilmwait(z) in state si-l =- statusz=RECV in state si-l by definition of lm-wait(z) · ((statusx=UNFL) A (status,=RECV)) in state si_l = xz =- · (lmwait(z) in state si-1) A (ri=<ASSIGNx,v>) A (xz) ,- lm-wait(z) in state si ] and thus the lemma has been proven. Lemma A.22 A (si-_ is a reachable state of LM) (-,recvbl(z) in state sil) A (7ri=<ASSIGNz,v>) -recvbl(z) in state si Proof: * 7ri=<ASSIGNz,v> ==4 <x,v>Ependingts-assign in state si-_ * <x,v>Ependingtsassign in state si-l = recvbl(x) in state si-l by d.efinition of recvbl(x) * ((-irecvbl(z)) A (recvbl(x))) in state si_l = x7z * (-_recvbl(z) in state si-1) A (i=<ASSIGNx,v>) A (x5z) ==, -recvbl(z) in state si and thus the lemma has been proven. Theorrem A.23 (si_l is a reachable state of LM) A A (wr=<ASSIGNx,v>) ((si_l,<ASSIGNx,v>,si)Esteps(LM)) A (t'Ef(sil)) t'Gf(si) Proof: 146 · Vz, zEA/, (lm-keep(z) in state si-1) A (t'Ef(sil)) keep=z in state t' * Vz, zEJr, (lmkeep(z) in state si_l) A (ri=<ASSIGN,v>) =- lmkeep(z) in state si == · Vz, zEAr, (lmlet(z) by Lemma A.19 in state si-1) A (t'Ef(sil)) zEleterase in state t' * Vz, zEA/r,(lmlet(z) in state sil) A (ri=<ASSIGN,v>) =- == by Lemma A.2 lmlet(z) by Lemma A.20 in state si * Vz, zEA, (lmwait(z) in state si- 1) A (t'Ef(sil)) = zEwaiterase in state t' * Vz, zEAr, (lmwait(z) in state si-1) A (ri=<ASSIGN,,v>) =- Imwait(z) in state si * Vz, zEJAl, (recvbl(z) in state si-1) A (t'Ef(si-l)) :- ((keepz) by Lemma A.3 A (zfleterase) A (zEwaiterase by Lemma A.4 by Lemma A.21 )) in state t' by Lemma A.5 * Vz, zEAl, (-,recvbl(z) in state si-1) A (ri=<ASSIGN,,v>) :=-recvbl(z) in state si * Therefore, from the above deductions it follows that t'Ef(si) and thus the theorem has been proven. Theorem A.24 (sil by Lemma A.22 by Definition 3.1 E is a reachable state of LM) A (7ri=CSREQD,) A A ((sil,CS-REQD (t'Ef(sil 1 )) ,si)Esteps(LM)) t'Ef(i) Proof: · Vz, zEJ', (lmkeep(z) in state si_ 1) A (t'Ef(sil)) =- keep=z in state t' by Lemma A.2 * Vz, zA, (lmkeep(z) in state si- 1 ) A (ri=CSREQDx) == lmeep(z) in state si · Vz, zEJA, (lmlet(z) in state si- 1 ) A (t'Ef(si_l)) = zEleterase in state t' * Vz, zEAl', (lmlet(z) in state si_ 1 ) A (ri=CSREQDx) by Lemma A.3 =: lmlet(z) in state si · Vz, zEA/, (lmwait(z) in state si_1 ) A (t'Ef(si_l)) ='. zEwaiterase in state t' * Vz, zEJA, (lm.wait(z) in state si-1) A (rj=CSREQDx) = lmwait(z) in state si 147 by Lemma A.4 · Vz, zEA/, (recvbl(z) in state si-1) A (t'Ef(si-1)) ((keep5z) A (zleterase) == A (zEwaiterase)) in state t' by Lemma A.5 * Vz, zEN/, (-recvbl(z) in state si-1) A (ri=CS.REQD,) == -irecvbl(z) in state si * Therefore, from the above deductions it follows that t'Ef(si) by Definition 3.1 and thus the theorem has been proven. 0 Lemma A.25 (sil is a reachable state of LM) A (lmlet(z) in state sil) A (7ri=CSRECV,) lmlet(z) in state si Proof: * ri=CSRECVx =: xEsendcsrecv in state si-1 * xEsendcsrecv in state si-1 == ((curr-reqddlr4x) A (recvbl(x))) in state si- 1 by Invariants 3.11, 3.7 and 3.15 * Either (1) x=z * (((currreqddlrzx) A A A (recvbl(x))) in state si-1) (7ri=CSRECV,) (x=z) ((curr-reqd-dlr#z) A (recvbl(z))) in state si * (ri=CS-RECV,) A (x=z) ==Y pendingackz=T in state si * ((curr-reqddlroz) A (recvbl(z)) A (pendingackz=T)) in state si == lmlet(z) in state si = or (2) zxZ * (((lmlet(z)) A (recvbl(x))) in state si-1) A (xhz) ((currreqddlroz) A (recvbl(z)) A (-lm.wait(z))) in state si-1 * (( (currreqddlrhz) A (recvbl(z)) A (-lmwait(z)) ) in state si-1) A (7ri=CSRECVx) A (x5z) =: ((curr-reqd.dlrz) A (recvbl(z)) A (-lm.wait(z))) in state si * ((curr-reqd.dlrhz) A (recvbl(z)) A (-lm wait(z))) in state si =- Ilmlet(z) in state si by definition of lmiet(z) 148 * For both possible cases, the desired result is obtained and thus the lemma has been proven. l Lemma A.26 (si-1 is a reachable state of LM) A (lmwait(z) in state si-l) A (7r=CS-RECV,) lmwait(z) in state si Proof: * 7ri=CSRECVz, = xEsendcsrecv in state si_1 * xEsendcsrecv in state si-1 => statusxZRECV in state si-_ by Invariant 3.15 * (lmwait(z) in state si- 1 ) A (statusxZRECV in state si_1 ) == z54x by definition of lmwait(z) * (lmwait(z) in state si-l) A (ri=CSRECVx)A (z#x) Ilmwait(z) in state si and thus the lemma has been proven. == Lemma A.27 O A (si_1 is a reachable state of LM) (-irecvbl(z) in state si- 1 ) A (ri=CSRECVx) -recvbl(z) in state si Proof: * 7ri=CSRECVx=> xEsendcsrecv * xsendcsrecv == in state si-_ in state si-1 recvbl(x) in state Si-_ b:y Invariants * ((-irecvbl(z)) A (recvbl(x))) in state si-1 = * (recvbl(z) z5x in state si-1 ) A (ri=CSRECV,) A (z5x) === -recvbl(z) in state si and thus the lemma has been proven. Theorrem A.28 (si_1 is a reachable state of LM) A (7ri=CSRECVx) A A ((si_,CS-RECV,sj)Esteps(LM)) (t'Ef(si- 1 )) t'ef(si) 149 3.7 and 3.15 Proof: 0 VZ, ZEN/, (lmkeep(z) in state si-1) A (t'Ef(si-1)) =- keep=z in state t' by Lemma A.2 · Vz, ZEN, (lmkeep(z) in state s-1) A (rj=CS_RECV) =: Imkeep(z) in state si · Vz, zEAr, (Imlet(z) in state sil) A (t'Ef(si-1)) -= zEleterase in state t' by Lemma A.3 * Vz, zEN/, (Imlet(z) in state si-1) A (i=CS_RECV,) = lmlet(z) in state si · Vz, zEAf, (Imwait(z) in state si-1) A (t'Ef(si_l)) by Lemma A.25 = zEwaiterase in state t' * Vz, zEJ, (lmwait(z) in state si-1) A (ri=CS_RECV) ==- lmwait(z) in state si by Lemma A.4 by Lemma A.26 · Vz, zENJ, (-irecvbl(z) in state si-1) A (t'Ef(si_l)) = ((keephz) A (zflet_erase) A (zEwait-erase)) in state t' by Lemma A.5 · Vz, zEJ, (-recvbl(z) in state si_l) A (ri=CS_RECV,) = -recvbl(z) in state si by Lemma A.27 * Therefore, from the above deductions it follows that by Definition 3.1 t'Ef(si) and thus the theorem has been proven. A.2 [O Proof of Liveness Theorem A.69, which is found at the end of this section, states XEL's important liveness property: the DLR for every committed update is eventually erased. Some preliminary lemmas must first be proven which will ultimately contribute toward the proof of Theorem A.69. In all the following lemmas and theorems, let a denote an execution for the LMmodule, and let ri represent the it h action of a (where EAf and i>1). Lemma A.29 rh=ACKASSIGNx 3g, g<h, s.t. rg=<ASSIGNx,t> for some t Proof: *· h=ACKASSIGN, ==- ((statusx=UNFL) A (pendingackx=T)) in state Sh_1 * ((status.=UNFL) A (pendingackx=T)) in state Sh_1 == 3g, g<h-1, s.t. rg=<ASSIGN,,t> for some t and thus the lemma has been proven. 150 0 Lemma A.30 A (a is a well-formed execution) (7ri=<ASSIGN,,u> for some u) ,ah, hoi, s.t. rh=<ASSIGN,,v>for any v Proof: By contradiction. Without loss of generality, assume 3h, h<i, s.t. rh=<ASSIGNz,v> for some v. Case 1: u=v. * rh=<ASSIGN.,v> --- (<x,v>Ependingtsassign in state sh-1) A * <x,v>Ependingtsassign (<x,v> pendingts_assign in state sh) in state Sh-1 ==~ 3q, q<h-1, s.t. 7rq=COMMITx * (ri=<ASSIGNr,u>) A (u=v) == <x,v>Ependingts.assign in state si_l * (<x,v>Opending_tsassign in state sh) A (<x,v>Epending_ts_assign in state si- 1 ) A (h<i) = 3r, h<ri-1, s.t. rr=COMMITx by WF1 =-- 4q, qir, s.t. rq=COMMITx * 7r=COMMIT, But this contradicts the earlier deduction that 3q, q<h-l<r, s.t. 7rq=COMMITx and so this case is impossible. Case 2: uIv. · 7ri=<ASSIGNz,u>== <x,u>Epending-tsassign in state si-1 * <x,u>Epending_tsassign in state s-_l 3r, r<i-1, s.t. ((r,=COMMITx)A (u=current-ts in state sr-1)) *· r=COMMITx == /q, qr, s.t. rq=COMMITx * 7rh=<ASSIGN,,v>== <x,v>Ependingts.assign in state * <x,v>Ependingts-assign in state sh-1 s.t. ((7rq=COMMITx) A (v=currentts =-: 3q, qh-1, * Uv =* sq-1liSr-1 by WF1 h-1 in state sq-1)) == qr But this contradicts the earlier deduction that * Sq-iSr-1 /Eq, qr, s.t. 7rq=COMMIT and so this case is also impossible. Since both cases are impossible, the original assumption must be false and the lemma has been proven. 0 Lemma A.31 (acis a well-formed execution) A (((statusx=UNFL) A (pendingack,=T)) Ah, h<i, s.t. 7rh=ACKASSIGN, 151 in state si) Proof: By contradiction. Assume 3h, h<i, s.t. 7rh=ACKASSIGN * ((status,=UNFL) A (pending.ack=T)) in state si ==: 3g, g<i, s.t. (irg=<ASSIGN,,t> for some t) A (]f, g<f <i, s.t. 7rf=ACKASSIGN,) *·r=<ASSIGN,,t> ==* d, d:g, s.t. 7rd=<ASSIGNx,u> for any u by Lemma A.30 A (,Bf, g<f<i, s.t. rf=ACKASSIGNr) * (3h, h<i, s.t. 7rh=ACKASSIGNx) ==~ 3e, e<g, s.t. 7re=ACKASSIGNx * 7re=ACKASSIGNx by Lemma A.29 ==. 3d, d<e<g, s.t. 7rd=<ASSIGN,,u> for some u But this is a contradiction and so the original assumption must be false. Thus 0 the lemma has been proven. Lemma A.32 ((currreqddlr=x, xEJ/) A (currreqdacked=T)) in state si 3h, h<i, s.t. 7h=ACKASSIGN. Proof: By contradiction. Assume h, h<i, s.t. rh=ACKASSIGNx * currreqddlr=x, xEJf, in state si (7rg=COMMIT,) =- 3g, g<i, s.t. A (Bf, g<f<i, s.t. rf=COMMIT,for some ysx) = ((currreqddlr=x) A (currreqd acked=F)) in state sg * rg=COMMITx (((curr-reqddlr=x) A (curr-reqdacked=F)) in state sml) · A (rm#ACKASSIGNx) A (rmCOMMITy for ysx) => ((currreqdadlr=x) A (currreqdacked=F)) in state sm * (((curr-reqd-dlr=x) A (curr.reqd-acked=F)) in state sg) A (,h, h<i, s.t. rh=ACKASSIGNx) for yox) A (,]f, g<f<i, s.t. rf=COMMITy == ((curr reqd-dlr=x) A (currreqd acked=F)) in state si by induction. But this contradiction implies that the original assumption must be false and 0 thus the lemma has been proven. Lemma A.33 xEsend-cs-recvin state si 3h, h<i, s.t. rh=ACKASSIGNx Proof: 152 By contradiction. Assume ]h, h<i, s.t. 7rh=ACKASSIGN, * xEsend-csrecv in state si =- either (1) 39g, g<i, s.t. (7rg=COMMITyfor some y) A (currreqddlr=x in state sg-l) A (currreqd acked=T in state sg-l) * ((currreqddlr=x) A (curr-reqd acked=T)) in state sg-_ .' 3h, h<g-1, s.t. 7rh=ACKASSIGN. by Lemma A.32 But this is a contradiction, so this case cannot be true. or (2) 3h, h<i, s.t. 7rh=ACKASSIGNx * But this is a contradiction, so this case cannot be true. or (3) 3g, g<i, s.t. (7rg=<DLR_GONE,t> for some t) A (currreqddlr=x A in state s_l) (currreqdacked=T) * ((currreqddlr=x) in state sg_l) A (currreqd acked=T)) in state s_ 3h, h<g-1, s.t. 7rh=ACKASSIGNx =. by Lemma A.32 But this is a contradiction, so this case cannot be true. Since all three possible cases lead to contradictions, the original assumption must be false and thus the lemma has been proven. o Lemma A.34 (a is a well-formed execution) A (j=ERASE) 3f, f<j, s.t. 7rf=CSRECVx Proof: * 7rj=ERASE,= 3h, h<j, s.t. rh=ERASABLEZ * rh=ERASABLEx== by WF2 sEXpendingerasable in state sh-1 * xEpendingerasable in state h-I =- 3g, g<h-1, s.t. rg=ACKCSRECVx * 7rg=ACK_CSRECVx ==~ statusx=RECV in state sg-1 * statusx=RECV in state s_l = 3f, f<g-1, s.t. rf=CSRECVx and thus the lemma has been proven. Lemma A.35 A (currreqddlr=x (h, h<i, s.t. in state si for some xjN) h=ACKASSIGNx) curr-reqd acked=F in state si Proof: 153 El By contradiction. Assume curr-reqd acked=T in state si * curr-reqdacked=T in state si (7rh=ACK_ASSIGNy) = 3h, h<i, s.t. A (currreqddlr=y, for some yEAr, in state sh-1) A (,Bj, h<j<i, s.t. rj=COMMITzfor some zEJ) * (currreqddlr=y, yeJa, in state sh-1) A (rh=ACKASSIGNy) ==> ((currreqddlr=y) V (currreqddlr=l)) in state sh · (((currreqddlr=y, yA/') V (currreqddlr=)) in state sh) (Aj, h<j<i, s.t. rj=COMMITzfor some zEJN) A , V1, h<l<i, ((currreqddlr=y, yEAr) V (currreqddlr=)) in state sl by definition of steps(LOT) * Either (1) currreqddlr=y, yEAr, in state si * By transitivity, x=y so 3h, h<i, s.t. rh=ACKASSIGN,. But this contradicts the lemma's predicate. or (2) curr-reqddlr=l in state si * But this also contradicts the lemma's predicate. Since both possible cases lead to contradictions, the original assumption O must be false and so the lemma has been proven. (a is a well-formed execution) Lemma A.36 A (ri=ERASEx) 3h, h<i, s.t. rh=ACKASSIGNx Proof: By contradiction. Assume h, h<i, s.t. 7rh=ACKASSIGN, *·ri=ERASEx==- 3e, e<i, s.t. re=CS-RECVx * re,=CSRECVx=- xEsend-cs-recv in state se-1 by Lemma A.34 · xEsend-csrecv in state s_l == 3d, d<e-1, s.t. either (1) (7d=COMMITy for some yEAJ) ( A · (((curr-reqddlr=x) (curr-reqdAdlr=x A (curr_reqd_acked=T)) in state sd-1) ) in state sd-1) A (,h, h<d-l<i, s.t rh=ACKASSIGN,) =-- curr-reqd acked=F in state sd-1 or (2) by Lemma A.35 But this is a contradiction, and so this case could not possibly occur. d=ACKASSIGN. * But this contradicts the assumption, and so this case can never or occur. 154 (3) ( (7rd=<DLR_GONE,tsy>for some tsy) A (currsreqd dlr=x in state sd-1) A (currreqdacked=T · in state sd-1) ) (currreqddlr=x A (h, in state sd-1) h<d-l<i, s.t. rh=ACKASSIGNx) by Lemma A.35 =- curr-reqd.acked=F in state sd-1 But this is a contradiction and so this case also cannot occur. Since none of these three cases can be true and there are no other possibilities, the original assumption must be false and so the lemma has been proven. o Lemma A.37 A (a is a well-formed execution) (((statusx=UNFL) A (pending_ack=T)) A (3q, q>k, s.t. in state sk) n, n<q, s.t. 7rn=ACK-ASSIGN,) Ap, p<q, s.t. (rp=CSREQDx) V (rp=CS_RECVx) V (rp=ERASE,) Proof: By contradiction. Assume 3p, p<q, s.t. (rp=CSREQDr) V (rp=CSRECVr) V (rp=ERASE,) * Either (1) 3p, p<q, s.t. rp=CSREQD, * 7rp=CS..REQDx =: x=sendcssreqd in state sp-1 * x=sendcs-reqd in state sp-_ = 3n, np-1, s.t. rn=ACKJASSIGNx But this contradicts the lemma's predicate and so this case cannot or occur. (2) 3p, p<q, s.t. 7rp=CSRECVx * 7rp=CSRECVx == xEsendcsrecv in state sp-1 * xEsendcs-recv in state sp-1l=- either (i) 3m, m<p-1, s.t. (rm=COMMITyfor some yEJ) A (currreqddlr=x in state sm-l) A (currreqd acked=T in state sm-l) * ((curr-reqddlr=x) A (currreqdacked=T)) in state Sm-1 =- 3n, n<m-1, s.t. 7rn=ACKASSIGNx by Lemma A.32 But this contradicts the lemma's predicate and so this subcase cannot occur. or (ii) 3n, n<p-1, s.t. n=ACKASSIGN, * But this contradicts the lemma's predicate, and so this subcase cannot occur either. 155 or (iii) 3m, m<p-1, for some t) (7rm=<DLR_GONE,t> s.t. A (currreqddlr=x in state s,,-l) A (curr_reqdacked=T in state s,_l) * ((curr-reqddlr=x) A (curr_reqd_acked=T))in state s,_l - 3n,n<m-1, s.t. rn=ACKASSIGNx by Lemma A.32 But this contradicts the lemma's predicate and so this subcase cannot occur either. Sinceall three subcases lead to contradictions and there are no other possible subcases, the entire case (2) must be impossible. or (3) 3p, p<q, s.t. rp=ERASEx * 7rp=ERASEx == 3n, n<p, s.t. r,=ACKASSIGNz by Lemma A.36 But this contradicts the lemma's predicate, and so also this case must be impossible. * All possible cases lead to contradictions. Therefore, the original assumption 0 must be false and thus the lemma has been proven. Lemma A.38 ri=ACKASSIGNz 3h, h<i, s.t. 7rh=COMMITx Proof: = * 7ri=ACKASSIGN, · ((statusx=UNFL) ((statusx=UNFL) A (pendingack,=T)) in state si- A (pendingackr=T)) in state si-1 3j, j<i-1, s.t. rj=<ASSIGNx,tsx>for some tsx * rj=<ASSIGNx,tsx> == <,tsx>Ependingts_assign in state sj-1 · <x,ts,>Epending_tsassign in state sj-1 == 3h, h<j-1, s.t. rh=COMMITx * By transitivity, 3h, h<i, s.t. 7rh=COMMITx => and thus the lemma has been proven. D (a is a well-formed execution) Lemma A.39 A (rg=COMMIT,) Vf, f <g, xzsend-cs-recv in state sf Proof: By contradiction. Assume 3f,f<g, s.t. 156 Esend-cs-recv in state sf ,/d, d<g, s.t. 7rd=COMMIT. * xEsendcs-recv in state sf == 3e,e<f, s.t. re=ACKASSIGNx * rg=COMMITx == * re=ACK-ASSIGNx==- 3d, d<e, s.t. 7rd=COMMITx by WF1 by Lemma A.33 by Lemma A.38 * (d<e<f<g) A (3d, d<e, s.t. 7rd=COMMITx)- 3d, d<g, s.t. wd=COMMITx But this is a contradiction, and so the original assumption must be false. Thus the lemma has been proven. Lemma A.40 A O (a is a well-formed execution) (xEsendcsrecv in state si) curr-reqddlrox in state si Proof: By induction. * xEsend-cs.recv in state s == 3h, h<i, s.t. rh=ACKASSIGNx *· h=ACK-ASSIGNX -- 39, g<h, s.t. rg=COMMITx * rg=COMMITx =- by Lemma A.33 by Lemma A.38 m, m>g, s.t. rm=COMMITx *·w=COMMIT == curr.reqddlr=x in state sg *· 7r=COMMITx, = Vf, f <g, xosendcs-recv in state sf by WF1 by Lemma A.39 * Vf, f<g, x sendcsrecv in state sf == Vf, f<g, ((curr_reqd_dlrzx) V (xsend-cs-recv)) in state sf * (((currreqd_dlrx) V (xsend-csrecv)) in state s,ml) A (rmCOMMIT,) ==. ((currreqddlr7x) V (xsend-cs-recv)) in state sm by definition of steps(LOT) * Therefore, by induction, Vp, p>g, ((curr.reqddlr5x) V (xsend_csrecv)) in state sp * Hence, Vq, q>O, ((currreqd dlr5x) V (xfsend_cs_recv)) in state sq * ((xEsendcsrecv) A ((curr_reqd_dlr6x) V (xzsend-cs-recv))) in state si == curr_reqd_dlr$x in state si and thus the lemma has been proven. Lemma A.41 (a is a well-formed execution) A (ri=ACK ASSIGNr) ,4h, hi, s.t. rh=ACKASSIGNx Proof: 157 [ By contradiction. Without loss of generality, assume 3h, h<i, s.t. rh=ACKASSIGN, *· rh=ACKASSIGN, ==- ((statusr=UNFL) A (pendingack,=T)) * ((status,=UNFL) A (pendingack,=T)) in state sh-1 ==y 3g, g<h-1, s.t r9g=<ASSIGN,,ts,> for some tsx ·* h=ACKASSIGN, - ((status.=UNFL) A (pendingack,=F)) ·* r=ACKASSIGN, =- ((status.=UNFL) A (pendingack,=T)) * in state Sh-1 in state sh in state si-1 (((status,=UNFL) A (pendingack,=F)) in state sh) A (((status,=UNFL) A (h<i) A (pendingack,=T)) =- 3f, h<f<i-1, in state si-1) s.t. 7rf=<ASSIGN,,u> for some u * 7rf=<ASSIGN,,u> =:- g, g<f, s.t. rg=<ASSIGNr,ts,> for any ts, by Lemma A.30 But this is a contradiction, and so the assumption must be false. Thus the o lemma has been proven. Lemma A.42 (a is a well-formed execution) A (7rg=CSRECV,) ,d, dig, s.t. 7rd=CSRECV. Proof: By contradiction. Without loss of generality, assume 3d, d<g, s.t. 7rd=CSRECVx * 7d=CSRECVx =- (xEsendcsrecv in state sd-l) A (xfsend_cs_recv in state d) * xEsend-cs.recv in state d-1 == 3c, c<d-1, s.t. rc=ACK-ASSIGN. by Lemma A.33 *·rc=ACKASSIGN = j3q, q>c, s.t. rq=ACKASSIGNx A by Lemma A.41 by Lemma A.38 *·r=ACKASSIGN, = 3b, b<c, s.t. 7rb=COMMIT. =, * rb=COMMIT, a, a>b, s.t. * xEsend-csrecv in state Sd-l = by WF1 ra=COMMITx currreqddlr$x in state d_l by Lemma A.40 * (curr_reqddlr-x in state sd-1) A (a, a>b, s.t. r-=COMMITx) A (b<c<d) =# Vp, p>d-1, curr_reqd_dlrx in state sp =- xEsendcs-recv in state sg-1 * 7rg=CSRECVx * (xfsendcsrecv in state Sd) A A (xEsendcsrecv in state sg-1 ) (d<g) either (1) 3q, d<q<g-, =- (7rq=COMMITyfor some y) s.t. A (curr-reqddlr=x A (currsreqdacked=T 158 in state sq-1) in state sq-1) * But curr-reqd-dlr=x in state sql is a contradiction, so this case cannot occur. or (2) 3q, d<q<g-1, s.t. rq=ACKASSIGNx * But this is a contradiction, so this case cannot occur. or (3) 3q, d<q<g-1, s.t. (rq=<DLR_GONE,t>for some t) A (currreqddlr=x A * But currreqddlr=x cannot occur. in state Sql) (currreqdacked=T in state sq_-1 in state sq_1 is a contradiction, so this case * Since all three cases lead to contradictions and there are no other possible cases besides these, the original assumption must be false and thus the lemma has been proven. 0 Lemma A.43 rg=CSRECVx 3f, f<g, s.t. rf =<ASSIGN,,t> for some t Proof: ·* rg=CSRECVx == sEsendcsrecv in state s_-1 * xEsendcs-recv in state sg-1 ==- 3e, e<g-1, s.t. re=ACK-ASSIGNx by Lemma A.33 * 7re=ACKASSIGNx ==: f, f<e, s.t. 7rf=<ASSIGNx,t> for some t and thus the lemma has been proven. Lemma. A.44 A (((status.=RECV) A (h<i) A (,Be, e>h, s.t. re=<ASSIGNx,u> for any u) A (pendingackx=T)) Proof: By contradiction. Assume Ad, h<d<i, s.t. rd=CSRECV, (((statusxRECV) A A V (pendingackx=F)) in state sml) (7rm$<ASSIGNx,u> for any u) (rmCSRECVx) =: [] (pendingackx=F in state sh) 3d, h<d<i, s.t. 'd=CSRECV. $ by Lemma A.29 ((statusx$RECV) V (pendingack,=F)) in state sm 159 in state si) * (pending_ack=F in state sh) A (h<i) A (/e, e>h, s.t. A (/3d, h<d<i, s.t. 7rd=CS-RECV,) re=<ASSIGNx,u> for any u) ((status,#RECV) V (pendingack=F)) in state si by induction But this contradicts the lemma's predicate, and so the original assumption must be false. Thus the lemma has been proven. 0 == Lemma A.45 A (a is a well-formed execution) (((status.=RECV) A (pending_ack=T)) in state si) /Bh, h<i, s.t. 7rh=ACK_CSRECV, Proof: By contradiction. Assume 3h, h<i, s.t. rh=ACK_CS_RECVx * rh=ACK_CSRECVx ~= (((statusr=RECV) A (pendingack,=T)) in state Sh-1) A (((statusz=RECV) A (pendingack,=F)) in state sh) * status.=RECV in state sh-1 ==~ 3g, g<h-1, s.t. rg=CSRECVx * rg=CSRECV,== 3f, f<g, s.t. · 7g=CSRECVx =# ,Ad, dog, s.t. f =<ASSIGN,,t> for some t rd=CSRECV, by Lemma A.43 by Lemma A.42 * rf=<ASSIGNx,t> = ] e, e>f, s.t. ire=<ASSIGNx,u> for any u * (f<g<h) A (e, e>f, s.t. re=<ASSIGNx,u> for any u) ==: Be, e>h, s.t. lre=<ASSIGNx,u> for any u * (pendingackz=F in state sh) A (((status.=RECV) A (pendingack=T)) in state sl) A by Lemma A.30 (h<i) A (,e, e>h, s.t. re=<ASSIGNx,u> for any u) ==- 3d, h<d<i, s.t. rd=CSRECV, by Lemma A.44 But this is a contradiction, and so the original assumption inust be false. Thus o the lemma has been proven. Lemma A.46 A (a is a well-formed execution) (((status.=RECV) A (pendingack,=T)) 7rm #ERASE, Proof: 160 in state Sm-_) By contradiction. Assume rm=ERASE,. * ((status.=RECV) A (pendingack,=T)) in state sm-1 ==, h, h<m-1, s.t. rh=ACKCSRECVX by Lemma A.45 3, g<m, s.t. rg=ERASABLEx * 7rm=ERASE, = * 7rg=ERASABLEx==- xEpendingerasable * xependingerasable by WF2 in state sg-_ in state sg_1 =: 3f, f <g-l<m-1, s.t. rf=ACKCSRECV, This contradiction implies that the original assumption must be false and thus the lemma has been proven. 0 Lemma A.47 (a is a well-formed and fair execution) A (((statusx=UNFL) A (pendingack,=T)) i state sk) 31, I>k, s.t. rl=ACKASSIGNx Proof: * ((status.=UNFL) A (pendingack,=T)) in state sk ==. 3q, q>k, s.t. A3n,n<q, s.t. rn=ACKASSIGN_ * (((status,=UNFL) A (pendingackx=T)) in state sk) A (3q, q>k, s.t. n, n<q, s.t. rn=ACKASSIGN) by Lemma A.31 , p, p<q, s.t. (rp=CSREQD,) V (rp=CS_RECVx)V (rp=ERASE,) by Lemma A.37 · (((status,=UNFL) A (pending_ackx=T)) in state s,m-l) A (7r#CS_REQDx) A (7m#CSRECV,) A (rm#ERASE,) =- ((statusx=UNFL) A (pendingack,=T)) in state sm * By induction and the definition of a fair execution, it therefore followsthat 31, I>k, s.t. ri=ACK.ASSIGNx and thus the lemma has been proven. 5 Lemma A.48 ((<x,t>Epending_tsassign) V (timestampx=t, tEN)) in state si 3h, h<i, s.t. (rh=COMMITx)A (currentts=t, Proof: * Either (1) <x,t>Ependingts-assign in state si 161 tEA, in state sh-1) * <x,t>Ependingts-assign in state si - 3h, h<i, s.t. (7rh=COMMIT,) A (currentts=t, tENV, in state sh-1) by definition of steps(LOT) or (2) timestampx=t, tEN, in state si · timestampx=t, tAf, in state s =- 3g, g<i, s.t. rg=<ASSIGNx,t> · rg=<ASSIGNx,t> ==- <x,t>Ependingtsassign * <x,t>Ependingtsassign in state sg-1 in state sg-1 3h, h<g-l<i, s.t. := (rh=COMMIT,) A (currentts=t, tEjA, in state Sh-l) and thus the lemma has been proven. Lemma A.49 O i<j (currentts in state s)<(currentts in state sj) Proof: By induction. * The basis case of i=j is trivial. * For any action rj, either (currentts in state sj_l)=(currentts in state sj) for rj$COMMITxfor any xEA/ or (currentts in state s_l)+l=(currentts in state sj) for rj=COMMIT,for some xAf Therefore, for any action rj, (currentts in state sj_)<(currentts in state sj) * By the inductive hypothesis, i<j-1 = (currentts in state si)<(current.ts in state sj-1) * Therefore, (i<j-1) A ((currentts in state sj_l)<(current_ts in state sj)) == ((currentts in state si)<(current-ts in state s)) and so the lemma has been proven. ] Lemma A.50 A A (a is a well-formed execution) (curr_reqd_ts=t, tEAr, in state si) (((<x,t>Epending_tsassign) V (timestampa=t)) curr-reqddlr=x in state si 162 in state si) Proof: * ((<x,t>Ependingts-assign) V (timestampa=t, tEN)) in state si = 3h, h<i, s.t. (rh=COMMIT,)A (currentts=t in state sh-1) by Lemma A.48 * (rh=COMMIT,)A (currentts=t in state sh_1) = ((currreqd dlr=x) A (curr reqdts=t) A (currentts=t+1)) in state sh * (currentts=t+1 in state Sh) =Vj, j>h, t<currents in state s by Lemma A.49 * (Vj, j>h, t<currentts in state sj) A (curr.reqdts=t, tEAN,in state si) A (h<i) =*- /k, h<k<i, s.t. Wk=COMMITyfor any y * (currreqddlr=x A A in state sh) (Ik, h<k<i, s.t. rk=COMMITy for any y) (currreqdts=t, tL, in state si) == currreqddlr=x in state si and thus the lemma has been proven. Lemma A.51 by definition of steps(LOT) O (a is a well-formed execution) A (rI=ERASE ) Vk, k>i, currreqddlrix in state sk Pr(oof: * 7ri=ERASEx =- 3h, h<i, s.t. 7rh=CSRECVx *·rh=CSRECVZ -=xEsendcsrecv in state sh-1 * xEsend-csrecv in state Sh-1 =- curr-reqddlrx b2y Lemma A.34 in state sh-1 b2y Lemma A.40 * 7h=CSRECVx -== 3g, g<h, s.t. rg=<ASSIGNx,t> for some t biy Lemma A.43 * 7rg=<ASSIGNx,t> = <x,t>Ependingts.assign in state s_ * <x,t>Ependingts-assign in state sg-_ = * rf=COMMITx== * (currreqd-dlrjx, 3f, f<g-1, s.t. rf=COMMITx j, j>f, s.t. rj =COMMITx by WF1 xEA, in state smil) A (rmiCOMMIT,) curr-reqddlrox in state sm (currreqddlr:x in state sh-1) == * A (j, j>f, s.t. 7rj=COMMITx) A (f<g-l<h<i) =- Vk, k>i, curr.reqddlr6x in state sk and thus the lemma has been proven. 163 by induction 0 Lemma A.52 A (a is a well-formed execution) (timestampr=t, tEA/, in state sj) /k, k>j, s.t. 7rk=<ASSIGNx,u>for any u Proof: By contradiction. Assume 3k, k>j, s.t. rk=<ASSIGNx,u>for some u * timestampz=t, tEAJ, in state s ==: 3i, i<j, s.t. ri=<ASSIGN,t> * (i<j) A (j<k) =- i<k by Lemma A.30 * rk=<ASSIGN,,u> ==, i, i<k, s.t. ri=<ASSIGN,,t> This contradiction implies that the original assumption must be false, and so O the lemma has been proven. Lemma A.53 A (a is a well-formed execution) (timestampa=t, tEAr, in state si) Vl, l>i, timestamp,=t in state sI Proof: * timestampa=t, tEA, in state si by Lemma A.52 :==-j, j>i, s.t. rj=<ASSIGN.,u> for any u * (timestampr=t in state sm-l) A (rm$<ASSIGN,,u> for any u) === timestamp,=t in state sm * (timestamp,=t in state si) A (j, j>i, s.t. rj=<ASSIGN,u> for any u) by induction ==- Vl, I>i, timestamp,=t in state sl and thus the lemma has been proven. Lemma A.54 O (a is a well-formed execution) A (timestampx=t, tEAn, in state si) A (tfrecv_tss in state si) A (3h, h<i, s.t. rh=ERASEx) Vj, j>i, tfrecvtss in state sj Proof: By contradiction. Assume 3j, j>i, tErecvtss in state sj * (tfrecvtss in state si) A (tErecvtss in state sj) A (i<j) -= 3k, i<k<j, s.t. (tfrecv tss in state sk) A (tErecv tss in state sk+l) * (trecvAtss in state sk) A (tErecvtss in state sk+1) by definition of steps(LOT) in state sk ==- currreqdts=t 164 * (timestampa=t, tEAN,in state si) =- VI, I>i, timestamp,=t in state sl by Lemma A.53 * (Vl, l>i, timestampx=t in state sl) A (k>i) =- timestamp-=t in state sk * ((curr_reqd.ts=t, tEN) A (timestampx=t)) in state sk in state sk currreqddlr=x currreqddlr5x * (Irh=ERASEx) A (h<k) = =: in state sk by Lemma A.50 by Lemma A.51 But this is a contradiction and so the original assumption must be false. Thus the lemma has been proven. Lemma A.55 A 0 (a is a well-formed and fair execution) (xEsendcs-recv in state si) 3j, j>i, s.t. rj=ERASEx Proof: * xEsend-csrecv in state si -=- 3k, k>i, s.t. rk=CS_RECVx * 7k=CS-RECVX = ((status.=RECV) A (pendingackx=T)) in state sk * ((statusx=RECV) A (pendingackx=T)) in state sm_ == by Lemma A.46 in state sml) A (rnERASE,) rmERASEx * (((statusx=RECV) A (pendingack=T)) = ((statusx=RECV) A (pendingackx=T)) in state sm * ((status,=RECV) A (pendingackx=T)) in state sk == 3n, n>k, s.t. rn=ACK_CSRECV, by induction and fairness · rn=ACK_CSRECVx - xEpendingerasable in state sn * xEpendingerasable in state sn,= 3p, p>n, s.t. 7rp=ERASABLEr *· p=ERASABLE -3j, j>p, s.t. 7rj=ERASEx by WF4 o * By transitivity, j>i and thus the lemma has been proven. Lemma A.56 (a is a well-formed and fair execution) A (xEsendcsrecv in state si) A (timestampa=t, tAf, in state s) 3j, j>i, s.t Vk, k>j, trecvtss in state sk Proof: * xEsendcsrecv in state si - 31, I>i, s.t. 7rl=ERASEx by Lemma A.55 * rl=ERASE == ((status=NONR) A (pending-ackx=T)) in state sl * timestampx=t, tJN, in state si ==- Vn, n>i, timestamp,=t in state sn * 7r=ERASE, = 3r,r<l,s.t. rr=CS-RECVx 165 by Lemma A.53 by Lemma A.34 * by Lemma A.42 ,]q, q>r, s.t. 7rq=CSRECVx * r=CSRECVx = (statusr=NONR in state sl) A (/q, q>l, s.t. 7rq=CSRECV,) A A in state s) (pendingack,=T (Vn, n>l, timestamp,=t in state s) ==- 3j, j>l, s.t. rj=<DLR_GONE,t> by definition of fairness in state sj * 7rj=<DLR_GONE,t>== trecvtss * (timestampa=t, tENJ, in state s) A (torecvtss in state sj) A A (7rl=ERASE,) (I<j) Vk, k>j, trecvtss by Lemma A.54 in state sk and thus the lemma has been proven. == Lemma A.57 [ (a is a well-formed execution) A (ri=ACKASSIGN,) A A (curr_reqddlr=x (i<j) in state sj) currreqdacked=T in state sj Proof: by Lemma A.38 * 7ri=ACK.ASSIGNx -==3h, h<i, s.t. 7rh=COMMITx * 7rh=COMMITx = ((curr_reqddlr=x) A (currreqd_acked=F)) in state sh * 7rh=COMMIT,== ,/k, k>h, s.t. rk=COMMITx by WF1 * (curr.reqddlr=x, xEA, in state Sh) A (curr_reqd_dlr=x, xEJA, in state sj) A (h<j) A (Ak, k>h, s.t. rk=COMMITx) for any y) :(,l, h<l<j, s.t. rl=COMMITy A (Vm, h<m<j, currreqd dlr=x in state sm) * (ir=ACKASSIGNx) A (Vm, h<m<j, currreqddlr=x A in state sm) (h<i<j) =- curr reqdacked=T in state si * (curr-reqdacked=T =- in state si) A (I, i<l<j, s.t. r=COMMITyfor any y) currreqd acked=T in state sj and thus the lemma has been proven. 166 [] Lemma A.58 A (a is a well-formed execution) (rk=<ASSIGN,,t> for some t) status,=UNFL in state sk Proof: in state k. By contradiction. Assume statusZUNFL * rk=<ASSIGNx,t> == h,h<k, s.t. rh=<ASSIGNz,v> for any v by Lemma A.30 * Either (1) status.=REQD in state sk * status.=REQD in state sk =Z 3j, j<k, s.t. rj=CSREQD, ·* j=CSREQD, ==>x=sendcsreqd in state sj_l * =sendcs-reqd in state sj- 1 ==- 3i, i<j-1, s.t. ri=ACKASSIGNx 3h, h<i<k, s.t. 7rh=<ASSIGN,,v> for some v by Lemma A.29 This contradiction implies that this case is impossible. * ri=ACKASSIGN. = or (2) status.=RECV in state sk · statusr=RECV in state k =- 3j, j<k, s.t. 7rj=CSRECVr *·rj=CSRECVX ==, xEsendcs-recv in state sj_l * xEsendcsrecv in state sj-1 == 3i, i<j-1, s.t. is=ACK.ASSIGN, by Lemma A.33 * ri=ACKASSIGNx== 3h, h<i<k, s.t. 7rh=<ASSIGN.,v> for some v by Lemma A.29 But this is a contradiction, so this case must be false. or (3) statusx=NONR in state sk * status.=NONR in state sk * rj=ERASEx = · - 3j, j<k, s.t. 7rj=ERASEx 3i, i<j, s.t. 7ri=ACKASSIGNx by Lemma A.36 * 7ri=ACKASSIGNz==,3h, h<i<k, s.t. 7rh=<ASSIGNx,v> for some v by Lemma A.29 But this is a contradiction and so also this case must be false. * All possible cases lead to contradictions, so the original assumption must be false and thus the lemma has been proven. O (a is a well-formed and fair execution) Lemma A.59 A (ri=COMMIT,) 31, I>i, s.t. rl=ACKASSIGNx 167 Proof: * (ri=COMMITx)A (currentts=t == for some tEAr) in state si <,t>Epending_tsassign * <x,t>Epending-tsassign in state si -- 3k, k>i, s.t. 7rk=<ASSIGN,,t> ·* rk=<ASSIGN,,t>=# pendingack,=T in state sk by Lemma A.58 * 7rk=<ASSIGN,,t> == statusx=UNFL in state sk * ((statusx=UNFL) A (pendingack,=T)) in state sk by Lemma A.47 =:- 31, I>k, s.t. rl=ACKASSIGNX 0 and thus the lemma has been proven. Lemma A.60 (a is a well-formed execution) (ri=COMMIT,) A A (currentts=t, tJAr, in state si-l) Vj, j>i, ((<x,t>Epending_tsassign) V (timestampx=t)) in state sj Proof: By induction. * (ri=COMMITx)A (currentts=t in state si_1 ) = <x,t>Ependingtsassign in state si * (<x,t>Ependingtsassign in state sj-1) A (rj#<ASSIGN,,t>) in state sj =# <x,t>Epending_tsassign * (<x,t>Ependingtsassign by definition of steps(LOT) in state sj_) A (7rj=<ASSIGNx,t>) by definition of steps(DLR,) = timestampx=t in state sj * (timestamp,=t in state sj_l) A (rjy<ASSIGNx,u> for any u) by definition of steps(DLRx) == timestampx=t in state sj k, k>j-1, s.t. rk=<ASSIGN,,u> for any u * timestampx=t in state sjl =: by Lemma A.52 * Therefore, ((<x,t> pending_ts_assign)V (timestampx=t)) in state sj-1 V (timestampx=t)) ((<x,t>Epending_tsassign) for any possible action 7rj in a well-formed execution. == * By induction, Vj, j>i, ((<x,t>Ependingtsassign) in state sj V (timestampx=t)) in state sj 0 and so the lemma has been proven. (a is a well-formed execution) Lemma A.61 A (tErecvtss in state si, tEAJ) (<x ,t> Epending-ts assign) Sx s.t. Vj, j>i, ( V (timestampx=t) 168 ) in state sj Proof: * tErecvtss in state s =* 3h, h<i, s.t. curr.reqd ts=t in state sh * currreqdts=t, tEAr, in state Sh ==* 3g, g<h, s.t. (rg=COMMITxfor some x) A (currentts=t * (rg=COMMITx)A (currentts=t in state sg-1) ==Vj, j>g, ((<x,t>Ependingtsassign) in state sg_l) V (timestampx=t)) in state sj by Lemma A.60 * (g<h<i) A (Vj, j>g, ((<x,t>Ependingtsassign) V (timestampa=t)) in state sj) ==. Vj, j>i, ((<x,t>Ependingts_assign) V (timestampa=t)) in state sj and thus the lemma has been proven. Lemma A.62 A O (a is a well-formed execution) (xEsendcsrecv in state si) <s,u>pending_tsassign in state si for any u Proof: By contradiction. Assume <x,u>Ependingts-assign in state si for some u * xEsendcs-recv in state si == 3h, h<i, s.t. rh=ACK.ASSIGNx by Lemma A.33 * · h=ACKASSIGNx 3 9 , g<h, s.t. rg=<ASSIGNx,t> for some t · rg=<ASSIGN.,t> = (<x,t>Ependingtsassign =- by Lemma A.29 in state sg_1) A (<x,t> pending_tsassign in state s) * <x,t>Ependingts-assign in state sg-1 == 3f, f<g-1, s.t. (rf=COMMIT,)A (current ts=t in state sf-1) * rf=COMMIT , j, jf, s.t. rj=COMMIT, by WF1 · (<x,t>pending_tsassign in state sg) A (,j, j>f, s.t. rj =COMMITx) A (f<g) => Vk, k>g, <x,t>pending_tsassign * in state sk (<x,u>Epending_tsassign in state si) A A (Vk, kg, <x,t>4pending_tsassign in state sk) (g<h<i) ==- uot · <x,u>Ependingtsassign in state si == 3e, e<i, s.t. (re=COMMITx)A (current-ts=u in state se-1) * (currentts=t in state sf-1) A (currentts=u in state sl) = ef * ef => 3e, ef, A (u7t) s.t. re=COMMITx But this is a contradiction and so the original assumption must be false. Thus the lemma has been proven. O 169 Lemma A.63 (a is a well-formed execution) A (curr_reqd_dlr=x, A (( xEAJ, in state si) (<x,t>Ependingts-assign) V curr-reqdts=t (timestampa=t, tENA) ) in state si) in state si Proof: * ((<x,t>Epending_tsassign) V (timestampa=t, tEN)) in state si == 3h, h<i, s.t. (rh=COMMITx) A (currentts=t, tEAN,in state sh-l) by Lemma A.48 * (rh=COMMITx)A (currentts=t in state sh-1) == ((curr_reqd_dlr=x) A (currreqdts=t) A (current_ts=t+l)) in state sh * (((curr_reqd_dlr=x, xEAJ) A (curr_reqdts=t, tEN)) in state sm-l) A (currenttsat in state s,_l-) A (7rmCOMMITx) V * (((curr_reqd_dlrix) A (currreqdtsot)) in state sm) (((curr_reqd_dlr=x) A (currreqd ts=t)) in state sm) (((curr-reqddlrox, xEJA) A (currreqdtsot, A (current ts7t, tENJ, in state sm-l) A tEN)) in state Sm-1) (WrmCOMMITx) = ((curr.reqddlrhx) A (curr.reqd.ts6t)) in state sm · rh=COMMIT =- ]3j, j>h, s.t. rj=COMMITx * current_ts=t+l in state sh == VI, I>h, current ts>t+l in state sl * V1, I>h, currentts>t+l * A in state s ==* V, I>lh, currenttsat (((currreqddlr=x, xEA) A (currreqdts=t, (Vl, I>h, currenttsot in state sl) by WF1 in state sl tN)) in state Sh) A (j, j>h, s.t. rj=COMMIT,) : Vk, k>h, V * (((currreqd_dlrx) (((curr_reqddlr=x) A (currreqdts5t)) A (currreqdts=t)) in state sk) in state sk) by induction (curr-reqddlr=x in state si) A A (h<i) (Vk, k>h, V (((curr_reqddlrsx) A (((curr_reqddlr=z) A (curr_reqd.ts5t)) in state sk) (currreqdts=t)) in state sk) ) curr-reqdts=t in state si and thus the lemma has been proven. == Lemma A.64 A (a is a well-formed execution) (currreqddlr=x, xeJn, in state si) currreqd tsIL in state si 170 O Proof: xEAr, in state si =- 3j, j<i, s.t. rj=COMMIT * currreqddlr=x, * (rj=COMMITx)A (current-ts=t, for some tEJA, in state sj-l) -Vk, k>j, ((<x,t>Epending_tsassign) V (timestampx=t)) in state sk by Lemma A.60 * (Vk, k>j, ((<x,t>Epending_tsassign) V (timestampx=t)) in state sk) A (j<i) =, ((<x,t>Epending_tsassign) V (timestamp,=t)) in state si (curr-reqd-dlr=x in state si) * A V (timestampx=t)) (((<x,t>Ependingtsassign) in state si =- currreqdts=t in state si) by Lemma A.63 in state si · teAf =- currreqdtsI and thus the lemma has been proven. Lemma A.65 A 0 (a is a well-formed execution) (tErecvtss in state si) tEJ Proof: * tErecvtss in state si ==* either (rh=COMMITxfor some x) (1) 3h, h<i, s.t. A (curr-reqdtsII A * ((currreqdts5i) (curr-reqdts=t in state h-_) in state sh-1) A (curr_reqdts=t)) in state Sh-1 tf or (2) 3h, h<i, s.t. (Wrh=ACK-ASSIGNx) A A * currreqddlr=x, (curr reqddlr=x in state sh_l) (curr-reqdts=t in state sh-1) xEJ/, in state h_- in state Sh-1 -- currreqdtslI A (curr_reqdts=t)) * ((currreqdts$I) by Lemma A.64 in state sh-1 = tEAr or (3) 3h, h<i, s.t. A (rh=<DLRGONE,u> for some u7t) (curr_reqd_dlrIl in state sh-1) A (currreqdts=t * curr.reqddlrI in state sh-1) in state sh-1 for some EA, in state sh-_ for some xEAf, in state sh-1 by Lemma A.64 => curr-reqdtsI in state sh-1 == currreqddlr=x, * currreqddlr=x, * ((curr.reqd.tsl) A (curr-reqd ts=t)) in state sh_- = 171 tAF * Therefore, for all possible cases, the desired result is obtained and thus the 0 lemma has been proven. Lemma A.66 A (a is a well-formed and fair execution) (tErecvtss in state si) 3j, j>i, s.t. Vk, k>j, trecv_tss in state sk Proof: * tErecvtss == tEN by Lemma A.65 in state so) == 3h, h<i, s.t. (tfrecvtss in state sh-1) A (tErecvtss in state sh) * tErecv tss in state sh * (tErecvtss in state si) A (recvtss=0 = 3x s.t. VI, I>h, ( (<x,t>Epending_tsassign) V (timestampx=t) * (trecvtss in state sh-1) A (tErecvtss in state ==y either ) in state sl by Lemma A.61 h) (1) (rh=ACKASSIGNy for some y) A (curr-reqdts=t (((<x,t>Epending_tsassign) V (timestampa=t)) in state sh) * A (7rh=ACKASSIGNy) ==y * in state sh-1) ( (<x,t>Ependingts-assign) V (timestampx=t) ) in state h-1 (rh=ACKASSIGNy) A (trecvtss in state Sh-_) A (tErecvtss in state sh) --y (currreqddlr=y in state sh-1) A (yEsendcs-recv in state sh) · (currreqdts=t, tEAN,in state h-1) A (( (<x,t> Epending_tsassign) in state sh-1) V (timestamp,=t) ) by Lemma A.50 in state sh-1 :. currreqddlr=x * (curr-reqddlr=y in state sh-1) A (currreqd-dlr=x in state sh-1) ==> x=y * (yEsendcsrecv in state h) A (x=y) =. xEsendcsrecv in state sh or for some ut) A (curr-reqd-ts=t in state sh-1) (2) (rh=<DLR_GONE,u> V (timestampa=t)) in state sh) · (((<x,t>pending_tsassign) A (rh=<DLR_GONE,u>for some u7t) (<x,t> Ependingts-assign) =• ( V (timestampx=t) ) in state sah172 · (curr-reqdts=t, tEA, in state A (( Sh-_) (<x,t>Ependingts-assign) V (timestampx-=t) ) in state sh-1) ==- currreqd-dlr=x in state sh-1 (rh=<DLR_GONE,u>) * A A (trecvtss (tErecvtss A (currreqddlr=x in state sh-1) =- xsendcsrecv in state sh in state sh-1) in state sh) or (3) (h=COMMITy for some y) A (currreqdts=t * (((<x,t>cpendingts-assign) A in state shl) V (timestampx=t)) in state sh) (rh=COMMITy for some y) =- ( (<x,t>Ependingts-assign) V * by Lemma A.50 (timestampx=t) ) in state sh-1 (currreqd-ts=t in state sh-1) A (( (<x,t>Ependingts-assign) V (timestampx=t) ) in state sh-1) -= currreqddlr=-x in state sh-1 by Lemma A.50 * Either (i) 3g, g<h-1, * s.t. 7rg=ACKASSIGNx (7rg=ACKASSIGNx) A A (currreqddlr=x in state Sh-1) (g<h-1) ==- currreqdacked=T in state Sh-1 by Lemma A.57 * (7rh=COMMITy) A (curr reqd-ts=t, tA, in state h-_l ) A (currreqdacked=T in state shl) A or (currreqddlr=x in state sh-1) == xEsendcsrecv in state sh (ii) ,fg, g<h-1, s.t. rg=ACKASSIGNx * currreqddlr=x, xA/, in state sh_1 =: 3f, f<h-1, s.t. rf=COMMITx · rf=COMMITz 3e, e>f, s.t. r=ACK-ASSIGNx (3e, e>f, s.t. re,=ACK-ASSIGNx) =, · by Lemma A.59 A (/g, g<h-1, s.t. rg=ACKASSIGNx) A (f h-1) A (rh$ACKASSIGNx) =? 3e, e>h, s.t. re=ACKASSIGNx * 7rf=COMMITx == j3d, d>f, s.t. 173 rd=COMMITz by WF1 * (rh=COMMITy) A (,ad, d>f, s.t. rd=COMMIT.) A (f<h) > xy * rh=COMMITy ==> currreqd_dlr=y * in state sh (currreqd_dlr=y in state sh) A A A * A (y) (,d, d>f, s.t. rd=COMMIT,) (f<h) ==: Vc, c>h, currreqd_dlrx in state s, (7re=ACKASSIGNx) (e>h) (Vc, c>h, curr_reqd_dlrZx in state s,) =, xEsendcsrecv in state se Therefore, for all possible cases, it must be true that 3e, e>h, s.t. xEsendcs-recv in state se A * xEsend_csrecv in state s, == * in state se <x,t>opendingts.assign by Lemma A.62 (<x,t>Opendingtsassign in state s,) A (, I>h, ((<x,t>Epending_tsassign) V (timestampx=t)) in state si) A (e>h) == timestampx=t in state s, * ((xEsendcs-recv in state s,) A (timestampx=t, tAf)) in state s, == 3j, j>e, s.t. Vk, k>j, trecvtss and thus the lemma has been proven. Lemma A.67 in state sk by Lemma A.56 O (a is a well-formed and fair execution) xEJA, in state si) A (curr_reqd_dlr=x, A (currreqd.acked=T A (recvtss#0 A (,3j, j>i, s.t. in state si) in state si) (rj=COMMITyfor some y) A (curr_reqd_dlr=xin state sj-1) ) 3k, k>i, s.t. xEsendcs.recv in state sk Proof: By contradiction. Assume /3k, k>i, s.t. xEsend-cs-recv in state sk * currreqddlr=x, xEA, in state si by Lemma A.64 in state si =-= currreqdtsI * ((currreqddlr=x, xEAr) A (currreqdacked=T)) in state si => 3h, h<i, s.t. rh=ACK.ASSIGNx * 7rh=ACKASSIGNz=-- ,g, g>h, s.t. rg=ACKASSIGNx 174 by Lemma A.32 by Lemma A.41 * (curr-reqd-dlr=x, xEA, in state s-l) A (currreqdtsI in state s,ml) A (curr-reqdacked=T A (,rm#COMMITy for any y) A (#7rmACKASSIGN in state Sm-l) ) A (xfsendcsrecv in state sm) = (u(A,uEJf, s.t. (7rm=<DLRGONE,u>) A (recvtss={u} in state Sm.-l) ) A (currreqd dlr=x in state sm) A (currreqdts$I in state sm) A (currreqd acked=T in state sm) · A A A (currreqddlr=x, xzEN, in state si) (currreqd_ts± in state si) (currreqd acked=T in state si) (Aj, j>i, s.t. (rj=COMMITyfor some y) A (Ag, g>i, s.t. rg=ACKASSIGN,) A (curr-reqddlr=x in state sj-1)) A (Ak, k>i, s.t. xEsendcsrecv in state sk) (Vc, c>i, currreqddlr=x in state s) A · A (Ad, d>i, s.t. (rd=<DLR_GONE,u>)A A (recvtss={u} in state sd-1) for any u) by induction (currreqddlr=x, xEJA, in state sm-l) (recvtss=A in state sm-1) A (rmyCOMMITy for any y) A (irm ACK-ASSIGNx) (Au, uEA/, s.t. (rm=<DLR_GONE,u>) A -- * recvtssCA A (recvtss={u} in state Sm in sltate sm-l)) by definition of steps(LOT) (recvtss=R in state si) A A A (R0) (Vc, c>i, currreqddlr=x in state s) (3j, j>i, s.t. (irj=COMMITyfor some y) A (currsreqddlr=x in state sj-_)) A (Bg, g>i, s.t. 7rg=ACKASSIGNx) A (Ad, d>i, s.t. (rd=<DLR_GONE,u>) A A (recvtss={u} in state Sd-1) for any u) recvtssCR in state sb by induction (Vb, b>i, recvtssCR in state sb) (RO) A (Ad, d>i, s.t. =: Vb, b>i, · (7rd=<DLR-GONE,u>) A (recvtss={u} in state sd-1) for any u) :- 3v, vER, s.t. Va, a>i, vErecv tss in state sa 175 * vER in state si == 3n, n>i, s.t. Vq, q>n, vorecvtss in state sq But this contradicts the earlier deduction that Va, a>i, vErecvtss in state sa by Lemma A.66 and so the original assumption must be false and the lemma has been proven. 5 Lemma A.68 (a is a well-formed and fair execution) A (currreqdacked=T A (currreqddlr=x, A (recv_tss$0 in state sl) xEN, in state sl) in state sl) 3r, r>l, s.t. xEsendcsrecv in state s, Proof: * Either (1) 3r, r>l, s.t. (rr,=COMMITy for some y) A (currreqddlr=x in state s,_l) * currreqddlr=x, xEJV, in state s,_== curr-reqdtsI in state s,_* (currreqddlr=x in state sl) A (curr.reqddlr=x in state Sr-l) A (r>l) 3q, l<q<r-1, ==- * by Lemma A.64 s.t. 7rq=COMMITzfor any z (currreqdacked=T in state sl) A * (;,q, I<q<r-1, s.t. rq=COMMITz for any z) =- curr.reqd acked=T in state sr(r=COMMITYfor some y) A (curr-reqd-dlr=x in state s,-l) A (currreqdtsI in state sr-l) A (currreqdacked=T in state Sr-l) =- xEsend-csrecv in state s, or (2) (7rm=COMMITyfor some y) m, m>l, s.t. A * (curr-reqd dlr=x in state Sm-l) (currreqddlr=x, A A A xEJA,in state sl) (currreqdacked=T in state s) (recvtss$0 in state s) (pm, m>l, s.t. (rm=COMMITyfor some y) A (currreqd dlr=x in state sm-l) ) 3r, r>l, s.t. xEsendcs-recv in state sr by Lemma A.67 * The desired result follows for both possible cases, and so the lemma has been == proven. 0 176 Theorem A.69 (a is a well-formed and fair execution) A (ri=COMMIT,) 3j, j>i, s.t. rj=ERASE, Proof: -= 31, I>i, s.t. ri=ACKASSIGN. * ri=COMMITx, * ri=COMMIT, ==Y,Ak, k>i, s.t. ark=COMMITx * Either for some yx (1) 3m, i<m<l, s.t. Irm=COMMITy A (k, * (rm=COMMITyfor some yx) by Lemma A.59 by WF1 k>i, s.t. rk=COMMIT,) curr-reqd dlrAx in state sl1 == * (curr-reqd_dlrox in state sl_1) A (rt=ACK_ASSIGN) == xEsend-csrecv in state sl or (2) for any y m, i<m<l, s.t. 7rm=COMMITy * ri=COMMITx = A (currreqd_acked=F)) ((currreqddlr=z) in state si * 7rl=ACKASSIGN, :- ,h, h<l, s.t. 7rh=ACKASSIGNx by Lemma A.41 (((currreqd_dlr=x) A (currreqd_acked=F)) in state sn-l) * (rnCOMMITy for any y) A A (7,7ACK.ASSIGNr) ((currreqdadlr=x) A (currreqd_acked=F)) in state s. = * A A A (curr_reqd_acked=F)) in state si) (((currreqddlr=x) for any y) (m, i<m<l, s.t. rm=COMMITy (Ah, h<l, s.t. 7rh=ACKASSIGNx) ((curr-reqddlr=x) A (currreqdacked=F)) in state si-_ =- by induction * Either (i) recvtss=O in state s-_l * (curr_reqd_dlr=x in state si-1) A (recvtss=0 in state s-1) A (ri=ACKASSIGNx) == xsend-cs-recv in state st or (ii) recvtss60~ in state s1_l * (currreqd-dlr=x in state sj_l ) A (recvtss0 in state s-1i) A ( r=ACKASSIGN ) ~= (currreqdacked=T in state sl) A A in state sl) (currreqddlr=x (recvtss$0 in state Sl) 177 A (curr-reqdacked=T in state s) (currreqddlr=x in state sl) A (recvtss$0 in state sl) =, 3r, r>l, s.t. xEsendcsrecv in state sr by Lemma A.68 * Therefore, for all possible cases, it is true that 3r, r>l, s.t. xEsendcsrecv in state sr * xEsendcsrecv in state Sr == 3j, j>r, s.t. rj=ERASEx and thus the theorem has been proven. 178 by Lemma A.55 [ Bibliography [1] Rakesh Agrawal. A Parallel Logging Algorithm for Multiprocessor Database Ma- chines. In Proc 4th Int'l Workshop on Database Machines, pages 256-276, Grand Bahama Island, March 1985. [2] Rakesh Agrawal and David J. DeWitt. Recovery Architectures for Multiprocessor Database Machines. In Proc SIGMOD '85 Conf, pages 131-145, Austin, Texas, May 1985. [3] Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. Data Structures and Algorithms. Addison-Wesley, Reading, Massachusetts, 1983. [4] Joel F. Bartlett. A NonStop Kernel. In Proc 8th Symp. on Operating System Principles, pages 22-29, December 1981. [5] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. Concurrency Con- trol and Recovery in Database Systems. Addison-Wesley,Reading, Massachusetts, 1987. [6] Ian F. Blake. An Introduction to Applied Probability. John Wiley & Sons, New York, New York, 1979. [7] Haran Boral. Design Considerations for 1990 Database Machines. In Proc IEEE Compcon '86 Conf, pages 370-373, San Francisco, California, March 1986. [8] Haran Boran, David J. DeWitt, Dina Friedland, Nancy F. Jarrell, and W. Kevin Wilkinson. Implementation of the Database Machine DIRECT. IEEE Trans on Software Engineering, SE-8(6):533-543, November 1982. [9] George Copeland, Tom Keller, Ravi Krishnamurthy, and Marc Smith. The Case for Safe RAM. In Proc 15th Int'l Conf on Very Large Databases, pages 327-335, Amsterdam, The Netherlands, August 1989. [10] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms. The MIT Press, Cambridge, Massachusetts, 1990. [11] William Dally, Andrew Chien, Stuart Fiske, Waldemar Horwat, John Keen, Michael Larivee, Rich Lethin, Peter Nuth, Scott Wills, Paul Carrick, and Greg Fyler. The JMachine: A Fine-Grain Concurrent Computer. In Proc IFIP 11th World Computer Congress, pages 1147-1153, San Francisco, California, August 1989. 179 [12] William J. Dally, J.A. Stuart Fiske, John S. Keen, Richard A. Lethin, Michael D. Noakes, Peter R. Nuth, Roy E. Davison, and Gregory A. Fyler. The MessageDriven Processor: A Multicomputer Processing Node with Efficient Mechanisms. IEEE Micro, 12(2):23-39, April 1992. [13] Dean S. Daniels, Alfred Z. Spector, and Dean S. Thompson. Distributed Logging for Transaction Processing. In Proc SIGMOD '87 Conf, pages 82-96, San Francisco, California, May 1987. [14] David J. DeWitt. DIRECT - A Multiprocessor Organization for Supporting Rela- tional Database Management Systems. IEEE Trans on Computers, C-28(6):395406, June 1979. [15] David J. DeWitt, Randy H. Katz, Frank Olken, Leonard D. Shapiro, Michael R. Stonebraker, and David Wood. Implementation Techniques for Main Memory Database Systems. In Proc SIGMOD '84 Conf, pages 1-8, Boston, Massachusetts, June 1984. [16] Susanne Englert, Jim Gray, Terrye Kocher, and Praful Shah. A Benchmark of NonStop SQL Release 2 Demonstrating Near-Linear Speedup and Scaleup on Large Databases. Technical Report 89.4, Tandem Computers Inc., May 1989. [17] K.P. Eswaran, J.N. Gray, R.A. Lorie, and I.L. Traiger. The Notions of Consistency and Predicate Locks in a Database System. Comm. of the ACM, 19(11):624-633, November 1976. [18] Anon. et al. A Measure of Transaction Processing Power. Datamation, pages 112118, April 1985. [19] Allan Gottlieb, Ralph Grishman, Clyde P. Kruskal, Kevin P. McAuliffe, Larry Rudolph, and Marc Snir. The NYU Ultracomputer - Designing an MIMD Shared Memory Parallel Computer. IEEE Trans on Computers, C-32(2):175-189, February 1983. [20] Jim Gray. Notes on Data Base Operating Systems. Set of course notes, 1977. [21] Jim Gray, editor. The Benchmark Handbookfor Database and Transaction Processing Systems. Morgan Kaufmann, San Mateo, California, 1991. [22] Jim Gray, Bob Good, Dieter Gawlick, Pete Homan, and Harald Sammer. One Thousand Transactions per Second. In Proc IEEE Compcon '85 Conf, pages 96101, San Francisco, California, February 1985. [23] Jim Gray, Bob Horst, and Mark Walker. Parity Striping of Disc Arrays: Low-Cost Reliable Storage with Acceptable Throughput. In Proc 16th Int'l Conf on Very Large Databases, pages 148-161, Brisbane, Australia, August 1990. [24] Jim Gray, Paul McJones, Mike Blasgen, Bruce Lindsay, Raymond Lorie, Tom Price, Franco Putzolu, and Irving Traiger. The Recovery Manager of the System R Database Manager. ACM Computing Surveys, 13(2):223-242, June 1981. 180 [25] Jim Gray and Frank Putzolu. The 5 Minute Rule for Trading Memory for Disc Accesses and The 10 Byte Rule for Trading Memory for CPU Time. In Proc SIGMOD '87 Conf, pages 395-398, San Francisco, California, May 1987. [26] Jim Gray and Andreas Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, San Mateo, California, 1993. [27] The Tandem Database Group. NonStop SQL: A Distributed, High-Performance High-Availability Implementation of SQL. Technical Report 87.4, Tandem Computers Inc., April 1987. [28] The Tandem Performance Group. A Benchmark of NonStop SQL on the Debit Credit Transaction. In Proc SIGMOD '88 Conf, pages 337-341, Chicago, Illinois, June 1988. [29] Linley Gwennap. 1992 in Review: The Top RISC Processors. Report, 6(17), 1992. Microprocessor [30] Theo Haerder and Andreas Reuter. Principles of Transaction-Oriented Recovery. ACM Computing Surveys, 15(4):287-317, December 1983. Database [31] Robert B. Hagmann. A Crash Recovery Scheme for a Memory-Resident Database System. IEEE Trans on Computers, C-35(9):839-843, September 1986. [32] Robert B. Hagmann and Hector Garcia-Molina. Implementing Long Lived Transactions Using Log Record Forwarding. Technical Report CSL-91-2, Xerox Palo Alto Research Center, February 1991. [33] Robert Holbrook. NonStop SQL - A Distributed Relational DBMS for OLTP. In Proc IEEE Compcon '88 Conf, pages 418-421, San Francisco, California, February 1988. [34] John Kaunitz and Louis Van Ekert. Audit Trail Compaction for Database Recovery. Comm. of the ACM, 27(7):678-683, July 1984. [35] John S. Keen and William J. Daily. Performance Evaluation of Ephemeral Logging. In Proc 1993 ACM SIGMOD Int'l Conf on Management of Data, pages 187-196, Washington, D.C., May 1993. [36] Brian W. Kernighan and Dennis M. Ritchie. The C Programming Language. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1978. [37] Walter H. Kohler. A Survey of Techniques for Synchronization and Recovery in Decentralized Computer Systems. ACM Computing Surveys, 13(2):149-183, June 1981. [38] Akhil Kumar. A Crash Recovery Algorithm Based on Multiple Logs that Exploits Parallelism. In Proc 2nd IEEE Symp on Parallel and Distributed Processing,pages 156-159, Dallas, Texas, December 1990. 181 [39] Tobin J. Lehman and Michael J. Carey. A Recovery Algorithm for A HighPerformance Memory-Resident Database System. In Proc SIGMOD '87 Conf, pages 104-117, San Francisco, California, May 1987. [40] Henry Lieberman and Carl Hewitt. A Real-Time Garbage Collector Based on the Lifetime of Objects. Comm. of the ACM, 26(6):419-429, June 1983. [41] David B. Lomet. Recovery for Shared Disk Systems Using Multiple Redo Logs. Technical Report CRL 90/4, DEC, October 1990. [42] Nancy Lynch. I/O Automata: A Model for Discrete Event Systems. Technical Report MIT/LCS/TM-351, MIT, March 1988. [43] Nancy A. Lynch and Mark R. Tuttle. An Introduction to Input/Output Automata. CWI-Quarterly, 2(3):219-246, September 1989. Centrum voor Wiskunde en Informatica, Amsterdam, The Netherlands. Also, appeared as Technical Memo MIT/LCS/TM-373. [44] Jeff Moad. Relational Takes on OLTP. Datamation, pages 36-39, May 1991. [45] C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwartz. ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging. ACM Trans on Database Systems, 17(1):94-162, March 1992. [46] Philip A. Neches. The Anatomy of a Data Base Computer System. In Proc IEEE Compcon '85 Conf, pages 252-254, San Francisco, California, February 1985. [47] Philip A. Neches. The Anatomy of a Data Base Computer System - Revisited. In Proc IEEE Compcon '86 Conf, pages 374-377, San Francisco, California, March 1986. [48] Michael D. Noakes, Deborah A. Wallach, and William J. Dally. The J-Machine Multicomputer: An Architectural Evaluation. In Proceedings of the 20th International Symposium on Computer Architecture, pages 224-235, San Diego, California, May 1993. IEEE. [49] S.C. North and J.H. Reppy. Concurrent Garbage Collection on Stock Hardware. In Proc Functional ProgrammingLanguagesand Computer Architecture (SpringerVerlag LNCS 274), pages 113-133, Portland, Oregon, September 1987. [50] Peter R. Nuth and William J. Dally. The J-Machine Network. In Proceedings of the International Conferenceon Computer Design: VLSI in Computersand Processors, pages 420-423. IEEE, October 1992. [51] David Patterson, Peter Chen, Garth Gibson, and Randy Katz. Introduction to Redundant Arrays of Inexpensive Disks (RAID). In Proc IEEE Compcon '89 Conf, pages 112-117, San Francisco, California, February 1989. 182 [52] David A. Patterson, Garth Gibson, and Randy H. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proc SIGMOD '88 Conf, pages 109-116, Chicago, Illinois, June 1988. [53] Don Pierce, Marty Czekalski, and Charles Cassidy. Solid State Technology Answers The Need For High Performance I/O. Computer Technology Review, pages 40-45, Summer 1993. [54] David Reed and Liba Svobodova. SWALLOW: A Distributed Data Storage System for a Local Network. In Local Networks for Computer Communications: Proceedings of the IFIP Working Group 6.4 International Workshop on Local Networks, pages 355-373, Zurich, Switzerland, August 1980. North-Holland. [55] Mike Ricciuti. Teradata Adds Power for OLTP. Datamation, pages 89-90, May 1991. [56] Mendel Rosenblum and John K. Ousterhout. The LFS Storage Manager. In Proc Summer '90 USENIX Technical Conf, Anaheim, California, June 1990. [57] Mendel Rosenblum and John K. Ousterhout. The Design and Implementation of a Log-Structured File System. In Proc of the 13th ACM Symp on OperatingSystems Principles (SOSP), pages 1-15, Pacific Grove, CA, October 1991. [58] Mendel Rosenblum and John K. Ousterhout. The Design and Implementation of a Log-Structured File System. ACM Trans on Computer Systems, 10(1):26-52, February 1992. [59] Ravi Sharma and Mary Lou Soffa. Parallel Generational Garbage Collection. In Proc Object-Oriented Programming Systems, Languages, and Applications Conf, pages 16-32, Phoenix, Arizona, October 1991. [60] Jack Shemer and Phil Neches. The Genesis of a Database Computer. IEEE Computer Magazine, 17(11):42-56, November 1984. [61] Emy Tseng and David Reiner. Parallel Database Processing on the KSR1 Com- puter. In Proc 1993 ACM SIGMOD Int'l Conf on Management of Data, pages 453-455, Washington, D.C., May 1993. [62] David Ungar. Generation Scavenging: A Non-disruptive High Performance Storage Reclamation Algorithm. In Proc ACM SIGSOFT/SIGPLAN Software Engi- neering Symp on Practical Software Development Environments, pages 157-167, Pittsburgh, Pennsylvania, April 1984. [63] Benjamin Zorn. Comparing Mark-and-sweep and Stop-and-copy Garbage Collec- tion. In Proc ACM Symp on Lisp and Functional Programming,pages 87-98, Nice, France, June 1990. 183