Logging and Recovery in a ...

advertisement
Logging and Recovery in a Highly Concurrent Database
by
John Sidney Keen
B.A.Sc., Electrical Engineering
University of Waterloo, 1986
M.A.Sc., Electrical Engineering
University of Waterloo, 1987
Submitted to the Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Science
at the
Massachusetts Institute of Technology
May 1994
@1994 Massachusetts Institute of Technology
All rights reserved
Signature
-
of Author
- ---
.
Certified by
William J. Dally
Associate Professor of ElectricA Engineering and Computer Science
\
A J,
1A
I
Thesi
Advisor
dvis
Accepted by
rederic R. Morgenthaler
OarmCranyepFetment Conmiftee on Graduate Students
WITHDRAWN
FROM
Ml
S2M11e RARIES
,
_
lra ok
~\·n;
.
t g1
, I
JUL 1.3 1994
LI1BRAR!F.
t.
2
Logging and Recovery in a Highly Concurrent Database
by John Sidney Keen
Submitted to the Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degree of Doctor of Science
at the Massachusetts Institute of Technology,May 1994
Abstract
This thesis addresses the problem of fault tolerance to system failures for database
systems that are to run on highly concurrent computers. It assumes that, in general, an
application may have a wide distribution in the lifetimes of its transactions.
Logging remains the method of choice for ensuring fault tolerance, but this thesis
proposes new ways of managing a database's log information that are better suited for
the conditions described above. The disk space reserved for log information is managed
according to the extended ephemerallogging(XEL) method. XEL segments a log into a
chain of fixed-size FIFO queues and performs generational garbage collection on records
in the log. Log records that are no longer necessary for recovery purposes are "thrown
away" when they reach the head of a queue; only records that are still needed for recovery
are forwarded from the head of one queue to the tail of the next. XEL does not require
checkpoints, permits fast recovery after a crash and is well suited for applications that
have a wide distribution of transaction lifetimes. The cost of XEL is more main memory
space and possibly a slight increase in the disk bandwidth required for log information.
XEL can significantly reduce the disk bandwidth required for log information in a system
that has been augmented with a non-volatile region of main memory.
When bandwidth requirements for log information demand an arbitrarily large collection of disks, they can be grouped into separate log streams. Each log stream consists
of a small fixed number of disks and operates largely independently of the other streams.
XEL manages the storage space of each log stream. Load balancing amongst the log
streams is an important issue. This thesis evaluates and compares three different distribution policies for assigning log records to log streams.
Simulation results demonstrate the effectiveness of the implementation techniques
proposed in this thesis for a highly concurrent database system in which transactions
may have a wide distribution in lifetimes.
Thesis Advisor:
William J. Dally
Associate Professor of Electrical Engineering and Computer Science
3
4
Acknowledgements
This thesis reflects a life-long learning process. Many people have guided and encouraged me along the way, and they have my deepest gratitude. From my earliest days
of childhood, my parents, Jim and Carolyn Keen, fostered my creativity and encouraged me to learn and think. When my love of technical engineering-style topics became
apparent, they supported my further development in this direction. Thanks, Mom and
Dad.
I am deeply indebted to my advisor here at MIT, Bill Dally. It has been a great
privilege to work with Bill. His intellect, energy and enthusiasm are simply amazing.
Inspirational! Bill graciously gave me considerable freedom to choose a research problem
for my thesis, and then he helped me to follow it through to a finished thesis. Thanks
for your encouragement, prodding and suggestions, Bill. I am also very grateful to the
readers on my thesis committee: David Gifford, Nancy Lynch and Bill Weihl. They all
offered many helpful comments and suggestions on preliminary drafts of my thesis.
My colleagues in the CVA (Concurrent VLSI Architecture) group were a pleasure
to work with and they helped me to refine my ideas. Lots of good criticisms and
valuable suggestions during presentations at group meetings assisted me in my work.
The friendship of all CVAers created a wonderful environment. I thank my current
officemates, Kathy Knobe and Diana Wong, for their cheerful company during this last
hectic year.
During my time at MIT, I always lived at Ashdown House (most of the time on the
fifth floor). I was fortunate to live in such a vibrant community of terrific people. There's
not enough space to list everyone's name, but you know who you are and I thank you all
for your kind friendship during our years together at MIT. I must make special mention
of Vernon and Beth Ingram, who have been the housemasters at Ashdown throughout
my stay. They dedicated themselves to making Ashdown a truly special place to live
and I shall be forever grateful to them.
I thank God that I have been blessed with the opportunity and ability to study at
MIT and complete this thesis. May the results of my work be to the glory of God and
the good of society.
The research described in this thesis was supported in part by a Natural Sciences
and Engineering Research Council of Canada (NSERC) 1967 Scholarship and in part by
the Advanced Research Projects Agency under contracts N00014-88K-0738, N00014-91J-1698 and F19628-92-C-0045 and in part by a National Science Foundation Presidential
Young Investigator Award, grant MIP-8657531, with matching funds from General Electric Corporation, IBM Corporation and AT&T Corporation. I thank all these sponsors
for generously supporting my research.
5
6
Contents
1
Introduction
1.1
11
Motivation
..................................
11
1.2 Major Contributions.
1.3 Statement of Problem.
1.4 Review of Previous Research ........................
1.4.1 Disk Storage Management ......................
1.4.2 Parallel Logging and Recovery ...................
1.4.3 Management of Precommitted Transactions ............
1.5
Commercial
Systems
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6 Assumptions .................................
1.7 Summary of Remaining Chapters ......................
2 Extended Ephemeral Logging (XEL)
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
Preamble: Limitation of Ephemeral Logging ................
Conceptual Design of XEL ..........................
Types and Statuses of Log Records .....................
Management of Log Records ..........................
Buffer Management for Disk Version of the Database ...........
Flow Control.
Buffering, Forwarding and Recirculation ..................
Management of the LOT and LTT .....................
Crash Recovery.
3 Correctness Proof for XEL
3.1
3.2
3.3
3.3.2 Invariants for Composition of SLMand ENV.............
3.3.3 Correctness of SLMModule .....................
Implementation of Log Manager ......................
3.4.1 I/O Automata Model.
3.4.2
3.4.3
3.4.4
27
28
30
32
39
41
41
42
48
51
57
Simplifications.
Well-formedness Properties of Environment ................
Specification of Correctness (Safety) ....................
3.3.1 I/O Automaton Model .
3.4
12
14
15
16
19
21
22
24
25
Invariants for Composition of LMand ENV .............
Proof of Safety for LM ........................
Proof of Liveness ...........................
7
58
60
62
62
63
64
67
68
74
76
77
4 Parallel Logging
4.1
4.2
78
Parallel XEL.
. . . . . . . . . ..
Three Different Distribution Policies . . . . . . . . . ..
4.2.1 Partitioned Distribution . . . . . . . . .
4.2.2
4.2.3
Random Distribution
....
Cyclic Distribution ......
. . . . . .
. . . . . .
5 Management of Precommitted Transactions
5.1
5.2
5.3
The Problem.
Shortcomings of Previous Approaches
Logged Commit Dependencies (LCD)
.....
.....
6.3
Conclusions
7.1 Lessons Learned.
7.2 Importance of Results
7.3 Extensions .
7.4
.
.
..
.
.
.
..............
...
88
89
91
,.............
.............
,...
98
..............
. . . . . . . .
..............
..............
. .
81
88
,..........
..............
............
78
81
85
86
.
..............
6.1.4 Validity of Simulation Model ......
Extended Ephemeral Logging for a Single Strea I
6.2.1 Effect of Transaction Mix ........
6.2.2 Effect of Transaction Duration .....
6.2.3 Effect of Size of Data Log Records . . .
. . . . . . . .
6.2.4 Effect of Data Skew
. . . . . . . .
Parallel Logging .....
. . . . . . . .
6.3.1 No Skew ......
. . . . . . . .
6.3.2 Moderate Skew ..
. . . . . . . .
6.3.3 High Skew.....
6.3.4 Discussion.....
7
.
Simulation Environment.
6.1.1 Input Parameters .
6.1.2 Fixed Parameters.
6.1.3 Data Structures .
6.2
.
........
6 Experimental Results
6.1
..
..
. .. . . . .. .
. . . . .. . .
.
..
.
.
.
.
.
.
.
.
.
..............
.............
..............
...........
e..
,~.............
..............
..............
..............
. . . . . . . . . . . .
99
99
102
102
105
108
109
111
112
113
113
116
116
117
118
119
.......
..
. ..
. ..
..
. ..
. ..
. ..
. . .
. ..
. ..
. ..
. ..
..
..
.emo
.............
7.3.1 Non-volatile Region of Main MIlemo:ry .
7.3.2 Log-Only Disk Management
7.3.3 Multiplexing of Log Streams
7.3.4 Choice of Generation Sizes ....
7.3.5 Fault Tolerance to Isolated Media IFailures
7.3.6 Fault Tolerance to Partial Failures
7.3.7 Support for Transition Logging . .
7.3.8 Adaptive Distribution Policies . . .aCrash.
7.3.9 Quick Resumption of Service After a Crash
The Future of High Performance Databases .
..
..
..
..
. .
..
. .
. .
. .
. .
. .
. .
. .
..
..
119
120
121
121
121
122
123
123
124
124
125
....
....
...
...
...
...
....
....
....
....
...
. . .
126
...
. . . .
. .. .
. . ..
. ..
. ..
. ..
. . ..
. .. .
. .. .
. .. .
.. .
............... 126
. ..
A Theorems for Correctness Proof of XEL
128
A.1 Proof of Possibilities Mapping ............
128
8
A.2 Proof of Liveness ...............................
9
150
List of Figures
1.1
1.2
Disk Configuration for Concurrent Database System .....
Firewall Method of Disk Space Management for Log .....
15
17
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
Two Successive Updates to Object ob8. ..................
State of the Log After a Crash .......................
Disk Space Management Using XEL ...................
Typical Events During the Lifetime of a Transaction ..........
State Transition Diagram for a REDO DLR ................
State Transition Diagram for an UNDO DLR ...............
TLR ................
State Transition Diagram for a COMMIT
Data Structures for XEL ...........................
Buffering of Incoming Log Records for Batched Write to Disk ......
29
30
31
33
34
34
35
40
43
3.1
3.2
3.3
3.4
3.5
Interface of Simplified Log Manager ...................
Automaton to Model Well-formed Environment ..............
Specification Automaton for LM ......................
I/O Automata for an Object ........................
States and Action Signatures of Automata in LM Module ........
4.1
4.2
Four Parallel Log Streams .......................
Load Imbalance vs. Number of Objects ...............
79
83
5.1
5.2
Deadlock in Dependency Graph for Buffers at Two Log Streams
Static Dependency Graph of Log Streams ..............
90
90
6.1
Simulation
Transaction
Model ........................
.
.
.
59
61
63
68
71
100
110
6.2 Performance Results for Varying Transaction Mix .............
111
6.3 Performance Results for Varying Long Transaction Duration ......
6.4 Performance Results for Varying Size of DLRs from Long Transaction Typell2
114
...............
6.5 Performance Results for Varying Data Skew .
6.6 Disk Bandwidth and Memory Requirements vs. Parallelism (x=0.5) . . 116
6.7 Disk Bandwidth and Memory Requirements vs. Parallelism (x=0.01). . 117
6.8 Disk Bandwidth and Memory Requirements vs. Parallelism (x=2x10 - 4 ) 118
7.1
.
Multiplexing of Older Generations
10
....................
122
Chapter 1
Introduction
1.1
Motivation
This thesis re-examines the problem of fault tolerance within the context of highly
concurrent databases whose applications may be characterized by a wide distribution
in transaction lifetimes. The goal of this effort is to propose and evaluate new data
structures and algorithms that constitute an efficient and scalable solution to the fault
tolerance problem. This thesis devotes most of its attention toward the management of
information on disk and ignores other aspects of the problem. Current technical and
economic trends justify this approach. Processors have made dramatic improvements,
in terms of both cost and performance, compared to disk technology. Similarly, main
memory storage space and interconnection network bandwidth are now relatively inexpensive (compared to disk) and abundant in most concurrent computer systems. Disk
technology is becoming an increasingly crucial factor in the design of any concurrent
database management system (DBMS) because it accounts for a significant fraction of
the cost of a system and threatens to limit the system's performance [22, 7].
Highly concurrent computers offer the potential for powerful databases that can
process many thousands of transactions per second. According to [18], a good database
system typically requires 105 instructions per transaction for the debit-credit benchmark.
Current microprocessors can process instructions at a rate of at least 108 instructions
per second [29]. With current technology, therefore, it is reasonable to assume that
each processor can support a rate of at least 1000 TPS if there are no limitations other
than CPU speed. The overall performance of a system with hundreds of processors is
expected to be several hundred thousand transactions per second for the debit-credit
benchmark. A good DBMS design must eliminate bottlenecks that would otherwise
prevent users from fully harnessing the computational potential of highly concurrent
computers.
11
Data structures and algorithms that worked well in DBMS designs for serial computers are inappropriate for highly concurrent systems. Consider the specific problem
of fault tolerance to system failures (also known as crashes), in which the contents of
volatile main memory storage are corrupted. To provide support for atomic transactions, a DBMS must guarantee fault tolerance to crashes. Traditionally, DBMSs have
kept a log of all modifications performed by transactions that are still in progress. Conceptually, the log is a FIFO queue to which records are added at the tail; the log should
be sufficiently long that records become unnecessary before they reach the head. This
abstraction of a single FIFO queue becomes awkward and inappropriate in a highly
concurrent system; the tail of the queue is a potential serial bottleneck. Furthermore,
if only a single disk drive is dedicated to hold the log, it may not provide sufficient
bandwidth for large volumes of log information. For example, a system that generates
500 Bytes of log information per transaction and runs at 100,000 TPS needs at least 50
MBytes/sec of disk bandwidth. If current disk drive technology can provide at most 2
MBytes/sec bandwidth per drive, the system requires at least 25 disk drives just for log
information to ensure that the log is not a bottleneck.
To add to these difficulties, applications that use databases are becoming more diverse. Some applications may have a wide distribution of transaction lifetimes. An application with a small proportion of transactions whose lifetimes are much longer than
average poses problems for traditional logging algorithms. Most variations of logging
retain all log records that have been written (by all transactions) since the beginning
of the oldest transaction that is still in progress. Many of these log records may be
unnecessary for the purposes of recovery, but their disk space cannot be reclaimed as
long as some older record must be retained; this situation arises as a consequence of
the FIFO policy which governs the management of a log's disk space. This constraint
poses disk management problems. If a transaction lives too long, the log will run out
of space to hold new records. An obvious solution is to simply allocate a large amount
of disk space for the log, but this implies some unpleasant consequences. First, it may
unnecessarily increase a system's cost. Second, the large size of the log may entail a
much longer recovery time after a crash. These drawbacks prompt an investigation into
better methods of fault tolerance for databases whose applications may have a wide
distribution in transaction lifetimes.
1.2 Major Contributions
The following paragraphs summarize the major contributions of this thesis.
Extended Ephemeral Logging (XEL). XEL is a new technique for managing a
log of database activity on disk. This thesis presents the data structures and algorithms
which constitute XEL; it explains XEL's operation during both normal database activity
12
and recovery from a crash. XEL does not require periodic checkpoints and does not abort
lengthy transactions as frequently as traditional logging techniques which manage the
log as a FIFO queue (assuming that XEL and the FIFO queue technique are both limited
to the same amount of disk space). Therefore, XEL is well suited for highly concurrent
databases and applications that have a wide distribution of transaction lifetimes. XEL
can offer significant savings in disk space, at the expense of slightly higher bandwidth for
log information and more main memory. The reduced size of the log permits much faster
recovery after a crash as well as cost savings. XEL can significantly reduce both disk
space and disk bandwidth in a system that has at least some portion of main memory
which is non-volatile.
Proof of Correctness for XEL.
XEL's safety and liveness properties are formally
proven. Apropos safety, this thesis proves that XEL never does anything wrong; therefore, the database can always be restored to a consistent state after a crash, regardless of
when the crash occurs. The liveness property ensures that XEL always makes progress;
every log record is eventually erased.
Evaluation of XEL. The benefits and costs of XEL, relative to logging techniques
which manage log information in a FIFO queue, are quantitatively evaluated via eventdriven simulation. This thesis presents these experimental results.
Evaluation of Parallel Logging Distribution Policies.
The abstraction of mul-
tiple log streams for log information can be easily implemented in a highly concurrent
DBMS which requires an arbitrarily large collection of disk drives to provide the neces-
sary bandwidth for log information. A database system's distribution policy dictates the
log stream(s) to which any particular log record is sent. This thesis evaluates and compares three different distribution policies for a DBMS that has multiple parallel streams
of log records. The random policy, which randomly chooses a log stream for any log
record, has good load balancing properties and is simple to implement.
Logged Commit Dependencies
(LCD).
The LCD technique permits very high
throughput on "hot spot" objects in a highly concurrent database. It is a variant of the
precommitted transaction technique [15] and is especially well suited for a DBMS that
uses multiple parallel log streams. A transaction's dependencies on previous precommitted transactions are explicitly encoded in a special PRECOMMIT
record that is generated
as soon as the transaction requests to commit, thus eliminating potentially awkward
synchronization requirements between the log stream to which the transaction's COMMIT
record is eventually written and the log streams to which COMMIT
records are written for
the transactions on which it depends.
13
1.3
Statement of Problem
In any DBMS, the log manager (LM) manages log information during normal database
operation, and the recoverymanager (RM) is responsible for restoring the database to
a consistent state after a crash. Together, these two components make up a DBMS's
logging and recovery subsystem. The log holds records for only recent modifications
to the database. A version of the database kept elsewhere on disk stores the state of
all items of data in the database. At any given point in time, this disk version of the
database does not necessarily incorporate all the updates that have been performed
by committed transactions; some of these updates may be recorded only in the log.
Another DBMS component called the cache manager (CM) is responsible for managing
the contents of the disk version of the database and must work in collaboration with
the LM. The CM choosesto flush (transfer) updated objects' to the disk version of the
database in a manner that uses I/O resources efficiently.
Figure 1.1 graphically represents the disk configuration of a concurrent database
system. At the top, a collection of disk drives provide the bandwidth and storage
capacity required for log information generated by the DBMS; these are called the log
disks. The LM manages these drives. The exact number of log disks is chosen to
support the highest rate of transaction processing of which the rest of the system is
capable. A small number of buffers, in main memory, are dedicated to each of these
log disks. On the right hand side, some other collection of disk drives hold the disk
version of the database. The exact number of drives required for the disk version of
the database depends on the demands for disk space and bandwidth imposed by the
rest of the system. A buffer pool is associated with each different disk drive of the disk
version of the database. These buffer pools are quite large so that they serve as caches.
They reduce the number of retrievals from disk and allow writes to disk to be ordered
in a manner that permits higher transfer rates (due to mostly sequential I/O). The CM
manages these buffer pools and their associated disk drives.
A complete solution to the logging and recovery problem in a concurrent DBMS must
answer all of the following questions, which apply to the management of log information
during normal operation of the database:
1. What events are logged?
2. What information should a log record contain?
3. How does the LM decide the disk drive(s) to which it will write a log record?
4. At what time should the LM write a log record to disk?
5. Where on disk should the LM write a log record?
1The term objectis used broadly to denote any distinct item of data in a database. It may be a record
in a hierarchical or network database, a tuple in a relational database or an object in an object-oriented
database.
14
information
information
_1
------------II
main
memory
I
-------------
..
I..-..G
,
disk version
^
.
.
.
----------.----------------------------------I
of database
IM
= log disk buffer
buffer pool
i
..............................................
I...................................
Figure 1.1: Disk Configuration for Concurrent Database System
6. When can the LM overwrite a log record on disk with more recent log information?
7. When can the CM flush an updated object's new value to the disk version of the
database?
Any proposed logging and recovery method must also respond to the followingquestions that concern recovery after a crash:
1. How should the RM schedule retrievals of blocks of log information from disk?
2. In what order does the RM process the log records contained in a block of log
information?
3. Given a particular log record, what does the RM do with it?
1.4
Review of Previous Research
Several good textbooks and articles have been published on the subject of fault tolerance in database systems. A reader who would like to become familiar with the basic
techniques and terminology of the field is referred to [20, 24, 30, 5, 26]. The remaining subsections of this section review prior research that is specifically relevant to the
material in this thesis.
15
1.4.1
Disk Storage Management
The Firewall Method of Disk Management
Traditionally, logging has been the method of choice for fault tolerance in most database
systems 2 . A LM maintains a log of database activity as transactions execute and modify
items of data (i.e., objects) in the database. Conceptually, the log is a FIFO queue.
The LM adds log records to the tail immediately after they are created. Log records
progress toward the head of the queue and ought to become unnecessary by the time
they eventually reach the head. The DBMS allocates a fixed amount of disk space to
hold log information. The LM manages this disk space as a circular array [3, 10]; the
log's head and tail pointers rotate through the positions of the array so that records
conceptually move from tail to head but physically remain in the same place on disk.
System R [24] is a familiar example of this traditional logging technique.
The LM maintains a pointer to the oldest record in the log that must still be retained;
this constitutes a "firewall" beyond which the head of the log cannot be advanced.
Hence, this logging technique shall be referred to as the firewall (FW) method. The
LM initiates periodic checkpoints. As soon as a checkpoint begins, the LM writes out
a special beginning-of-checkpoint record to the log. During a checkpoint, the CM writes
out all updated objects to the disk version of the database3 and then the LM writes out
a special end-of-checkpoint record to the log. After the checkpoint has completed, the
LM can be sure that all preceding log records for committed updates are no longer
necessary to ensure correct recovery of the database after a crash. The LM keeps
a pointer to the position within the log of the beginning-of-checkpoint record for the
most recently completed checkpoint. The LM also maintains a pointer for each active4
transaction that identifies the position within the log of the oldest record written by
the transaction. At any given time, the log's firewall is the oldest of the pointers for
all active transactions and the pointer to the beginning of the most recent checkpoint.
Figure 1.2 illustrates an example.
If the log starts to run short on space for new log records, the LM must free up
some space by advancing the firewall. It must either kill an old active transaction or it
must perform another checkpoint, depending on the exact nature of the current firewall.
In general, it is bad to kill a transaction because this will likely annoy the client who
originally initiated the transaction. Furthermore, all the resources consumed by the
transaction have essentially been wasted, and the transaction's effort will be repeated
2
Logging is not the only possible solution to the fault tolerance problem. However, it has tended to
be the most popular solution for reasons of performance and efficiency. Refer to [37, 30, 2, 5, 39] for
explanations of alternative methods of achieving fault tolerance (such as shadowing) and comparisons
of the strengths and weaknesses of the different approaches.
3Sophisticated fuzzy checkpoint methods allow the database to continue servicing requests from client
transactions while the CM flushes out all updated objects, so that the checkpoint activity causes negli-
gibly small disruption to normal operation of the database.
4
An active transaction is one that is still in progress (it has not committed nor been aborted).
16
oldest
record from
all active
transactions
1I
new
_I
_
log
records
V
tail
head
log
ftamost
recent
checkpoint
record
log
rewall
I
unused disk space (available for re-use)
~
old log records no longer required for recovery
Figure 1.2: Firewall Method of Disk Space Management for Log
if its client decides to try it again. This scenario is particularly irritating because
many records in the log may be unnecessary for recovery purposes but the FIFO queue
abstraction prevents the LM from reclaiming their space until all prior log records have
been rendered unnecessary. For example, suppose that a transaction updates an object
and continues to live for another 10 min while many short (several seconds) transactions
each update a few objects and commit. The log records from these short transactions
follow the long transaction's first log record. Even though most of the log records
from these short transactions are no longer needed for recovery, their space cannot be
reclaimed until the long-lived transaction finishes.
Checkpoints are not free. They become awkward in a concurrent system because
they entail synchronization and coordination amongst an arbitrarily large number of
participating parties. In general, if the LM wishes to perform a checkpoint, it must
coordinate activity at all the log disks and all the disk drives on which the disk version
of the database resides. A checkpoint operation requires communication bandwidth,
processor cycles, storage space in main memory and disk bandwidth. Periodic checkpoints may interfere with the CM's operation by constraining it to schedule flushes to
disk in an order that does not take full advantage of locality within the disk version
of the database. Finally, the duration required to perform a checkpoint and the delays
between consecutive checkpoints limit the speed with which the LM can reclaim space
in the log.
As the number of log disks increases, the abstraction of a single queue becomes
increasingly difficult to implement. The tail of the queue is a potential serial bottleneck
and the LM must carefully manage the order in which it writes blocks to each log disk
so that it preserves the log's FIFO property.
Therefore, the traditional "firewall" method of logging poses implementation prob-
17
lems for highly concurrent databases with wide variations in transaction lifetimes because it does not reclaim disk space sufficiently quickly, it suffers from the overhead of
periodic checkpoint operations and it is difficult to implement on an arbitrarily large
number of disk drives.
Recirculation of Log Records
Hagmann and Garcia-Molina [32] propose a solution to the disk management problem
posed by long lived transactions. If a record reaches the head of the log but must still
be retained, the LM recirculates the record within the log by adding it to the tail.
In contrast to the XEL method which this thesis will propose, they continue to
implement the log as a single FIFO queue, rather than a chain of FIFO queues. A log
record from a very long lived transaction may therefore be recirculated numerous times
within the log, thereby consuming more bandwidth than would be required by the XEL
method. Furthermore, Hagmann and Garcia-Molina do not attempt to eliminate the
need for checkpoint operations.
Log Compression
Log compression filters out unnecessary information so that less storage space is required
to store the log. However, previous methods of log compression [34, 31] are intended
for "batch mode" log compression; the LM cannot add new records to the existing log
while it is being compressed.
The XEL method proposed in this thesis essentially performs log compression, but
its continuous and incremental nature distinguishes it from previous log compression
methods.
Generational Garbage Collection
Previous work on generational garbage collection inspired XEL's essential idea: the log
is segmented into a chain of FIFO queues. Lieberman and Hewitt [40] proposed the segmentation of a system's main memory storage space into several temporal generations;
most of the system's garbage collection effort is limited to only its younger generations. Quantitative evaluation of several variations on generational garbage collection
[62, 49, 63, 59] have demonstrated its effectiveness for automatic memory management.
However, previous work on generational garbage collection addressed the more general problem of automatic storage reclamation by a programming language's runtime
18
system. The reference pattern amongst a program's data objects is usually more complicated than the dependencies that exist amongst log records, and so garbage collection
methods that worked well for programming languages may be inappropriate for managing the disk storage reserved for a database's log. Less complicated yet more effective
techniques may be possible for the simpler problem of managing a database's log.
Rosenblum and Ousterhout [56, 57, 58] adopt a similar strategy for the log-structured
file system (LFS). The LFS adds all changes to data, directories and metadata to the end
of an append-only log so that it can take advantage of sequential disk I/O. The log consists of several large segments. A segment is written and garbage collected ("cleaned")
all at once. To reclaim disk space, the LFS merges non-garbage pieces from several
segments into a new segment; it must read the contents of a segment from disk to decide what is garbage and what isn't. There is a separate checkpoint area and the LFS
performs periodic checkpoints. The LFS takes advantage of the known hierarchical reference patterns amongst the blocks of the file system during logging and recovery; the
inode map, segment summary blocksand segment usage array data structures describe
the file system's structure and play an important role in the LFS's management of disk
space.
1.4.2
Parallel Logging and Recovery
Distributed databases are similar to concurrent databases, but not identical. Like concurrent databases, a distributed database consists of an arbitrary number of processors
linked via some communication network; the total set of processors is partitioned into
disjoint sites, and the communication network connects the sites together. These sites
can share data, so that a transaction executing at any particular site effectively sees one
large database. However, the underlying technology distinguishes distributed databases
from concurrent databases. In general, a distributed database's communication network
has lower bandwidth, much longer latency and lower reliability than that of a concurrent
database. Moreover, individual sites or links in a distributed database's network may
fail (partial failures, in the terminology of [5]), yet other portions of the system remain
intact; the system can continue to provide service to client transactions as long as these
transactions do not require access to data that is unavailable due to failures elsewhere.
For many concurrent systems, either the entire system is operational so that all data is
available to client transactions or it is completely failed (a total failure [5]) so that no
transaction can execute. This "all or nothing" characteristic simplifies some aspects of
the DBMS implementation problem.
The limitations of a distributed database's communication network and concerns
about availability of data lead to rigid partitioning of data objects within the system.
Assume, for simplicity, that objects are not replicated at different sites. Each object has
a home site [5]and objects do not migrate between sites. Any transaction that wants to
access a particular object must do so at its home site. Whenever an object is updated,
19
log information is generated and stored at its home site. This rigid partitioning is liable
to load balancing problems. One site may hold a disproportionately large number of
"hot spot" objects 5 and so it is overloaded while another site is relatively idle. However,
the rigid partitioning prevents the system from taking advantage of available processing
power and disk bandwidth elsewhere in the system. A concurrent system has more
flexibility in balancing the loads amongst processors and disk drives. The possibility of
partial failures necessitates complicated algorithms for atomic commitment, such as the
two phase commmit (2PC) and three phase commit (3PC) protocols [5],for distributed
databases. These sophisticated techniques are unnecessary for concurrent databases
under the "all or nothing" assumption.
The SWALLOW distributed storage system [54] stores objects in a collection of
autonomous repositories. Each object is represented as a sequence of versions, called
a version history, which records successive updates to the object over time. SWAL-
LOW creates a commit record at each repositoryat which a transaction updates objects
and links new versions of updated objects to the transaction's commit record while the
transaction executes. If the transaction eventually aborts, the system deletes the transaction's commit record and the versions for the updates which it performed on objects.
Otherwise, the committed transaction's changes become permanent in the multiversion
representation of the database. Hence, there are no distinct log records in SWALLOW
because it retains versions for objects. Old versions can be garbage-collected when there
are no longer any references to them.
Lomet [41] proposes a new method for redo logging and recovery that is intended for
use in a data sharing system. Multiple nodes can access common data, yet each node
can have its own log. In contrast to the partitioning of data which characterizes many
distributed databases, Lomet's system allows data objects to migrate within the system.
After modifying an object, a processing node records the operation in its own private
log; each private log is a sequential file, apparently managed as a FIFO queue. Hence,
log records for different updates to the same object may be distributed throughout
a collection of private logs. This data sharing approach offers potentially better load
balancing behavior. For each object, Lomet's system ensures that at most one node's log
can record updates that have not yet been applied to the version of the object currently
in the disk version of the database; hence, recovery activity for each object is still limited
to only a single node even though an object's log records may be distributed throughout
the private logs of numerous nodes.
Previous researchers have already broached the problem of parallel logging and recovery in a concurrent database but they often focussed on only isolated subproblems
or declined to propose detailed solutions. The expedient of using several disk drives to
increase the throughput for log information was suggested in [15]. Lehman and Carey
[39] propose storing log information in a logically parallel manner that easily maps to a
physically parallel implementation; every partition 6 in the database has its own separate
5 "Hot
6
spot" objects are accessed much more frequently than other objects in the database.
A partition, as defined in [39], is the unit of memory allocation for a system's underlying memory
20
log of activity.
Agrawal [1, 2] investigates some alternative solutions to the recovery problem for the
multiprocessor-cache class of database machines, of which the DIRECT system [14, 8]
is an example. He investigates four different distribution policies for determining the
log disk to which to send a log record: cyclic, random, Query Processor Number mod
Total Log Processors (QPNmTLP) and Transaction Number mod Total Log Processors
(TNmTLP). Agrawal concluded that the TNmTLP policy suffers from significant load
balancing problems; an exceptionally industrious transaction can generate a deluge of
log information for one log disk while the other disks sit relatively idle. As processor
speed and bandwidth continue to rise, relative to disk bandwidth, the QPNmTLP policy
faces similar load balancing problems. Agrawal examined these four policies within the
context of the multiprocessor-cache class of database machines and in conjunction with
different solutions to some of the other subproblems associated with logging and recovery.
His evaluation criteria focused on overall performance (throughput and response time)
rather than on specific aspects of efficiency. He does not consider how much disk space
the log requires or the extent to which the log disks' loads are balanced. This thesis
will make different assumptions that more closely model today's concurrent computer
technology and will choose different evaluation criteria.
Apropos recovery, DeWitt et al. [15] propose merging several parallel logs into a single log so that familiar algorithms from the sequential world will be applicable. Agrawal's
algorithm [1] does not require a merge step, but it processes each log sequentially one
after the other and therefore forfeits the opportunity to exploit parallelism during recovery. Kumar [38] proposes a parallel recovery algorithm, but it requires processing at
all log streams to proceed in a lock-step manner; frequent barrier synchronization limits
the performance and scalability of this algorithm. To address these limitations, he proposes an improved algorithm that involves minimal synchronization between recovery
activities at separate log streams, but this latter algorithm still requires two scans over
each log stream.
1.4.3
Management of Precommitted Transactions
Interactions between the LM and the concurrency control manager (CCM) of a DBMS
can affect a system's overall performance. Strict two phase locking (2PL) [5] requires
that the CCM release all write locks held by a transaction only after the transaction has
committed or aborted. Hence, the CCM must hold all the write locks of a successful
transaction for at least as long as the minimum response time for the LM to accept a
DLR and process a request to commit from the transaction. Without non-volatile main
memory, the lower bound for this response time is the minimum time required to write
a block to disk. Slow response on the part of the LM may entail concurrency control
bottlenecks on some objects that must be updated very frequently.
mapping hardware.
21
To alleviate these performance limitations, the precommitted transaction (PT) technique [15] enables a CCM to release a transaction's write locks as soon as the transaction
requests to terminate. The transaction precommits when it requests to commit, and it
commits when all its log records (including a COMMIT record) have been written to
disk. It is in a precommitted state between these two times.
The PT technique alleviates the throughput bottleneck on "hot spot" objects due
to I/O latency to disk. Without the PT technique, a transaction that has requested
to commit cannot release its write locks until after it has committed, lest some other
transaction see its effects before they have been made permanent in the database. In
this case, the maximum rate at which independent transactions can update an object
is limited by the rate at which successive blocks of log records can be written to disk.
If the minimum time to write a block to disk is Tmi, (for example, 10 ms), then each
transaction must hold its write locks for at least
Tmi,
and the maximum throughput
for any object is 1/Tmin. If a database has hot spot objects, its entire throughput may
therefore be limited to 1/Tmi,. When the LM uses the PT technique, independent
transactions can update hot spot objects at a much higher rate, limited by the time
required to acquire and release locks.
Now suppose that a DBMS's LM supports precommitted transactions. A transaction can release its write locks after it has requested to commit but before it actually
commits. While it waits in this precommitted state, the only thing that could cause
it to fail is some failure on the part of the DBMS (e.g., a crash) which prevents the
log records from being written to disk. Other transactions can see the updates from
the precommitted transaction, but they become dependent on it to eventually commit.
The LM sends an acknowledgement to the transaction in response to its commit request
after the transaction commits. If a crash occurs before the precommitted transaction
commits, the RM must ensure that the restored database does not include updates from
the transaction or any of the subsequent transactions which depended on it. After a
transaction commits, a COMMIT log record exists on disk to record the fact that the
transaction committed and so its effects are guaranteed to survive a crash.
1.5
Commercial Systems
This section briefly reviews existing commercial database systems. Most commercial
systems incorporate variations of the techniques presented in the previous section.
IBM has had a long and influential presence in the database market. Details about
several of its most noteworthy products can be found in [26]. IMS, one of the industry's
earliest database systems, runs on the MVS operating system; MVS runs on computer
systems built around the IBM 370 processor family. Early versions of IMS maintained
separate redo and undo logs. Redo information was kept for restart recovery and undo
22
information (the dynamic log) was for transaction backout. More recently, IMS has
merged the redo and undo logs into one log to reduce I/O at commit. IMS supports
group commit [5] and performs fuzzy dumps for archive recovery. IBM's IMS FastPath
system introduced the notions of group commit and main-storage databases, among
other things. It is a pure deferred-update, redo-only system (this implies a STEAL
buffer management policy [5]). DB2 is IBM's implementation of SQL for its mainframe
systems; it implements a -STEAL buffer management policy and the WAL (write ahead
log) protocol [5].
DEC markets a database system called Rdb/VMS. It runs on a VAXcluster system ("shared disk" architecture) and supports a single global log. Rdb/VMS performs
undo/redo logging, with separate journals for redo and undo log records; the After Image
Journal (AIJ) holds redo information and the Run-Unit Journal (RUJ) keeps undo information. Periodic checkpoints bound the length of recovery time. Rdb/VMS exploits
group commit to achieve efficient disk I/O.
The Teradata DBC/1012 system [60, 46, 47, 55] is a highly parallel database system
that is intended principally for decision support (i.e., mostly queries) but which provides
some support for on-line transaction processing (OLTP). The most recent version, the
DBC/1012 model 4, incorporates up to 1024 Intel 80486 microprocessors. The system's
processors are divided into interface processors (IFPs) and access module processors
(AMPs). The DBC/1012 can support multiple hosts; each IFP connects to one particular host. The AMPs manage the data. Tuples within a relation are partitioned across
the AMPs so that the DBC/1012 can support parallelism both within and between
independent requests. The AMPs and IFPs are interconnected by a proprietary Ynet
"active logic" interconnection network.
The Tandem NonStop SQL system [27, 28, 33, 16] is essentially a distributed relational database system. Data objects are partitioned across multiple processing nodes
and transactions can access data at different sites. The system provides local autonomy
so that a site can perform work despite failures at other sites or in the interconnection
network. An implementation of the two-phase commit (2PC) protocol [5] ensures that
distributed transactions commit atomically. NonStop SQL performs undo/redo logging
and maintains a separate log at each site. The state of the database is periodically
archived by performing a fuzzy dump.
Oracle's Parallel Server [44] is intended to run on highly parallel systems (such as
the KSR1 [61]). It can support very high transaction processing rates, compared to
other available systems. Version 6.2 has been benchmarked at 1073 TPS (transactions
per second) for the TPC-B benchmark [21], with a cost of only $2,480 per TPS; these
results were obtained for Oracle V6.2 running on an nCube concurrent computer system with 64 processors. Parallel Server employs redo/undo servers for log information,
performs periodic checkpoints and uses a partitioned distribution policy for distributing
log records.
23
1.6
Assumptions
Physical State Logging at the Access Path Level.
This thesis limits its attention
to physical state logging at the access path level [30]. According to this definition,
any modification to an object in the database results in a log record that holds some
representation of the state of the object; the log record may hold the pre-modification
state of the object, the post-modification state, or both.
Buffer Management: -FORCE, STEAL. This thesis assumes the most general
policy for buffer management. The CM may flush an updated object to the disk version
of the database whenever it chooses, regardless of whether or not the transaction that
performed the update has yet committed. Restated in formal terminology, the buffer
management policy is -FORCE and STEAL [30].
Concurrency Control: Two Phase Locking.
The LM does not perform concur-
rency control, but the concurrency control manager that schedules requests to the LM
on behalf of client transactions must respect certain restrictions if the RM is to be able
to restore the database to a consistent state after a crash. This thesis assumes that
the concurrency control manager performs two phase locking (2PL) [17, 5] so that all
executions are serializable.
All chapters except Chapter 5 will further assume that all executions are strict [5]:
no transaction reads or updates an object that has been modified by another transaction
which has not yet committed. Chapter 5 relaxes this assumption slightly; a transaction
may release its write locks shortly after it requests to commit even though it has not
actually committed yet.
Volatile Main Memory, Non-volatile Disk Storage.
All main memory is volatile.
In the event of a system failure, such as a power interruption, some or all of the contents
of main memory may be lost. In contrast, disk storage (secondary memory) is nonvolatile. Any information written to disk will remain on disk despite a system failure.
Distributed Memory Multiprocessor System.
The data structures and algo-
rithms presented in this thesis are intended for a fine-grain distributed memory multiprocessor system, such as the MIT J-Machine [11, 12, 48], in which each processor can
directly address only its own local memory and all interprocessor communication must
occur via explicit message passing. Nevertheless, the techniques presented in this thesis
could be adapted to a shared memory multiprocessor system with little effort.
24
Disks are the Limiting Resource.
This thesis addresses the problem of manag-
ing log information on disk, subject to limited storage capacity and bandwidth. Disk
technology threatens to limit the performance of concurrent database systems, and is
expected to account for a significant fraction of their cost [22, 7]. Existing concurrent
computer systems provide abundant computational power, volatile main memory storage and interprocessor communication ability so that none of these resources constitutes
a bottleneck. Hence, the attention specifically to disk technology.
Recent trends and expectations for future progress justify this assumption. Processor performance has increased dramatically over the past decade and will likely continue
to improve at a fast pace in the near future. Similarly, DRAM (dynamic random access
memory) capacities have soared and prices have fallen during the past decade, and these
trends in main memory technology are expected to continue. Interconnection network
technology has improved significantly so that high bandwidth, low latency interprocessor communication is now a reality. For example, the MIT J-Machine provides 288
Mbps communication bandwidth per channel [12, 50], and each processing node has 6
channels. In contrast, the capacity, bandwidth and cost of disk drives have not improved
as dramatically.
Unique Identifiers for Objects and Transactions.
Every object in the database
must have some unique object identifier (oid). Similarly, each transaction must have a
transaction identifier (tid) that distinguishes it from all other transactions.
1.7
Summary of Remaining Chapters
Chapter 2 explains the extended ephemeral logging (XEL) method. XEL is a new
method for managing a log of database activity on disk. It is a more general variation
of ephemeral logging (EL) [35]; XEL does not require a timestamp to be maintained
with each object in the database. XEL does not require periodic checkpoints and does
not abort lengthy transactions as frequently as traditional firewall logging for the same
amount of disk space. Therefore, it is well suited for highly concurrent databases and
applications that have a wide distribution of transaction lifetimes.
Important safety and liveness properties for a simplified version of XEL are proven
in Chapter 3. The log record from the most recently committed update to an object
remains recoverable as long as log records from earlier updates to the same object can
be recovered from the log. However, every log record is eventually erased so that its
space on disk can be re-used for subsequent log information.
Chapter 4 considers how to manage log information in a highly concurrent database.
The abstraction of a collection of log streams, all operating in parallel with one another,
25
is suitable for applications with very high bandwidth requirements for log information.
The LM uses the XEL method to manage the disk space within each log stream. With
multiple log streams, the LM must have some distribution policy by which it decides
the stream(s) to which it will send any particular log record. Chapter 4 analyzes three
different distribution policies.
Chapter 5 points out the difficulties of implementing the PT technique, in its current
form, in a LM that supports an arbitrarily large collection of log streams. The chapter
proposes a new variation of the technique, called Logged Commit Dependencies (LCD),
which alleviates these difficulties. It introduces a new type of record, called a PRECOMMIT
record, which explicitly states all a transaction's dependencies at the time that the
transaction requests to commit.
Chapter 6 quantitatively evaluates XEL via event-driven simulation. XEL's complex-
ity severelylimits analytical attempts to evaluate its performance. Simulation provides
an alternative means by which to study its behavior. Section 6.1 describes the imple-
mentation of a simulator for XEL. It explains each of the input parameters, documents
the fixed parameters, presents the definitions of XEL's data structures as expressed in
the C programming languange [36] and justifies the validity of the simulation model.
Section 6.2 evaluates XEL's performance for only a single log stream as various input
parameters vary and compares XEL's performance to that of the FW method. The
following section examines XEL's behavior for a collection of parallel log streams as the
degree of parallelism increases. Disk space, disk bandwidth, main memory requirements
and recovery time are the evaluation criteria throughout the chapter.
The last chapter of the thesis summarizes the important lessons that were learned,
explains the importance of the results and discusses various extensions to XEL.
26
Chapter 2
Extended Ephemeral Logging
(XEL)
This chapter proposes a new technique for managing the disk space allocated for log
information. This new technique, called extended ephemeral logging (XEL), is a more
general variation of the ephemeral logging (EL) technique that the author presented in
an earlier publication [35]. Both EL and XEL break the abstraction of the single FIFO
queue that was presented in Section 1.4.1. Rather than managing log information in
a single FIFO queue, EL and XEL treat the log as a chain of fixed-size FIFO queues
and perform garbage collection at the head of each queue. This approach, inspired
by previous research on generational garbage collection, mitigates the threat of the log
running out of space for new log records because a transaction lives too long; EL and
XEL can retain the records from long running transactions but can reclaim the space of
chronologically subsequent log records that are no longer needed for recovery. Hence, a
log manager that uses EL or XEL generally requires less disk space than one that treats
the log as a single FIFO queue. Intuitively, this advantage is strongest if an application
has only a small fraction of transactions that execute for a very long time and write
only a small number of records to the log.
Another strong motivation for EL and XEL arises if a system can be augmented
with a limited amount of non-volatile main memory. In such a system, EL and XEL can
drastically reduce the amounts of disk space and bandwidth required for log information
if most log records emanate from short-lived transactions. The benefits here are twofold.
First, the system's cost may be substantially reduced since fewer disk drives are needed
for log information. Second, recovery after a crash may be much faster since the amount
of log information is considerably smaller.
Variations on EL and XEL can render a separate disk version of the database unnecessary. The most recently committed value for each object is always retained in the log.
27
This approach may significantly reduce a system's cost because it likely requires fewer
disk drives. It also simplifies the DBMS design because the CM is no longer needed.
This expedient pertains to main memory database systems as well as other systems
which hold most of their data in main memory and update them sufficiently often.
EL and XEL maintain pointers to all records in the log that are relevant for recovery
purposes. This entails significantly higher main memory requirements, compared to the
FW technique. However, the LM no longer needs to perform periodic checkpoints to
ensure that all updates prior to a particular point in time have been flushed to the disk
version of the database, as had been necessary for the FW method. Of course, the CM
should continue to flush committed updates to the disk version of the database at as fast
a rate as possible so as to reduce the amount of information that must be kept in the log.
This elimination of checkpoints is a benefit for highly concurrent systems which have
many processors and an arbitrary number of parallel log streams (as will be discussed
in Chapter 4). Checkpointing is more complicated in concurrent systems, compared to
sequential systems, so EL and XEL relieve concurrent DBMS designers from having to
design and implement efficient checkpointing algorithms that are provably correct.
The presence or absence of timestamps in the disk version of the database differentiates EL and XEL. EL assumes that each object's representation in the disk version
of the database has a timestamp kept with it. However, there are good reasons why
some databases may violate this assumption. The absence of timestamps complicates
the problem of managing log information in a manner that does not jeopardize the consistency of the database. The XEL technique presented in this chapter does not require
timestamps for objects in the disk version of the database.
Section 2.1 illustrates why EL cannot guarantee consistency after a crash for a DBMS
that does not maintain a timestamp with every object in the disk version of the database.
Once familiar with the pitfalls of EL, a reader will be better able to appreciate the rationale that underlies the complexities of XEL. Subsequent sections each address specific
problems that must be solved in order to implement XEL. To some extent, these subproblems are independent of one another. The structure of this chapter reflects the
modular nature of these problems.
2.1
Preamble: Limitation of Ephemeral Logging
Ephemeral logging (EL), as originally proposed in [35], adopts a generational garbage
collection strategy for managing a log's disk space. The LM manages the log as a chain
of FIFO queues, each of which is called a generation. The LM adds new log records
to the tail of the youngest generation. When a record that must still be kept in the
log approaches the head of a generation, the LM forwards it to the tail of the next
generation or recirculates it within the last generation. For any particular application,
28
the generations' sizes are chosen so that only a small fraction of the records in any
generation need to be forwarded or recirculated.
EL requires each object in the database to have a monotonically increasing times-
tamp. The simplest implementation that satisfies this constraint maintains an integervalued counter with every object. The LM increments an object's counter each time a
transaction updates the object. Whenever the CM flushes an updated object to the disk
version of the database, the accompanying timestamp value is stored with the object.
Likewise, each data log record (DLR) for an object holds the value and corresponding
timestamp (as well as the object and transaction identifiers) from a particular update
to the object. After a crash, the RM can determine if the disk version of the database
holds the most recently committed value for any particular object. It finds the most
recently committed DLR for the object that is still in the log and checks if this DLR
has a more recent timestamp than the version of the object currently on disk. If the
DLR is more recent, then the RM should update the object in the disk version of the
database; otherwise, it should ignore the DLR.
Now suppose that timestamps are not kept with each object stored in the version
of the database on disk. This case might arise because of a deliberate decision to
conserve storage, or it could be a constraint inherited from an existing implementation.
Without timestamps in the database, EL is not sufficient to guarantee a consistent
state after recovery from a crash. The RM no longer has a standard by which to judge
whether or not a DLR holds a more recent value than that which currently exists for the
corresponding object on disk. Accordingly, the RM can no longer deduce which records
are non-garbage and which are garbage.
The following example illustrates what can go wrong. Assume that the log has
two generations. Suppose transaction tx3 assigns object ob8 a value of 12, writes a
corresponding DLR to the log and then commits. Assume that the DLR for this update
and the transaction's COMMIT
TLR are both forwarded to generation 1 of the logl.
After moving to generation 1, these two records soon become garbage. Now suppose
that transaction tx6 subsequently assigns object ob8 a value of 9 and then commits.
Figure 2.1 summarizes this chronology of events for transactions tx3 and tx6.
DLR for
ob8-12
by tx3
I
COMMIT
for tx3
t
DLR for
ob8+-9
by tx6
T
COMMIT
for tx6
tme
Figure 2.1: Two Successive Updates to Object ob8
Suppose further that both the DLR and the COMMIT
TLR from tx6 become garbage
before they reach the head of generation 0 and are overwritten by other log records, but
1
The DLR may have been forwarded because t3 had a relatively long lifetime, for example. The
COMMIT
TLR was forwarded because not all the transaction's updates had been flushed to the disk version
of the database before it reached the head of generation 0.
29
the "stale" log records from tx3 are still lingering in generation 1, as shown in Figure 2.2.
If a crash were to occur while the system is in such a state, the RM would find tx3's
DLR to be the most recently committed update in the log when it attempts to recover
object ob8 but it would not know whether the value in this DLR is more recent than
the value currently stored for ob8 in the disk version of the database.
generation
ob8=9
disk version
A1
A
....
OI UaLaduse
:----------------legend - --------non-garbage log record
garbage log record
.......................................
[
Figure 2.2: State of the Log After a Crash
In this example, the DLR from tx3 is garbage and ought to be ignored; the disk
version of the database already holds the most recently committed value for object ob8.
It is not difficult to construct a different example in which the RM finds a (non-garbage)
DLR that is more recent than the value stored for an object in the disk version of the
database.
2.2
Conceptual Design of XEL
Extended ephemeral logging (XEL) manages a log's disk space as a chain of fixed-size
queues. Each queue is called a generation. If there are N generations, then generation
O is the youngest generation and generation N-1 is the oldest generation. New log
records are added to the tail of generation 0. A log record near the head of generation
i, for i<N-1, is forwarded to the tail of generation i+1 if it must be retained in the
log; otherwise, it is simply discarded (overwritten by more recent log information). In
the special case of generation N-1, a log record near its head that must be retained is
recirculated in it by adding the record to its tail. The disk space within each queue is
managed as a circular array [10]; the head and tail pointers rotate through the positions
of the array so that records conceptually move from tail to head but physically remain
30
in the same place on disk.
In tandem with the activity of the LM, the CM flushes (transfers) updates to the
disk version of the database so that some log records become unnecessary for recovery.
The LM no longer needs to retain these log records and so it can re-use their space on
disk.
Figure 2.3 conveys the essence of XEL for the specific case of a log stream with three
generations. Non-garbage log records are necessary for recovery and must be kept in the
log; all other log records are garbage. The "garbage pail" does not actually exist, but
is conceptually convenient to suggest that log records are "thrown away" after they are
no longer needed. The arrows at the head of each generation portray the two possible
fates for a log record near the head. If the record is garbage, it is ignored (conceptually
thrown away in the garbage pail). If it is non-garbage, then it must be retained in the
log and so it is either forwarded to the tail of the next generation or recirculated in the
last generation. A stable version of the database resides elsewhere on disk. It does not
necessarily incorporate the most recent changes to the database, but the log contains
sufficient information to restore it to the most recent consistent state if a crash were
to occur. The arrows underneath each generation illustrate the flushing activity that
occurs in parallel with logging, and indicate that the log records whose updated objects
are flushed may, in general, be anywhere in the log.
gen
new
_
log -records
]
I
\garbage
n
legend:pail
~
legend:
C] non-garbage log record
}1 garbage log record
~ disk
~~
~ ~~
~~~~
version
of database
Figure 2.3: Disk Space Management Using XEL
This segmentation of the log is particularly effective if a large proportion of transactions finish execution and have their updates flushed before their log records near the
head of generation 0. Many, if not all, of these records become garbage before the LM
must decide whether or not to forward them and so the LM does not forward them to
generation 1; their disk space can quickly be reclaimed for more incoming log records.
31
Only a small proportion of log records, mostly from transactions with longer lives, are
forwarded to subsequent generations.
Recirculation in the last generation means that the physical order of its records no
longer necessarily corresponds to the temporal order in which they were originally generated. The LM includes timestamps in data log records to enable the RM to establish
the temporal order of the records.
2.3
Types and Statuses of Log Records
The previous section described the segmentation of a log stream into a chain of fixed-size
FIFO queues and mentioned that the LM performs garbage collection at the head of
each queue. This section will explain the basis upon which the LM decides whether
or not a particular log record is garbage. There are several types of log records. For
each type of log record, a record may be in any one of several possible states at any
given time. In response to ongoing activity in the database, the state of a record may
change over time. The LM's decision about whether to retain or throw away a log record
depends on the record's state.
XEL performs physical state logging on the access path level, according to the taxonomy of [30]. In short, XEL performs redo logging with lazy logging of undo records.
It adheres to the write ahead log (WAL) protocol [5]: the disk version of the database
cannot be modified before the LM has written a log record to disk which describes the
modification.
There are two types of log records. Data log records (DLRs) chronicle changes to the
contents of the database (creation, modification or deletion of objects). Transaction (tx)
log records (TLRs) mark important milestones (e.g., begin, commit or abort) during the
lives of transactions.
Apropos TLRs, XEL logs only commit events; it does not bother to log even the
commit of a transaction that did not update any objects in the database. When a
transaction (that updated at least one object) successfully terminates, the LM adds
record holds only the
record to the log to mark the occasion. The COMMIT
a COMMIT
transaction's identifier. Previous logging and recovery methods also logged transactions'
begin and abort events. XEL can incorporate these other types of TLRs, but they
are superfluous. These anachronistic TLR types played an important role in previous
recovery algorithms but are no longer relevant for XEL's recovery algorithm.
Figure 2.4 illustrates the noteworthy events in the lifetime of a typical transaction.
The transaction begins, updates several objects and then requests to commit. Whenever
the transaction updates an object, the LM writes a DLR to the log. The transaction
32
can continue executing without needing to wait for the DLR to be written to disk. If the
transaction eventually requests to commit, the LM generates a COMMIT
record for it and
writes the record to the log. After all the transaction's log records have been written to
disk, the LM acknowledges the transaction's request to commit, thus bringing its course
of execution to a close.
tx
begins
tx updates
an object
tx updates
another
object
te
DLR
DLR COMMIT
acknowledge
commit
Figure 2.4: Typical Events During the Lifetime of a Transaction
There are two varieties of DLRs: REDO and UNDO DLRs. The LM generates a
REDO DLR whenever a transaction modifies an object in the database. Each REDO
DLR contains the following four pieces of information:
oid:
identifier for the affected object
txid:
identifier for the transaction that performed the update
timestamp:
indication of when the update occurred
new-value:
new value of the object
If the CM wants to flush an uncommitted update out to the disk version of the
database, it must first inform the LM of its intentions and obtain permission from the
LM. In response to such a request, the LM generates an UNDO DLR with the following
pieces of information:
oid:
identifier for the affected object
txid:
identifier for the transaction that performed the update
timestamp:
indication of when the update occurred
old-value:
old value of the object (prior to start of transaction)
The LM grants permission to the CM to flush the uncommitted update only after the
UNDO DLR has been written to disk. It is unnecessary for the LM to generate more
than one UNDO DLR for a particular object and a particular transaction.
The LM must write a REDO DLR to the log for every update, but it is expected that
only a very few updates, mostly from exceptionally lengthy transactions that modify a
large number of objects, will trigger UNDO DLRs as well. Therefore, UNDO DLRs
ought to be quite rare, in general.
Each log record must be in a particular state at any given time. The LM maintains
33
data structures in main memory that track of the state of all log records; a record's
state information is not kept in the log record itself. Figures 2.5, 2.6 and 2.7 graphically
TLRs,
summarize the states and transitions for REDO DLRs, UNDO DLRs and COMMIT
respectively. Subsequent paragraphs will explain these state transition diagrams in
detail.
5
1
Transition events:
1) Transaction updates same object again
2) Commit of more recent update to same object
3) Commit of more recent update to same object
4)
5)
6)
7)
Flush completed; older recoverable DLR still exists
Flush completed; no older recoverable DLR existsLast older recoverable DLR becomes non-recoverable
Transaction's COMMIT record is overwritten
Figure 2.5: State Transition Diagram for a REDO DLR
Transition events:
1) Transaction commits
2) Aborted update undone by cache manager
Figure 2.6: State Transition Diagram for an UNDO DLR
A REDO DLR may have one of four different status values: unflushed, required,
recoverable and non-recoverable. If a REDO DLR holds a more recent value for its asso34
required
recoverable
Transition event:
1) No UNDO DLRs and only recoverable REDO DLRs left
Figure 2.7: State Transition Diagram for a COMMIT
TLR
ciated object than does the disk version of the database, the REDO DLR corresponds to
the most recent modification to its associated object by the transaction that performed
the update 2 and there is no REDO DLR in the log for a more recently committed up-
date to the same object, then the DLR must have a status of unflushed. A requiredDLR
must be retained in the log in order to ensure correct recovery after a crash. A REDO
DLR with a status of recoverable can be recovered by the RM after a crash (because the
COMMIT
TLR from the corresponding transaction is also still in the log on disk), but is
not required for correct recovery. A REDO DLR whose status is non-recoverable cannot
be recovered after a crash because the COMMIT
TLR from its transaction has already
been overwritten on disk by more recent log records.
An UNDO DLR can have one of three status values: required, annulled and recoverable. An UNDO DLR initially has status required when the LM creates it. It remains
requireduntil the transaction that wrote it commits, at which time it becomes annulled;
the UNDO DLR remains annulled until it is overwritten on disk. If a transaction writes
an UNDO DLR and later aborts, the UNDO DLR retains its requiredstatus until the
CM restores the corresponding object in the disk version of the database to the value
that it held prior to the aborted transaction's update. Note that the CM need not
immediately write out the object's original value to the disk version of the database;
it can buffer the undo operation until a convenient opportunity, so as to achieve bet-
ter disk I/O. After the CM has undone the aborted update (by restoring the object's
representation in the disk version of the database to its original value), the LM changes
the status of the corresponding UNDO DLR to recoverable. A recoverable UNDO DLR
remains recoverable until it is eventually overwritten on disk.
There are two status values for a COMMIT
TLR: required and recoverable. A COMMIT
TLR has status required if at least one UNDO DLR (which may have a status of either
required or annulled) that was written by the transaction still exists or if at least one
REDO DLR that was written by the transaction has a status of unflushed or required;
otherwise (i.e., no UNDO DLRs and any remaining REDO DLRs have recoverable sta-
tus), the TLR has status recoverable.
2If a transaction modifies a particular object more than once, then the REDO DLRs for all updates
except the most recent one have status recoverable.
35
Whenever an executing transaction updates an object, it writes a REDO DLR to
the log. This DLR has unflushed status. The LM downgrades the status of any REDO
DLRs from earlier updates to the same object by the same transaction to recoverable. If
TLR has status required. Suppose that
the transaction eventually commits, its COMMIT
transaction t modifies an object x and commits. At the time that t commits, the LM
checks if there is an unflushed or required REDO DLR for object x from an update by
some earlier transaction. If such a DLR exists, the LM downgrades that DLR's status
to recoverable.
The CM may flush an updated object to the disk version of the database whenever
it chooses. In general, the CM attempts to flush all a transaction's updates after it has
committed, so that no UNDO DLRs are needed; nevertheless, the CM may occasionally
need to flush some of a transaction's updates to disk before the transaction commits. As
soon as the CM has flushed an update to some object, the LM downgrades the status
of any unflushed REDO DLR from an earlier transaction. The LM assigns a status
of required to an earlier REDO DLR if it corresponds to the most recently committed
update to the object and must be retained because of lingering recoverable DLRs from
earlier updates to the same object; otherwise, it assigns a status of only recoverable to
the earlier REDO DLR. After processing any unflushed previous DLR for the object,
the LM then processes the REDO DLR for the update that was just flushed. If this
DLR still has unflushed status at the time the flush operation completes 3 and there
exists a required or recoverable DLR from an earlier update to the object, then the LM
downgrades the DLR's status to required; otherwise, the LM assigns a status of only
recoverable to the DLR whose update was just flushed. The LM downgrades a required
REDO DLR's status to recoverable after there is no longer a required or recoverable
(REDO or UNDO) DLR from any earlier update to the corresponding object.
TLR to status recoverable as soon as
The LM downgrades a transaction's COMMIT
there is no longer any unflushed or required REDO DLR nor any UNDO DLR remaining
TLR is eventually overwritten on
from the transaction. When a transaction's COMMIT
disk, the LM changes the status to non-recoverable for any remaining REDO DLRs that
the transaction wrote. Note that this may trigger status changes for REDO DLRs and
TLRs from more recent transactions.
The following pseudocode expresses how the LM manages the states of records in
the log.
createnew record(logrecord)
{
if (type of log-record is REDO DLR) (
status of new record - unflushed
if (logrecord's transaction previously updated same object) {
status of REDO DLR from previous update - recoverable
3
A REDO DLR may have a status of unflushed at the time that the CM decides to initiate a flush
operation for it, but the DLR may be rendered recoverable by a more recently committed update before
the flush operation completes.
36
else {
status of new record -- required
}
commit transaction(txid) {
for every object updated by transaction txid {
if (unflushed or required REDO DLR remains from earlier transaction) {
status of REDO DLR from earlier transaction -- recoverable
}
if (an UNDO DLR for the object was written out to the log) {
status of UNDO DLR +- annulled
}
}
aborttransaction(txid)
{
for every object updated by transaction txid {
for every update to the object {
status of REDO DLR from the update +-- non-recoverable
}
if (uncommitted update was flushed to disk version of database) {
retrieve UNDO DLR from log
request CM to restore object in disk version of database to original value
}
}
changeredotorecov(logrecord)
{
status of logrecord +- recoverable
tid +- txid of transaction that write log-record
if (
(no more unflushed, required or annulled DLRs from txid)
AND (txid has committed)) {
status of COMMIT record from txid +- recoverable
}
abortedupdateundone(object
id, txid) {
status of UNDO DLR for update to objectid by txid +- recoverable
37
}
updatewrittento-diskversionofdb(objectid,
txid, timestamp) {
for every REDO DLR r from a previous update that has status unflushed {
(r is for most recently committed update to objectid)
if (
AND (older recoverable DLRs for objectid still exist)) {
status of r
-
required
-
recoverable
}
else {
status of r
}
if (status of DLR from flushed update is still unflushed) {
if (recoverable DLRs from previous updates are in log) {
status of REDO DLR for flushed update - required
}
else {
change-redotorecov(REDO
DLR for flushed update)
}
recorderased(logrecord)
{
case (type of logrecord) {
REDO DLR:
if (status of oldest surviving REDO DLR for the object is required) {
surviving REDO DLR for object)
changeredotorecov(oldest
}
UNDO DLR:
if (no unflushed, required or annulled DLRs from logsrecord's tx) {
status of COMMIT record for logrecord's transaction - recoverable
}
COMMIT:
for every remaining recoverable REDO DLR from logrecord's tx {
status of REDO DLR - non-recoverable
if (status of oldest surviving REDO DLR for the object is required) {
surviving REDO DLR for object)
changeredotorecov(oldest
}
38
2.4
Management of Log Records
The previous section explained the basis upon which the LM decides whether or not to
keep a log record. This section introduces the data structures which enable the LM to
make this decision. The LM keeps a pointer to every noteworthy log record. A record's
pointer indicates its location in the log and current status. The LM uses this information
when it must decide a log record's fate.
It is convenient to broadly classify log records as relevant or irrelevant. All records
that can affect the recovered state of the database are collectively referred to as relevant
log records; unflushed, required and recoverable REDO DLRs and all UNDO DLRs are
relevant log records, as are all required COMMIT
TLRs and recoverable COMMIT
TLRs for
which some corresponding recoverable REDO DLRs still remain. All other log records
(non-recoverable DLRs and every recoverable COMMIT
record for which no corresponding
REDO DLRs remain) are irrelevant. The LM must keep track of the positions of all
relevant records. It does not bother to keep track of irrelevant records.
A cell exists for every relevant record in any generation of the log. Each cell resides
in main memory and points to the record's location on disk. A record's location on disk,
as pointed to by its cell, is indicated by an identifier of the block to which it belongs;
finer granularity (e.g., position within the block) is not required by XEL. The cells
corresponding to each generation are joined in a doubly linked list that "wraps around"
in a circular manner; the cells nearest the head and tail have right and left pointers to
each other, respectively. For generation i, pointer hi points to the cell for the relevant
record nearest the head. There is no tail pointer for a generation, but the cell for the
relevant record nearest to the tail can be found quickly by following the right pointer of
the cell pointed to by hi.
The logged object table (LOT) has an entry for every object that has at least one
relevant DLR somewhere in the log. An object's LOT entry keeps track of the positions
within the log of its relevant DLRs. Cells for an object's relevant DLRs are accessible
via its LOT entry.
Likewise, the logged transaction table (LTT) has an entry for every transaction that
has updated at least one object. A transaction's LTT entry keeps track of all objects
that it updated and for which the corresponding REDO DLRs are still relevant. After
a transaction commits, its LTT entry points to the cell that corresponds to its COMMIT
TLR.
The LM continually updates the LOT and LTT to reflect the current state of the
system as transactions and log records come and go. At any given time, the cells
associated with the LOT and LTT entries point to all relevant log records. Although
cells belong to these two different tables, they may nonetheless simultaneously belong
to the same doubly linked list.
39
An example of XEL with N=3 generations is shown in Figure 2.8. To relate Figure 2.8 to Figure 2.3, note that all irrelevant records are garbage records. Some relevant
log records can affect recovery but are not required for recovery, and so they are also
garbage records; they can be thrown away with impunity when convenient.
'
:.
dlsI version
I legend:
of database
* cell
O relevant log record
irrelevant log record
Figure 2.8: Data Structures for XEL
Figure 2.8 illustrates the most important aspects of XEL's data structures. The LOT
and LTT, with their constituent cells, reside in main memory. Other internal details of
the LOT and LTT have been omitted; the circular doubly linked lists of cells are the
important aspect of the LOT and LTT in this figure.
Each cell has a status field that indicates the status of its corresponding log record.
At any given time, the LM can determine whether a record is non-garbage by checking
the status field in the record's cell. When a record must be forwarded to the tail of
generation i+1, the LM writes its contents to disk at the tail of generation i+1, updates
its cell, c, to point to its new position in the log and transfers c from the circular linked
list for generation i to the circular linked list for generation i+1. The LM updates pointer
hi to point to the cell previously to the left of c, if such a cell exists for generation i;
otherwise, it sets hi to NULL. If hi+l was NULL immediately before the record was
forwarded, then the LM updates it to point to c (and c's left and right pointers point
40
to itself). Recirculation in the last generation is handled similarly.
2.5
Buffer Management for Disk Version of the Database
The LM relies on the CM to flush updated objects to the disk version of the database so
that log records from these updates will become garbage. This section briefly discusses
how the CM ought to schedule flush operations and elaborates on how the CM interacts
with the LM.
In general, there is negligible locality of access between the updates of independent
transactions. Flushing updates in the order that they are written to the log would lead
to random disk I/O for the disk version of the database. Instead, the CM maintains a
pool of objects waiting to have their committed updates flushed and schedules writes to
disk so that it can take advantage of locality in the disk version of the database and thus
improve I/O performance. Ideally, there should usually be a significantly large number
of committed updates from which the CM can choose the next object to be flushed; too
small a "pool" of updates leads to random I/O. Flushing can proceed continuously at
as high a rate as possible.
Occasionally,the CM may need to flush uncommitted updates out to the disk version
of the database (because its buffer pool is running dangerously low on free space, for
example). In such an emergency situation, the CM must first obtain permission from
the LM, as described in section 2.3. After the LM has written the necessary UNDO
DLRs to disk and granted the CM permission to flush some uncommitted updates to
disk, the CM can schedule the writes to disk so as to exploit locality.
In the rare event that a transaction aborts after writing one or more UNDO DLRs,
the CM must undo the transaction's updates that have already been propagated to the
disk version of the database. The LM reads (from disk) every UNDO DLR that was
written by the aborted transaction. For each such UNDO DLR, the LM communicates
the oid and old value to the CM. In response, the CM restores the object (in the disk
version of the database) to the value that it had prior to the aborted transaction's update
and informs the LM after it has undone the transaction's modification to the object.
2.6
Flow Control
This section discusses how the LM regulates the flow of log records from one generation
to the next in a log stream. Each generation is a FIFO queue of fixed size. If it begins
to run out of space for new records, the LM must try to free up some space by throwing
away or forwarding (or recirculating) log records from near the head of the queue. The
41
LM can forward records from one generation only if the next generation is able to accept
them. Hence, flow control between successive generations within a log stream must be
regulated.
The LM attempts to keep at least Nfree blocks available in each generation to accept
incoming (and recirculated, in the case of the oldest generation) log records. When the
tail of generation i advances so as to violate this "low water mark", the LM attempts
to forward (or recirculate) log records from generation i. However, the LM can forward
log records to generation i+1 only if generation i+1 is able to accept them.
The LM refuses to overwrite any block that holds an unflushed or required log record.
If hi+l points to a record in the block immediately after the current tail position of
generation i+1, then generation i+1 cannot accept any forwarded log records. Similarly,
the LM refuses to accept any new log records from client transactions if space is not
available for them in generation 0.
When space eventually becomes available in generation i+1, the LM will resume
forwarding of records from generation i if there are fewer than Nf,,ree blocks between
the current tail position of generation i and the block to which hi indirectly points (hi
points to a cell, and this cell points to a block position on disk).
This flow control policy is guaranteed to be free of deadlock. The LM never needs to
keep a log record because of the lingering presence of some chronologically subsequent
log record. This property ensures that the dependency graph amongst log records is
acyclic, and therefore deadlock is impossible.
In summary, a producer-consumer protocol between adjacent generations regulates
the flow of forwarded log records. Older generations that become full exert "backpressure" on younger generations. If all generations become full, then the LM does not accept
log records from client transactions. This policy ensures that necessary log information
is never lost.
2.7
Buffering, Forwarding and Recirculation
The previous section explained how the LM regulated the flow of records into and out
of each generation of a log stream. This section elaborates on some important timing
details which govern this movement of log records. Much of the complexity arises from
the characteristics and limitations of current disk drive technology.
Two characteristics of current disk technology exert an important influence on the
implementation of XEL. Information is written to disk in fixed sized blocks (with each
block typically some multiple of 1024 bytes). Sequential disk I/O is faster than random
42
disk I/O. XEL must accommodate the constraint of fixed sized disk blocks, and ought
to take advantage of the performance benefits of sequential I/O.
The LM uses the group commit technique [15, 5]. Records are collected in a buffer
and written to disk all at once. Figure 2.9 illustrates the group commit technique. The
bottom of the buffer holds log records that have already arrived. The LM adds new
incoming log records to the buffer by putting them in the unfilled portion shown at the
top of the buffer. In Figure 2.9, the direction of growth is upward.
log stream buffer
fill up
remainder
of buffer
with new
log records
.. :
- .:
,..
.....
,,,.. .
'.-.
--
.
,
....
,' .
:..
''
.'
' " -: :-' ..
.' ':.:',:-..
1 block
records
already
in buffer
Figure 2.9: Buffering of Incoming Log Records for Batched Write to Disk
The LM should dedicate at least two buffers for log records that are to be written to
generation 0 because a disk write generally requires a significant amount of time, such
as 10 ms, during which other log records may arrive. While one buffer is being written
to disk, the LM can add new records to a different buffer without risk of interference.
The size of each buffer in the pool is exactly equal to the size of a disk block. At any
given time, there is a current buffer for generation 0. The LM adds new log records
to this buffer until it is full or a time limit runs out, at which time the LM writes it
to disk; another buffer in the pool becomes the current buffer as soon as there are no
unflushed or required records in the block to which the new current buffer will be written.
Therefore, log records are not immediately written to disk. There is a delay while the
current buffer fills, and some extra delay for the disk I/O.
Only one buffer is needed for each generation i>0 because the LM has the liberty
of scheduling the movement of log records between generations. There can be only one
outstanding write to the tail of a particular generation and the LM can quickly refill
generation i's buffer as soon as the current write operation completes, so additional
buffers would not help anyway.
The tail of generation i points to the location of a block on disk; finer granularity is
unnecessary. When a new log record comes in to generation i, the LM will attempt to
allocate it to the block indicated by the current position of the tail; if there is insufficient
room remaining in the current buffer to accommodate the record, then the LM will
attempt to advance the tail block position and allocate the record to the new tail block.
43
The head of generation i is the block to which hi (indirectly) points, if hi is not NULL.
If hi is NULL, then the head of generation i is, by default, the current tail block.
The movement of head and tail pointers in block sized quanta has implications.
When the LM decides to advance the head of generation i, it must deal with all log
records in the head block. The LM attempts to forward all unflushed and required
records in this head block to generation i+1. Suppose that the LM can forward these
records to generation i+1. It adds the records to the buffer for generation i+1. In
general, they are insufficient to completely fill the buffer, but the LM must ensure that
the forwarded records are soon written to disk in generation i+1. Therefore, it attempts
to fill the buffer as full as possible before writing it. After forwarding records from the
block at the head of generation i, the LM works backward (i.e., to the left) from the
head to gather enough other non-garbage log records to fill the buffer that is destined
for the tail of generation i+1. In summary, the requirements of generation i dictate that
records be removed from its head in quanta of size at least a block. The requirements
associated with forwarding records to the tail of generation i+1 imply that records are
usually forwarded as a group from the first several blocks at the head of generation i.
There are two details that complicate the operation of forwarding records from generation i to generation i+1. First, there is some delay between the time when the LM
decides to forward a log record to the moment when the forwarded record is actually
on disk in generation i+1. Second, two copies of a forwarded record may temporarily
exist in the log. The forwarded copy of the record resides on disk in generation i+1,
but there is also the "stale" copy left behind in generation i; this latter copy remains
in the log until it is overwritten by newer log records. Because of these details, the LM
manipulates two additional special pointers for each generation.
For each generation i, si is the scan pointer. Like hi, it points to a cell in the circular
doubly linked list for generation i or is NULL. It indicates how far the LM has scanned
to forward log records to generation i+1. If there is no forwarding operation in progress,
then si coincides with hi. When the LM examines a log record and decides to forward
it, it leaves the cell in generation i's list and advances si to the left; if this leftward
movement causes si to "wrap around" so that it comes back to hi, then the LM sets
it to NULL instead. Immediately after a buffer of forwarded records has been written
to disk in generation i+1, the LM keeps advancing hi leftward until it coincides with si
or becomes NULL; until hi "catches up" to si, the LM transfers each cell over which it
passes from generation i's circular list of cells to the tail of generation i+1's list. This
cautious management of the hi and si pointers ensures that the LM never inadvertently
overwrites a non-garbage log record in generation i before it has been forwarded to
generation i+1 (and is on disk in generation i+1).
The LM maintains a second circular doubly linked list of cells for every generation.
This other list is called the doomed list because it indicates relevant records that will
(soon) be overwritten. The pointer di points to the cell at the head of generation i's
doomed list. When the LM examines a log record's cell and decides that the record
44
is garbage, it removes the cell from the regular list (after advancing s, of course, and
possibly also hi if necessary) and adds it to the tail of the doomed list. If di was
previously NULL, it will point to the cell that has just been transferred to the doomed
list; otherwise, the transferred cell is the target of the right pointer of the cell to which
di points. Now consider the case that the LM decides the log record to which si points
is non-garbage. It creates a new cell that also points to the record and leaves the old cell
intact in the regular list for generation i (where it waits to be forwarded to generation i+1
after the buffer has been written to disk in that generation). If the log record is a DLR,
the LM adds the new cell to the LOT entry of the associated object; otherwise, it adds
the new cell to the LTT entry of the corresponding transaction. Finally, the LM inserts
this new cell into the doomed list at its tail. Whenever a block of log records is written
to disk in generation i, the LM examines di. If di (indirectly) points to the block that
has just been written, then the LM concludes that the record associated with the cell to
which di points is now gone and so it moves di leftward (or assigns di the value NULL, if
appropriate) and deletes the cell to which di had pointed. The LM continues to advance
di leftward until it becomes NULL or no longer (indirectly) points to the block that has
just been written. The di pointer ensures that the LM does not forget about any stale
copy of a relevant log record.
Recirculation is not as complicated. The LM recirculates records from only the block
at the head of the last generation and places them in a buffer without immediately
writing it to disk. The existing copies of these records will not be overwritten before the
tail has advanced, but the recirculated copies will belong to the disk block written at
the tail. There is no need for a scan pointer in the last generation; the LM immediately
transfers the cells for the recirculated records from head to tail in the circular list (in
practice, this is accomplished simply by advancing hi to the left). Similar to the case
of forwarding, the LM transfers the cells for garbage log records to the doomed list and
eventually deletes the cells after the log records have actually been overwritten. As new
log records come in to the last generation, the LM adds them to the buffer after the
recirculated records.
If the LM ever decides to delete the cell to which hi points, it must first adjust hi
accordingly. If there are other cells in generation i's linked list, then the LM advances
hi to the left; otherwise, it assigns NULL to hi. The LM behaves similarly if it deletes
the cell to which si or di points.
In summary, the LM collects records in a buffer before writing them to any generation. It attempts to fill a buffer as full as possible before writing it to disk. When
the LM decides to forward a log record, it does not transfer the record's cell from the
circular list of generation i to the list of generation i+1 until after it is certain that the
record is on disk in generation i+1. The LM keeps track of the positions of all copies of
all relevant log records until they are actually overwritten on disk (or until they become
irrelevant, if this happens first).
The following pseudocode routines succinctly state the LM's algorithms for forward45
ing and recirculating log records.
i) {
addcelltogeneration(cell,
{
if (hi==NULL)
cell
cell
hi
si
cell->left - cell
cell->right - cell
}
else
{
if (si==NULL) {
cell
si}
cell->left - hi
cell->right -hi->right
cell- >left->right - cell
cell->right->left - cell
deletecellfrom-generation(cell, i) {
if (si==cell) {
if (si->left==hi)
{
si -NULL
}
else
{
si - si->left
if (hi==cell) {
{
if (hi->left==hi)
NULL
hi
}
else {
hi - hi->left
if (di==cell) {
{
if (di->left==di)
NULL
di
else
{
di e di->left
I
46
}
if (cell->leftAcell)
{
cell->left->right
-
cell- >right
cell->right->left cell->left
}
addcelltodoomedlist(cell,
if (di==NULL)
di
i) {
{
cell
cell->left - cell
cell->right - cell
}
else {
cell->left
di
cell->right di->right
cell->left->right
cell
cell->right->left
cell
-
-
}
forwardrecords-from-generation(i)
{
while ((generation i+l can accept records) AND (siyANULL)) {
if (record pointed to by si must be kept) {
copy record pointed to via s to buffer for generation i+l
create newcell
copy contents of cell pointed to by si into newcell
addcelltodoomedist(newcell,
i)
if (si->left==hi)
{
si -- NULL
}
else {
si
si->left
}
}
else {
cell
si
deletecellfrom-generation(cell,
i)
addcelltodoomedlist(cell,
i)
47
recirculaterecordswithingeneration(i)
{
while (fewer than Nfree blocks available in generation i) (
if (record pointed to by hi must be kept) {
copy record pointed to via hi to buffer for generation i
hi-
hi->left
}
else
{
cell
hi
deletecellfrom-generation(cell,
i)
addcellto-doomedlist(cell,
i)
}
{
bufferwrittento.generation(i)
if (i>l) {
while ((hi_iNULL)
AND (hi_ilsi_l))
{
cell - hi-_
deletecellfromgeneration(cell,
addcellto-generation(cell,
i)
i-1)
}
I
while ((di:NULL)
AND (di points to block position just overwritten)) {
recorderased(di)
if (di->left==di)
{
di
NULL
else {
di -- di ->left
}
2.8
Management of the LOT and LTT
Section 2.4 introduced the LOT and LTT when it described the doubly linked lists of
cells that track the positions of all relevant log records in the generations of a log stream.
This section provides more details about the LOT and LTT and describes how the LM
manages these data structures.
48
The LOT and LTT keep track of all relevant log records. The LM updates them on
a continual basis as records enter the log and progress through it.
The LM associatively accesses each object's LOT entry by using its object identifier
(oid) as a key. A hash table implementation is therefore appropriate. The dynamic
nature of the LOT strongly suggests that chaining [10] (rather than open addressing)is
the most suitable technique for collision resolution. An object's LOT entry has one or
more cells, each of which points to the disk block of a relevant DLR for the object. The
LM manages these cells as a linked list.
Entries in the LTT are associatively accessed using transaction identifiers (tids) as
keys. Like the LOT, the LTT is implemented as a hash table with chaining for collision
resolution. Each transaction's LTT entry holds a set obj_ids of oids to keep track of
which objects were updated by the transaction; this set is initially empty and grows as
the transaction progresses and performs work.
The LM maintains a timestamp in each object's LOT entry, although no timestamp
is necessarily stored with any object in the disk version of the database. A simple
integer-valued counter suffices for the timestamp. When the LM creates a new LOT
entry for an object, it initializes the timestamp to 0. Whenever a transaction updates
the object, the LM increments the timestamp and then puts the new timestamp value
in the resulting REDO DLR. An UNDO DLR for an object holds the current value of
the timestamp in its LOT entry at the time that the UNDO DLR is created; the LM
does not bother to increment the timestamp when it creates an UNDO DLR. The LM
removes an object's LOT entry only after it has no more relevant DLRs remaining in the
log (the LM detects this situation when the set of cells associated with the LOT entry
becomes empty). Therefore, at any given time, all relevant REDO DLRs for an object
have unique timestamps and these DLRs can be placed into chronological sequence
by their timestamps. Likewise, all UNDO DLRs have timestamps that indicate their
chronological ordering. The cell for each DLR has a tstamp field that stores the value
of the timestamp contained in the DLR.
Whenever a transaction modifies an object in the database, it causes the LM to send
a REDO DLR to the log. If an entry does not already exist for the object in the LOT, the
LM creates one. The LM increments the timestamp in the object's LOT entry, formats
the DLR, adds the DLR to the current buffer for the tail of generation 0, creates a cell
to point to the DLR's position in the log, adds it to the set of cells maintained in the
object's LOT entry, inserts the cell in the doubly linked list for generation 0, creates a
new LTT entry for the transaction that performed the update if it did not already have
one and then adds the object's oid to the obj_ids set in the transaction's LTT entry.
Every transaction eventually commits or aborts. An abort is easy to handle. Because
the log will never hold a COMMIT
TLR from an aborted transaction, all the REDO DLRs
from the transaction immediately become non-recoverable; the LM disposes the cells that
pointed to these DLRs and deletes the transaction's LTT entry. However, any UNDO
49
DLRs from the transaction retain their requiredstatus after it aborts until the CM has
undone these updates to the disk version of the database, as described in Sections 2.3
and 2.5.
When a transaction (which updated at least one object) commits, the LM updates
record in the log. Then the LM
its LTT entry so that it points to the cell for its COMMIT
processes the members of obj.ids in the transaction's LTT entry. For each oid in objids,
the LM retrieves the object's LOT entry, assigns a status of recoverable to any unflushed
or required REDO DLR from an earlier committed update to the same object, assigns
a status of annulled to any UNDO DLR from the transaction that just committed and
informs the CM that the most recent update has now been committed. If the CM has
not already flushed this most recent update to disk, then the CM enqueues it to be
flushed4 .
The LTT entry for each transaction includes a counter to keep track of the number of
UNDO DLRs and unflushed or required REDO DLRs that exist for the transaction. The
LM initializes an LTT entry's counter to 0 and increments this counter every time the
transaction writes another REDO or UNDO DLR to the log. The LM also increments
this counter whenever it copies an existing UNDO DLR so that it can forward the DLR.
The LM decrements a transaction's counter each time that it downgrades the status of
one of the transaction's REDO DLRs from unflushed or required to recoverable, or each
time that it overwrites an UNDO DLR from the transaction. When the counter reaches
TLR to recoverable.
zero, the LM downgrades the transaction's COMMIT
After a transaction commits, its obj ids set can only shrink in size. Whenever the
last copy of a relevant REDO DLR is overwritten and no other REDO DLRs from
other updates to the same object by the same transaction remain, the LM removesthe
corresponding oid from the obj-ids set of the transaction that wrote the DLR.
TLR is eventually overwritten, the LM
When the last copy of a transaction's COMMIT
examines its objids set; for every object still represented in this set, the LM downgrades
the status of all corresponding REDO DLRs to non-recoverable (and deletes the cells
that pointed to these DLRs). Finally, the LM deletes the transaction's LTT entry.
If the objids set in a committed transaction's LTT entry becomes empty and no
UNDO DLRs remain from the transaction (as indicated by a counter value of zero), all
record become irrelevant. The LM disposes the cells
copies of the transaction's COMMIT
that point to them and removesthe transaction's entry from the LTT.
To summarize, every object with relevant DLRs in the log has an entry in the LOT.
An object's LOT entry keeps track of the positions within the log of its relevant DLRs.
There is an LTT entry for every transaction currently in progress that has updated
4If the CM already enqueued the object in response to an earlier update by another transaction but
it has not yet flushed the object, then the object's oid remains unchanged in the set of objects waiting
to be flushed.
50
at least one object and every committed transaction that still has relevant DLRs. A
transaction's LTT entry keeps track of all objects that it updated and the positions
within the log of copies of its COMMIT
record. The LM continually updates the LOT and
LTT to reflect the current state of the system as transactions and log records come and
go. At any given time, the cells associated with the LOT and LTT entries point to all
relevant records in the log.
2.9
Crash Recovery
After a crash has happened, the database invokes the RM to restore the disk version
of the database to a consistent state. The RM examines the records in the log and
attempts to find the most recently committed value, if any, for each object that has one
or more DLRs in the log. The RM propagates each object's most recently committed
value, if any, to the disk version of the database, thus restoring the disk version of
the database to a consistent state: it incorporates all the effects of transactions which
committed prior to the crash and none of the effects of transactions which aborted or
were interrupted.
The RM starts sequentially reading from the disk(s) where the log is stored. It does
not need to begin at the tail of generation 0 (nor of any other generation). The RM
processes the log in a single pass. This new recovery algorithm is suitable for systems
in which the log is not larger than main memory.
As each block is read from the log, the RM processes the DLRs and TLRs in the
block. The Pending Object Table (POT) keeps track of all objects during the recovery process, while the Recovered Transaction Table (RTT) serves a similar purpose for
transactions.
Each RTT entry belongs to a particular transaction.
It holds two pieces of infor-
mation about the transaction. The first is the status of the transaction. If the RM has
found a COMMIT
record for the transaction, it has a status of committed; otherwise, it
has an unknown status. The RTT entry also contains a set of oids, called pendingobjs.
The contents of this set are meaningful only if the transaction's status is still unknown.
Each member of pendingobjs indicates an object for which a REDO DLR was already
found in the log and this DLR was written by the transaction.
When the RM reads a REDO DLR from the log, it checks the POT entry for the
associated object to see if a more recently committed REDO DLR or a more recent
UNDO DLR has already been found (the timestamps within an object's REDO and
UNDO DLRs indicate their relative temporal ordering). If not, it adds the new DLR to
the POT. It also inspects the RTT to find out if the transaction that wrote the DLR is
known to have committed. If the transaction did indeed commit, then the RM marks
51
the new update as committed and deletes from the POT all earlier DLRs for the same
object; otherwise, it leaves the update with a status of pending and adds the object's
oid to the pendingobjs set in the RTT entry of the correspondingtransaction.
If the RM reads an UNDO DLR from the log, it ignores it if a more recently committed REDO DLR or a more recent UNDO DLR has already been found for the associated
object; otherwise, it adds the UNDO DLR to the object's POT entry and deletes all
DLRs that were written by earlier transactions (as indicated by their txid and timestamp
fields).
When the RM upgrades a transaction's status from unknown to committed (in reTLR), it processes each object represented in the
sponse to the discovery of a COMMIT
pendingobjs set kept with the transaction's RTT entry. For each such object, it marks
the corresponding update in the POT (if it still exists) as committed and deletes all
earlier updates to the object; it also deletes the object's oid from the pendingobjs set.
After all log records have been processed, the RM restores the disk version of the
database to the most recent consistent state that existed prior to the crash by examining
each object's POT entry and taking appropriate action. If an object's POT entry holds
an UNDO DLR and the transaction that wrote the record has a status of committed,
then the RM does nothing further for the object. However, an UNDO DLR from an
uncommitted transaction 5 prompts the RM to propagate the object's value (indicated
in the UNDO DLR) to the disk version of the database and thus undo the effect of the
unsuccessful transaction. If an object's POT entry has a REDO DLR from a committed
transaction, the RM flushes this updated value to the disk version of the database.
The RM ignores any object whose POT entry holds neither an UNDO DLR from an
uncommitted transaction nor a committed REDO DLR.
The following pseudocode expresses the RM's algorithms.
{
shouldkeepredodlr(redo-dlr)
potentry
redos
-
-
POT entry for object redodlr->oid
redodlr->timestamp
if (potentry has a committed REDO DLR with timestamp > redo-ts) (
return FALSE
}
else (
if (potentry
has an UNDO DLR with timestamp > redots) (
return FALSE
}
else {
if (potentry
5
has a REDO DLR with timestamp == redots) {
The RM concludes that any transaction
crash.
whose status is still unknown did not commit before the
52
return FALSE
}
else (
return TRUE
}
knownitohavecommitted(txid)(
if (RTT entry of txid has committed status) {
return TRUE
else (
return FALSE
recoverredodlr(redodlr) (
if (object redo_dlr->oid has no POT entry) (
create new POT entry for object redodlr->oid
}
if (should keep_redodlr(redo
dlr)) (
add redo dlr to POT entry of redodlr->oid
if (knowntohave_committed(redo_dlr->txid))
{
redodlr->txid +- tx.committed
delete all DLRs with timestamps less than redodlr->timestamp
delete any UNDO DLR with timestamp equal redodlr->timestamp
else (
add topendingobjs inrtt(redodlr- >oid, redodlr- >txid)
shouldkeepundodlr(undodlr)(
potentry +- POT entry for undo.dlr->oid
undors -- undodlr->timestamp
if (potentry has a committed REDO DLR with timestamp > undots) (
return FALSE
else {
53
if (potentry has an UNDO DLR with timestamp > undots) {
return FALSE
}
else {
return TRUE
}
recoverundodlr(undodlr)
{
has no POT entry) {
if (object undodlr->oid
for object undodlr->oid
entry
POT
new
create
}
if (shouldkeep undodlr(undo-dlr)) {
add undodlr to POT entry of undodlr->oid
delete all other DLRs with timestamps less than undodlr->timestamp
}
updatepot-aftertxcommit(oid,
txid) {
if (POT entry for oid still has a REDO DLR from transaction txid) {
redodlr +- REDO DLR for oid and txid with greatest timestamp
redodlr->txid +- txcommitted
delete all other DLRs with timestamps less than redodlr->timestamp
delete any UNDO DLR with the same timestamp as redo.dlr->timestamp
}
recovercommit(commit-record)
{
if (commit-record->txid has no RTT entry) {
create new RTT entry for commit-record->txid
}
status of transaction commitrecord->txid
-- committed
if (pendingobjs0O in RTT entry of commitrecord->txid)
for every oid in pendingobjs {
updatepot aftertxcommit(oid,
remove oid from pending-objs
}
54
{
commit record- >txid)
perform-recovery()
while (there are still some unread log records) (
log-record - read another record from the log
case (type of logrecord)
REDO DLR: recoverredodlr(log
record)
UNDO DLR: recoverundodlr(logrecord)
COMMIT:
recover commit(log record)
for every object with an entry in the POT {
if (object's POT entry has an UNDO DLR from an uncommitted tx) (
write out value from UNDO DLR to object in disk version of DB
else {
if (object's POT entry has a committed REDO DLR) (
write out value from committed REDO DLR to disk version of DB
The LM ensures that the log always contains sufficient information for the RM to
restore the database to a consistent state if a crash were to ever occur at any time.
Consider a series of updates, performed by different transactions, to a particular object.
Suppose that the most recent update to the object has been committed. If the CM has
already flushed this update to the disk version of the database, then either no recoverable
(UNDO or REDO) DLRs from prior updates to the object remain in the log or the DLR
from the most recent update is still in the log (and has a status of required) along with
the COMMIT
record for the transaction which performed this update. If the CM has not
already flushed this update, then its DLR must have a status of unflushedand is still in
the log.
Now suppose that the most recent update to a particular object was performed by
a transaction that aborted or is still in progress. Therefore, the log does not contain
any COMMIT
record from this transaction. If the log still holds a REDO DLR from this
update, then it is innocuous anyway (because the RM ultimately ignores any REDO
DLR from a transaction that did not commit). If the disk version of the database already
holds the new uncommitted value for the object, then the log must hold an UNDO DLR
which records the object's original value (i.e., the value which it had immediately prior
to the beginning of the current transaction). Since the RM finds this UNDO DLR but
it does not find a COMMIT
record for the associated transaction, it restores the object
in the disk version of the database to the value indicated in the UNDO DLR, thus
undoing the uncommitted transaction's update. If the disk version of the database still
holds the object's original value but the log holds an UNDO DLR for the uncommitted
55
update (because the RM had requested the LM for permission to flush but had not yet
performed the flush), then correct recovery is still ensured. Finally, if the log contains
no UNDO DLR for the uncommitted update, then the disk version of the database must
hold the object's original value and either the log holds a REDO DLR (with unflushed,
required or recoverable status) from the most recently committed update or the log holds
no unflushed, requiredor recoverable DLRs from prior updates to the object. Either way,
the disk version of the database will hold the original value of the object after the RM
finishes its work.
Therefore, the LM and RM together guarantee that the most recently committed
value for every object is restored to the database after a crash, and thus the consistency
of the database is maintained.
Note that the LOT and LTT data structures which played a crucial role during
normal logging operations are unnecessary for recovery. They enabled the LM to manage
the log's records so that the database could always be restored to a consistent state if a
crash were to ever occur. After a crash has actually happened, the RM must examine
the information which the LM left on disk and use it to restore the disk version of
the database to a consistent state. When the RM has finished its work, the database
resumes normal processing. The LM initializes the LOT and LTT data structures (they
are initially empty) and then begins accepting requests from client transactions.
56
Chapter 3
Correctness Proof for XEL
This chapter presents a theoretical model for a simplified version of XEL and proves
important safety and liveness properties. Effectively, the model ignores the role of the
cache manager and assumes that every update is flushed to the disk version of the
database immediately after it is committed. Nevertheless, the log manager must ensure
that the REDO DLR from the most recently committed update to each object is retained
until all prior REDO DLRs for the same object have first been rendered non-recoverable
(review Section 2.1 to understand this requirement).
A manual correctness proof for a complete implementation of XEL, as presented
in Chapter 2, would be overwhelming in terms of both size and effort; the proof itself
would be prone to human error and its length would deter most readers from bothering
to verify it. Nevertheless, this chapter does prove the correctness of a simplified version
of XEL. The proof focuses on only a single object, but it applies to all objects in the
database. Therefore, this simplified version of XEL ensures that the log always holds
sufficient information for the RM to restore the database to a consistent state after
a crash. Section 2.9 explained how the RM actually does restore the database to a
consistent state, given the information in the log. This chapter also proves that every
committed update's REDO DLR is eventually erased (a liveness property) so that its
disk space can be reused.
Although the proof considers only a simplified version of XEL, it has worth nonetheless. After someone understands this proof for simplified XEL, they can extend this understanding so that it applies to more realistic implementations of XEL. This approach
of starting reasonably simple and then gradually adding in more detail has pedagogic
value. Furthermore, the experience of proving the correctness for a simplied version of
XEL can suggest approaches for automating the many "mechanical" parts of the proof
effort so that much more sophisticated implementations of XEL can be proven automatically by computer (assuming that the program which assists in the proof process is
itself correct); little human effort would be required and so the chance of human error
57
would be significantly reduced.
This chapter uses I/O automata theory [42, 43] extensively. A reader unfamiliar
with I/O automata theory is referred to [42, 43] for an explanation of it.
The remainder of the chapter is organized as follows. Section 3.1 states what aspects
of XEL are simplified and explains the interface for this much-simplified version of the
log manager. It also suggests how to gradually embellish this simplified model so that
a more realistic version of XEL is obtained. To perform correctly, any variation of XEL
can reasonably require client transactions to behave appropriately; Section 3.2 formally
expresses these restrictions as four well-formedness properties that the log manager's
surrounding environment must always satisfy. Section 3.3 presents a model of the log
manager that is as simple as possible and proves that it is correct, as long as the
environment satisfies the well-formedness properties; this very simple model for the
log manager is referred to as SLM.Section 3.4 defines a more complex model for XEL
that more closely resembles a real implementation and then proves safety and liveness
properties for it. The I/O automata description for this implementation is presented
in Section 3.4.1. This implementation is referred to as LM.Even though LMis fairly
elaborate, it still has many simplifications and does not constitute an implementation
of the complete XEL technique as presented in Chapter 2. Section 3.4.3 postulates a
possibilities mapping f that maps each state in LMto a set of states in SLMand then proves
that f is indeed a possibilities mapping, according to the definition in [42, 43]. This
result inductively proves that LMis correct apropos safety. For any possible execution
of LM,a corresponding execution of SLMalso exists which has exactly the same external
behavior. Since the correctness (in terms of all possible external behaviors) of SLM
has already been proven, the correctness of LMfollows as a result. Finally, Section 3.4.4
states an important liveness property: every log record is eventually erased. Appendix A
provides the many lemmas and theorems which constitute the safety and liveness proofs.
3.1
Simplifications
This section describes the log manager's interface and explains how the log manager
interacts with the world around it. Subsequent sections will build upon the introductory
description which this section provides. Because this chapter considers a simplified
version of XEL for the log manager, the interface is simpler than what was described in
Chapter 2. This section explains and justifies these simplifications. It also discusses how
some of these simplifications would be relaxed if one wanted to extend the techniques
of this chapter to a more realistic version of XEL.
This chapter will prove that XEL (in its simplified manifestation) does "the right
thing" for each object. Each object is characterized by a set of possible updates. These
updates may be sequenced in any particular order; the "external world" chooses the
58
order by issuing a series of commit commands. Figure 3.1 illustrates the log manager's
interface for the simplified version of XEL that is considered in this chapter. In this
simple model, the log manager manages only a single object. The external world sends
COMMITiand ERASE,messages to the log manager, where the subscript i (which is a
member of some index set) identifies a particular update, and the log manager sends
ERASABLEi
messages to the external world. Here, the external world is everything outside
the log manager.
COMMIT
1
ERASABLE1
ERASE
1
COMMIT
ERASABLE
ERASEi
Figure 3.1: Interface of Simplified Log Manager
When the external world (specifically, some client transaction) wants to atomically
update an object, it sends a COMMITi
request to the log manager; the subscript identifies
the particular update which the external world wants to commit. The log manager keeps
some (internal) record of this update but may eventually decide to delete it. When it
decides that it no longer needs to retain a record from update i, it sends an ERASABLE,
message to the external world. In response to this message, the outside world chooses
exactly when the record from update i is actually deleted; the external world sends an
ERASEimessage to the log manager to inform it when the record from update i has
been deleted. The ERASABLEi
and ERASEimessages model the operation of a typical disk
drive. When the log manager decides to overwrite a particular log record on disk, it
submits a request to the disk drive's controller to write a block of new information to
the record's location on disk; this request corresponds to the ERASABLEi
message. After
the disk drive has actually completed the write operation, it informs the log manager;
this acknowledgement corresponds to the ERASEimessage.
The log manager must satisfy the followingimportant property. Either the log manager still retains the record from the most recently committed update to the object (i.e.,
COMMITi
has happened, no subsequent COMMITj
has occurred yet, and the log manager
has not issued an ERASABLEimessage) or the records from all previously committed
updates have already been deleted (i.e., for every COMMITh
which preceded COMMITi,
a
corresponding ERASEhhas already happened). This guarantees that no stale old update
59
can jeopardize the state of the database if a crash were to occur.
The particular values of the updates to an object are irrelevant, so this simplified
model makes no mention of them. It may help to think of update i's value being
communicated in the COMMITS
message. A more elaborate model of XEL would choose
to include a MODIFYiaction by which the external world assigns a value to the object
for update i. If the external world later decides to make this modificationpermanent, it
submits a separate COMMITi
request; alternatively, an ABORTimessage would annul the
update.
The model ignores the role of the cache manager. Effectively, it assumes that each
COMMIT,action simultaneously commits update i and flushes the object's new value
(which was assigned by update i) to the disk version of the database. The log manager
can grant permission to erase the REDO DLR from some update i as soon as the records
from all chronologically preceding updates have been erased; it need not wait for the
completion of a flush to the disk version of the database. Furthermore, UNDO DLRs do
not play a role in this simplified model because a new value is (conceptually) assigned
and committed at the same time. A more realistic model would include some additional
FLUSHiinput action to the log manager to inform it that the value which update i
assigned to the object has been flushed to the disk version of the database; as long
as update i is the most recently committed update, the log manager cannot issue an
ERASABLEi
message until FLUSHihas happened.
This model applies to only a single object. The fact that a transaction can update
an arbitrary number of other objects is irrelevant. A more realistic model would embed
the above model within a larger model that provides transactional support. In this
larger model, a transaction would send MODIFY/messages to any number of objects. If
message to each
the transaction later committed, the log manager would send a COMMITi
object which the transaction updated. Hence, the above model is a building block on
which to construct a more realistic model.
3.2
Well-formedness Properties of Environment
The external world constitutes the environment in which the log manager must operate.
By definition, the environment is constrained to respect certain conventions which govern
its relationship with the log manager. These conventions are expressed in terms of the
well-formedness properties presented in this section.
Let
denote a behavior for the log manager module, and let ri represent the ith
action of 3 (where iEJf and i_1). The behavior 3 is well-formed if and only if it satisfies
the four properties expressed below.
60
WF1: Vx: ri=COMMIT,
Bj,
] jAi, such that rj=COMMIT,.
WF2: Vx: ri=ERASE, =
3j, j<i, such that rj=ERASABLEx.
WF3: Vx: ri=ERASEx => ,Bj, joi, such that 7rj=ERASE,.
WF4: Vx: ri=ERASABLEx =
3j, j>i, such that rj=ERASE,.
Property WF1 states that the external world may commit a particular update, x, at
most once. Property WF2 states that the external world cannot perform an ERASEx
action until after the log manager has given it permission to do so via the ERASABLEx
action, whileWF3 constrains the external world to perform at most one ERASEx action
for any particular DLR x. Finally, WF4 insists that the external world must eventually
perform an ERASEx action in response to an ERASABLEX action.
These well-formedness properties which the environment must preserve can be represented in an automaton, called ENV,as shown in Figure 3.2.
COMMIT
1
ERASABLE1
ERASE 1
COMMIT.
ERASABLE/
ERASE
Figure 3.2: Automaton to Model Well-formed Environment
The names, types and initial values of the three variables which constitute the stat
of ENVare 1 :
variable
type
-
committed
can-erase
erased
initial value
-
2N
2Af
0
2Nr
0
The ENVmodule has the following transition relation:
1'A denotes the set of natural numbers {0,1,2,...).
61
For any set S, 2 denotes the powerset of
ERASABLE
Effect:
can-erase
- can-erase U {i}
COMMIT
Precondition:
Effect:
committed
Precondition:
Effect:
(iEcan-erase) A (ioerased)
canerase - can-erase - {i)
icommitted
- committed U {i}
ERASEi
erased
3.3
+-
erased U {i}
Specification of Correctness (Safety)
This section defines a very simple I/O automaton, called SLM,which embodies the safety
property required of any implementation of XEL. Namely, it ensures that the record from
the most recently committed update is not erased unless the records from all previously
committed updates have already been erased. To prove this property, this section states
a set of invariants which describe the composition of the ENVand SLMautomata in all
reachable states and then uses these invariants to prove that all behaviors of the SLM
automaton satisfy the safety property.
3.3.1
I/O Automaton Model
Figure 3.3 illustrates the I/O automaton for the very simple version of the log manager.
This automaton shall be referred to as SLM.
Three variables comprise the state of SLM.Their names, types and initial values are 2:
variable
keep
let erase
waiterase
type
Ar.
2r
2Ar
initial value
I
0
0
The SLMautomaton has the following transition relation:
2
For any set S, S denotes the lifted domain S=S U I, where
that does not belong to S.
62
is some unique bottom element
COMMIT
1
ERASABLE
ERASE 1
COMMIT
ERASABLE.
ERASEi
Figure 3.3: Specification Automaton for LM
COMMIT
Effect:
if ((let-erase=0) AND (wait.erase=0))
leterase - let-erase U {i}
else
if (keepI)
leterase
keep
-
leterase U {keep}
i
ERASEi
Effect:
wait erase
if ((keepi)
-
leterase
wait erase - {i}
AND (leterase=0)
keep- I
-
AND (wait erase=0))
leterase U {keep}
ERASABLEi
Precondition:
Effect:
iEleterase
leterase - leterase - {i}
wait-erase
3.3.2
- wait-erase U i)
Invariants for Composition of SLMand ENV
The following invariants apply to the system composed from the SLMand ENVautomata.
It is easy to verify inductively that they are true in all reachable states of the system.
Invariant 3.1
(keep=I) V (leteraseA0) V (waiterase$0)
63
Invariant 3.2
Vx, xEJ, (keephx) V (xEcommitted)
Invariant 3.3
Vx, xEAr, (xleterase)
V ((keepox) A (xEcommitted))
Invariant 3.4
Vx, zxEA,
(xowaiterase)
V ((keepox)
A (xoleterase)
A (Ecommitted))
Invariant 3.5
(Ilet-erase) A (IOwaiterase)
3.3.3
Correctness of SLMModule
Theorem 3.4, at the end of this subsection, expresses XEL's important safety property:
the log record from the most recently committed update is never erased before the
records from all earlier committed updates have already been erased. Several supporting
lemmas must first be proven. This subsection proves that the SLMautomaton, when
composed with ENV,satisfies this property. Throughout this section, a will represent an
execution of the module composed of SLMand ENV,and ri will represent the it h action
of a.
The following lemma states that after a particular update w has been committed,
the SLMautomaton keeps track of w in one of its three state variables at least until w is
erased.
Lemma 3.1
A
(r=COMMIT, )
A (Am, <m<j, s.t. 7rm=ERASE,)
Vh, l<h<j, ((keep=w) V (wEleterase) V (wEwaiterase)) in state th
Proof:
*· 7r=COMMIT,
==, ((keep=w)
V (wEleterase))
in state
tj
* (((keep=w) V (wElet-erase) V (wEwaiterase)) in state t-_l)
=
·
((keep=w) V (wElet-erase) V (wEwaiterase))
(((keep=w) V (wElet-erase)) in state t)
A (m#ERASE.)
in state tm
A (Am, I<m<j, s.t. rm=ERASE,)
--
Vh, l<h<j, ((keep=w) V (wEleterase)
and thus the lemma has been proven.
64
V (wEwait-erase)) in state th
by induction
O
The following lemma states that after a particular update x has been committed,
then at least one of the let-erase and wait-erase state variables of the SLMautomaton
must be non-empty at least until z is erased (assuming that the execution is well-formed).
(a is a well-formed execution)
Lemma 3.2
A
('ri=COMMITx)
A (3k, k>i, s.t.
j, i<j<k, s.t. 7rj=ERASEx)
Vl, i<l<k, ((leterase#O) V (waiterase0))
in state tj
Proof:
* (i=COMMITx) A (j, i<j<k, s.t. rj=ERASE,)
-- > Vl, il<k, ((keep=x) V (xEleterase) V (xEwait-erase)) in state tj
by Lemma 3.1
* V1, i<l<k, ((keep=l) V (leterasej0) V (waiterase#0)) in state tj
by Invariant 3.1
(((keep=x) V (xElet-erase) V (xEwait-erase)) in state t)
*
A (((keep=I) V (leteraseo0) V (wait_erasej0)) in state t)
==
((let_erase0)
V (waiterase40))
in state t1
* It therefore follows that
Vl, i<l<k, ((leterase$0) V (waiterase#0)) in state tj
and thus the lemma has been proven.
D
The following lemma proves that, in a well-formed execution, a particular update x
cannot be erased before it has been committed.
(a is a well-formed execution)
Lemma 3.3
A
(ri=COMMITx)
A
(rj=ERASEx)
i<j
Proof:
*· ri=COMMIT,
· rj=ERASEx -
- ,3f, fTi,
s.t. rf=COMMIT,
3h, h<j, s.t. rh=ERASABLEx
==> xElet-erase in state th-1
* 7rh=ERASABLE,
by WF1
by WF2
* xEleterase in state th-1
==- Either
(1) 3f, f<h-1, s.t.
(rf=COMMITx)
A (((leterase=0) A (waiterase=0)) in state tf-1)
65
* (rf=COMMIT,)
A (f,
ffi, s.t. rf=COMMIT,)
=- f=i
*(f=i)A (f<h-l<j) =4 i<j
or
(2) 39g, g<h-1,
s.t.
A
(rg=ERASEzfor some z)
(keep=x in state tg-1)
* keep=x in state tg-1
==
3f, f<g-1, s.t. 7rf=COMMITx
* (rf=COMMITx)
A (/f, f#i, s.t. rf=COMMITx)
=* f=i
· (f=i) A (f<g-l<j) =: i<j
Therefore, the desired result follows for both possible cases and thus the
lemma has been proven.
°
The following theorem proves that, in a well-formed execution, the most recently
committed update, x, cannot be erased before all previously committed updates have
been erased.
Theorem 3.4
(a is a well-formed execution)
A
(ri=COMMITx)
A
A
(rj=ERASE,)
(Ak, i<k<j, s.t. 7rk=COMMITy
for any y)
A
(r=COMMITw, l<i)
3m, m<j, s.t. rm=ERASE,
Proof:
By contradiction. Assume /3m, m<j, s.t. rm=ERASE,
* 7ri=COMMITx
=#
g, g7i, s.t. rg=COMMITx
* (rl=COMMITw)
A (l<i) A (g,
* (r=COMMIT,)
A (m,
=
g<i, s.t.
rg=COMMITx)
by WF1
:-wgx
m<j, s.t. ,m=ERASEw)
Vh, Il<h<j, ((let_erase-0) V (waiterase&0)) in state th
b,y Lemma 3.2
* (i=COMMIT.) A (j=ERASEx)
i<j
by Lemma 3.3
(<i) A (i<j) A (h, l<h<j, ((let_erase4 0) V (waiterasef0)) in state th)
=-- ((leterase 4 0) V (wait-erase$0)) in state ti-1
* (ri=COMMITx)A (((let_erase$0) V (wait_erase$0)) in state ti-l)
=, keep=x in state ti
* 7r=COMMIT,
=* ,An, n>l, s.t. 7rn,=COMMIT
by W'F1
w
* (keep=x in state ti) A (xyw) A (n, n>l, s.t. 7r=COMMIT,)A (<i)
==- Vp, i<p, keepyw in state tp
* rj=ERASE2, ==- 3q, q<j, s.t. q=ERASABLEx
*· q=ERASABLEx =
xEleterase
in state tql
* xElet-erase in state tq-1 == 3r, r<q-1, s.t. either
66
by W rF2
· (rr=COMMITx)
A (g,
· r=i =', ((leterase0)
gfi,
in state t-
A (waiterase=0))
(1) (rr=COMMITx) A (((leterase=0)
s.t.7rg=COMMITx) =
1
)
r=i
V (waiterase#40)) in state tr-1
But this is a contradiction, so this case cannot be true.
or
(2)
(rr=ERASEz for some z)
(((keep=x) A (leterase=0)
A
A (waiterase={z}))
* (7rr=ERASEz)
A (r<q<j) A (m,
m<j, s.t. 7rm=ERASE,) - zw
* keep=x, xEA, in state tr,_l == 3f, f<r-1,
· (rf=COMMITx) A (g,
goi,
s.t.
in state tr-i)
s.t. rf=COMMITx
g=COMMIT)
9
==' f=i
* (f=i) A (f<r-1) = i<r-1
* (7r-=COMMITw) A (m,
m<j,
s.t.
7rm--ERASEw)
Ve,I<ej,
((keep=w) V (weleterase)
V (wEwaiterase)) in state te
by Lemma 3.1
(((keep=x) A (leterase=0) A (waiterase={z})) in state tr_1)
·
A
(x5w)
A
(l<i<r-l<j)
A
(Ve, I<e<j,
((keep=w) V (wEleterase)
V (wEwaiterase))
in state t)
Z=W
But this is a contradiction, and so this case cannot be true either.
Since both possible cases must be false the original assumption must be
[
false and thus the theorem has been proven.
3.4
Implementation of Log Manager
This section describes an implementation of the log manager. It composes a set of
constituent I/O automata to yield a module with the same external action signature as
the SLMmodule. This log manager module shall be referred to as LM.To prove that LM
satisfies XEL's safety property, this section states a set of invariants that characterize
all reachable states when LMis composed with ENV,postulates a possibilities mapping f
from the composition of LMand ENVto the composition of SLMand ENVand then proves
that f is indeed a possibilities mapping. Given that SLM(when composed with ENV)
is correct and that f is a possibilities mapping from LMto SLM,it immediately follows
that LM(when composed with ENV)implements SLMand therefore satisfies XEL's safety
property. Finally, this section proves a liveness property of LM:every record is eventually
erased.
67
3.4.1
I/O Automata Model
Figure 3.4 depicts the composition of a collection of automata, all for the same object.
The LOT automaton represents the object's LOT entry and the accompanying procedures which manage it. Each DLRiautomaton represents a different possible update to
the object. Together, these automata compose the LMmodule which models the log
manager's activity for the object. The external world issues a COMMITicommand to
instruct LMto commit update i. Some time later, LMsends an ERASABLE,
message to the
external world to inform it that it is now allowed to erase the DLR from update i. In
response, the outside world will eventually send an ERASE,message back to LMto inform
it that update i's DLR has been erased.
ERASE
ERASE 2
ERASE
Figure 3.4: I/O Automata for an Object
This model ignores the problem of flow control between generations in a log stream.
Management of the producer-consumer relationship between consecutive generations is
an entirely different problem which is not considered here. The focus of this section is
the management of each object's collection of REDO DLRs according to the state transition diagram that was represented in Figure 2.5. This state transition diagram implicitly
takes into account the fact that each stream may have more than one generation, and
that there may be more than one stream. It shall be proven that XEL guarantees a consistent state when DLRs are characterized by the state transition diagram of Figure 2.5.
A more elaborate model that incorporates XEL's flow control activities would show that
XEL preserves the state transition diagram for each object's DLRs. By implication, this
more elaborate version of XEL must also guarantee a consistent state.
68
Action Signatures of Automata
The LOTautomaton has the following action signature:
in
out
COMMITi
<ASSIGNi,tsi>
CSREQDi
ACKASSIGNi
ACK_CSRECVi
<DLR_GONE,tsi>
int
none
CSRECVi
ERASABLEi
In this action signature, as well as that given below for DLR, iEA and tsiEAr, where A
denotes the set of natural numbers.
Each DLRi,iEA,
automaton has the following action signature:
.
in
out
<ASSIGN,tsi>
CSREQDi
CSRECVi
ACK_CSRECVi
ACK-ASSIGNi
int
none
<DLR_GONE,tsi>
ERASEi
After receiving a COMMITimessage from the external world, the LOT automaton
chooses a unique timestamp, tsi, for the associated DLR and sends an <ASSIGNi,tsi>
message to DLRi. The DLRi automaton receives the message and replies with an
ACKASSIGNimessage to the LOT.
The LOT automaton sends a CS_REQDimessage to DLRito instruct it to change its
status from unflushed to required. Similarly, the CS_RECVi
message informs DLRjthat it
should change its status to recoverable. DLRidoes not bother to acknowledge receipt of a
CS_REQDi
message, but it does send an ACK_CSRECVi
message to acknowledge a previous
CSRECVi message.
After DLR i has been erased, its DLRiautomaton sends a <DLR_GONE,tsi>message
to the LOTto inform it that the DLR whose timestamp was tsi no longer exists.
The subscripted messages from the LOTto DLRP(namely, <ASSIGNi,tsi>, CSREQDi
and CSRECVi) denote point-to-point communication. The fact that the <DLR_GONE,tsi>
message is not subscripted reflects the implementation of XEL, in which only the timestamp of an erased DLR is communicated to the LOT.
69
States of Automata
The following variables constitute the state of the LOTautomaton 3 :
variable
type
pendingtsassign
recvtss
2NxN
current ts
ArI
Af
currreqdts
currreqddlr
currreqd acked
BN
send cs reqd
send cs recv
nAr
2Af
pendingerasable
2f
represents a committed DLR for which no
Every member of pendingtsassign
<ASSIGNi,tsi> message has yet been generated. The recv.tss variable represents the
set of timestamps which correspond to all DLRs whose status is merely recoverable.
The LOT maintains a counter, current_ts, which it will assign as the timestamp value
for the next committed DLR. The curr-reqdts and currreqddlr variables indicate the
timestamp and identity, respectively, of the DLR which currently has status required, if
there is such a DLR. In conjunction with these two variables, the currreqdacked variable indicates if that DLR's automaton has acknowledged receipt of its timestamp. The
sendcsreqd variable represents the DLR to which a CS.REQDimessage should be sent,
if any such DLR exists. Likewise, sendcsrecv identifies all DLRs to which CSRECVi
messages should be sent. Finally, the pendingerasable set indicates all DLRs for which
the LOTcan issue an ERASABLE,
message.
The following variables constitute the state of each DLRiautomaton:
variable
statusi
type
{UNFL,REQD,RECV,NONR}
B
pendingacki
AiV
timestampi
The status/ variable of DLRiindicates the current status of the DLR and may have
one of the four values listed in the table above. If pendingacki is true, then DLRiowes
a message of response to the LOT; the value of statusi determines the particular type
of the message. The timestampi variable represents a DLR's unique timestamp, and
receives its value in response to an <ASSIGNi,tsi> message from the LOT.
The variables which comprise the state of each of the constituent automata of the
LMand the relationships amongst these automata are depicted in Figure 3.5.
3B denotes the set of boolean values {T,F}. For any sets S and T, S x T denotes the set which is
their cartesian product.
70
C£
ERASABLE,
Figure 3.5: States and Action Signatures of Automata in LM Module
Initial State of System
The LOTautomaton has a unique initial state. The initial values of the LOT's variables
are:
variable
initial value
pendingts.assign
0
recvtss
0
0
current ts
I
curr-reqdAts
curr-reqddlr
curr-reqdacked
sendcs-reqd
send cs recv
pending-erasable
71
I
F
I
0
0
Each DLRiautomaton also has a unique initial state in which its variables have the
following values:
variable
initial value
UNFL
statusi
pendingacki
timestamp i
F
I
Transition Relations of Automata
The steps for the LOTautomaton's input actions are as follows:
COMMITi
Effect:
if (currreqdtsyI)
recvtss 4- recvAss U {curr-reqd ts}
if (currreqdacked=T)
sendcs.recv - sendcs-recv U {currreqddlr}
currreqdts
-
currentts
currreqddlr - i
pendingts.assign
pendingts.assign U {<i,current-ts>}
+-
currreqdacked - F
currentts 4- currentts + 1
ACKASSIGNi
if (i=currreqddlr)
Effect:
currreqd.acked - T
if (recvtss$0)
sendcsreqd
4
i
else
recvtss
4-
recvtss U {currreqd ts}
sendcsrecv - sendcsrecv U {i}
currreqddlr 4-
currreqdts else
sendcsrecv 4- sendcs recv U {i}
ACK_CSRECVi
Effect:
pendingerasable
+- pendingerasable
U {i}
<DLR_GONE,tsi>
Effect:
if ((recv tss={tsi}) AND (currreqddlry_)
AND (curr-reqdacked=T))
recvtss 4- recvtss U {currreqdts}
sendcs-recv - send-cs-recv U {curr-reqd-dlr}
curr-reqd.dlr - 1
currreqdts
-I
recvtss 4- recvtss - {tsi}
72
The steps for the LOTautomaton's output actions are specified below:
ERASABLEi
Precondition:
iEpending-erasable
Effect:
pendingerasable
-
pendingerasable -
i}
<ASSIGNi,tsi>
Precondition:
Effect:
<i,tsi>Ependingtsassign
pendingtsassign -- pendingts.assign -
CSREQDi
Precondition: i=sendcsreqd
Effect:
send-cs-reqd 1-
CSRECVi
Precondition: iEsend cs recv
Effect:
sendcs-recv - sendcsrecv - i}
The steps for the DLR,automaton's input actions are:
<ASSIGNi,tsi>
Effect: timestamp/ + tsi
pendingacki -- T
CSREQDi
Effect:
if (statusi=UNFL)
statusi +- REQD
CSRECVi
Effect:
statusi - RECV
pendingacki - T
ERASEi
Effect: statusi - NONR
pendingacki - T
The steps for the DLR/automaton's output actions are:
ACKASSIGNi
Precondition: statusi=UNFL
pendingacki=T
Effect:
pendingacki - F
ACK_CSRECVi
Precondition: statusi=RECV
pendingacki=T
Effect:
pendingacki
F
73
<i,tsi>}
<DLR_GONE,tsi>
Precondition: statusi=NONR
pending-acki=T
timestampi=tsi
Effect:
pendingack i - F
3.4.2
Invariants for Composition of LMand ENV
The following invariants will assist in the proof of LM'scorrectness. They apply to
system composed of the LMand ENVmodules. It is straightforward to verify that
invariants are true in the initial state of the system. Likewise, the definitions for
automata's actions ensure that the invariants remain true in all reachable states of
system.
the
the
the
the
For convenience, define a predicate recvbl(x), xEVA,which characterizes a particular
update, x, as recoverable or not, in a particular state of LM:
recvbl(x) -
(
(<x,u>Ependingtsassign
V
((timestampZI)
for some u)
A (statusxZNONR)))
Similarly, it is notationally convenient to define three other predicates. These predicates apply to a state of the LMautomaton, but they have an obvious correspondence
to the variables which comprise the state of the SLMautomaton. The lm_ prefix is a
reminder of the fact that they apply to the LMautomaton. These predicates will play
an important role in defining and proving a possibilities mapping from the states of LM
to the states of SLM.
lmkeep(x)
lmlet(x)
(currreqddlr=x,
xJ>A) A (3y, yox, s.t. recvbl(y))
((curr-reqd-dlr=x,
V (
xEAf) A (y,
y5x, s.t. recvbl(y)))
(currreqd_dlrzx)
A
(recvbl(x))
A (
(statusxARECV)
V
(pendingackxZF)
V (xEpendingerasable)) )
lmwait(x)-
(statusx=RECV) A (pendingackx=F) A (x pendingerasable))
The following invariants for the composition of LMand ENVare expressed in terms of
the predicates defined above.
Invariant 3.6
Vx, xEJA, (statusx=UNFL)
V (status.=REQD)
74
V (currreqddlrox)
Invariant 3.7
(timestamp.EAf)
(statusx=UNFL)
V (
Vx, xEKA,
A
(x5sendxcs-reqd)
(pendingack.=F)
(x pendingerasable))
A
(xzsendcsrecv)
A
A
(x can-erase)
A
Invariant 3.8
(currreqddlr4x)
Vx, xEA,
V
(c urrreqdts=v)
(3v, vEJA, s.t.
A
(timestampx=v)
(
V
(<x,v>Ependingts-assign)))
Invariant 3.9
Vx, xENA,
(,/v s.t. <x,v>Ependingtsassign)
V
(3v, vENA, s.t.
(
(<x,v>Epending-ts-assign)
A
A
A
(Au, u7v, s.t. <x,u>pendingtsassign))
(timestamp=l)
(statusx=UNFL))
Invariant 3.10
Vx, xEA,
(xEcommitted)
(currreqddlrzx)
A (,]u s.t. <x,u>Ependingtsassign)
(x:sendcsreqd)
A
(x sendcsrecv)
A
(xopendingerasable)
A
(xfcan-erase)
A
(pendingackr=F)
A
(timestamp=lI))
V (
A
Invariant 3.11
Vx, xEAN, (curr-reqddlr5x)
V ((xzsendcs-recv)
Invariant 3.12
'Vx,xEA, (xopendingerasable)
A (xfcanerase))
V ((pending-ack,=F)
A (xcanerase))
Invariant 3.13
Vx, xEJ,
(xocan-erase) V (pendingackx=F)
Invariant 3.14
Vx, xEA, (statusx=RECV)
V ((xopendingerasable)
75
A (xocanerase))
Invariant 3.15
Vx, xEAf, (xfsendcs-recv)
Invariant 3.16
Vx, xEAf,
V (status,=UNFL)
V (statusx=REQD)
((,v s.t. <x,v>Epending.tsassign) A (timestamp=l±))
V
(3v,
vENJ, s.t.
((<x,v>Ependingtsassign) V (timestampr=v))
A (Vy,yx,
( (Bu s.t. <y,u>Ependingts-assign)
A
V
(timestampy=l))
(3u, uA,
(
A
A
s.t.
(<y,u>Ependingts-assign)
V (timestampy=u))
(v)))
(v<currentts))
Invariant 3.17
Vx,xEf,
(currreqddlr=x)
V
(-recvbl(x))
V
(3v, vEJA, s.t.
((<x,v>ependingtsassign)
A
3.4.3
V (timestamp,=v))
(vErecvtss))
Proof of Safety for LM
Let s be a state in the LMmodule (i.e., the implementation), t be a state in the SLM
module (i.e., the specification), and let f denote a possibilities mapping from the states
of LMto the states of SLM.This possibilities mapping f is defined as follows.
Definition 3.1
Vx, xE.,
((lmkeep(x) in state s)
V ((lmlet(x) in state s)
V ((lmwait(x) in state s)
V
(
A (keep=x in state t))
A (xEleterase in state t))
A (xEwaiterase in state t))
(-recvbl(x) in state s)
A
(((keep5x) A (x leterase) A (xowaiterase)) in state t) )
tEf(s)
Refer to Section A.1 in Appendix A for all the lemmas and theorems which prove
that f is a possibilities mapping from LMto SLM.
76
3.4.4
Proof of Liveness
Theorem A.69, which is found in Section A.2, states an important liveness property for
LM:the DLR for every committed update is eventually erased. This property is formally
expressed as:
(a is a well-formed and fair execution)
A
(i--COMMIT,)
3j, j>i, s.t. rj=ERASE.
represents
where acrepresents an execution of the module composed of LMand ENV,and 7ri
the ith action of a.
Refer to Section A.2 for the proof of this theorem and the many lemmas that support
it.
77
Chapter 4
Parallel Logging
4.1
Parallel XEL
The XEL algorithm presented in Chapter 2 can be applied in a parallel system in which
there are multiple log streams, each of which accepts incoming log records and operates
independently of other log streams. Collectively, the log streams provide the bandwidth
required for an application's log information. The name for this practice is parallel XEL.
Each log stream accepts incoming log records from client transactions and manages
them according to the XEL algorithm. The streams operate independently of one another, except for dependencies introduced by the status values for log records. The LOT
and LTT tables are distributed across numerous processors in the parallel system so that
they can each provide the necessary throughput for operations on them.
Each log stream is segmented into the same number of generations. Generation i is
the same size for each stream, but different generations within each stream may still be
of different sizes. The positions of the head and tail for generation i in one stream are
completely independent of the head and tail positions for another stream's generation i.
That is, the LM performs forwarding and recirculation at each log stream independently
of such activity at other streams.
The abstraction of multiple log streams that operate independently of one another
is well suited to a system which requires an arbitrarily large number of disk drives to
provide the necessary bandwidth for log information. The LM can dedicate only one or
some small fixed number of disk drives to each particular stream.
Figure 4.1 illustrates the situation for a LM that manages four log streams, each composed of two generations. Within each stream, non-garbage log records are forwarded
78
or recirculated and garbage records are thrown away (no garbage pails are shown in this
figure in order to reduce visual clutter). Updated objects whose DLRs are anywhere
in any of the log streams may have their new values flushed to the disk version of the
database.
gm
new
log
records
I
----- N.-
new
log
records
new
log
records
I
new
log
records
I
Figure 4.1: Four Parallel Log Streams
When the LM must examine or modify an object's LOT entry, it first hashes the
object's oid to a processor identifier within the concurrent system and then it hashes
the oid to a particular address within the processor's memory space. The number of
processors over which the LOT is distributed must be sufficient to satisfy the throughput
requirements of client transactions. Similarly, the LTT is implemented as a distributed
hash table with a two-step translation procedure.
Each log stream has one particular processor that is responsible for managing its
records1 . The cells for a stream's relevant log records all reside in the memory space of
the stream's processor. In general, the LM may send an object's DLRs to different log
streams, and so the object's LOT entry cannot hold direct pointers to the cells for all
1This is not necessarily a one-to-one mapping. A processor may manage more than one log stream
if it has sufficient processing power, memory capacity and communication bandwidth.
79
the object's relevant DLRs (as was the case for only a single log stream in Chapter 2).
Rather, an object's LOT entry keeps track of the streams at which relevant DLRs exist.
A local LOT at the processor of each of these streams, which is also associatively accessed
via oid, holds pointers to the cells for each object's DLRs at the stream. Similarly, a
TLR, if any, was sent;
transaction's LTT entry indicates the stream to which its COMMIT
to
the
cells for all copies of
LTT
at
that
stream
points
the transaction's entry in a local
record within the stream.
the COMMIT
As a special case, the LM may insist that all an object's DLRs go to the same
stream, and the identity of this stream is determined by a function whose domain is the
set of all oids. In this case, the LM can place the object's LOT entry at the processor
that manages the stream to which its DLRs are sent so that indirection via a local
LOT is unnecessary. This placement of LOT entries at processors that are responsible
for managing log streams assumes that each processor has ample resources to support
both purposes. Similarly, a transaction's LTT entry can be placed at the processor that
record will be sent so that indirection via the
manages the stream to which its COMMIT
local LTT is eliminated (this assumes that each transaction is statically mapped to some
record).
stream which will receive its COMMIT
TLR to the log (i.e., adds
When only a single log stream exists, the LM adds a COMMIT
it to the buffer in main memory that currently holds records at the tail of generation
0) as soon as a transaction requests to commit. In the more general case of more
than one log stream, the LM waits until all a transaction's DLRs are on disk before it
TLR for it; this delay is expected to be less than 100 ms. Therefore,
generates a COMMIT
record marks both its intention and its eligibility to successfully
a transaction's COMMIT
terminate.
record is acThe synchronization between a transaction's DLRs and its COMMIT
complished as follows. For each buffer of each log stream, the LM keeps a list of the
transaction identifiers for all transactions which wrote log records to the buffer. For each
transaction, the LM keeps a list which identifies the buffers to which the transaction
has written log records; this list is stored in the transaction's LTT entry. Immediately
after a buffer of log records has been written to disk, the LM examines the buffer's list
of transactions. For each transaction in the list, the LM removes the buffer identifier
from the list kept in the transaction's LTT entry. If this list becomes empty and the
record for the transtransaction is waiting to commit, then the LM generates a COMMIT
action; otherwise, the LM does nothing further and leaves the transaction waiting for
the rest of the buffers on which it depends to be written to disk.
Crash recovery is almost the same as for the special case of only a single log stream,
except that the POT and RTT data structures are distributed across processing nodes
in a parallel machine so that they provide the necessary throughput. Like the LOT
and LTT, they are implemented as distributed hash tables with a two step translation
procedure. The RM's work at each log stream proceeds completely independently of
recovery activity at other log streams. The RM sends each DLR to the processor that
80
manages the POT entry for the object indicated in the DLR. Likewise, after retrieving a transaction's COMMIT
record from disk, the RM sends it to the processor that is
responsible for its RTT entry.
Many of the messages used in parallel XEL are quite short, typically only 10 to 20
Bytes in length. Low overhead interprocessor communication is therefore particularly
important. For best performance, parallel XEL should be implemented on a fine-grain
concurrent computer that provides low overhead, low latency communication primitives.
The MIT J-Machine [11, 12, 48] is an existing example of such a machine. XEL will
perform satisfactorily on other concurrent systems in which the overhead for interprocessor communication and synchronization is higher as long as the added delays are still
relatively short compared to the delays for writing blocks to disk, the interconnection
network provides sufficient bandwidth and CPU cycles are plentiful.
4.2
Three Different Distribution Policies
When a client transaction submits a log record to the LM, the LM must choose the
stream(s) to which it will assign the record. The LM's distribution policy governs its
choice of log streams for records. In general, copies of a log record may be sent to any
number of streams. All the policies examined in this thesis send a log record to only
one log stream.
This section proposes three distribution policies: partitioned, random and cyclic.
These policies are all oblivious policies: they do not use information about current load2
imbalances to help choose the stream to which to send a log record. More elaborate
adaptive policies, which monitor load imbalances between streams and attempt to send
records to streams so as to counteract current imbalances, are beyond the scope of this
thesis.
The analyses in the following subsections consider only the bandwidth required for
incoming log records to generation 0 of each log stream. The bandwidth for forwarded
and recirculated records within each stream is ignored because it defies accurate analytical modelling. In practice, the bandwidth required for forwarded and recirculated
records ought to be relatively small compared to that required for incoming new records.
4.2.1 Partitioned Distribution
The partitioned policy assigns each object to a particular log stream. A function whose
only argument is an oid defines this mapping. For each object, the LM directs all its
2In these discussions, a log stream's load is defined to be the bandwidth demanded of it.
81
DLRs to the stream prescribed by this mapping. Similarly, another mapping (via some
other function) from tid to stream number determines the stream to which the LM sends
each transaction's COMMIT
record.
The partitioned distribution is susceptible to "static skew" effects. Even if all objects
are updated with the same frequency, some streams may be assigned more objects than
others, and so they must provide more bandwidth for log information. By definition, the
entire LM fails when it must refuse to accept a log record from a client transaction. The
failure of only one log stream condemns the entire system, according to this definition. A
log stream will fail when the bandwidth demanded of it exceeds its available bandwidth.
Therefore, the maximum demanded bandwidth, over all log streams, is an important
metric when evaluating a parallel logging system.
A quantitative analysis for a simple case can provide some insight into the static
skeweffect that threatens the partitioned distribution policy. Suppose that there are N
objects and 2 log streams; denote the log streams as A and B. Each of the N objects
is randomly assigned to a particular log stream; the probability that it is assigned to
stream A is PA=0.5 (and hence, PB=0.5 is the probability that it is assigned to stream
B instead) and is independent of the assignments for the other objects. Let M be
a random variable 3 that denotes the maximum number of objects assigned to either
stream. The probability distribution function for M is
0
if m< N
NI
Prob[M = m] =
if N is even and m =
2)1
2w
2 (m
1
if m>
N
It follows that the expected value for M is given by
N
E[M] =
mProb[M=m]
E
m=r-N
/2
1
1
2NN
(N
rI
3
+
L,2Z=1+Nm(m) }
if N is even
if N is odd
%rnl(m)
All random variables are typeset in boldface to emphasize their nature.
82
Define E[M]/N
to be the load imbalance; it represents the expected fraction of
the objects assigned to the stream with the most objects. Ideally, E[M]/N remains
constant at 0.5 (i.e., each stream gets half the objects) for all N. Figure 4.2 plots
E[M]/N as N increases. For small values of N, the imbalance between the two streams
is quite pronounced. For example, E[M]/N=0.6 for N=16. That is, 60% of the objects
go to one stream and the other 40% go to the other stream, on average; the first
stream's bandwidth is 50% higher than that of the second stream. However, a significant
imbalance remains even for fairly large N. For N=128, E[M]/N=0.535193 and so the
busier stream's bandwidth is 15%higher than that of the other stream, on average. The
total number of objects in a database may be quite high (several million, say) but a
relatively small number of "hot" objects may receive a disproportionately large number
of updates. If these hot objects are not evenly distributed across all the log streams,
the static skew effect may lead to significant load imbalances. The smaller the set of
hot objects and the higher their "temperature", the worse the threat of load imbalances
becomes.
1.1
·~
1.0
8 0.9
Ce
no 0.8
E
*o 0.7
0.6
0.5
0.4
0.3
0.2
0.1
fin
0
20
40
60
80
100
120
140
Numberof objects, N
Figure 4.2: Load Imbalance vs. Number of Objects
Now consider what happens when the number of log streams increases in proportion
to the number of objects. This is a reasonable exercise because each object may be
characterized by a particular bandwidth (which depends on how often it is updated). If
every object has the same characteristic bandwidth and this parameter remains constant,
then a database's total demanded bandwidth is proportional to the number of objects
in the database. To satisfy this demanded bandwidth, the LM must provide a number
of log streams that is at least proportional to the number of objects.
Let there be N objects and S log streams.
Each object is randomly assigned to
one stream; the probability that the object is assigned to stream i is 1, for <i<S.
Let Peq(m,N,S) denote the probability that the maximum number of objects assigned
to any stream is exactly m and Pg(m,N,S) denote the probability that the maximum
83
number of objects assigned to any stream is at least m.
Pge(mr,N2 ,S)>Pge(mNl,S)
*,N1
-<m<N -(N 2 >N1 ) A (([s]
2)
Lemma 4.1
Proof:
Partition the collection of N 2 objects into two disjoint sets of sizes N1 and NR-N 2 -N1.
To assign the N 2 objects to the S streams, first assign the N1 objects of the first set to
streams and then assign the remaining NR objects to streams. After assigning the first
N 1 objects, let the random variable Bi denote the number of objects assigned to stream
denote the set of streams that have a maximum number of objects
i, for 1<i<S, and
assigned to them. That is, VjE,
k, l<k<S, such that Bk>Bj. The probability that
any one of the remaining NR objects is assigned to one of the streams in IF is at least
s (this minimum occurs for the case of NR=1 and I1'I=1). Therefore,
Pge(m,N2,S) > Peq(m-1,N1,S)() + Pge(m,N,S)
and
Peq(m-l,N 1,S)>O
for [N 1]<m-l<N
rPeq(m-l,N 1, S)>O
for FN]<m<Nl+l
1
so
and hence
Pge(mN2,S)>Pge(m,NS)
for [N 1<m<Nl+1.
For larger values of m,
Pge(m,Nl,S)=O and Pge(m,N2,S)>O
for Nl+l<m<N
2
and so
Pge(m,N2,S)>Pe(m,N1,S >)
for Ni+l<m<N2.
Therefore, the general result follows:
Pge(m,N2 ,S)>Pge(m,N,S)
Theorem 4.2
for [N 1<m<N2
0
Vm, [N1<m<2N, Pge(m,2N,2S)>Pge(m,N,S)
Proof:
Divide the set of 2S log streams into two equal (and disjoint) sets, each of size S. Assign
the 2N objects to streams in two steps. In step 1, randomly assign each object to one of
the two sets. In step 2, assign each object to a particular stream within its set. In step 1,
the two sets are equally likely to be chosen when assigning objects to sets; similarly, the
streams within a set all have the same probability of being chosen in step 2. Therefore,
this two step procedure implies that the probability that a particular object is assigned
to a particular stream is the same for all streams, as required by the statement of the
problem. After step 1, there are two possible outcomes:
(a) Both sets have been assigned exactly N objects each. Denote this outcome
by A.
or
84
(b) One set has been assigned more than N objects. Denote this outcome by B,
and let the random variable MB represent the number of objects assigned to
the set with more objects. It must be true that MB>N.
The probability of outcome A is
PA= (
(1)2N
and the probability of outcome B is
PB = 1-PA = 1( N )
Therefore,
Pge(m,2N,2S)-N+ )
2
(1-
,
()
Pge(m
,MB,S)
for [N 1 <m<N.
where aPge(m,N,S)
By Lemma 4.1, Pge(m,MB,S)>a for VN <m<MB because MB>N.
= a[1 +
>a
(
a)
a)]
-
T herefore,
-(1
becauseO<a<l
This completes the proof that
Pge(m,2N,2S) > Pge(m,N,S)
for
[rl <m<N.
Turning now to larger values of m,
Pge(m,2N,2S)>O but Pge(m,N,S)=O0 for N<m<2N
and so it followsthat
Pge(m,2N,2S) > Pge(m,N,S)
foir N<m<2N.
Combining these results yields the final conclusion:
Pge(m,2N,2S) > Pge(m,N,S)
for
[N1
<m<2N.
This completes the proof of the theorem.
0
Theorem 4.2 implies that the threat of load imbalances becomes increasingly severe
as the number of log streams increases in proportion to the number of objects. Hence, a
LM must increase the number of log streams superlinearly if it is to maintain the same
probability of failure (i.e., overload) as the number of objects in the database increases
(assuming that the average rate at which each object is updated remains constant).
4.2.2
Random Distribution
The random distribution policy randomly chooses the stream to which each log record
is sent; all log streams are equally likely to be chosen. Static skew is not a problem
with the random distribution policy because it does not assign log records to streams
on the basis of oids or tids; log records for different updates to the same object may go
85
to different streams.
Nevertheless, different streams may receive different numbers of log records simply
as a consequence of the LM's random decisions. Imbalances between log streams are
now attributed to "dynamic skew" because they arise from decisions that the LM makes
during operation rather than from a static assignment of objects and transactions to log
streams.
Suppose that L log records are to be distributed to S log streams. The probability
and
that the LM chooses to send a particular log record to stream i, for l<i<S, is
Ri
denote
random
variable
Let
the
log
records.
is independent of its choices for other
the number of log records that the LM sends to stream i, for 1<i<S; Ri has a binomial
1) . Define another
distribution [6] with a mean E[Ri]=L and a variance V[Ri]=-l
random variable EOiR to be the fraction of log records sent to log stream i. Since
and its variance is
Oi is a linearly scaled version of Ri, its mean is E[Ei] =E[=
v[Oi]=VL
=S2L-. By the Chebyshev Inequality 4 , Prob[IEilim Prob[10i-
L
oo
-
S
½I > e] < S
and so
> ] = 0.
Under the assumption that all log records are the same size, this result proves that load
imbalances between log streams are expected to diminish as the number of log records
increases. Therefore, dynamic skew is not expected to be a serious problem in a system
that operates on a continuous basis for long durations (so that many log records are
written).
4.2.3
Cyclic Distribution
The cyclic policy assigns log records to streams in a round robin manner: the LM assigns
a total order to the log streams and directs successive records to successive streams in
the order. The cyclic distribution policy does not suffer from either static or dynamic
skew effects. If all log records have the same size, it guarantees an optimal load balance
amongst the log streams.
Again, assume that L log records are to be distributed across S streams. According
to the cyclic distribution policy, the LM sends log record j to stream 1+(j mod S),
for j>O, and so stream i receives li=[rL records if i<(L mod S) and li=LL] records
otherwise, for 1<i<S. Define i- k to be the fraction of log records sent to stream i
Note that OS<Oi<l1, for all i, 1<i<S. Under the assumption that all log records are
the same size, i is proportional to the load on stream i and so the load of the most
4The Chebyshev Inequality [6] states that for any random variable, X, which is characterized by an
expectation E[X]=p and a variance V[X]=a2, the following relation is true: Prob[lX where e is any arbitrarily chosen positive constant.
86
1 > e] <
-
heavily loaded stream cannot exceed that of the most lightly loaded stream by more
than a factor of Os -- Is <
- (1+ [LJ)/
S [LLbut this quantity monotonically approaches
1.0 as L increases. Therefore, load imbalances amongst the streams are expected to be
negligible after a system has been running for a while and many log records have been
generated.
In practice, not all log records will be the same size so the cyclic policy no longer
guarantees an optimal load balance amongst the set of log streams. Nevertheless, the
cyclic policy is expected to yield reasonably good load balancing behavior for most
applications if a system is allowed to run for a sufficiently long amount of time. The
cyclic policy will serve as a touchstone against which to judge the other distribution
policies.
The cyclic policy poses implementation problems as a system's degree of concurrency
increases. The most straightforward implementation employs a single variable that
keeps track of the stream to which the most recent log record was written. This single
variable may become a serial bottleneck at some point as the number of processing
nodes increases, or it may introduce significant complexity (such as a combining tree
[19] implementation, for example). A simpler approach is to divide processing nodes
into disjoint sets and perform cyclic distribution amongst the members of each set; that
is, each set adheres to its own cyclic distribution discipline independently of the other
sets. In this latter approach, each set has a separate variable that identifies the stream
to which the most recent log record was written by any processing node in the set. The
set size is restricted to some manageable limit, and the number of sets increases as a
system's degree of concurrency increases. The superimposed loads from the different
sets will still yield even load balancing amongst the log streams. Buffering may still
be a problem, though. If the separate sets inadvertently "synchronize" so that they all
send their records to the same stream at approximately the same time, one stream may
receive a flood of records while other streams are relatively idle.
87
Chapter 5
Management of Precommitted
Transactions
5.1
The Problem
The problem of managing precommitted transactions and the transactions which depend
on them becomes much more complicated in a highly concurrent database that has a
collection of parallel log streams. The following example illustrates the crux of the
problem.
Suppose a transaction txl acquires a write lock on some object Obj5 in a database,
updates the object and then requests to commit. Assume that the REDO DLR from
txl's update to Obj5 is already on disk when txl requests to commit. In response to
record for txl and adds the record to some log
txl's request, the LM generates a COMMIT
stream's current buffer in main memory. Recall that the LM does not write the buffer
to disk right away. Rather, it waits for the arrival of enough log records from other
transactions to fill up the buffer (or for a time limit to expire) and then writes the buffer
to disk.
Now suppose that some other transaction, tx2, reads the updated value of Obj5
record has been written to disk. Thus tx2 becomes dependent on
before txl's COMMIT
record for tx2 and
txl. When tx2 later wants to commit, the LM must create a COMMIT
record for tx2 goes to the same
add it to some log stream's current buffer. If the COMMIT
log record did, then there are no problems. Either these two
log stream as tl's COMMIT
log records belong to the same block of records, or tx2's record is subsequently written
to disk in another block. Either way, it is impossible for the RM to find tx2's COMMIT
record on disk, but not that of txl.
88
Now suppose that tx2's COMMIT
log record is directed to a different stream than txl's
record. It is possible that the buffer holding t2's COMMIT
log record will fill up before
the buffer to which tl's COMMIT
record belongs. Let bufl and buf2 denote the buffers
that hold the COMMIT
records from txl and t2, respectively. If buf2 is written to disk
but then a crash occurs before bufl can be written, the RM is faced with a problem.
It finds a COMMIT
record for t2 but not for tzl. How is it to know that tx2 depended
on tl, so that its changes must be undone, despite the fact that its COMMIT
record was
written to disk?
The problem to be addressed concernsthe management of precommitted transactions
in a highly concurrent system which has many parallel log streams. The LM must ensure
that the log on disk always contains sufficient information so that the RM can restore
the database to a consistent state after a crash. If a crash occurs before a precommitted
transaction commits, then the effects of this transaction and all transactions which
depend on it must not be present in the restored database.
5.2
Shortcomings of Previous Approaches
Other researchers [15] have previously suggested that the LM ensure that buffers are
written to disk in an order that will never jeopardize the consistency of the database.
Transactions' COMMIT
log records do not contain any explicit information about dependencies amongst transactions, so the LM must not allow the COMMIT
log record of a
transaction to be written to disk before any of the COMMIT
log records for earlier transactions on which it depends.
Consider the application of this approach to the situation described in the previous
section. The COMMIT
record for transaction tl is waiting in buffer bufl and the COMMIT
record for t2, which depends on tl, is waiting in buffer buf2 at a different log stream.
The log manager must enforce a topological ordering amongst these buffers so that buf2
is written to disk after bufl.
This approach becomes awkward as more complex situations arise. For example,
suppose that another transaction tx3 becomes dependent on t2 and requests to commit
before either bufl or buf2 has been written to disk. If the COMMIT
record for t3 is added
to bufl, then a dependency cycle now exists in the topological ordering amongst buffers.
Buffer buf2 must be written after bufl because of tx2's dependency on tl, but bufl must
be written after buf2 because of t3's dependency on t2. Neither buffer can be written
to disk before the other without risking a possibly inconsistent state after recovery. This
situation is illustrated in Figure 5.1. The interdependencies amongst buffers at different
log streams can be represented as a dependency graph. An arc from node x to node
y in the dependency graph indicates that buffer x must be written after buffer y. In
Figure 5.1, buf2 must be written after bufl because of the dependency introduced by
89
tx2, but bufl must be written after buf2 because of the dependency introduced by tx3.
tx3
tx2
Figure 5.1: Deadlock in Dependency Graph for Buffers at Two Log Streams
One solution to this problem is to keep track of existing dependencies so that a cycle
never forms. This leads to difficulties. In a large system with many parallel log streams,
the maintenance of a dynamic dependency graph would entail prohibitive overhead.
record to a stream's current buffer, the LM must traverse the
Before adding a COMMIT
graph to check that a cycle will not be created.
A static approach may involve less overhead. A static graph, defined at system
initialization, specifies upon which other log streams a particular stream's buffers may
depend. The graph is constructed so that no cycles can possibly occur. When a transac-
record must be written, it is written to the stream which has the smallest
tion's COMMIT
set of allowed dependencies that includes all of the transaction's current dependencies.
For any set of log streams, there must be a log stream which can have dependencies
on all of them. This implies that the graph be a partial order with some unique bottom element. For example, the graph in Figure 5.2 is a suitable static partial ordering
amongst seven log streams. For any set of nodes in the graph, there exists some node
which is below all of them. Log stream L7is the unique bottom element.
Figure 5.2: Static Dependency Graph of Log Streams
If a transaction has dependencies on buffers at log streams Li and L2, then its
record will be sent to log stream L5. A transaction with dependencies on L2
COMMIT
and L3 must have its COMMIT
record directed to L7.
Although this static ordering reduces the overhead of maintaining a dependency
graph to avoid cycles, it can lead to other problems. It restricts the LM's options for a
90
distribution policy. Log streams at the bottom of the graph will tend to receive a higher
load, at least in terms of COMMIT
records, than streams near the top; this imbalance
could persist indefinitely.
This section has explained the drawbacks of dynamic and static solutions to the
problem of maintaining dependencies amongst buffers at different log streams so that
records are written to disk in an order that respects their depentransactions' COMMIT
dencies. Dynamic approaches have significant run-time overhead, and static approaches
are prone to load imbalances.
5.3
Logged Commit Dependencies (LCD)
This section presents a new solution, called Logged Commit Dependencies (LCD), that is
an appealing alternative to the ones described in the previous section. All considerations
about dependency graphs are banished. The choice of a log stream to which a record is
written is no longer limited by synchronization constraints.
record. When a transaction
LCD introduces a new type of TLR called a PRECOMMIT
record which explicitly
requests to commit, the LM immediately generates a PRECOMMIT
identifies all the transaction's unsatisfied dependencies at the time of the request. The
record to any log stream and can write it to disk at any
LM can send the PRECOMMIT
time.
records
Recovery becomes more complicated, however. If the LM wrote PRECOMMIT
but not COMMIT
records (which are no longer absolutely necessary), the RM might be
forced to unravel a deep "tree" of transaction dependencies before it can conclude that
a recent transaction actually committed. To make the RM's job easier, the LM also
generates a COMMIT
record for each transaction after all the transaction's dependencies
have been satisfied. This COMMIT
record is simply an indication to the RM that the
transaction did indeed commit before the crash occurred, and so it need not bother to
record.
check all the dependencies listed in the transaction's PRECOMMIT
The LM maintains a monotonically increasing Log Sequence Number (LSN) for each
log stream and associates a unique LSN value with every block of records that it writes
to disk at the stream. The LM places a block's LSN at the beginning of the buffer which
has been allocated for it. When the LM decides to write the buffer to disk, it increments
the stream's LSN and puts the new LSN at the beginning of the new current buffer.
Each transaction's LTT entry has a field, called DLRdeps, which keeps track of the
transaction's dependencies on unwritten REDO DLRs. This DLR-deps field holds a set
of pairs, where each pair has a stream identifier and a LSN. Whenever a transaction
writes a REDO DLR to the log, the LM notes the stream to which it was sent and the
91
LSN for the buffer to which it was added; denote the stream as s and the LSN as n. The
LM uses this information to update the transaction's LTT entry. If the transaction's
DLRdeps field currently has no pair for stream s, it adds the pair <s,n> to the set.
Otherwise, it updates the existing pair for s so that it holds the new LSN n instead
of its previous value. Similarly, the LM also maintains corresponding information for
each buffer at each log stream. The LM keeps track of the current LSN for a buffer and
the transactions that have had log records added to the buffer. After a buffer has been
written to disk, the LM uses this information to update the appropriate LTT entries.
Suppose that transaction t had a log record in a buffer at stream s whose LSN is n. After
this buffer has been written to disk, the LM retrieves the current pair <s,m> for stream
s from the DLRdeps field in t's LTT entry and compares m to n. If m=n, then the
LM removes the pair from the DLRdeps field (because t has not written any records to
subsequent buffers at stream s); otherwise (i.e., m>n), it just leaves the current <s,m>
pair in t's DLRdeps field.
Each transaction's LTT entry also has three more fields, called dependson,
isdependedonby
and deptxctr.
A transaction t's depends on field holds the set of
transaction identifiers for all precommitted transctions on which transaction t depends,
and t's isdependedonby field holds the set of transaction identifiers for all subsequent
transactions that depend on t. The deptx._ctr field is an integer-valued counter that
keeps track of the number of transactions on which t depends while t is in a precommit-
ted state.
The LM must remember the identity of the precommitted transaction, if any, that
most recently updated each object. This information is kept in each object's LOT
entry. Whenever a transaction reads or updates an object, it becomes dependent on
the precommitted transaction, if any, that previously updated the object. The LM adds
this dependency information to the respective dependson and isdependedonby
in the LTT entries of both transactions.
fields
Each PRECOMMIT
record contains the following three fields:
txid:
dlr_streams:
identifier for the transaction that has requested to commit
<streamid,LSN> pairs for all streams with unwritten DLRs
precomm_txs:
transactions on which this transaction depends
Suppose a transaction t requests to commit. The LM determines which REDO
DLRs (for updates by t) are still waiting to be written to disk and which precommitted
transactions (on which t depends) are still waiting to commit by examining the DLRdeps
and dependson fields, respectively, in transaction t's LTT entry. Unless both these fields
are empty, the LM puts the contents of both fields into their corresponding fields of the
PRECOMMIT
record for t and sends the PRECOMMIT
record to some log stream. For each
object b that t modified (as indicated by the contents of the objids set in t's LTT
entry), the LM updates b's LOT entry to record the fact that its current value depends
on precommitted transaction t and then the LM releases t's write lock on b. Finally, the
92
LM counts the number of transactions listed in the dependson field of t's LTT entry and
assigns this value to t's deptxctr counter. If DLRdeps and dependson are both empty
when t requests to commit, the LM just goes ahead and generates a COMMIT
record for
t.
Similar to before, a transaction t does not actually commit until all its DLRs have
been written to disk, all the precommitted transactions on which it depends have committed, and its PRECOMMIT
record (or its COMMIT
record, as explained below) has been
written to disk. Without the LCD technique and with no explicit dependency informa-
tion in the COMMIT
log record, transaction t committed at the instant that its COMMIT
record was written to disk (since all dependencies had to be satisified before its COMMIT
record could be written to disk). Now, the explicit information in the PRECOMMIT
record
allows the PRECOMMIT
record to be written to disk before all dependencies on DLRs and
earlier transactions have been satisfied. There may be some delay between the time that
t's PRECOMMIT
record is written to disk and the time that it commits.
The LM detects that a precommitted transaction t's last dependency (on either an
unwritten DLR or a precommitted transaction) has been satisfied when both
DLR-deps=O and deptx_ctr=O become true for the transaction. When this happens,
the LM immediately generates a COMMIT
record for t and sends it to any log stream; the
LM can go ahead and generate a COMMIT
record for t even before t's PRECOMMIT
record
has been written to disk. The transaction commits as soon as either its PRECOMMIT
record or COMMIT
record has been written to disk (and DLR-deps=O and dep_txctr=O).
When transaction t actually does commit, the LM sends an acknowledgement to t
in response to its commit request, updates the LOT entries of all objects which t had
modified, and updates the LTT entries for all transactions listed in the isdepended-onby
set in t's LTT entry. For each object that t modified (as indicated by the contents of
the objids set in t's LTT entry), the LM first checks to see if the object still depends on
t. If so, it changes the LOT entry to indicate that the object no longer depends on any
precommitted transaction. Otherwise, the object must now depend on some subsequent
precommitted transaction, and so the LM does not change the LOT entry. The LM
processes all the members of the isdependedonby
field in t's LTT entry immediately
after t commits. For each transaction u in this set, the LM decrements the deptxctr
counter in u's LTT entry.
The pseudocode for the LM's management of precommitted transactions, using the
LCD technique, is given below.
readobject(txid, object id) {
lotentry - LOT entry for object objectid
lttentry - LTT entry for transaction txid
ptx
- lot entry->precomtx
if (ptx$4NULL) {
pretxlttentry
-- LTT entry for transaction ptx
93
pretxJtt entry- >isdependedonby
pretxlttentry->isdependedonby
ittentry- >dependson
-
U {txid}
Ittentry- >dependson U ptx}
updateobject(txid, object id)
lotentry - LOT entry for object objectid
ittentry - LTT entry for transaction txid
ptx - lotentry->precomtx
if (ptx#NULL)
{
pretxltt-entry
-
LTT entry for transaction ptx
pretxltt entry- >isdependedonby
pretxlttentry->isdependedonby
Ittentry- >dependson
U {txid}
- ttentry- >dependson U ptx}
}
<stream,lsn> -- writelogrecord(txid, objectid)
if (3x s.t. <stream,x>
E lttentry->DLRdeps)
{
Ittentry->DLRdeps - ttentry->DLRdeps - <stream,x>}
}
Ittentry->DLRdeps
- Ittentry->DLRdeps U {<stream,lsn>}
}
requesttocommit(txid)
ittentry - LTT
ittentry->tx
{
entry for transaction txid
status - precommitted
if ((lttentry->DLRdeps=0)
AND (ttentry->dependson=0))
{
generatecommit rec(txid)
else (
generateprecomm rec(txid, ttentry->DLR-deps,
Itt entry- >depends on)
for (b E lttentry->objids)
lotentry
{
LOT entry for object b
lotentry->precomtx - txid
release write lock on object b
lttentry->dep.txctr - ltt-entry->depends.onl
94
txcommitted(txid)
{
send acknowledgement of commit to txid
ittentry
- LTT entry for transaction txid
lttentry->txstatus - committed
for (b C lttentry->objids)
lotentry
{
- LOT entry for object b
if (lotentry->precomtx
= txid) {
lotentry->precomtx
-
NULL
for (u E lttentry->isdependedonby)
{
pretxcommitted(u)
}
pretxcommitted(txid)
ittentry
{
-- LTT entry for transaction txid
Ittentry->dep-txctr - ttentry->deptxctr - 1
if (
(lttentry->txstatus
= precommitted)
AND (lttentry->DLRdeps=0)
AND
(lttentry->deptxctr=O)
){
generatecommit rec(txid)
if (PRECOMMIT
record from txid is already on disk) {
txcommitted(txid)
bufferwrittentodisk(txid,
lttentry
stream, lsn) {
- LTT entry for transaction txid
nocommityet - (lttentry->DLRdeps540)
lttentry->DLRdeps
if (
-
lttentry->DLR-deps- {<stream,lsn>)
(ltt entry- >txstatus=precommitted)
AND (nocommityet)
AND (lttentry->DLRdeps=0)
AND
(lttentry->deptxctr=0)
){
generatecommit rec(txid)
if (PRECOMMIT
record from txid is already on disk) {
tx-committed(txid)
}
}
95
precommit orcommit-recordwrittento-disk(txid)
ittentry
if (
{
- LTT entry for transaction txid
(Ittentry->tx status=precommitted)
AND (lttentry->DLRdeps=0)
AND
(lttentry->deptxctr=O)
){
txcommitted(txid)
The RM now has greater responsibilities. It may discover transaction t's PRECOMMIT
record on disk, but it must do some detective work to figure out if t really did commit
before the crash occurred. For every log stream listed in t's PRECOMMIT
record, the RM
must check that all records in the stream up to and including the LSN indicated in the
PRECOMMIT
record were written to disk prior to the crash. This is easily accomplished,
by inspecting the LSN numbers in all the blocks found in the log on disk. Likewise, for
every indicated precommitted transaction, it must verifythat the transaction did indeed
commit before the crash. The RM keeps track of these dependencies by using two new
fields, called dependson and isdependedonby, that belong to each transaction's RTT
entry. These fields are analogous to their counterparts in the LTT.
Let rmax denote the maximum time required to fill a buffer and write it to disk.
A transaction will commit (and generate a COMMIT
record) within time rma, after it
submits its commit request. Therefore, a COMMIT
record exists in the log on disk for
every transaction which precommitted at least 2 rma, prior to a crash. At worst, the
RM must deduce the fates of only those transactions which precommitted in the last
2ra,, seconds prior to the crash. It is expected that this number of transactions will be
small, compared to the total size of the log.
A transaction t's PRECOMMIT
record explicitly lists all the streams on which t depends
for its REDO DLRs to be written to disk. To reduce the size of the PRECOMMIT
record,
and thus save disk space and bandwidth, the LM can postpone generating a PRECOMMIT
record for t until all its REDO DLRs have been written to disk. Of course, transaction t
can still release all its write locks (after the DBMS has updated the objects' LOT entries
to record their dependency on t) as soon as t requests to commit. After all t's REDO
DLRs have been written to disk, the LM generates a PRECOMMIT
record for t that lists
all the precommitted transactions on which t still depends at the time the PRECOMMIT
record is generated. Now, the RM may need to deduce the fates of transactions that
precommitted as early as 3 rm,, prior to a crash.
Now consider how to integrate LCD into the XEL technique. The LM continues to
manage DLRs exactly the same as before, but has more complexity for the handling of
PRECOMMIT
and COMMIT
records. Each PRECOMMIT
record has a status of either required or
recoverable. A PRECOMMIT
record is initially required, and becomes recoverable after the
LM has written a COMMIT
record (for the same transaction) to disk. The LM must keep
96
a required PRECOMMITrecord in the log, but can throw away a recoverable PRECOMMIT
record at its earliest convenience.
record until all subsequent transactions
The LM must retain a transaction t's COMMIT
records written to the log. Suppose some transaction
which depend on t have had COMMIT
record is on disk, the LM
u, which depends on t, commits. As soon as u's COMMIT
retrieves the contents of the depends-on field from u's LTT entry. For each member v of
this field, the LM removes u from the isdepended-onby field in v's LTT entry. Since u
depended on t, the LM will remove u from the isdepended-on.by field in t's LTT entry.
in t's LTT entry, it concludes that
As soon as the LM detects isdependedonby=0
record any longer. The LM changes the
no subsequent transactions require t's COMMIT
record to recoverable as soon as isdependedon-by=O
status of a transaction's COMMIT
for the transaction, no recoverable UNDO DLRs remain from the transaction, and any
remaining REDO DLRs have only recoverable status (these latter two conditions are
determined by maintaining a counter in each transaction's LTT entry, as described in
Section 2.8).
LCD can increase the maximum throughput for any particular object in the database,
but this enhanced performance comes at a cost. LCD requires more storage for extra
fields in each LOT, LTT and RTT entry; these extra fields maintain dependency infor-
mation. It also increases the complexity and computational requirements of the LM and
records consume disk space and bandwidth.
RM. Finally, the PRECOMMIT
97
Chapter 6
Experimental Results
This chapter evaluates XEL and the three proposed distribution policies for parallel
logging. Disk space, disk bandwidth, main memory requirements and recovery time are
the evaluation criteria throughout the chapter.
Section 6.1 describes the event-driven simulator by which the experiments were performed. It explains each of the input parameters, documents the fixed parameters,
presents the definitions of XEL's data structures as expressed in the C programming
languange [36] and justifies the validity of the simulation model.
Section 6.2 quantitatively evaluates and compares the performances of XEL and
the FW technique for a single log stream as various application characteristics vary.
Four sets of experiments consider separately the effects of: (1) the probability of long
transactions, (2) the duration of long transactions, (3) the size of DLRs from long
transactions and (4) the "data skew" which characterizes access patterns to objects in
a database. The results of this section demonstrate that XEL can significantly reduce
the amount of disk space required for log information, compared to FW, although XEL
requires significantly more main memory and may entail increased disk bandwidth for
log information. Recovery is I/O bound, so recovery time is less for XEL than for FW.
Section 6.3 quantitatively evaluates and compares the load balancing properties of
the three oblivious distribution properties as the number of parallel log streams increases.
Three separate sets of experiments consider the cases of low, moderate and high data
skew, respectively. In these experiments, the log streams are managed by only the XEL
technique (FW is no longer of interest in this section). This section's results indicate that
all three policies yield approximately equal load balancing behavior for low data skew.
The partitioned policy performs slightly worse than random and cyclic for moderate data
skew, and it is much worse for high data skew. However, the partitioned policy generally
consumes the least amount of main memory because it does not require indirection via
the LOT and LTT data structures.
98
6.1 Simulation Environment
To quantitatively evaluate XEL and the three distribution policies for parallel logging,
the author implemented an event-driven simulator. The simulator is written in C and
runs on SPARCstations.
6.1.1 Input Parameters
The user can specify the following input parameters:
timestamps:
arrival rate:
tx pdf:
object pdf:
flush rate:
generations:
flag to indicate if disk version of database keeps timestamps
rate at which transactions are initiated
statistical mix of transaction types
statistical access pattern to objects
rate for flushing updates to disk version of database
number and sizes of generations
recirculation: flag to turn recirculation on or off
num streams: number of parallel log streams
distribution policy for parallel logging
distn policy:
runtime:
duration of simulated time span
recovery:
flag to request recovery after normal logging activity ends
The timestamps parameter specifies whether or not timestamps are assumed to
exist in the disk version of the database. If timestamps exist, then the simulator uses
the EL [35] algorithm; otherwise, it uses the more complicated (but more general) XEL
algorithm.
The simulator initiates transactions at regular intervals, according to the specified
arrival rate (transactions per second).
The user specifies an arbitrary number of different transaction types and their prob-
ability distribution function (pdf). For each type of transaction, the user states the
probability of occurrence, the duration of execution, the number of REDO DLRs written and the size of each DLR. Figure 6.1 graphically represents this transaction model
for a transaction that generates N=2 REDO DLRs in a system with only one log stream.
Whenever a new transaction must be initiated, the simulator randomly (according
to the pdf) selects its type. After choosing its type, the simulator schedules when its
REDO DLRs will be written. The DLRs are written at equally spaced intervals, with
the last being written only some short time E (equal to t 3 - t2 ) prior to completion.
Suppose that the transaction's lifetime (specified as part of its type) is T. It will finish
execution and request to commit (at time t 3 ) T seconds after it started. Its last DLR
99
choose
tx type
T-£,
IF
V
to
write N th
data log
record
write first
data log
record
-
F
p
F
F
t2
ti
t3
t4
time
1
T
begin execution
write COMMIT
tx log record
tx commits
Figure 6.1: Simulation Transaction Model
is written (at time t2 ) T -
before it finishes, and each DLR is written (T - E)/N after
the preceding one, where N is the number of REDO DLRs written by a transaction of
this type. After the LM has generated a COMMIT
TLR for the transaction and sent it to
a particular log stream, the transaction continues to wait for acknowledgement (at time
t4 ) from the LM before it actually commits; this delay occurs because the LM waits
until a buffer is almost full before writing it to disk at the tail of generation 0, and then
there is some delay rDisk_Write for transferring the contents to disk.
Whenever a transaction writes a REDO DLR, the simulator randomly picks some oid,
according to the access probabilities specifiedby the user and subject to the constraint
that the oid has not already been chosen for an update by a transaction which is still
active. The set from which an oid can be chosen consists of all integers from 0 up
to NUMOBJECTS-1, where NUMOBJECTS is the total number of objects (a fixed
value). The user breaks up this set of objects into several classes. For each class, the
user specifies the probability of occurrence and the size as a proportion of the total
number of objects. When a DLR is to be written, the simulator first chooses a class
according to the specified object pdf and then randomly selects an available oid from
within this class.
To control the rate at which the CM can flush updates, the user specifies some
number of disk drives and the time required to write a block to any of these drives.
There can be at most one request at a time for any particular drive. The user can
increase the maximum rate at which updates are flushed by increasing the number of
drives or decreasing the time to write a block to any drive. The NUMOBJECTS objects
are striped evenly over these drives. That is, for D drives, object i is mapped to drive
i mod D. Striping ensures that the objects within each class are distributed as evenly
as possible across the different drives, so no drive is relatively overloaded by a large
number of "hot spot" objects'. Each updated object requires a separate disk write
(i.e., the simulator assumes that there is negligible locality of updates within a disk
1
A "hot spot" object is one which is updated very frequently.
100
block). Each disk drive attempts to service pending flush requests in a manner that
minimizes access time. The simulator assumes that the difference between two objects'
oids corresponds to their locality on disk. For the purpose of calculating the difference
between two oids, the simulator assumes that the sequence of integers assigned to their
disk drive wraps around.
The user specifies the number of generations and the size (number of disk blocks) of
each generation. The size of each disk block is fixed in the simulator.
In some experiments, it is worthwhile to examine the LM's behavior without recirculation in the last generation, just to see the effect of simply segmenting the log. There
is an input flag to specify whether recirculation in the last generation is turned on or
off. If recirculation is disabled and the LM cannot advance the tail of the last generation because it would overwrite a non-garbage log record at the head, then it refuses to
accept any more incoming log records to the last generation; this tends to exert "back
pressure" on younger generations.
The user can specify the number of log streams that are to operate in parallel. If
more than one log stream is specified, then the user must also specify one of the three
distribution policies.
If a log stream refuses to accept a log record, the simulator kills the client transaction
that submitted the request. A more realistic simulator would stall, rather than kill the
transaction. The current version of the simulator suffices because the experimental
objective will be to determine the LM's resource requirements to support a particular
load without needing to kill or stall transactions.
After simulating normal logging activity for the specified runtime, the simulator will
also simulate recovery if the recovery flag has been set. Recovery uses the state of the
log on disk in the condition which exists immediately after the specified runtime has
elapsed. The simulator models only the first phase of recovery, in which the contents of
the log streams are retrieved from disk and the most recently committed value in the
log, if any, is determined for each object that had a DLR in the log. The second phase
of recovery, in which the disk version of the database is updated with these values from
the log, is not considered. In practice, this work can be performed in background after
normal processing has resumed, so it is reasonable to ignore it when simulating recovery.
The current version of the simulator does not incorporate the LCD technique; all
transactions retain their write locks until they commit. No experimental data are available for the performance of a LM which employs the LCD technique.
101
Fixed Parameters
6.1.2
Several parameters are fixed in the simulator. The delay between the write for a
transaction's last REDO DLR and its request to commit is fixed at 1 ms. The capacity
of each disk block is 2000 Bytes 2 . The LM attempts to keep Nfree>3 blocks available
in each generation to hold incoming log records. Four disk block buffers (2048 Bytes
TLR is assumed to
each) are provided for generation 0 of each stream. Each COMMIT
require 8 Bytes. The simulator conservatively assumes a fixed delay of TDisk_Write=15
ms to transfer a buffer's contents to disk when writing out records to the log. For each
log stream, at most one disk write operation (to any generation) can be outstanding
at any time. If several generations have buffers waiting to be written to them, the
simulator gives older generations priority over generation 0 when it must schedule the
next buffer to be written to disk. The simulator uses the group commit technique [5];
a log record is not written to disk until its buffer is as full as possible. Therefore, the
delay between the time a record is added to a buffer and the time it is written to disk
The number of objects in the database is fixed
is generally longer than TDiskWrite.
at NUMOBJECTS=107. Disk I/O from each log stream is entirely sequential during
recovery, so the simulator assumes that only 5 ms is required to retrieve a block from
disk when reading the log. When recovering the contents of a block, each DLR requires
100 ps to process and a TLR requires 40 ps.
These fixed parameters are summarized in the following table:
Parameter
Value
E = delay from last REDO DLR to commit request
Capacity of each disk block in log
Nfree = threshold number of free blocks per generation
Number of buffers (for generation 0) per log stream
record
Size of each COMMIT
1 ms
-
TDisk_Write
=
delay to write a block to the log
2000 Bytes
3 blocks
4 buffers
8 Bytes
15 ms
Maximum number of outstanding disk writes per stream
1 write
NUMOBJECTS = number of objects in the database
107 objects
5 ms
Delay to read a block from log during recovery
Time required to process a DLR during recovery
TLR during recovery
Time required to process a COMMIT
100Is
40 /us
6.1.3 Data Structures
The following declarations define XEL's principal data structures for the most general situation of numerous parallel log streams and any distribution policy. They are
expressed according to the syntax of the C programming language [36], in which the
2
A block size of 2048 is typical, but the simulator assumes 48 Bytes are reserved for bookkeeping
purposes and so only the remaining 2000 Bytes are available to hold log records.
102
simulator itself is written. These data structures are intended for a distributed memory
message passing parallel system architecture, rather than a shared memory model.
struct strlsm
int
int
{
struct strlsm
struct stroid
OBJID
/* one log stream id in a list
streamid;
minrecovtstamp;
/* identifier of a log stream
/* needed for parallel XEL
/* pointer
*next;
/* one object
{
id in a list of oids
/* object identifier
oid;
struct stroid
to next cell in the list
/* pointer to next cell in the list
*next;
};
struct strlttentry
TXNID
/* LTT entry for a transaction
{
/*
tid;
TXSTATUS status;
int
numrqddlrs;
int
recstr;
struct
strpcg
struct
stroid
struct
strlttentry
struct strlotentry
oid;
int
int
struct
struct
uncmtstamp;
commtstamp;
strlsm
strlotentry
int
int
/* number of required DLRs remaining
/* log stream to which COMMIT written
*setcgs;
/* cmt grps on which tx depends
*objids;
/* objects modified by this tx
*next;
/* other txs in same hash buckt
/* LOT entry for an object
{
OBJID
struct strrelcell
TXNID
tid;
id of transaction
/* current status of the transaction
/* id of object with records in log
{
blocknum;
reclength;
/*
/*
*lstms; /*
*next;
/*
timestamp for most recent DLR
tstamp most recently committed DLR
log streams with DLRs for object
other objects in same hash bucket
/* cell to point to a relevant log record
/* id of associated transaction
/* index of block in log to which rec belongs
/* size of the log record (in Bytes)
/* current status of log record
int
tstamp;
/* timestamp of update, if cell is for DLR
struct strllotent: r3 r *pobj; /* parent object, if cell is for DLR
RSTAT recstatus;
struct strrelcell
struct strrelcell
struct str-relcell
struct strllotentry
*next;
*left;
/* more cells for same obj or tx
/* left neighbor in doubly lkd list
*right; /* right neighbor in doubly lkd list
/* local LOT entry for an object
103
OBJID
struct
struct
oid;
*cells; /*
strllotentry
struct strllttentry
TXNID
/* id of object with DLRsin stream
strrelcell
list of cells for object's DLRs
*next; /* other objects in samehash bucket
/* local LTTentry for a transaction
{
/* id of tx with TLRsin log
tid;
struct strrelcell
struct strllttentry
/* list of cells for the tx's TLRs
*next; /* next tx in the hash bucket list
*cells;
static struct str_lotentry *lot_tbl[LOT_TBL_SIZE];
static struct strlttentry
*ltt_tbl[LTT_TBL_SIZE];
/* LOT
hashtbl */
/* LTThashtbl */
definition does not include a field that indicates a log record's
The strrelcell
type (TLR, REDO DLR or UNDO DLR). Such an extra field is unnecessary. The
contents of the recstatus
field identify both the type and the status of a record. For
example, a required REDO DLR and a required UNDO DLR have different values in the
recstatus
fields of their cells.
The strlsm
and strilotentry
structures go together on processing nodes that
administer portions of the LOT. The stroid
and strlttentry
together on nodes that manage the LTT. The str
str relcell
lotentry,
structures belong
strillttentry
and
structures are used at nodes that manage the log streams.
Assume that each oid and tid requires 8 Bytes. Integers and pointers consume 4
Bytes each. Each field of type TXSTATUS or R.STAT requires 4 Bytes (these types
are typedef'ed to int). The amount of storage required for each of these structures is
summarized below.
Storage required (Bytes)
Structure name
str lsm
12
stroid
12
str lttentry
str otentry
str relcell
strllot-entry
strilltt-entry
32
24
40
16
16
For the specific case of a single log stream or multiple log streams with the partitioned
distribution strategy, indirection via the local LTT and local LOT is no longer neces-
sary. The cells field from the str lttentry structure replaces the rec str field
structure. Similarly, the cells field from the strllotentry
in the strilttentry
structure replaces the lstms field in the strlot-entry structure. The pobj pointer in
strrel_cell now points to an instance of the strlot-entry type. The declarations
104
of the strlsm,strillot_entry
and str lltt_entry
structure types can be omitted.
The storage requirements for each of the stroid,
strrel_cell
str lttentry,
str lotentry
and
structure types remains unchanged, despite these modifications 3 .
The simulator can also report the storage requirements for FW logging. The user
must specify only a single log stream with a single generation, of course. The FW method
requires a simpler version of the LTT. As before, each entry in this LTT corresponds to a
particular transaction. It has fields for the transaction's identifier (8 Bytes), the current
status of the transaction (4 Bytes), the number of DLRs still waiting to be flushed (4
Bytes), the position within the log of the first record written by the transaction (4 Bytes)
and a pointer to the next entry in the LTT (4 Bytes). The FW method keeps track of
a transaction until after it has committed and none of its updates need to be flushed
to the disk version of the database. When a committed transaction no longer has any
updates waiting to be flushed, the FW method removesit from the LTT. Therefore, the
user should set the simulator's timestamps flag to true so that the EL algorithm is
used4 , even though the disk version of the database may not actually keep a timestamp
with every object in the database.
6.1.4
Validity of Simulation Model
The simulator provides sufficient flexibility to realistically evaluate various LM configurations for many different applications. It does not permit a user to precisely model every
possible application, but it does allow a user to succinctly specify the characteristics for
a broad range of applications. The simulator's inherent technological assumptions only
approximate reality, yet they capture the important characteristics of the underlying
technology while abstracting out many details that are largely irrelevant. Therefore,
the simulator provides sufficient power to evaluate XEL and its parallel variants as
important parameters vary.
The probabilistic transaction model statistically describes an application's static
mixture of transactions. It is worthwhile to examine XEL's behavior as the relative
lifetimes of different transaction types vary, and so the simulator provides this capability. The number and size of each transaction type's log records affect XEL's performance, and so a user can also vary these parameters. The probabilistic transaction
model does not provide sufficient power to specify every possible application. For example, an application in which exactly every eighth transaction is 10.0 s long and the
remaining applications have duration 1.0 s cannot be modelled, nor can an application
whose transactions do not write REDO DLRs at equally spaced intervals. Despite these
shortcomings, enough different application transaction mixes can be specified to provide
3
For the case of only a single log stream, the setcgs
field can be removed from the str ltt
entry
declaration, thus saving another 4 Bytes per transaction.
4
The EL algorithm removes a committed transaction's LTT entry as soon as all its DLRs have become
garbage.
105
meaningful results.
Likewise, the deterministic arrival rate enables a user to control the system's overall
throughput, but limits the ability to control precisely when each transaction is initiated.
A Markov arrival pattern [6], for example, cannot be accurately modelled with the
current version of the simulator. However, variations in the arrival pattern are not
an important issue for the evaluation of XEL; the overall throughput is the important
parameter.
The simulator is an open system. It does not incorporate feedback in its scheduling
of the times at which transactions are initiated, DLRs are generated and commits are
requested. If a transaction manager refrains from initiating new transactions before
it has received commit acknowledgements for enough previous transactions, then the
open system assumption is unrealistic. However, the open system assumption can be
justified for some other applications. As long as the LM continues to accept incoming
log records, there may be little reason why the DBMS would change the rate at which
it initiates new transactions. Likewise, a client transaction never needs to wait for a
reply after requesting the LM to write a REDO DLR to the log; at the time that the
request is made, the LM already knows whether or not it has space available on disk
for the record and it can respond to the client transaction immediately. A transaction's
lifetime is therefore largely unaffected by the performance of the LM, as long as the
LM is able to accept its log records. After a transaction requests to commit, it must
wait for acknowledgement from the LM. For the most general case of more than one
log stream, the LM must make sure that all a transaction's DLRs are on disk before it
record for the transaction, and then there is some additional delay
generates a COMMIT
record arrives on disk. As the LM becomes busier, queues form and
before the COMMIT
a buffer of records may need to wait longer before being written to disk. Therefore,
the length of time that a transaction must wait for acknowledgement to its request to
commit depends on the LM's load, but this delay does not affect the times at which the
transaction writes its DLRs and requests to commit. The experimental objective will
be to evaluate the LM's resource requirements such that the LM can accept all requests
without needing to kill (or stall) any client transactions, so the open system assumption
does not diminish the significance of the experimental results.
The probabilistic specification of data access patterns enables a user to model different collections of data objects in a database. These collections are characterized by
the frequency with which transactions update their member objects. A user can model
a wide range of different "data skews". Again, the statistical nature of this modelling
is concise and simple but there are some applications whose exact data access patterns
cannot be expressed in terms of this model. This inability to accurately model all possible applications does not prevent one from using the simulator to conduct experiments
which illustrate important aspects of XEL's performance.
XEL's behavior is largely influenced by what happens as records approach the head
of a generation, because only then does XEL decide whether or not to forward or re106
circulate a record. The arrival of subsequent records push a record toward the head of
its generation so that XEL must decide its fate, but the exact times of arrival of these
records are largely irrelevant. This observation supports the claim that the simulator's
acknowledged inabilities to accurately model all possible applications' characteristics
does not seriously limit its worth for the purposes of studying XEL.
The statistical specificationsof transaction types and data access patterns are static,
as is the arrival rate. In reality, an application's characteristics may vary over time. The
simulator does not permit specification of dynamically varying parameters. Despite this
shortcoming, meaningful experimental results may be obtained for important cases in
which an application's parameters remain static. These results can provide valuable
insights into XEL's behavior.
The simulator uses a simple model for disk I/O. A flush to the disk version of the
database always requires the same duration (a parameter which the user specifies). In
reality, this duration may vary from one write operation to the next. However, small
fluctuations in the actual duration to flush an update will have only a minor effect on
XEL's performance.
The simulator provides only a first order estimate of the bandwidth required for log
information. It assumes that each block write operation to the log requires the same
amount of time, rDiskWrite In reality, the time required to write a block of log records
to disk depends on whether the I/O is sequential or random in nature. Successive writes
to the tail of generation 0 are sequential and so they will generally have a short duration
(such as 5 to 10 ms each). When the LM must occasionally write a block to the tail
of generation 1, for example, the resulting disk I/O is random;, it will tend to take
significantly longer (such as 20 ms) because of seek and rotational delays. The effects
of the random disk. I/O to all generations except generation 0 can be minimized by
choosing generation 0 to be sufficiently large so that only a small fraction of log records
need to be forwarded. Section 7.3.3 explains an optimization for a system with several
log streams so that most disk I/O to log disks is sequential.
The simple model of the CM assumes that no uncommitted updates are ever written
out to the disk version of the database. Hence, UNDO DLRs are never needed. Economic trends justify this simplification. For many applications, the savings in disk I/O
bandwidth outweigh the price of the extra main memory needed to buffer uncommitted
updates. For example, suppose that each transaction modifies X Bytes of state, is T1
seconds long and begins Td seconds after the preceding transaction. The extra memory
required to buffer all uncommitted updates from a continuous sequence of transactions
is no more than XTl/Td Bytes5 . If an UNDO DLR were written for each uncommitted
update, the extra disk bandwidth would be X/Td Bytes/sec. The technique introduced
in [25] permits a comparison of the relative costs for these two options. Let Cm represent the cost (in dollars per Byte) of main memory and Cd represent the cost of disk
5
To be more precise, the upper limit is actually X[rT/Td],
for a first order analysis.
107
but the approximation
XTL/Td suffices
bandwidth (in dollars per Byte/sec). The cost of buffering uncommitted updates is
CmXTi/Td, and the cost of writing UNDO DLRs is CdXITd. Therefore, it is less expensive to buffer uncommitted updates in main memory if Tl<CdICm. A typical disk
drive costs at least $200 and provides a maximum I/O bandwidth of 2 MBytes/sec, so
Cd = 10- 4 $.sec/Byte. On the other hand, 1 MByte of DRAM costs approximately $20,
so Cm = 20 x 10-6 $/Byte. Hence, buffering of uncommitted updates is better economically if T1<5 sec. For many applications, transactions have lifetimes shorter than 5 sec.
As the price of DRAM continues to fall, relative to the price of disk bandwidth, T1 will
continue to increase.
The simulator concerns itself with the management of log information on disk and
the associated data structures which must reside in main memory. It does not account
for computational requirements nor for interprocessor communication. To some extent,
the consumption and management of these latter two resources depend on the specific
system upon which the logging and recovery system is implemented and so it is difficult
to accurately account for them. Furthermore, computation and communication are relatively cheap and abundant in concurrent systems, and so they do not deserve nearly as
much serious attention as disk I/O, which is a limited and relatively expensive resource.
6.2
Extended Ephemeral Logging for a Single Stream
This section presents the results of many experiments which were conducted to observe
the behavior of XEL as applied to a single log stream and to understand the effects of
varying different parameters. For comparison purposes, the traditional FW technique
was simulated by specifying a single generation with no recirculation. The FW simulation did not involve any checkpointing activity; the firewall was always the oldest
log record from the oldest transaction in the system. This omission favors FW because it ignores the overhead (in terms of disk space and bandwidth) associated with
checkpointing.
There are several evaluation criteria. Disk space, disk bandwidth (in terms of block
writes per second) and main memory requirements (for the LOT and LTT) are the main
criteria for normal logging activity. Elapsed time is the primary criterion for recovery.
The following parameters are specified for all experiments, unless otherwise stated.
There are two types of transactions. The first is of 1.0 s duration and writes 2 DLRs,
each of size 100 Bytes. The second lasts 10.0 s, in which time it writes 4 DLRs of size
100 Bytes each. Their probabilities of occurrence are 0.95 and 0.05, respectively.
The arrival rate is 100 TPS.
108
There is no data skew. That is, all NUMOBJECTS
chosen whenever an update is to be performed.
objects are equally likely to be
To provide sufficient bandwidth for flushing updates, each experiment specifies 10
disk drives with a transfer time of 25 ms. The conservative 25 ms time allows for some
read operations to be interspersed with writes.
All tests of XEL use two generations. The minimum possible sizes for these generations are determined experimentally. Recirculation is disabled, so that it is possible to
assess the effect of simply segmenting the log. There is only one log stream.
The simulation time is 500 s; these results reflect the minimum disk space required
to support 500 s of logging activity such that no transaction is killed.
6.2.1
Effect of Transaction Mix
For the first set of experiments, the probabilities of occurrence for the two transaction
types are varied. The probability of the long transaction type increases from 0 to 1.0,
while the probability of the short type decreases accordingly.
Figure 6.2(a) plots the disk space requirements (number of blocks) versus the transaction mix for both FW and XEL. The corresponding graphs of disk bandwidth (to
only the log), main memory requirements and recovery time are shown in Figures 6.2(b)
to 6.2(d), respectively.
XEL's advantages are most apparent for the 5% mix. It reduces disk space by a
factor of 3.2 with only a 9.1% increase in bandwidth. XEL requires 13 times as much
main memory as FW for the 5% mix, but this requirement is still modest in absolute
terms; XEL needs only 57.5 KBytes of main memory. The time required to read in the
log from disk dominates recovery time for both XEL and FW, so XEL offers much faster
recovery. As the probability of 10 s transactions increases, XEL's relative advantage over
FW diminishes. The reductions in disk space and recovery time are not as large, but
the increase in bandwidth is greater.
As the probability of the long transaction type approaches 1.0, the rate at which objects are updated approaches the maximum rate at which updates can be flushed (400
updates/s).
The resultant queueing delay causes DLRs to tend to remain unflushed
longer, and so the length of a single FW log increases accordingly. In the case of XEL,
many of the DLRs have had their updates flushed by the time the LM must decide
whether or not to forward them to generation 1; most of these DLRs have recoverable
status and need not be forwarded. Only a fraction of all log records have unflushed or
required status as they approach the head of generation 0, and so only these records advance into generation 1. By throwing away the garbage records at the head of generation
109
450
0
C)
.O 400
Q
- 0 ...... 0 Firewall
A---A
350
Co
.Y
'a
XEL
u.2
0
0)
Z
_
35
-
...... 0 Firewall
A- - -A
300
,&
u0) 25
250
.
.Q!
150
~~ 1 ~~AA.'
100
20
15
0)
C'
._
b.00
I
0.20
I
0.40
I
0.60
I
o "-i
0.00
0.20
I ~ I
0.80
1.00
0
a,
I
0.40
I
0.60
I
0.80
1.00
(b) Disk Bandwidth
0
22550
500
LW
X.
A
@..©"
Probability of tx of 10 s duration
(a) Disk Space
250
,De
'6 '©' 'I"
-
Probability of tx of 10 s duration
.Y
o 350
E
300
"C
5
a
E 450
0)
E 400
>
_M ,~ .1A 3
Af '
4'
10
50
o
XEL(Total)
- - -FOXEL
(Generation
1)
a) 30
u0
200 _
E
40
E
- 0 ...... 0 Firewall
A- - -A XEL
I
I
I
I
I
A
_
_
- 0 ...... 0 Firewall
E 175;O
A---A
_
XEL
15CI0
125;O
0
_
cc 100I0
I
_
200l0
75;O
150
. /
A
_
'a. @_E'
-
.©.' '
"
50
100
25;a
50
63
q}
*(3
& q)
a (a.h.h...00
0.20
0.40
0.60
0.80
1.00
Probability of tx of 10 s duration
n
0.00
0.20
0.40
0.60
0.80
1.00
Probability of tx of 10 s duration
(d) Recovery Time
(c) Main Memory
Figure 6.2: Performance Results for Varying Transaction Mix
110
0, XEL manages to use 44% less disk space than FW.
6.2.2
Effect of Transaction Duration
Figures 6.3(a)-(d) show the results as the duration of transactions of the long type
increases from 10.0 s to 60.0 s in increments of 10.0 s.
500
.
D
0 0 Firewall
......
A- - -A XEL
450
C'
9 400
n 350
13
12
11
0
10
- - - - - AD.-- e
®.... @.... 0 .... E).... E)....
9
C
300
.0
O 250
-
X
8
0
7
-
A--- -A
C- - -I
.S:
.
) 200
5
.,
E 150
4
3
Z 100
2
50
ir
v
00
1
0
60
50
40
30
20
10
50
60
20
30
40
10
Duration of long txs (seconds)
I
I
I
-
0
10
. 2750
E 2500oo
0 2250
E
0_
0)
rn
A-'
40
o
.I
60
O......
0 Firewall
A---A
XEL
O
-
1750
Co 1250
1000
750
XEL
500
20
n Li
0
50
8 1500
0 ...... 0 Firewall
A - -A
40
E 2000
80
60
T
30
(b) Disk Bandwidth
0 120
E
-
Duration of long txs (seconds)
(a) Disk Space
E 100
20
Firewall
XEL(Total)
XEL(Generation
1)
_
250
i
w
10
20
30
40
50
60
Duration of long txs (seconds)
I
4
0
I
I
I
I
40
50
60
10
20
30
Duration of long txs (seconds)
(d) Recovery Time
(c) Main Memory
Figure 6.3: Performance Results for Varying Long Transaction Duration
XEL's advantage over FW increases as the duration of the long transaction type
lengthens. For a 60.0 s duration, XEL reduces the size of the log by a factor of 7.9
with only a 6.9% increase in disk bandwidth. Regardless of transaction duration, the
average rate of arrival of log records (in steady state) is the same for both FW and
XEL. At any given moment, FW must retain all log records that have been written
111
.
since the first record of the oldest transaction, so the disk space required for FW is
roughly proportional to the duration of the longest transaction. However, XEL is able
to filter out most log records from short transactions at the head of generation 0, so
XEL is largely unaffected by the duration of a small fraction of long transactions.
6.2.3
Effect of Size of Data Log Records
This set of experiments examines the effect of the size of DLRs from the long (10.0 s)
transaction type. Each long-lived transaction still writes 4 DLRs, as before, but now the
size of each long transaction's DLRs varies from 100 Bytes up to 500 Bytes in increments
of 100 Bytes. Figures 6.4(a)-(d) present the results.
0, 140
0
_
120
_
-o
100
_
80
_
60
_
40
_
a)
E
Z
.0
0 Firewall
XEL
A- - -A
C_
o
_Ce
O......
20
a)
.,
L.
.0
0.·
le
I
-
1 2
0
o
100
_
I
200
3
I
300
I
I
400
500
4
0G
10
0 ...... 0 Firewall
-E-
2
nV
i
0
100
a)
E
C
i
300
400
X' 700
_
500
0 600
_
.0
E
50
.o'
E 5oo
40
o
Firewall
XEL
400
0 ...... 0 Firewall
CC --JUU
_
20
200
_
10
100
A- - -A
10r
n
I
0
i
(b) Disk Bandwidth
A.
30
200
E. - -N
Size of DLRs of long txs (Bytes)
(a) Disk Space
E
a,
E
E
XEL(Total)
XEL(Generation1)
4
Size of DLRs of long txs (Bytes)
60
..0
.'
14
8 - A- - -A
6 -_0-- --O
_-A
20
n
16
12
.
I
18
c
0
XEL
-020
30
40
I.
Size of DLRs of long txs (Bytes)
500
200
300
400
100
Size of DLRs of long txs (Bytes)
(c) Main Memory
(d) Recovery Time
100
200
300
400
500
Figure 6.4: Performance Results for Varying Size of DLRs from Long Transaction Type
112
As the size of DLRs from long-lived transactions increases, XEL suffers more than
FW. The proportion of log information, measured in Bytes, that must be forwarded to
generation 1 increases with the size of DLRs from long transactions, so disk space and
bandwidth for generation 1 both increase. Furthermore, the bandwidth for generation 0
increases, so records tend to move from tail to head faster. Most records from short
transactions are thrown away at the head of generation 0, so their faster movement
through generation 0 means that the LM does not need to keep track of them for as
long a period of time. This tends to decrease the overall main memory requirements for
XEL.
6.2.4
Effect of Data Skew
This section examines the effect of data skew on the performance of XEL for a single
log stream. Suppose that there is some subset H of the set of all objects and that the
members of H receive a disproportionately large number of updates; these objects are
"hot spot" objects because they are updated much more frequently than other objects.
Let x, 0<x<1, be the ratio of the size of H to the total number of objects. Suppose
further that when a transaction must choose an object to update, the probability that
it chooses a member of H is 1-x. This simple definition provides a single parameter, x,
which represents an application's data skew. By varying x appropriately, it is possible
to control the amount of skew in the pattern of updates to data by an application's
transactions.
In this section's experiments, x ranges from 5x10 - 5 up to 0.5. In the case of
x=5x10-5, almost all the updates affect a set of only 500 objects. Such extremely
skewed distributions characterize databases with "hot spot" objects. When x=0.5, all
objects are updated equally often, on average. Figures 6.5(a)-(d) present the results.
Data skew has only a minor effect on XEL. Even for the most highly skewed distribution, XEL requires only 48% more disk space and 5.5% more bandwidth than it does
for a completely unskewed distribution.
6.3
Parallel Logging
This section compares the performances of the three distribution policies for three different data skew specifications as the number of log streams increases. All experiments
in this section use the XEL method for disk space management within each log stream.
Let I be the number of log streams. The experiments in this section examine 1=2 k
for O<k<6.
113
O)
0O 90
80
._8
14
_
12
0)
0 10
_
_
0 ...... 0 Firewall
70
A- - -A
XEL
®.-. ..
..-.
. .....
..-.
CD
. 60
._)50 _
. 40
u)
CD
0.
8
0 .'.'.0
Firewall
6
0- - -_
XEL(Generation
1)
A- - -A XEL(Total)
u)
E
z
30
4
20
2
10
j
mI
11111
JJ1I
md JJ1
I I m,
fill
IIi l
16-04 16-03 16-02 1-01
Data skew (fraction of objects)
0
0
E
Q,
E
70
.o
60
E
CO
a,
450
E
a, 400
E
350
8
0r
0)
0 ...... 0 Firewall
A- - -A XEL
40
30
250
200
150
_
-A
- A--A
100
10
0
0 ...... 0 Firewall
- A---A XEL
300
50
20
E
(b) Disk Bandwidth
4
80
B - -
il J111
11111111I1 11111
111
JIlJ 1 J11111!
le-04 le-03 le-02 le-01
Data skew (fraction of objects)
(a) Disk Space
0
E13.- EfE I
50
ip ;''-Ai ) i i ili li'','ll
le-04 1e-03 1e-02 le-01
Data skew (fraction of objects)
II
0
1 le-04 Ile-03 III le- 02I
11111111
e- IIIIII
le-04
-03 le-02 le-01
Data skew (fraction of objects)
(d) Recovery Time
(c) Main Memory
Figure 6.5: Performance Results for Varying Data Skew
114
Refer to section 6.2.4 for a definition of the data skew parameter, x. The experiments
in this section consider three cases of data skew: x=0.5 (no skew), x=0.01 (moderate
skew) and x=2x10- 4 (high skew). In the case of x=2x10 - 4 , 99.98% of the updates, on
average, affect a set of only 2,000 objects.
For the case of maximum data skew and the greatest number of log streams, the
average time between consecutive updates for a hot spot object becomes so short that
the transaction types defined for the previous sets of experiments (which involved only a
single log stream) are no longer feasbile. It is necessary to define new transaction types
which would hold their write locks on objects for much shorter durations.
All experiments in this section specify the following two types of transactions. The
first is of 0.1 s duration and writes 2 DLRs, each of size 250 Bytes. The second lasts 2.0
s, in which time it writes 2 DLRs of size 250 Bytes each. Their relative probabilities of
occurrence are 0.99 and 0.01, respectively.
To provide sufficient bandwidth for flushing updates, each experiment specifies 10xl
disk drives with a transfer time of 25 ms each. The conservative 25 ms time allows for
some read operations to be interspersed with writes.
The arrival rate is 100x1 TPS and the simulation time is 500 s. All tests use two
generations. Recirculation is enabled for 1>2 so that race conditions will not stall any
of the log streams 6. For each skew setting, the sizes of the two generations are found
such that disk space is minimized for 1=1 (with recirculation disabled), subject to the
constraint that no transaction is killed. For 1>2 and x=0.5 or x=0.01, the sizes of
generations 0 and 1 are both doubled. For 1>2 and x=2x10 - 4 , the size of generation
0 is quadrupled and double the size of generation 1 is doubled; the larger increase in
generation 0 for x=2x 10- 4 was chosen because it was found to yield substantially lower
bandwidth requirements for generation 1.
The evaluation criteria are disk bandwidth and main memory for normal logging
activity and elapsed time for recovery. When considering bandwidth, the results reflect
the maximum total bandwidth (both generations) for any particular stream.
6
To appreciate the subtlety of potential race conditions, suppose that recirculation in the last generation is disabled and imagine that two different transactions update the same object in quick succession
and write DLRs to different streams. Until the first REDO DLR becomes non-recoverable, the second
DLR must be either unflushed or required(assuming that no other transaction subsequently updates the
object and commits). If the second DLR's stream is "faster" than the stream of the first DLR, it may
reach the head of the last generation of its log stream before the first DLR gets to the end of its stream.
The fast stream will be forced to stall until the slow stream has caught up. Similar situations can also
be imagined (such as for a DLR and the COMMIT
TLR on which it depends). These interstream dependencies can cause "fast" streams to synchronize with "slow" streams, which is generally undesirable for
performance reasons.
115
6.3.1
No Skew
The results for the case of no skew (x=0.5) are presented in Figure 6.6. The sizes of
generations 0 and 1 are 10 and 5 blocks, respectively, for 1=1; they are 20 and 10 blocks,
respectively, for 1>2. Recovery time is dominated by the delay required to read in the
contents of the log from disk. The elapsed time for recovery is 76 ms for 1=1 and 151
ms for 1>2, regardless of the distribution policy.
o
0
" 1500
0
30
oc,0
25
- 0...... Partitioned
A
0.
0
0)
......
E 1350
--- A
[ L
0-_
20
E 1200
Random
Partitioned
A- - -A Random
] -
Cyclic
*' 1050
E 900
Cyclic
.0
0O 750
15
4 600
10
450
300
5
150
0
o
I
8
I
I
I
I
I
I
I
16 24 32 40 48 56 64
Number of log streams
0
(a) Maximum Disk Bandwidth
0
8
16
24 32 40 48 56 64
Number of log streams
(b) Main Memory
Figure 6.6: Disk Bandwidth and Memory Requirements vs. Parallelism (=0.5)
All three distribution policies require approximately the same maximum bandwidth,
and the maximum bandwidth remains practically constant as the number of log streams
increases. Refer to Section 6.3.4 for an explanation of why the partitioned distribution
policy requires less main memory, compared to the other two policies, for all experiments.
6.3.2
Moderate Skew
Figure 6.7 shows the results for the moderate skew (x=0.01) case. The sizes of generations 0 and 1 are 10 and 5 blocks, respectively, for 1=1; they are 20 and 10 blocks,
respectively, for 1>2. The elapsed time for recovery is 76 ms for 1=1 and 151 ms for
1>2, regardless of the distribution policy.
For all three policies, the maximum required bandwidth increases slightly with the
number of log streams. This behavior can be attributed to decreasing intervals between
successive updates to each object. The transaction arrival rate increases in proportion
to the number of log streams but the number of hot spot objects remains constant, so
the average duration between successive updates to any particular object is inversely
proportional to the number of log streams. At this skew setting, a REDO DLR becomes
116
-o
......
1800
I
0=
16000A
E
3
co
a0
25
.
20
0 Partitioned
0.-
A- - -A Random
_
-
'
1400
-
E 1200
Cyclic
[
-
1000
t
15
800
Y
10
600
400
5
0 .
200
0
I
I
8
16
I
I
I
I
I
I
n
24 32 40 48 56 64
Number of log streams
0
8
16
24
32
40
48
56
64
Number of log streams
(a) Maximum Disk Bandwidth
(b) Main Memory
Figure 6.7: Disk Bandwidth and Memory Requirements vs. Parallelism (x=0.01)
more likely to have a required status, because of the lingering presence of a recoverable
DLR from a prior update to the same object, when the LM must decide whether or not
to forward the DLR from generation 0 to generation 1. In terms of maximum bandwidth,
the partitioned policy performs slightly worse than the other two because of static skew
effects.
The non-linear slope for main memory requirements can be explained similarly. The
COMMIT
records from transactions are more likely to be forwarded into generation 1
because required REDO DLRs depend on them. Therefore, COMMIT
records tend to
survive longer and the LM must continue to pay attention to all associated DLRs;
previously, the LM would have thrown away many of these COMMIT
records at the head
of generation 0 instead of forwarding them, and any remaining REDO DLRs would
instantly become non-recoverable.
6.3.3
High Skew
Figure 6.8 shows the results for the case of high skew (x=2x10-4). The sizes of generations 0 and 1 are 10 and 5 blocks, respectively, for 1=1; they are 40 and 10 blocks,
respectively, for 1>2. The elapsed time for recovery is 76 ms for 1=1 and 252 ms for
1>2, regardless of the distribution policy.
It is interesting to note that the bandwidth curves for the cyclic and random policies
slope up and then down, as the number of log streams increases. The initial increases
have a similar explanation as was given for the case of moderate skew. Namely, a
REDO DLR becomes more likely to be forwarded to generation 1 as the mean time
117
_...) o> 2400
o 50
o
45
ou
40
z5
35
w
30
25
0.
*-
-
..
0 ......
=
20 _ A---A
El
C
15
"' ,.
.
C 1800
E
am1200
0 Partitioned
[
Cyclic
... 2000
1600
~ ~ 1Et~
_
Random
0
."E 2200
'
m
]
10
5
n
1000
800
600
400
200
I
0
8
16
24
32
40 48 56 64
Number of log streams
0
8
16
24 32 40 48 56 64
Number of log streams
(b) Main Memory
(a) Maximum Disk Bandwidth
Figure 6.8: Disk Bandwidth and Memory Requirements vs. Parallelism (=2xl0
-4 )
between updates decreases. As the throughput increases even further, the average time
between consecutive updates to an object becomes so short that a DLR is more likely
to become only recoverable by the time the LM must decide whether or not to forward
it to generation 1 because another transaction has already updated the same object and
committed.
6.3.4
Discussion
Regardless of skew, the partitioned distribution policy requires less main memory than
either cyclic or random because it takes advantage of the fact that all DLRs for a
particular object are directed to the same log stream; the object's LOT entry can be
located at the same processor node as the one that manages the cells for its DLRs
and can directly point to these cells. With the cyclic and random policies, an object's
LOT entry no longer points directly to the cells for its DLRs. Instead, it indicates the
streams where there are relevant DLRs for the object; local LOT tables at those streams
point directly to the corresponding cells. This indirection entails higher main memory
requirements.
In terms of disk bandwidth, all three distribution policies are approximately the
same for low data skew. For high data skew, partitioned suffers noticeably from static
skew effects and so it requires significantly more disk bandwidth than the other two.
The random and cyclic strategies both exhibit approximately the same behavior for all
data skews.
118
Chapter 7
Conclusions
7.1 Lessons Learned
This thesis has proposed and evaluated a new variation of logging and recovery that is
well suited to highly concurrent database systems. Extended ephemeral logging (XEL),
a new disk management method that does not require periodic checkpoint operations,
is the cornerstone upon which the rest of the logging and recovery system is built. An
application's bandwidth requirements for log information may demand a collection of
log streams which work in parallel. Each log stream resides on a single disk drive or
possibly a small set of drives. XEL manages the disk space within each log stream.
Chapter 3 proved important safety and liveness properties for a simplified version of
XEL that was expressed in terms of the I/O automata model [42], thereby imparting
confidence in the correctness of XEL.
XEL's performance was experimentally evaluated in Chapter 6, using an event-driven
simulator. XEL was compared to the traditional "firewall" (FW) disk management
method for a single stream of log records and the experimental results suggest that
XEL's advantage over FW increases under any of the following conditions:
* The lifetime of an application's longest transaction type increases.
* The amount of log information written by an application's longest transaction type
decreases.
* The probability of occurrence of an application's longest transaction type decreases
(but remains non-zero).
119
When a system has multiple log streams, XEL can accommodate any distribution
policy but the partitioned policy generally requires less main memory space. Unless an
application updates a relatively small collection of objects much more frequently than
other objects in the database, all three oblivious distribution policies which Chapter 4
considered yield approximately equal loads on all log streams. However, the random and
cyclic distribution policies lead to better load balancing, compared to the partitioned
policy, if an application has a small collection of "hot spot" objects.
Log streams need not be especially large, in terms of storage space, when a system's
LM uses XEL to manage the disk space allocated for log information. Small log streams
and large main memories enable much faster recovery after a crash; a database system's
recovery manager (RM) can sequentially retrieve a log stream's contents from disk and
process them in a single pass. When multiple log streams exist, the RM processes them
all independently in parallel.
7.2
Importance of Results
This thesis widens the options available to DBMS designers. The firewall (FW) method's
abstraction of a single FIFO queue for log information is inappropriate in some circumstances. In such circumstances, a DBMS designer may prefer to use the XEL method
instead.
If a small fraction of transactions have relatively long lifetimes, XEL retains the necessary log information from these long transactions but reclaims the disk space occupied
by log records from much shorter transactions. Therefore, XEL can significantly reduce
the size of the log for some applications. The primary benefit of a much smaller log is
faster recovery after a crash. A smaller log may also decrease a system's cost.
Checkpoints are no longer a necessity with XEL. This eliminates the overhead (in
terms of computation, communication, disk bandwidth and disk space) and complexity
that accompany any disk management method which involves checkpoints (e.g., the FW
method). This advantage is especially welcome in highly concurrent systems that have
an arbitrarily large number of log streams and an arbitrarily large number of disk drives
on which the disk version of the database is kept; coordination for periodic checkpoints
in such a highly concurrent setting becomes cumbersome.
An arbitrarily large collection of disk drives provide the necessary bandwidth for
log information. As the size of this collection grows, the single FIFO queue abstraction
becomes increasingly awkward to implement. A more convenient abstraction is to view
the log as a collection of log streams that operate largely independently of each other.
This abstraction can be implemented efficiently if XEL is used to manage the disk space
within each log stream.
120
The simulation results presented in this thesis quantitatively demonstrate XEL's
effectiveness for a wide variety of applications and illustrate its strengths and weaknesses
compared to the traditional FW method.
7.3
7.3.1
Extensions
Non-volatile Region of Main Memory
Previous authors [15, 13, 39, 9, 53] have proposed system designs in which some (but
not all) of main memory is non-volatile. Battery backup to some portions of RAM
ensures that the contents will not be lost if the regular power supply is interrupted.
In such a system, parallel XEL can greatly reduce the disk bandwidth required for
log information so that much fewer disk drives are needed for the log. The youngest
generation (generation 0) for each log stream can be kept in non-volatile main memory;
the LM writes log records to disk only when space in non-volatile main memory has
been exhausted.
7.3.2
Log-Only Disk Management
Suppose a computer's main memory provides sufficient capacity to hold all the objects
of a database and that applications update most of these objects quite frequently. In
such a setting, a separate disk version of the database is superfluous. The most recently
committed value for each object can be kept in only the log. A few small changes to the
XEL algorithm yield a variant that meets the needs of this log-only situation. Without
a disk version of the database, UNDO DLRs are no longer needed. The status of each
REDO DLR is either required or not-required; a REDO DLR has status required if the
transaction which performed the update is still in progress or if the DLR is for the most
recently committed update to the object. Although older REDO DLRs for the same
object may still be recoverable, the LM doesn't need to keep track of them because it
will never erase the DLR for the most recently committed update to the object; hence,
these older REDO DLRs all have status not-required. Likewise, a COMMIT
record can
have a status of either required or not-required; a COMMIT
record is required if any only
if at least one DLR from the transaction is still required. The LM keeps track of only
COMMIT
records which are required.
This log-only variant of XEL offers several advantages. First, it eliminates the expense and complexity that arise from managing a separate disk version of the database.
Second, the LM no longer needs to keep track of old stale records in the log, and this
implies lower main memory storage requirements for the LOT and LTT. Furthermore,
the fact that the LM keeps track of only required records means that the status field can
121
be eliminated from each cell, thus yielding further reductions in main memory requirements.
7.3.3
Multiplexing of Log Streams
One possible drawback to XEL is that it may cause write operations to non-sequential
positions on disk. Movement between generations within a stream introduces random
disk I/O, which is generally much less efficient than sequential I/O. For example, suppose
a log stream has two generations. When the LM must occasionally write a block of
forwarded log records to the tail of generation 1, it must seek to this track's location on
disk and wait for the block's location to rotate under the disk head. When the write
to generation 1 has finished, the LM returns to the tail of generation 0, where it can
resume writing blocks to generation 0 in sequential order.
A LM with a sufficiently large collection of log disk drives can alleviate the need for
occasional non-sequential accesses to the log by multiplexing older generations from different streams, as illustrated in Figure 7.1 for a LM whose streams have two generations.
disk
1
new
log -records
disk 2
new
log -records
disk 4
disk 3
new
log ---
J1||I
;
Lm![mt
records
Figure 7.1: Multiplexing of Older Generations
With this configuration, the LM can exploit completely sequential disk I/O when
writing log information to generation 0 of any particular log stream. When the LM
must forward log records to generation 1, it writes them to a physically different disk
drive (or set of drives) so that no random I/O is required.
This technique of multiplexing older generations from different streams permits quick
reclamation of the disk space allocated to log records from short transactions but doesn't
122
suffer from any performance degradations due to random disk I/O to the log disks.
7.3.4
Choice of Generation Sizes
For a particular application, how many generations should each log stream have, and
what should be the sizes of these generations? Currently, no analytical methods are
available to answer these questions. The experiments reported in Chapter 6 relied on
simulation to determine the optimal configuration for each particular case. Therefore,
formulation of an accurate analytic model to determine the optimal number of generations and their sizes for any particular application remains a challenging open problem.
The characteristics of an application may vary over time, and so the design of an
enhanced version of XEL that can adaptively alter its parameters in response to changing
conditions is another important open problem.
7.3.5
Fault Tolerance to Isolated Media Failures
This thesis has concentrated on fault tolerance to system failures (crashes) in which
the contents of main memory are lost but all information on non-volatile disk storage
remains intact. To tolerate media failures, in which information on disk is lost, a DBMS
must exploit redundancy. RAID [52, 51] and Parity Striping [23] have both recently
been proposed as solutions to the problem of isolated media failures. A parallel implementation of XEL and RAID can be combined with little difficulty. Disk I/O for
log information is characterized by sequential transfers of large blocks of information,
and so level 3 RAID is most appropriate. A group of disk drives, one of which is used
only for parity information, constitute each log stream. The maximum bandwidth per
stream is now higher (compared to the simplistic situation of only one disk drive per
log stream), so fewer streams are required. In contrast, I/O for the disk version of the
database is characterized by random requests for small pieces of data, so level 5 RAID
or Parity Striping would be the best choice for it. The differing requirements of the LM
and CM provide a good example of a situation in which it is advantageous to employ
two different levels of RAID in the same system.
RAID systems can be designed to provide very high reliability. For example, a 1000
disk level 5 RAID with a group size of 10 and a few standby spares could have a calculated
MTTF (mean time to failure) of over 45 years [52]. Isolated media failures pose very
little threat under such circumstances. Other components of the system may likely
fail before an irrecoverable media failure happens. Nevertheless, the mere threat of an
irrecoverable media failure may warrant attempts for further fault tolerance. A familiar
solution (see [5], for example) is to periodically dump the database's current state to an
archive version and maintain an archive log of all transactions which have executed since
123
the archive version was dumped. The archive version of the database and the archive
log may be kept on tape rather than on disk. This thesis has not addressed the problem
of maintaining an archive log for a database which requires very high bandwidth for
log information. Another option, which is already practiced by some large commercial
users of databases, is to have two duplicate database systems running at geographically
separate sites to that a natural or man-made disaster at one site does not wipe out all
data. This option may be expensive, because it requires full duplication, but it does
provide good fault tolerance. In the event of a media failure at one site, the other site
still offers an up-to-date version of the database. The likelihood of simultaneous failures
at both sites ought to be very low. The provision of more efficient means to ensure fault
tolerance to irrecoverable media failures remains a challenging open problem.
7.3.6
Fault Tolerance to Partial Failures
This thesis has conveniently assumed that the concurrent system on which the DBMS
executes is either completely up or completely down. In practice, some machines may
be able to continue operation at some processing nodes despite partial failures at other
nodes elsewhere within the same machine. When the system hardware allows such
"graceful degradataion", a DBMS designer might wish to structure the DBMS software
so that it too allows graceful degradation.
One obvious way to provide fault tolerance to partial failures is to duplicate all an
application's data structures and code. For example, all LOT entries at a particular
node would be duplicated at some other node elsewhere in the machine. Any update
to one copy of the these LOT entries must be applied to the other copy as well. Refer
to [4] for an example of a system which exploits software redundancy to tolerate partial
failures.
7.3.7
Support for Transition Logging
This thesis has always assumed physical state logging at the access path level. Some
DBMS designers may prefer other styles of logging. For example, the ARIES method
for logging and recovery [45] uses transition logging [30] and incorporates compensation
log records (CLRs) to undo the effects of previous updates to objects. If a LM performs
transition logging for an object and a transaction updates the object, the resulting log
record indicates only the operation that transformed the object's old value into its new
value; neither the old state nor the new state of the object is represented in the log. In
this context, what is the definition of a garbage log record, and what is a relevant log
record? How should the LM manage the LOT and LTT?
When a transaction performs an operation on an object, the resulting log record shall
124
be called a FWD record (since it refers to an operation in the "forward" direction in
time). When the LM undos an operation, the resulting record is a CLR, as mentioned
above. In general, an operation described in a log record is not idempotent. If the
RM must undo an update by a transaction, it must first be sure that the version of the
object in the disk version of the database actually incorporates the transaction's original
update, lest an unwarranted undo action put the object into an incorrect state. One
way to synchronize the log and the disk version of the database is to keep a timestamp
with every object upon which the LM will perform transition logging. Suppose this
timestamp is an integer-valued counter (initially 0) and the LM increments the counter
every time it changes the object. Whenever the CM flushes an updated object to the disk
version of the database, the object's current timestamp accompaniesit and resides with
it in the disk version of the database. Whenever the LM modifies an object (in either
the "forward" or "backward" direction), it increments the timestamp and stores the new
timestamp value in the associated log record. When the RM must restore an object to
its most recently committed pre-crash state, it knows that the version of the object in
the disk version of the database already incorporates the effects of all operations whose
log records have timestamps less than or equal to that found with the object in the disk
version of the database. The RM must redo only those operations which are described
in subsequent log records from committed transactions. Similarly, the RM must undo
any operations that were performed by transactions which aborted or were interrupted
by the crash; some of these operations may temporally precede the current version of
the object in the disk version of the database.
The LM must retain all FWD and CLR log records for operations that temporally
follow the current version of the object in the disk version of the database. It must also
retain all FWD log records from uncommitted transactions that do not have corresponding CLRs, regardless of whether these records temporally precede or follow the current
version of the object in the disk version of the database. If a transaction commits, the
LM must retain its COMMIT
record until all FWD log records from the transaction have
been overwritten; otherwise, the RM could find a FWD log record which appears to be
from an uncommitted transaction but the log contains no subsequent CLR so the RM
ought to undo the operation described in the FWD record. If the LM writes a FWD log
record and later writes a CLR to undo the operation, it must retain the CLR in the log
until the FWD record has been overwritten; this ensures that the RM does not undo an
operation which has already been undone. These considerations dictate when COMMIT,
FWD and CLR records become garbage. The LOT and LTT can track the positions
and status values for FWD and CLR records just as it did for other types of log records.
7.3.8
Adaptive Distribution Policies
This thesis demonstrated that a simple oblivious distribution policy, such as random or
cyclic, can ensure excellent load balancing behavior in a system with arbitrarily many
log streams, where load balancing is defined in terms of the demanded bandwidth of each
125
stream. However, there may still be reasons to pursue more sophisticated distribution
policies.
An adaptive distribution policy takes into account the current state of the system
when deciding the stream to which to send a log record; any such strategy is clearly not
oblivious. Although an adaptive strategy cannot achieve any improvements in terms of
balancing the demanded bandwidths of log streams, it can offer other benefits. Each
log stream has a fixed number of buffers. If one stream's buffers are all completely
full temporarily, an adaptive strategy ought to redirect log records to other streams
which still have buffer capacity available. Furthermore, the random and cyclic policies
make no attempt to exploit locality within a concurrent computer. As a system's degree
of concurrency scales up, global network bandwidth may become more limited and
locality may affect a database's overall performance. The partitioned policy, despite its
drawbacks in terms of load balancing, can provide good locality; a modified variant of
the partitioned policy may yield the best solution, in terms of both load balancing and
locality. Log records are sent to streams according to the partitioned policy. However,
an overloaded log stream does not refuse to accept records if it can redirect them to
some other stream (preferably one that is nearby) that can accept more records.
Theoretical analysis of an adaptive distribution policy becomes quite complicated
because the system incorporates feedback. Such problems have classically been the
preserve of control systems theory. Theoretical analysis and experimental evaluation
of an adapative policy must take into the system's dynamic behavior. For example, a
system might exhibit oscillatory behavior under some conditions. These issues ought to
be thoroughly understood before any adaptive policy is chosen for use within a database
system.
7.3.9
Quick Resumption of Service After a Crash
The description of the RM's operation in Section 2.9 stated that normal operation resumes after recovery activity has completely finished. For some applications, availability
is very important and normal activity should resume as quickly as possible after a crash.
More sophisticated recovery algorithms for XEL may permit a database to start servicing requests before recovery has entirely completed; normal operation and recovery are
overlapped. This remains an interesting open problem.
7.4
The Future of High Performance Databases
In the future, information will be plentiful, available and inexpensive; but it won't be
free. As computer and communications technology becomes pervasive in our society,
many offices and homes will be able to retrieve information from large databases. In
126
general, the corporations which provide these information services will charge the consumers appropriately. Fees will reflect the value of the information to the consumer
and the cost to the producer for gathering, storing and distributing the information.
Database technology will provide not only the ability to store and retrieve large amounts
of diverse information. It will also provide a means to charge customers accordingly, so
as to ensure economic efficiency.
Suppose an "information vendor" establishes a database which provides information
services to many millions of customers. If 10,000,000 customers happen to be using the
system at one time and each customer submits approximately one request every 100
seconds, say, this generates a load of 100,000 TPS (transactions per second) for the
billing database. This is enormous, compared to the best demonstrated performance of
today's systems.
The cost for maintaining billing information is important. It ought to be relatively
low, compared to the cost of the information itself, so that the price is not significantly
inflated by the need to charge a fee for each request.
These considerations suggest that high performance, low cost transaction processing
systems will play an important role in our society as we become information consumers
and many hundreds of millions of people and companies buy and sell information.
The work in this thesis is a step toward realizing this dream. However, it addresses
only a necessary condition (fault tolerance), not a sufficient condition. Much more
work remains to be done in many other areas. The problems of concurrency control,
query optimization and transaction management all deserve re-examination within the
context of highly concurrent database systems that support very high rates of transaction
processing.
127
Appendix A
Theorems for Correctness Proof
of XEL
This appendix contains proofs for the safety and liveness properties of XEL. These proofs
supplements Chapter 3 in the main body of the thesis.
Proof of Possibilities Mapping
A.1
The following lemmas and theorems will prove that f, as defined in Section 3.4.3, satisfies
the sufficient conditions (stated in [42, 43]) to be a possibilities mapping from LMto SLM.
Theorem
Vso, soEstart(LM), 3to s.t. (toEstart(SLM)) A (toEf(so))
A.1
Proof:
*
((
A
(curr reqddlr=l)
A
(pendingts-assign=0)
A
A
(Vx, xEA, timestamp=l)
(Vx, xEAr, statusx=UNFL)
)
in state so)
(tEf(so))
((keep=)
* start(SLM)={to}
=:-
where ((keep=I)
A (leterase=0) A (wait-erase=0)) in state t
A (leterase=0)
A (waiterase=0))
in state to
* ((keep=I) A (leterase=0) A (wait-erase=0)) in state to -== t=to
* (t=to) A (tEf(so))
=- toEf(so)
and thus the theorem has been proven.
128
O
Lemma A.2
(lmkeep(z)
A
in state s)
(tEf(s))
keep=x in state t
Proof:
By contradiction. Assume keepox in state t.
* (tEf(s)) A (keep/x in state t)
== Either
(1) (lmlet(x) in state s) A (xEleterase in state t)
* lmlet(x) in state s ==: -lmkeep(x) in state s
by definition of lm let(x) and Imkeep(x)
But this is a contradiction, so this case cannot be true.
or
(2) (Imwait(x) in state s) A (xEwaiterase in state t)
* lmIwait(x) in state s
==> statusx=RECV
in state s
by defin!ition of Imwait(x)
* statusx=RECV in state s
=- currreqddlr5x in state s
* currreqddlr5x in state s
==~ -lmkeep(x)
by Invariant 3.6
by definiition of lmkeep(x)
in state s
But this is a contradiction, so this case cannot be true.
or
(3)
A
(-recvbl(x) in state s)
(((keepy4x) A (xfleterase)
A (xwaiterase))
in state t)
* Imkeep(x) in state s
== curr-reqddlr=x in state s
by definition of Imkeep(x)
* currreqddlr=x
in state s == recvbl(x) in state s
by Invariants 3.6 and 3.8 and definition of recvbl(x)
But this is a contradiction, so this case cannot be truite.
* Therefore, every possible case leads to a contradiction and so the original
assumption must be false. Thus the lemma has been proven.
[
Lemma A.3
(Imlet(x) in state s)
A
(tEf(s))
xEleterase in state t
Proof:
By contradiction. Assume xleterase
* (tEf(s)) A (xflet-erase in state t)
== Either
129
in state t.
(1) (Imkeep(x) in state s) A (keep=x in state t)
* lmkeep(x) in state s ==- lmlet(x) in state s
by definition of lmkeep(x) and Imlet(x)
But this is a contradiction, so this case cannot be true.
or
(2) (lmwait(x) in state s) A (xEwait-erase in state t)
* lmwait(x) in state s
=
statusx=RECV in state s
by definition of lmwait(x)
* statusx=RECV in state s
==-
currreqddlrox
by Invariant 3.6
in state s
* ((curr-reqd dlrsx) A (lm-wait(x))) in state s
by definition of lmlet(x)
-lmlet(x) in state s
=
But this is a contradiction, so this case cannot be true.
or
(3)
(-recvbl(x) in state s)
A (((keep5x) A (xleterase)
* lmlet(x) in state s
A (xwaiterase))
V (recvbl(x)))
== ((curr-reqddlr=x)
in state t)
in state s
by definition of lmlet(x)
V (recvbl(x))) in state s
* ((currreqd.dlr=x)
= recvbl(x) in state s
by Invariants 3.6 and 3.8 and definition of recvbl(x)
But this is a contradiction, so this case cannot be true.
*Therefore, every possible case leads to a contradiction and so the original
0
assumption must be false. Thus the lemma has been proven.
Lemma A.4
(lmwait(x)
A
in state s)
(tEf(s))
xEwaiterase in state t
Proof:
By contradiction. Assume xowait-erase in state t.
* (tEf(s)) A (xowaiterase in state t)
-- Either
(1) (lmkeep(x)
in state s) A (keep=x in state t)
* Imkeep(x) in state s
==
curr-reqd dlr=x in state s
by definition of Imkeep(x)
* curr-reqddlr=x in state s
==
status,RECV
in state s
by Invariant 3.6
* statusx$RECV in state s
by definition of lmwait(x)
==- lmwait(x) in state s
But this is a contradiction, so this case cannot be true.
130
or
(2) (lm-let(x) in state s) A (xEleterase
in state t)
* lmlet(x) in state s
((currreqddlr=x)
==
V (-,lmwait(x))) in state s
by definition of lmlet(x)
* ((curr_reqddlr=x) V (-lmwait(x))) in state s
:==-lm.wait(x) in state s
by Invariant 3.6
But this is a contradiction, so this case cannot be true.
or
(3)
(-recvbl(x) in state s)
A (((keep-x) A (xzlet-erase) A (xfwait-erase)) in state t)
* lm.wait(x) in state s
==
status.=RECV
in state s
by definition of lmwait(x)
* statusx=RECV in state s
in state s
by Invariant 3.7
==- timestampEA
* ((statusx=RECV) A (timestamp.EA/f)) in state s
-=recvbl(x) in state s
by definition of recvbl(x)
But this is a contradiction, so this case cannot be true.
*Therefore,
every possible case leads to a contradiction and so the original
assumption must be false. Thus the lemma has been proven.
Lemma A.5
(-recvbl(x)
A
[]
in state s)
(tEf(s))
((keephx) A (xfleterase) A (xEwait-erase)) in state t
Proof:
By contradiction.
Assume ((keep=x) V (xElet-erase) V (xEwait-erase)) in state t.
* (tEf(s)) A (((keep=x) V (xEleterase) V (xEwaiterase)) in state t)
=. Either
(1) (lmkeep(x)
in state s) A (keep=x in state t)
* lmkeep(x) in state s
=# currreqddlr=x
* currreqddlr=x
in state s
by definition of lmkeep(x)
in state s =- recvbl(x) in state s
by Invariants 3.6 and 3.8 and definition of recvbl(x)
But this is a contradiction, so this case cannot be true.
or
(2) (lmlet(x) in state s) A (xEleterase in state t)
* lmlet(x) in state s
= ((curr eqd dlr=x) V (recvbl(x))) in state s
by definition of lmlet(x)
131
V (recvbl(x))) in state s
*((curr-reqd.dlr=x)
- recvbl(x) in state s
by Invariants 3.6 and 3.8 and definition of recvbl(x)
But this is a contradiction, so this case cannot be true.
or
(3) (lmwait(x) in state s) A (xEwait-erase in state t)
* lmwait(x) in state s ==- status,=RECV in state s
by definition of lmwait(x)
* status,=RECV in state s
=-
timestampEAf
in state s
by Invariant 3.7
* ((status,=RECV) A (timestampEAf)) in state s
by definition of recvbl(x)
==- recvbl(x) in state s
But this is a contradiction, so this case cannot be true.
*Therefore, every possible case leads to a contradiction and so the original
[
assumption must be false. Thus the lemma has been proven.
Lemma A.6
A
(recvbl(x), xEAr, in state s)
A
(tEf(s))
((keep=x) V (xEleterase) V (xEwaiterase)) in state t
Proof:
By contradiction.
Assume ((keepox) A (xleterase) A (xfwaiterase)) in state t.
* (tEf(s)) A (((keepox) A (xleterase) A (xowaiterase)) in state t)
=:
by Definition 3.1
-recvbl(x) in state s
But this contradicts the lemma's predicate. Therefore, the initial assumption
must be false and thus the lemma has been proven.
0
(s is a reachable state of LM)
Lemma A.7
A
(recvbl(x),
A
(tEf(s))
xzEA, in state s)
((leterase#0) V (waiteraserS0)) in state t
Proof:
* (recvbl(x) in state s) A (tEf(s))
== ((keep=x) V (xElet-erase) V (xEwait-erase)) in state t
by Lemma A.6
* Either
(1) keep=x in state t
132
* t is a reachable state
=- ((keep=l) V (leterase#O) V (waiteraseO))
in state t
by Invariant 3.1
·
(keep=x, EAV,in state t)
A (((keep=I) V (let-erasej0) V (waiteraseo0)) in state t)
:- ((let-eraseo0) V (waiterase$0)) in state t
or
(2) keepox in state t
*
(keepox, xEA(, in state t)
A (((keep=x) V (xEleterase) V (xEwait-erase)) in state t)
=- ((xEleterase) V (xEwaiterase)) in state t
* ((xEleterase) V (xEwaiterase)) in state t
== ((let-erase40) V (wait-erase#0)) in state t
Therefore, the desired result is obtained for both possible cases and thus
the lemma has been proven.
[
Lemma A.8
(s is a reachable state of LM)
(x, XEAE,s.t. recvbl(x), in state s)
(t Ef(s))
A
A
((let-erase=0) A (waiterase=0)) in state t
Proof:
By contradiction. Assume ((leterase$O) V (waiterase40)) in state t
*
(((leterase-0) V (waiterase$0)) in state t)
A (((L~leterase) A (waiterase))
=: 3x,
in state t)
EAr, s.t. ((xElet-erase) V (xEwait-erase)) in state t
by Invariant 3.5
* (((xEleterase) V (xEwaiterase)) in state t) A (tEf(s))
=-
((lm let(z)) V (Imwait(z)))
in state s
by Lemmas A.3 and A.4
* ((lmlet(x)) V (lmwait(x))) in state s =. recvbl(x) in state s
by definitions of lmlet(x) and lmwait(x),
and by Invariants 3.6, 3.7 and 3.8
But this contradicts the lemma's predicate, and so the original assumption
must be false. Thus the lemma has been proven.
Theorem A.9
(sil
is a reachable state of LM)
A
(ri=COMMIT)
A
A
((s_- ,COMMIT,sj)Esteps(LM))
(t'Ef(si-))
3t s.t. ((t',COMMIT,t)Esteps(SLM)) A (tEf(si))
133
0
Proof:
* 7ri=COMMIT,
= currreqddlr=x
* Either
(1) 3y, yx,
in state si
s.t. recvbl(y) in state si_
* (recvbl(y) in state si_1 ) A (ri=COMMITx)=>- recvbl(y) in state si
* ((currreqddlr=x) A (recvbl(y), y/x)) in state si
= lmkeep(x) in state s* (recvbl(y) in state si-1) A (t'Ef(si_-))
((leterase$0) V (waiterase/(O)) in state t'
by Lemma A.7
*
(((let-erase$O) V (waiterase0O)) in state t')
A
((t',COMMIT,t)Esteps(SLM))
=: x=keep in state t
* Either
(i) 3w, w5x, s.t. currreqddlr=w in state si-,
* curr-reqddlr=w in state si, ((recvbl(w)) A (statuso/RECV)) in state s_l
by Invariants 3.6 and 3.8
*
(((recvbl(w)) A (status
A
RECV)) in state si_l)
(7ri=COMMITx)
==- ((recvbl(w))
A (statusRECV))
in state si
* (ri=COMMITx)
A (w/x) => currreqddlrow in state si
*
(
(currreqddlr/w)
A
(recvbl(w))
A
(statusw#RECV) ) in state si
lmlet(w) in state si
* currreqddlr=w in state si=:v ((lmkeep(w)) V (lmlet(w))) in state si_1
* (((lmkeep(w)) V (lmlet(w))) in state si-l) A (t'Ef(si-l))
== ((keep=w) V (wEleterase)) in state t'
by Lemmas A.2 and A.3
*
(((keep=w) V (wEleterase)) in state t')
A
((t',COMMITx,t)Esteps(SLM))
=- wEleterase in state t
or
(ii) A/w, w/x, s.t. currreqddlr-w
in state si-1
* ,w, w/x, s.t. currreqddlr=w in state si-_
==-
/w, w/&x, s.t. lm-keep(w) in state si-1
* Vz, z/x, (lmlet(z) in state si_l) A (7ri=COMMITx)
== lmlet(z) in state si
· Vz, z5x, (Imlet(z) in state si-1) A (t'Ef(si_l))
=V zEleterase in state t'
134
by Lemma A.3
* Vz, zix, (zElet_erase in state t') A ((t',COMMIT,t)Esteps(SLM))
==, zEleterase in state t
* Vz, z5x, (lmwait(z) in state sil) A (ri=COMMITx)
==
lm_wait(z) in state si
* Vz,zzx, (lmwait(z) in state sil) A (t'Ef(sil))
==, zEwait-erase in state t'
by Lemma A.4
* Vz, z5x, (zEwaiterase in state t') A ((t',COMMIT,t)Esteps(SLM))
== zEwaiterase in state t
or
(2) ]]y, yZ:, s.t. recvbl(y) in state si-_
(y, yx, s.t. recvbl(y) in state s_l) A (7r=COMMIT,)
==
y, yxz, s.t. recvbl(y) in state si
* ((currreqd_dlr=x) A (y,
=
(y,
y5x, s.t. recvbl(y))) in state si
lm let(x) in state si
yx, s.t. recvbl(y) in state si-1) A (t'Ef(sil))
;- ((let_erase=0) A (wait_erase=0)) in state t'
by Lemma A.8
*
(((leterase=0) A (wait_erase=0)) in state t')
A
((t',COMMIT,t)Esteps(SLM))
=- ((keep=l) A (xElet-erase)) in state t
* Vz, z7x, (-,recvbl(z) in state si-1) A (ri=COMMITx)=>
· Vz, zx, (-irecvbl(z) in state si-1) A (t'Ef(si-1))
. ((keepAz) A (zleterase)
recvbl(z) in state si
A (ziwaiterase)) in state t'
by Lemma A.5
* Vz, zix,
(((keep5z) A (zfleterase) A (zfwaiterase)) in state t')
A
((t',COMMIT,t) Esteps(SLM))
=-
((keep7z) A (zlet_erase) A (zfwait_erase)) in state t
* Therefore, from the above deductions it follows that
tEf(si)
and thus the theorem has been proven.
Theorem
A.10
(sil
by Definition 3.1
O
is a reachable state of LM)
A
A
(iri=ERASABLEx)
((sil,ERASABLEx,si)Esteps(LM))
A
(t'Ef(si-1))
3t s.t. ((t',ERASABLE,t)Esteps(SLM)) A (tEf(si))
Proof:
* 7ri=ERASABLEx
=•= xEpending_erasable in state si_
· xEpending_erasable in state si-1
:- ((statusx=RECV) A (pending_ackr=F)) in state si-1
by Invariants 3.14 and 3.12
135
* (((status,=RECV) A (pendingack=F))
in state si_1) A (ri=ERASABLE,)
Im_wait(x) in state si
* xEpending_erasable in state si-1
=-
((status,=RECV)
A (currreqd_dlr
x) A (timestampEAF))
in state si-1
by Invariants 3.14, 3.6 and 3.7
*
(
A
(statusx=RECV)
(currreqd_dlrzx)
A
(timestampxEKf)
in state si-I
(xEpending_erasable))
by definition of lmlet(x)
=lm let(x) in state si-1
== xElet_erase in state t'
· (lmlet(x) in state si-1) A (t'Ef(sil))
by Lemma A.3
* (xElet_erase in state t') A ((t',ERASABLE,t)Esteps(SLM))
=
XEwait_erasein state t
* Vz, zx, (lmkeep(z) in state si-1) A (ri=ERASABLE,)
A
==
lmkeep(z)
in state si
* Vz, z6x, (m_keep(z) in state si-1) A (t'Ef(si- 1 ))
by Lemma A.3
=> keep=z in state t'
* Vz, zx, (keep=z in state t') A ((t',ERASABLE,t)Esteps(SLM))
== keep=z in state t
=: lmlet(z) in state si
* Vz, zox, (lmlet(z) in state Si_l) A (ri=ERASABLEx)
* Vz, zx, (lmlet(z) in state si-1) A (t'Ef(sil))
by Lemma A.3
* Vz, zox, (zElet_erase in state t') A ((t',ERASABLEx,t)Esteps(SLM))
== zElet_erase in state t'
=> zElet_erasein state t
4
* Vz, zy x, (lm_wait(z) in state si-1) A (ri=ERASABLE,)
==: lm_wait(z) in state si
* Vz, zy4x, (lm_wait(z) in state si-1) A (t'Ef(si-1))
by Lemma A.4
=-- zEwait_erase in state t'
* Vz, zx, (zEwait_erase in state t') A ((t',ERASABLE,,t)Esteps(SLM))
-=> zEwait_erase in state t
* Vz, zyx, (recvbl(z) in state si-1) A (ri=ERASABLEx)
=> -recvbl(z) in state si
* Vz, zyx, (recvbl(z) in state si-l) A (t'Ef(si-1))
==: ((keepoz) A (zlet_erase) A (zfwaiterase)) in state t'
by Lemma A.5
(((keep4z) A (zleterase)
· Vz, zx,
A
A (zowait_erase)) in state t')
((t',ERASABLE,t)Esteps(SLM))
=
((keepyLz) A (zolet-erase) A (zowaiterase))
in state t
* Therefore, from the above deductions it follows that
by Definition 3.1
tEf(si)
and thus the theorem has been proven.
136
0
Theorem A.11
(si-_ is a reachable state of LM)
A
(7ri=ERASE,)
A
((si_l,ERASE ,si) Esteps(LM))
A
(t'Ef(sii))
3t s.t. ((t',ERASE,t)Esteps(SLM)) A (tEf(si))
Prioof:
* ri=ERASEx==* xEcanerase in state si-,
* xEcan-erase in state si- 1
=- lm-wait(x) in state si-l
by Invariants 3.14, 3.13 and 3.12
* Imwait(x) in state si-l
==-
(Iv
]v s.t. <x,v>Ependingts-assign
s.t. <x,v>Ependingts.assign
in state si-_
by Invariant 3.9
in state Si-1) A (ri=ERASEx)
=: -recvbl(x) in state si
* (Imwait(x) in state si_1) A (t'Ef(sil))
==-
xEwaiterase
in state t'
by Lemma A.4
* xEwait-erase in state t'
==
((keepox) A (xleterase))
in state t'
by Invariant 3.4
* (((keepox) A (xVleterase)) in state t') A ((t',ERASE,,t)Esteps(SLM))
== ((keep5x) A (xlet-erase) A (xwaiterase)) in state t
* Either
(1)
w, wEA, s.t. curr-reqddlr=w in state si-_
* (currsreqddlr=w in state si- 1 ) A (ri=ERASEx)
=
currsreqddlr=w in state si
* ((curr-reqd.dlr=w) A (statusx=RECV)) in state si-l
==-
w7x
by definition of lmwait(x)
and Invariant 3.6
* lm-wait(x) in state si-1
=
recvbl(x) in state Si-l
by Invariant 3.7
* ((curr-reqddlr=w)
A (recvbl(x)) A (wkx)) in state si-_
. lmkeep(w) in state si-l
* (lmkeep(w) in state si- 1) A (t'Ef(si_l))
==- keep=w in state t'
by Lemma A.2
* Either
(i) 3y, yo{x,w},
s.t. recvbl(y) in state si-1
* (recvbl(y) in state si_ 1) A (ri=ERASE,) A (y5x)
=> recvbl(y) in state si
* ((curr-reqddlr=w) A (y, yow, s.t. recvbl(y))) in state si
== lmkeep(w) in state si
by definition of lmkeep(w)
137
* (recvbl(y) in state si- 1 ) A (t'Ef(Si-l))
((keep=y) V (yEleterase) V (yEwaiterase))
in state t'
by Lemma A.6
(keep=y)
((
·
(keep=
(yEleterase)
V
V (yEwaiterase)) in state t')
A
A
(keep=w in state t')
(y/w)
=> ((yEleterase)
((
*
V (yEwaiterase)) in state t'
A ((yEleterase) V (yEwaiterase))) in state t')
A
((t',ERASE,t)Esteps(SLM))
keep=w in state t
=#
or
(ii)
Ay, yo{x,w},
s.t. recvbl(y) in state si-1
* (/y, yff{x,w}, s.t. recvbl(y) in state sil) A (ri=ERASEx)
,Ay, y54w, s.t. recvbl(y) in state si
A (Ay, y/w, s.t. recvbl(y))) in state si
* ((currreqddlr=w)
== lmlet(w) in state si
* xEwaiterase in state t'
== xleterase
in state t'
by Invariant 3.4
* keep=w in state t'
=> ((wleterase) A (w~waiterase)) in state t'
by Invariants 3.4 and 3.3
* (Ay, y4{x,w}, s.t. recvbl(y) in state si_l) A (t'Ef(si_l))
(keep/y)
Vy, yo{x,w}, (
(yleterase)
A
A (yfwaiterase) )
in state t'
by Lemma A.5
*
((xleterase) A (xEwaiterase) in state t')
A (((woleterase) V (wfwaiterase)) in state t')
(keep/y)
A (Vy, yo{x,w}, (
A
(yoleterase)
A (yfwaiterase) ) in state t')
((let-erase=0) A (waiterase={x})) in state t'
(keep=w)
((
*
A
(leterase=0)
A (waiterase={x})) in state t')
A
((t',ERASEx,t)Esteps(SLM))
- ((keep=l) A (wEleterase)) in state t
or
(2)
w, wefV, s.t. currreqddlr=w
138
in state si-1
* w, WEAf, s.t. curr-reqd dlr=w in state Si_1
=- 8w, wEAf, s.t. lmkeep(z) in state si-1
* Vz, z7x, (lmlet(z) in state sil)
A (7ri=ERASEx)== Ilmlet(z) in state si
* Vz, z7x, (lmlet(z) in state si-1) A (t'Ef(sil))
- zEleterase in state t'
* Vz, zx,
by Lemma A.3
(zElet-erase in state t') A ((t',ERASE,t)Esteps(SLM))
== zEleterase in state t
* Vz, z7x, (lmwait(z) in state si-1) A (7ri=ERASEx)
=
* Vz, zx, (lmwait(z) in state si-1) A (t'Ef(sil))
lmwait(z) in state si
== zEwaiterase in state t'
by Lemma A.4
* Vz, zxz, (zEwait-erase in state t') A ((t',ERASE,t)Esteps(SLM))
zEwaiterase in state t
==
* Vz, z4x, (recvbl(z)
in state sil)
A (ri=ERASE) == -irecvbl(z) in state si
* Vz, zkx, (recvbl(z) in state si-i) A (t'Ef(s-il))
:. ((keep5z) A (zlet-erase) A (zowaiterase)) in state t'
by Lemma A.5
* Vz, z5x,
(((keep5z) A (zleterase)
A
A (zowaiterase)) in state t')
((t',ERASEx,t) Esteps(SLM))
:==
((keepoz) A (zleterase)
A (zowaiterase)) in state t
* Therefore, from the above deductions it follows that
tEf(si)
and thus the theorem has been proven.
Lemma A.12
A
(si_l is a reachable state of LM)
(lmkeep(z) in state si-1)
A
(7ri=ACKASSIGNx)
by Definition 3.1
E
lmkeep(z) in state si
Proof:
* lmkeep(z) in state si-1
=- ((currreqd-dlr=z) A (3y, y5z, s.t. recvbl(y))) in state Si_* ((currreqddlr=z)
A (3y, yhz, s.t. recvbl(y))) in state si- 1
==- recvtss50 in state si-1
by Invariant 3.17
* (((curr-reqddlr=z) A (recvtss60)) in state Si-1 ) A (ri=ACKASSIGN,)
= curr-reqddlr=z
in state si
* (recvbl(y) in state si-1) A (ri=ACKASSIGN,) ==~ recvbl(y) in state si
* ((currreqddlr=z)
A (recvbl(y), yz))
and thus the lemma has been proven.
139
in state si =- lmkeep(z) in state si
E
Lemma A.13
A
(si-1 is a reachable state of LM)
(lmlet(z) in state si-1)
A
(7ri=ACK-ASSIGN,)
lmlet(z)
in state si
Proof:
* lmlet(z) in state si-1
= Either
(1) ((curr-reqddlr=z)
A (y, y$z, s.t. recvbl(y))) in state si-1
* (y, yz, s.t. recvbl(y) in state si_l) A (7ri=ACKASSIGN,)
=
Ay, yoz, s.t. recvbl(y) in state si
* currreqddlr=z
=
in state si-1
((recvbl(z))
A (statuszRECV))
in state si-_
by Invariants 3.6 and 3.8
*
(((recvbl(z)) A (status-zRECV)) in state si-1)
A
(ri=AcKASSIGN,)
=- ((recvbl(z)) A (statusz#RECV))
* (
in state si
(Ay, yoz, s.t. recvbl(y))
A
(recvbl(z))
A (statusz#RECV) ) in state si
=: lmlet(z) in state si
or
(2) ((currreqddlr7z)
*
((
A (recvbl(z)) A (lmwait(z)))
in state si-1
(currreqddlrz)
A
(recvbl(z))
A (-lmwait(z)) ) in state si-1)
A
(7ri=ACKASSIGN,)
:- ((curr-reqddlroz)
A (recvbl(z))
A (-lmwait(z)))
in state si
* ((currreqddlr5z)
A (recvbl(z)) A (-lmwait(z))) in state si
= Ilmlet(z) in state si
by definition of lmlet(z)
* For both possible cases, the desired result is obtained and thus the lemma has
been proven.
[1
Theorem A.14
(si-1 is a reachable state of LM)
A
(7ri=ACKASSIGN,)
A
((si_l,ACK ASSIGN.,si) steps(LM))
A
(t'Ef(si-1))
t'Ef(si)
Proof:
140
· Vz, zeAN, (lmkeep(z) in state si-1) A (t'Ef(sil))
- keep=z in state t'
by Lemma A.2
* Vz, zEAr, (lmkeep(z) in state Si-1) A (ri=ACK-ASSIGN.)
=: lmkeep(z) in state si
by Lemma A.12
* Vz, zEAN, (lmlet(z) in state si-1) A (t'Ef(sil))
== zElet-erase in state t'
by Lemma A.3
* Vz, zENA, (Imlet(z) in state sil) A (ri=ACKASSIGN,)
==, lmlet(z) in state si
* Vz, zEAf, (lmwait(z) in state si-1) A (t'Ef(sil))
by Lemma A.13
=: zEwaiterase in state t'
by Lemma A.4
* Vz, zENV, (lm.wait(z) in state si-1) A (ri=ACKASSIGN)
==r
lmwait(z) in state si
* Vz, zEAl, (-,recvbl(z) in state si-1) A (t'Ef(sil))
==* ((keepoz)
A (z4leterase)
in state t'
A (zEwait_erase))
by Lemma A.5
* Vz, zEAN, (recvbl(z)
in state si-1) A (ri=ACKASSIGN,)
-recvbl(z) in state si
* Therefore, from the above deductions it followsthat
==
by Definition 3.1
t'Ef(si)
0
and thus the theorem has been proven.
Theorem A.15
(si-1 is a reachable state of LM)
A
(r=ACK_CSRECV)
A
A
((si_-1,ACKCSRECV ,si) Esteps(LM))
(t'Ef(si-1))
t'Ef(si)
Proof:
* Vz, zENV, (lmkeep(z) in state si-1) A (t'ef(si- 1))
=
keep=z in state t'
by Lemma A.2
* Vz, zEAl,
(lmkeep(z) in state si_l) A (ri=ACK_CSRECVx)== lmkeep(z) in state si
* Vz, zE.A, (lm let(z) in state si-1) A (t'Ef(sil))
=; zElet-erase in state t'
by Lemma A.3
· Vz, zEJ,
(lmlet(z) in state si-1) A (ri=ACK_CS.RECV,) - lmlet(z)
* Vz, zEAJ, (lm-wait(z) in state si-1) A (t'Ef(si- 1))
--
zEwaiterase in state t'
in state si
by Lemma A.4
* Vz, zENA,
(lmwait(z)
in state si-1) A (7ri=ACK_CSRECV) : lmwait(z)
141
in state si
· Vz, zAf, (-recvbl(z) in state si-1) A (t'Ef(si-l))
:- ((keep5z) A (zflet-erase) A (zEwait-erase)) in state t'
by Lemma A.5
* Vz, zEA,
(-recvbl(z) in state si_l) A (ri=ACKCSRECV:)=: -recvbl(z) in state s
* Therefore, from the above deductions it follows that
by Definition 3.1
t'Ef(si)
[
and thus the theorem has been proven.
A
(si-1 is a reachable state of LM)
(lm-keep(z) in state si-l)
A
(ri=<DLRGONE,u>)
Lemma A.16
lmkeep(z) in state si
Proof:
* lmkeep(z) in state si-1
((curr.reqdadlr=z) A (3y, y#z, s.t. recvbl(y))) in state si-1
by definition of lmkeep(z)
* ((currreqdidlr/y)
A (recvbl(y))) in state si- 1
3v,
:vEAN,s.t.
(
((<y,v>Ependingts.assign) V (timestampy=v))
A
(vErecvtss))
in state si-1
by Invariant 3.17
* ri=<DLR_GONE,u>
=- 3x s.t. ((status.=NONR) A (timestamp,=u)) in state si- 1
* statusc=NONR in state si- 1
=-
,Aw s.t. <x,w>Ependingtsassign
in state si-_
by Invariant 3.9
* ((,]w s.t. <x,w>Ependingts-assign) A (statusx=NONR)) in state si- 1
== -recvbl(x) in state i-l
* ((recvbl(y)) A (-irecvbl(x))) in state si-1 == xy
*
((<y,v>Ependingtsassign) V (timestampy=v, vEAr))
(
A
(timestampx=u)
A
.
(xzy))
uv
in state si-1
by Invariant 3.16
(((curr-reqddlr=z) A (vErecvtss)) in state Si-l)
·
A
(ri=<DLR_GONE,u>)
A
(u#v)
- currreqddlr=z
in state si
* (3y, yoz, s.t. recvbl(y) in state si-1) A (ri=<DLR_GONE,u>)
3y, yAz, s.t. recvbl(y) in state si
* ((curr.reqd.dlr=z) A (3y, yoz, s.t. recvbl(y))) in state si
Ikeep(z)
in state si
and thus the lemma has been proven.
142
0
Lemma A.17
A
(si-1 is a reachable state of LM)
(lmlet(z) in state si- 1 )
A
(ri=<DLR_GONE,u>)
lmlet(z) in state si
Proof:
* lmlet(z) in state si-1
=. Either
(1) ((currreqddlr=z) A (y, yhz, s.t. recvbl(y))) in state si-1
* curr reqddlr=z in state si- 1
((recvbl(z)) A (statuszRECV))
==
*
((
A
(Ay, yoz, s.t. recvbl(y))
A
(recvbl(z))
A
(statusz#RECV))
in E;tate
Si-i)
(ri=<DLRGONE,u>)
-
(
(3y, yoz, s.t. recvbl(y))
A
A
·
in state si-1
by Invariants 3.6 and 3.8
(recvbl(z))
(statusz7RECV))
(,y, yz,
(
A
in state si
s.t. recvbl(y))
(recvbl(z))
A
(statusz#RECV) ) in state si
== lmlet(z) in state si
or
(2) ((currreqddlr5z)
*
((
A (recvbl(z)) A (lm
by definition of lmlet(z)
wait(z))) in state si-1
(curr.reqddlroz)
A
(recvbl(z))
A (-lmwait(z)) ) in state si-1)
A
(ri=<DLR_GONE,u>)
=: ((curr-reqddlr7z)
A (recvbl(z))
A (-lmwait(z)))
in state si
* ((currreqddlr
z) A (recvbl(z)) A (lmwait(z)))
in state si
==- lmlet(z) in state si
by definition of lmlet(z)
* For both possible cases, the desired result is obtained and thus the lemma has
been proven.
O
Theorem A.18
(si_l is a reachable state of LM)
A
A
(ri=<DLR_GONE,u>)
((si_l,<DLR-GONE,u>,si)Esteps(LM))
A
(t'Ef(si-l))
t'Ef(si)
143
Proof:
* Vz, zEAr, (lmkeep(z) in state Si-l) A (t'Ef(sil))
by Lemma A.2
=- keep=z in state t'
* Vz, zEAf, (lmkeep(z) in state Si_l) A (ri=<DLR_GONE,u>)
lmkeep(z)
==
in state si
* Vz, zEA, (lmlet(z) in state si-1) A (t'Ef(si_l))
=- zEleterase in state t'
* Vz, zEN/, (lmlet(z) in state si-1) A (ri=<DLR_GONE,u>)
==' lmlet(z) in state si
* Vz, zEAr', (lmwait(z) in state si-1) A (t'Ef(sil))
=- zEwaiterase in state t'
* Vz, zEAN, (lmwait(z) in state si-1) A (ri=<DLRGONE,u>)
Iby Lemma A.16
by Lemma A.3
Iby Lemma A.17
by Lemma A.4
=- lmwait(z) in state si
· Vz, zEJAr, (-,recvbl(z)
in state sil)
A (t'Ef(si-l))
~=((keephz) A (zflet-erase) A (zEwait-erase)) in state t'
by Lemma A.5
* Vz, zEJV, (-,recvbl(z) in state Si_l) A (7ri-=<DLRGONE,u>)
- _recvbl(z) in state si
* Therefore, from the above deductions it follows that
t'Ef(si)
and thus the theorem has been proven.
Lemma A.19
A
(si-1 is a reachable state of LM)
(lmireep(z) in state si- 1)
A
(ri=<ASSIGNx,v>)
b y Definition 3.1
E
lmkeep(z) in state si
Proof:
* lmkeep(z) in state si-1
A (3y, yoz, s.t. recvbl(y))) in state si-1
== ((curr-reqd.dlr=z)
* (currreqddlr=z
in state si_1 ) A (ri=<ASSIGN,,v>)
== currreqddlr=z
in state si
* ri=<ASSIGN,v> = <x,v>Ependingts-assign in state si-_
* <x,v>Ependingtsassign in state si-1
by Invariant 3.9
=- (vEA) A (statusx=UNFL in state si- 1 )
* (statusx=UNFL in state si_1 ) A (7ri=<ASSIGNx,v>) A (vEAr)
== ((timestampxEAf) A (statusx=UNFL)) in state si
* ((timestamp.EAf) A (status,=UNFL)) in state si
by definition of recvbl(x)
=> recvbl(x) in state si
* (recvbl(y) in state si- 1 ) A (7ri=<ASSIGNx,v>) == recvbl(y) in state si
144
* ((currreqddlr=z)
==
A (3y, y/z,
Imkeep(z)
s.t. recvbl(y))) in state si
in state si
and thus the lemma has been proven.
Lemma A.20
(si_l is a reachable state of LM)
A (lmiet(z) in state sil)
A
(ri=<ASSIGN,v>)
lmlet(z)
in state si
Proof:
* 7ri=<ASSIGNz,v>
= <x,v>Ependingtsassign in state si
· <x,v>cpendingtsassign
in state si-1
=- (vEN) A (statusx=UNFL in state Sil)
* (statusx=UNFL
in state si- 1) A (i=<ASSIGNx,v>)
by Invariant 3.9
A (vEx)
((timestampEJV) A (statusx=UNFL)) in state si
* lmJet(z) in state si- 1
=-
==- Either
(1) ((currreqddlr=z)
*
(
A (y,
y/z,
s.t. recvbl(y))) in state si-1
(<x,v>Ependingtsassign)
A (fy, y/z, s.t. recvbl(y)) ) in state si-1
x=z
* (curr-reqddlr=z
in state si-1) A (ri=<ASSIGNz,v>)
currreqddlr=z
in state si
(y, y/z, s.t. recvbl(y) in state si_l) A (7ri=<ASSIGNz,v>)
=; Ay, y5z, s.t. recvbl(y) in state si
-=
* ((curr reqddlr=z)
A (y,
y5z, s.t. recvbl(y))) in state si
=- IlmJet(z) in state si
or
(2) ((curr-reqddlr/z)
A (recvbl(z)) A (-Ilmwait(z)))
*
((
in state si-1
(currreqddlr5z)
A
(recvbl(z))
A (-lm wait(z)) ) in state si-1)
A
A
(ri=<ASSIGNx,v>)
(vEA)
((curr reqddlrz)
A (recvbl(z)) A (-lmwait(z)))
==
* ((curr-reqddlr/z)
A (recvbl(z)) A (-lm-wait(z)))
in state si
in state si
=- lmlet(z) in state si
* For both possible cases, the desired result is obtained and thus the lemma has
been proven.
[
145
Lemma A.21
A
(si-_ is a reachable state of LM)
(lmwait(z) in state si-1)
A
(ri=<ASSIGNz,v>)
lmwait(z)
in state si
Proof:
* ri=<ASSIGN,,v> == <x,v>Ependingtsassign
· <x,v>Ependingtsassign
in state si-_
in state si-_
statusx=UNFL in state si-l
by Invariant 3.9
· Ilmwait(z) in state si-l
=- statusz=RECV in state si-l
by definition of lm-wait(z)
· ((statusx=UNFL) A (status,=RECV)) in state si_l = xz
=-
· (lmwait(z)
in state si-1) A (ri=<ASSIGNx,v>) A (xz)
,- lm-wait(z) in state si
]
and thus the lemma has been proven.
Lemma A.22
A
(si-_ is a reachable state of LM)
(-,recvbl(z) in state sil)
A
(7ri=<ASSIGNz,v>)
-recvbl(z) in state si
Proof:
* 7ri=<ASSIGNz,v> ==4 <x,v>Ependingts-assign
in state si-_
* <x,v>Ependingtsassign in state si-l
= recvbl(x) in state si-l
by d.efinition of recvbl(x)
* ((-irecvbl(z)) A (recvbl(x))) in state si_l = x7z
* (-_recvbl(z) in state si-1) A (i=<ASSIGNx,v>) A (x5z)
==, -recvbl(z) in state si
and thus the lemma has been proven.
Theorrem A.23
(si_l is a reachable state of LM)
A
A
(wr=<ASSIGNx,v>)
((si_l,<ASSIGNx,v>,si)Esteps(LM))
A
(t'Ef(sil))
t'Gf(si)
Proof:
146
· Vz, zEA/, (lm-keep(z) in state si-1) A (t'Ef(sil))
keep=z in state t'
* Vz, zEJr, (lmkeep(z) in state si_l) A (ri=<ASSIGN,v>)
=- lmkeep(z) in state si
==
· Vz, zEAr, (lmlet(z)
by Lemma A.19
in state si-1) A (t'Ef(sil))
zEleterase in state t'
* Vz, zEA/r,(lmlet(z) in state sil) A (ri=<ASSIGN,v>)
=-
==
by Lemma A.2
lmlet(z)
by Lemma A.20
in state si
* Vz, zEA, (lmwait(z) in state si- 1) A (t'Ef(sil))
= zEwaiterase in state t'
* Vz, zEAr, (lmwait(z) in state si-1) A (ri=<ASSIGN,,v>)
=- Imwait(z) in state si
* Vz, zEJAl, (recvbl(z) in state si-1) A (t'Ef(si-l))
:- ((keepz)
by Lemma A.3
A (zfleterase)
A (zEwaiterase
by Lemma A.4
by Lemma A.21
)) in state t'
by Lemma A.5
* Vz, zEAl, (-,recvbl(z) in state si-1) A (ri=<ASSIGN,,v>)
:=-recvbl(z) in state si
* Therefore, from the above deductions it follows that
t'Ef(si)
and thus the theorem has been proven.
Theorem A.24
(sil
by Lemma A.22
by Definition 3.1
E
is a reachable state of LM)
A
(7ri=CSREQD,)
A
A
((sil,CS-REQD
(t'Ef(sil 1 ))
,si)Esteps(LM))
t'Ef(i)
Proof:
· Vz, zEJ', (lmkeep(z) in state si_ 1) A (t'Ef(sil))
=-
keep=z in state t'
by Lemma A.2
* Vz, zA,
(lmkeep(z) in state si- 1 ) A (ri=CSREQDx)
== lmeep(z)
in state si
· Vz, zEJA, (lmlet(z) in state si- 1 ) A (t'Ef(si_l))
= zEleterase in state t'
* Vz, zEAl', (lmlet(z)
in state si_ 1 ) A (ri=CSREQDx)
by Lemma A.3
=: lmlet(z) in state si
· Vz, zEA/, (lmwait(z)
in state si_1 ) A (t'Ef(si_l))
='. zEwaiterase in state t'
* Vz, zEJA, (lm.wait(z) in state si-1) A (rj=CSREQDx)
=
lmwait(z) in state si
147
by Lemma A.4
· Vz, zEA/, (recvbl(z)
in state si-1) A (t'Ef(si-1))
((keep5z) A (zleterase)
==
A (zEwaiterase)) in state t'
by Lemma A.5
* Vz, zEN/,
(-recvbl(z) in state si-1) A (ri=CS.REQD,) == -irecvbl(z) in state si
* Therefore, from the above deductions it follows that
t'Ef(si)
by Definition 3.1
and thus the theorem has been proven.
0
Lemma A.25
(sil
is a reachable state of LM)
A (lmlet(z) in state sil)
A
(7ri=CSRECV,)
lmlet(z) in state si
Proof:
* ri=CSRECVx =:
xEsendcsrecv
in state si-1
* xEsendcsrecv in state si-1
==
((curr-reqddlr4x)
A (recvbl(x))) in state si- 1
by Invariants 3.11, 3.7 and 3.15
* Either
(1) x=z
*
(((currreqddlrzx)
A
A
A (recvbl(x))) in state si-1)
(7ri=CSRECV,)
(x=z)
((curr-reqd-dlr#z) A (recvbl(z))) in state si
* (ri=CS-RECV,) A (x=z) ==Y pendingackz=T in state si
* ((curr-reqddlroz)
A (recvbl(z)) A (pendingackz=T)) in state si
== lmlet(z) in state si
=
or
(2) zxZ
* (((lmlet(z)) A (recvbl(x))) in state si-1) A (xhz)
((currreqddlroz)
A (recvbl(z)) A (-lm.wait(z)))
in state si-1
*
((
(currreqddlrhz)
A
(recvbl(z))
A (-lmwait(z)) ) in state si-1)
A
(7ri=CSRECVx)
A
(x5z)
=:
((curr-reqd.dlrz)
A (recvbl(z))
A (-lm.wait(z)))
in state si
* ((curr-reqd.dlrhz) A (recvbl(z)) A (-lm wait(z))) in state si
=- Ilmlet(z) in state si
by definition of lmiet(z)
148
* For both possible cases, the desired result is obtained and thus the lemma has
been proven.
l
Lemma A.26
(si-1 is a reachable state of LM)
A (lmwait(z) in state si-l)
A
(7r=CS-RECV,)
lmwait(z)
in state si
Proof:
* 7ri=CSRECVz,
=
xEsendcsrecv in state si_1
* xEsendcsrecv in state si-1
=> statusxZRECV in state si-_
by Invariant 3.15
* (lmwait(z) in state si- 1 ) A (statusxZRECV in state si_1 )
== z54x
by definition of lmwait(z)
* (lmwait(z) in state si-l) A (ri=CSRECVx)A (z#x)
Ilmwait(z) in state si
and thus the lemma has been proven.
==
Lemma A.27
O
A
(si_1 is a reachable state of LM)
(-irecvbl(z) in state si- 1 )
A
(ri=CSRECVx)
-recvbl(z) in state si
Proof:
* 7ri=CSRECVx=> xEsendcsrecv
* xsendcsrecv
==
in state si-_
in state si-1
recvbl(x) in state Si-_
b:y Invariants
* ((-irecvbl(z)) A (recvbl(x))) in state si-1 =
* (recvbl(z)
z5x
in state si-1 ) A (ri=CSRECV,) A (z5x)
=== -recvbl(z)
in state si
and thus the lemma has been proven.
Theorrem A.28
(si_1 is a reachable state of LM)
A
(7ri=CSRECVx)
A
A
((si_,CS-RECV,sj)Esteps(LM))
(t'Ef(si- 1 ))
t'ef(si)
149
3.7 and 3.15
Proof:
0
VZ,
ZEN/, (lmkeep(z) in state si-1) A (t'Ef(si-1))
=- keep=z in state t'
by Lemma A.2
· Vz, ZEN,
(lmkeep(z) in state s-1) A (rj=CS_RECV) =: Imkeep(z) in state si
· Vz, zEAr, (Imlet(z) in state sil) A (t'Ef(si-1))
-= zEleterase in state t'
by Lemma A.3
* Vz, zEN/, (Imlet(z) in state si-1) A (i=CS_RECV,)
=
lmlet(z) in state si
· Vz, zEAf, (Imwait(z) in state si-1) A (t'Ef(si_l))
by Lemma A.25
= zEwaiterase in state t'
* Vz, zEJ, (lmwait(z) in state si-1) A (ri=CS_RECV)
==-
lmwait(z) in state si
by Lemma A.4
by Lemma A.26
· Vz, zENJ, (-irecvbl(z) in state si-1) A (t'Ef(si_l))
=
((keephz) A (zflet_erase) A (zEwait-erase))
in state t'
by Lemma A.5
· Vz, zEJ, (-recvbl(z) in state si_l) A (ri=CS_RECV,)
= -recvbl(z) in state si
by Lemma A.27
* Therefore, from the above deductions it follows that
by Definition 3.1
t'Ef(si)
and thus the theorem has been proven.
A.2
[O
Proof of Liveness
Theorem A.69, which is found at the end of this section, states XEL's important liveness
property: the DLR for every committed update is eventually erased. Some preliminary
lemmas must first be proven which will ultimately contribute toward the proof of Theorem A.69. In all the following lemmas and theorems, let a denote an execution for the
LMmodule, and let ri represent the it h action of a (where EAf and i>1).
Lemma A.29
rh=ACKASSIGNx
3g, g<h, s.t. rg=<ASSIGNx,t> for some t
Proof:
*· h=ACKASSIGN, ==- ((statusx=UNFL)
A (pendingackx=T))
in state Sh_1
* ((status.=UNFL) A (pendingackx=T)) in state Sh_1
== 3g, g<h-1, s.t. rg=<ASSIGN,,t> for some t
and thus the lemma has been proven.
150
0
Lemma A.30
A
(a is a well-formed execution)
(7ri=<ASSIGN,,u> for some u)
,ah, hoi, s.t. rh=<ASSIGN,,v>for any v
Proof:
By contradiction. Without loss of generality, assume
3h, h<i, s.t. rh=<ASSIGNz,v> for some v.
Case 1: u=v.
* rh=<ASSIGN.,v> ---
(<x,v>Ependingtsassign in state sh-1)
A
* <x,v>Ependingtsassign
(<x,v> pendingts_assign in state sh)
in state Sh-1 ==~ 3q, q<h-1, s.t. 7rq=COMMITx
* (ri=<ASSIGNr,u>) A (u=v) == <x,v>Ependingts.assign in state si_l
*
(<x,v>Opending_tsassign in state sh)
A (<x,v>Epending_ts_assign in state si- 1 )
A
(h<i)
=
3r, h<ri-1,
s.t. rr=COMMITx
by WF1
=-- 4q, qir, s.t. rq=COMMITx
* 7r=COMMIT,
But this contradicts the earlier deduction that
3q, q<h-l<r, s.t. 7rq=COMMITx
and so this case is impossible.
Case 2: uIv.
· 7ri=<ASSIGNz,u>== <x,u>Epending-tsassign in state si-1
* <x,u>Epending_tsassign in state s-_l
3r, r<i-1, s.t. ((r,=COMMITx)A (u=current-ts in state sr-1))
*· r=COMMITx == /q, qr,
s.t. rq=COMMITx
* 7rh=<ASSIGN,,v>== <x,v>Ependingts.assign in state
* <x,v>Ependingts-assign
in state sh-1
s.t. ((7rq=COMMITx)
A (v=currentts
=-: 3q, qh-1,
* Uv =* sq-1liSr-1
by WF1
h-1
in state sq-1))
== qr
But this contradicts the earlier deduction that
* Sq-iSr-1
/Eq, qr, s.t. 7rq=COMMIT
and so this case is also impossible.
Since both cases are impossible, the original assumption must be false and the
lemma has been proven.
0
Lemma A.31
(acis a well-formed execution)
A
(((statusx=UNFL)
A (pendingack,=T))
Ah, h<i, s.t. 7rh=ACKASSIGN,
151
in state si)
Proof:
By contradiction. Assume 3h, h<i, s.t. 7rh=ACKASSIGN
* ((status,=UNFL) A (pending.ack=T)) in state si
==: 3g, g<i, s.t.
(irg=<ASSIGN,,t>
for some t)
A
(]f,
g<f <i, s.t. 7rf=ACKASSIGN,)
*·r=<ASSIGN,,t>
==* d, d:g, s.t. 7rd=<ASSIGNx,u> for any u
by Lemma A.30
A (,Bf, g<f<i, s.t. rf=ACKASSIGNr)
* (3h, h<i, s.t. 7rh=ACKASSIGNx)
==~ 3e, e<g, s.t. 7re=ACKASSIGNx
* 7re=ACKASSIGNx
by Lemma A.29
==. 3d, d<e<g, s.t. 7rd=<ASSIGN,,u> for some u
But this is a contradiction and so the original assumption must be false. Thus
0
the lemma has been proven.
Lemma A.32
((currreqddlr=x,
xEJ/) A (currreqdacked=T)) in state si
3h, h<i, s.t. 7h=ACKASSIGN.
Proof:
By contradiction. Assume
h, h<i, s.t. rh=ACKASSIGNx
* currreqddlr=x, xEJf, in state si
(7rg=COMMIT,)
=- 3g, g<i, s.t.
A
(Bf, g<f<i, s.t.
rf=COMMIT,for some ysx)
= ((currreqddlr=x) A (currreqd acked=F)) in state sg
* rg=COMMITx
(((curr-reqddlr=x) A (curr-reqdacked=F)) in state sml)
·
A
(rm#ACKASSIGNx)
A
(rmCOMMITy for ysx)
=> ((currreqdadlr=x) A (currreqdacked=F)) in state sm
*
(((curr-reqd-dlr=x) A (curr.reqd-acked=F)) in state sg)
A (,h, h<i, s.t. rh=ACKASSIGNx)
for yox)
A (,]f, g<f<i, s.t. rf=COMMITy
== ((curr reqd-dlr=x) A (currreqd acked=F)) in state si
by induction.
But this contradiction implies that the original assumption must be false and
0
thus the lemma has been proven.
Lemma A.33
xEsend-cs-recvin state si
3h, h<i, s.t. rh=ACKASSIGNx
Proof:
152
By contradiction. Assume
]h, h<i, s.t. 7rh=ACKASSIGN,
* xEsend-csrecv in state si =- either
(1) 39g, g<i, s.t.
(7rg=COMMITyfor some y)
A (currreqddlr=x in state sg-l)
A (currreqd acked=T in state sg-l)
* ((currreqddlr=x) A (curr-reqd acked=T)) in state sg-_
.' 3h, h<g-1, s.t. 7rh=ACKASSIGN.
by Lemma A.32
But this is a contradiction, so this case cannot be true.
or
(2) 3h, h<i, s.t. 7rh=ACKASSIGNx
* But this is a contradiction, so this case cannot be true.
or
(3) 3g, g<i, s.t.
(7rg=<DLR_GONE,t>
for some t)
A (currreqddlr=x
A
in state s_l)
(currreqdacked=T)
* ((currreqddlr=x)
in state sg_l)
A (currreqd acked=T)) in state s_
3h, h<g-1, s.t. 7rh=ACKASSIGNx
=.
by Lemma A.32
But this is a contradiction, so this case cannot be true.
Since all three possible cases lead to contradictions, the original assumption
must be false and thus the lemma has been proven.
o
Lemma A.34
(a is a well-formed execution)
A
(j=ERASE)
3f, f<j, s.t. 7rf=CSRECVx
Proof:
* 7rj=ERASE,=
3h, h<j, s.t. rh=ERASABLEZ
* rh=ERASABLEx==
by WF2
sEXpendingerasable in state sh-1
* xEpendingerasable in state
h-I
=- 3g, g<h-1, s.t. rg=ACKCSRECVx
* 7rg=ACK_CSRECVx
==~ statusx=RECV in state sg-1
* statusx=RECV in state s_l
=
3f, f<g-1, s.t. rf=CSRECVx
and thus the lemma has been proven.
Lemma A.35
A
(currreqddlr=x
(h, h<i, s.t.
in state si for some xjN)
h=ACKASSIGNx)
curr-reqd acked=F in state si
Proof:
153
El
By contradiction. Assume curr-reqd acked=T in state si
* curr-reqdacked=T in state si
(7rh=ACK_ASSIGNy)
= 3h, h<i, s.t.
A (currreqddlr=y,
for some yEAr, in state sh-1)
A (,Bj, h<j<i, s.t. rj=COMMITzfor some zEJ)
* (currreqddlr=y,
yeJa, in state sh-1) A (rh=ACKASSIGNy)
==> ((currreqddlr=y)
V (currreqddlr=l))
in state sh
·
(((currreqddlr=y,
yA/') V (currreqddlr=))
in state sh)
(Aj, h<j<i, s.t. rj=COMMITzfor some zEJN)
A
, V1, h<l<i, ((currreqddlr=y, yEAr) V (currreqddlr=))
in state sl
by definition of steps(LOT)
* Either
(1) currreqddlr=y,
yEAr, in state si
* By transitivity, x=y so 3h, h<i, s.t. rh=ACKASSIGN,.
But this contradicts the lemma's predicate.
or
(2) curr-reqddlr=l
in state si
* But this also contradicts the lemma's predicate.
Since both possible cases lead to contradictions, the original assumption
O
must be false and so the lemma has been proven.
(a is a well-formed execution)
Lemma A.36
A
(ri=ERASEx)
3h, h<i, s.t. rh=ACKASSIGNx
Proof:
By contradiction. Assume h, h<i, s.t. 7rh=ACKASSIGN,
*·ri=ERASEx==- 3e, e<i, s.t. re=CS-RECVx
* re,=CSRECVx=- xEsend-cs-recv in state se-1
by Lemma A.34
· xEsend-csrecv in state s_l == 3d, d<e-1, s.t. either
(1)
(7d=COMMITy for some yEAJ)
(
A
·
(((curr-reqddlr=x)
(curr-reqdAdlr=x
A (curr_reqd_acked=T)) in state sd-1) )
in state sd-1)
A (,h, h<d-l<i, s.t rh=ACKASSIGN,)
=-- curr-reqd acked=F in state sd-1
or
(2)
by Lemma A.35
But this is a contradiction, and so this case could not possibly occur.
d=ACKASSIGN.
* But this contradicts the assumption, and so this case can never
or
occur.
154
(3)
(
(7rd=<DLR_GONE,tsy>for some tsy)
A (currsreqd dlr=x in state sd-1)
A
(currreqdacked=T
·
in state sd-1) )
(currreqddlr=x
A (h,
in state sd-1)
h<d-l<i, s.t. rh=ACKASSIGNx)
by Lemma A.35
=- curr-reqd.acked=F in state sd-1
But this is a contradiction and so this case also cannot occur.
Since none of these three cases can be true and there are no other possibilities, the original assumption must be false and so the lemma has been
proven.
o
Lemma A.37
A
(a is a well-formed execution)
(((statusx=UNFL) A (pending_ack=T))
A (3q, q>k, s.t.
in state sk)
n, n<q, s.t. 7rn=ACK-ASSIGN,)
Ap, p<q, s.t. (rp=CSREQDx) V (rp=CS_RECVx) V (rp=ERASE,)
Proof:
By contradiction.
Assume
3p, p<q, s.t. (rp=CSREQDr)
V (rp=CSRECVr)
V (rp=ERASE,)
* Either
(1) 3p, p<q, s.t. rp=CSREQD,
* 7rp=CS..REQDx
=:
x=sendcssreqd
in state sp-1
* x=sendcs-reqd in state sp-_ = 3n, np-1, s.t. rn=ACKJASSIGNx
But this contradicts the lemma's predicate and so this case cannot
or
occur.
(2) 3p, p<q, s.t. 7rp=CSRECVx
* 7rp=CSRECVx
== xEsendcsrecv in state sp-1
* xEsendcs-recv in state sp-1l=- either
(i) 3m, m<p-1, s.t.
(rm=COMMITyfor some yEJ)
A (currreqddlr=x in state sm-l)
A (currreqd acked=T in state sm-l)
* ((curr-reqddlr=x) A (currreqdacked=T)) in state Sm-1
=- 3n, n<m-1, s.t. 7rn=ACKASSIGNx
by Lemma A.32
But this contradicts the lemma's predicate and so this subcase cannot occur.
or
(ii) 3n, n<p-1, s.t. n=ACKASSIGN,
* But this contradicts the lemma's predicate, and so this subcase cannot occur either.
155
or
(iii) 3m, m<p-1,
for some t)
(7rm=<DLR_GONE,t>
s.t.
A (currreqddlr=x in state s,,-l)
A (curr_reqdacked=T in state s,_l)
* ((curr-reqddlr=x) A (curr_reqd_acked=T))in state s,_l
- 3n,n<m-1, s.t. rn=ACKASSIGNx
by Lemma A.32
But this contradicts the lemma's predicate and so this subcase cannot occur either.
Sinceall three subcases lead to contradictions and there are no other
possible subcases, the entire case (2) must be impossible.
or
(3) 3p, p<q, s.t. rp=ERASEx
* 7rp=ERASEx
==
3n, n<p, s.t. r,=ACKASSIGNz
by Lemma A.36
But this contradicts the lemma's predicate, and so also this case
must be impossible.
* All possible cases lead to contradictions. Therefore, the original assumption
0
must be false and thus the lemma has been proven.
Lemma A.38
ri=ACKASSIGNz
3h, h<i, s.t. 7rh=COMMITx
Proof:
=
* 7ri=ACKASSIGN,
· ((statusx=UNFL)
((statusx=UNFL) A (pendingack,=T)) in state si-
A (pendingackr=T))
in state si-1
3j, j<i-1, s.t. rj=<ASSIGNx,tsx>for some tsx
* rj=<ASSIGNx,tsx> == <,tsx>Ependingts_assign in state sj-1
· <x,ts,>Epending_tsassign in state sj-1 == 3h, h<j-1, s.t. rh=COMMITx
* By transitivity,
3h, h<i, s.t. 7rh=COMMITx
=>
and thus the lemma has been proven.
D
(a is a well-formed execution)
Lemma A.39
A
(rg=COMMIT,)
Vf, f <g, xzsend-cs-recv in state sf
Proof:
By contradiction. Assume 3f,f<g, s.t.
156
Esend-cs-recv in state sf
,/d, d<g, s.t. 7rd=COMMIT.
* xEsendcs-recv in state sf == 3e,e<f, s.t. re=ACKASSIGNx
* rg=COMMITx
==
* re=ACK-ASSIGNx==- 3d, d<e, s.t. 7rd=COMMITx
by WF1
by Lemma A.33
by Lemma A.38
* (d<e<f<g) A (3d, d<e, s.t. 7rd=COMMITx)- 3d, d<g, s.t. wd=COMMITx
But this is a contradiction, and so the original assumption must be false. Thus
the lemma has been proven.
Lemma A.40
A
O
(a is a well-formed execution)
(xEsendcsrecv in state si)
curr-reqddlrox
in state si
Proof:
By induction.
* xEsend-cs.recv in state s == 3h, h<i, s.t. rh=ACKASSIGNx
*· h=ACK-ASSIGNX
-- 39, g<h, s.t. rg=COMMITx
* rg=COMMITx
=-
by Lemma A.33
by Lemma A.38
m, m>g, s.t. rm=COMMITx
*·w=COMMIT == curr.reqddlr=x in state sg
*· 7r=COMMITx,
= Vf, f <g, xosendcs-recv in state sf
by WF1
by Lemma A.39
* Vf, f<g, x sendcsrecv in state sf
== Vf, f<g, ((curr_reqd_dlrzx) V (xsend-cs-recv)) in state sf
* (((currreqd_dlrx) V (xsend-csrecv)) in state s,ml) A (rmCOMMIT,)
==. ((currreqddlr7x)
V (xsend-cs-recv))
in state sm
by definition of steps(LOT)
* Therefore, by induction,
Vp, p>g, ((curr.reqddlr5x)
V (xsend_csrecv)) in state sp
* Hence,
Vq, q>O, ((currreqd dlr5x) V (xfsend_cs_recv)) in state sq
* ((xEsendcsrecv)
A ((curr_reqd_dlr6x) V (xzsend-cs-recv))) in state si
== curr_reqd_dlr$x in state si
and thus the lemma has been proven.
Lemma A.41
(a is a well-formed execution)
A
(ri=ACK
ASSIGNr)
,4h, hi, s.t. rh=ACKASSIGNx
Proof:
157
[
By contradiction.
Without loss of generality, assume 3h, h<i, s.t. rh=ACKASSIGN,
*· rh=ACKASSIGN, ==- ((statusr=UNFL) A (pendingack,=T))
* ((status,=UNFL) A (pendingack,=T))
in state sh-1
==y 3g, g<h-1, s.t r9g=<ASSIGN,,ts,> for some tsx
·* h=ACKASSIGN, - ((status.=UNFL) A (pendingack,=F))
·* r=ACKASSIGN, =- ((status.=UNFL) A (pendingack,=T))
*
in state
Sh-1
in state sh
in state si-1
(((status,=UNFL) A (pendingack,=F)) in state sh)
A
(((status,=UNFL)
A
(h<i)
A (pendingack,=T))
=- 3f, h<f<i-1,
in state si-1)
s.t. 7rf=<ASSIGN,,u> for some u
* 7rf=<ASSIGN,,u> =:-
g, g<f, s.t. rg=<ASSIGNr,ts,> for any ts,
by Lemma A.30
But this is a contradiction, and so the assumption must be false. Thus the
o
lemma has been proven.
Lemma A.42
(a is a well-formed execution)
A
(7rg=CSRECV,)
,d, dig, s.t. 7rd=CSRECV.
Proof:
By contradiction. Without loss of generality, assume 3d, d<g, s.t. 7rd=CSRECVx
* 7d=CSRECVx
=- (xEsendcsrecv
in state sd-l) A (xfsend_cs_recv in state d)
* xEsend-cs.recv in state d-1
== 3c, c<d-1, s.t. rc=ACK-ASSIGN.
by Lemma A.33
*·rc=ACKASSIGN =
j3q, q>c, s.t. rq=ACKASSIGNx
A
by Lemma A.41
by Lemma A.38
*·r=ACKASSIGN, = 3b, b<c, s.t. 7rb=COMMIT.
=,
* rb=COMMIT,
a, a>b, s.t.
* xEsend-csrecv in state Sd-l =
by WF1
ra=COMMITx
currreqddlr$x
in state d_l
by Lemma A.40
* (curr_reqddlr-x in state sd-1) A (a, a>b, s.t. r-=COMMITx)
A (b<c<d)
=# Vp, p>d-1, curr_reqd_dlrx in state sp
=- xEsendcs-recv in state sg-1
* 7rg=CSRECVx
*
(xfsendcsrecv in state Sd)
A
A
(xEsendcsrecv
in state sg-1 )
(d<g)
either
(1) 3q, d<q<g-,
=-
(7rq=COMMITyfor some y)
s.t.
A
(curr-reqddlr=x
A
(currsreqdacked=T
158
in state sq-1)
in state sq-1)
* But curr-reqd-dlr=x in state sql is a contradiction, so this case
cannot occur.
or
(2) 3q, d<q<g-1,
s.t. rq=ACKASSIGNx
* But this is a contradiction, so this case cannot occur.
or
(3) 3q, d<q<g-1,
s.t.
(rq=<DLR_GONE,t>for some t)
A (currreqddlr=x
A
* But currreqddlr=x
cannot occur.
in state Sql)
(currreqdacked=T
in state sq_-1
in state sq_1 is a contradiction, so this case
* Since all three cases lead to contradictions and there are no other possible
cases besides these, the original assumption must be false and thus the lemma
has been proven.
0
Lemma A.43
rg=CSRECVx
3f, f<g, s.t. rf =<ASSIGN,,t> for some t
Proof:
·* rg=CSRECVx == sEsendcsrecv
in state s_-1
* xEsendcs-recv in state sg-1
==- 3e, e<g-1, s.t. re=ACK-ASSIGNx
by Lemma A.33
* 7re=ACKASSIGNx
==: f, f<e, s.t. 7rf=<ASSIGNx,t> for some t
and thus the lemma has been proven.
Lemma. A.44
A
(((status.=RECV)
A
(h<i)
A
(,Be, e>h, s.t. re=<ASSIGNx,u> for any u)
A (pendingackx=T))
Proof:
By contradiction. Assume Ad, h<d<i, s.t. rd=CSRECV,
(((statusxRECV)
A
A
V (pendingackx=F)) in state sml)
(7rm$<ASSIGNx,u> for any u)
(rmCSRECVx)
=:
[]
(pendingackx=F in state sh)
3d, h<d<i, s.t. 'd=CSRECV.
$
by Lemma A.29
((statusx$RECV) V (pendingack,=F)) in state sm
159
in state si)
*
(pending_ack=F in state sh)
A
(h<i)
A
(/e, e>h, s.t.
A
(/3d, h<d<i, s.t. 7rd=CS-RECV,)
re=<ASSIGNx,u> for any u)
((status,#RECV) V (pendingack=F))
in state si
by induction
But this contradicts the lemma's predicate, and so the original assumption
must be false. Thus the lemma has been proven.
0
==
Lemma A.45
A
(a is a well-formed execution)
(((status.=RECV) A (pending_ack=T))
in state si)
/Bh, h<i, s.t. 7rh=ACK_CSRECV,
Proof:
By contradiction. Assume 3h, h<i, s.t. rh=ACK_CS_RECVx
* rh=ACK_CSRECVx
~=
(((statusr=RECV) A (pendingack,=T)) in state Sh-1)
A
(((statusz=RECV)
A (pendingack,=F))
in state sh)
* status.=RECV in state sh-1
==~ 3g, g<h-1, s.t. rg=CSRECVx
* rg=CSRECV,== 3f, f<g, s.t.
· 7g=CSRECVx =#
,Ad, dog, s.t.
f =<ASSIGN,,t> for some t
rd=CSRECV,
by Lemma A.43
by Lemma A.42
* rf=<ASSIGNx,t>
= ] e, e>f, s.t. ire=<ASSIGNx,u> for any u
* (f<g<h) A (e, e>f, s.t. re=<ASSIGNx,u> for any u)
==: Be, e>h, s.t. lre=<ASSIGNx,u>
for any u
*
(pendingackz=F in state sh)
A (((status.=RECV)
A (pendingack=T))
in state sl)
A
by Lemma A.30
(h<i)
A
(,e, e>h, s.t. re=<ASSIGNx,u> for any u)
==- 3d, h<d<i, s.t. rd=CSRECV,
by Lemma A.44
But this is a contradiction, and so the original assumption inust be false. Thus
o
the lemma has been proven.
Lemma A.46
A
(a is a well-formed execution)
(((status.=RECV)
A (pendingack,=T))
7rm
#ERASE,
Proof:
160
in state Sm-_)
By contradiction. Assume rm=ERASE,.
* ((status.=RECV) A (pendingack,=T)) in state sm-1
==,
h, h<m-1,
s.t. rh=ACKCSRECVX
by Lemma A.45
3, g<m, s.t. rg=ERASABLEx
* 7rm=ERASE,
=
* 7rg=ERASABLEx==- xEpendingerasable
* xependingerasable
by WF2
in state sg-_
in state sg_1
=: 3f, f <g-l<m-1, s.t. rf=ACKCSRECV,
This contradiction implies that the original assumption must be false and thus
the lemma has been proven.
0
Lemma A.47
(a is a well-formed and fair execution)
A (((statusx=UNFL) A (pendingack,=T)) i state sk)
31, I>k, s.t.
rl=ACKASSIGNx
Proof:
* ((status.=UNFL) A (pendingack,=T)) in state sk
==. 3q, q>k, s.t. A3n,n<q, s.t. rn=ACKASSIGN_
*
(((status,=UNFL) A (pendingackx=T)) in state sk)
A (3q, q>k, s.t. n, n<q, s.t. rn=ACKASSIGN)
by Lemma A.31
, p, p<q, s.t. (rp=CSREQD,) V (rp=CS_RECVx)V (rp=ERASE,)
by Lemma A.37
·
(((status,=UNFL) A (pending_ackx=T)) in state s,m-l)
A
(7r#CS_REQDx)
A
(7m#CSRECV,)
A
(rm#ERASE,)
=- ((statusx=UNFL) A (pendingack,=T)) in state sm
* By induction and the definition of a fair execution, it therefore followsthat
31, I>k, s.t. ri=ACK.ASSIGNx
and thus the lemma has been proven.
5
Lemma A.48
((<x,t>Epending_tsassign) V (timestampx=t, tEN)) in state si
3h, h<i, s.t. (rh=COMMITx)A (currentts=t,
Proof:
* Either
(1) <x,t>Ependingts-assign
in state si
161
tEA, in state sh-1)
* <x,t>Ependingts-assign in state si
- 3h, h<i, s.t.
(7rh=COMMIT,)
A
(currentts=t,
tENV, in state sh-1)
by definition of steps(LOT)
or
(2) timestampx=t, tEN, in state si
· timestampx=t, tAf, in state s =- 3g, g<i, s.t. rg=<ASSIGNx,t>
· rg=<ASSIGNx,t> ==- <x,t>Ependingtsassign
* <x,t>Ependingtsassign
in state sg-1
in state sg-1
3h, h<g-l<i, s.t.
:=
(rh=COMMIT,)
A (currentts=t, tEjA, in state Sh-l)
and thus the lemma has been proven.
Lemma A.49
O
i<j
(currentts in state s)<(currentts
in state sj)
Proof:
By induction.
* The basis case of i=j is trivial.
* For any action rj, either
(currentts in state sj_l)=(currentts in state sj)
for rj$COMMITxfor any xEA/
or
(currentts in state s_l)+l=(currentts
in state sj)
for rj=COMMIT,for some xAf
Therefore, for any action rj,
(currentts in state sj_)<(currentts
in state sj)
* By the inductive hypothesis,
i<j-1 = (currentts in state si)<(current.ts in state sj-1)
* Therefore,
(i<j-1) A ((currentts in state sj_l)<(current_ts in state sj))
== ((currentts in state si)<(current-ts in state s))
and so the lemma has been proven.
]
Lemma A.50
A
A
(a is a well-formed execution)
(curr_reqd_ts=t, tEAr, in state si)
(((<x,t>Epending_tsassign)
V (timestampa=t))
curr-reqddlr=x in state si
162
in state si)
Proof:
* ((<x,t>Ependingts-assign) V (timestampa=t, tEN)) in state si
= 3h, h<i, s.t. (rh=COMMIT,)A (currentts=t in state sh-1)
by Lemma A.48
* (rh=COMMIT,)A (currentts=t in state sh_1)
= ((currreqd dlr=x) A (curr reqdts=t) A (currentts=t+1)) in state sh
* (currentts=t+1 in state Sh)
=Vj, j>h, t<currents in state s
by Lemma A.49
*
(Vj, j>h, t<currentts in state sj)
A (curr.reqdts=t, tEAN,in state si)
A
(h<i)
=*- /k, h<k<i, s.t. Wk=COMMITyfor any y
*
(currreqddlr=x
A
A
in state sh)
(Ik, h<k<i, s.t. rk=COMMITy
for any y)
(currreqdts=t,
tL, in state si)
== currreqddlr=x
in state si
and thus the lemma has been proven.
Lemma A.51
by definition of steps(LOT)
O
(a is a well-formed execution)
A
(rI=ERASE
)
Vk, k>i, currreqddlrix
in state sk
Pr(oof:
* 7ri=ERASEx
=- 3h, h<i, s.t. 7rh=CSRECVx
*·rh=CSRECVZ -=xEsendcsrecv in state sh-1
* xEsend-csrecv in state Sh-1 =- curr-reqddlrx
b2y Lemma A.34
in state sh-1
b2y
Lemma A.40
* 7h=CSRECVx -== 3g, g<h, s.t. rg=<ASSIGNx,t> for some t
biy Lemma
A.43
* 7rg=<ASSIGNx,t>
= <x,t>Ependingts.assign in state s_
* <x,t>Ependingts-assign in state sg-_ =
* rf=COMMITx==
* (currreqd-dlrjx,
3f, f<g-1, s.t. rf=COMMITx
j, j>f, s.t. rj =COMMITx
by WF1
xEA, in state smil) A (rmiCOMMIT,)
curr-reqddlrox in state sm
(currreqddlr:x
in state sh-1)
==
*
A (j, j>f, s.t. 7rj=COMMITx)
A
(f<g-l<h<i)
=- Vk, k>i, curr.reqddlr6x in state sk
and thus the lemma has been proven.
163
by induction
0
Lemma A.52
A
(a is a well-formed execution)
(timestampr=t, tEA/, in state sj)
/k, k>j, s.t. 7rk=<ASSIGNx,u>for any u
Proof:
By contradiction. Assume 3k, k>j, s.t. rk=<ASSIGNx,u>for some u
* timestampz=t, tEAJ, in state s ==: 3i, i<j, s.t. ri=<ASSIGN,t>
* (i<j) A (j<k) =- i<k
by Lemma A.30
* rk=<ASSIGN,,u> ==, i, i<k, s.t. ri=<ASSIGN,,t>
This contradiction implies that the original assumption must be false, and so
O
the lemma has been proven.
Lemma A.53
A
(a is a well-formed execution)
(timestampa=t, tEAr, in state si)
Vl, l>i, timestamp,=t in state sI
Proof:
* timestampa=t, tEA, in state si
by Lemma A.52
:==-j, j>i, s.t. rj=<ASSIGN.,u> for any u
* (timestampr=t in state sm-l) A (rm$<ASSIGN,,u> for any u)
=== timestamp,=t
in state sm
* (timestamp,=t in state si) A (j, j>i, s.t. rj=<ASSIGN,u> for any u)
by induction
==- Vl, I>i, timestamp,=t in state sl
and thus the lemma has been proven.
Lemma A.54
O
(a is a well-formed execution)
A (timestampx=t, tEAn, in state si)
A (tfrecv_tss in state si)
A (3h, h<i, s.t. rh=ERASEx)
Vj, j>i, tfrecvtss in state sj
Proof:
By contradiction. Assume 3j, j>i, tErecvtss in state sj
* (tfrecvtss in state si) A (tErecvtss in state sj) A (i<j)
-= 3k, i<k<j, s.t. (tfrecv tss in state sk) A (tErecv tss in state sk+l)
* (trecvAtss in state sk) A (tErecvtss in state sk+1)
by definition of steps(LOT)
in state sk
==- currreqdts=t
164
* (timestampa=t, tEAN,in state si)
=- VI, I>i, timestamp,=t
in state sl
by Lemma A.53
* (Vl, l>i, timestampx=t in state sl) A (k>i) =- timestamp-=t in state sk
* ((curr_reqd.ts=t, tEN) A (timestampx=t)) in state sk
in state sk
currreqddlr=x
currreqddlr5x
* (Irh=ERASEx) A (h<k) =
=:
in state sk
by Lemma A.50
by Lemma A.51
But this is a contradiction and so the original assumption must be false. Thus
the lemma has been proven.
Lemma A.55
A
0
(a is a well-formed and fair execution)
(xEsendcs-recv in state si)
3j, j>i, s.t. rj=ERASEx
Proof:
* xEsend-csrecv in state si -=- 3k, k>i, s.t. rk=CS_RECVx
* 7k=CS-RECVX
= ((status.=RECV) A (pendingackx=T)) in state sk
* ((statusx=RECV) A (pendingackx=T)) in state sm_
==
by Lemma A.46
in state sml) A (rnERASE,)
rmERASEx
* (((statusx=RECV) A (pendingack=T))
= ((statusx=RECV) A (pendingackx=T)) in state sm
* ((status,=RECV) A (pendingackx=T)) in state sk
== 3n, n>k, s.t. rn=ACK_CSRECV,
by induction and fairness
· rn=ACK_CSRECVx - xEpendingerasable
in state sn
* xEpendingerasable in state sn,=
3p, p>n, s.t. 7rp=ERASABLEr
*· p=ERASABLE -3j, j>p, s.t. 7rj=ERASEx
by WF4
o
* By transitivity, j>i and thus the lemma has been proven.
Lemma A.56
(a is a well-formed and fair execution)
A
(xEsendcsrecv
in state si)
A (timestampa=t, tAf, in state s)
3j, j>i, s.t Vk, k>j, trecvtss
in state sk
Proof:
* xEsendcsrecv in state si
- 31, I>i, s.t. 7rl=ERASEx
by Lemma A.55
* rl=ERASE == ((status=NONR)
A (pending-ackx=T)) in state sl
* timestampx=t, tJN, in state si
==- Vn, n>i, timestamp,=t in state sn
* 7r=ERASE,
= 3r,r<l,s.t. rr=CS-RECVx
165
by Lemma A.53
by Lemma A.34
*
by Lemma A.42
,]q, q>r, s.t. 7rq=CSRECVx
* r=CSRECVx =
(statusr=NONR in state sl)
A (/q, q>l, s.t. 7rq=CSRECV,)
A
A
in state s)
(pendingack,=T
(Vn, n>l, timestamp,=t in state s)
==-
3j, j>l, s.t. rj=<DLR_GONE,t>
by definition of fairness
in state sj
* 7rj=<DLR_GONE,t>== trecvtss
*
(timestampa=t, tENJ, in state s)
A (torecvtss in state sj)
A
A
(7rl=ERASE,)
(I<j)
Vk, k>j, trecvtss
by Lemma A.54
in state sk
and thus the lemma has been proven.
==
Lemma A.57
[
(a is a well-formed execution)
A
(ri=ACKASSIGN,)
A
A
(curr_reqddlr=x
(i<j)
in state sj)
currreqdacked=T in state sj
Proof:
by Lemma A.38
* 7ri=ACK.ASSIGNx
-==3h, h<i, s.t. 7rh=COMMITx
* 7rh=COMMITx
= ((curr_reqddlr=x) A (currreqd_acked=F)) in state sh
* 7rh=COMMIT,== ,/k, k>h, s.t. rk=COMMITx
by WF1
*
(curr.reqddlr=x,
xEA, in state Sh)
A (curr_reqd_dlr=x, xEJA, in state sj)
A
(h<j)
A
(Ak, k>h, s.t. rk=COMMITx)
for any y)
:(,l, h<l<j, s.t. rl=COMMITy
A (Vm, h<m<j, currreqd dlr=x in state sm)
*
(ir=ACKASSIGNx)
A (Vm, h<m<j, currreqddlr=x
A
in state sm)
(h<i<j)
=-
curr reqdacked=T in state si
* (curr-reqdacked=T
=-
in state si) A (I,
i<l<j, s.t.
r=COMMITyfor any y)
currreqd acked=T in state sj
and thus the lemma has been proven.
166
[]
Lemma A.58
A
(a is a well-formed execution)
(rk=<ASSIGN,,t> for some t)
status,=UNFL in state sk
Proof:
in state k.
By contradiction. Assume statusZUNFL
* rk=<ASSIGNx,t>
==
h,h<k, s.t. rh=<ASSIGNz,v> for any v
by Lemma A.30
* Either
(1) status.=REQD
in state sk
* status.=REQD in state sk =Z 3j, j<k, s.t. rj=CSREQD,
·* j=CSREQD, ==>x=sendcsreqd in state sj_l
* =sendcs-reqd in state sj- 1 ==- 3i, i<j-1, s.t. ri=ACKASSIGNx
3h, h<i<k, s.t. 7rh=<ASSIGN,,v> for some v
by Lemma A.29
This contradiction implies that this case is impossible.
* ri=ACKASSIGN. =
or
(2) status.=RECV
in state sk
· statusr=RECV in state k =- 3j, j<k, s.t. 7rj=CSRECVr
*·rj=CSRECVX ==, xEsendcs-recv in state sj_l
* xEsendcsrecv in state sj-1 == 3i, i<j-1, s.t. is=ACK.ASSIGN,
by Lemma A.33
* ri=ACKASSIGNx== 3h, h<i<k, s.t. 7rh=<ASSIGN.,v> for some v
by Lemma A.29
But this is a contradiction, so this case must be false.
or
(3) statusx=NONR
in state sk
* status.=NONR in state sk
* rj=ERASEx =
·
-
3j, j<k, s.t. 7rj=ERASEx
3i, i<j, s.t. 7ri=ACKASSIGNx
by Lemma A.36
* 7ri=ACKASSIGNz==,3h, h<i<k, s.t. 7rh=<ASSIGNx,v> for some v
by Lemma A.29
But this is a contradiction and so also this case must be false.
* All possible cases lead to contradictions, so the original assumption must be
false and thus the lemma has been proven.
O
(a is a well-formed and fair execution)
Lemma A.59
A
(ri=COMMIT,)
31, I>i, s.t. rl=ACKASSIGNx
167
Proof:
* (ri=COMMITx)A (currentts=t
==
for some tEAr)
in state si
<,t>Epending_tsassign
* <x,t>Epending-tsassign in state si -- 3k, k>i, s.t. 7rk=<ASSIGN,,t>
·* rk=<ASSIGN,,t>=# pendingack,=T in state sk
by Lemma A.58
* 7rk=<ASSIGN,,t> == statusx=UNFL in state sk
* ((statusx=UNFL) A (pendingack,=T)) in state sk
by Lemma A.47
=:- 31, I>k, s.t. rl=ACKASSIGNX
0
and thus the lemma has been proven.
Lemma A.60
(a is a well-formed execution)
(ri=COMMIT,)
A
A (currentts=t, tJAr, in state si-l)
Vj, j>i, ((<x,t>Epending_tsassign)
V (timestampx=t))
in state sj
Proof:
By induction.
* (ri=COMMITx)A (currentts=t
in state si_1 )
= <x,t>Ependingtsassign in state si
* (<x,t>Ependingtsassign in state sj-1) A (rj#<ASSIGN,,t>)
in state sj
=# <x,t>Epending_tsassign
* (<x,t>Ependingtsassign
by definition of steps(LOT)
in state sj_) A (7rj=<ASSIGNx,t>)
by definition of steps(DLR,)
= timestampx=t in state sj
* (timestamp,=t in state sj_l) A (rjy<ASSIGNx,u> for any u)
by definition of steps(DLRx)
== timestampx=t in state sj
k, k>j-1, s.t. rk=<ASSIGN,,u> for any u
* timestampx=t in state sjl =:
by Lemma A.52
* Therefore,
((<x,t> pending_ts_assign)V (timestampx=t)) in state sj-1
V (timestampx=t))
((<x,t>Epending_tsassign)
for any possible action 7rj in a well-formed execution.
==
* By induction,
Vj, j>i, ((<x,t>Ependingtsassign)
in state sj
V (timestampx=t)) in state sj
0
and so the lemma has been proven.
(a is a well-formed execution)
Lemma A.61
A
(tErecvtss
in state si, tEAJ)
(<x ,t> Epending-ts assign)
Sx s.t. Vj, j>i, (
V (timestampx=t)
168
)
in state sj
Proof:
* tErecvtss in state s =* 3h, h<i, s.t. curr.reqd ts=t in state sh
* currreqdts=t, tEAr, in state Sh
==* 3g, g<h, s.t. (rg=COMMITxfor some x) A (currentts=t
* (rg=COMMITx)A (currentts=t in state sg-1)
==Vj,
j>g, ((<x,t>Ependingtsassign)
in state sg_l)
V (timestampx=t)) in state sj
by Lemma A.60
*
(g<h<i)
A (Vj, j>g, ((<x,t>Ependingtsassign)
V (timestampa=t)) in state sj)
==. Vj, j>i, ((<x,t>Ependingts_assign)
V (timestampa=t))
in state sj
and thus the lemma has been proven.
Lemma A.62
A
O
(a is a well-formed execution)
(xEsendcsrecv in state si)
<s,u>pending_tsassign
in state si for any u
Proof:
By contradiction. Assume <x,u>Ependingts-assign
in state si for some u
* xEsendcs-recv in state si == 3h, h<i, s.t. rh=ACK.ASSIGNx
by Lemma A.33
*
·
h=ACKASSIGNx
3 9 , g<h, s.t. rg=<ASSIGNx,t> for some t
· rg=<ASSIGN.,t>
=
(<x,t>Ependingtsassign
=-
by Lemma A.29
in state sg_1)
A (<x,t> pending_tsassign in state s)
* <x,t>Ependingts-assign in state sg-1
== 3f, f<g-1, s.t. (rf=COMMIT,)A (current ts=t in state sf-1)
* rf=COMMIT
, j, jf, s.t. rj=COMMIT,
by WF1
·
(<x,t>pending_tsassign
in state sg)
A (,j, j>f, s.t. rj =COMMITx)
A
(f<g)
=> Vk, k>g, <x,t>pending_tsassign
*
in state sk
(<x,u>Epending_tsassign in state si)
A
A
(Vk, kg,
<x,t>4pending_tsassign
in state sk)
(g<h<i)
==- uot
· <x,u>Ependingtsassign
in state si
== 3e, e<i, s.t. (re=COMMITx)A (current-ts=u in state se-1)
* (currentts=t in state sf-1) A (currentts=u in state sl)
= ef
* ef
=> 3e, ef,
A (u7t)
s.t. re=COMMITx
But this is a contradiction and so the original assumption must be false. Thus
the lemma has been proven.
O
169
Lemma A.63
(a is a well-formed execution)
A
(curr_reqd_dlr=x,
A ((
xEAJ, in state si)
(<x,t>Ependingts-assign)
V
curr-reqdts=t
(timestampa=t,
tENA) )
in state si)
in state si
Proof:
* ((<x,t>Epending_tsassign) V (timestampa=t, tEN)) in state si
== 3h, h<i, s.t. (rh=COMMITx)
A (currentts=t, tEAN,in state sh-l)
by Lemma A.48
* (rh=COMMITx)A (currentts=t in state sh-1)
== ((curr_reqd_dlr=x) A (currreqdts=t)
A (current_ts=t+l)) in state sh
*
(((curr_reqd_dlr=x, xEAJ) A (curr_reqdts=t, tEN)) in state sm-l)
A (currenttsat in state s,_l-)
A
(7rmCOMMITx)
V
*
(((curr_reqd_dlrix) A (currreqdtsot))
in state sm)
(((curr_reqd_dlr=x) A (currreqd ts=t)) in state sm)
(((curr-reqddlrox, xEJA) A (currreqdtsot,
A (current ts7t, tENJ, in state sm-l)
A
tEN)) in state Sm-1)
(WrmCOMMITx)
=
((curr.reqddlrhx) A (curr.reqd.ts6t)) in state sm
· rh=COMMIT =- ]3j, j>h, s.t. rj=COMMITx
* current_ts=t+l in state sh == VI, I>h, current ts>t+l in state sl
* V1, I>h, currentts>t+l
*
A
in state s ==* V, I>lh, currenttsat
(((currreqddlr=x,
xEA) A (currreqdts=t,
(Vl, I>h, currenttsot in state sl)
by WF1
in state sl
tN)) in state Sh)
A (j, j>h, s.t. rj=COMMIT,)
: Vk, k>h,
V
*
(((currreqd_dlrx)
(((curr_reqddlr=x)
A (currreqdts5t))
A (currreqdts=t))
in state sk)
in state sk)
by induction
(curr-reqddlr=x in state si)
A
A
(h<i)
(Vk, k>h,
V
(((curr_reqddlrsx)
A
(((curr_reqddlr=z)
A
(curr_reqd.ts5t)) in state sk)
(currreqdts=t)) in state sk) )
curr-reqdts=t in state si
and thus the lemma has been proven.
==
Lemma A.64
A
(a is a well-formed execution)
(currreqddlr=x,
xeJn, in state si)
currreqd tsIL in state si
170
O
Proof:
xEAr, in state si =- 3j, j<i, s.t. rj=COMMIT
* currreqddlr=x,
* (rj=COMMITx)A (current-ts=t,
for some tEJA, in state sj-l)
-Vk, k>j, ((<x,t>Epending_tsassign) V (timestampx=t)) in state sk
by Lemma A.60
* (Vk, k>j, ((<x,t>Epending_tsassign) V (timestampx=t)) in state sk) A (j<i)
=, ((<x,t>Epending_tsassign) V (timestamp,=t)) in state si
(curr-reqd-dlr=x in state si)
*
A
V (timestampx=t))
(((<x,t>Ependingtsassign)
in state si
=- currreqdts=t
in state si)
by Lemma A.63
in state si
· teAf =- currreqdtsI
and thus the lemma has been proven.
Lemma A.65
A
0
(a is a well-formed execution)
(tErecvtss in state si)
tEJ
Proof:
* tErecvtss
in state si ==* either
(rh=COMMITxfor some x)
(1) 3h, h<i, s.t.
A (curr-reqdtsII
A
* ((currreqdts5i)
(curr-reqdts=t
in state h-_)
in state sh-1)
A (curr_reqdts=t)) in state
Sh-1
tf
or
(2) 3h, h<i, s.t.
(Wrh=ACK-ASSIGNx)
A
A
* currreqddlr=x,
(curr reqddlr=x in state sh_l)
(curr-reqdts=t in state sh-1)
xEJ/, in state h_-
in state Sh-1
-- currreqdtslI
A (curr_reqdts=t))
* ((currreqdts$I)
by Lemma A.64
in state sh-1 = tEAr
or
(3) 3h, h<i, s.t.
A
(rh=<DLRGONE,u> for some u7t)
(curr_reqd_dlrIl in state sh-1)
A
(currreqdts=t
* curr.reqddlrI
in state sh-1)
in state sh-1
for some EA, in state sh-_
for some xEAf, in state sh-1
by Lemma A.64
=> curr-reqdtsI
in state sh-1
== currreqddlr=x,
* currreqddlr=x,
* ((curr.reqd.tsl)
A (curr-reqd ts=t)) in state sh_- =
171
tAF
* Therefore, for all possible cases, the desired result is obtained and thus the
0
lemma has been proven.
Lemma A.66
A
(a is a well-formed and fair execution)
(tErecvtss in state si)
3j, j>i, s.t. Vk, k>j, trecv_tss in state sk
Proof:
* tErecvtss
== tEN
by Lemma A.65
in state so)
== 3h, h<i, s.t. (tfrecvtss in state sh-1) A (tErecvtss in state sh)
* tErecv tss in state sh
* (tErecvtss
in state si) A (recvtss=0
= 3x s.t. VI, I>h, (
(<x,t>Epending_tsassign)
V
(timestampx=t)
* (trecvtss
in state sh-1) A (tErecvtss in state
==y either
)
in state sl
by Lemma A.61
h)
(1) (rh=ACKASSIGNy for some y) A (curr-reqdts=t
(((<x,t>Epending_tsassign) V (timestampa=t)) in state sh)
*
A
(7rh=ACKASSIGNy)
==y
*
in state sh-1)
(
(<x,t>Ependingts-assign)
V (timestampx=t) )
in state h-1
(rh=ACKASSIGNy)
A (trecvtss in state Sh-_)
A (tErecvtss in state sh)
--y
(currreqddlr=y
in state sh-1)
A (yEsendcs-recv in state sh)
·
(currreqdts=t, tEAN,in state h-1)
A ((
(<x,t> Epending_tsassign)
in state sh-1)
V (timestamp,=t) )
by Lemma A.50
in state sh-1
:. currreqddlr=x
* (curr-reqddlr=y in state sh-1) A (currreqd-dlr=x in state sh-1)
==> x=y
* (yEsendcsrecv in state h) A (x=y)
=. xEsendcsrecv in state sh
or
for some ut) A (curr-reqd-ts=t in state sh-1)
(2) (rh=<DLR_GONE,u>
V (timestampa=t)) in state sh)
·
(((<x,t>pending_tsassign)
A (rh=<DLR_GONE,u>for some u7t)
(<x,t> Ependingts-assign)
=• (
V (timestampx=t) )
in state sah172
·
(curr-reqdts=t, tEA, in state
A ((
Sh-_)
(<x,t>Ependingts-assign)
V (timestampx-=t) )
in state sh-1)
==- currreqd-dlr=x
in state sh-1
(rh=<DLR_GONE,u>)
*
A
A
(trecvtss
(tErecvtss
A
(currreqddlr=x
in state sh-1)
=- xsendcsrecv
in state sh
in state sh-1)
in state sh)
or
(3) (h=COMMITy for some y) A (currreqdts=t
*
(((<x,t>cpendingts-assign)
A
in state shl)
V (timestampx=t)) in state sh)
(rh=COMMITy for some y)
=- (
(<x,t>Ependingts-assign)
V
*
by Lemma A.50
(timestampx=t)
)
in state sh-1
(currreqd-ts=t in state sh-1)
A ((
(<x,t>Ependingts-assign)
V (timestampx=t) )
in state sh-1)
-= currreqddlr=-x in state sh-1
by Lemma A.50
* Either
(i) 3g, g<h-1,
*
s.t. 7rg=ACKASSIGNx
(7rg=ACKASSIGNx)
A
A
(currreqddlr=x
in state Sh-1)
(g<h-1)
==-
currreqdacked=T
in state Sh-1
by Lemma A.57
*
(7rh=COMMITy)
A (curr reqd-ts=t, tA, in state h-_l )
A (currreqdacked=T in state shl)
A
or
(currreqddlr=x
in state sh-1)
== xEsendcsrecv in state sh
(ii) ,fg, g<h-1, s.t. rg=ACKASSIGNx
* currreqddlr=x, xA/, in state sh_1
=: 3f, f<h-1, s.t. rf=COMMITx
· rf=COMMITz
3e, e>f, s.t. r=ACK-ASSIGNx
(3e, e>f, s.t. re,=ACK-ASSIGNx)
=,
·
by Lemma A.59
A (/g, g<h-1, s.t. rg=ACKASSIGNx)
A (f h-1)
A
(rh$ACKASSIGNx)
=? 3e, e>h, s.t. re=ACKASSIGNx
* 7rf=COMMITx ==
j3d, d>f, s.t.
173
rd=COMMITz
by WF1
* (rh=COMMITy)
A (,ad, d>f, s.t. rd=COMMIT.)
A (f<h)
> xy
* rh=COMMITy
==> currreqd_dlr=y
*
in state sh
(currreqd_dlr=y in state sh)
A
A
A
*
A
(y)
(,d, d>f, s.t. rd=COMMIT,)
(f<h)
==: Vc, c>h, currreqd_dlrx
in state s,
(7re=ACKASSIGNx)
(e>h)
(Vc, c>h, curr_reqd_dlrZx in state s,)
=, xEsendcsrecv in state se
Therefore, for all possible cases, it must be true that
3e, e>h, s.t. xEsendcs-recv in state se
A
* xEsend_csrecv in state s,
==
*
in state se
<x,t>opendingts.assign
by Lemma A.62
(<x,t>Opendingtsassign in state s,)
A (, I>h, ((<x,t>Epending_tsassign) V (timestampx=t)) in state si)
A
(e>h)
== timestampx=t
in state s,
* ((xEsendcs-recv in state s,) A (timestampx=t, tAf)) in state s,
== 3j, j>e, s.t. Vk, k>j, trecvtss
and thus the lemma has been proven.
Lemma A.67
in state sk
by Lemma A.56
O
(a is a well-formed and fair execution)
xEJA, in state si)
A
(curr_reqd_dlr=x,
A
(currreqd.acked=T
A
(recvtss#0
A
(,3j, j>i, s.t.
in state si)
in state si)
(rj=COMMITyfor some y)
A (curr_reqd_dlr=xin state sj-1) )
3k, k>i, s.t. xEsendcs.recv in state sk
Proof:
By contradiction. Assume /3k, k>i, s.t. xEsend-cs-recv in state sk
* currreqddlr=x,
xEA, in state si
by Lemma A.64
in state si
=-= currreqdtsI
* ((currreqddlr=x,
xEAr) A (currreqdacked=T)) in state si
=> 3h, h<i, s.t. rh=ACK.ASSIGNx
* 7rh=ACKASSIGNz=-- ,g, g>h, s.t. rg=ACKASSIGNx
174
by Lemma A.32
by Lemma A.41
*
(curr-reqd-dlr=x, xEA, in state s-l)
A (currreqdtsI
in state s,ml)
A
(curr-reqdacked=T
A
(,rm#COMMITy for any y)
A
(#7rmACKASSIGN
in state Sm-l)
)
A (xfsendcsrecv in state sm)
=
(u(A,uEJf, s.t.
(7rm=<DLRGONE,u>)
A (recvtss={u} in state Sm.-l) )
A (currreqd dlr=x in state sm)
A (currreqdts$I
in state sm)
A (currreqd acked=T in state sm)
·
A
A
A
(currreqddlr=x,
xzEN, in state si)
(currreqd_ts±
in state si)
(currreqd acked=T in state si)
(Aj, j>i, s.t.
(rj=COMMITyfor some y)
A
(Ag, g>i, s.t. rg=ACKASSIGN,)
A (curr-reqddlr=x in state sj-1))
A (Ak, k>i, s.t. xEsendcsrecv in state sk)
(Vc, c>i, currreqddlr=x in state s)
A
·
A
(Ad, d>i, s.t.
(rd=<DLR_GONE,u>)A
A (recvtss={u} in state sd-1) for any u)
by induction
(currreqddlr=x,
xEJA, in state sm-l)
(recvtss=A in state sm-1)
A
(rmyCOMMITy for any y)
A
(irm
ACK-ASSIGNx)
(Au,
uEA/, s.t. (rm=<DLR_GONE,u>)
A
--
*
recvtssCA
A (recvtss={u}
in state Sm
in sltate
sm-l))
by definition of steps(LOT)
(recvtss=R in state si)
A
A
A
(R0)
(Vc, c>i, currreqddlr=x
in state s)
(3j, j>i, s.t.
(irj=COMMITyfor some y)
A (currsreqddlr=x in state sj-_))
A (Bg, g>i, s.t. 7rg=ACKASSIGNx)
A (Ad, d>i, s.t.
(rd=<DLR_GONE,u>)
A
A (recvtss={u} in state Sd-1) for any u)
recvtssCR in state sb
by induction
(Vb, b>i, recvtssCR in state sb)
(RO)
A
(Ad, d>i, s.t.
=: Vb, b>i,
·
(7rd=<DLR-GONE,u>)
A (recvtss={u} in state sd-1) for any u)
:- 3v, vER, s.t. Va, a>i, vErecv tss in state sa
175
* vER in state si
== 3n, n>i, s.t. Vq, q>n, vorecvtss in state sq
But this contradicts the earlier deduction that
Va, a>i, vErecvtss in state sa
by Lemma A.66
and so the original assumption must be false and the lemma has been proven.
5
Lemma A.68
(a is a well-formed and fair execution)
A
(currreqdacked=T
A
(currreqddlr=x,
A
(recv_tss$0
in state sl)
xEN, in state sl)
in state sl)
3r, r>l, s.t. xEsendcsrecv in state s,
Proof:
* Either
(1) 3r, r>l, s.t.
(rr,=COMMITy
for some y)
A (currreqddlr=x in state s,_l)
* currreqddlr=x, xEJV, in state s,_== curr-reqdtsI
in state s,_*
(currreqddlr=x in state sl)
A (curr.reqddlr=x in state Sr-l)
A
(r>l)
3q, l<q<r-1,
==-
*
by Lemma A.64
s.t. 7rq=COMMITzfor any z
(currreqdacked=T in state sl)
A
*
(;,q, I<q<r-1, s.t. rq=COMMITz
for any z)
=- curr.reqd acked=T in state sr(r=COMMITYfor some y)
A (curr-reqd-dlr=x in state s,-l)
A (currreqdtsI
in state sr-l)
A (currreqdacked=T in state Sr-l)
=- xEsend-csrecv
in state s,
or
(2)
(7rm=COMMITyfor some y)
m, m>l, s.t.
A
*
(curr-reqd dlr=x in state Sm-l)
(currreqddlr=x,
A
A
A
xEJA,in state sl)
(currreqdacked=T
in state s)
(recvtss$0 in state s)
(pm, m>l, s.t.
(rm=COMMITyfor some y)
A (currreqd dlr=x in state sm-l) )
3r, r>l, s.t. xEsendcs-recv in state sr
by Lemma A.67
* The desired result follows for both possible cases, and so the lemma has been
==
proven.
0
176
Theorem A.69
(a is a well-formed and fair execution)
A
(ri=COMMIT,)
3j, j>i, s.t. rj=ERASE,
Proof:
-= 31, I>i, s.t. ri=ACKASSIGN.
* ri=COMMITx,
* ri=COMMIT,
==Y,Ak, k>i, s.t. ark=COMMITx
* Either
for some yx
(1) 3m, i<m<l, s.t. Irm=COMMITy
A (k,
* (rm=COMMITyfor some yx)
by Lemma A.59
by WF1
k>i, s.t. rk=COMMIT,)
curr-reqd dlrAx in state sl1
==
* (curr-reqd_dlrox in state sl_1) A (rt=ACK_ASSIGN)
== xEsend-csrecv in state sl
or
(2)
for any y
m, i<m<l, s.t. 7rm=COMMITy
* ri=COMMITx
=
A (currreqd_acked=F))
((currreqddlr=z)
in state si
* 7rl=ACKASSIGN,
:- ,h, h<l, s.t. 7rh=ACKASSIGNx
by Lemma A.41
(((currreqd_dlr=x) A (currreqd_acked=F)) in state sn-l)
*
(rnCOMMITy for any y)
A
A
(7,7ACK.ASSIGNr)
((currreqdadlr=x) A (currreqd_acked=F)) in state s.
=
*
A
A
A (curr_reqd_acked=F)) in state si)
(((currreqddlr=x)
for any y)
(m, i<m<l, s.t. rm=COMMITy
(Ah, h<l, s.t. 7rh=ACKASSIGNx)
((curr-reqddlr=x) A (currreqdacked=F)) in state si-_
=-
by induction
* Either
(i) recvtss=O in state s-_l
*
(curr_reqd_dlr=x in state si-1)
A (recvtss=0 in state s-1)
A
(ri=ACKASSIGNx)
== xsend-cs-recv in state st
or
(ii) recvtss60~ in state s1_l
*
(currreqd-dlr=x in state sj_l )
A (recvtss0 in state s-1i)
A
( r=ACKASSIGN )
~=
(currreqdacked=T in state sl)
A
A
in state sl)
(currreqddlr=x
(recvtss$0 in state Sl)
177
A
(curr-reqdacked=T in state s)
(currreqddlr=x
in state sl)
A (recvtss$0 in state sl)
=, 3r, r>l, s.t. xEsendcsrecv in state sr
by Lemma A.68
* Therefore, for all possible cases, it is true that
3r, r>l, s.t. xEsendcsrecv in state sr
* xEsendcsrecv in state Sr
== 3j, j>r, s.t. rj=ERASEx
and thus the theorem has been proven.
178
by Lemma A.55
[
Bibliography
[1] Rakesh Agrawal. A Parallel Logging Algorithm for Multiprocessor Database Ma-
chines. In Proc 4th Int'l Workshop on Database Machines, pages 256-276, Grand
Bahama Island, March 1985.
[2] Rakesh Agrawal and David J. DeWitt. Recovery Architectures for Multiprocessor
Database Machines. In Proc SIGMOD '85 Conf, pages 131-145, Austin, Texas,
May 1985.
[3] Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. Data Structures and
Algorithms. Addison-Wesley, Reading, Massachusetts, 1983.
[4] Joel F. Bartlett. A NonStop Kernel. In Proc 8th Symp. on Operating System
Principles, pages 22-29, December 1981.
[5] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. Concurrency Con-
trol and Recovery in Database Systems. Addison-Wesley,Reading, Massachusetts,
1987.
[6] Ian F. Blake. An Introduction to Applied Probability. John Wiley & Sons, New
York, New York, 1979.
[7] Haran Boral. Design Considerations for 1990 Database Machines. In Proc IEEE
Compcon '86 Conf, pages 370-373, San Francisco, California, March 1986.
[8] Haran Boran, David J. DeWitt, Dina Friedland, Nancy F. Jarrell, and W. Kevin
Wilkinson. Implementation of the Database Machine DIRECT. IEEE Trans on
Software Engineering, SE-8(6):533-543, November 1982.
[9] George Copeland, Tom Keller, Ravi Krishnamurthy, and Marc Smith. The Case
for Safe RAM. In Proc 15th Int'l Conf on Very Large Databases, pages 327-335,
Amsterdam, The Netherlands, August 1989.
[10] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to
Algorithms. The MIT Press, Cambridge, Massachusetts, 1990.
[11] William Dally, Andrew Chien, Stuart Fiske, Waldemar Horwat, John Keen, Michael
Larivee, Rich Lethin, Peter Nuth, Scott Wills, Paul Carrick, and Greg Fyler. The JMachine: A Fine-Grain Concurrent Computer. In Proc IFIP 11th World Computer
Congress, pages 1147-1153, San Francisco, California, August 1989.
179
[12] William J. Dally, J.A. Stuart Fiske, John S. Keen, Richard A. Lethin, Michael D.
Noakes, Peter R. Nuth, Roy E. Davison, and Gregory A. Fyler. The MessageDriven Processor: A Multicomputer Processing Node with Efficient Mechanisms.
IEEE Micro, 12(2):23-39, April 1992.
[13] Dean S. Daniels, Alfred Z. Spector, and Dean S. Thompson. Distributed Logging
for Transaction Processing. In Proc SIGMOD '87 Conf, pages 82-96, San Francisco,
California, May 1987.
[14] David J. DeWitt. DIRECT - A Multiprocessor Organization for Supporting Rela-
tional Database Management Systems. IEEE Trans on Computers, C-28(6):395406, June 1979.
[15] David J. DeWitt, Randy H. Katz, Frank Olken, Leonard D. Shapiro, Michael R.
Stonebraker, and David Wood. Implementation Techniques for Main Memory
Database Systems. In Proc SIGMOD '84 Conf, pages 1-8, Boston, Massachusetts,
June 1984.
[16] Susanne Englert, Jim Gray, Terrye Kocher, and Praful Shah. A Benchmark of
NonStop SQL Release 2 Demonstrating Near-Linear Speedup and Scaleup on Large
Databases. Technical Report 89.4, Tandem Computers Inc., May 1989.
[17] K.P. Eswaran, J.N. Gray, R.A. Lorie, and I.L. Traiger. The Notions of Consistency
and Predicate Locks in a Database System. Comm. of the ACM, 19(11):624-633,
November 1976.
[18] Anon. et al. A Measure of Transaction Processing Power. Datamation, pages 112118, April 1985.
[19] Allan Gottlieb, Ralph Grishman, Clyde P. Kruskal, Kevin P. McAuliffe, Larry
Rudolph, and Marc Snir. The NYU Ultracomputer - Designing an MIMD Shared
Memory Parallel Computer. IEEE Trans on Computers, C-32(2):175-189, February
1983.
[20] Jim Gray. Notes on Data Base Operating Systems. Set of course notes, 1977.
[21] Jim Gray, editor. The Benchmark Handbookfor Database and Transaction Processing Systems. Morgan Kaufmann, San Mateo, California, 1991.
[22] Jim Gray, Bob Good, Dieter Gawlick, Pete Homan, and Harald Sammer. One
Thousand Transactions per Second. In Proc IEEE Compcon '85 Conf, pages 96101, San Francisco, California, February 1985.
[23] Jim Gray, Bob Horst, and Mark Walker. Parity Striping of Disc Arrays: Low-Cost
Reliable Storage with Acceptable Throughput. In Proc 16th Int'l Conf on Very
Large Databases, pages 148-161, Brisbane, Australia, August 1990.
[24] Jim Gray, Paul McJones, Mike Blasgen, Bruce Lindsay, Raymond Lorie, Tom
Price, Franco Putzolu, and Irving Traiger. The Recovery Manager of the System
R Database Manager. ACM Computing Surveys, 13(2):223-242, June 1981.
180
[25] Jim Gray and Frank Putzolu. The 5 Minute Rule for Trading Memory for Disc
Accesses and The 10 Byte Rule for Trading Memory for CPU Time. In Proc
SIGMOD '87 Conf, pages 395-398, San Francisco, California, May 1987.
[26] Jim Gray and Andreas Reuter. Transaction Processing: Concepts and Techniques.
Morgan Kaufmann, San Mateo, California, 1993.
[27] The Tandem Database Group. NonStop SQL: A Distributed, High-Performance
High-Availability Implementation of SQL. Technical Report 87.4, Tandem Computers Inc., April 1987.
[28] The Tandem Performance Group. A Benchmark of NonStop SQL on the Debit
Credit Transaction. In Proc SIGMOD '88 Conf, pages 337-341, Chicago, Illinois,
June 1988.
[29] Linley Gwennap. 1992 in Review: The Top RISC Processors.
Report, 6(17), 1992.
Microprocessor
[30] Theo Haerder and Andreas Reuter. Principles of Transaction-Oriented
Recovery. ACM Computing Surveys, 15(4):287-317, December 1983.
Database
[31] Robert B. Hagmann. A Crash Recovery Scheme for a Memory-Resident Database
System. IEEE Trans on Computers, C-35(9):839-843, September 1986.
[32] Robert B. Hagmann and Hector Garcia-Molina. Implementing Long Lived Transactions Using Log Record Forwarding. Technical Report CSL-91-2, Xerox Palo Alto
Research Center, February 1991.
[33] Robert Holbrook. NonStop SQL - A Distributed Relational DBMS for OLTP. In
Proc IEEE Compcon '88 Conf, pages 418-421, San Francisco, California, February
1988.
[34] John Kaunitz and Louis Van Ekert. Audit Trail Compaction for Database Recovery.
Comm. of the ACM, 27(7):678-683, July 1984.
[35] John S. Keen and William J. Daily. Performance Evaluation of Ephemeral Logging.
In Proc 1993 ACM SIGMOD Int'l Conf on Management of Data, pages 187-196,
Washington, D.C., May 1993.
[36] Brian W. Kernighan and Dennis M. Ritchie. The C Programming Language.
Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1978.
[37] Walter H. Kohler. A Survey of Techniques for Synchronization and Recovery in
Decentralized Computer Systems. ACM Computing Surveys, 13(2):149-183, June
1981.
[38] Akhil Kumar. A Crash Recovery Algorithm Based on Multiple Logs that Exploits
Parallelism. In Proc 2nd IEEE Symp on Parallel and Distributed Processing,pages
156-159, Dallas, Texas, December 1990.
181
[39] Tobin J. Lehman and Michael J. Carey. A Recovery Algorithm for A HighPerformance Memory-Resident Database System. In Proc SIGMOD '87 Conf, pages
104-117, San Francisco, California, May 1987.
[40] Henry Lieberman and Carl Hewitt. A Real-Time Garbage Collector Based on the
Lifetime of Objects. Comm. of the ACM, 26(6):419-429, June 1983.
[41] David B. Lomet. Recovery for Shared Disk Systems Using Multiple Redo Logs.
Technical Report CRL 90/4, DEC, October 1990.
[42] Nancy Lynch. I/O Automata: A Model for Discrete Event Systems. Technical
Report MIT/LCS/TM-351, MIT, March 1988.
[43] Nancy A. Lynch and Mark R. Tuttle. An Introduction to Input/Output Automata. CWI-Quarterly, 2(3):219-246, September 1989. Centrum voor Wiskunde
en Informatica, Amsterdam, The Netherlands. Also, appeared as Technical Memo
MIT/LCS/TM-373.
[44] Jeff Moad. Relational Takes on OLTP. Datamation, pages 36-39, May 1991.
[45] C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwartz.
ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and
Partial Rollbacks Using Write-Ahead Logging. ACM Trans on Database Systems,
17(1):94-162, March 1992.
[46] Philip A. Neches. The Anatomy of a Data Base Computer System. In Proc IEEE
Compcon '85 Conf, pages 252-254, San Francisco, California, February 1985.
[47] Philip A. Neches. The Anatomy of a Data Base Computer System - Revisited. In
Proc IEEE Compcon '86 Conf, pages 374-377, San Francisco, California, March
1986.
[48] Michael D. Noakes, Deborah A. Wallach, and William J. Dally. The J-Machine
Multicomputer: An Architectural Evaluation. In Proceedings of the 20th International Symposium on Computer Architecture, pages 224-235, San Diego, California,
May 1993. IEEE.
[49] S.C. North and J.H. Reppy. Concurrent Garbage Collection on Stock Hardware.
In Proc Functional ProgrammingLanguagesand Computer Architecture (SpringerVerlag LNCS 274), pages 113-133, Portland, Oregon, September 1987.
[50] Peter R. Nuth and William J. Dally. The J-Machine Network. In Proceedings of the
International Conferenceon Computer Design: VLSI in Computersand Processors,
pages 420-423. IEEE, October 1992.
[51] David Patterson, Peter Chen, Garth Gibson, and Randy Katz. Introduction to
Redundant Arrays of Inexpensive Disks (RAID). In Proc IEEE Compcon '89 Conf,
pages 112-117, San Francisco, California, February 1989.
182
[52] David A. Patterson, Garth Gibson, and Randy H. Katz. A Case for Redundant
Arrays of Inexpensive Disks (RAID). In Proc SIGMOD '88 Conf, pages 109-116,
Chicago, Illinois, June 1988.
[53] Don Pierce, Marty Czekalski, and Charles Cassidy. Solid State Technology Answers
The Need For High Performance I/O. Computer Technology Review, pages 40-45,
Summer 1993.
[54] David Reed and Liba Svobodova. SWALLOW: A Distributed Data Storage System
for a Local Network. In Local Networks for Computer Communications: Proceedings
of the IFIP Working Group 6.4 International Workshop on Local Networks, pages
355-373, Zurich, Switzerland, August 1980. North-Holland.
[55] Mike Ricciuti. Teradata Adds Power for OLTP. Datamation, pages 89-90, May
1991.
[56] Mendel Rosenblum and John K. Ousterhout.
The LFS Storage Manager. In Proc
Summer '90 USENIX Technical Conf, Anaheim, California, June 1990.
[57] Mendel Rosenblum and John K. Ousterhout. The Design and Implementation of a
Log-Structured File System. In Proc of the 13th ACM Symp on OperatingSystems
Principles (SOSP), pages 1-15, Pacific Grove, CA, October 1991.
[58] Mendel Rosenblum and John K. Ousterhout.
The Design and Implementation of
a Log-Structured File System. ACM Trans on Computer Systems, 10(1):26-52,
February 1992.
[59] Ravi Sharma and Mary Lou Soffa. Parallel Generational Garbage Collection. In
Proc Object-Oriented Programming Systems, Languages, and Applications Conf,
pages 16-32, Phoenix, Arizona, October 1991.
[60] Jack Shemer and Phil Neches. The Genesis of a Database Computer. IEEE Computer Magazine, 17(11):42-56, November 1984.
[61] Emy Tseng and David Reiner. Parallel Database Processing on the KSR1 Com-
puter. In Proc 1993 ACM SIGMOD Int'l Conf on Management of Data, pages
453-455, Washington, D.C., May 1993.
[62] David Ungar. Generation Scavenging: A Non-disruptive High Performance Storage Reclamation Algorithm. In Proc ACM SIGSOFT/SIGPLAN
Software Engi-
neering Symp on Practical Software Development Environments, pages 157-167,
Pittsburgh, Pennsylvania, April 1984.
[63] Benjamin Zorn. Comparing Mark-and-sweep and Stop-and-copy Garbage Collec-
tion. In Proc ACM Symp on Lisp and Functional Programming,pages 87-98, Nice,
France, June 1990.
183
Download