ARIES: A Transaction Recovery Method Supporting

advertisement
ARIES: A Transaction Recovery Method
Supporting Fine-Granularity Locking
and Partial Rollbacks Using
Write-Ahead Logging
C. MOHAN
IBM Almaden
Research
Center
and
DON HADERLE
IBM Santa Teresa
Laboratory
and
BRUCE
LINDSAY,
IBM Almaden
HAMID
Research
In this paper we present
PIRAHESH
and PETER SCHWARZ
Center
and efficient method, called ARIES ( Algorithm
for Recouery
which supports partial
rollbacks
of transactions,
finegranularity
(e. g., record) locking and recovery using write-ahead logging (WAL). We introduce
history
to redo all missing updates before performing
the rollbacks of
the paradigm of repeating
the loser transactions
during restart after a system failure. ARIES uses a log sequence number
in each page to correlate the state of a page with respect to logged updates of that page. All
updates of a transaction
are logged, including those performed during rollbacks. By appropriate
chaining of the log records written during rollbacks to those written during forward progress, a
bounded amount of logging is ensured during rollbacks even in the face of repeated failures
during restart or of nested rollbacks
We deal with a variety of features that are very Important
transaction processing system ARIES supports
in building and operating an industrial-strength
fuzzy checkpoints, selective and deferred restart, fuzzy image copies, media recovery, and high
concurrency lock modes (e. g., increment /decrement) which exploit the semantics of the operations and require the ability
to perform operation
logging. ARIES is flexible
with respect
to the kinds of buffer management
policies that can be implemented.
It supports objects of
varying
length efficiently.
By enabling
parallelism
during restart, page-oriented
redo, and
logical undo, it enhances concurrency and performance.
We show why some of the System R
paradigms for logging and recovery, which were based on the shadow page technique, need to be
changed in the context of WAL. We compare ARIES to the WAL-based
recovery methods of
and
Isolation
Exploiting
a simple
Semantics),
Authors’ addresses: C Mohan, Data Base Technology Institute,
IBM Almaden Research Center,
San Jose, CA 95120; D. Haderle, Data Base Technology Institute,
IBM Santa Teresa Laboratory, San Jose, CA 95150; B. Lindsay, H. Pirahesh, and P. Schwarz, IBM Almaden Research
Center, San Jose, CA 95120.
Permission to copy without fee all or part of this material is granted provided that the copies are
not made or distributed
for direct commercial advantage, the ACM copyright notice and the title
of the publication
and its date appear, and notice is given that copying is by permission of the
Association for Computing Machinery.
To copy otherwise, or to republish, requires a fee and/or
specific permission.
@ 1992 0362-5915/92/0300-0094
$1.50
ACM Transactions on Database Systems, Vol
17, No. 1, March 1992, Pages 94-162
ARIES: A Transaction Recovery Method
.
95
DB2TM, IMS, and TandemTM systems. ARIES is applicable not only to database management
systems but also to persistent
object-oriented
languages,
recoverable
file systems and
transaction-based
operating
systems. ARIES has been implemented,
to varying
degrees, in
IBM’s OS/2TM Extended Edition Database Manager, DB2, Workstation
Data Save Facility/VM,
Starburst and QuickSilver,
and in the University
of Wisconsin’s EXODUS and Gamma database
machine.
Categories
dures,
and Subject
checkpoint/
Management]:
fault
Physical
tems—concurrency,
and
D.4.5
[Operating
Systems]:
Reliability–backup
proce-
[Data]: Files– backup/ recouery; H.2.2 [Database
and restart;
H.2.4
[Database Management]: Sys-
tolerance;
E.5.
Design–reco~ery
transaction
tration—logging
General
Descriptors:
restart,
H.2.7 [Database
processing;
Management]:
Database
Adminis-
recovery
Terms: Algorithms,
Designj
Performance,
Additional
Key Words and Phrases: Buffer
write-ahead logging
Reliability
management,
latching,
locking,
space management,
1. INTRODUCTION
In
this
section,
first
we
introduce
some
ery, concurrency
control,
and buffer
organization
of the rest of the paper.
1.1
Logging,
Failures,
and Recovery
The transaction
concept,
for a long
It encapsulates
time.
which
and Durability)
properties
not limited
to the database
Guaranteeing
concurrent
important
been
the
execution
problem
developed
performance
methods
judged
have
using
in
concepts
relating
and then
to
recov-
we outline
the
Methods
is well
the
understood
ACID
by now,
(Atomicity,
has been
Consistency,
around
Isolation
[361. The application
of the transaction
concept is
area [6, 17, 22, 23, 30, 39, 40, 51, 74, 88, 90, 1011.
atomicity
and
durability
of transactions,
in
the
face
of
of multiple
transactions
and various
failures,
is a very
in transaction
processing.
While
many
methods
have
the
past
characteristics,
not always
several
basic
management,
to
deal
and
the
been
metrics:
with
acceptable.
degree
this
complexity
problem,
and
Solutions
of concurrency
the
assumptions,
ad hoc nature
to this
supported
problem
within
of such
may
be
a page
and across pages, complexity
of the resulting
logic, space overhead
on nonvolatile
storage and in memory for data and the log, overhead
in terms of the
number
of synchronous
and asynchronous
1/0s required
during restart
recovery and normal
processing,
kinds of functionality
supported
tion rollbacks,
etc.), amount
of processing
performed
during
degree of concurrent
processing
supported
during
restart
system-induced
transaction
rollbacks
caused by deadlocks,
(partial
restart
transacrecovery,
recovery,
extent of
restrictions
placed
‘M AS/400, DB2, IBM, and 0S/2 are trademarks
of the International
Business Machines Corp.
Encompass, NonStop SQL and Tandem are trademarks
of Tandem Computers, Inc. DEC, VAX
DBMS, VAX and Rdb/VMS are trademarks of Digital Equipment
Corp. Informix is a registered
trademark of Informix Software, Inc.
ACM Transactions on Database Systems, Vol. 17, No 1, March 1992.
96
C. Mohan et al
.
on stored data (e. g., requiring
unique
keys for all records,
mum size of objects to the page size, etc.), ability
to support
which
allow
the
concurrent
execution,
based
restricting
maxinovel lock modes
on commutativity
and
other
properties
[2, 26, 38, 45, 88, 891, of operations
like increment/decrement
on
the same data by different
transactions,
and so on.
In this
paper
we introduce
a new recovery
method,
called
ARL?LSl
(Algorithm
very well
flexibility
for Recovery
and Isolation
Exploiting
Semantics),
which
fares
with respect to all these metrics.
It also provides
a great deal of
to take
advantage
of some special
characteristics
of a class
of applications
that
of applications
for better
performance
(e. g., the kinds
IMS Fast Path [28, 421 supports
efficiently).
To meet transaction
and data recovery
guarantees,
ARIES records in a log
the progress
able
data
of a transaction,
objects.
transaction’s
types
back).
records
and its actions
log becomes
committed
of failures,
When the
also
The
actions
the
which
source
are reflected
for
cause changes
ensuring
in the database
or that its uncommitted
actions
logged actions
reflect
data object
become
the
source
for
reconstruction
to recover-
either
that
despite
the
various
are undone
(i.e., rolled
content,
then those log
of damaged
or lost
data
(i.e., media recovery).
Conceptually,
the log can be thought
of as an ever
growing
sequential
file. In the actual implementation,
multiple
physical
files
may be used in a serial fashion
to ease the job of archiving
log records [151.
Every
record
log record is assigned a unique
log sequence number
(LSN)
is appended to the log. The LSNS are assigned in ascending
when that
sequence.
Typically,
they are the logical addresses of the corresponding
log records. At
[6’71. If more
times, version
numbers
or timestamps
are also used as LSNS
than one log is used for storing
the log records relating
to different
pieces of
data, then a form of two-phase
commit
protocol
(e. g., the current
industrystandard
Presumed
Abort protocol
[63, 641) must be used.
The nonvolatile
version
of the log is stored on what is generally
called
stable storage. Stable storage means nonvolatile
storage which remains
intact
Disk is an example
of nonvolatile
and available
across system
failures.
storage and its stability
is generally
improved
by maintaining
synchronously
two identical
copies of the log on different
devices.
We would
expect
online log records stored on direct access storage devices to be archived
cheaper and slower medium
like tape at regular
intervals.
The archived
records
may
be discarded
once the appropriate
image
copies
(archive
the
to a
log
dumps)
of the database
have been produced
and those log records
are no longer
needed for media recovery.
Whenever
log records are written,
they are placed first only in the volatile
storage
(i.e., virtual
storage)
buffers
of the log file. Only at certain
times
(e.g., at commit time) are the log records up to a certain point (LSN) written,
in log page sequence, to stable storage.
This is called
forcing
the log up to
that LSN. Besides forces caused by transaction
and buffer
manager
activi -
1 The choice of the name ARIES, besides its use as an acronym that describes certain features of
our recovery method, is also supposed to convey the relationship
of our work to the Starburst
project at IBM, since Aries is the name of a constellation.
ACM TransactIons on Database Systems, Vol. 17, No 1, March 1992
ARIES: A Transaction Recovery Method
ties, a system
buffers as they
process
fill up.
may,
For ease of exposition,
in
the
we assume
background,
that
periodically
each log record
.
force
describes
performed
to only a single page. This is not a requirement
in the Starburst
[87] implementation
of ARIES,
sometimes
97
the
log
the update
of ARIES.
In fact,
a single log record
might
be written
to describe updates to two pages. The undo (respectively,
redo) portion
of a log record provides
information
on how to undo (respectively,
redo) changes
performed
by the transaction.
A log record
which
contains
both
record.
information
(e.g.,
fields
undo
and the
a log
or only
log record
that
the
Sometimes,
the undo
or an undo-only
is performed,
before
within
subtract
3 from
high concurrency
performed
the
the update
the object)
redo
record
information
may
information.
log record,
undo-redo
information
For
example,
with
of the model
ARIES
of [3], which
exclusively
(X mode)
uses the widely
of the commercial
is called
the
log
redo
a redo-only
on the action
be recorded
physically
images
or values
of specific
add 5 to field 3 of record 15,
logging
permits
semantics
of the
certain
operations,
the
the use of
operations
same field
updates
of many transactions.
These
is permitted
by the strict executions
essentially
for commit
accepted
and prototype
undo-redo
only
Depending
may
update
(e.g.,
field 4 of record 10). Operation
lock modes, which exploit
the
on the data.
property
Such a record
and after the
or operationally
an
to contain
respectively.
of a record could have uncommitted
permit
more concurrency
than what
be locked
is called
be written
write
says that
ahead
systems
modified
objects
must
duration.
logging
(WAL)
based on WAL
protocol.
are IBM’s
Some
AS/400TM
[9, 211, CMU’S Camelot
961, Unisys’s
DMS/1100
[23, 901, IBM’s DB2TM [1, 10,11,12,13,14,15,19,
35,
[271, Tandem’s
EncompassTM
[4, 371, IBM’s IMS [42,
m [161, Honeywell’s
MRDS [911,
43, 53, 76, 80, 941, Informix’s
Informix-Turbo
[29], IBM’s
0S/2
Extended
Tandem’s
NonStop
SQL ‘M [95], MCC’S ORION
EditionTM
Database
Manager
[71, IBM’s
QuickSilver
[40], IBM’s
Starburst
[871, SYNAPSE
[781, IBM’s
System/38
[99], and DEC’S VAX DBMSTM
and
VAX Rdb/VMSTM
[811. In WAL-based
systems, an updated
page is written
back to the same nonvolatile
storage location
from where it was read. That
is, in-place
what
updating
happens
is performed
in the shadow
on nonvolatile
page technique
which
storage.
Contrast
is used in systems
this
with
such as
System R [311 and SQL/DS
[51 and which is illustrated
in Figure
1. There the
updated
version
of the page is written
to a different
location
on nonvolatile
storage and the previous
version of the page is used for performing
database
recovery
if the system were to fail before the next checkpoint.
The WAL protocol
asserts that the
some data must already
be on stable
allowed
to replace the previous
version
That
is, the system
storage
records
storage.
is not allowed
version
of the
which describe
To enable the
method
of recovery
describes
the most
log records representing
changes
to
storage
before the changed
data is
of that data on nonvolatile
storage.
to write
an updated
page to the nonvolatile
database
until
at least the undo portions
of the log
the updates to the page have been written
to stable
enforcement
of this protocol,
systems using the WAL
store
recent
in every page the LSN
of the log record
that
update
performed
on that
page. The reader
is
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
98
.
C Mohan et al.
Map
Page
~
Fig. 1.
Shadow page technique.
Logical page LPI IS read from physical page PI and after
modlflcat!on IS wr!tten to physical page PI’ P1’ IS the current
vers!on and PI IS the shadow version
During
a checkpoint,
the
shadow
shadow
verson
recovety
IS performed
us!ng
to [31, 971 for discussions
ered to be better
which
of the
than
shadowing
problems
of the
some of the important
data
why
the
original
the
On
a failure,
log
and
current
the
version
data
shadow
page
and they
technique
base
version
is consid-
[16, 781 discuss
a separate
shadow
drawbacks
the
and
WAL
page technique.
using
also
base
about
the shadow
is performed
d]scarded
IS
the
of the
referred
version
becomes
methods
log. While
these
avoid
approach,
they
still
introduce
in
some
retain
some new ones. Similar
comments
apply to the methods suggested in [82, 881. Later, in Section 10, we
show why some of the recovery
paradigms
of System R, which were based on
the shadow page technique,
are inappropriate
in the WAL context,
when we
need support
are described
for high levels
in Section 2.
Transaction
status
of concurrency
is also
stored
in
and various
the
log
and
other
features
no transaction
considered
complete
until its committed
status and all its log data
recorded
on stable storage by forcing
the log up to the transaction’s
log record’s
LSN.
This
allows
a restart
recovery
procedure
that
can
be
are safely
commit
to recover
any
transactions
that completed
successfully
but whose updated
pages were not
physically
written
to nonvolatile
storage before the failure
of the system.
This means that
a transaction
is not permitted
to complete
its commit
processing
(see [63, 64]) until
the redo portions
of all log records
of that
transaction
have been written
to stable storage.
We deal with three types of failures:
transaction
or process, system, and
media or device. When a transaction
or process failure
occurs, typically
the
transaction
would
be in such a state that
its updates
would
have to be
undone.
It is possible
that
the
buffer
pool if it was
the process disappeared.
in the
When
storage
contents
be lost
restarted
and
the
database
contents
recovered
the
would
recovery
and
of
that
using
the
log.
image
When
would
copy
had
corrupted
some pages
in the
middle
of performing
some updates
when
the virtual
a system failure
occurs, typically
and
performed
media
an
transaction
the
using
a media
be
lost
(archive
transaction
the
or device
and
system
nonvolatile
the
dump)
failure
lost
would
have
storage
data
version
occurs,
would
of the
to be
versions
of
typically
have
lost
data
the
to
be
and
log.
Forward
processing
refers to the updates performed
when the system is in
normal
(i. e., not restart
recovery)
processing
and the transaction
is updating
ACM TransactIons on Database Systems, Vol
17, No. 1, March 1992.
ARIES: A Transaction Recovery Method
the database
because
of the data
user or the application
and using
the log to generate
to the ability
later
manipulation
program.
That
the (undo)
to set up savepoints
in the transaction
the transaction
request
(e.g.,
update
during
the
the rolling
since the establishment
concept
is exposed
only
place
if a partial
another
with
at the application
database
partial
rollback
were
rollback
whose
A
to be later
point
Partial
issued
by the
back
rollback
refers
of a transaction
of the changes
savepoint
and
performed
by
[1, 31]. This
is
all updates
of the transaction
Whether
or not the savepoint
is immaterial
nested
calls
99
is not rolling
execution
of a previous
level
recovery.
calls.
back
to be contrasted
with
total rollback
in which
are undone and the transaction
is terminated.
deals
SQL)
is, the transaction
.
rollback
followed
to us since this
paper
is said to have
taken
by a total
of termination
rollback
is an earlier
point
or
in the
transaction
than the point of termination
of the first rollback.
Normal
undo
refers to total or partial
transaction
rollback
when the system is in normal
operation.
A normal
or it may
constraint
be system
violations).
restart
recovery
undo
may be caused
by a transaction
request
to rollback
initiated
because of deadlocks
or errors (e. g., integrity
Restart
undo refers to transaction
rollback
during
after
a system
failure.
To make
partial
or total
rollback
efficient
and also to make debugging
easier, all the log records written
by a
transaction
are linked
via the PreuLSN
field of the log records in reverse
chronological
order.
That
transaction
would
point
that transaction,
if there
the
updates
performed
is,
the
most
recently
written
log
record
of the
to the previous
most recent log record written
by
is such a log record.2 In many WAL-based
systems,
during
a rollback
are logged
using
what
are
called
compensation
log records (CLRS)
[151. Whether
a CLR’S update
is undone,
should that CLR be encountered
during
a rollback,
depends on the particular
system.
As we will
see later,
in ARIES,
a CLR’S
update
is never
undone
and
hence CLRS are viewed as redo-only
log records.
Page-oriented
redo is said to occur if the log record whose update is being
redone describes which page of the database
was originally
modified
during
normal
processing
and if the same page is modified
during
the redo processing. No internal
descriptors
of tables or indexes need to be accessed to redo
the update.
That
is to be contrasted
is, no other
with
page of the database
logical
redo
which
and AS/400
for indexes
[21, 621. In those
not logged separately
but are redone using
needs to be examined.
is required
in System
systems, since
the log records
This
R, SQL/DS
index changes are
for the data pages,
performing
a redo requires
accessing
several
descriptors
and pages of the
database.
The index
tree
would
have
to be retraversed
to determine
the page(s) to be modified
and, sometimes,
the index page(s) modified
because
of this redo operation
may be different
from the index page(s) originally
modified
during
normal
processing.
Being able to perform
page-oriented
redo
allows
the
the system
recovery
to provide
of one page’s
recovery
contents
independence
does not
require
amongst
objects.
accesses
That
to any
is,
other
2 The AS/400, Encompass and NonStop SQL do not explicitly
link all the log records written by
backward scan of the log must be
a transaction.
This makes undo inefficient
since a sequential
performed to retrieve all the desired log records of a transaction.
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992
100
.
C. Mohan et al
(data or catalog) pages of the database.
media recovery very simple.
In a similar
Being
levels
fashion,
As we will
we can define
describe
page-oriented
undo
and
able to perform
logical
undos allows
the system
of concurrency
than what would be possible if the
restricted
only
to
page-oriented
undos.
appropriate
concurrency
control
of one transaction
to be moved
one were
restricted
to only
This
later,
this
makes
logical
undo.
to provide
higher
system were to be
is because
the
former,
with
protocols,
would permit uncommitted
updates
to a different
page by another
transaction.
If
page-oriented
undos,
then
the
latter
transaction
would have had to wait for the former
to commit.
Page-oriented
redo and
page-oriented
undo permit
faster recovery
since pages of the database
other
than the pages mentioned
in the log records are not accessed. In the interest
of efficiency,
interest
of
ARIES
supports
high
concurrency,
ARIES/IM
method
for
page-oriented
redo and its supports,
in
logical
undos.
In [62], we introduce
concurrency
control
and
recovery
and show the advantages
of being able to perform
ARIES/IM
with other index methods.
1.2
Latches
Normally
latches
and locks
has been
discussed
to a great
have
not
been
latches
are
other
hand,
semaphores.
data,
Usually,
locks
while
worry about
environment.
locks.
Also,
consistency
are usually
the deadlock
in such
or involving
Acquiring
and
are used to control
indexes
by comparing
detector
a manner
access to shared
extent
discussed
used
are used to assure
physical
Latches
are requested
alone,
undos
and Locks
Locking
the
in B ‘-tree
logical
the
the
to
in the
that
much.
guarantee
logical
physical
consistency
is not informed
about
deadlocks
releasing
a latch
is
much
on
like
of
We need
to
a multiprocessor
period than are
latch
cheaper
are
consistency
of data.
so as to avoid
and locks.
Latches,
Latches
since we need to support
held for a much shorter
latches
information.
literature.
waits.
Latches
involving
than
acquiring
latches
and
releasing
a lock. In the no-conflict
case, the overhead
amounts
to 10s of
instructions
for the former
versus 100s of instructions
for the latter.
Latches
are cheaper because the latch control information
is always in virtual
memory in a fixed place, and direct
addressability
to the latch information
is
possible given the latch name. As the protocols
presented
later in this paper
and those in [57, 621 show, each transaction
holds at most two or three
latches simultaneously.
As a result,
the latch request blocks can be permanently allocated
to each transaction
and initialized
with transaction
ID, etc.
right at the start of that transaction.
On the other hand, typically,
storage for
individual
locks has to be acquired,
formatted
and released
dynamically,
causing more instructions
to be executed to acquire and release locks. This is
advisable
because, in most systems, the number
of lockable
objects is many
orders of magnitude
greater
than the number
of latchable
objects. Typically,
all information
relating
to locks currently
held or requested
by all the
transactions
is stored in a single,
central
hash table.
Addressability
to a
particular
lock’s information
is gained
the address
of the hash anchor
and
pointers.
Usually,
ACM Transactions
in the
process
on Database Systems, Vol
by first hashing
then,
possibly,
of trying
to locate
17, No 1, March 1992
the lock
following
the
lock
name to get
a chain
of
control
block,
ARIES: A Transaction Recovery
because multiple
transactions
may be simultaneously
the contents
of the lock table,
one or more latches
released—one
latch
on the
hash
anchor
lock’s chain of holders and waiters.
Locks may be obtained
in different
IX
(Intention
exclusive),
and,
Method
reading
and modifying
will
be acquired
and
possibly,
one on the
modes such as S (Shared),
IS (Intention
Shared)
and
101
.
SIX
specific
X (exclusive),
(Shared
Intention
exclusive),
and at different
granularities
such as record (tuple),
table
tion), and file (tablespace)
[321. The S and X locks are the most common
(relaones.
S provides
the read privilege
and X provides
the read and write privileges.
Locks on a given object can be held simultaneously
by different
transactions
only if those locks’ modes are compatible.
The compatibility
relationships
amongst
the
above
modes
of locking
are shown
in Figure
2. A check
mark
(’<) indicates
that the corresponding
modes are compatible.
With
hierarchical locking,
the intention
locks (IX, IS, and SIX) are generally
obtained
on
the higher
levels of the hierarchy
(e.g., table),
and the S and X locks are
obtained
and X),
on the lower levels (e. g., record).
The nonintention
mode locks (S
when obtained
on an object at a certain
level of the hierarchy,
implicitly
grant locks of the corresponding
mode on the lower level objects of
that higher
level object. The intention
mode locks, on the other hand, only
give the privilege
of requesting
the corresponding
mode locks on the lower level objects. For example,
grants
S on all
the
records
explicitly
on the records.
defined in the literature
of that
table,
and
it
Additional,
semantically
[2, 38, 45, 551 and ARIES
intention
or nonintention
SIX on a table implicitly
allows
X to be requested
rich lock modes have been
can accommodate
them.
Lock requests
may be made with
the conditional
or the unconditional
option. A conditional
request means that the requestor
is not willing
to wait
if, when the request
is processed, the lock is not grantable
immediately.
An
unconditional
lock becomes
unconditional
request
means that the requestor
is willing
to wait until
the
grantable.
Locks
may be held for different
durations.
An
request for an instant
duration
lock means that the lock is not
to be actually
granted,
call with
the success
duration
locks
are released
long before transaction
when the transaction
The above
discussions
durations,
except
1.3
but the lock manager
has to delay returning
status
until
the lock becomes
grantable.
Fine-Granularity
Fine-granularity
database systems
some time
termination.
terminates,
concerning
for commit
after
they
are acquired
the lock
Manual
and, typically,
Commit
duration
locks are released only
i.e., after commit
or rollback
is completed.
conditional
duration,
apply
requests,
to latches
different
modes,
and
also.
Locking
(e.g., record) locking
has been supported
by nonrelational
(e.g., IMS [53, 76, 801) for a long time. Surprisingly,
only
few of the commercially
locking,
even though
a
available
relational
systems provide fine-granularity
IBM’s
System R [321, S/38 [991 and SQL/DS
[51, and
Tandem’s
Encompass
[37]
supported
record
and/or
key
the beginning.
3 Although
many interesting
problems
relating
locking
from
to providing
3 Encompass and S/38 had only X locks for records and no locks were acquired
these systems for reads.
automatically
ACM Transactions
by
on Database SyStanS, Vol. 17, No 1, March 1992
102
C. Mohan
.
Fig. 2.
matrix
Lock
et al.
m
mode comparability
lx
+
Slx
4
.’
fine-granularity
locking
in the context
of WAL
remain
to be solved, the
research community
has not been paying enough attention
to this area [3, 75,
88]. Some of the System R solutions
worked
only because of the use of the
shadow page recovery
technique
in combination
with
10). Supporting
fine-granularity
locking
and variable
flexible
fashion
requires
addressing
some interesting
issues
which
have
never
really
been
discussed
in
locking
length
storage
the
(see Section
records
in a
management
database
literature.
Unfortunately,
some of the interesting
techniques
that were developed
for
System R and which are now part of SQL/DS
did not get documented
in the
literature.
here
At
the
expense
some of those
As supporting
of making
problems
high
this
and their
concurrency
tion of an application
requiring
systems
gain in popularity,
paper
long,
we will
be discussing
solutions.
gains
very high
it becomes
importance
(see [79] for the descrip-
concurrency)
necessary
to
and as object-oriented
invent
concurrency
control
and recovery
methods
that take advantage
of the semantics
of the
operations
on the data [2, 26, 38, 88, 891, and that support
fine-granularity
locking
efficiently.
Object-oriented
systems may tend to encourage
users to
define
a large
to be the
view
of the
number
appropriate
database,
the container
locking
during
users
may
tend
of small
objects
granularity
the
and users
of locking.
concept
of a page,
may
In
with
the
many
terminal
object
instances
object-oriented
logical
its physical
of objects, becomes unnatural
to think
object accesses and modifications.
Also,
to have
expect
interactions
orientation
about as the
object-oriented
during
the
as
unit
of
system
course
of a
transaction,
thereby
increasing
the lock hold times.
If the
were to be a page, lock wait times and deadlock
possibilities
unit
will
of locking
be aggra-
vated.
Other discussions
concerning
transaction
management
oriented
environment
can be found in [22, 29].
As more and more customers
adopt
relational
systems
in
an object-
applications,
it becomes ever more important
77, 79, 83] and storage management
without
the system
users or administrators.
Since
to handle
requiring
relational
for
production
hot-spots [28, 34, 68,
too much tuning
by
systems
have been
welcomed
to a great extent because of their ease of use, it is important
that
we pay greater
attention
to this area than what has been done in the context
of the nonrelational
systems.
Apart
from the need for high concurrency
for
user data, the ease with
which
online
data definition
operations
can be
performed
in relational
systems by even ordinary
users requires
the support
for high concurrency
of access to, at least, the catalog data. Since a leaf page
in an index typically
describes
data in hundreds
of data pages, page-level
locking
of index data is just not acceptable.
A flexible
recovery
method that
ACM TransactIons on Database Systems, Vol
17, No. 1, March 1992.
ARIES: A Transaction Recovery Method
allows the
needed.
The
support
above
of high
facts
argue
levels
for
supporting
such as increment/decrement
rently
modify
even the same
increment
and decrement
of concurrency
during
semantically
.
index
rich
103
accesses
modes
is
of locking
which
allow multiple
transactions
to concurpiece of data. In funds-transfer
applications,
operations
are frequently
performed
on the branch
and teller balances by numerous
transactions.
If those transactions
to use only X locks, then they will be serialized,
even though their
are forced
operations
commute.
1.4
Buffer
The
buffer
manages
Management
manager
the
nonvolatile
(BM)
buffer
is the
pool
storage
and
version
component
does
1/0s
of the
to
of the database.
transaction
read/write
The
fix
pages
primitive
system
that
from/to
the
of the BM may
be used to request
the buffer
address of a logical
page in the database.
If
the requested
page is not in the buffer pool, BM allocates
a buffer
slot and
reads
when
the p~ge in. There may be instances
(e. g., during
a B ‘-tree page split,
the new page is allocated)
where the current
contents
of a page on
nonvolatile
storage
are not of interest.
In such a case, the
fix– new primitive
may be used to make the BM allocate
a ji-ee slot and return
the address of
that slot, if BM does not find the page in the buffer pool. The fix-new
invoker
will then format
the page as desired.
Once a page is fixed in the buffer pool,
the corresponding
buffer slot is not available
for page replacement
until
the
unfix primitive
is issued by the data manipulative
component.
Actually,
for
each page, BM keeps a fix count which is incremented
by one during
every
fix operation
and which is decremented
by one during
every unfix operation.
A page in the buffer pool is said to be dirty if the buffer version of the page
has
some
updates
which
are
not
yet
reflected
in
the
nonvolatile
storage
version of the same page. The fix primitive
is also used to communicate
the
intention
to modify the page. Dirty pages can be written
back to nonvolatile
the
modification
read accesses to the page while
storage
it is being
of BM
when
no fix
in writing
with
in the
nonvolatile
storage
if a system
failure
were
buffer
pool
pages
in the
other
pages
background,
to reduce
without
is held,
out.
basis,
of redo work
that
and also to keep a certain
nondirty
state
synchronous
write
so that
1/0s
they
having
thus
allowing
[96] discusses
on a continuous
the amount
to occur
intention
written
the role
dirty
would
pages
percentage
may
to
be needed
be replaced
to be performed
of the
with
at the
time of replacement.
While
performing
those writes,
BM ensures that the
WAL protocol
is obeyed. As a consequence,
BM may have to force the log up
to the LSN of the dirty page before writing
the page to nonvolatile
storage.
Given the large
of this nature
transactions
buffer pools that
to be very rare
committing
are common today, we would expect a force
and most log forces to occur because
of
or entering
the prepare
state.
BM also implements
the support
for latching
pages. To provide
direct
addressability
to page latches and to reduce the storage associated with those
latches, the latch on a logical page is actually
the latch on the corresponding
buffer
slot. This
means
that
a logical
page can be latched
only
after
it is fixed
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
104
C. Mohan et al
.
in the buffer
pool and the latch
These
are
stored
in the buffer
BCB
dirty
highly
has to be released
acceptable
control
conditions.
block
also contains
the identity
status of the page, etc.
(BCB)
The
before
latch
the page is unfixed.
control
information
for the corresponding
of the logical
page,
what
buffer
the fix
is
slot.
count
The
is, the
Buffer
management
policies
differ
among the many systems in existence
WAL-Based
Methods”).
If a page modified
by a
(see Section
11, “Other
transaction
is allowed to be written
to the permanent
database on nonvolatile
storage before that transaction
commits,
then the steal policy is said to be
Otherwise,
by the buffer manager
(see [361 for such terminologies).
policy is said to be in effect. Steal implies
that during
normal
followed
no-steal
restart
rollback,
some
volatile
storage
version
commit until
the database,
undo
work
of the
might
have
database.
said to occur
if, even in the virtual
not performed
database
calls.
updating
policy, during
transactions.
storage
database
buffers,
using
the pending
list
the transaction
is definitely
back, then the pending
list
policy
has
implications
information,
on whether
Organization
The
rest
of the
paper
is organized
is
are
the corresponding
elsewhere
and are
after
a transaction
as follows.
no
it is deter-
If the transaction
needs
or ignored.
The deferred
own updates or not, and on whether
partial
rollbacks
For more discussions
concerning
buffer management,
1.5
to
recovery,
updating
the updates
only
committing.
is discarded
non-
allowed
version of
a no-force
restart
Deferred
in-place
when
the transaction
issues
The updates
are kept in a pending
list
in-place,
mined that
to be rolled
on the
is not
all pages modified
by it are written
to the permanent
then a force policy is said to be in effect. Otherwise,
policy is said to be in effect. With a force
redo work will be necessary for committed
performed
to be performed
If a transaction
a
or
can
“see”
its
are possible or not.
see [8, 15, 24, 961.
After
stating
our
goals
in
Section
2 and giving
an overview
of the new recovery
method
ARIES
in Section 3, we present,
in Section 4, the important
data structures
used by
ARIES during normal
and restart
recovery processing.
Next, in Section 5, the
protocols followed
during normal processing
are presented
followed,
in Section
6, by the description
of the processing
performed
during
latter
section also presents
ways to exploit
parallelism
methods
for
performing
recovery
selectively
restart
during
or postponing
recovery.
recovery
the
The
and
recovery
of
some of the data.
checkpoints
during
impact
of failures
Then, in Section
7, algorithms
are described
for taking
the different
log passes of restart
recovery
to reduce the
during
recovery.
This is followed,
in Section
8, by the
description
of how
Section 9 introduces
fuzzy image copying
and media
the significant
notion of nested
a method
tiques
context
caused
detail
in
for
some
implementing
of
of the
shadow
by using
the
the
those
characteristics
different
ACM Transactions
systems
them
existing
page
efficiently.
recovery
technique
paradigms
in the
of many
such
as
of the
IMS,
on Database Systems, Vol
Section
paradigms
and
System
WAL
recovery
are supported.
top actions and presents
R. We
context.
WAL-based
DB2,
10
which
Encompass
17, No. 1, March 1992
describes
and
cri-
originated
in
the
discuss
Section
recovery
and
the
problems
11 describes
in
methods
in use
NonStop
SQL.
ARIES: A Transaction Recovery Method
.
105
Section 12 outlines
the many different
properties
of ARIES.
We conclude by
summarizing,
in Section 13, the features
of ARIES which provide
flexibility
and efficiency,
and by describing
the extensions
and the current
status of the
implementations
of ARIES.
Besides presenting
a new recovery
method,
by way of motivation
for our
work,
we also
describe
some
previously
unpublished
aspects
of recovery
in
System R. For comparison
purposes,
we also do a survey
of the recovery
methods used by other WAL-based
systems and collect information
appearing
in several
publications,
aims in
resulting
many
of which
are not widely
this paper
is to show the intricate
from the different
choices made for
available.
One of our
and unobvious
interactions
the recovery
technique,
the
granularity
of locking
and the storage
management
scheme.
One cannot
make arbitrarily
independent
choices for these and still expect the combination to function
together
correctly
and efficiently.
This point
needs to be
emphasized
books
cover,
as it
is not
always
dealt
with
adequately
on concurrency
control
and recovery.
as much as possible, all the interesting
one encounters
processing
in building
and operating
in
most
papers
and
In this paper, we have tried to
recovery-related
problems
that
an
industrial-strength
transaction
system.
2. GOALS
This
section
lists
the goals
of our work
in designing
a recovery
method
The goals relate to the metrics
discussed
earlier,
Simplicity.
and program
algorithms
for
paper
is long
that
Concurrency
for, compared
are
strived
bound
a simple,
the
the difficulties
involved
1.1.
and recovery
with
other
be error-prone,
powerful
and
of its comprehensive
ignored
Hopefully,
to
yet
because
are mostly
simple.
feeling.
in Section
and outlines
that supports the features
that we aimed for.
for comparison
of recovery
methods
that we
in the
literature,
overview
presented
are complex subjects to think
aspects of data management.
if
they
flexible,
are
discussion
the
complex.
algorithm.
main
in Section
Hence,
Although
of numerous
algorithm
3 gives
about
The
itself
we
this
problems
is quite
the reader
that
Operation
logging.
The recovery
method had to permit
operation
logging
(and value logging)
so that semantically
rich lock modes could be supported.
This would
let one transaction
modify
the same data that
was modified
earlier
by another
transaction
which
transaction:’
actions are semantically
has not yet committed,
when the
compatible
(e.g., increment/decrement
operations;
see [2, 26, 45, 881). As should be clear,
always perform
value or state logging
(i. e., logging
images
systems
of modified
that
data),
do very
cannot
physical
support
recovery
methods
which
before-images
and after-
operation
—byte-oriented—
two
logging.
logging
of all
This
includes
changes
to a
page [6, 76, 811. The difficulty
in supporting
operation
logging
is that we need
to track precisely,
using a concept like the LSN, the exact state of a page
with respect to logged actions relating
to that page. An undo or a redo of an
update
should
not be performed
without
being
sure that
the original
update
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992
.
C. Mohan et al
is present
or is not present,
106
transactions
that
need to know
respectively.
had previously
precisely
how
This
modified
the page
also means
a page
has been
start
affected
that,
if one or more
rolling
back,
during
then
we
the rollbacks
and how much of each of the rollbacks
had been accomplished
so far. This
requires
that
updates
performed
during
rollbacks
also be logged
via the
so-called
compensation
log records (CLRS).
The LSN concept lets us avoid
attempting
to redo
present
in the page.
an operation
when
the operation’s
effect
is already
It also lets us avoid attempting
to undo an operation
when
effect
the operation’s
is not present
in the page.
Operation
logging
lets
us perform,
thing
that
if found desirable,
logical
logging,
which means that not everywas changed
on a page needs to be logged
explicitly,
thereby
saving
log
space.
amount
of free space on the page,
For
example,
operations
can be performed
value logging,
see [881.
changes
of control
information,
need not be logged.
logically.
like
the
The redo and the undo
For a good discussion
of operation
and
Efficient
support for the storage and manipFlexible storage management.
ulation
of varying
length
data is important.
In contrast
to systems
like
IMS, the intent
here is to be able to avoid the need for off-line reorganization
of the data to garbage
collect
any space that
might
have been freed up
because of deletions
and updates
that caused data shrinkage.
It is desirable
that
the
that
the
the
data
logging
within
moved
this
that
data
recovery
method
and locking
a page for
to be locked
and the
concurrency
or the
movements
also means that one transaction
must
page currently
has some uncommitted
tion. This may lead to
log; logical undos may
a transaction
that has
space during
its later
permit
this
Partial
in data
rollbacks.
control
method
be such
is logical
in nature
so that
movements
garbage
collection
reasons
do not cause
to be logged.
For
an
of
the
index,
be able to split a leaf page even if
data inserted
by another
transac-
problems
in performing
page-oriented
undos using the
be necessary.
Further,
we would like to be able to let
freed up some space be able to use, if necessary,
that
insert
activity
[50]. System R, for example,
does not
pages.
It
was
essential
that
the
new
recovery
method
sup-
port the concept
of savepoints
and rollbacks
to savepoints
(i.e.,
partial
rollbacks).
This
is crucial
for handling,
in a user-friendly
fashion
(i. e.,
without
requiring
a total
rollback
of the transaction),
integrity
constraint
violations
information
Flexible
(see [1, 311), and
(see [49]).
buffer
problems
management.
arising
The recovery
from
using
method
should
obsolete
make
cached
the
least
number
of restrictive
assumptions
about the buffer
management
policies
(steal,
force, etc.) in effect. At the same time, the method
must be able to
take advantage
of the characteristics
of any specific policy that is in effect
(e.g., with a force policy there is no need to perform
any redos for committed
transactions.)
This flexibility
could result in increased
concurrency,
decreased
1/0s and efficient
usage of buffer
storage.
Depending
on the policies,
the
work
that
needs
ACM Transactions
to be performed
during
restart
on Database Systems, Vol. 17, No. 1, March 1992
recovery
after
a system
ARIES: A Transaction Recovery Method
failure
large
or during
media recovery
maybe
main
memories,
it must be noted
more
that
.
107
or less complex.
Even
a steal policy
is still
with
very
desirable.
This is because,
with
a no-steal
policy,
a page may never get
written
to nonvolatile
storage
if the page always
contains
uncommitted
updates due to fine-~anularity
locking
and overlapping
transactions’
updates
to that
page.
running
The
situation
transactions.
to frequently
by locking
reduce
all
would
be further
aggravated
those
conditions,
either
Under
concurrency
the
objects
by quiescing
on the
page)
if there
the system
all activities
and
then
are
would
additional
bookkeeping
in the general
Hence,
general
writing
the
page
to non-
11 with
reference
It should
independence.
perform
track
whether
a page
restart
incurs
contains
any
given our goal of supporting
semantiand varying
length objects efficiently,
undo
logging
and in-place
updating.
like the transaction
workspace
model of AIM [46] are not
for our purposes.
Other
problems
relating
to no-steal
are
in Section
Recovery
and
to
case, we need to perform
methods
enough
discussed
overhead
We believe that,
partial
rollbacks
have
on the page (i.e.,
volatile
storage, or by doing nothing
special and then paying
a huge
redo recovery
cost if the system were to fail. Also, a no-steal
policy
uncommitted
updates.
cally rich lock modes,
long-
media
recovery
to IMS
Fast
be possible
or restart
Path.
to image
recovery
copy (archive
at different
dump),
granularities,
rather
than only at the entire
database
level. The recovery
of one object
should
not force the concurrent
or lock-step
recovery
of another
object.
Contrast
this with what happens
in the shadow page technique
as implemented
in System R, where index and space management
information
are
recovered
lock-step
with user and catalog
table (relation)
data by starting
from
an internally
to all the
processing.
some
consistent
state
related
objects
of the
Recovery independence
object,
catalog
descriptors
of that
may be undergoing
information
of the whole
in
object and its related
recovery
in parallel
the two may be out of synchronization
be possible
later point
devices.
database
and redoing
changes
database
simultaneously,
as in normal
means that, during the restart
recovery of
the
database
cannot
be
accessed
for
objects, since that information
itself
with the object being recovered
and
[141. During
restart
recovery,
it should
to do selective
recovery
and defer recovery
of some objects to a
in time to speed up restart
and also to accommodate
some offline
Page-oriented
recovery
means
that
even
if one page
in the
database
is corrupted
because of a process failure
or a media problem,
it should be
possible to recover that page alone. To be able to do this efficiently,
we need
to log
every
spans
multiple
conjunction
page’s
with
change
pages
and
the writing
will make media recovery
image
copying
of different
different
frequencies.
Logical
undo.
that
is different
This
from
individually,
the
update
even
affects
if the
more
of CLRS for updates
object
than
performed
being
updated
one page.
This,
during
rollbacks,
in
very simple (see Section 8). This will also permit
objects to be performed
independently
and at
relates
to the ability,
during
undo, to affect
the one modified
during
forward
processing,
a page
as is
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
108
.
C. Mohan et al.
needed in the earlier-mentioned
context of the split
index page containing
uncommitted
data of another
to perform
logical
undos
especially
in search
rollback
processing,
allows
higher
levels
by one transaction
of an
transaction.
Being able
of concurrency
to be supported,
structures
[57, 59, 621. If logging
is not performed
during
logical
undos would be very difficult
to support,
if we
also desired
recovery
independence
and page-oriented
recovery.
but
at the expense
of
R and SQL/DS
support
logical
undos,
independence.
Parallelism
and
fast
With
recovery.
multiprocessors
becoming
System
recovery
very
com-
mon and greater
recovery
method
data availability
becoming
increasingly
important,
the
has to be able to exploit
parallelism
during
the different
stages
recovery
that
of restart
the recovery
method
and
during
media
be such that
recovery.
recovery
It
is also
can be very
fast,
hot-standby
approach
is going to be used (a la IBM’s IMS/VS
Tandem’s
NonStop
[4, 371). This means that redo processing
possible,
undo
processing
should
be page-oriented
(cf.
always
important
if in fact
a
XRF [431 and
and, whenever
logical
redos
and undos in System R and SQL/DS
for indexes and space management).
It
should also be possible to let the backup system start processing
new transactions, even before the undo processing
for the interrupted
transactions
completes.
there
This
is necessary
were
long
Minimal
update
because
overhead.
Our
restart
recovery
normal
and
storage
consumption,
undo
processing
may
take
a long
time
if
transactions.
etc.)
goal
is to have
processing.
imposed
by the
good
The
performance
overhead
recovery
(log
method
both
data
during
volume,
in virtual
and
nonvolatile
storages for accomplishing
the above goals should be minimal.
Contrast
this with the space overhead
caused by the shadow page technique.
This goal also implied
that we should minimize
the number
of pages that are
modified
(dirtied)
during
restart.
The idea is to reduce the number
of pages
that have to be written
back to nonvolatile
storage and also to reduce CPU
overhead.
This rules out methods
which,
during
restart
recovery,
first undo
some committed
changes that had already
reached
the nonvolatile
storage
before the failure
and then redo them (see, e.g., [16, 21, 72, 78, 881). It also
rules
out
nonvolatile
methods
storage
in which
updates
that
are not present
in a page on
are undone
unnecessarily
(see, e.g., [41, 71, 881). The
method
should not cause deadlocks
involving
transactions
that are already
rolling
back. Further,
the writing
of CLRS should not result in an unbounded
number
of log records having
to be written
for a transaction
because of the
undoing
of CLRS, if there were nested rollbacks
or repeated
system failures
during
rollbacks.
It should also be possible
to take checkpoints
and image
copies without
quiescing
significant
activities
in the system. The impact
of
these operations
on other activities
should be minimal.
To contrast,
checkpointing
and image copying
in System R cause major perturbations
in the
rest of the system [31].
As the reader will have realized
by now, some of these goals are contradictory.
Based on our
features,
experiences
ACM Transactions
knowledge
with IBM’s
on Database Systems, Vol
of different
developers’
existing
systems’
existing
transaction
systems and contacts
17, No 1, March 1992
ARIES: A TransactIon Recovery Method
.
109
with customers,
we made the necessary tradeoffs.
We were keen on learning
from the past successes and mistakes
involving
many prototypes
and products.
3. OVERVIEW
The
aim
method
OF ARIES
of this
ARIES,
section
which
is to provide
satisfies
a brief
quite
overview
reasonably
of the
the goals that
Section
2. Issues like
deferred
and selective
restart,
restart recovery,
and so on will be discussed in the later
ARIES
guarantees
the
atomicity
and durability
new
recovery
we set forth
in
parallelism
during
sections of the paper.
properties
of transactions
in the fact of process,
transaction,
system
and media
failures.
For this
purpose, ARIES keeps track of the changes made to the database by using a
log and it does write-ahead
logging
(WAL).
Besides
logging,
on a peraffected-page
transactions,
(CLRS),
basis, update
ARIES
also
updates
during
both
partial
rollback
activities
performed
during
forward
logs, typically
using
compensation
performed
normal
and
in which
back two of them
during
restart
partial
a transaction,
and then
or total
rollbacks
Figure
3 gives
processing.
starts
going
after
performing
forward
again.
the two updates,
two CLRS are written.
In ARIES,
that they are redo-only
log records. By appropriate
log records
written
during
forward
processing,
processing
of
log records
of transactions
an example
three
Because
of a
updates,
rolls
of the undo
of
CLRS have the property
chaining
of the CLRS to
a bounded
amount
of logging
is ensured
during
rollbacks,
even in the face of repeated
failures
during
restart
or of nested rollbacks.
This is to be contrasted
with what happens
in
IMS, which may undo the same non-CLR
multiple
times, and in AS/400,
DB2
and NonStop
SQL, which, besides undoing
may also undo CLRS one or more times
severe
problems
In ARIES,
to be written,
action
in real-life
as Figure
customer
5 shows,
the CLR,
for redo purposes,
besides
the same non-CLR
multiple
(see Figure
4). These have
situations.
when
the undo
containing
is made
times,
caused
to contain
points
to the predecessor
of the just
information
is readily
available
since
of a log record
a description
the
causes
a CLR
of the compensating
UndoNxtLSN
pointer
which
undone
log record.
The predecessor
every log record,
including
a CLR,
contains
the PreuLSN
pointer
which points to the most recent preceding
log
record written
by the same transaction.
The UndoNxtLSN
pointer
allows us
to determine
precisely
how much of the transaction
has not been undone so
far. In Figure
5, log record 3’, which is the CLR for log record 3, points to log
record 2, which is the predecessor
of log record 3. Thus, during
rollback,
the
UndoNxtLSN
field
of the most recently
written
CLR keeps track
of the
progress
of rollback.
the transaction,
rollback
or if
bypass
those
It tells
the system
from
whereto
continue
the rollback
if a system failure
were to interrupt
the completion
a nested rollback
were to be performed.
It lets the
log
records
that
had
already
been
undone.
Since
of
of the
system
CLRS
are
available
to describe what actions are actually
~erformed
during
the undo of
an original
action, the undo action need not be, in terms of which page(s) is
affected, the exact inverse of the original
action. That is, logical undo which
allows
very
high
concurrency
to be supported
is made
possible.
ACM Transactions on Database Systems, Vol
For example,
17, No. 1, March 1992.
110
C. Mohan et al.
.
w
Fig. 3.
Partial
rollback
example.
12
Log
After
performing
rollback
log
I
3 actions,
by undoing
records
and
33’2’4
3
performs
the
actions
and
2,
act~ons
!3j
transaction
3 and
and
then
4 and
5
performs
2, wrlt!ng
starts
>
a patilal
the compensation
go[ng
forward
aga!n
Before Failure
1
Log
During
DB2, s/38,
Encompass --------------------------AS/400
Restart
,
2’”
3“
3’
3’
2’
1’
~
1;
lMS
>
)
I’ is the CLR for I and I“ is the CLR for I’
Fig. 4
Problem
of compensating
compensations
or duplicate
compensations,
or both
a key inserted
on page 10 of a B ‘-tree by one transaction
may be moved to
page 20 by another
transaction
before the key insertion
is committed.
Later,
if the first transaction
were to roll back, then the key will be located on page
20 by retraversing
the tree and deleted from there. A CLR will be written
to
describe the key deletion
on page 20. This permits
page-oriented
redo which
is very efficient.
[59, 621 describe
this logical undo feature.
ARIES
uses a single
a page is updated
and
placed in the page-LSN
LSN
ARIES/LHS
and ARIES/IM
on each page to track
the page’s
a log record is written,
the LSN
field of the updated
page. This
which
state.
exploit
Whenever
of the log record is
tagging
of the page
with
the LSN
allows
ARIES
to precisely
track,
for restartand mediarecovery
purposes,
the state of the page with respect to logged updates
for
that page. It allows ARIES to support
novel lock modes! using which, before
an update
performed
on a record’s
field by one transaction
is committed,
another
transaction
may be permitted
to modify
the same data for specified
operations.
Periodically
during
checkpoint
log records
and the
modified
needed
begin
normal
identify
processing,
ARIES
takes
checkpoints.
the transactions
that are active, their
LSNS of their
most recently
written
log records,
data (dirty data) that is in the buffer pool. The latter
to determine
from
where
the
redo
pass
of restart
its processing.
ACM Transactions
on Database Systems, Vol. 17, No. 1, March 1992.
The
states,
and also
information
recovery
the
is
should
ARIES: A Transaction Recovery Method
Before
12
‘,;
\\
Log
-%
111
Failure
3
3’ 2’ 1!
)
F i-. ?%
/
/
-=---/
/
------
---
During
I
.
Restart
,,
----------------------------------------------+1
I’ is the Compensation Log Record for I
I’ points to the predecessor, if any, of I
Fig. 5.
During
from
this
ARIES’
restart
the first
analysis
technique
recovery
record
pass,
for avoiding compensating
compensations.
(see Figure
of the last
information
6), ARIES
checkpoint,
about
compensation
dirty
first
and duplicate
scans the log, starting
up to the end of the
pages
log.
and transactions
During
that
were
in progress at the time of the checkpoint
is brought
up to date as of the end of
the log. The analysis
pass uses the dirty pages information
to determine
the
starting
point
( li!edoLSIV)
for the log scan of the immediately
pass. The analysis
pass also determines
the list of transactions
rolled back in the undo pass. For each in-progress
transaction,
most recently
written
log record
will
also be determined.
following
redo
that are to be
the LSN of the
Then,
during
the redo pass, ARIES
repeats history, with respect to those updates logged on
stable storage, but whose effects on the database pages did not get reflected
on nonvolatile
storage
before
the
failure
of the
system.
This
is done for the
updates of all transactions,
including
the updates of those transactions
that
had neither
committed
nor reached the in-doubt
state of two-phase
commit by
the time
loser
of the system
transactions
failure
are
(i.e.,
redone).
even the missing
This
essentially
updates
reestablishes
of the so-called
the
state
of
the database
as of the time of the system failure.
A log record’s
update
is
redone if the affected page’s page-LSN
is less than the log record’s LSN. No
logging
is performed
when updates
are redone.
The redo pass obtains
the
locks needed to protect the uncommitted
updates of those distributed
transactions that will remain
in the in-doubt
(prepared)
state [63, 64] at the end of
restart
The
updates
recovery.
next log pass
are rolled
is the
back,
undo
in reverse
pass
during
chronological
which
order,
all
loser
transactions’
in a single
sweep
of
the log. This is done by continually
taking
the maximum
of the LSNS of the
next log record to be processed for each of the yet-to-be-completely-undone
loser transactions,
until
no transaction
remains
to be undone.
Unlike
during
the redo pass, performing
undos is not a conditional
operation
during
the
undo pass (and during
normal
undo).
That
is, ARIES
does not compare
the page.LSN
of the affected
page to the LSN of the log record to decide
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
112
C. Mohan et al
.
m
Log
@
Checkpoint
r’
Follure
i
DB2
System
Analysis
I
Undo Losers
/
*—————
——.
R
Redo Nonlosers
—— — ————,&
Redo Nonlosers
. ------
IMS
“––-––––––X*
..:--------
(FP Updates)
1 -------
ARIES
Redo ALL
Undo Losers
.-:”---------
Fig. 6,
whether
or not
transaction
to undo
during
the
Restart
the
processing
update.
undo
& Analysis
Undo Losers (NonFP Updates)
in different
When
pass,
if
it
methods.
a non-CLR
is an
I
is encountered
undo-redo
for
or undo-only
a
log
record, then its update is undone.
In any case, the next record to process for
that transaction
is determined
by looking
at the PrevLSN
of that non-CLR.
Since
CLRS
are never
undone
(i.e.,
CLRS
are not
compensated–
see Figure
5), when a CLR is encountered
during
undo, it is used just to determine
the
next log record to process by looking
at the UndoNxtLSN
field of the CLR.
For those transactions
which were already
rolling
back at the time of the
system failure,
ARIES
will rollback
only those actions
been undone.
This is possible since history
is repeated
and since the last CLR written
for each transaction
indirectly)
to the next
non-CLR
record
that
that had not already
for such transactions
points
(directly
or
is to be undone,
The net result
is
that, if only page-oriented
undos are involved
or logical undos generate
only
CLRS, then, for rolled back transactions,
the number
of CLRS written
will be
exactly equal to the number
of undoable)
log records
processing
of those transactions.
This will
be the
repeated
failures
4. DATA
STRUCTURES
This
4.1
section
describes
restart
the major
or if there
data
are nested
structures
that
rollbacks.
are used by ARIES.
Log Records
Below,
types
during
written
during
forward
case even if there
are
we describe
the
important
fields
that
may
of log records.
ACM Transactions
on Database Systems, Vol. 17, No. 1, March 1992,
be present
in
different
ARIES: A Transaction Recovery Method
.
113
LSN.
Address
of the first byte of the log record in the ever-growing
log
address space. This is a monotonically
increasing
value. This is shown here
as a field only to make it easier to describe
ARIES.
The LSN need not
actually
be stored
Type.
regular
pare’),
in the record.
Indicates
update
whether
record
this
is a compensation
(’update’),
a commit
or a nontransaction-related
TransID.
Identifier
PrevLSN.
LSN
record
(e.g.,
of the transaction,
of the preceding
record
(’compensation’),
protocol-related
record
‘OSfile_return’).
if any, that
log record
wrote
written
the log record.
by the
tion. This field has a value of zero in nontransaction-related
the first log record of a transaction,
thus avoiding
the need
begin
transaction
PageID.
identifier
PageID
same transacrecords and in
for an explicit
log record.
Present
only in records of type ‘update’
or ‘compensation’.
of the page to which the updates of this record were applied.
will
normally
consist
of two
parts:
an objectID
(e.g.,
and a page number
within
that object. ARIES can deal with
contains
updates for multiple
pages. For ease of exposition,
only
a
(e. g., ‘pre-
The
This
tablespaceID),
a log record
we assume
that
that
one page is involved.
UndoNxtLSN.
Present
of this
transaction
that
UndoNxtLSN
is the value
only in CLRS. It is the LSN of the next log record
is to be processed
during
rollback.
That
is,
of PrevLSN
of the log record that the current
log
record is compensating.
If there
this field contains
a zero.
Data.
This
is the
redo
are no more
and/or
undo
data
log records
that
to be undone,
describes
was performed.
CLRS contain
only redo information
undone.
Updates
can be logged in a logical fashion.
the
then
update
that
since they are never
Changes
to some fields
(e.g., amount
of free space) of that page need not be logged since they can be
easily derived.
The undo information
and the redo information
for the entire
object need not be logged. It suffices if the changed fields alone are logged.
For increment
or decrement
types of operations,
before and after-images
of
the field are not needed.
Information
about the type of operation
and the
decrement
or increment
amount
is enough.
The information
here would also
be used to determine
redo and/or
4.2
One
undo
the appropriate
of this
action
routine
to be used to perform
the
log record.
Page Structure
of the
fields
in every
page
of the
database
is the
page-LSN
field.
It
contains
the LSN of the log record that describes
the latest update
to the
page. This record may be a regular
update record or a CLR. ARIES
expects
the buffer manager
to enforce the WAL protocol.
Except for this, ARIES does
not place any restrictions
on the buffer
page replacement
policy.
The steal
buffer
management
policy may be used. In-place
updating
is performed
on
nonvolatile
storage.
Updates
are applied
immediately
and directly
to the
ACM Transactions on Database Systems, Vol. 17, No, 1, March 1992.
114
.
buffer
as in
ing
C. Mohan et al.
version of the page containing
INGRES
[861 is performed.
and,
flexible
4.3
consequently,
enough
A table
deferred
not to preclude
Transaction
If
the object. That is, no deferred updating
it is found
desirable,
deferred
updat-
logging
those
can
policies
be
from
implemented.
being
ARIES
is
implemented.
Table
called
the
transaction
table
is used during
restart
recovery
to track
the state of active transactions.
The table is initialized
during
the analysis
pass from the most recent checkpoint’s
record(s)
and is modified
during
the
analysis
of the
log records
written
after
the
During
the undo pass, the entries
of the
checkpoint
is taken
during
restart
recovery,
will
be included
in
the
checkpoint
during
normal
processing
by the
important
fields of the transaction
TransID.
State.
Transaction
Commit
or unprepared
LastLSN.
record(s).
The
same
transaction
manager.
table follows:
checkpoint.
table
If a
table
is also
A description
used
of the
ID.
state of the transaction:
prepared
The LSN
The
If the most
of the latest
LSN
recent
of the
log record
log record
next
written
record
(’P’ –also
then this field’s
is a CLR, then
UndoNxtLSN
CLR.
value
from
that
written
called
in-doubt)
by the transaction.
to be processed
or seen for this
undoable
non-CLR
log record,
If that most recent log record
4.4
of that
are also modified.
the contents
of the
(’U’).
UndoNxtLSN.
back.
beginning
table
then
value will
this field’s
during
transaction
rollis an
be set to LastLSN.
value is set to the
Dirty_ Pages Table
A table called the dirty .pages table is used to represent
information
about
dirty buffer pages during
normal
processing.
This table is also used during
restart
recovery.
The actual implementation
of this table may be done using
hashing
or via the deferred-writes
queue mechanism
the table consists of two fields: PageID and RecLSN
normal
processing,
when a nondirty
the intention
to modify,
the buffer
of [961. Each entry in
(recovery
LSN). During
page is being fixed in the buffers
manager
records in the buffer pool
with
(BP)
dirty .pages table, as RecLSN,
the current
end-of-log
LSN, which will be the
LSN of the next log record to be written.
The value of RecLSN indicates
from
what point in the log there may be updates which are, possibly,
not yet in the
nonvolatile
storage version
of the page. Whenever
pages are written
back
to nonvolatile
storage, the corresponding
entries in the BP dirty _pages table
are removed.
record(s) that
The contents
of this table
are included
is written
during
normal
processing.
The
in the checkpoint
restart
dirty –pages
table
is initialized
from the latest
checkpoint’s
record(s)
and
during
the analysis
of the other
records
during
the analysis
ACM Transactions
on Database Systems, Vol
17, No 1, March 1992
is modified
pass. The
ARIES: A Transaction Recovery Method
minimum
RecLSN
pass during
restart
5. NORMAL
This
discusses
the
processing.
part
of recovering
5.1
Updates
During
table
gives
the
starting
point
for
115
the
redo
PROCESSING
section
transaction
value in the
recovery.
.
normal
actions
Section
from
a system
processing,
that
are
6 discusses
performed
the actions
as part
that
of normal
are performed
as
failure.
transactions
may be in forward
processing,
partial
rollback
or total rollback.
The rollbacks
may be system- or application-initiated.
The causes of rollbacks
may be deadlocks,
error conditions,
integrity
constraint
violations,
unexpected
database
state, etc.
If the granularity
of locking
is a record, then, when an update
is to be
performed
on a record in a page, after the record is locked, that
in the buffer and latched in the X mode, the update is performed,
page is fixed
a log record
is appended
to the log, the LSN of the log record is placed in the page .LSN
field of the page and in the transaction
table, and the page is unlatched
and
unfixed.
The page latch
is held
during
the call to the logger.
This
is done to
ensure that the order of logging
of updates of a page is the same as the order
in which those updates are performed
on the page. This is very important
if
some
of the
redo
information
is going
to be logged
amount
of free space in the page) and
guaranteed
for the physical
redo to work
be held during
read and update operations
the page contents.
This is necessary
might
move records around
within
such garbage
collection
is going
look at the page since they
repetition
correctly.
to ensure
transaction
get confused.
Readers
S mode and modifiers
latch in the X mode.
The data page latch is not held while any
necessary
performed.
held
At
most
two
page
(e.g.,
the
because inserters
and updaters
of records
a page to do garbage
collection.
When
on, no other
might
physically
of history
has to be
The page latch must
physical
consistency
of
latches
are
should
be allowed
of pages latch
index
operations
simultaneously
to
in the
(also
are
see
[57, 621). This means
that
two transactions,
T1 and T2, that
are modifying different
pieces of data may modify a particular
data page in one order
(Tl, T2) and a particular
index page in another
order (T2, T1).4 This scenario
is impossible
in System R and SQL/DS
since in those systems, locks, instead
of latches
are used for providing
physical
consistency.
Typically,
all the
(physical)
page locks are released only at the end of the RSS (data manager)
call. A single
RSS call deals with
modifying
the data and all relevant
indexes.
This
deadlocks
may
involve
waiting
(physical)
page
for many
locks
1/0s
alone
and locks.
or (physical)
This
means
page
that
locks
and
gets very complicated if operations like increment/decrement
are supported
high concurrency lock modes and indexes are allowed to be defined on fields on which
operations are supported. We are currently studying those situations.
with
such
4 The situation
involving
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
116
.
(logical)
System
C. Mohan et al
record/key
locks
are possible.
They
have
been
a major
problem
in
R and SQL/DS.
Figure
7 depicts
a situation
at the time
of a system
failure
which
followed
the commit
of two transactions.
The dotted lines show how up to date the
states of pages PI and P2 are on nonvolatile
storage with respect to logged
updates of those pages. During
restart
recovery,
it must be realized
that the
most recent log record written
for PI, which was written
by a transaction
which later committed,
needs to be redone, and that there is nothing
to be
redone for P2. This situation
points to the need for having
the LSN to relate
the state of a page on nonvolatile
and the need for knowing
where
some information
storage
restart
in the checkpoint
record
to a particular
position
redo pass should begin
(see Section
5.4).
in the log
by noting
For the example
scenario,
the restart
redo log scan should begin at least from the log record
representing
the most recent update of PI by T2, since that update needs to
be redone.
It is not assumed that a single log record can always accommodate
information
needed to redo or undo the update
operation.
There
instances
when
more
than
one record
needs
to be written
for this
all the
may be
purpose.
For example,
one record
may be written
with
the undo information
and
another
one with the redo information.
In such cases, (1) the undo-only
log
record should be written
before the redo-only
log record is written,
and (2) it
is the LSN of the redo-only
log record
field.
The first condition
is enforced
situation
in which
the
written
to stable storage
the redo of that redo-only
history
feature)
only
redo-only
that should be placed in the page.LSN
to make sure that we do not have
record
and
not
the
undo-only
before a failure,
and that during
log record is performed
(because
to realize
later
that
there
isn’t
restart
of the
an undo-only
record
a
gets
recovery,
repeating
record
to
undo the effect of that operation.
Given that the undo-only
record is written
before the redo-only
record, the second condition
ensures that we do not have
a situation
in which
even though
the page in nonvolatile
storage
already
contains
the
unnecessarily
the undo-only
redo could
update
during
record
cause
of the redo-only
record, that same update
gets redone
restart
recovery because the page contained
the L SN of
instead of that of the redo-only
record. This unnecessary
integrity
problems
if operation
There may be some log records written
cannot or should not be undone
(prepare,
logging
is being
performed.
during
forward
processing
free space inventory
update,
that
etc.
records). These are identified
as redo-only
log records. See Section 10.3 for a
discussion
of this kind of situation
for free space inventory
updates.
Sometimes,
the identity
of the (data) record to be modified
or read may not
be known before a (data) page is examined.
For example,
during
an insert,
the record ID is not determined
until the page is examined
to find an empty
slot. In such cases, the record lock must be obtained
after the page is latched.
To avoid waiting
for a lock while
holding
a latch, which
could lead to an
undetected
deadlock,
the lock is requested
conditionally,
and if it is not
granted,
then the latch is released and the lock is requested
unconditionally.
Once the unconditionally
requested
lock is granted,
the page is latched again,
and any previously
verified
conditions
are rechecked.
This rechecking
is
ACM Transactions on Database Systems, Vol 17, No. 1, March 1992.
ARIES: A Transaction Recovery Method
/’
/’
j;:’
Log
PI
o
T1
a
T2
because,
changed.
The
page_LSN
bered
detect
quickly,
If
update,
it
taken.
If
the
can
the
page,
proceed
to support
if they
hold
performed
an
amount
rency
control
be used
5.2
used
conditions
any
to
changes
be
could
satisfied
for
Otherwise,
is
could
could
be
have
possibly
performing
corrective
granted
have
remem-
the
actions
immediately,
are
then
the
is
should
be made
while
reading
to normal
the
to hold
the
transaction
to
a
page
will
change,
if the
the
system
locking,
is
a transac-
X latch
on the
physical
consistency
Unlocked
reads
of
may
page
also
causing
the
systems
in
be
least
processing.
to
control
similar
the
assured
interest
this
But,
page
than
on the
for
case.
are
restricted
are
coarser
lock
Except
page.
in
concurrency
that
the
even with
locks
utility
not
something
since
record-locking
then,
copy
or
page
reads,
acquiring
is
page
transaction.
not
ARIES
as the
a
the
only
those
mechanism.
locking,
Even
like
the
other
ones
in
which
concur-
[2],
could
ARIES.
Total or Partial Rollbacks
To provide
flexibility
of a sauepoint
notion
of a transaction,
could
be
might
updates
atomicity.
request
at
the
After
undoing
outstanding
limiting
the
to the
savepoint.
the
transaction
for
This
such
the
a partial
rollback,
like
I)B2,
command
to support
after
of savepoints
a system
transaction
performed
the
the execution
number
in
manipulation
is needed
a while,
updates
data
rollbacks,
during
Any
Typically,
SQL
data.
After
of
established.
time.
every
executing
of all
be
in
before
extent
[1, 31]. At any point
can
a point
is established
perform
level
in
is supported
a savepoint
outstanding
savepoint
still
if
lock
as in the
image
schemes
with
the
of unlatching
above.
executing
a page
S latch
of
is
time
found
to latch
dirty
of interference
locking
still
locking
same
are
the
Applicability
the
rematching,
need
or
who
by
unlatched,
was
at
are
the
the
is updating
readers
state as a failure.
requested
is no
unlocked
that
Database
as before.
are
so that
Commit
P2
@ Checkpoint
as described
to isolate
taken
‘:\,;
w
Failure
page
on
of
there
‘“O
Commit
value
conditionally
be sufficient
actions
the
conditions
granularity
then
tion
after
is performed
If
update
the
‘!
‘!
/
required
to
/’
PI
Fig. 7.
occurred.
P
#“ PI
LZN’”S
pi
117
El
/
,’
.
SQL
or the
the
statementsystem
establishment
the
ACM Transactions on Database Systems, Vol
a
that
transaction
can
of a
can
17, No. 1, March 1992.
118
.
C. Mohan et al.
continue
lar
execution
savepoint
and
is
that
savepoint
LSN
of the
no
or
to
latest
remembered
in
start
log
record
of the
transaction
SaveLSN
is
to
savepoint,
it
to be exposed
expose
the
and
INGRES
[181.
Figure
locks
are
undo
get
activity
on
in
as
During
CLR.
to
written.
when
CLRS
(e.g.,
is
is
up to determine
helps
nested
rollback
during
the
first
various
have
to
rollback
then,
scenarios
methods,
efficiently
able
inverses
of
situations
are
management
ARIES’
safely
not
with
small
of the
records
Even
with
to
informarollback.
next
record
When
a
record
the
see how
to
CLR
is
is looked
UndoNxtLSN
means
that
UndoNxtLSN
that
the
to be written.
This
were
though
conjunction
be easy
CLRS,
having
actions.
in,
Section
guarantee
in
should
involved
possible
(see
ACM Transactions
via
not
original
was
it
log
again.
to contain
the
Thus,
records.
a
need
some
undo
of that
of
in
mentioned
during
field.
a
in
undone
Figures
if
a
CLRS,
during
4, 5, and
restart
undos
in
nested
rollbacks
13
the
are
ARIES.
to describe,
of the
which
by
because
of the
rollback
log
fit
CLRS
CLR
ignored
to be processed.
ease
will
As
to contain
field
For
performed,
62].
processed,
in
[1001.
chronological
is made
PrevLSN
undone
partial
recovery
is
its
is
this
are
UndoNxtLSN
record
none
it
up
already
occur,
records
after
be processed
flexibility
deal
don’t
log
would
the
page
they
the
log
over
were
Being
us
next
skip
second
caused
looking
rollback,
field
undo
do not
of
action
[59,
No
involved
multiple
undo
in
UndoNxtLSN
rollback
describe
handled
by
a logical
and
back
latches
reverse
undo
to
during
algorithms
where
described
[42]
acquired
get
a
not
TransID.
is written.
the
case
to
sequence
IMS
the
that
in
a CLR
whose
encountered,
the
us
its
Redo-only
is
during
as
or
is
the
undone
to the
when
be undone,
determined
pointer
that,
in
about
ARIES
record
before-images).
encountered
the
information
system
cannot
and
are
all
log
641
the
and
ensured
is undone,
written,
never
always
record)
concept
values
in
the
back
is used for rolling
a latch
transaction
records
the
which
SaveLSN
have
[31,
roll
as is done
is
at
savepoint
expect
though
SaueLSN,
a log
to
symbolic
back
that
possible
the
a non-CLR
process
R*
is written,
in
will
and
to extend
a CLR
value
we
log
sometimes
would
the
established
the
to
established,
written
If
some
is the
record
that
It
are
PrevLSN
When
log
It is easy
non-CLRs
Since
the
each
assume
single
before,
R
we
even
a rolling
yet
A particu-
performed
called
desires
ROLLBACK
Since
System
rollback,
for
exposition,
tion
in
the
and,
be
a page.
not
internally,
routine
rollback,
deadlocks,
has
use
to LSNS
to the
during
then
is
is being
SaveLSN.
but
routine
The input
involved
order
the
it
3).
been
transaction,
transaction
level,
user
mapping
acquired
deadlock,
user
to the
8 describes
to a savepoint.
the
remembered
at the
do the
the
Figure
has
a savepoint
savepoint
when
When
the
SaveLSNs
numbers
(i.e.,
zero.
supplies
by
the
(see
a rollback
When
written
If
again
if
one.
storage.
beginning
were
forward
outstanding
a preceding
virtual
set
going
longer
for
the
to
In
in
actions
force
the
particular,
performed
undo
the
during
actions
undo
to
action
the
original
action.
example,
index
management
undo
gives
the
exact
be
could
Such
affect
logical
[621
and
a
undo
space
10.3).
of a bounded
computer
amount
systems
of logging
situations
during
in which
on Database Systems, Vol. 17, No. 1, March 1992
undo
allows
a circular
us to
online
ARIES: A Transaction Recovery Method
\\\
***
.
119
\
u
,0
w
m
dFm
m
0
<0
m
c
m
L
..
~
v
al
sQ
..
..
z
x
m“.
-J
nc.1
WE
>
..!
!’. :
0 %’
:
0 CIA. . . .
Fl
‘n
..!
I
.
5
n“
-_l
z
WI-’-l
al
!!
..
w
M.-s
mztn
CL.
-am
UWL
aJ-.J
Crfu
u!
It
0
.-l
=%
ql-
ulc
&
l..-
..2
!!
;E
%’2
al-
ACM Transactions on Database Systems, Vol. 17, No 1, March 1992.
120
C. Mohan
.
log might
be used and log space is at a premium.
keep in reserve
transactions
enough
under
of ARIES
advantage
of this.
When
savepoint
partial
or
cannot
again,
any
thereby
ever
back,
when
and
a CLR
This
makes
rather
5.3
transaction’s
always
updates
read
record
occur
postpone
Once
any
an
the
this
action
log
erasing
return
of
the
lock
roll-
is undone
on that
partial
object.
rollbacks
not
take
5 When
the
same
the
site
SIX,
new
in
or a different
We
we
need
to
the
those
locks
is written,
the
would
some
other
site).
To
files
to be
complete
until
by
failure
uncommitted
locks
cause
objects’
that
of the
held
then
the
record
no
may
as part
etc.)
state,
state
files
Abort
and
if a system
protect
if
which
[191.
log
that
prepare
such
erasing
committing
to
prepare
of
X,
in-doubt
released,
the
logging
like
(IX,
the
be
to the
to ensure
recovery,
Presumed
transactions
deal
sure
log
of
with
erased,
contents,
are
be
part
we
that
the
pending
these
record.
the
they
in-doubt
locks.
its
they
must
place
when
it is committed
be performed.
a file
log record.
associated
state,
Once the end record
or returning
redo-only
is not
locks
of objects)
the
enters
actions,
record
does
(at
definitely
involves
OSfile.
locks
CLRS
a (partial)
(e. g.,
written
is done
into
dropping
prepare
protocol
to terminate
enters
of getting
and releasing
pending
locks
could
actions
a transaction
commit
is used
restart
IS)
avoiding
is
in
release
because
using
a
rollbacks.
transaction.
and
performing
end record
which
S
as the
of
transaction
actions
during
transaction
sake
such
undoes
object
the
deadlocks
of update-type
a transaction
as part
distributed
(such
resolving
64]))
of the
in-doubt
(e.g.,
later
the
once,
during
release
the
and
after
does
to a particular
can
after
do not
to be undone
never
than
update
DB2
updates
R
field,
synchronously
is
list
logging
after
of the
actions
first
system
of two-phase
which
reacquired,
for
more
establishment
because,
same
ARIES
UndoNxtLSN
to total
(see [63,
the
The
acquired
the
form
some
Commit
locks
non-CLR
the
resorting
includes
be
because
very
it,
like
System
But,
the
Termination
that
to
the
impletakes
be released
rollback
cause
The
Manager
after
systems
a partial
still
the
to consider
transaction.
could
using
it possible
prepare
were
particular
for
or Presumed
protocol
a
fact,
running
shortage).
may
we can
currently
Database
obtained
In
the bound,
all
space
rollback
inconsistencies.
is written
Transaction
the
data
log
locks
after
may
back
Edition
of the
completes.
CLRS
the
than
Assume
locks
rollback
rollback
of the
the
target
is completed.
of the
undoes
chaining
back,
is the
causing
a partial
Extended
Knowing
to roll
(e. g.,
0S/2
rolls
a later
to be able
conditions
rollback
release
space
the
which
total
release,
nor
in
a transaction
of the
after
log
critical
mentation
lock
et al
with
to the
For
each
operating
For ease of exposition,
any
particular
a checkpoint
is in
by
is written,
transaction
writing
an
if there
pending
system,
are
action
we
write
we assume
that
and
that
this
progress.
5Another possibility
is not to log the locks, but to regenerate the lock names during restart
recovery by examining
all the log records written by the in-doubt transaction— see Sections 6.1
and 64, and item 18 (Section 12) for further ramifications
of this approach
ACM Transactions
on Database Systems, Vol. 17, No. 1, March 1992.
ARIES: A Transaction Recovery Method
A transaction
record, rolling
actions
list,
in-doubt
state is rolled
back by writing
the transaction
to its beginning,
discarding
in
releasing
its locks,
and then
writing
the end record.
not the rollback
and end records are synchronously
written
will depend on the type of two-phase
commit
protocol used.
record
may
be avoided
if the
transaction
121
a rollback
the pending
the
back
of the prepare
.
Whether
or
to stable storage
Also, the writing
is not
a distributed
one or is read-only.
5.4
Checkpoints
Periodically,
checkpoints
are taken
to reduce
the amount
of work
that
needs
to be performed
during
restart
recovery.
The work may relate to the extent of
the log that needs to be examined,
the number
of data pages that have to be
read from nonvolatile
storage, etc. Checkpoints
can be taken asynchronously
(i.e.,
while
fuzzy
transaction
checkpoint
end– chkpt
record
transaction
tion
the
writing
(like
tablespace,
table
information.
is going
begin-chkpt
and
indexspace,
any
file
etc.) that
Only
end-chkpt
record
Such
Then
a
the
of the normal
mapping
informa-
are “open”
for simplicity
can be accommodated
the case where multiple
Once the
on).
record.
in it the contents
table,
has entries).
assume that all the information
record. It is easy to deal with
updates,
a
by including
BP dirty-pages
BP dirty–pages
log this
including
by
is constructed
table,
for the objects
which
processing,
is initiated
(i.e.,
for
of exposition,
we
in a single end- chkpt
records are needed to
is constructed,
it is written
to the log. Once that record reaches stable storage, the LSN of the begin-chkpt
record is stored in the master record which is in a well-known
place on stable
storage.
If a failure
were to occur before the end–chkpt
record migrates
to
stable storage, but after the begin _chkpt
record migrates
to stable storage,
then that checkpoint
is considered
an incomplete
checkpoint.
Between
the
begin--chkpt
and
end. chkpt
log
records,
transactions
might
have
written
other log records.
If one or more transactions
are likely
to remain
in the
in-doubt
state for a long time because
of prolonged
loss of contact
with
the
commit
coordinator,
then
record
information
about
those
transactions.
This
recovery,
those
locks
it is a good idea
the update-type
way,
could
if a failure
be
reacquired
to include
locks
(e.g.,
were
to occur,
in the
end-chkpt
X, IX and SIX)
without
then,
having
during
to
held
by
restart
access
the
prepare records of those transactions.
Since latches
may need to be acquired
to read the dirty _pages table
correctly
while gathering
the needed information,
it is a good idea to gather
the information
a little
at a time to reduce contention
on the tables.
For
example,
tion
if the
dirty
100 entries
before
Figure
_pages
table
If the
already
during
written
important
by
because
transactions
the
entries
acquisichange
remain
correct (see
redo point,
besides
of the RecLSNs of the dirty pages included
also takes into account the log records that
since
effect
each latch
examined
the end of the checkpoint,
the recovery
algorithms
10). This is because,
in computing
the restart
taking
into account the minimum
in the end_chkpt
record, ARIES
were
has 1000 rows,
can be examined.
of some
the
beginning
of the
updates
of the
that
checkpoint.
were
performed
This
is
since
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
122
C. Mohan et al.
.
initiation
of the checkpoint
might
not
is recorded
as part of the checkpoint.
the
that
ARIES
does
storage
on
system
buffer
in
in
this
in case
make
a copy
minimizes
unavailability
to
before
site
analysis
dirty
gets
_pages
For
begin
failure
pass,
is the
the
its
pages
which
are
those
pages
restart
redo
prevention
buffer
1/0
multi
manages
are
work,
of updates
manager
from
the
of
the
a failure,
recovery
state
ensure
and
could
copy.
This
master
the
pass,
appropriately.
At
which
the
checkpoint
invokes
the
routines
order.
The
end
system.
contains
in that
the
be
RESTART
the
complete
to
atomicity
of a failed
record
last
routine
undo
restart
needs
the
9 describes
of the
of the
This
and
is updated
availability,
redo
and
undo
passes.
to
latch
pages
for
way
the
necessary
Analysis
The
first
taken
for
buffer
of restart
the
pool
recovery,
a
Figure
ments
the
contains
analysis
the
the
The
list
processing
is by
parallelism
are
by
must
exploiting
be as short
as
parallelism
is going
modified
to
be
during
allowing
new
during
employed
restart
is
it
recovery.
transaction
processing
[601.
is made
actions.
outputs
of pages
or was
The
of this
from
which
that
may
that
the
be written
rolled
back
routine
were
were
the
must
this
before
in
routine
in
the
which
end
but
missing.
ACM ‘llansactlons
imple-
LSN
table,
of the
which
the
in-doubt
or unprepared
the dirty–pages
table, which
processing
are
that
is the
transaction
dirty
failure,
analysis
is the
routine
routine
RedoLSN,
start
system
recovery
are the
potentially
and
pass
by
to this
or shutdown;
down;
redo
restart
ANALYSIS
input
which
failure
shut
during
RESTART_
the
of transactions
list
totally
that
of system
the
had
log
pass
failed
records
in
describes
system
log
if
they
availability
explored
of the
10
at the time
contains
Only
before
data
are
record.
master
of restart
this
Pass
pass
pass.
duration
of accomplishing
improving
recovery
6.1
are
LSN
record
pass
the
that
the
write
DB2
that
the
Figure
beginning
shutdown.
redo
One
state
and
how
is,
using
is taken.
high
Ideas
after
a consistent
at the
possible.
during
to
.chkpt
or
the
table
checkpoint
writes
manager
writes.
of transactions.
routine
the
restarts
data
invoked
to this
pointer
the
properties
that
input
for
background
reduce
To avoid
nonvolatile
the
hot-spot
to
perform
to
buffer
ensure
operation,
time
system
bring
durability
The
some
to
often
and
forced
list
page
the
about
are
has
1/0
pages
the
dirty
the
PROCESSING
to
routine
batch
to occur.
an
data
transaction
performed
were
during
of those
the
the
failure
in
details
reasonably
of each
6. RESTART
When
storage
pages
can
manager
be
pages
if there
buffer
in
is that
dirty
[961 gives
Even
the
a system
hot-spot
out
manager
fashion.
pages
assumption
writing
buffer
dirty
any
The
operation.
to nonvolatile
such
and
1/0
modified,
written
to
The
one
pools
frequently
just
basis,
processes.
pages
that
a checkpoint.
a continuous
ple
require
not
during
be reflected
on Database Systems, Vol. 17, No. 1, March 1992.
the
records
for
buffers
is the
log.
for
whom
when
location
The
only
the
on
log
transactions
end
records
ARIES: ATransaction
Recovery Method
.
123
RE.STAR7(Master Addr);
Restart_Analys~
Restart_
s(Master_Addr,
Redo(RedoLSN,
buffer
pool
remove
entries
Restart_
Dirty_Pages
for
table
locks
Dlrty_Pages,
for
RedoLSN);
Dlrty_Pages);
:=
Dirty_
Pages;
non-buffer-resident
Undo (Trans_Tabl
reacquire
Trans_Table,
Trans_Table,
pages
from
the
buffer
pool
Dirty_
Pages
table;
e);
transactions;
prepared
checkpoint;
RETURN;
Fig.9.
During
this
does not
the table
transaction
back.
if a log record
appear
in the
with
the current
table is modified
also to note
undone
pass,
already
the
LSN
if it were
file
which
of the
_pages
most
recent
log record
sure that
the redo
later,
original
for a page
table,
then
log record
ultimately
that
whose
an entry
table
identity
is made
causing
would
then
in
The
and
need
to be
had to be rolled
any pages belonging
are removed
no page belonging
pass. The same file
operation
that
the transaction
is encountered,
are in the dirty-pages
order to make
accessed during
once the
is encountered
dirty
log record’s
LSN as the page’s RecLSN.
to track the state changes of transactions
determined
If an OSfile.return
to that
Pseudocode for restart.
from
the latter
in
to that version
of that file is
may be recreated
and updated
the
file
erasure
is committed.
In
that case, some pages of the recreated
file will reappear
in the dirty-pages
table later with RecLSN
values greater
than the end-of-log
LSN when the
file was erased. The RedoLSN
is the minimum
RecLSN from the dirty-pages
table at the end of the analysis
are no pages in the dirty _pages
It is not necessary
ARIES
there
is no analysis
(see also
missing
logged
Hence,
tion.
This
that
there
implementation
in
pass.
pass.
table.
The
redo pass can be skipped
be a separate
the
This
0S/2
analysis
Extended
is especially
6.2),
in the
redo
pass,
ARIES
updates.
That
is, it redoes
them
irrespective
redo
transactions,
does not need to know
unlike
the loser
unconditionally
System
their
update
locks
are reacquired
computation
to consider
the
Begin _LSNs
turn requires
that we know, before the start
the in-doubt
transactions.
Without
the analysis
pass, the transaction
and DB2.
of a transac-
the
lock
names
as they are encountered
locks forces the RedoLSN
of in-doubt
transactions
which
of the redo pass, the identities
table
all
were
only for the undo pass.
in which
for in-doubt
by inferring
from the log records of the in-doubt
transactions,
during the redo pass. This technique
for reacquiring
before
they
R, SQL/DS
status
in the
Manager
redoes
of whether
or nonloser
That information
is, strictly
speaking,
needed
would
not be true
for a system
(like
DB2)
transactions
Database
as we mentioned
Section
by loser or nonloser
pass and, in fact,
Edition
because,
if there
could
be constructed
in
of
from
the checkpoint
record and the log records encountered
during
the redo pass.
The RedoLSN
would have to be the minimum(minimum(
RecLSN
from the
dirty-pages
table in the end.chkpt
record), LSN(begin-chkpt
record)).
Suppression of the analysis
pass would also require
that other methods be used to
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992
124
C. Mohan
0
et al.
#~START_ANALYSIS(Mast er_Addr,
ln]tiallze
the
Trans_’able,
Trans_Table
tables
D1rty_pages,
arm D1rty_Pages
to
RedoLSN) ;
empty;
Master_Rec := Read_Dl sk(Master_Addr)
;
Open_ Log_ Scan (Master_Rec .Chkpt LSN) ;
LogRec := Next_ Logo;
/’
/*
LogRec := Next_ Logo;
WHILE NOT(End_of_Log)
open log scan at Beg)n_Chkpt
/* read )n the Begln_Chkpt
read
log
record
followlng
record
record
‘/
‘/
Begln_Chkpt
*/
00;
ret Urn*/
IF trans related
record & LogRec.7ransi3
‘/C- ;n Trans Table THEN /* not chkpt/OSflle
/* log ~ecord */
Insert
(Log Rec. Trans ID, ’U’ ,Log Rec. LSN, Log Rec. Frev LSN) l!,:o Trans Table;
SELECT(LogRec. Type)
WHEN(’update’
I ‘compensation’)
DO;
Trans_Tabl
e[LogRec. Trans ID] .Last LSN := LogRt-:. LSN;
IF LogRec. Type = ‘update’
IF LogRec 1s undoable
THEN
THEN Trans_Tahl
e[.ogRec.
TransIO]
.UndoNxt LSN := LogRec. LSN;
ELSE Trans_Tabl e[LogRec. Trans IDU.UndoNxt LSN := LogRec. UndoNxt LSN;
/’ next record to undo 1s the one pointed
IF LogRec is redoable
& LogRec. ~age ID NOT IN DTrty_Pages THEN
insert
(LogRec. Page ID, Log Rec. LSN) Into Llrty_Pages;
END; /’ WHEN(‘update’
I ‘compensation’)
*/
WHEN(‘Begln_Chkpt ‘) ; /* found an Incomplete
ENO; /* SELECT ‘/
LogRec := Next_ Logo;
ENO; /’ WHILE ‘/
FOR EACH Trans Table entry with (State = ‘U’) & (Undo Nxt LSN = O) 00; /* rolled
back trans
write
end re~ord and remove entry from Trans Table;
I* w)th mlsslng
end record
*/
*[
IF Trans ID NOT IN Trans_Table
Insert
entry (Trans ID, State,
ENO;
END; /*
entry
entry
(Page IO, RecLSN) In Olrty_Pages;
to Rec LSN In Olrty_PagLst;
IF LogRec. Type = ‘prepare’
THEk Trans_Tabl e[Log Rec. Transit].
ELSE Trans Table [LogRec .Trans ID]. State
:= ‘U’;
Trans_Tabl~[LogRec
.TransID]
WHEN(‘OSfile_return’)
delete
ENO; /* FOR */
RedoLSN := minimum(Di rty_Pages.
RE-URN;
bac<’)
entry
begin_
pass
chkpt
Redo Pass
The
second
Figure
:= ‘P’ ;
/*
to files
used
which
is that
to
= LogRec. Trans ID;
pages of
filter
have
returned
return
file;
start
posltlon
for
~edo *I
analysis.
been
returned
the
dirty
.pages
table
update
log
records
which
to the
used
operating
during
occur
the
after
the
record.
6.2
pass.
be
TransID
all
Pseudocode for restart
consequence
cannot
which
Rec LSN) ;
updates
Another
*/
for
from Olrty_?ages
Fig. 10.
processing
State
.Last LSN := LogRec. LSN;
ENO; /’ WHEN(’prepare’
I ‘roll
WHEN(‘end’)
delete Trans_Table
redo
Table;
FOR ‘/
ELSE set RecLSN of Dlrty_Pages
END; /’ FOR ‘/
END; /’ WHEN(’End Chkpt’)
*/
WhEN( ‘prepare’
\ ‘rollback’)
DO;
avoid
ignore
00;
THEN 00;
Last LSN,UndoNxt LSN) In Trans
FOR each entry in LogRec.Dirty
PagLst 00;
IF Pagel Ll NOT IN Olrty_Pages-THEN
lrsert
system.
record.
CLR */
*/
DO;
in LogRec. Tran_Table
Begln_Chkpt
by this
It
WHEN(‘End_ Chkpt’)
FOR each entry
checkpoint’s
to
pass
11
of the
describes
log
that
the
is made
during
RESTART.REDO
restart
routine
ACM 11-ansact,ons on Database Systems, Vol. 17, No. 1, March 1992
recovery
that
is the
redo
implements
ARIES: A Transaction Recovery Method
.
125
Di rty_Pages);
RESTART-REDO(RedoLSN,
/* open log scan and :;s]tlon
at restart
pt *J
/* read log record a: restart
redo point
*/
/* look at all records
till
end of log */
Open_ Log_Scan(RedoLSN);
LojRec := Next_ Logo;
WHILE NOT(End_of_Log) 00;
IF LogRec. Type = (’update’
I ‘compensation’)
& LogRec is
redoable
&
LogRec. PageIO IN Oirty-Pages
& LogRec. LSN >= Oi rty_Pages[LogRec
.~ageID] .Rec LSN
THEN 00;
/’ a redoable
page update.
updated page mg-t
not have made It to */
/* disk before
sys failure.
need to access cage and check Its LSN */
Page := fix&l atch(LogRec. PageIO, ‘X’);
IF Page. LSN < LogRec. LSN THEN 00
/* update not or cage. need to redo It *I
Redo_Update(Page,
LogRec);
/’
redo
update
*/
update
*I
Pag.?. LSN := LogRec. LSN;
END;
ELSE Dlrty_Pages
[*
[LogRec. PageIO] .Rec LSN := Page. LSN+l;
.~date already
on page *I
update dirty
page list
with correct
info.
tr-s w1ll happen if this
*/
~~gewas written
to disk after
:Re checkpt b.t before sYs failure
*/
/’
I*
unfix&unlatch
(Page);
ENO;
LogRec : = Next_ Log ();
/“
LSN on ~age has to
/a read next
/*
ENO;
RETURN;
Fig. 11.
the
redo
the
dirty-pages
pass
records
log
records
from
a check
table.
If
it
might
be
less
than
the
if
the
that
log
log
record’s
the
log
record’s
record’s
page
LSN,
then
the
reestablishes
the
database
performed
by
loser
routine
updates
behind
this
repeating
some
of that
redo
[691 we
have
Since
table
reduce
redo
may
get
that
are
were
dirty
at
have
may
the
been
that
were
written
time
log
of the
like
to
records
to
get
with
the
redo
examined
redo.
last
pass.
This
log
write
be used
to
although
eliminate
the
out
this
the
listed
in
Not
all
pages
became
some
identify
dirty
system
CPU
the
option
corresponding
pass.
dirty-pages
of the
the
In
of history
pass.
which
that
unnecessary.
some
that
failure.
rationale
It turns
pages
this
To
to be
RecLSN
The
in
the
saving
that
the
during
before
and
records
storage,
ACM Transactions
or
storage
redone.
the
state
is found
repeating
entries
Only
be
of system
be
dirtied
to
page
to be examined.
10.1.
the
equal
Thus,
may
is because
volume
log
to
time
during
checkpoint
nonvolatile
nonvolatile
can
which
pages
and
reducing
systems
to
read
require
written
of reasons
expect
be
records
is encoun-
the
LSN
redone.
in Section
log
the
dirty-pages
or
have
are
of restricting
of pages
the
have
page’s
as of the
log
idea
only
during
will
read
do not
such
the
number
modified
we
and
the
table
pages
Because
further
is page-oriented,
dirty-pages
might
is explained
and
No
record
that
which
transactions
transactions’
*/
log
scanning
in the
is redone.
of pages
state
end of
RedoLSN
than
might
If the
update
number
of history
of loser
explored
to possibly
the
log
appears
suspected
update
This
Even
to limit
the
starts
is greater
is
is accessed.
serves
pass
page
it
*/
*/
routine.
a redoable
LSN
then
information
are
redo
When
till
be checked
1og record
redo,
routine
referenced
table,
reading
restart-analysis
The
point.
the
the
this
the
routine.
in
suspicion,
the
this
to
by
to see if the
page
such
inputs
RedoLSN
and
the
this
by
is made
does
for
resolve
The
Pseudocode for restart
supplied
written
tered,
RecLSN
actions.
table
are
redid
/’
the
the
that
later
failure.
overhead,
dirty
is
pages
available
pages
from
on Database Systems, Vol. 17, No. 1, March 1992,
126
C. Mohan et al.
0
the
dirty
.pages
analysis
table
pass.
complete,
being
when
Even
if
a system
written.
those
such
failure
The
log
records
in
records
were
a narrow
corresponding
are
encountered
always
to
window
pages
will
be
could
not
during
written
1/0s
them
from
during
this
prevent
get
modified
the
after
pass.
For
brevity,
after
logging
the
pending
of all
are
redone
For
the
all
before
the
pass.
Since
also
perform
records
table
to read
possibly
This
the
the
page
log.
This
its
Undo Pass
The
third
the
pass
Figure
undo
pass
table.
The
since
history
page
is
performed
systems
The
logical
or
like
not.
DB2
restart
order,
in
the
maximum
the
yet-to-be-completely-undone
remains
rolled
those
as
we
undo
The
described
protocol
before
this
routine
while
this
undo
with
next
loser
next
by
writing
pool,
processes.
Updates
to
represented
for
a given
supporting
These
disaster
5.2.
CLRS.
dirty
perform
selective
until
the
buffer
to
each
table
log
process
manager
nonvolatile
pass.
ACM TransactIons on Database Systems, Vol. 17, No. 1, March 1992
records
of
chronotaking
for
each
of
transaction
to be
for
each
of
is exactly
rolling
back
follows
the
storage
for
redo.
transaction
transaction
be
10.1
continually
no loser
the
should
reverse
to be processed
encountered
In
by
Also,
on
Section
in
is done
for
pass.
LSN
operation
transactions,
The
pages
the
but
the
transaction
undo
in
to process
of the
Section
undo
implements
restart
this
describe
record
in
the
undo
is the
that
initiated,
transactions,
entry
recovery
we
This
log
record
an
processing
writes
losers
log.
is
an
history
of the
is
pass
what
back
of the
in
as the
as before.
routine
during
whether
repeat
sweep
restart
routine
the
rolls
The
of
log
buffer
since
order
context
UNDO
consulted
this
determined
transactions.
transactions,
WAL
is
order
can
of
and,
process.
the
same
during
not
determine
LSNS
to be undone.
back
to
is
do not
a single
the
made
input
before
routine
of the
is
table
that
one
properties
in the
to
the
redo
we
information
multiple
from
correctness
RESTART_
the
Contrast
-undo
only
basis
in
[731.
that
to
by
the
queues
into
the
buffers
logged,
by the
using
the
in
not
come
orders
reapplied
are
of pages
queues
in
1/0s
in
in-memory
pages
with
applicable
The
consulted
information
encountered
pass
group
different
any
are
log
is repeated
not
actions
asynchronous
(as dictated
or
record
in
violate
describes
_pages
occur
execution
pending
the
are
redo
and
be dealt
backups
actions.
dirty
to
the
be available
building
page
log
also
of the
12
records
complete
queue
may
the
like
a per
applied
remote
6.3
pass.
on
get
are
were
before
remaining
of
to be reapplied
1/0s
updates
ideas
via
log
need
not
the
they
during
each
does
a failure
but
of initiating
so that
things
table)
missing
parallelism
recovery
pages
performed
may
if
availability
corresponding
that
pages
all
these
corresponding
requires
different
the
possibility
us the
initiated
processing
transaction,
gives
.pages
how,
pass.
potentially
dirty
as to
of a transaction,
of that
redo
sophisticated
asynchronously
here
record
parallelism,
updates
which
the
discuss
end
actions
during
..-pages
parallel
in
do not
of the
exploiting
dirty
in
we
the
during
the
usual
the
ARIES: A Transaction Recovery Method
.
127
.
REST,.4//T-UMM(T rans-Tabl
e);
WHILE EXISTS (Trans with
State
= ‘U’
UndoLSN := maxlmum(UndoNxtLSN)
/’
in
Trans_Table)
DO;
from Trans_Tab7e
pick
UP
LogRec := Log-Read (UndoLSN);
SELECT(LogRec. Type)
WHEN(‘update’)
DO;
entries
with
State
= ‘u’ ;
UndoNxtLSN of unprepared
trans with maximum UndoNxt LSN */
J* read log record to be undone or a CLR *J
IF LogRec is undoable THEN 00;
f’ record needs undoing (not
Page := flx&latch(LogRec
.Page IO, ‘X’);
Undo_Update(Page, LogRec);
Log_Wri te(’compensati
on’ ,LogRec .Trans ID, Trans_Tabl e[LogRec. TransID]
LogRec. Page ID, LogRec. PrevLSN,
Page. LSN := LgLSN;
/’
I* write
CLR */
CLR in page */
LSN of
*/
table
*/
undone
*/
/* pick UP addr of next record to examine
e[LogRec. TransIO] .UndoNxtLSN := LogRec. PrevLSN;
*/
[ ‘ prepare’)
Trans_Tabl
.UndoNxtLSN
I*
To exploit
parallelism,
processes.
single
leaves
open
undos
to
objects
the
the
parallel,
actually
trans
from
fully
*/
*I
:= LogRec. UndoNxt LSN;
UP addr of
a single
next
record
to examine
*I
log
for
CLRS
first,
in
6.2.
to the
In
pages
multiple
completely
the
CLRS.
without
by
This
then
redoing
this
fashion,
the
be performed
a
still
applying
accomplishing
and
can
using
with
in
problems
undos),
Section
an example
records
was
describe
written
rollback
transaction
to
was
went
missing
one
ARIES,
the
performed
(updates
regardless
we
have
after
restart
and
the
savepoint
Each
of how
the
update
option
recovery
concept,
and
6) are
many
the
this
for
the
CLRS
in
undo
work
in parallel,
of
After
6).
first
log
of
even
that
During
restart
the
then
a
the
the
undos
be matched
with
is performed.
continuation
ARIES
undo
the
write,
recovery,
then
recovery
Since
in the
and
restart
will
Here,
failure,
disk
3)
and
record
ARIES.
the
4 and
redone
allowing
could,
using
Before
records
times
is completed.
we
page.
update.
of log
5
scenario
same
second
(undo
performed.
transactions
supports
after
recovery
to the
(3, 4, 4’, 3’, 5 and
1) are
CLR,
disk
restart
updates
forward
updates
and
With
the
6.4
logical
changes
be performed
be dealt
transaction.
13 depicts
(of 6, 5,2
also
chaining
of writing
in
the
can
transaction
UndoNxtLSN
Section
require
explained
pass
each
of the
(see
may
applying
Figure
undo
possibility
pages
as
at most
I*
*I
*I
Pseudocode for estart undo.
that
because
that
partial
the
It is important
process
the
pick
LastLSN, . . .) ;
/* delete
trans
‘/
SELECT “/
WHILE */
Fig. 12.
the
store
Log_Wrlte( ’end’ ,LogRec .Trans IO, Trans_Tabl e[LogRec. Transit].
delete Trans_Table entry where TransID . LogRec. TransIO;
ENO;
ENO; /*
/*
END;
RETURN;
page
*I
Trans_Tabl e[LogRec. TransID] .LastLSN := LgLSN;
/’ store LSN of CLR in table
unfix&unl atch(Page);
ENO;
I* undoable
record case
ELSE;
/* record cannot be undone - ignore it
Trans_Tabl e[LogRec. Trans IO] .UndoNxt LSN := LogRec. PrevLSN; /x next record to process is
J* the one preceding
this record in its backward chain
IF LogRec. PrevLSN = O THEN DO;
/* have undone completely
- write
end
WHEN(‘rollback’
all
record)
.LastLSN,
. . . ,LgLSN, Data);
ENO; /* WHEN(‘update’)
*/
WHEN(‘compensation’)
Trans_Tabl e[LogRec. TransID]
for
redo-only
pass,
repeats
roll
of
loser
history
back
each
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
128
C. Mohan et al.
●
u
Wrl te !bdated
m
*
12344’3’5
REDO
344356
UNDO
6521
Fig. 13.
loser
only
to
its
transactions.
tion
at
a
savepoint
(1)
records
before
ever
its
after
some
recovery
amount
the
by
processing
perform
undo
work
offline
the
needs
undo
then
transactions.
solely
processing
the
of the
table)
that
offline
are
tions
with
to protect
until
need
are
positions,
and
then
DB2,
for
of the
because
the
non-CLR
[151.
That
is,
there
log
undo
are
and
they
transactions
remembered.
[141.
Unless
to those
accesses
The
there
to those
those
during
will
in
the
online,
are
on Database Systems, Vol. 17, No 1, March 1992
indexes)
exact
storage,
objects
forward
DB2.
ranges
some
DB2
not
brought
(DBA)
that
those
before
of log
in-doubt
need
will
fact
is
inverses
allocation
no locks
objects
based
the
undos
LSN
handling
for
be
up.
on those
finish
brought
are
objects,
to
and/or
be generated
database
are
possible
redo
minipage,
is
for
is brought
and
logical
the
in virtual
when
When
no
(called
It
system
transactions
can
(or
is
system
actions
of
doing
to reduce
the
which
written
page
the
done
it
for
alone
CLRS
records
Because
table
in the
completed.
the
defer
unavailable.
example,
objects
CLRS
to write
is
opening
the
in
offline
processing
to
is usually
is able
is possible
since
ACM Transactions
first
the
wish
data
the
those
objects
This
critical
some
may
loser
updates
is
locks
when-
cursor
to restart
we
when
uncommitted
recovery
log
those
information
restore
some
to other
also
wish
time.
In
to be recovered
accessible
to be applied
data
are
exceptions
is maintained
objects
made
an
in
of locking,
actions.
in
may
some
when
transactions
original
can
the
would
transaction’s
enough
system
for
granularity
remembers,
the
to be performed
information
smallest
the
such
even
DB2
This
on the
point
transactions.
needs
about
correctly
(2) reacquiring
Hence,
which
to be performed
work
objects,
we
possible.
a later
recovery
this
from
(3) logging
the
information
loser
applica-
so on.
failure,
as
during
new
and
the
its
Restart
recovering
of
restart
If some
to
time
names
back
invoking
Doing
updates,
so that
rolling
by
enough
lock
undone
and
soon
work
of
accomplished
of
as
passing
recovery,
a system
transactions
the
not
or Deferred
totally
transaction
is to be resumed.
generate
state,
of
the
and
established
program
Sometimes,
new
to
restart
are
Selective
6.4
point
execution
ability
completing
instead
resume
uncommitted,
savepoints
application
could
entry
which
the
for
recovery example with ARIES.
savepoint,
we
special
from
require
latest
Later,
Restart
they
records
transac-
to be acquired
be permitted
online,
then
ARIES:
recovery
the
is performed
remembered
for
offline
In
also,
transactions
undos.
object.
For
logical
the
cannot
rolling
forward
normal
similar
Method
using
rollbacks,
the
.
log
CLRS
space
of an
update
logical
129
records
maybe
in
written
based
they
that
page
retraversing
maybe
affected,
page-oriented
is not
the
undo
generally
can
write
But
for
tree
work
to
and
For
a CLR
the
high
since
do
is unpredictable;
not
of
CLRS.
possible,
index
will
state
10.3),
we
loser
require
page-oriented.
appropriate
is O% full.
this
the
current
Section
the
of [62]
may
always
operation,
the
of
that
on the
are
(see
record
none
objects
generate
methods
page
when
and
insert
(e.g.,
of which
predict
are
since
management
stating
undo
provided
offline
undos
a problem,
approach
undo
of the
logical
at all
actions,
or more
management
in terms
even
the
a
key
in fact,
we
hence
logical
is necessary.
It is not
during
possible
restart
records
that
reverse
each
to handle
recovery
at a later
Remember
in
the
index
the
deletion),
the
not
involving
during
of
take
is because
are
space-related
effect
by
during
one
a conservative
concurrency,
undo
This
undos
take
can
modified
Redos
example,
for
we
has
the
we can
Even
Recovery
objects.
ARIES
logical
efficiently
ranges.
A Transaction
in
record,
other
Even
all
the
the
records
have
point
the
PrevLSN
undos
if the
recovery
next
of some
the
order.
two
sets
methods,
Hence,
record
to
and/or
of the
the
is
of a transaction
logical)
of records
undo
it
be
records
(possibly,
of the
are
of a transaction
enough
processed
to
during
UndoNxtLSN
chain
rest
is
done
remember,
the
leads
of
interspersed.
undo;
for
from
us to all
the
to be processed.
under
the
circumstances
to perform,
potentially
needs
to be supported,
restart
undos
handle
in time,
chronological
transaction,
that
the
and
where
logical,
then
one
undos
we
or more
of the
on some
suggest
the
offline
loser
transactions
objects,
following
if deferred
algorithm:
it for
1. Perform
the repeating
of history
for the online objects, as usual; postpone
the log ranges.
the off/ine objects and remember
2. Proceed
with
the undo pass as usual,
but stop undoing
a loser transaction
when
one of its log records
is encountered
for which
a CLR
cannot
be
generated
for the above reasons. Call such a transaction
a stopped transaction.
But continue
undoing
the other, unstopped
transactions.
3. For the stopped transactions,
acquire locks to protect their updates which have
not yet been undone.
This could be done as part of the undo pass by continuing
to follow the pointers,
as usual, even for the stopped transactions
and acquiring locks based on the encountered
non-CLRs
that were written by the stopped
transactions.
4. When restart
recovery
is completed
and later the previously
offline
objects are
made online, fkst repeat history
based on the remembered
log ranges and then
continue
with
the undoing
of the stopped
transactions.
After
each of the
stopped transactions
is totally
rolled back, release its still held locks.
5. Whenever
an offline
object becomes online,
when the repeating
of history
is
completed
for that object, new transactions
can be allowed
to access that object
in parallel
with the further
undoing
of all of the stopped transactions
that can
make progress.
The
tion
above
in
in-doubt
requires
the
update
the
ability
(non-GLR)
to generate
log
records.
lock
names
DB2
is
based
doing
on the
that
informa-
already
for
transactions.
ACM Transactions on Database Systems, Vol
17, No, 1, March 1992.
130
C. Mohan
.
Even
the
if none
of the
processing
of
transactions
ing:
(1)
locks
are
first
for
start
completed,
released
as each
requires
that
the
log
If a loser
failure,
then,
with
are
the
the
redo
which
represent
object
as
the
soon
because
more
we
than
as
(e.g.,
normal
first
the
hence,
it
early
undo
because
not
we
work
that
possibly
undone.
in
been
undone.
the
undo
systems
that
a
non-CLR
can
be
performed
records
corresponding
the
that
object’s
works
same
CLRS
more
than
of
only
non-CLR
undo
in
resolution
the
some
log
This
do not
permit
to
obtained
release
undo
of locks
to
is
such
be
those
on
then
for
to release
specially
and
record
yet
like
transaction
redo
system
equal
to
not
the
pass
or
would
that
to be undone.
need
have
we
mark
that
than
(1)
ensure
of the
remain
are
step
during
time
loser
(1)
(e. g.,
once
ARIES
during
deadlocks
using
rollbacks.
DURING
In this
describe
section,
can
save
of the
by,
recovery
Analysis
can
we
be reduced
of restart
pass.
some
By
work
the
This
is
latter,
impact
taking
of failures
on CPU
checkpoints
a checkpoint
were
of this
checkpoint
at
the
of
this
dirty-pages
different
dirty
the
a failure
list
restart
the
taking
if
table
dirty–pages
the
how
optionally,
table
transaction
that
RESTART
processing
during
and
different
stages
processing.
transaction
the
of
can
log
or
release
7. CHECKPOINTS
1/0
will
and
step
analysis
less
is in effect)
and
DB2)
transaction
partial
by
locking
corresponding
AS/400,
This
we
of the
in
to
at the
Locks
the
and
Performing
the
that
back
then
CLRS
are
updates
rolled
rollbacks
records
CLR.
follow-
transactions,
encountered
back
log
LSNS
those
the
records,
acquired
during
last
update
undo
once;
IMS).
for
if record
do not
Encompass,
whose
as possible,
(e. g., record,
lock
obtained
transaction’s
only
are
loser
log
appropriately
rolling
that
the
doing
as the
adjusted
already
is being
as soon
be
of
their
in-doubt
locks
is desired
it by
completes.
as to which
records
pass
even
The
it
rollbacks
on
and
rollback
information
the
loser
transactions
was
transaction
locks
loser
be known
log
of
If a long
of its
the
it will
UndoNxtLSN
during
of the
based
parallel.
RedoLSN
but
the
accommodate
of the
transaction’s
transaction
a transaction,
can
transactions
in
restart
records
pass.
These
loser
before
reacquire,
updates
performed
the
we
and
new
is offline,
start
then
history
processing
are
to be recovered
transactions
uncommitted
transactions
all
objects
new
repeat
the
(2) then
et al.
from
list
will
be the
of
the
analysis
will
be
at
during
is obtained
of the
during
contains
happens
end
occur
checkpoint
table
what
.pages
end
at the
to
the
analysis
pass,
recovery.
same
The
as the
pass.
entries
The
the
same
as
of the
analysis
from
the
buffer
redo
pass,
the
the
checkpoint.
pool
(BP)
of
entries
end
a normal
we
entries
entries
pass.
For
the
dirty-pages
table.
Redo
pass.
notified
during
that
At
so that,
the
page
redo
by
ACM Transactions
the
beginning
whenever
pass,
making
it
the
of the
it writes
will
out
change
RecLSN
a modified
the
be equal
restart
to the
buffer
page
manager
to nonvolatile
dirty
_pages
table
LSN
of that
log
on Database Systems, Vol. 17, No. 1, March 1992.
(BM)
is
storage
entry
record
for
such
ARIES: A Transaction Recovery Method
that
all
BM
log
records
have
the
to maintain
ing.
The
pass
to
a failure
the
reduce
to
dirty-pages
of
the
of
entries
of
redo
the
is
pass.
during
the
pass,
undo
undo.
entries
undo
table
or redo
since
as
of
to
be
redone
the
the
checkpoint.
be
not
parallelism
in
the
of
entries
The
the
same
analysis
pass.
is
if
entries
will
the
as
This
employed
in
work
on a restart
view
complex
date
checkpoints
case,
they
8. MEDIA
will
such
called
a
are
are
the
same
to
dirty,
modified
the
as
undo
same
checkpoint.
be the
sometimes
some
depicted
physical
in
This
Figure
a system
failure
in
restart.
While
to take
place
The
as the
pass,
as the
entries
entries
17
be required
(the
shadow
is another
R. This
would
restart
of
during
consequence
no
these
are
after
its
effect
were
to easily
checkpoints
restart
true
and
restart
is able
System
be
logic
a
for
of the
the
longer
an earlier
ARIES
that
pages)
complicates
checkpoint
[31].
in
it may
pages
in System
The
that
(like
media
DBspace,
tions.
With
might
contain
archive
such
dump)
if
desired,
uncommitted
updates.
directly
the
from
with
a high
some
recovery
will
tablespace,
concurrently
course,
written
become
during
in
as it
consid-
accommo-
optional
in
our
R.
RECOVERY
fuzzy
performed
recovery,
up
completes.
be forced
table
are
to
table
time
of the
will
this
pages
up
no longer
time.
to be describable
assume
some
checkpoint
is cleaned
are
about
checkpoint
time
to be performed.
during
may
any
of that
at the
be repeated
following
too
list
when
are
transaction
is taken
dirty-pages
table
manipulates
pages
of the
restart
the
pages
entries
table
the
point,
manager
when
.pages
free
pass,
this
corresponding
BP
entries
restart
to
checkpoint
ered
the
of this
cannot
the
the
at that
taken
history
a restart
We
same
of
undo
At
the
entries
dirty
R, during
be
that
logic
the
end
or
of the
which
dirty–pages
table
System
more
the
whether
If a checkpoint
of the
BP
transaction
checkpoint
fact
during
The
time
processing–removing
the
transaction
In
time
checkpoint
at
table.
onward,
adding
of the
of the
the
by
any
need
not
process-
currently
pass.
be
the
this
table
for
storage,
normal
entries
then
normal
During
then
at
normal
are
redo
if
does
pages
would
the
will
of
dirty-pages
entries
From
nonvolatile
during
table
table
beginning
BP
those
buffers.
etc.
of
is enough
BM
during
taken
that
end
checkpoint
affected
the
the
by removing
does
this
log
be
It
fashion.
of what
to
of the
this
as it does
track
the
transaction
At
becomes
the
before
not
table
processed.
in
131
pass.
Undo
table
amount
transaction
checkpointing
the
the
been
table
checkpoints
occur
of
had
be keeping
dirty–pages
the
record
dirty--pages
allow
list
restart
the
log
dirty-pages
still
above
were
entries
own
it should
buffers.
redo
to that
restart
its
Of course,
the
Of
up
manipulates
.
concurrency
could
Let
nonvolatile
operation
entity.
us
to
also
storage
the
image
updates,
assume
at the
A
involving
modifications
uncommitted
we
be required
etc.)
easily
that
version
copy
in
such
entity
produce
image
of the
entity
can
other
transac-
the
image
method
be
copy
of [52].
image
copy
with
copying
is
performed
entity.
This
or
copy (also
an
to the
an
of a file
by
method,
contrast
the
level
image
fuzzy
means
no
that
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
C. Mohan et al.
132
.
more
recent
versions
transaction
version
of the
geometry
be
copying
up
via
(e.g.,
to
for
the
easy
to
case,
some
latching
When
begin.
the
fuzzy
remembered
image
point
along
with
information
with
LSNS
image-copied
externalized
tion
up
to
began.
record
of
the
the
image
us call
been
in
for
the
5.4
log.
taking
into
media
while
call
the
point
the
location
is
of
on this
in
y
records
of
have
been
copy
opera-
be at least
media
point
the
LSN
of the
as
recovery
begin.
same
of
the
record),
would
computation
the
check-
log
image
is the
and
pages
would
the
noted
checkpoint
dirt
entity
redo
discussing
is
that
example,
this
fuzzy
that
account
recovery
For
it
in
end.chkpt
the
of the
We
then
based
checkpoint))
time
version
the
of
checkpoint’s
the
course,
logged
SNs
copy
by
Of
the
be made
had
copy
[131),
checkpoint
Let
can
that
image-copied
reason
Section
initiated,
data.
updates
storage
point
computing
in
is
that
image
nonvolatile
The
in
given
the
as of that
redo point.
all
it.
desirable
in
be needed.
complete
copy
is found
not
than
be needed.
minimum(minimum(RecL
in
Hence,
to date
record
that
entity
LSN(begin_chkpt
image
buffer
does
convenient
latter
will
will
recent
assertion
more
the
device
the
system
as described
operation
most
the
than
If the
the
since
transaction
be
in
storage
since
and
accommodate
no locking
copy
The
is
less
but
the
copy checkpoint.
to
efficient
also
copying,
method
image
of
may
of synchronization
level,
record
the
present
nonvolatile
operation
buffers.
image
be
the
more
a copy
it
may
from
Since
system’s
amount
page
pages
much
copying,
presented
minimal
chkpt
direct
incremental
at the
be
such
be eliminated.
the
the
copied
directly
usually
transaction
modify
the
during
will
support
of
Copying
would
be exploited
overheads
to
some
buffers.
object
can
manager
have
of
system’s
chkpt
as the
the
one
restart
redo
point.
When
media
reloaded
redo
point.
being
recovery
and
then
During
recovered
the
are
unless
the
information
or the
LSN
on the
a log
record
refers
record’s
LSN
image
copy
pared
to the
end
of the
as
until
an
page
that
is reached,
had
if there
made
recovery.
be kept
DBA
table
end
analysis
of the
arbitrary
recovery,
ACM Transactions
in
any
to the
in
DB2—see
from
last
.pages
list
log
must
its
are
undone,
about
the
identities,
6.4)
in
or
complete
an
com-
Once
then
as in
of
exceptions
may
be
the
those
the
etc.
undo
such
table
obtained
checkpoint
in
if
log
of the
LSN
be redone.
entity
the
record
and
list
redo,
and
transactions,
(e.g.,
entity
applied,
restart
y_pages
accessed
update
Section
the
dirty
begin–chkpt
be
somewhere
pass
dirt
the
are
during
in-progress
information
separately
the
must
if the
record’s
is
recovery
to
updates
Unlike
entity
media
relating
checkpoint
of the
page
are
provides
every
database
records
of the
the
the
by
log
log.
logging
ARIES,
LSN
changes
The
may
log
from
corresponding
is not
to check
record’s
the
copy
LSN
an
needs
a page
version
starting
it unnecessary.
log
the
in
image
makes
the
Page-oriented
Since,
to
image-copied
the
that
that
the
in the
page
all
and
than
log
performing
scan,
then
of restart
such
redo
the
is initiated
processed
is greater
transactions
scan
checkpoint,
transactions
pass
is required,
a redo
recovery
database
page
the
is
page’s
damaged
recovery
can
independence
update
in
be
the
amongst
is logged
separately,
nonvolatile
storage
accomplished
on Database Systems, Vol. 17, NO 1, March 1992
easily
by
objects.
even
and
extracting
if
the
ARIES: A Transaction Recovery Method
an
earlier
copy
version
with
of the
systems
index
and
from
damage
the
image
rolled
back
be
so
useless
tion
being
not
made
had
would
log
records,
recovery
(see
if it
changes
while
problems
the
before
the
gets
database
code
is
may
occur
key)
or due
process
had
operation
to
update.
page
volatile
the
to
read
is
all
is
started
manager.
[151.
The
The
bit
from
set
is
modified),
the
read
or write,
case
automatic
is unacceptable
a broken
updates
that
version
of the
those
to
the
’1’
bit
page
page
‘O’.
situation
date
by
corrupted
that
were
is
rolled
of restart
only
because
by
rolling
updated,
entire
letting
From
restart
page
storage.
in
ACM Transactions
an
but
recovery
were
A related
the
fixed
redo
missing
by
non-
page
state
scan
of the
the
buffer
automatically
the
page
header.
the
update
page
LSN
is latched,
to ‘l’,
for
in which
viewpoint,
to recover
all
in the
problem
state
cor-
the
and
system
page
the
the
availability
transaction
every
Once
is equal
the
redo
a page
value
hitting
expensive
by
logged
whenever
to see if its
in
X-latched.
is
that
from
buffer
a bit
by
recover
operation
update
this,
which
abnormal
page
the
using
and
itself,
before
to
and
changes.
such
forward
for
pool
the
an
roll-forward
recovery
by
buffer
on noting
of the
The
termination
generally
efficient
is fixed
left
over
(e.g.,
action
It
page,
Given
the
on nonvolatile
pages
interruption
version
is initiated.
down
in the
implement,
system’s
internal
is tested
recovery
to bring
were
to
in the
state
to
page
alternative
describing
way
is detected
(i. e.,
in
transac-
pass
not
an
page
the
being
result
An
process
uninterruptable
that
of
page
to skip
process
DB2
remembered
kind
to
had
back
analysis
record
limit.
of a page
this
page
up
RecLSN
is reset
first
an
for
rolled
corrupted
user’s
time
after
complete
bit
it
records
transactions
may
recovered.
application
uncorrupted
records
this
corruption
is
of the
in
starting
log
the
scans
to a page
like
circumstances,
the
does
the
operating
bring
log
DB2
operation
for
and
relevant
by
CPU
written
by
rollback)
such
to
abnormal
a log
to the
process
these
an
changes
because
its
all
storage
using
that
put
Given
rupted
such
exhausted
of
the
not
transactions
pointers
be
of
systems
terminations
attention
may
to write
executed
are
when
logging
18).
making
performance-conscious
scans
being
which
total
any
some
forward
of reconeven
to the
or
(e. g.,
recovery
to date
If
that
R during
Figure
up
backward
page
System
a chance
if CLRS
changes
out
place
because
is actively
process
turns
database
also
the
log
and
the
but
process
the
what
10.2
of
in
R),
attention
any
These
and
for
backward
to the
log
as it is done
pages
media
the
Section
Individual
If
performed,
pages
undone.
made
undone.
for
state
that
updates
written,
operation
partial
be
then
pages’
not
index
System
paying
forward
complete
even
133
is to be contrasted
expensive
(commit,
they
are
any
are
a page’s
should
totally,
if
some
the
in
rolling
This
for
the
require
any,
see
be to preprocess
back
of
or
they
work
pages
bringing
and
records
Also,
state
if
to
that
require
would
actions,
partially
since
rebuilding
data
then
copy
above.
log
is damaged).
state
required
recovered
may
transaction’
what
would
which,
(e. g.,
(e.g.,
copy
the
determine
in
a page
is performed,
representing
image
as described
pages’)
object
explicitly
undo
from
such
an
log
R
of an index
is performed
when
System
entire
one page
from
the
management
to
the
page
using
like
space
structing
only
of that
page
.
those
logged
uncorrupted
is to make
the
it
from
sure
abnormally
on Database Systems, Vol. 17, No. 1, March 1992.
134
C. Mohan et al.
.
terminating
process,
leaving
and
latch,
unfix
calls
footprints
enough
the
user
are
around
process
issued
before
aids
by
the
transaction
performing
system
system.
operations
processes
like
in performing
fix,
the
By
unfix
necessary
clean-ups.
For
the
CLRS
variety
This
good
only
page
9. NESTED
TOP
committed,
not.
with
when
do need
illustrated
in
we
may
the
the
be allowed
transaction.
well
context
If the
of file
the
extension.
data
extended
area
the
effects
if the
of the
starting
mechanism
is,
transaction
and
In
ARIES,
the
of course,
the
very
should
not
which
is
dependent
undone
irrespective
of the
once
transaction
execution
nested
top
consists
action
(1)
ascertaining
(2)
logging
nested
(3)
on
the
the
top
completion
step
We
of the
of the
is
enclosing
until
that
of
the
be
initiating
unacceptable.
action,
to support
indepenfor
our
a transaction
and
some
is logged
inde-
transaction
we are able
to initiate
complete
action
[511. A
top actions
top
of actions
top
to
purwhich
later
action
stable
storage,
which
define
transaction.
a sequence
following
of the
undo
nested
it
traditionally
between
having
in the
of actions
a
steps:
current
transaction’s
information
last
associated
with
log
the
record;
actions
of the
and
of
UndoNxtLSN
A
sequence
nested
performing
and
action;
actions.
been
would
top action,
without
data
independent
which
very
completion,
waits
conflicts
not
might
transactions.
their
have
The
to lock
undo
system
called
proceeding.
subsequence
position
redo
actions
extending
it would
committed
before
is
a file
transactions
then
to the
or
This
of the
an
transaction
of a nested
the
Such
transactions,
the
the
outcome
A
of
efficiently,
any
on
back,
a failure
kinds
other
to roll
other
to be
commits
extends
database,
commit
transaction,
concept
to mean
be
[521, which
themselves.
to the
updates
by
before
to perform
is taken
in the
by the
vulnerable
the
a transaction
independent
independent
requirement
transactions
poses,
an
commits
using
above
dent
such
transaction
After
extension.
performed
independent
initiating
pendent
in
locking.
a transaction
of
updates
extension-related
were themselves
interrupted
to undo them,
These
is necessary
by
writing
page
ultimately
these
were
database
performed
only
updates
prior
transaction
of updates
hand,
transaction
elsewhere,
suggested
transaction
for
to some system
to undo
other
some
the
y property
extending
to a loss
lead
the
approach,
like
whether
atomicit
to use
be acceptable
no-CLRs
would
of
causes updates
which
and
is supporting
locking.
irrespective
We
the
section
in this
system
ACTIONS
are times
There
mentioned
even if the
idea
is to be contrasted
supports
On
of reasons
is a very
the
points
nested
to the
top
log
action,
record
writing
whose
dummy
CLR whose
was remembered
in
a
position
(l).
assume
associated
that
updates
the
externalized,
before
are
to only
referring
ACM Transactions
effects
to system
the
the
of
any
data
dummy
system
actions
normally
like
creating
resident
outside
CLR
is written.
data
that
on Database Systems, Vol
When
is resident
17, No 1, March 1992,
a file
the
we
in
the
and
their
database
discuss
database
are
redo,
we
itself.
ARIES: A Transaction Recovery Method
.
135
*
Fig. 14.
Using
roll
this
back
nested
after
will
ensure
that
not
undone.
If
written,
then
nested
top
redo-only)
top
the
the
top
the
action’s
action.
Unlike
sense
be thought
advantage
quent
to
do we
costly
Figure
14 gives
5. Log
transaction’s
rolled
It
then
we
writing
be
do not
conflict
can
example
6’ acts
log
action
will
as
CLRS,
there
redo
record
for
enclosing
pay
the
the
price
to redo
nested
not
with
in
a
The
wait
its
a new
this
when
action.
need
of starting
Contrast
the
CLR
top
proceeding
to
for
dummy
transaction
before
problems.
The
the
opposed
property
is nothing
pass.
is
since
(as
atomicity
are
CLR
undone
undo-redo
desired
the
be
action
for
subse-
transaction.
approach
with
the
interrupted
the
the
nested
update
CLR.
by
nested
that
using
top
of the
and
action
it
actions
enclosing
needs
to
be
undone.
implementation
of only
redo-only
log
action
top
relies
a single
nested
index
the
hence
is not
consists
a single
of the
though
and
action
action
method
consisting
Even
a failure
nested
top
action
CLR.
top
Applications
storage
top
dummy
record
update,
and
concept
management
can
avoid
in the
be found
62].
RECOVERY
section
(e.g.,
additional
of the
goals
and
System
R,
be
In particular,
were
problems
and
found
recovery
to motivate
which
of the
locking
can
existing
in ARIES.
some
record)
discussion
features
our
PARADIGMS
describes
granularity
include
top
of a nested
If the
that
dummy
storage
as the
that
of a hash-based
[59,
This
is
emphasized
dummy
top
the
to
CLR
approach.
an
history.
the
context
of
we
lock
6’ ensures
should
nested
written
the
were
dummy
before
commit
stable
record
activity
back,
on repeating
ing
into
to
the
of the
normal
is that
forced
then
occur
the
during
independent-transaction
3, 4 and
10.
the
of as the
6 Also,
run
for
approach
are
transaction
action,
as part
provides
encountered
be
actions.
Nor
in
is
of our
record
records
enclosing
top
to
were
nested
This
the
nested
performed
failure
log
if
of the
incomplete
CLR
can
approach,
updates
a system
a dummy
this
action
completion
10g records.
nested
Nested top action example.
in
[97].
methods
the
need
for
Our
aim
in
the
providing
rollbacks.
is to
show
us difficulties
certain
why
with
transaction
caused
we show
developed
associated
handling
some
features
of the
context
how
certain
in accomplishwhich
recovery
of
fineSome
the
we
had
to
paradigms
shadow
6 The dummy CLR may have to be forced if some urdogged updates may be performed
other transactions which depended on the nested top action having completed.
page
later by
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
136
C. Mohan et al.
.
are inappropriate
technique,
high
levels
of
paradigms
have
algorithms
with
The
System
— undo
been
In
adopted
redo
work
preceding
of
context
or
and
of WAL,
[3,
15,
of interest
of
those
leading
16,
is a need
there
more
52,
to the
71,
for
System
72,
R
design
78,
of
82,
881.
(i.e.,
no
are:
recovery.
redo
updates
one
errors
are
restart
is to be used
past,
and/or
that
during
logging
WAL
the
in the
limitations
R paradigms
— selective
— no
when
concurrency.
work
during
performed
restart
recovery.
during
transaction
rollback
CLRS).
— no logging
—no
of index
tracking
no LSNS
10.1
goal
has
been
subsection
implemented
in
aim
is to
in
supporting
is to motivate
why
database
recovery
undo
pass
(see Figure
6).
redo
pass.
As
show
is incorrect
DB2,
on the
[311.
We
as we
the
describing
record’s
the
log
record’s
is less
than
performed
not
ation
The
make
undo
the
page
(see
to
ACM Transactions
the
and
and
an
then
undo
redo
the
preceding
WAL-based
pass,
System
in-doubt)
selective
approach
redo
transac-
paradigm
to take,
it
Writing
the
an
If
has
of
many
the
the
CLR
when
also
pass,
then
the
not
it
an undo
when
page,
is not
on Database Systems, Vol. 17, No. 1, March 1992,
is
the
actually
a failure
to
LSN
action
is
Whether
a CLR
rolled
the
set
page
undo
being
record
record’s
than
page.
of the
are
contain
to handle
handling
less
if the
no
as part
log
LSN
on the
actions
does
force
is
a
before.
of a log
the
page’s
undo
consider
described
LSN
is performed
on
us
LSN
the
performed
page
not
in
as
page
and
transaction’s
and
and
whether
performed
the
LSN
to be undone,
been
locking
inconsistencies
Let
to the
the
During
undo
have
to be necessary
implemented.
redone
record
page
to data
determine
page.
15).
when
be
only
lead
is compared
to
is
actually
simpler
support
will
contains
the
log
when
DB2,
page
Figure
be
even
out
the
perform
pass
The
(i.e.,
it
recovery.
pass
of
During
While
to
Otherwise,
written,
way.
undo
locking.
prepared
were
page
to
would
recovery
turns
redo
that
that
generally
a
R paradigm
efficient
as
update
of the
that
is written,
log:
and
redo
WAL-based
the
opposite.
LSN
the
the
page.
needs
media
page
to
L SN
is always
each
the
LSN
updates
CLR
(i.e.,
problems
they
the
System
approach
locking
then
the
of
performs
redo.
such
be reapplied
on the
in a special
the
to
with
fine-granularity
the
This
which
pass,
LSN,
the
history.
to be the
[151.
update
needs
log
or
redo
of selective
show
failures,
the
selectiue
record
in
an
update
ing
redo
if
concept
after
of committed
systems,
technique
During
updates
below,
WAL-based
systems,
WAL
to logged
to
repeats
passes
and
seems
selective
and
R first
does just
this
discuss
2
later,
WAL
actions
call
pitfalls,
in
System
hand,
the
R intuitively
such
will
with
System
perform
changes.
it
locking
restart
updates
other
only
the
systems
ARIES
systems
we
redo
Some
to relate
fine-granularity
transaction
tions
information
itself
introduce
many
When
R redoes
management
on page
Redo
of this
introduces
The
space
state
on pages).
Selective
The
and
of page
describ-
undo
oper-
rolled
back.
update,
just
back
updates
performed
of the
to
on
system
ARIES: A Transaction Recovery Method
T1 Is a Nonloser
Update
30
20
Fig. 15.
during
restart
Selective
recovery.
which
did
not
PI
which
had
to be undone,
Pi’s
LSN
were
the
being
have
completion
the
other
hand,
problem.
page
It
should
locking
Given
these
would
by
the
subsequently
by
Tl)
comes
undone
redo
page
to
or
undo
not.
and
update
with
the
update
with
pass
present
in
value
to
had
to
beyond
the
15
LSN
perform
the
page.
is
the
20
and
since
16
it
30 since
the
or
than
page-LSN
or
not
equal
is no longer
a true
its
update
needs
problem
with
even
relies
not
be
of the
the
it
not
(undo
not
current
is
page_LSN
undone
By
redoing
causes
though
LSN).
be
redoing
but
on the
the
to
selective
transaction,
should
indicator
pushed
if
update
records
have
with
So, when
scenario,
update
log
was
update
loser.
transaction,
logic
modi-
T2)
the
latter
undo
by
(say,
would
under
page
20
a loser
former
an
only
by
to a nonloser
to
On
any
to a losing
the
LSN
latter
this
the
the
it.
be
when
respect
where
know
to
of the
appear
not
even
update
illustrate
is because
whether
greater
not
it belongs
would
method
with
with
The
In
if PI
interrupts
would
arises
situation
update
belongs
undo
This
the
would
that,
to undo
WAL-based
established
locking.
LSN
to
we
it
for
and
After
be made
page
U1
[15].
redone.
value
for
written
failure
there
transaction’s
be
the
loser,
Figures
determine
page_LSN
history,
a nonloser
fine-granularity
the
undo
by
being
restart,
of a page
(say,
U2
update
of U2).
next
redo
state
in
Ul)
problem
DB2
earlier
a system
then
selective
of the
transaction
which
of the
the
track
for
would
this
with
transaction
losing
modified
30
LSN
time
of
lose
that
case
an update
an
LSN
the
written,
emphasized
as is the
(>
an attempt
been
was
before
during
and
scenario.
was
(CLll
of l.Jl’
then,
had
or in-rollback)
first
LSN
be
U1’
storage
U2
properties
we
(in-progress
the
U2’
is used,
discussion,
fied
if
there
in
LSN
restart,
update
if there
but
resulting
nonvolatile
of this
the
happen,
to the
to
as if P1 contains
will
to be undone,
changed
to be written
redo with WAL—problem-free
This
PI
Redoes
137
Loser
T2 is a
UNDO Undoes Update
REDO
.
if
repeating
state
of the
page.
Undoing
an
harmless
oriented
DBMS
reuse
data
effect
action
only
locking
and
space,
is not
present
its
as
[81],
unique
will
be caused
the
is
for
they
and
and
in
effect
conditions;
logging,
Rdb/VMS
inconsistencies
when
certain
and
VAX
of freed
even
under
are
other
keys
for
by
not
records.
undoing
an
in
with
implemented
systems
all
present
example,
in
[6],
there
With
original
a page
will
be
physical/byteIMS
[76],
VAX
is no automatic
operation
operation
logging,
whose
page.
ACM Transactions
on Database Systems, Vol. 17, No. 1, March 1992.
138
C. Mohan et al.
.
0T1
Vr! fe !Mated
~,
F“,2
10
20
I’Jq
,,
.
.
i
LSN
Commit
30
T2 is a Loser
T1 is a Nonloser
REDO Redoes Update
UNDO Will
Try
Update
Though
f
30
to Undo
Is NOT
20 Even
on Page
ERROR?!
Fig. 16.
Reversing
the
the
problem
pass
to precede
greater
the
of that
CLR’S
update
is redone
would
not
redoing
selective
redo
pass,
Figure
to
15,
the
only
30
even
then
the
undo
Since,
would
violate
the
passes
might
lose
of 20
during
in
make
and
redo
is not
durability
log
not
solve
the
undo
actions
page
LSN
assignment
a log
record’s
record’s
present
and
the
the
pass,
the
If
of which
would
than
will
[3].
track
of a CLR
the
update
the
undo
suggested
is less
that
scenario
is
writing
page-LSN
though
and
we
of the
page.
if the
update
redo
approach
30, because
LSN
redo
that
In
than
redo with WAL—problem
incorrect
This
to be redone.
become
of the
either.
were
need
order
Selective
LSN,
on the
atomicity
we
page.
Not
properties
of
transactions.
The
use
to have
be
of the
the
undone
during
shadow
concept
and
what
needs
a checkpoint,
shadow
uersion,
create
points
version
an
the
shadow
version,
As
a result,
there
not
in
and
the
database.7
correct]
which
database,
This
y even
management
is
and
version
not.
all
one
reason
selective
changes
are
updates
of the
logged
other
but
are
unnecessary
page
database,
the
recovery
after
restart
updates
before
the
R
recovery
last
the
current
is performed
during
the
to
called
constituting
which
needs
technique,
two check-
between
even
about
System
The
logged,
thus
restart,
is done
logged
the
redo.
not
page,
updates
it
what
shadow
Updates
During
ambiguity
All
and
with
1).
shadowing
no
the
storage.
updated
R makes
to determine
With
consistent
Figure
System
system
redone.
of the
is
are
be
by
that
on nonvolatile
(see
ery.
in
to
version
database
from
database
technique
action
is saved
a new
of the
page
of page.LSN
recov-
are
in
the
in
the
checkpoint
checkpoint
are
method
reason
is that
index
redone
or undone
are
functions
and
logically.
space
8
7 This simple view, as it is depicted in Figure 17, is not completely accurate–see
Section 10.2.
s In fact, if index changes had been logged, then selective redo would not have worked. The
problem would have come from structure modifications
(like page split) which were performed
which were taken advantage of later by transacafter the last checkpoint by loser transactions
tions which ultimately
committed. Even if logical undo were performed (if necessary), if redo was
page oriented, selective redo would have caused problems. To make it work, the structure
modifications
could have been performed using separate transactions.
Of course, this would have
been very expensive. For an alternate, efficient solution, see [62].
ACM Transactions
on Database Systems, Vol.
17,
No.
1,
March 1992.
ARIES: A Transaction Recovery Method
was described
history.
Apart
As
repeats
repeating
history
commit
some
ultimately
10.2
The
them.
not
time,
paper,
A
CLRS.
of
transaction.
very
care
of the
next
record
may
already
of a system
That
is,
never
the
to be
undone
be rolling
failure
is unimportant
restart
recovery
starts
before
the
the
time
at
written,
of
after
multiple
the
to avoid
redoing
backward
scan,
some
when
to do some
which
the
log
actions
the
The
during
only
the
the
special
to have
to undo
about
a partial
information
at
the
time
track
some
changes
performed
restart.
as of the
version
them
to handle
completed
a little
the
partial
need
wanted
later
having
are
those
the
designers
rollback
last
of
CLRS
is to avoid
The
of
at the
during
since
and
of
only
keeps
transactions,
processing
pass.
The
of a transaction
handling
redo
of the
written
R.
R
shadow
initiated
some
is
this,
a
of progress
database
Despite
special
R
database
of the
is
failure.
checkpoint.
over
state
is
Since
been
state
database
the
rollentire
partial
since
System
active
the
the
level,
have
System
state
in
—this
transactions
last
passes
the
failure
R needs
in-doubt
since
from
system
of the
and
of the
in
of
num-
systems.
System
rollback
uisible
not
system
System
or
rollbacks
each
not
might
in
of
the
Supporting
in
this
any
only
application
occurs
record
The
for
cause
back.
do this
state
for
back.
actions
the
track
to
in
advantages
transaction
to keep
checkpoint
present
13.
and
rollback
easy
would
elsewhere
will
at
a failure
the
a way
the
also
present-day
when
its
roll
not
undone
known
violation
partial
transaction
So,
are
committed
for
need
back
have
whether
these
Section
violation
of writing
in recovery
and
of
around
a significant
play
the
in
concept
been
advantages
In fact,
all
roll-
updates
the
has
literature,
they
to note
the
if
is relatively
checkpoint
database
to
by
describe
While
section
roll
during
last
checkpoint
ability
transaction
introduced
problems
key
a
back
we
taken.
which
the
the
that
the
that
this
causing
for
the
and
advantages
internally,
It
about
is
In
partially
illustrates
storage,
a checkpoint
after
or
in
additional
we try
performed
rollback.
we
but
locking,
us the
and
community.
a unique
be rolling
updates
to nonvolatile
time
3
systems
role
these
totally
least
problems.
many
been,
[56].
statement
at
may
of the
time
what
in
requirement
a transaction
transaction
redo,
9.
CLRS
to them
research
contexts,
Figure
31],
important
effects
and
example,
update
[1,
gives
difficulties
of the
in
really
by the
may
the
rollback
It
Section
writing
fundamental
summarize
For
the
how
relating
questions
We
of reasons.
back
selective
fine-granularity
of whether
in
discuss
some
not
the
appropriate
transaction
ber
has
undone
open
to
and
problems
and
be
as
in the
writing
perform
effect.
described
solves
recognized
could
left
side
implemented
there
utility
well
actions
were
been
of CLRS,
been
is
progress
rollbacks
has
Their
support
irrespective
as was
subsection
during
a long
not
to
beneficial
of a transaction
their
CLRS
discussion
another
does
us
139
State
of this
writing
allowing
or not,
in tracking
performed
for
has
commits
goal
ARIES
from
actions
Rollback
backs
before,
.
with
a
occurred
is encountered.
Figure
All
log
18 depicts
records
are
an
written
example
by the
of a restart
same
recovery
transaction,
scenario
say T1.
ACM Transactions on Database Systems, Vol
for
In the
System
R.
checkpoint
17, No. 1, March 1992.
140
.
C. Mohan et al
Last
‘“g~
Uncommitted
Changes
Need
Undo
Fig. 17.
Committed
Changes
Redo
Or In-Doubt
Need
Simple view of recovery processing in System R
~..----_- . .
12
3
5,,.’-6
4
7
8 ::jg
Log
@
Checkpoint
Fig. 18.
record,
the
information
checkpoint
was
partial
rollback.
write
a separate
information
records
of
points
the
log
to the
after
the
record
pointer
3, we conclude
with
the
During
log
5 and
of
records
6,
during
the
during
the
will
be undone
Here,
the
To
7,
and
redo
8.
pass,
pass
in
9 is a commit
and
same
during
transaction
see why
the
rollback
it
is
is the
undo
To
that
log
is patched
record
record
the
a partial
pass
is involved
pass
has
log
both
records
in the
to precede
or not.
9 points
caused
to log
undo
undo
redo
are
not
pointer
9.
log
record
5 will
pass
log
undo
record
pass,
pass
to
the
records
4 and
the
as of the
a forward
it point
of 3
from
1 needs
back
putting
the
undo
state
transaction
had
rolled
its
log
Whether
rollback
during
the
database
record
by
that
preceding
log
5 to make
then,
redo
the
via
When
notice
with
is a losing
that
ensure
the
log
T1
log
written
protocol.
database
of the
Such
of the
record
this
a
not
transaction
4 and
started
state
does
a transaction
log
to be undone.
determined
that
that
the
the
of
place.
immediately
that
needs
also
by
follow
restart,
on whether
it is concluded
analysis
record
pass.
pass
by
record
of the
during
2 definitely
depend
analysis
log
If log
will
hence
partial
Since,
written
not
log
1, instead
to be performed
record
or not
redone
2.
does
time
chaining
processing
pass,
it
took
the
written
forward
the
because
but
rollback
in
record
recently
by
undone
CLRS,
breakage
rollback
to
write
a log
first
2 since
been
a partial
the
analysis
the
of
needs
the
record
that
undo
checkpoint,
to be undone
of the
record
not
that
most
the
is pointing
ended
recovery
was
of a partial
record
which
say
from
But
as part
Prev-LSN
to
in System R,
already
does
Ordinarily,
that
completion
to log
3 had
only
inferred
pointer.
examine,
last
R not
a transaction.
handling
points
record
record
be
rollback
T1
log
System
PrevLSN
we
for
taken
must
Partial
and
in
2
be redone.
in
the
System
redo
R,g
g In the other systems, because of the fact that CLRS are written
and that, sometimes, page
LSNS are compared with log record’s LSNS to determine whether redo needs to be performed or
not, the redo pass precedes the undo pass— see the Section “10. 1. Selectlve Redo” and Figure 6.
ACM Transactions
on Database Systems, Vol
17, No. 1, March 1992
ARIES: A Transaction Recovery Method
consider
the
allowed
to
following
reuse
transaction,
the
in
partial
ID
with
sequence
redo
the
Since
record’s
above
rollback,
record’s
dealt
scenario:
that
ID
case,
which
have
been
reused
redo
pass.
To
of actions
be fore
the
dealt
in
have
with
the
repeat
the
portion
the
undo
is
same
because
pass,
and
transaction
respect
must
the
deleted
of the
with
by
undo
141
a record
later
been
in
history
failure,
deleted
inserted
might
to be
might
that
a record
a record
had
in
the
a transaction
for
.
to
of
that
that
the
be performed
is
original
before
the
is performed.
If 9 is neither
will
be
and
1 will
Since
one
a commit
determined
to
be undone.
CLRS
are
forward
as well
from
what
to this
Section
being
5.4).
being
done
example:
A
T1
had
logged
the
will
be a data
3
the
fancy
by
different
integrity
high
not
ln
will
logic.
This
whether
on
10.3).
Allowing
concurrency
of
Let
value
O after
2, T1 rolls
redo
and
because
case,
operation
after
recovery
Of
object.
does
not
necessarily
or
not
to be supported
[59,
the
undo,
will
for
T1
System
R did
the
and
is
not
the
being
support
updates
very
of
redo
efficiently
logging;
is
logically
examples).
T2
these
have
logging
management
621 for
Then,
2 concurrent
information
to
an
then
data
byte-oriented
storage
of undo
(see
for
also
has
If T1
performed
mean
flexible
logging
be
object
checkpoint.
Allowing
recovery
(see
us consider
commits.
to support
space
normal
information
F?, undo
course,
be needed
same
T2
System
redo
last
and
the
in
update.
the
the
back,
some
redo
is
further
processing
logging
the
different
during
on an
let
the
occur
or undo
a given
history
cause
not
which
R also
performed
would
to
physically
redo
quite
operation).
its
which
be
repeating
potentially
in
for
in System
did
2
transactions’
by
this
redoing
mode
may
(i.e.,
prevents
way
operation
for
transactions
depend
Section
2.
other
the
has
adds
exact
records
be redone.
created
problem
of
by
dumb
using
data
1, T2
after-image
lock
information
will
of
adds
instead
accomplished
(i.e.,
that
resiart
also
after-image
piece
transaction
the
CLRS
the
restart
could
split
will
processing,
changes
These
a
during
physically
the
as
transaction
log
with
the
processing
index
8).
such
writing
during
logging
footnote
required
Not
be logged—not
value
(see
problems
processing
pages
normal
Not
hence
known,
the
pass
records
interspersed
is not
then
undo
of the
R and
were
different
the
none
System
actions
record,
during
pass,
in
during
to guarantee).
contributes
redo
or undo
as across
management
from
the
a prepare
and
operations
happened
impossible
nor
a loser
written
undo
processing
page
In
not
transaction’s
record
be
that
used
will
ARIES
(see
permit
supports
these.
WAL-based
during
the
systems
rollbacks
data
is
being
rolled
which
the
rollbacks.
locking.
using
always
back.
Gontrast
method
immediate
to be rolled
and,
worse
more
than
once.
started
forward,
That
once
rolling
data,
works
then
still,
the
This
this
the
some
by
with
of its
before
in
the
ACM Transactions
logging
some
LSN,
page
failure
(or
are
are
4, in
of
in
the
which
granularity)
more
undone,
a transaction
system.
in
during
if a transaction
undone
also
of
are
[521,
back
coarser
is that,
actions
actions
Figure
suggested
CLRS
state
actions
is “pushed”
level
original
performed
the
original
approach,
the
actions
is concerned,
if
of writing
compensating
is illustrated
even
even
with
only
by
as recovery
as denoted
consequence
back,
back
problem
“marching”
of the
were
this
So, as far
state
The
handle
CLRS.
Then,
than
possibly
had
during
on Database Systems, Vol. 17, No. 1, March 1992.
142
C. Mohan
.
recovery,
CLRS
the
the
are
previously
undone
idea
lock
of writing
Section
the
next
CLRS
ARIES
avoids
Not
undoing
CLRS.
and
12, and
section
early
release
Section
and
Unfortunately,
support
written
again.
management
22,
et al.
in
6.4).
[691.
Additional
We
has
already
like
that
benefits
still
in
also
to
dead-
(see
item
are
discussed
the
Section
suggested
in
is an important
non-
retaining
relating
objects
of CLRS
one
undone
while
discussed
the
this
already
on undone
benefits
were
feel
and
a situation,
CLRS
methods
rollbacks.
undone
such
of locks
Some
recovery
partial
are
[921
in
8.
do
drawback
not
of such
methods.
10.3
Space
The
goal
Management
of this
management
length
records
A
are
management
deletion
in
the
before
the
Since
some
commit
of the
storage
do (see
byte
the
something
like
which
describes
how
is that
garbage
to lock
or log
us the
flexibility
and
modify
have
to
reduce
be
Figure
state
of being
run
e.g.,
in
space
left
in
collects
are
it.
the
in
redo
This
the
space
around
records
which
In
with
not
keeping
nonvolatile
from
an
earlier
management
the
same
is attempted
shows
the
is
page
and
on a page
need
for
on the
record
does
page.
not
IMS,
of the
same
has
tracking
the
gives
to store
utilities
These
version
Assuming
have
This
a page
in
ACM TransactIons on Database Systems, Vol. 17, No. 1, March 1992
log
consequence
used.
the
and
looks
The
point
which
want
name
like
track
exact
not
fragmentation.
storage
as
a location
within
systems
do
address
logging
The
the
storage
did
lock
on a page
around
the
The
record.
within
to
a page,
We
changed.
under
desirable
record.
identifies
got
efficiently.
deal
#
with
to use
record’s
pre-
another
[62].
within
page.
index
to
by
in
not
space
For
want
want
of the
unused
in the
storage
not
The
slot
not
is dealt
was
on the
record
to move
LSN
involve
for
location
to
bytes
name
moved
records
it
problem
this
consumed
of data
did
page.
do
another
This
to [50].
undo
a goal,
logging
during
to
is described
is, we
data
that
perform
19
a
that
we
way
storage
by
solutions
being
changed
to users.
the
200
lock
frequently
flexible
Figure
requiring
That
actual
of the
a scenario
to
when
was
flexible
is referred
from
and
y of data
storing
space
varying
a transaction
is committed.
approach
# ) where
able
in
and
consumed
with
The
were
the
length
quite
attempting
updates
free
records
19 shows
problems
insert
collection
the
availability
(by,
slot
to
not
reader
undo
within
contents
variable
the
#,
points
the
that
logical
(page
then
as the
with
by
concurrency,
locking
bytes
is
deal
transaction.
a logical
of a record
be
not
transaction
811).
locking
released
page
interested
one
[6, 76,
involved
of locking
transaction
do
management
specific
to
space
data
We
first
using
byte-oriented)
have
page
and
of the
(i.e.,
first
locking
by
flexible
to identify
a
The
problems
record
the
of increasing
released
systems
on
here,
circumstances
physical
that
[761.
the
granularity
doing
space-releasing
in
out
level
efficiently.
in
sure
interest
space
the
such
the
problem
updates,
vent
with
update
briefly
reservation
point
page
to be supported
or
until
discussed
to
than
is to make
transaction
is
is
finer
to be dealt
problem
record
subsection
when
actual
page
of the
page)
log
leads
that
transaction,
only
an
100 bytes
of
page
to
all
of
state
ARIES: A Transaction Recovery Method
Redo Attempted
From Here.
It Fails Due to
Lack of Space
Page Full
As of Here
143
.
Page State
On Disk
.Og
Oelete
RI
Free 200
Bytes
Fig. 19.
using
an
Typically,
each
pages
map
pages
possibly
based
location
of other
the
record,
new
enough
free
page
on
records
one
at
recovery
to 25%
not
require
If
are
already
update
change
the
need
for
to do logical
logging
That
is,
but
would
cause
state
a change,
to
the
does
needs
not
an example
ing
the
is not
then
exact
the
We
perform
the
of the
space
easily
update
to
to
the
handling
of
recovery
to
not
cause
as
space
the
system
FSIP
during
performed
during
a CLR
the
and
for
the
to determine
which
and
if it
describes
in
forward
rollback.
during
be
to the
updates.
example
during
the
would
points
to change
an
FSIP
which
has
the
to
record,
inventory
information
O%
does
then
update
changes
free
construct
which
scenario
full
from
back,
an
O% full,
This
write
full,
roll
23%
it
a redoiundo
redo-only
and
from
change
were
to say
to the
to the
update
space-re-
to provide
to go to 35%
page.
as
FSIP
of the
update
to change
record
to the
update
an update
in which
inverse
can
an
log
update,
the
special
FSIP
T1
entry
free
only
25%
every
avoid
of
with
keeps
least
not
also
page
FSIP
an
page
the
should
FSIP
that
and
space
data
page
the
update
FSIP.
FSIP
a data
at
the
as that
be logged.
if
this
respect
a data
causes
to perform
construct
to
with
undoing
the
of the
changes
to
change
the
undo,
about
keys)
requires
To
also
Now,
and
FSIP
sure
page
on the
update
FSIP.
full
current
undos
must
cause
the
3 l%
and
space
an
its
operation
change
transaction
the
might
to
the
while
that
T2
to
make
FSIP.
FSIPS
cause
written
need
cause
to the
an
to
operation,
index
The
a
space
information
insert
as that
has
called
space
related
record.
such
a data
redo
requiring
rollback
whether
during
might
Later,
given
to
the
to identify
new
corresponding
FSIPS
full.
had
etc.)
relations
are
a clustering
consulted
the
more
They
a record
(or closely
information
is full,
the
key
inserting
or
describes
from
are
operation
thereby
T1
T1’s
wrong,
5090
updates
would
FSIP.
for
one
During
same
FSIPS
(e.g.,
in
T1
full,
full
space
it
least
of the
Transaction
ing,
which
(FSIPS).
FSIP
pages.
the
or more
in
of
obtained
with
-consuming
independence,
the
/’
with space for insert.
operations
pages
Each
index
information
information
27%
or
information
or
space
does
redo
records
inventory
DB2.
data
space
is full,
leasing
space
in
many
approximate
then
problem
to
containing
free
(SMPS)
to
to
attempting
file
called
relating
the
avoid
Insert
Commit
R3
Consume
100 Bytes
to the page.
applied
few
Oelete
R2
Free 200
Bytes
Wrong redo point-causing
to
LSN
Insert
R2
Consume
200 Bytes
We
forward
which
a
processcan
also
process-
rollback.
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
144
@
C. Mohan
10.4
Multiple
LSNS
Noticing
the
support
record
problems
object’s
state
explain
why
DB2
precisely
in
transactions
state
an
LSN
for
the
log
is set
the
to determine
LSN
the
for
This
storing
able
for
the
and
supported
best.
when
tends
keys.
Further,
key
desired
into
handle
the
performing
a fixed
number
space
reservation
fine-granularity
the
terminology
11.
OTHER
In
the
methods
shadow
because
of
their
blocks
the
(see
the
sions).
First,
which
we
methods
along
method
Siemens.
are
unable
ACM Transactions
also
(like
each
at
page,
even
media
history
during
transactions
turns
out
divides
is
one
length
introduce
up a
needed
proposed
to
in
[61]
objects
(atoms
in
other
significant
in
the
this
has
been
for
the
extra
paper
and
different
have
shadow
1/0s
systems
been
of information
we
on Database Systems, Vol
additional
and
recovery
that
about
17, No. 1, March 1992
the
of
data,
page
for
significant
it here.
copies
compare
here
checkpoints,
involving
informed
with
based
considered
costly
[31]
Next,
methods
not
very
and
implemented
of lack
R) are
e.g.,
section.
We
some
Recovery
of System
of data,
of this
of
protocol.
overhead
dimensions.
because
properties
WAL
that
space
sections
of [25]
to include
the
clustering
previous
But,
to be
especially
the
varying
case
have
is cumbersome
for
technique
like
disadvantages,
storage
various
support
the
use
technique
be examining
recovery
we
summarize
we briefly
will
by
we
physical
special
avail-
to the
METHODS
well-known
nonvolatile
disturbing
do not
which
page
no
Methods
on
paper).
WAL-BASED
following,
minipages,
space
physically
DB2
is
overhead
objects
repeating
of loser
it
record’s
undone
waste)
recovery,
of
Since
log
space
objects
page
undo,
to the
(LSN)
the
The
During
length
deleted
rollback
field.
conveniently
varying
in ARIES.
problem.
locking
of that
on the
extra
of
over
technique
the
seen
therefore
having
is updated,
much
(and
way
of loser
minipage’s
to be actually
too
The
besides
LSNS.
variable
to make
done,
simple
each
LSN
carry
for
state
being
before
as we have
for
recovery
is
The
recovery
page
LSNS
efficient.
to be sufficient,
not
when
a single
locking
very
fragment
12].
of
2 to 16
actions
tracks
needs
option
into
[10,
is compared
incurring
a page.
the
index
redoing
minipage
update
it does
has
a minipage
that
besides
especially
have
to
trying
than
minipage,
minipage
LSN
record’s
to
each
in the
of the
Maintaining
to
minipage
recovery,
restart
locking,
efficiently.
We
log
technique,
less
of the
not
Whenever
page
is
user
DB2
with
is stored
the
LSNS,
storing
of record
not
if that
minipage.
LSN
maximum
and
despite
is as follows.
an
LSN
to the
LSN
when
of a minipage
pages,
pass,
the
leaf page
granularity
as a whole.
record’s
equal
minipage
redo
page
page
that
where
up each
on such
associating
leaf
per
of locking
indexes
at the
the
by
corresponding
LSN
of
divide
properly
during
LSN
to suggest
that
we track
each
a separate
LSN
to each object.
Next
we
idea.
case
do locking
separately
one
tempting
a granularity
the
recovery
having
be
assigning
to physically
and
does
by
may
by
supports
DB2
minipages
DB2
it
it is not a good
happens
requiring
caused
locking,
already
This
et al
the
the
map
discusmethods
different
DB-cache
modifications
implementation,
ARIES: A Transaction Recovery Method
IBM’s
IMS/VS
[41,
42,
database
system,
consists
relatively
flexible,
and
has
restrictions
many
transaction
can
buffering
locked
objects
fixed
length
make
the
page
locking
provides
ports
data
pOOk
[80,
sharing
different
locking
minipage
and
repeatable
read)
tables
A
in
support
field
to
Only
many
high-
IMS,
locking,
with
only
calls)
records.
have
each
storage
support
support.
global
FF,
of the
main
MSDB
DEDBs
with
also
its
own
supbuffer
record)
prefix
and
and
unlocked
or permanently
even
Schwarz
[881
(a
la
differences,
for
IMS)
as
which
is much
been
implemented
will
be
less
complex
CMU’S
management.
adopted
OLM
the
write
storage
and
written
back
in
pages
DB2
an
steal
a
that
has
Encompass,
and
fetch
no-force
record
end-write
to nonvolatile
OLM
alone.
might
The
operation
[23,
have
a sophisticated
been
in
buffer
stability,
repeat-
off temporarily
based
logging
SQL,
on
have
value
several
method
method
are
help
a
OLM,
normal
a page
the
commit
granularities
methods
logging
time
These
records
two
During
whenever
storage.
These
and
two-phase
locking
value
NonStop
every
Tan-
(VLM),
(OLM),
has
901.
policies.
record
changes
updates
methods
The
the
Camelot
some
data
on files.
below.
than
and
IMS
multisite
be turned
recovery
logging.
outlined
and
NonStop,
(cursor
can
operations
operation
Abort
and
stability,
Encompass
allow
levels
Logging
different
With
different
data,
loading
DB2
Both
They
DB2
supports
off temporarily
[4, 37] with
Presumed
consistency
read).
two
and
in
and
nonutility
presents
access.
supports
for
like
[95].
products.
It
(cursor
both
SQL
its
the
page
to be turned
access
The
19].
levels
algorithm
data
or dirty
and
system.
DB2.
15,
operations
can
for
operating
in
14,
table
NonStop
SQL
MVS
13,
logging
recovery
using
NonStop
[1,
utility
support
the
available
consistency
allows
Tandem’s
64].
for
are
transaction
transaction
[63,
key
failure.
for
via
and
In
granularities
(i.e.,
IMS,
in
during
distributed
read,
dirty
IMS
MSDBS
database
system
and
Encompass
able
also
A single
recovery
databases:
systems,
presented
only
single
(file,
ing
But,
(tablespace,
hot-standby
single
of
Buffer
of
is
but
differences.
the
large
functions
indexes)
indexes
The
SQL
a
been
for
incorporated
NonStop
and
database
[10, 11, 12]. DB2
provides
have
many
which
efficient
The
mechanisms
and
different
granularities
page
atomicity.
logging
two
data.
possible
[431.
access
has
data.
been
protocol
DEDBs.
support
data
and
reorganizing
within
for
features
relational
distributed
dem
minimum
is more
(DEDBs).
the
be the
across
have
kinds
a hierarchical
(FF),
indexes).
(FP)
operations,
two
provides
which
secondary
databases
is
Function
93],
Path
the
which
145
941.
algorithm
has
times
hot-standby
recovery
with
FP
for
parts
supports
parallelism
is IBM’s
Limited
for
and
supported
and
XRF,
types
941,
Full
42,
Fast
two
entry
80,
IMS
[28,
and
the
data
but
hold
is
availability
DB2
FP
records,
lock
by
and
76,
Path
FF
database
vary.
53,
parts:
no support
both
used
(MSDBS)
48,
two
Fast
(e.g.,
on the
databases
IMS
access
methods
depending
43,
of
.
is
written
in
buffer
manager
read
dirty
during
[10,
and
from
page
identifying
pool
VLM
processing,
nonvolatile
is
successfully
restart
the
at
the
961,
process-
super
time
and
DB2
VLM
set
of
of system
writes
a log
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
146
.
record
C. Mohan
whenever
whenever
after
the
storage.
DB2’s
For
not
writes,
to the
log
are
held
on
to stable
all
the
nonvolatile
pages
of the
release
updates
the
being
is used
pages
to nonvolatile
the
next
lelism
for
for
the
that
were
by
nonvolatile
section
the
in
SQL,
and
logging
Encompass
writing
we
the
dirty
their
storage
MSDBS,
no
version.
Also,
commit
record
not
it
present.
yet
ACM Transactions
is written.
been
ARIES.
it
written
that
DEDB
to
with
gain
paral-
policies.
than
Before
all
the
page
data
pages
locking
is
placed
on
being
considered
are the
to
in
System
R,
the
each
will
this
partial
since
For
DEDBs,
to
nonvolatile
dirty
on
one
present
committed
the
any
updates
committed
storage
is that,
are
on Database Systems, Vol. 17, No. 1, March 1992
(table
MSDBS
of two
files
in
spaces,
alone,
on
non-
is performed
their
changes
are
are
instead
For
updating
and
actions
objects
[961.
contents
NonStop
update
checkpoint
object
operation
IMS,
when
taken
quiesce
The
difference
deferred
be
an
DB2,
even
are
VLM
checkpoint.
major
writes
that
and
DB2’s
Since
ones
OLM
of ARIES.
alternately
no
needed
also
storage
finer
a no-steal
the
go ahead
force
mode.
The
for
changes
is
and
checkpoints
for
a checkpoint.
Care
with
process
and
of the
uncommitted
writing
algorithms
to those
contents
is ensured
steal
similar
concurrently.
table,
uncommitted
user
consistent)
(fuzzy)
a RecLSN
during
the
for
uncommitted
on
with
complete
locking
recovery
take,
_pages
etc. ) list
writes
are
described
page
recovery
similar
take
since
checkpoints
and
going
any
forced
processing.
restart
are
in
Since
transaction
do
are
to what
volatile
system
are
result
processes
After
committed),
transaction
to nonvolatile
some
the
Normal
in the
record
activities
indexspaces,
have
the
all
the
to
records
[28]).
been
not
the
in
commit
necessarily
checkpoint
tion
result
is used—see
as possible
forces
log
completion
to let
follows
the
on
of separate
FF
course,
by
record
of time
transferred
it force
has
(not
locks
log
which,
does
transaction.
during
is not
(not
of the
IMS
Of
log
system
activity
similar
may
modified
buffers
amount
are
FP
a single
record
commit
locks
logic
in
log
the
to let
processes
is used.
MSDB
the
transaction
as soon
FF
checkpointing.
the
consistent
of
this
storage.
force
Normal
all
FF,
DEDB
time
storage
the
the
minimizes
commit
This
use
that
before
the
were
IMS
by
objects
a transaction
policy
in
even
FP
is intended
IMS
modified
supported
when
1/0s.
only
nonvolatile
transaction
and
The
group
processing
a transaction,
how
system
storage
record
dirty
that
a no-steal
applied
is given
to nonvolatile
transaction’s
committing
records
after
The
to
the
means
log
the
are
released
locks.
DEDBs.
This
a given
is
that
DEDB
back
to bring
for
records.
using
forced
policy
are
(i.e.,
DEDBs
records
records
placing
(i.e.,
storage
log
written
DEDBs,
manager
completed
performed
been
updating.
This
ultimately
is
to
log
another
is
For
updates
MSDB
The
storage
logging
After
the
processes.
these
log
locks
and
operation
failure.
the
storage.
is opened,
close
have
deferred
MSDB
MSDB
stable
space
updates.
all
the
The
uses
does
MSDB
time,
The
on
system
1/0s,
pass
FP
indexspace
closed.
as of the
manager.
released.
the
of the
own
storage),
placed
locks
pages
IMS
at commit
on stable
are
is
analysis
see its
or an
space
up to date
MSDBS,
does
is
a
dirty
information
call
a tablespace
such
all
et al.
of a transac-
applied
updated
included
for
checkpointed
after
pages
in
the
the
which
check-
ARIES: A Transaction Recovery Method
point
records.
recovery,
These
any
Encompass
storage
once
of the
this
policy,
NonStop
the
Partial
partial
access
FP
undo
data
MSDBS.
in
The
its
DB2
provide
FF
supports
for
FP
does
because
needs
is always
to get
MSDBS,
time.
into
the
Since
the
some
This
restart
log
with
storage—
when
policy,
none
nonvolatile
writes
undo
such
reader
rollbacks,
often,
has
there
many
VLM
repeated
rollbacks.
media
that
people
amount
is
storage
been
in
there
would
hence
a no-steal
policy
problems
no-steal
and
FP)
write
IMS
FP
might
in-progress
pool
about
to
of the
been
Since
CLRS,
This
without
with
many
to
IMS
the
for
FP
FP
log
which
data
should
no-steal
written
to be undone,
[931.
com-
to nonvolatile
because
have
at
transaction.
written
these
to be dealt
DEDBs,
buffer
unmodified
eliminates
for
the
recovery
and
rollback
(FF
would
recovery.
for
at
IMS
been
corresponding
restart
is done
recovery,
to write
is
it never
from
though,
media
records
This
is performed
discarded
be nothing
just
the
some
and
are
OLM
log
commit
one
updates
to simplify
even
that
to
rollback,
any
processing—i.e.,
Even
information,
needed,
still
for
VLM,
made.
and
having
down.
during
are
commit
DB2,
a normal
locking
most)
already
is accessed
assume
write
system
is
restart
the
FP
hence
with
do not
not
the
written
purged
DB2
During
(at
records
redo
by
updating
page
SQL,
also.
records
only
does
rollback
lists
simply
corresponding
and
information
volatile
the
for
contain
are
went
have
to
deferred
and
by
have
system
the
storage
CLRS
records
the
of
Since
NonStop
log
use
SQL,
in two-phase
is followed
written
of its
application
is performed
During
not
decision
(to-do)
rollbacks
must
some
sup-
supports
that
FP
updating
NonStop
pending
DEDBs
records
transaction
mit,
of
at the
applications
internal
it would
the
state.
in
the
do not
1, IMS
is because
rollbacks.
coordinator
Encompass,
during
find
since
policy
pages
for
normal
until
kept
of
for
[1].
prepared
a no-steal
time.
CLRS
the
updates
modified
rollback
the
Because
VLM
is exposed
deferred
Encompass,
data
a
comple-
waiting
and
to those
is excluded
rollbacks
during
CLRS
such
only
because
atomicity
write
to
FP
and
that
the
page.
2 Release
concept
data
nonvolatile
before
delayed
OLM
Version
savepoint
FP
recovery.
requires
of the
be
SQL,
From
partial
CLRS
not
changes
NonStop
log records.
write
dirtying
to
that
storage
restart
data
pages.
is available
records
statement-level
IMS
IMS
the
reason
log
Compensation
and
fact,
support
FP
pages
policy
may
dirty
rollback.
In
This
data.
old
Encompass,
transaction
level.
the
during
for
dirty
the
nonvolatile
following
of the
examining,
some
of a checkpoint
writing
rollbacks.
program
force
to
for
checkpoint
enforce
be written
checkpoint
rollbacks.
partial
need
the
might
They
completion
of the
the
before
SQL
must
second
completion
avoid
written
a checkpoint.
dirtied
tion
together
records
and
during
page
port
log
147
.
on
the
non-
illustrate
supporting
at restart
for
problems.
to
partial
FP.
Too
Actually,
it
shortcomings.
does
not
write
of logging
will
failures
during
Of
recovery.
course,
OLM
CLRS
occur
during
for
restart.
this
has
writes
restart
a rolled
In
some
CLRS
rollbacks.
back
fact,
CLRS
negative
for
undos
As
transaction,
are
written
implications
and
redos
a result,
even
a bounded
in
only
with
performed
the
for
face
of
normal
respect
to
during
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
148
.
restart
C. Mohan
undomodify
(called
done
to
deal
modify
with
and
rupt
restart
causing
the
CLRS
for
case,
grows
recovery,
writing
the
The
net
result
is
up
IMS
need
Log record
of records)
(or
state)
logs
both
undo
by
to redo
the
objects.
XRF
records
for
to reduce
log
NonStop
OLM
and
logs
DB2
of each
or
But
map
address
work
the
before-
and
of the
OLM
OLM’S
modify,
only
IMS
does
not
and
The
and
set of pages
of
a modified
recovery
CLRS
and
fields.
of Encompass
since
their
consistent
records
of the
snap-
contain
corresponding
parts
VLM
DB2
updated
records
the
in
updated
and
information
undomodify
where
by
the
redomodify
L SNS
of
records.
of
undo
For
information
Encompass
logs an operation
the
redomodify
the
FF
Since
or restart
of updated
and
Ih’ls
occupied
updates.
value
[761).
names
takeover
operation.
redo
undomodify
the
buffer
does
(see
information.
lock
log
policy,
after-image
IMS
enough
the
of
force
(i.e.,
redo
after-images
periodically
but
specifies
update
the
the
record
recovery.
before,
includes
of DEDBs’
only
both
only
of the
information
number
IMS
given
of its
locking
a backup’s
redo
the
information
track
and
information
OLM’S
to
case,
media
them.
others,
a
information.
IMS
undo
object.
which
redo
have
during
of redo
be undone.
undo
the
is used
to contain
shot
a page
logs
description
redo
to
system
the
might
need
backup
need
CLRS
records.
log
the
the
amount
SQL
and
the
IMS
for
for
mentioned
byte-range)
problem.
CLR
during
the
failures
CLRS
Because
redo
As
support,
also
complete
only
(i.e.,
CLRS
only
policy.
hot-standby
information
the
writes
physical
updates,
FP
FP
no-steal
worst
linearly.
updates
information
the
This
also
and
undo
IMS
page.
IMS
the
grows
this
like
same
In
restart
write
failures,
the
In
OLM
CLR’S
of its
logging
CLRS’
log
the
contents.
because
providing
and
not
thus
identical
processing.
avoids
does
multiple
times
of CLRS,
repeated
ARIES
inter-
themselves.
of multiple,
or restart
hence
undo-
failures
CLRS
writing
during
how
of
undo
the
forward
processing.
IMS
changes
and
multiple
if
DB2
written
pass
record
and
5 shows
because
forward
written
will
that,
update
is
This
multiple
for
during
undo
writing
during
records
Figure
the
write
generated
and
records
respectively).
might
are
CLRS
written
of log
during
wind
written
its
record
number
CLRS
might
Encompass
for
OLM
a given
CLRS
of CLRS
a given
for
No
records,
restart.
records
exponentially.
ignores
during
processing.
During
redomodify
and
failures
redomodify
restart
worst
et al
no
modify
also
contain
modified
object
reside.
Encompass
and NonStop
SQL use one LSN on each page
Page overhead.
uses no LSNS, but OLM uses one
to keep track of the state of the page. VLM
LSN.
DB2
uses one LSN
and IMS
FF no LSN.
Not having
the LSN
in IMS
FF and VLM
to know
the exact
state
of a page does not cause
any problems
because of IMS’ and VLM’S value logging
and physical
locking
attributes.
It
is acceptable
to redo an already
present
update
or undo an absent update.
IMS FP uses a field in the pages of DEDBs as a version number
to correctly
handle redos after all the data sharing
systems have failed [671. When DB2
divides
an index
minipage,
besides
ACM Transactions
leaf page into minipages
then it
one LSN for the page as a whole.
on Database Systems, Vol
17, No. 1, March 1992.
uses
one LSN
for
each
ARIES: A Transaction Recovery Method
.
149
Log passes during
restart
recovery.
Encompass
and NonStop
SQL
two passes (redo and then undo), and DB2 makes three passes (analysis,
make
redo,
and
their
then
undo— see Figure
redo
passes
This
is sufficient
dirty
page
from
the
because
within
6).
Encompass
beginning
two
of the
of the buffer
checkpoints
and
NonStop
penultimate
management
after
the
SQL
start
successful
policy
page
checkpoint.
of writing
became
to disk
dirty.
They
a
also
seem to repeat history
before performing
the undo pass. They do not seem to
repeat history
if a backup system takes over when a primary
system fails [41.
In the case of a takeover
by a hot-standby,
locks are first reacquired
for the
losers’ updates and then the rollbacks
with the processing
of new transactions.
using
a separate
that
point,
process
which
is
to gain
of the losers are performed
in parallel
Each loser transaction
is rolled back
parallelism.
determined
using
DB2
successful checkpoint,
as modified
by the analysis
DB2 does selective redo (see Section 10.1).
VLM
makes one backward
undo, and then redo). Many
starts
information
its redo
recorded
scan from
in
the
pass. As mentioned
last
before,
pass and OLM makes three passes (analysis,
lists are maintained
during
OLM’S and VLM’S
passes. The undomodify
and redomodify
log records of OLM are used only to
modify
these lists, unlike
in the case of the CLRS written
in the other
systems.
In VLM,
the one backward
pass is used to undo uncommitted
changes on nonvolatile
storage and also to redo missing
committed
changes.
No log records are written
during
these operations.
In OLM, during
the undo
pass, for each object to be recovered,
if an operation
consistent
version
of
the object
does not
exist
on nonvolatile
storage,
of the object from the snapshot
log record
version of the object, (1) in the remainder
updates
that
precede
the snapshot
then
it restores
a snapshot
so that, starting
from a consistent
of the undo pass any to-be-undone
log record
can be undone
in the redo pass any committed
or in-doubt
updates (modify
follow
the snapshot
record can be redone
logically.
This
logically,
and (2)
records only) that
is similar
to the
shadowing
performed
in [16, 781
the database-wide
checkpointing
the use of a single log instead of
IMS first reloads MSDBS from
using a separate
log—the
difference
is that
is replaced by object-level
checkpointing
and
two logs.
the file that received their contents
during
the
latest
before
that
were
successful
included
buffers
as before.
number
of buffers
checkpoint
in the checkpoint
This
cannot
means
that,
be altered.
the
records
during
Then,
failure.
The
dirty
DEDB
buffers
are also reloaded
into
the
a failure,
restart
it makes
after
just
the same
one forward
the
pass
over the log (see Figure
6). During
that pass, it accumulates
log records in
memory
on a per-transaction
basis and redoes, if necessary,
completed
transactions’
FP updates.
Multiple
processes
are used in parallel
to redo the
DEDB updates.
As far as FP is concerned,
only the updates
starting
from
the last checkpoint
before the failure
are of interest.
At the end of that one
pass, in-progress
transactions’
FF updates are undone (using the log records
in memory),
in parallel,
using
one process per transaction.
If the space
allocated
in memory
for a transaction’s
log records is not enough,
then a
backward
scan of the log will be performed
to fetch the needed records during
that transaction’s
rollback.
In the XRF context,
when a hot-standby
IMS
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
150
C. Mohan
.
et al.
takes over, the handling
of the loser transactions
Tandem
does it. That
is, rollbacks
are performed
transaction
processing.
Page forces
the
end
during
of restart.
restart.
OLM,
Information
VLM
on
and
is similar
in parallel
DB2
Encompass
force
and
all
to
the
with
dirty
NonStop
way
new
pages
SQL
at
is
not
available.
Restart
checkpoints.
the end of restart
not available.
Restrictions
record
have
IMS,
DB2,
recovery.
on data.
a unique
OLM
and VLM
Information
Encompass
key.
This
take
on Encompass
and
unique
NonStop
key
a checkpoint
only
at
and NonStop
SQL
is
SQL
require
that
is used to guarantee
every
that
if an
attempt
is made to undo a logged action which
was never applied
to the
nonvolatile
storage
version
of the data, then
the latter
is realized
and
the undo
fails.
In other
words,
idempotence
of operations
is achieved
using
the unique
key.
IMS
in effect
hence does not allow records
results
in the fragmentation
imposes
that
some additional
an object’s
does byte-range
locking
and logging
and
to be moved around freely within
a page. This
and the less efficient
usage of free space. IMS
constraints
representation
with
respect
be divided
into
to FP data.
fixed
VLM
length
requires
(less
than
one
page sized), unrelocatable
quanta.
The consequences
of these restrictions
are
similar
to those for IMS.
[2, 26, 56] do not discuss recovery
from system failures,
while the theory of
[33] does not include
semantically
logging).
In other sections of this
rich
paper,
with
that
12.
some of the other
ATTRIBUTES
ARIES
makes
approaches
modes of locking
(i.e., operation
we have pointed
out the problems
have
been proposed
in the literature.
OF ARIES
few assumptions
about
the
data
or its model
and has several
advantages
over other recovery
methods.
While ARIES is simple, it possesses
several interesting
and useful
properties.
Each of most of these properties
has been demonstrated
in one or more existing
or proposed
systems,
as
summarized
in the last section.
However,
we
proposed or real, which has all of these properties.
ARIES are:
(1) Support for finer
larities
of locking.
a uniform
locking
fashion.
is. Depending
than page-level
ARIES
expected
control
page-level
is not
Recovery
on the
concurrency
supports
know
of no single
system,
Some of these properties
of
and
affected
by
contention
what
(2) Flexible
buffer management
long as the write-ahead
logging
schemes
the
for the data,
ate level of locking
can be chosen. It also allows
locking
(e.g., record, table, and tablespace-level)
tablespace).
Concurrency
control
schemes of [2]) can also be used.
and multiple
record-level
other
granu-
locking
of
the appropri-
multiple
granularities
of
for the same object (e. g.,
than
locking
(e.g.,
during
restart and normal
processing.
protocol
is followed,
the buffer manager
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992
in
granularity
the
As
is
ARIES: A Transaction Recovery Method
free to use any page
incomplete
transactions
transactions
commit
replacement
policy.
In particular,
dirty
pages of
can be written
to nonvolatile
storage before those
(steal
dirtied
by a transaction
transaction
is allowed
lead
to
reduced
151
.
policy).
Also,
be written
to commit
demands
for
it is not
required
that
back to nonvolatile
storage
(i.e., no-force policy).
These
buffer
storage
and
fewer
all
pages
before the
properties
1/0s
involving
frequently
updated
(hot-spot)
pages. ARIES does not preclude
the possibilities of using deferred-updating
and force-at-commit
policies and benefiting
from them. ARIES is quite flexible
in these respects.
(3) Minimal
(excluding
required
space overhead–only
log) space overhead
No
The LSN
constraints
on
actions.
There
logged
unique
keys,
around
within
ensured
etc,
should
Actions
taken
exact inverses
are being
original
recorded
to guarantee
are
LSN
on each
during
actions
the
and
the undo
of an update
taken
undos,
what
former.
undo
of
respect
to
can be moved
Idempotence
is used
or
with
Data
performed
value.
of operations
to determine
whether
is
an
or not.
of the actions
during
page
data
length.
collection.
action
of redo
on the
can be of variable
The permanent
to the storage
increasing
idempotence
no restrictions
be redone
written
in
data
Records
the
of the last logged
of a page is a monotonically
a page for garbage
since
operation
(5)
LSN
per page.
scheme is limited
on each page to store the LSN
on the page.
(4)
one
of this
during
any differences
actually
An
had
example
need not necessarily
the original
update.
between
to be done
of when
the
be the
Since
the inverses
during
inverse
CLRS
of the
undo
can
be
might
not
be
correct is the one that relates
to the free space information
10% free, 20% free) about data pages that are maintained
(like at least
in space map
pages.
while
Because
of finer
than
page-level
granularity
locking,
no free
space information
change takes place during the initial
update of a page by
a transaction,
a free space information
change might occur during the undo
(from 20% free to 10% free) of that original
change because of intervening
update activities
of other transactions
(see Section 10.3).
Other
benefits
of this attribute
in the context
of hash-based
storage
methods
and index
management
(6) Support for operation
to a page can be logged
redo information
can be found
in [59, 621.
logging and novel lock modes.
in a logical fashion.
The undo
for the entire
object
The changes made
information
and the
need not be logged.
It suffices
if the
changed fields alone are logged. Since history
is repeated,
for increment
or
decrement
kinds of operations
before- and after-images
of the field are not
needed.
Information
about the type of operation
and the decrement
or
increment
amount
is enough.
Garbage
collection
actions
and changes to
some fields (e.g., amount
of free space) of that page need not be logged.
Novel lock modes based on commutativity
and other properties
of operations can be supported
[2, 26, 881.
(7) Even redo-only
and undo-only
(single
call
to the
be efficient
undo and redo information
about
records are accommodated.
log component)
sometimes
an update
While
it may
to include
the
in the same log record,
at other
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
152
C. Mohan et al.
.
times it may be efficient
(from the original
data, the undo record
constructed
and, after the update is performed
in-place in the data
from
the
sary
(because
updated
different
tions,
data,
of log
records.
the
undo
the
record
ARIES
record
redo
size
can
must
record
can
be
restrictions)
handle
both
be logged
before
the
for partial
and total transaction
to be rolled back totally,
ARIES
savepoints
and
the
partial
rollback
system)
will
Under
redo
itself
for every
page
does not treat
affected
by that
multipage
objects
total
to
namically
and
permanently
to
the
two
condi-
update,
such
ARIES
savepoints.
recoverable
information
rollbacks
in any special
(10) Allows files to be acquired
or returned,
system.
ARIES
provides
the flexibility
these
record.
(9) Support for objects spanning
multiple
pages.
Objects
pages (e.g., an IMS “record”
which consists of multiple
scattered
over many pages). When an object is modified,
written
necesin
rollback.
Besides
allowing
allows the establishment
of
even logically
cached catalog
require
and/or
information
of transactions
Without
the support for partial
rollbacks,
(e.g., unique
key violation,
out-of-date
distributed
database
wasted work.
the
situations.
Support
transactions
(8)
constructed)
to log
can be
record,
and
errors
in a
result
in
can span multiple
segments
may be
if log records are
works
fine.
ARIES
way.
any time, from or to the operating
of being able to return
files dy-
operating
system
(see
[19]
for
the
detailed
description
of a technique
to accomplish
this). Such an action is
considered
to be one that cannot be undone.
It does not prevent
the same
file from being
reallocated
to the database
system.
Mappings
between
objects (table spaces,
as in System R.
(11)
Some actions
etc.) and files
of a transaction
a whole is rolled back.
This
a dummy
CLR to implement
given
as an example
situation
are not required
maybe
to be defined
committed
statically
even if the transaction
as
refers to the technique
of using the concept of
nested top actions.
File extension
has been
which
could benefit
from
tions of this technique,
in the context
of hash-based
index management,
can be found in [59, 621.
this.
Other
storage
applica-
methods
and
(12) Efficient
checkpoints
(including
during
restart recovery).
By supporting
fuzzy checkpointing,
ARIES makes taking
a checkpoint
an efficient
operation. Checkpoints
can be taken
even when update
activities
and logging
are
going
on concurrently.
processing
will help reduce
The dirty .pages
information
the number
redo pass.
of pages
which
Permitting
the
impact
written
are read
checkpoints
even
during
restart
of failures
during
restart
recovery.
during
checkpointing
helps reduce
from
nonvolatile
storage
during
the
(13) Simultaneous
processing
of multiple
transactions
in forward
processing
and /or in rollback
accessing same page.
Since many transactions
could
simultaneously
be going forward
or rolling
back on a given page, the level
of concurrent
access supported
could be quite high.
Except for the short
duration
latching
which
has to be performed
any time
a page is being
ACM Transactions
on Database Systems, Vol. 17, No. 1, March 1992.
ARIES: A Transaction Recovery Method
physically
rollback,
modified
or examined,
rolling
back transactions
.
153
be it during
forward
processing
or during
do not affect one another
in any unusual
fashion.
(14) No locking or deadlocks
during
transaction
rollback.
is required
during
transaction
rollback,
no deadlocks
will
Since no locking
involve
transac-
tions that are rolling
back. Avoiding
locking
during
rollbacks
simplifies
not only the rollback
logic,
but also the deadlock
detector
logic.
The
deadlock
detector
need not worry about making
the mistake
of choosing
a
rolling
back transaction
as a victim
in the event of a deadlock (cf. System R
and R* [31, 49, 64]).
(15)
Bounded
logging
rollbacks.
Even
CLRS written
The number
time
during
restart
if repeated
is unaffected.
of log records
of transaction
in spite of repeated
failures
occur
during
failures
restart,
or of nested
the
number
of
This is also true if partial
rollbacks
are nested.
written
will be the same as that written
at the
rollback
during
normal
processing.
The latter
again
is
a fixed number
and is, usually,
equal to the number
of undoable
records
written
during
the forward
processing
of the transaction.
No log records
are written
during
the redo pass of restart.
(16)
Permits
faster
exploitation
restart.
of parallelism
Restart
and
can be made
selective/deferred
faster
by not doing
processing
all the needed
for
1/0s
synchronously
ARIES permits
one at a time while processing
the corresponding
log record.
the early identification
of the pages needing
recovery
and
the
of asynchronous
initiation
pages.
The
memory
pages
during
parallel
can be processed
the
redo
pass.
dling
of a given
transaction
processing
can be postponed
Undo
Fuzzy
image
copying
outside
the
reading
as they
parallelism
transactions
(archive
requires
the transaction
data
the
forward
traversal
of those
for
media
into
complete
hanrestart
offline
in parallel
recovery.
Media
are supported
very efficiently.
To
actual act of copying
can even be
system
(i.e.,
without
going
buffer
pool).
This can happen
even while
the latter
modifying
the information
being copied. During
media
(18) Continuation
repeats history
in
are brought
can be performed
dumping)
recovery
and image copying
of the
take advantage
of device geometry,
performed
for
by a single
process.
Some of the
to speed up restart
or to accommodate
devices. If desired, undo of loser
with new transaction
processing.
(17)
1/0s
concurrently
through
is accessing
recovery
only
the
and
one
of the log is made.
of loser transactions
after
and supports
the savepoint
a system
concept,
restart.
Since ARIES
we could, in the undo
pass, instead
of totally
rolling
back the loser transactions,
roll back each
loser only to its latest savepoint.
Locks must be acquired
to protect
the
transaction’s
uncommitted,
not undone
updates.
Later,
we could resume
the transaction
by invoking
its application
at a special entry
point
and
passing enough
be resumed.
(19)
Only
information
one backward
about
traversal
the savepoint
of log during
from
restart
which
execution
or media
is to
recovery.
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
154
C. Mohan
.
Both
during
media
Need
only
compensation
recovery
and restart
This is especially
important
in a slow medium
like tape.
redo
information
records
information.
are
in
never
So, on the average,
Support
for distributed
transactions.
Whether
does not affect
(22)
Early
compensation
undone
the
site
ARIES
of locks
rollback,
during
redo
during
the forward
distributed
or a subordinate
site
during
transaction
rollback
when
the transaction’s
very
and
deadlock
resolu-
never
undoes
CLRS and
more than once, during
a
first
update
to a particular
for it, the system can release the lock
to consider resolving
deadlocks
using
be noted that ARIES does not prevent
the shadow page technique
used for selected portions
of the data to avoid logging
of only undo
information
or both
dealing
with
long
Database
Manager.
undo
and
redo
information.
This
may
be useful
for
fields,
as is the case in the 0S/2
Extended
Edition
In such instances,
for such data, the modified
pages
have to be forced
to nonvolatile
storage
before
commit.
media recovery and partial
rollbacks
can be supported
logged and for which updates shadowing
is done.
13.
Since
only
accommodates
is a coordinator
object is undone and a CLR is written
on that object. This makes it possible
partial
rollbacks.
would
records.
to contain
ARIES.
release
It should
from being
traversal
of
of the log is
of log space consumed
tion using partial
rollbacks.
Because
ARIES
because it never undoes a particular
non-CLR
(partial)
log
need
space consumed
transactions.
a given
they
the amount
a transaction
rollback
will be half
processing
of that transaction.
(21)
one backward
if any portion
recovery
the log is sufficient.
likely to be stored
(20)
et al
will
Whether
depend
or not
on what
is
SUMMARY
In this
paper,
some of the
we presented
recovery
the
paradigms
ARIES
of System
recovery
method
and
R are inappropriate
showed
why
in the
WAL
context.
We dealt with
a variety
of features
that
are very important
in
building
and operating
an industrial-strength
transaction
processing
system.
Several
issues regarding
operation
logging,
fine-granularity
locking,
space
management,
and flexible
recovery
were discussed.
In brief, ARIES
accomplishes the goals that we set out with by logging
all updates
on a per-page
basis, using an LSN on every page for tracking
page state, repeating
history
during
restart
recovery
before undoing
the loser transactions,
and chaining
the CLRS to the predecessors
of the log records that they compensated.
Use of
ARIES
is not
restricted
to the
database
area
alone.
implementing
persistent
object-oriented
languages,
and transaction-based
operating
systems.
In fact,
QuickSilver
distributed
operating
system [401 and
aid the backing
up of workstation
In this section, we summarize
to which
specific
ACM Transactions
attributes
that
It can also be used
recoverable
it is being
in a system
data on a host [441.
as to which specific features
give
us flexibility
of ARIES
and efficiency.
on Database Systems, Vol. 17, No. 1, March 1992
for
file systems
used in the
designed
to
lead
ARIES: A Transaction Recovery Method
Repeating
history
CLRS during
chained
undos,
using
(1) Record
within
records
logged.
exactly,
which
permits
the following,
the UndoNxtLSN
in turn
field
implies
using
irrespective
155
.
LSNS
and writing
of whether
CLRS
are
or not:
level locking
to be supported
and records to be moved around
a page to avoid
storage
fragmentation
without
the moved
having
to be locked and without
the movements
having
to be
(2) Use only
one state
variable,
a log sequence
number,
per page.
(3) Reuse of storage released by one transaction
for the same transaction’s
later actions or for other transactions’
actions once the former
commits,
thereby
leading
to the
efficient
usage
of storage.
preservation
of clustering
of records
(4) The inverse of an action origianlly
performed
during
forward
of a transaction
to be different
from the action(s)
performed
undo
That
of that original
is, logical undo
undo
on the
(6) Recovery
of each page independently
relating
to transaction
state, especially
(7) If necessary,
the continuation
the time of system failure.
or deferred
transaction
(9) Partial
same
processing
during
the
concurrently
with
of other pages or of log
during
media recovery.
records
of transactions
restart,
processing
rollback
the
action (e. g., class changes in the space map pages).
with recovery
independence
is made possible.
(5) Multiple
transactions
may
transactions
going forward.
(8) Selective
and
and undo
to improve
data
page
which
of losers
were
in progress
concurrently
with
at
new
availability.
of transactions.
(10) Operation
logging
and logical
logging
of changes within
a page. For
example,
decrement
and increment
operations
may be logged,
rather
than the before- and after-images
of modified
data.
Chaining,
using the UndoNxtLSN
field,
forward
processing
permits
the following,
history
CLRS to log records written
during
provided
the protocol
of repeating
is also followed:
(1) The avoidance
CLRS.
This
of undoing
also makes
CLRS’
actions,
it unnecessary
thus
avoiding
to store undo
(2) The avoidance
of the undo of the same log record
processing
more than once.
(3) As a transaction
is being
rolled
back,
the ability
writing
information
written
to release
by partially
(4) Handling
partial
log, as in System
(5) Making
permanent,
rolling
rollbacks
R.
if
back
without
necessary
for
in CLRS.
during
object when all the updates to that object had been undone.
important
while
rolling
back a long transaction
or while
deadlock
CLRS
forward
the lock on an
This may
resolving
be
a
the victim.
any special
via
nested
actions
top
like
actions,
patching
some
the
of the
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
156
C. Mohan
.
et al.
changes made by a transaction,
irrespective
itself subsequently
rolls back or commits.
Performing
the analysis
(1) Checkpoints
pass before
to be taken
any
of whether
repeating
time
history
during
the
the
permits
redo
and
transaction
the following:
undo
passes
of
recovery.
(2) Files to be returned
ing dynamic
binding
(3) Recovery
of file-related
user data,
without
(4) Identifying
1/0s
to the operating
system dynamically,
between
database
objects and files.
pages
could
information
requiring
possibly
be initiated
concurrently
special
requiring
for them
treatment
redo,
with
volatile
storage
when
of
asynchronous
parallel
pages by eliminating
e.g., that some empty
been freed.
(6) Exploiting
opportunities
to avoid
writing
end. write
records
after
table
recovery
the redo pass starts.
(5) Exploiting
opportunities
to avoid redos on some
those pages from the dirty .pages table on noticing,
pages have
the
allow-
for the former.
so that
even before
thereby
and
by
the end. write
(7) Identifying
the transactions
locks could be reacquired
reading
some pages during
redo, e.g., by
dirt y pages have been written
to non-
eliminating
records
those
pages
from
the
dirty
.pages
are encountered.
in the in-doubt
and in-progress
states so that
for them
during
the redo pass to support
selective
or deferred
restart,
the continuation
of loser transactions
after
restart,
and undo of loser transactions
in parallel
with new transaction
processing.
13.1
Implementations
ARIES
forms
and Extensions
the basis
of the recovery
algorithms
used in the IBM
Research
prototype
systems Starburst
[871 and QuickSilver
[401, in the University
of
Wisconsin’s
EXODUS
and Gamma
database
machine
[201, and in the IBM
program
products
0S/2
Extended
Edition
Database
Manager
[71 and Workstation
history,
Data Save Facility/VM
has been implemented
[441. One feature
of ARIES, namely
repeating
in DB2 Version
2 Release 1 to use the concept
of nested top action
for supporting
segmented
tablespaces.
A simulation
study of the performance
of ARIES
is reported
in [981. The following
conclu“Simulation
results
indicate
the
sions from that
study are worth
noting:
success of the ARIES
recovery
method
in providing
fast recovery
from
failures,
caused by long intercheckpoint
intervals,
efficient
use of page LSNS,
log LSNS, and RecLSNs avoids redoing updates unnecessarily,
and the actual
recovery
load
is reduced
skillfully.
concurrency
control
and recovery
indicated
by the negligibly
small
Besides,
algorithms
difference
the
overhead
incurred
by
the
on transactions
is very low, as
between
the mean transaction
response time and the average
duration
of a transaction
if it ran alone in a
never failing
system.
This observation
also emerges
as evidence
that the
recovery method goes well with concurrency
control through
fine-granularity
locking,
an important
virtue. ”
ACM Transactions
on Database Systems, Vol. 17, No. 1, March 1992
ARIES: A Transaction Recovery Method
We have
extended
transaction
methods,
model
called
ARIES
(see [70,
ARIES
to make
85]).
/KVL,
it work
in the
Based
on ARIES,
ARIES/IM
and
context
we have
ARIES
157
.
of the
nested
developed
/LHS,
to
new
efficiently
provide
high concurrency
and recovery
for B ‘-tree
indexes
[57, 62] and for
hash-based
storage structures
[59]. We have also extended
ARIES to restrict
the amount
of repeating
of history
that
takes
place
for the loser
transactions
[691. We have designed
concurrency
control
and recovery
algorithms,
on ARIES,
for the N-way data sharing
(i. e., shared disks) environment
66,67,
68]. Commit.LSN,
that
exists
reevaluation
in
[54,
a method
which
takes
in every
page to reduce
the
overheads,
and also to improve
58,
processing,
60].
Although
we did not
messages
discuss
are
message
advantage
based
[65,
of the page.LSN
locking,
latching
and predicate
concurrency,
has been presented
an
important
logging
part
and recovery
of transaction
in this
paper.
ACKNOWLEDGMENTS
We have benefited
immensely
from the work that was
System R project and in the DB2 and IMS product
groups.
valuable
lessons by looking
at the experiences
with those
the source code and internal
documents
of those systems
The
Starburst
project
gave
us the
opportunity
the
contributions
also like to thank
have adopted
our
Brian
and Irv
Oki,
Erhard
Traiger
of the
designers
to begin
of the
other
in the
learned
systems. Access to
was very helpful.
from
design some of the fundamental
algorithms
of a transaction
into account experiences
with the prior systems. We would
edge
performed
We have
scratch
and
system, taking
like to acknowl-
systems.
We
would
our colleagues
in the research
and product
groups that
research
results.
Our thanks
also go to Klaus
Kuespert,
Rahm,
for their
Andreas
detailed
Reuter,
comments
Pat
Selinger,
Dennis
Shasha,
on the paper.
REFERENCES
1. BAKER, J., CRUS, R., AND HADERLE, D. Method for assuring atomicity of multi-row
update
operations in a database system. U.S. Patent 4,498,145, IBM, Feb. 19S5.
2. BADRINATH, B. R., AND RAMAMRITHAM, K.
Semantics-based
concurrency control: Beyond
3rd IEEE
International
Conference
on Data Engineering
commutativity.
In Proceedings
(Feb. 1987).
Concurrency
Control
and Recovery
in
3. BERNSTEIN, P., HADZILACOS, V., AND GOODMAN, N.
Database
Systems. Addison-Wesley,
Reading, Mass., 1987.
4. BORR, A. Robustness to crash in a distributed
database: A non-shared-memory
multi10th International
Conference
on Very Large Data Bases
processor approach. In Proceedings
(Singapore, Aug. 1984).
5. CHAMBERLAIN,D., GILBERT, A., AND YOST, R. A history of System R and SQL)Data System.
7th International
Conference
on Very Large
Data Bases (Cannes, Sept.
In Proceedings
1981).
ACM Trans.
6. CHANG, A., AND MERGEN, M. 801 storage: Architecture
and programming.
Comput. Syst., 6, 1 (Feb. 1988), 28-50.
7. CHANG, P. Y., AND MYRE, W. W.
0S/2 EE database manager: Overview and technical
ZBM Syst. J. 27, 2 (198S).
highlights.
schemes
8. COPELAND, G., KHOSHAFIAN, S., SMITH, M., AND VALDURIEZ, P. Buffering
International
Conference
on Data
Engineering
for permanent
data. In Proceedings
(Los Angeles, Feb. 1986).
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
158
.
C. Mohan
et al.
9. CLARK, B. E., AND CORRTGAN,M. J.
Application
System/400
performance
characteristics.
IBM S@. J. 28, 3 (1989).
10. CHENG, J., LOOSELY, C., SHIBAMIYA, A., AND WORTHINGTON, P. IBM Database 2 perforIBM Sy.st. J. 23, 2 (1984).
mance: Design, implementation,
and tuning.
11. CRUS, R , HADERLE, D., AND HERRON, H. Method for managing
lock escalation
in a
multiprocessing,
multiprogramming
environment.
U.S. Patent 4,716,528, IBM, Dec. 1987.
IBM
Tech. Disclosure
12. CRUS, R., MALKEMUS, T., AND PUTZOLU, G. R. Index mini-pages
Bull. 26, 4 (April 1983), 5460-5463.
13. CRUS, R., PUTZOLU, F., AND MORTENSON, J. A
Incremental
data base log image copy IBM
!l’ec~. Disclosure Bull. 25, 7B (Dec. 1982), 3730-3732.
Bull. 25, 7B
14. CRUS, R., AND PUTZOLU, F. Data base allocation table. IBM Tech. Disclosure
(Dec. 1982), 3722-2724.
15. CRUS, R. Data recovery in IBM Database2.
IBM Syst. J. 23,2(1984).
Informix-Turbo,
In Proceedings LZEECornpcon
Sprmg’88(Feb.
-March l988),
16. CURTIS, R.
operating
17. DASGUPTA, P., LEBLANC, R., JR., AND APPELBE, W. The Clouds distributed
8th International
Conference
on Distributed
Computing
Systems
system. In Proceedings
(San Jose, Calif., June 1988).
AGuideto
INGRES. Addison-Wesley,
Reading, Mass., l987.
18. DATE, C.
data sets. IBM Tech. Disclosure
19. DEY, R., SHAN, M., AND TRAIGER, 1. Method fordropping
Bull. 25, 11A (April 1983), 5453-5455.
AND
20. DEWITT, D., GHANDEHARIZADEH, S., SCHNEIDER, D., BRICKER, A., HSIAO, H.-I.,
Data Eng.
RASMUSSEN,R. The Gamma database machine project. IEEE Trans. Knowledge
2, 1 (March 1990).
21. DELORME, D., HOLM, M., LEE, W., PASSE, P., RICARD, G., TIMMS, G., JR., AND YOUNGREN, L.
Database index journaling
for enhanced recovery. U.S. Patent 4,819,156, IBM, April 1989
The treatment
of
22. DIXON, G. N., BARRINGTON, G. D., SHRIVASTAVA, S., AND WHEATER, S. M.
persistent objects in Arjuna. Comput. J. 32, 4 (1989).
management.
Ph.D. dissertation,
Tech. Rep. CMU-CS-88-192,
23. DUCHAMP, D. Transaction
Carnegie-Mellon
Univ., Dec. 1988,
ACM
of database buffer management,
24. EFFEUSBERG, W., AND HAERDER, T. Principles
Trans. Database
Syst. 9, 4 (Dec. 1984).
25. ELHARDT, K , AND BAYER, R. A database cache for high performance and fast restart in
database systems. ACM Tram Database Syst. 9, 4 (Dec. 1984).
locking for
26. FEKETE, A., LYNCH, N., MERRITT, M., AND WEIHL, W. Commutativity-based
nested transactions.
Tech. Rep. MIT/LCS/TM-370.b,
MIT, July 1989,
Data base integrity
as provided for by a particular
data base management
27. FOSSUM, B
J. W. Klimbie and K. L. Koffeman, Eds., North-Holland,
system. In Data Base Management,
Amsterdam,
1974.
of concurrency control in IMS/VS
Fast Path.
28. GAWLICK, D., AND KINKADE, D. Varieties
IEEE Database
Eng. 8, 2 (June 1985).
management
in an object-oriented
database system.
29. GARZA, J., AND KIM, W. Transaction
ACM-SIGMOD
International
Conference
on Management
of Data (Chicago,
In Proceedings
June 1988).
CHAOS’%
Support for real-time
atomic transactions.
In
30. GHEITH, A., AND SCHWAN, K.
Proceedings
19th International
Symposium
on Fault-Tolerant
Computing
(Chicago, June
1989).
31. GRAY, J., MCJONES, P., BLASGEN, M., LINDSAY, B., LORIE, R., PRICE, T., PUTZOLU, F., AND
ACM
Comput.
TRAIGER, I. The recovery manager of the System R database manager.
Suru. 13, 2 (June 1981).
Systems–An
Aduanced
systems. In Operating
32. GRAY, J. Notes on data base operating
Course, R. Bayer, R. Graham, and G. Seegmuller,
Eds., LNCS Vol. 60, Springer-Verlag,
New York, 1978.
m database systems. J. ACM 35, 1 (Jan. 1988),
33. HADZILACOS, V, A theory of reliability
121-145.
S.yst. 13, 2 (1988),
hot spot data in DB-sharing
systems. Inf
34. HAERDER, T. Handling
155-166.
ACM Transactions
on Database Systems, Vol. 17, No. 1, March 1992
ARIES: A Transaction Recovery Method
.
159
35. HADERLE, D., AND JACKSON, R.
IBM Database 2 overview. IBM Syst. J. 23, 2 (1984).
Principles
of transaction
oriented database recovery–A
taxonomy. ACM CornPUt. Sure. 15, 4 (Dec. 1983).
37. HELLAND, P. The TMF application programming
interface: Program to program communication, transactions,
and concurrency in the Tandem NonStop system. Tandem Tech. Rep.
TR89.3, Tandem Computers, Feb. 1989.
36. HAERDER, T., AND REUTER, A.
38. HERLIHY, M.,
Proceedings
AND WEIHL, W.
7th
ACM
Hybrid
concurrency
SIGAC’T-SIGMOD-SIGART
Systems (Austin, Tex., March 1988).
39. HERLIHY, M., AND WING, J. M. Avalon:
17th
International
systems. In Proceedings
(Pittsburgh,
Pa., July 1987).
control
for abstract
Symposium
Language
Symposium
support
on
data
on Principles
for
reliable
Fault-Tolerant
types.
In
of Database
distributed
Computing
40. HASKIN, R., MALACHI, Y., SAWDON, W., AND CHAN, G. Recovery management
in QuickSilver. ACM !/’runs. Comput. Syst. 6, 1 (Feb. 1988), 82-108.
Dec. GG24-1652, IBM, April 1984.
41. IMS/ VS Version 1 Release 3 Recovery/Restart.
Programming.
Dec. SC26-4178, IBM, March 1986.
42. IMS/ VS Version 2 Application
43. IMS/ VS Extended
April 1987.
Recovery
44. IBM Workstation Data
1990.
Facility
Save Facility
(XRF):
/ VM:
Technical
General
Reference.
Information.
Dec. GG24-3153,
IBM,
Dec. GH24-5232,
IBM,
45. KORTH, H. Locking primitives
in a database system. JACM 30, 1 (Jan. 1983), 55-79.
46. LUM, V., DADAM, P., ERBE, R., GUENAUER, J., PISTOR, P., WALCH, G., WERNER, H., AND
WOODFILL, J. Design of an integrated
DBMS to support advanced applications.
In Proceedings International
Conference
on Foundations
of Data Organization
(Kyoto, May 1985).
47. LEVINE, F., AND MOHAN, C. Method for concurrent record access, insertion,
deletion and
alteration using an index tree. U.S. Patent 4,914,569, IBM, April 1990.
Isolation
Locking.
Dec. GG66-3193, IBM Dallas Systems
48. LEWIS, R. Z. ZMS Program
Center, Dec. 1990.
49. LINDSAY, B., HAAS, L., MOHAN, C., WILMS, P., AND YOST, R. Computation
and communication in R*: A distributed
database manager. ACM Trans. Comput. Syst. 2, 1 (Feb. 1984).
9th ACM Symposium
on Operating
Systems Principles
(Bretton Woods,
Also in Proceedings
Oct. 1983). Also available as IBM Res. Rep. RJ3740, San Jose, Calif., Jan. 1983.
50. LINDSAY, B., MOHAN, C., AND PIRAHESH, H. Method for reserving space needed for “rollBull. 29, 6 (Nov. 1986).
back” actions. IBM Tech. Disclosure
51. LISKOV, B.,
AND SCHEIFLER, R. Guardians
and actions: Linguistic
support for robust,
distributed
programs. ACM Trans. Program. Lang. Syst. 5, 3 (July 1983).
52. LINDSAY, B., SELINGER, P., GALTIERL C., GRAY, J., LORIE, R., PUTZOLU, F., TRAIGER, I., AND
WADE, B. Notes on distributed
databases. IBM Res. Rep. RJ2571, San Jose, Calif., July
1979.
53. MCGEE, W. C. The information
management
syste]m IMS/VS—Part
II: Data base faciliIBM Syst. J. 16, 2 (1977).
ties; Part V: Transaction processing facilities.
54. MOHAN, C., HADERLE, D., WANG, Y., AND CHENG, J.
Single table access using multiple
indexes: Optimization,
execution,
and concurrency
control techniques.
In Proceedings
International
Conference
on Extending
Data Base Technology
(Venice, March 1990). An
expanded version of this paper is available
as IBM Res. Rep. RJ7341, IBM Almaden
Research Center, March 1990.
55. MOHAN, C., FUSSELL, D., AND SILBERSCHATZ, A. Compatibility
and commutativity
of lock
modes. Znf Control 61, 1 (April 1984). Also available as IBM Res. Rep. RJ3948, San Jose,
Calif., July 1983.
56. MOSS, E., GRIFFETH, N., AND GRAHAM, M. Abstraction
in recovery management.
In
Proceedings
ACM SIGMOD
International
Conference
on Management
of Data (Washington,
D. C., May 1986).
57. MOHAN, C. ARIES /KVL: A key-value locking method for concurrency control of multiac16th International
Conference
tion transactions operating on B-tree indexes. In Proceedings
on Very Large Data Bases (Brisbane, Aug. 1990). Another version of this paper is available
as IBM Res. Rep. RJ7008, IBM Almaden Research Center, Sept. 1989.
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
160
.
C. Mohan et al
58. MOHAN, C.
Commit -LSN: A novel and simple method for reducing locking and latching in
16th International
Conference
on Very Large
processing systems In Proceedings
Data l?ases (Brisbane, Aug. 1990). Also available as IBM Res. Rep. RJ7344, IBM Almaden
Research Center, Feb. 1990.
59 MOHAN, C. ARIES/LHS:
A concurrency control and recovery method using write-ahead
logging for linear hashing with separators. IBM Res. Rep., IBM Almaden Research Center,
Nov. 1990.
60. MOHAN, C. A cost-effective method for providing improved data avadability
during DBMS
of the 4th International
Workshop
on HLgh
restart recovery after a failure
In Proceedings
Performance
Transachon
Systems (Asilomar,
Calif., Sept. 1991). Also available as IBM Res.
Rep. RJ81 14, IBM Almaden Research Center, April 1991.
transaction
61. Moss, E., LEBAN, B., AND CHRYSANTHIS, P. Fine grained concurrency for the database
3rd IEEE International
Conference
on Data Engineering
(Los Angeles,
cache. In Proceedings
Feb. 1987),
62. MOHAN, C., AND LEVINE, F. ARIES/IM:
An efficient and high concurrency index management method using write-ahead
logging. IBM Res. Rep. RJ6846, IBM Almaden Research
Center, Aug. 1989.
63. MOHAN, C., AND LINDSAY, B. Efficient commit protocols for the tree of processes model of
2nd ACM SIGACT/
SIGOPS
Sympos~um
on Pridistributed
transactions.
In Proceedings
nciples of Distributed
Computing
(Montreal,
Aug. 1983). Also available
as IBM Res. Rep.
RJ3881, IBM San Jose Research Laboratory,
June 1983.
64. MOHAN, C., LINDSAY, B., AND OBERMARCK, R. Transaction
management
in the R* dktributed database management
system. ACM Trans. Database Syst. 11, 4 (Dec. 1986).
65. MOHAN, C., ANn NARANG, I. Recovery and coherency-control
protocols for fast intersystem
page transfer and tine-granularity
locking in a shared disks transaction
environment.
In
Proceedings
17th International
Conference
on Very Large
Data Bases (Barcelona,
Sept.
1991). A longer version is available
as IBM Res. Rep. RJ8017, IBM Almaden Research
Center, March 1991.
66. MOHAN, C., AND NARANG, I. Efficient
locking and caching of data in the multisystem
of the International
Conference
on
shared disks transaction
environment.
In proceedings
Extending
Database
Technology
(Vienna, Mar. 1992). Also available
as IBM Res. Rep.
RJ8301, IBM Almaden Research Center, Aug. 1991.
67. MOHAN, C., NARANG, I., AND PALMER, J. A case study of problems in migrating
to
distributed
computing: Page recovery using multiple logs in the shared disks environment.
IBM Res. Rep. RJ7343, IBM Almaden Research Center, March 1990.
68. MOHAN, C., NARANG, I., SILEN, S. Solutions
to hot spot problems in a shared disks
of the 4th International
Workshop
on High Perfortransaction environment.
In proceedings
mance Transaction
Systems (Asilomar,
Calif., Sept. 1991). Also available as IBM Res Rep.
8281, IBM Almaden Research Center, Aug. 1991.
69. MOHAN, C., AND PIRAHESH, H. ARIES-RRH:
Restricted repeating of history in the ARIES
7th International
Conference
on Data Engitransaction
recovery method. In Proceedings
neering
(Kobe, April
1991). Also available
as IBM Res. Rep. RJ7342, IBM Almaden
Research Center, Feb. 1990
70. MOHAN, C , AND ROTHERMEL, K.
Recovery protocol for nested transactions
using writeBull. 31, 4 (Sept 1988).
ahead logging. IBM Tech. Dwclosure
3rd
71. Moss, E. Checkpoint and restart in distributed
transaction
systems. In Proceedings
Symposium
on Reliability
in Dwtributed
Software
and Database
Systems
(Clearwater
Beach, Oct. 1983).
13th International
72. Moss, E
Log-based recovery for nested transactions.
In Proceedings
Conference
on Very Large Data Bases (Brighton,
Sept. 1987).
73. MOHAN, C., TIUEBER, K., AND OBERMARCK, R. Algorithms
for the management
of remote
backup databases for disaster recovery. IBM Res. Rep. RJ7885, IBM Almaden Research
Center, Nov. 1990.
74. NETT, E., KAISER, J., AND KROGER, R. Providing recoverability
in a transaction
oriented
6th International
Conference
on Distributed
distributed
operating system. In Proceedings
Computing
Systems (Cambridge,
May 1986).
ACM Transactions
on Database Systems, Vol. 17, No, 1, March 1992
ARIES: A Transaction Recovery Method
75.
NOE,
J., KAISER, J., KROGER, R., AND NETT, E.
The commit/abort
problem
GMD Tech. Rep. 267, GMD mbH, Sankt Augustin, Sept. 1987.
locking.
76. OBERMARCK, R. IMS/VS
Calif., July 1980.
77. O’NEILL, P.
(Dec. 1986).
78. ONG, K.
The
SYNAPSE
SIGMOD
Symposium
program
Escrow
isolation
transaction
approach
IBM
ACM
method.
to database
on Principles
feature.
.
161
in type-specific
Res. Rep. RJ2879,
San Jose,
Trans. Database Syst. 11, 4
recovery.
of Database
In Proceedings
3rd ACM
SIGACT(Waterloo, April 1984).
contention in a stock trading database: A
Systems
79. PEINL, P., REUTER, A., AND SAMMER, H. High
ACM SIGMOD
International
Conference
on Management
of Data
case study. In Proceedings
(Chicago, June 1988).
80. PETERSON,R. J., AND STRICKLAND, J. P. Log write-ahead protocols and IMS/VS logging. In
Proceedings
(Atlanta,
2nd
ACM SIGACT-SIGMOD
Ga., March
Symposium on Principles of Database Systems
1983).
81. RENGARAJAN, T. K., SPIRO, P., AND WRIGHT, W.
DBMS software. Digital Tech. J. 8 (Feb. 1989).
82. REUTER, A.
Softw.
Eng.
A fast transaction-oriented
4 (July 1980).
scheme for UNDO
mechanisms
recovery.
of VAX
IEEE Trans.
SE-6,
83. REUTER, A.
SIGMOD
logging
“High availability
Concurrency
Symposium
on high-traffic
on Principles
84. REUTER, A. Performance
(Dec. 1984), 526-559.
analysis
data elements.
of Database
Systems
of recovery techniques.
ACM
SIGACTIn Proceedings
(Los Angeles, March 1982).
ACM Trans. Database Syst. 9,4
85. ROTHERMEL, K., AND MOHAN, C. ARIES/NT:
A recovery method based on write-ahead
15th International
Conference
on Very Large
logging fornested transactions.
In Proceedings
Data Bases (Amsterdam,
Aug. 1989). Alonger
version ofthis
paper is available as IBM
Res. Rep. RJ6650, lBMAlmaden
Research Center, Jan. 1989.
86. ROWE, L., AND STONEBRAKER, M.
The commercial INGRES epilogue. Ch. 3 in The ZNGRES Papers, Stonebraker,
M., Ed., Addson-Wesley, Reading, Mass., 1986.
87. SCHWARZ, P., CHANG, W., FREYTAG, J., LOHMAN, G., MCPHERSON, J., MOHAN, C., AND
Workshop
on
PIRAHESH, H. Extensibility
in the Starburst database system. In Proceedings
Object-Oriented
Data Base Systems (Asilomar,
Sept. 1986). Also available as IBM Res. Rep.
RJ5311, San Jose, Calif., Sept. 1986.
88. SCHWARZ,P. Transactions on typed objects. Ph.D. dissertation,
Carnegie Mellon Univ., Dec. 1984.
Tech. Rep. CMU-CS-84-166,
ACM
Trans.
89. SHASHA, D., AND GOODMAN, N. Concurrent
search structure
algorithms.
Database
Syst. 13, 1 (March 1988).
90. SPECTOR, A., PAUSCH, R., AND BRUELL, G. Came Lot: A flexible, distributed
transaction
IEEE
Compcon Spring
’88 (San Francisco, Calif., March
processing system. In Proceedings
1988).
91. SPRATT, L.
ACM
The transaction
resolution journal: Extending the before journal.
1985).
92. STONEBRAKER, M. The design of the POSTGRES storage system. In Proceedings
International
Conference
on Very Large Data Bases (Brighton,
Sept. 1987).
Syst.
Oper.
Rev. 19, 3 (July
IMSj VS Version 1 Release 3 Fast Path
93. STILLWELL, J. W., AND RADER, P. M.
Dec. G320-0149-0, IBM, Sept. 1984.
94. STRICKLAND, J., UHROWCZIK, P., AND WATTS, V. IMS/VS:
An evolving system.
13th
Notebook.
IBM
Syst.
J. 21, 4 (1982).
95.
high-performance,
THE TANDEM DATABASE GROUP. NonStop
SQL: A distributed,
Science Vol. 359,
high-availability
implementation
of SQL. In Lecture Notes in Computer
D. Gawlick, M. Haynie, and A. Reuter, Eds., Springer-Verlag,
New York, 1989.
96. TENG, J., AND GUMAER, R.
IBM
Syst.
97. TRAIGER, I.
Virtual
4 (Oct. 1982), 26-48.
98. VURAL, S.
Managing
IBM
Database
2 buffers
to maximize
performance.
J. 23, 2 (1984).
memory
management
for database systems.
A simulation
study for the performance
recovery method. M. SC. thesis, Middle East Technical
ACM
Oper.
Syst.
Rev.
16,
analysis of the ARIES transaction
Univ., Ankara, Feb. 1990.
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992,
162
.
C. Mohan et al.
WATSON, C. T., AND ABERLE, G. F
System/38 machine database support. In IBM Syst,
38/ Tech. Deu., Dec. G580-0237, IBM July 1980.
100. WEIKUM, G. Principles and realization
strategies of multi-level
transaction
management.
ACM Trans. Database
Syst. 16, 1 (Mar. 1991).
101. WEINSTEIN, M., PAGE, T., JR , LNEZEY, B., AND POPEK, G. Transactions
and synchroniza10th ACM Symposium
on Operating
tion in a distributed
operating system. In Proceedings
Systems Principles
(Orcas Island, Dec. 1985).
99
Received January
1989; revised November
1990; accepted April
1991
ACM TransactIons on Database Systems, Vol. 17, No. 1, March 1992
Related documents
Download