Lightweight Recoverable Virtual Memory

advertisement
Lightweight
M. Satyanarayanan,
Recoverable
Henry
H. Mashburn,
School
Virtual
Puneet
Kumar,
of Computer
Carnegie
Mellon
Recoverable virtual memory rekrs to
regions
C. Steere, James J. Kistler
Science
combination
found
of a virtual
David
University
This
Abstract
Memory
in
of circumstances
situations
involving
the meta-data
transactional guarantees are
repositories.
Thus RVM
offered. This paper deseribes RVM, an efficient, Pofiable,
and easily used implementation of recoverable virtual
applications
from distributed
memory for Unix environments. A unique characteristic
of RVM is that it allows independent control over the
RVM
trensactionaJ properties of atotnicity, permanence, and
serializability. This leads to considerable flexibility in the
use of RVM, potentially
enlarging the range of
applications than can benefh from transactions. It also
control over the basic transactional
simplifies the layering of functionality such as nesting and
It may often be tempting,
distribution. The paper shows that RVM performs well
over its intended rattge of usage even though it does not
use a mechanism
integrated
benefit from specialized operating system support. It also
has been that such sophistication
address space on which
demonstrates the importance
transaction optitnizations.
of
intra-
end
object-oriented
inter-
this
paper,
programming
is
a
library
implemented
code and no more
rurttime library
called RVkl,
be, while remaining
for input-output.
is implemented
with
considerable
flexibility
in about 10K lines
intrusive
This transactional
without
specialized
is intended for Unix applications
system-level
concerns
The total size of those data structures
of disk capacity,
easily fit within
manner.
and their working
main memory.
direct
commercial
title
of the
that
copying
and/or
specific
without
fee
the copies
advantage,
publication
Machinery.
SIGOPS
B 1993
that
To copy
all or part
of
are not made
the ACM
and Its date
is by permission
this
copyright
appear,
material
or distributed
and
of the Association
otherwise,
on our design.
appropriate,
2. Lessons
and
as an
lay not in
what could
RVM.
our experience
This experience, and
requirements
were
The description
architecture,
the
a discussion
of
dominant
of RVM
follows
and implementation.
we point out ways in which
our design.
with
We conclude
usage
with
an
of its use as a building
or to republish,
notice
notice
IS
Camelot
is a transactional
that
simplify
for
and the
distributed
IS given
general-purpose
and
encourage
systems [33].
nested transactions,
for Computing
requires
from
2.1. Overview
the choice
a fee
permission.
‘93/12 /93/N. C., USA
ACM 0-89791-632-81931001
RVM
block, and a summary of related work.
Permission
copy
of usability
Our design challenge
Venari [24, 37],
of RVM,
thesis
to
concerns
of the fault-tolerance
evaluation
Camelot
provided
programming
and performance,
[10], a predecessor of RVM.
influenced
This work was sponsored by the Avinmcs Laboratory,
Wright Research
and Development
Center, Aeronautical
Systems Division (AFSC), U.S.
Air Force, Wright-Patterscm
AFB, Ohio, 45433-6543
under Contract
F33615-90-C-1465,
ARPA
Order No. 7597.
James KistIer is now
affiliated with the DEC Systems Research Center, Palo Alto, CA.
granted
But our experience
one can view
crippling
experience
to
or better
comes at the cost of
this paper by describing
Wherever
set size must
unavoidable,
in functionality
of functionality
engineering
and
have
represents a batance between the
in three parts: rationale,
should be a small
of atomicity,
up features to add, but in determining
influences
structures that must be updated in a fault-tolerant
persistent
applications
system.
Alternatively,
Coda [16, 30]
tools.
independent
properties
ease of use and more onerous
our understanding
with persistent data
for
allows
and sometimes
that is richer
Thus RVM
Camelot
operating
of
in how they use transactions.
constraints,
We begin
facility,
support
serializability,
be omitted without
than a typical
range
tools, and CASE
mntime
with the operating
conjuring
wide range of hardware from laptops to servers.
fraction
and
of storage
a wide
Since RVM
exercise in minimalism.
minimal
system support, and has been in use for over two years on a
RVM
permanence,
mm”ntenance.
Our answer, as elaborated
user-level
constraints,
of mainline
provide
to be
file systems and databases, to
CAD
languages.
and the software
facility
a potent tool for fault-tolerance?
in
can also
portability,
1. Introduction
How simple can a transactional
can benefit
repositories,
programming
is most likely
2... $1 .~~
146
facility
built
to validate
transactional
the
support
construction
reliable
It supports local and distributed
and provides considerable
of logging,
of
the
would
synchronization,
flexibility
in
and transaction
commitment
strategies.
communication
which
is
Camelot
relies
management
page
external
facilities
binary
heavily
and
of the Mach operating
compatible
with
on the
the
Figure 1 shows the overall structure
of a Camelot
Each module
Mach
task and communication
Mach’s inteqxocess
is implemented
between
communication
is via
facilihy(IPC).
recoverable
Coda
control
on servers.
as well
and internal
m
9“”
system.
The meta-data included
housekeeping.
memory
contribute
to Coda meta-data in
as persistent
Server recovery
recoverable
we
of the Camelot thesis.
memoryl
directories
Yet
experience in
and because it would
Coda file was kept in a Unix
m
of an overkill.
We placed data structures pertaining
as a
modules
be something
towards the validation
Unix
operating system [20].
node.
would
the use of transactions,
system [2],
4.3BSD
Camelot
persisted, because it would give us first-hand
interprocess
data for
fite on a server’s
consisted
which
local file
of Camelot
to the last committed
by a Coda salvager
replica
The contents of each
ensured
restoring
state, followed
mutual
consistency
between meta-data and data.
. Recoverable
Processes
2.3. Experience
The most valuable lesson we learned by using Camelot was
that recoverable
I
Camelot
and practically
system
Co.pon.nt
like
.
Mach K@r.el
Coda.
virtual
memory
was indeed a convenient
useful programming
Crash recovery
structures
were restored
operations
were merely
for systems
because data
in situ by Camelot.
manipulations
The Coda salvager
structures.
abstraction
was simplified
Directory
of in-memory
was simple
range of error states it had to handle was small.
the encamulation
.
of messv. crash
This figure shows the intemaf structure of Camelot as welt as its
relatim-ship to application code. Camelot is composed of several
Camelot considerably
Mach tasks: Master Control, Camelot, and Node Server, as welt
as the Recovery, Transaction,
and Disk Managers.
Camelot
Unfortunately,
provides recoverable vittuaf memory for Data Servers; that is,
transactional operations am supported cm portions of the virtual
address space of each Data Server. Application
code can be split
between Data Server and Application
tasks (as in this figure), or
these benefits
problems
we encountered
scalability,
programming
maintenance.
may be entirely linked into a Data Server’s address space. The
latter approach was used in Coda. Camelot facilities are accessed
via a library linked with application code.
simplified
consequences
Overall,
details
into
Coda server code.
came at a high price.
manifested
themselves
constraints,
In spite of considerable
able to circumvent
these problems.
and
effort,
The
as poor
d.ijjjcul~
of
we were not
Since they were ditect
of the design of CameloL
these problems in the following
Figure 1: Structure of a Camelot Node
recoverv.
data
because the
we elaborate
on
paragraphs.
A key design goal of Coda was to preserve the scalability
2.2. Usage
Our interest
in Camelot
phase optimistic
replication
System.
Although
distributed
commit,
atomicity
arose in the context
the
meta-data
protocol
does
not
require
strategy for us would
have
an ad hoc fault tolerance mechanism for
using some form
of shadowing.
that we found
support
virtual
for recoverable
enables regions
address
be
endowed
of atomicity,
isolation
properties
space
we did not find
distributed
to
we
realized
responsible
is its
These experiments
contributor
server CPU
controlled
to loss of
utilization,
operation
also showed
scalability
was
and that Camelot
for over a third of this increase.
of Coda servers in operation
experiments
was
Examination
showed considerable
paging
overheads due to the fact that each
involved
interactions
between many of
processes shown in Figure 1. There was no
obvious way to reduce this overhead, since it was inherent
in the implementation
structure of Camelot.
virtual
transactional
and permanence.
a need for features
transactions,
the
increased
Camelot
This unique
of a process’
with
that the primary
the component
most useful
memory [9].
feature of Camelot
in an earlier paper [30]) showed that Coda was
and context switching
But we were
curious to see what Camelot could do for us.
The aspect of Camelot
But a set of carefully
(described
less scalable than AFS.
a
of locat updates to meta-dafa in
The simplest
been to implement
used by the Coda File
it does require each server to ensure the
and permanence
the first phase.
protocol
of the two-
of AFS.
Since
such as nested or
that
our
use of
lFor
memo#
147
brevity,
we often omit
in the rest of this paper.
“virtual”
from
“recoverable
virtuaf
A
second
obstacle
programming
to
using
constraints
Camelot
was
it imposed.
the
count on the clean failure
set of
latter is only responsible
These constraints
came in a variety of guises. For example, Camelot required
all
processes
Manager
using
it
to be descendants
task shown in Figure
Coda servers required
scripts
It also made debugging
starting
a Coda server under
example
of a programming
required
us to use Mach
Coda
was capable
expensive,
thread
constraint
of using
context
we ended up paying
and
were
threads.
much
third
limitation
complexity
and
combinations
porting
tight
of Mach
difficult.
recoverable
dependence
features
made
size,
rarely
used
maintenance
memory,
we were usually
new bugs in Camelot.
As the cumulative
toll
and
hard to decide
for ways to preserve the virtues of Camelot
avoiding
its drawbacks.
Since recoverable
was the only aspect of Camelot
virtual
RVM
RVM
If
has to
for coping
unpleasant
is implemented
correctly
with
with
concurrency
to be multi-threaded
and
in the presence of true parallelism.
changes
no
on
threading
coroutine
C threads, and coroutine
fiml
RVM
mechanisms:
simplification
thus
user-level
thread
We have, in fact, used RVM
different
Mach kernel threads [8],
LWP [29].
was to factor
Standard techniques
will
adopts
with three
out resiliency
such as mirroring
Our expectation
most likely
to
can
is that
be implemented
in the
disk.
a layered
approach
to transactional
support, as shown in Figure 2. This approach is simple and
that
enhances flexibility:
That
into
an application
does not have to buy
those aspects of the transactional
irrelevant
concept
that are
to it.
Rationale
The central principle
we adopted in designing RVM
was to
In building
a tool that did
one thing well, we were heeding Lampson’s
sound advice
value simplicity
over generality.
on interface design [19].
long Unix tradition
We were also being faithful
of keeping building
blocks simple.
take radically
from Camelot in the areas
positions
operating
functionality,
system
dependence,
I
The
to simplicity
different
allowed
Application
to the
change in focus from generality
of
above
That layer is also responsible
device driver of a mirrored
quest led to RVM.
3. Design
are supporting.
a layer
and other
of their
at a granularity
starvation
this functionality
memory
was cheap, easy-to-use and had few strings attached.
they
is
out concurrency
synchronization
be used to achieve such resiliency.
we
while
into a realization
RVM
insist on a
to use a policy
deadlocks,
Our
we relied on, we sought to
distill the essence of this functionality
to factor
applications
is required,
media failure.
mounted,
we decided
implementations.
lay in Camelot or Mach.
of these problems
simplified
enforce it.
used
the first to expose
But it was often
problem
the
But it does not depend on kernel thread support, and can be
Since Coda was the sternest test case for
whether a particular
looked
on
its code
have
to the abstmctions
to function
was that
we
and to perform
Internally,
a hefty performance cost
of Camelot
while
control problems.
more
with little to show for it.
A
of RVM,
Rather than having RVM
This allows
serializability
even
where
control.
appropriate
was that
threads,
area
technique,
choice,
because
user-level
switches
specific
control.
was complex.
kernel
second
concurrency
that
complicated
more difficult
Camelot
Since kernel
procedure
a debugger
Another
though
A
the Disk
1. This meant that starting
a rather convoluted
made our system administration
fragile.
of
semantics
for local, non-nested transactions.
Nastkg
Code
I
\
Distribution
/ SanalizabWy
us to
and
structure.
Operating
Permananca.’
System
media
faihre
3.1. Functionality
was to eliminate support for nesting
Our first simplification
and distribution.
A cost-benefit analysis showed us that
each could be better provided
top of RVM2.
efficient
property
than
as an independent
While a layered implementation
a monolithic
of keeping
one,
it
each layer simple.
maybe
has the
Figure 2: Layering
layer on
less
attractive
Upper layers can
3.2. Operating
System
To make RVM
portable,
of Functionality
Dependence
we decided
Unix
~plemenwtia
sketch is provided
in Section 8
only
on a
small, widely
supported,
A consequence of this decision was that we
VM
148
to rely
call interface.
subset of the Mach system
could not count on tight coupling
Zh
in RVM
subsystem.
The Camelot
between
Disk Manager
RVM
and the
module runs
as an external pager [39] and takes full responsibility
managing
the backing
process.
in
the
The use of advisory
Mach
recoverable
interface
regions
paged out until
with
Mach’s
backing
that
This close alliance
allows
Camelot
store
or
to avoid
regions whose
addressing
of large recoverable
3.3. Structure
The
ability
spaces
dirty
address space are not
commit.
subsystem
of a
and unpin)
ensure
and to support recoverable
handling
Camelot’s
Camelot
of a process’
VM
approaches
Efficient
lets
for
regions
VM calls (pin
transaction
double paging,
size
store for recoverable
limits.
regions is critical
to
to
communicate
allows
sactilcing
efficiently
robustness
good
decomposition,
to
be
fast [4],
its
experimental
lags
in
far
Our goals in building
RVM
were more modest.
not trying to replace traditional
We were
forms of persistent storage,
such as file systems and databases.
Rather, we saw RVM
1, is predicated
commercial
behind
that
Even
implementations.
on
Unix
of
the
on Mach
best
2.5, the
measurements reported by Stout et al [34] indicate that IPC
is about 600 times more expensive
goals.
modular
it has been shown that IPC can be
performance
implementations
without
Camelot’s
shown earlier in Figure
Although
address
enhanced
performance.
fast PC.
across
than local procedwe
ca113. To make matters worse, Ousterhout
[26] reports that
the context switching
performance
not improving
with raw hardware performance.
linearly
of opemting
systems is
as a building
block for meta-data in those systems, and in
Given
our desire to make RVM
portable,
higher-level
compositions
willing
to make its design critically
dependent on fast IPC.
of
them.
could assume that the recoverable
a machine
storage.
Consequently,
we
memory requirements
would only be a small fraction
on
of its total disk
This in turn meant that it was acceptable to waste
some disk
space by
duplicating
recoverable
regions.
Hence RVM’S
recoverable
region,
completely
called
independent
the backing
store for
backing
its external
of the region’s
store for a
data
segment,
is
VM
swap space.
Crash recovery relies only on the state of the external data
segment.
external
Since
a VM
pageout
does not
data segmen~ an uncommitted
reclaimed
by
correctness.
the
VM
modify
the
dirty page can be
subsystem
without
Of course, good performance
loss
of
also requires
that such pageouts be rare.
versus
our strategy
resource
generous with memory
usage
use
implemented
of
is to view
it as a
By
tradeoff.
being
and disk space, we have been able
to keep RVM simple and portable.
optional
external
support
Our design supports the
pagers,
for
this
but
feature
we
have
The
yet.
not
most
apparent impact on Coda has been slower startup because a
process’
recoverable
memory
must be read in en masse
rather than being paged in on demand.
Insulating
RVM
sharing
of
spaces.
But this is not a serious limitation.
primary
reason
increase
Sharing recoverable
this purpose.
virtual
memory
to use a separate
robustness
by
avoiding
memory
across
memory
address
After all, the
address
space is to
memory
corruption.
across address spaces defeats
In fact, it is worse than sharing
virtual memory because damage may be persistent!
our view
is involved
is that processes willing
(volatile)
Hence,
to share recoverable
as a library
that is linked
No external communication
in the servicing
implication
of this is, of course,
applications
not to damage RVM
of RVM
of any
calls.
An
that we have to trust
data structures
and vice
versa.
A less obvious implication
a single write-ahead
is common
movement
is a strong
because
determinant
infeasible
the Disk
significant
server
cannot share
Such sharing
because disk
head
of performance,
and
disk
per application
In
present.
Camelot,
is
for
serves as the multiplexing
The inability
limitation
legitimate
at
Manager
agent for the log.
RVM.
systems
the use of a separate
economically
file
is that applications
log on a dedicated disk.
in transactional
to share one log is not a
for Coda, because we run only one
process
on a machine.
But
concern for other applications
Fortunately,
there are two
it
may
be a
that wish to use
potential
alleviating
factors on the horizon.
First, independent
there
is
of transaction
considerable
implementations
from the VM subsystem also hinders the
recoverable
kind
example,
One way to characterize
complexity
Instead, we have structured RVM
in with an application.
we were not
to place the RVM
in
log-structured
of the Unix file system [28].
log for each application
on such a system, one would
head movement.
processing considerations,
interest
benefit
from
No log multiplexer
If one were
in a separate file
minimal
would
disk
be needed,
because that role would be played by the file system.
Second, there is a trend toward using disks of small form
factor,
partly
technology
[27].
motivated
by
interest
It has been predicted
in
disk
array
that the large disk
capacity in the future will be achieved by using many small
already trust each other enough to run as threads
in a single address space.
3430 ficros~nds
contemporary
149
versus 0.7 microseconds
machine, rhe DECStarion
5000200
for a nutt c~
on a WPi~
disks.
startup latency, as mentioned
If this turns out to be true, there will be considerably
less economic
incentive
to avoiding
a dedicated
we plan to provide an optional
disk per
In summary,
each process using RVM
The log cart be placed
partition.
in a Unix
has a separate log.
file
RVM’S
permanence
implementation
rely
of this system call.
on
the
correct
more
mappings
aliasing.
eliminate
information
follows
the major program-visible
the operations
abstractions,
memory
and then describe
for
is managed
in segments, which
segments.
to accommodate segments
although
current
segment
hardware
and
RVM
file
system
on a machine
is only
are
limited
number
of
by its storage
Since the distinction
programs,
we use the term
“external
RVM
to cope
with
of page size,
retains no
after its regions
and maintenance
storage
and
takes
the use of absolute
memory
care
also
of
mapping
in segments.
layered
of storage within
provided
by
and segment
to
operation
data segment”
to
external
A
on RVM,
a segment.
RVM
mapping
for
initialization,
are shown
in Figure
The log to be used by a process is specified
initialization
is invisible
a
This
each time.
pointers
allocator,
of a load map
Primitives
termination
refer to either.
RVM
mappings
the same base address
operations
4(a).
outstanding.
A segment loader package, built on top of
the creation
into
4.2. RVM
The
The backing store for a segment may be a file or
a raw disk partition.
These
memory.
must be done in multiples
supports heap management
has been
limitations
The
the need for
transactions
recoverable
recoverable
up to 2 64 bytes long,
to 232 bytes.
length
allows
simplifies
to Mukics
virtual
about a segment’s
are unmapped.
RVM,
segment
designed
resources.
the rationale
and Regions
analogous
segments
from
below, we first present
supported on them.
4.1. Segments
Recoverable
logically
In the description
Also,
the same process.
in
Regions can be unmapped at any time, as long as they have
no uncommitted
presented earlier.
The most
and regions must be page-aligned.
4. Architecture
The design of RVM
once by
overlap
Mapping
are minimal.
is that no region of a segment may lx
than
cannot
resmictions
on a dedicated
Unix file system.
on segment mapping
restriction
mapped
onto disk.
For best performance,
the log should either be in a raw partition
disk o; in a fide on a log-structured
important
uses the fs ync
flush modifications
guarantees
Restrictions
or on a raw disk
When the log is on a file, RVM
system call to synchronously
restrict
Mach external pager to copy
data on demand.
process.
loosely
in Section 3.2. In the future,
via the opt i on.s_des
c argument.
at RVM
The map
is called once for each region to be mapped.
data segment
addresses
for
argument.
the
and the range of virtual
mapping
The unmap
are
identified
operation
time that a region is quiescent.
in
the
can be invoked
Once unmapped,
The
memory
first
at any
a region
can be remapped to some other part of the process’ address
space.
After a region has been mapped, memory
it may be used in the transactional
.Ssgment-2
address range specified
Figure 3: Mapping
a transaction
set
As shown in Figure
record
Regions of Segments
3, applications
explicitly
memory.
A region
typically
collection
of objeets,
segment.
In the current
data from external
begin_t
corresponds
transaction
image of
implementation,
when a region is mapped.
The limitation
the copying
memory
act
of
of this method is
150
that is used in all
that transaction.
The
know that a certain area
This allows RVM
to
value of the area so that it can undo
The rest
ore_mode
lets an application
ion
abort a transaction.
is more efficient,
regions require no RVM
occurs
with
lets RVM
copy data on a set -range.
to a related
and may be as large as the entire
data segment to virtual
rans
operation
t id,
is about to be modified.
will never explicitly
guarantees
that newly mapped data represents the committed
associated
operation
the current
shown in
begin_transaction
identifier,
changes in case of an abort.
map regions
RVM
The
operations
range
of a;egion
during mapping.
of segments into their virtual
the region.
4(b).
returns
further
Each shaded area represents a region. The contents of a region
art physically copied from its external data segment to the virtuat
memoty
Figure
addresses within
operations
since RVM
Such a
no-resrore
does not have to
Read operations
intervention.
flag to
indicate that it
on mapped
initialize
(version,
options_desc);
map(region_desc,
options_desc);
unmap(region
—desc);
~end_transaction
(tid,
abort_transaction
terminate;
(a) Imtialization
& Mapping
Operations
commit_mode)
(tid)
(b) Transactional
query
EzJ
Operations
(options_desc,
~
create
region_desc);
log_len,
log(options,
(c)LogControlOperations
;
;
(d) Miscellaneous
mode);
Operations
Figure 4: RVM Primitives
Atmnsaction
aborted
is committedly
via
successful
commit
madeina
end_transacti.on
guarantees
transaction.
willingness
permanence
But an application
the commi–t_mode
paratneter
Such ano-@shor
no-flush
transactions
flush RVM’S
where the boundis
Figure4(c)shows
controlling
operation,
flush,
transactions
operation,
blocks
have
been
truncate,
in the write-ahead
segments.
a
operation,
committed
truncation
is
long-running
we have provided
and
RVM
for
variety
application
identity
to obtain information
i ons
changes
performed
of
application
internal
buffers.
can dynamically
then use it in an initialize
in a region.
sets a variety
Using
of tuning
a no-undo/redo
the old-value
space is available
records
only
of
from
memory
transactions.
records
to the log.
of committed
The format of a
Upon commit,
records are known to
operations
the set-range
old-value
issued during
records are replaced
ranges of memory.
Note that each modified
one new-value
record
even if that
range has been updated many times in a transaction.
an
final step of transaction commitment
new-value
The
a
records that reflect the current contents of the
range results in only
a
records
The
consists of forcing the
to the log and writing
out a commit
record.
knobs
an
in virtual
uncommitted
the new-value
have to be written
by new-value
c reate_log,
logging
assumes
transaction.
and the
value
The implementation
The bounds and contents of old-value
log truncation
create a write-ahead
use
to an external data segment.
RVM
for applications
allows
to
typical log record is shown in Figure 5.
such as the number and
for triggering
treatment
techniques can be found in
changes
transactions
resource-intensive
operation
transactions
operation
such as the threshold
sizes
able
Consequently,
second
shown in Figure 4(d), perform
The query
of uncommitted
set_opt
is
corresponding
of functions.
log
manual [22]
strategy [3] because it never reflects uncommitkd
for
to control its timing.
The final set of primitives,
implementation
that adequate buffer
first
But since this is
a mechanism
The RVM
5.1.1. Log Format
Note
to external data
usually
by RVM.
aspects of its implementation:
5.1. Log Management
no-flush
The
disk.
blocks until all committed
in the background
potentially
all
to
for
our discussion
Gray and Reuter’s text [14].
When
The
log.
log have been reflected
Log
transparently
until
techniques
must explicitly
providedbyRVM
forced
well-known
systems, we restrict
and optimization.
of transactional
independentofpermanence.
the use of the write-ahead
upon
here to two important
management
bounded persistence,
thetwooperations
draws
transactional
offers many further details, and a comprehensive
the period between log flushes.
thatatomicityisguaranteed
its
guarantee via
time to time.
provides
RVM
building
To ensure persistence
log tim
used in this manner, RVM
Since
changes
has reduced commit
the application
write-ahead
of
a
of end_transaction.
“lazy’’transaction
latency since a log force is avoided.
ofits
default,
can indicate
to accept a weaker permanence
5. Implementation
and
By
abort_transaction.
No-restore
and no-flush
transactions
are more efficient.
The former result in both time and space spacings since the
log and
contents of old-value
operation.
buffered.
latency,
records do not have to be copied or
The latter result in considerably
since
new-value
and
commit
spooled rather than forced to the log.
151
lower commit
records
can be
RaveraeD!splacemerrts
Ranw
Tram
Hdr
Hdr 1
*I
!1
1
1
Dkplacements
FoIwwd
‘this log record has three medifimtimr
ranges.
The bidirectional
displacements
records aUow the log to be read either way.
Figure 5: Format of a Typical Log Record
Tail Displacements
I
l}
Head Dlsplacenrents
‘rttis figure shows the organization of a log during epoch truncation.
The current tail of the log is to the right of the area marked “current epoch.
The log wraps around logically,
and irmemal synchronization
in RVM allows forward processing in the current epoch white truncation is in
progress. When truncation
is ccmplete, the area marked “truncation epoch” wilt be freed for new log records.
Figure
6: Epoch Truncation
5.1.2. Crash Recovery and Log Truncation
undo/redo
Crash recovery
modified
consists of RVM
tail to head, then constructing
latest
committed
encountered
modifications
external
data segment.
empty log.
delaying
for
The
in the log.
information
an in-memory
changes
applying
first reading the log from
in
each
tree of the
data
segment
trees are then traversed,
them
Finally,
to
the
corresponding
property
internal
locks
not violate
Certain
may result
have
RVM
maintains
truncation
situations,
transactions
or sustained
in incremental
truncation
The idempotency
those circumstances,
of recovery
all
other
is achieved by
recovery
actions
does
such as the
blocked for so long that log space becomes critical.
step until
been
high
being
Under
RVM reverts to epoch truncation.
are
Page Vector
Truncation
is the process of reclaiming
log entries by applying
the recoverable
because log
whenever
current
total size.
space is finite,
implementation
to
procedure
code
as
epoch
described
the log while
truncation
to implement
correctly.
the
forward
,..
,. ---,,-
crash
processing
Although
logically
recovery
This figure
truncation.
part of
and results in bursty system performance.
incremental
truncation
mechanism
periodically
obsolete by writing
recoverable
during
renders
normal
the
Now that RVM
a mechanism
operation.
oldest
out relevant pages directly
data
segment.
To
log
for
This
entries
from VM to
preserve
the
no-
shows the key data structures involved in incremental
RI through R5 are log entries. The reserved bit in
to the recoverable
data segment.
The log head dces not move, since P2 has the
same log offset as PI.
P2 is written next, and the log head is
moved to P3’s log offset. Incremental truncation is now blocked
until P3’s uncommitted reference count drops tn zero.
Figure
more than necessary,
is stable and robust, we are implementing
kg tail
Log Records
count of zero, it is the first page to be written
exclusive
reliance on epoch truncation
is a
correct strategy, it substantially
increases log
processing
●.,
:
page vector entries is used as an intemaf leek. Since page P1 is
at the head of the page queue and has an tmcornrrdtted reference
occurs in the
is in progress.
degrades forward
tail
&jR4~
rest of the log. Figure 6 depicts the layout of a log while an
epoch truncation
PagaQueua
-----
IOghead
approach,
to an initial
:
:
To
chose to reuse
In this
truncation,
head
:
of its
has proved to
truncation.
P
1
is
log truncation
above is applied
concurrent
to
and is triggered
effort, we initially
for
Unumm-med
R8fCni
B*
Resewed
in them to
log size exceeds a preset fraction
be the hardest part of RVM
crash recovery
Periodic
segment.
In our experience,
minimize
space allocated
the changes contained
data
necessary
the
that
cannot be written
data segment.
of long-running
complete.
traffic,
pages
to ensure that incremental
this propwty.
concurrency,
log,
transactions
in the log status block is updated to reflect an
this
referred
the
out to the recoverable
presence
the head and tail location
of
by uncommitted
7: Incremental
Truncation
Figure 7 shows the two data structures used in incremental
truncation.
mapped
The first data structure is a page vector for each
region
that region’s
that maintains
the modification
status of
pages.
The page vector is loosely analogous
to a VM page table:
the entry for a page contains a dirty bit
and an uncommitted
reference
152
count.
A page is marked
dirty when it has committed
reference
count
executed,
and
committed
or aborted,
marked dirty.
is
changes.
incremented
decremented
the
descriptors
that specifies
the log head. Each descriptor
first record referencing
descriptor
incremental
offset
are
platforms
in which
truncation
specified
by
The queue contains no
it could
consists
the descriptor,
repeated until
out in order to move
only in the
appear.
of
A step in
selecting
the
first
out the pages specified
and moving
the next
This
descriptor.
the desired amount
such as IBM RTs, DEC MIPS workstations,
laptops
and
machines
capacity
and a variety
workstations.
ranges
ranges
experience
of log space has been
with RVM
experience
opportunities
with
Intra-transaction
RVM
optimization
identical,
typically
duplicate
because
of
when
one
and
actions
part
then
recoverable
memory
procedure
Optimization
of
Forgetting
been written
Standard
has only been on Mach 2.5 and 3.0.
an
procedures
- range
it modifies,
is supposed
code in RVM
to be ignored,
application
and
to issue
issuing
a
are often
a
to
Each of those
calls for the areas of
even if the caller or some
causes duplicate
and overlapping
Inter-transaction
A
version
so already.
set-range
transactions.
of modifications
command
“cp
many no-flush
Temporal
system
to recoverable
dl / *
and adjacent
The
is being debugged.
for
has encouraged
transparent
resolution
of directory
server replicas
also use RVM
alsconnected
operation
updates
is done using a log-
based strategy [17]. The logs for resolution
Clients
But positive
us to expand its use.
are maintained
now,
[16].
particularly
for
The persistence of
changes made while
disconnected
is achieved
replay logs in RVM,
and user advice for long-term
by storing
cache
is stored in a hoard database in RVM.
An unexpected
use of RVM
servers and clients [31].
hard-to-reproduce
structures.
All
log
has been in debugging
As Coda matured,
bugs involving
corrupted
persistent data
We realized that the information
excellent
of reference
in
in RVM’S
log
clues to the source of these corruptions.
we had to do was to save a copy of the log &fore
truncation,
and to build
a post-mortem
display the history of modifications
The most common
using RVM
catl prior
new-value
recorded by the log.
has been in forgetting
record
tool to search and
source of programming
to modifying
reflect
problems
memory.
because RVM
memory.
does not create a
by the transaction
The current solution,
memory.
defensive y.
in
to do a set-range
an area of recoverable
for this area upon transaction
modifications
language-based,
For example, the
Coda
we ran into
commit.
will not
to that area of
as described in Section 5.2,
A better
solution
would
be
as discussed in Section 8.
d2” on a Coda client will cause as
transactions
check
locality
in
uses
intent was just to replace Camelot by RVM on
is to program
updating
for d2 as there are children
performed
that
Our original
often translates into locality
these updates needs to be forced
flush.
the
truncation
The result is disastrous,
occur only in the context of
input requests to an application
RVM
of
Hence the restored state after a crash or shutdown
optimization
have
incremental
offered
begins
elsewhere
to have done
will
using RVM
Such
records to be coalesced.
no-flush
applications
in C or C++, but a few have been written
ML.
management
This is particularly
that transaction.
set
-range
modularity
Hence applications
invokes
within
set
bug, while
disk
personal
Most
supporting
transaction.
in applications.
procedures may perform
calls
a single
call is an insidious
call is harmless.
tmnsaction,
other
Our
that ports to other Unix platforms
in RVM.
or adjacent memory
to err on the side of caution.
common
distinct
respectively.
overlapping,
occur
defensive programming
–range
two
the volume of data
arise when
addresses are issued within
situations
indicated
reducing
optimization
calls specifying
perform
to 2.5GB.
on these
while
be stmightforward.
made to partitioned
We refer to these as intra-transaction
and inter-transaction
written
MB,
But RVM has been ported to SunOS and SGI IRIX at MIT,
For example,
for substantially
to the log.
a set
capacity
64
servers, in the role described in Section 2.2.
Optimizations
written
to
60MB
experience with RVM
Early
12MB
Sun
386/486-based
step is
reclaimed,
5.2.
Memory
from
from
of Intel
and we are confident
by
the log head to the
has been in daily use for over two years on hardware
Spare workstations,
the order in
a page is mentioned
in the queue, writing
it, deleting
RVM
specifies the log offset of the
that page.
page references:
descriptor
changes
are
the affected pages are
which dirty pages should be written
earliest
— ranges
The second data structure is a FIFO queue of
page modification
duplicate
set
when
On commit,
6. Status and Experience
The uncommitted
as
the data structure
of dl.
7.
Evaluation
A fair
assessment
to the log on a future
issues.
From a software engineering
inter-transaction
optimization
at commit
time.
If the modifications
committed
subsume
those
transaction,
the older log records are discarded.
from
in
Only the last of
an
earlier
is
being
unflushed
to ask whether
commensurate
perspective,
simplicity
153
of RVM
RVM’S
with
must
code
consider
distinct
we need
size and complexity
From
its functionality.
we need to know
two
perspective,
whether
has resulted in unacceptable
RVM’S
are
a systems
focus on
loss of performance.
To address the first issue, we compared the source code of
RVM
and
RVM’S
Camelot.
approximately
10K lines of C, while utilities,
and other auxiliary
code contribute
Camelot has a mainline
and auxiliary
like PC
and the
to Camelot.
is considerably
This translates into a corresponding
and porting.
smaller
reduction
for RVM.
of effort
in
What is being given up in return
is support for nesting and distribution,
in areas such as choice of logging
paging
The
performance
worst
case
distributed
strategies — a fair trade
case, the benchmark
considerable
referred
as well
locality.
to as localized,
accounts
we used controlled
as measurements
from
Coda
servers and clients in actual use. The specific questions of
are
In
this
uniformly
the average
that exhibits
access pattern,
70% of the transactions
update
on 5’%0 of the pages, 259i0 of the transactions
1590 of the pages, and the
remaining
5’70 of the transactions
remaining
80?Z0of the pages.
uniformly
distributed.
update accounts on the
Within
each set, accesses are
7.1.2. Results
goal in these experiments
the throughput
of RVM
as discussed
experimentation
accesses
To represent
uses an access pattern
temporal
This corresponds
of RVM
when
across all accounts.
Our primary
as well as flexibility
by our reckoning.
To evaluate the performance
occurs when accesses are sequential.
occurs
update accounts on a different
size of code that has to be understood,
and tuned
maintenance
10K lines.
These numbers do
for features
external pager that are critical
debugged,
is
code size of about 60K lines of C,
code in Mach
the total
code
test programs
a further
code of about 10K lines.
not include
Thus
mainline
paging
over its intended
to situations
in Section
observe performance
becomes
was to understand
3.2.
A secondary
degradation
relative
more significant.
shed light on the importance
domain
of use.
where paging rates are low,
goal was to
to Camelot
We expected
of RVM-VM
as
this to
integration.
interest to us were
●
How serious is the lack of integration
RVM and VM?
. What is RVM’S
impact on scalability?
. How effective
optimizations?
are intra-
arrays ranging
This roughly
and inter-transaction
the VM
component
of an operating
To quantify
of RVM
system
from
could
hurt
this effect, we designed a variant
of the industry-standard
TPC-A
in a series of carefully
conwolled
benchmzuk [32] and used it
experiments.
bank
witk
branch,
and
transaction
many
1 and Figure
other
relevant
branches,
customer
multiple
accounts
updates a randomly
tellers
per branch.
chosen account,
per
A
updates
branch and teller balances, and appends a history record to
and Camelot
memory.
The number
benchmark.
The
represented
as arrays
respectively.
a
transaction
in
recoverable
of accounts is a parameter
accounb
of
and
the
128-byte
audit
of our
trail
and 64-byte
are
records
Each of these data structures occupies close
to half the total recoverable
memory.
The sizes of the data
structures for teller and branch balances are insignificant.
around.
sequential,
with wrap-
The pattern of accesses to the account array is a
second parameter
of our benchmark.
conditions
virtually
identical
changes
increases.
The best case for
for
Hardware
and
are described
in
as the
The average
throughput.
size
time
of
This
yields
a
This
recoverable
to perform
a log
is about 17.4
theoretical
maximum
of 57.4 transactions per second, which is within
15% of the observed best-case throughput
for RVM
and
Camelot.
account
sequential
the
is
access.
effects
throughput
until
access is random,
throughput
of
initially
Figure
close
As recoverable
paging
drops.
become
8(a) shows that
to
memory
more
its
value
for
size increases,
significant,
and
But the drop does not become serious
recoverable
memory
size
exceeds
about
70%
of
physical memory size. The random access case is precisely
where one would
expect Camelot’s
to be most valuable.
in Figure
Access to the audit trail is always
offer
hardly
throughput
In our variant of this benchmark,
by
8 present our results.
experimental
milliseconds.
When
accessed
the experiment
account access patterns.
For sequential account access, Figure 8(a) shows that RVM
RVM’S
we represent all the data
entries.
Table 1.
an audit trail.
structures
and localized
force on the disks used in our experiments
is stated in terms of a hypothetical
one or more
random,
Table
memory
benchmark
for account
to about 450K
memory size to total physical memory size. At
throughput
7.1.1. The Benchmark
The TFC-A
32K entries
each account array size, we performed
Integration
As discussed in Section 3.2, the separation
from
corresponds to ratios of 10% to 175% of total
recoverable
sequential,
7.1. Lack of RVM-VM
performance.
To meet these goals, we conducted experiments
between
graceful
recoverable
with Mach
of the curves
degradation
But even at the highest
to physical memory
better than Camelot’s.
154
Indeed, the convexities
8(a) show that Camelot’s
than RVM’S.
integration
size, RVM’S
is more
ratio of
throughput
is
No. of
Rmem
Accounts
Pmem
RVM
Camelot
(Trarts/See)
Sequential
Random
Localized
(Tkans/See)
Sequential
Random
L&sstized
32768
12.5%
48.6 (0.0)
47.9 (0.0)
47.5 (0.0)
48.1 (0.0)
41.6 (0.4)
44.5 (0.2)
65536
25.0%
48.5 (0.2)
46.4 (0.1)
46.6 (0.0)
48.2 (0.0)
34.2 (0.3)
43.1 (0.6)
98304
37.5%
48.6 (0.0)
45.5 (0.0)
46.2 (0.0)
48.9 (0.1)
30.1 (0.2)
41.2 (0.2)
131072
50.0%
48.2 (0.0)
44.7 (0.2)
45.1 (0.0)
48.1 (0.0)
163840
62.5%
48.1 (0.0)
43.9
43.2
42.5
41.6
40.8
(0.0)
(0.0)
(0.0)
(0.0)
(0.5)
44.2 (0.1)
39.7
33.8
33.3
30.9
27.4
(0.0)
(0.9)
(1.4)
(0.3)
(0.2)
48.1
48.1
48.2
48.0
48.0
48.1
48.3
48.9
48.0
47.7
29.2
27.1
2S.8
23.9
21.7
20.8
19.1
18.6
18.7
41.3
40.3
39.5
37.9
35.9
35.2
33.7
33.3
32.4
32.3
31.6
196608
75.0%
47.7 (0.0)
229376
87.5%
47.2 (0.1)
262144
100.0%
112.5%
46.9 (0.0)
294912
327680
125.0%
46.9 (0.7)
360448
137.5%
393216
150.0%
425984
162.5%
458752
175.0%
48.6
46.9
46.5
46.4
46.3 (0.6)
(0.0)
(0.2)
(0.4)
(0.4)
43.4
43.8
41.1
39.0
39.0
40.0
39.4
38.7
35.4
(0.0)
(0.1)
(0.0)
(0.6)
(0.5)
(0.0)
(0.4)
(0.2)
(1.0)
(0.0)
(0.2)
(1.2)
(0.1)
(0.0)
(0.2)
(0.0)
(0.0)
(0.1)
18.2 (0.0)
17.9 (0.1)
(0.0)
(0.4)
(0.2)
(0.0)
(0.0)
(0.1)
(0.0)
(0.0)
(0.0)
(0.0)
(0.1)
(0.2)
(0.8)
(0.2)
(0.2)
(0.1)
(0.0)
(0.1)
(0.2)
(02)
(0.0)
‘M
table presents the measured steady-state throughput, in transactions per second, of RVM and Camelot on the benchmark described in Section
‘7.1. 1. The cohmm labekd
“Rmem/Pmem”
~ivw the ratio of recoverable to physi~
m~ow
si=
Ea* dam -t
t$ves tie mea SSId StSSSda~
deviatiar (in parenthesis) of the three trials {ith most consistent results, chosen from a set of five to eight. ‘Dse experiments were conducted on a
DEC 50W200
with 64MB of main memory and separate disks for the log, external data segment, and paging file. Only one thread was used to
nm the benchmark. Only processes relevant to the benchmark mrr on the machine during the experiments.
Transactions were required to be fully
atomic and permanent.
Inter- and intra-transaction
optirnizations
were enabled in the case of RVM, but not effective for this benchmark.
This
version of RVM
only supported epoeh truncation;
we expect incremental
truncation
Table 1: Transactional
g 5P
q
--
-’
g
a
g
--
n
=<..
40 “\
\
2
● -..
u
c
.:
~
. ,*
\
\
❑\
.
●. . ..m
‘- . *...*-...
“’- s...
Et”.. u...
=..
-------mp
u 40 w
.
>
significantly.
Throughput
~ 50q
u)
V.*
to improve performance
●
..
6.- . . . . . .
●
.:- ..?...
“---a . . . . . .
:
~-...
L7-..
k
b,
J \
30 -
●
a’-.
‘*\
i?
K
..
““-&.-...
❑
●
\
.
%--------
q
. ..*D
\e
‘u
\a
- .0
—
20 “ _
Sequa+ml
Camelot Sequential
-m- _
100
is%
IWM
-i
-
:
E-n-+a
EzizEzl
RVM R.wrdom
Cunotot
Random
20
40
60
100 120
80
140
RmemfPmem
160
180
10
20
0
40
60
localized
RVM’S
the data in Table 1. For clarity,
recoverable
and
memory
an impediment
size
domain.
8: Transactional
shows
with
acceptable
approaches
throughput
also drops
that
increasing
even
physical
performance
A conservative
160
180
(percent)
data in
locality
recoverable
less than 10%.
and is
for
its
interpretation
is not
intended
of the
1 indicates
that
data, while keeping
Applications
restrict active recoverable
performance.
larger,
simplicity
Table
applications
can use up to 40% of physical
slow,
memory
linearly,
Throughput
when
throughput.
confirm ‘that RVM’S
to good
8(b)
linearly
140
the average case is presented separately from the best and worst eases.
But the drop is relatively
worse than RVM’S
These measurements
application
size.
remains
memory
Camelot’s
consistently
access, Figure
drops almost
performance
recoverable
size,
account
throughput
120
(b) Average Case
Figure
For
lLXI
RmeWPmem
(a) Best and Worst Cases
These plots illustrate
80
(percent)
Inactive
constrained
memory
limits
only
imposed
throughput
with
with
memory
good
for active
degradation
poor locality
to
have to
data to less than 25% for similar
recoverable
by
startup
data can be much
latency
by the operating
and virtual
system.
The
comparison with Camelot is especially revealing.
In spite
of the fact that RVM is not integrated with VM, it is able to
outperform
155
Camelot over a broad range of workloads.
_
_
—
.
.
Cmmelot
Rrndom
Cwldot
Sequa?ntial
RVM Random
_
RVM Sequential
I
❑
‘a
8 “5
.!.
❑
.......
FI...
❑
✎
✎
..s...m”””o
””a”””m”””am”””a
”
.ti.
❑
U--U-’--I3
8“
o
20
60
40
●
--9--b-’
.
r ----
---
*
80
1(X3
120
140
RmetrVPmem
●.. ... ..... ..... .... .... .... ...-9 ....?.... V.... V... W...V-.
”●
4
9
160
0
180
20
60
40
1(M
80
120
140
RmenVPmem
(percent)
160
180
(per cent)
(b) Average Case
(a) Worst and Best Cases
These plots depict the measured CPU usage of RVM and Camelot during tbe experiments described in Secticm 7.1.2. As in Figure 8, we have
separated the average ease from the best and worst cases for visuat clarity. To save space, we have omitted the table of data (similar to Table 1)
on which these plots are based.
Figure
Although
we were
puzzled
by
recoverable
Camelot’s
gratified
behavior.
to physical
memory
and RVM’S
we had expected
throughputs
case, throughput
is highly
from
lecalized
48.1
throughput
faster
We conjecture
by an overly
Manager.
sustained
to much
that this increased paging activity
aggressive log truncation
During
truncation,
levels
Disk
running
of
is random,
writing
When truncation
many
out a dirty
IOSL Less ffequent
to amortize
page across multiple
truncation
design of RVM.
RVM
has yielded
It is therefore
are
ideal way to answer this question
experiment
of Camelot.
mentioned
would
Unfortunately,
such a direct comparison
on
the
machine,
which
all
CPU
is attributable
Dividing
usage
to the
the total CPU usage
gives the average CPU cost
is our metric
of scalability.
Note
and
page
fault
servicing
over
like
all
of RVM
and Camelot for
CPU usage of Camelot.
The actual values of CPU usage
remain
for
RVM
almost
constant
account access, Figure
Camelot’s.
handling
is not
range, RVM’S
over
all
the
9(a) shows that both
that even at the
is more
limit
of
CPU usage is less than
In other words, the inefficiency
in RVM
lower inherent overhead.
156
systems
CPU usage increase with recoverable
size. But it is astonishing
our experimental
instead
both
memory sizes we examined.
and Camelot’s
memory
The
be to repeat the
in Section 2.3, using RVM
present
9 compares the sealability
For random
on the
to ask whether
gains in scalability.
the total CPU
Since no extraneous
For sequential account access, RVM requires about half the
heavy toll on the
appropriate
the anticipated
is based
of that set of experiments,
of the benchmark.
recoverable
2.3, Camelot’s
sealability
each of the three access patterns described in Section 7.1.1,
or sequential account access
of Coda servers was a key influence
of
in Section 7.1.
truncation
Figure
the cost of
7.2. Scalability
sealability
possible,
transactions.
result in fewer such lost opportunities.
As discussed in Section
not
to the exclusion
that this metric amortizes the cost of sporadic activities
log
writes out
transactions
original
deseribed
in system or user mode)
per transaction,
is induced
is frequent and account access
opportunities
changed
the
is also
of RVM’S
by the number of transactions
Mamger.
all dirty pages referenced by entries in the affected portion
of the log.
was
(whether
strategy in the Disk
the Disk Manager
hardware
usage on the machine was recorded.
activity
higher
by the Camelot
current
our evaluation
For each trial
in the
of the raw data indicates that the drop
is attributable
activity
Repeating
5000/200s.
on the same set of experiments
case, and to 41.6 in the random case.
in throughput
on
Consequently,
At
per second
case to 44.5
Decstation
has
hardware
RTs we now use the much
Camelot.
even at the
in transactions
server
Instead of IBM
because Coda servers now use RVM
But in Camelot’s
to locality
because
experiment
of
The data shows
memory ratio of 12.5%.
in the sequential
Closer examination
paging
sensitive
to physical
that ratio Camelot’s
drops
feasible
considerably.
both
to be independent
in the access pattern.
that this is indeed the case for RVM.
lowest recoverable
CPU Cost per Transaction
by these results, we were
For low ratios of
Camelot’s
the degree of locality
9: Amortized
of page fault
than compensated
for by its
Machine
Machine
‘IkansactiOms
Bytes Written
name
type
committed
to Log
T
grieg
server
haydn
server
wagner
server
morart
client
ives
verdi
client
bath
client
purcell
client
beriioz
client
267,224
483,978
248,169
34,744
21,013
21,907
26,209
76,491
101,168
client
Intra-’llansaction
289,215,032
661,612,324
264,557,372
9,039,008
6,842,648
5,789,6%
10,787,736
12,247,508
14,918,736
‘this table presents the observed reduction in log traffic due to RVM
size after both optimization
were applied.
‘Rre columns labclled
Total
Inter-Transaction
Savings
Savings
Savings
20.7%
21.5%
20.9%
41.6%
31.2%
28.1%
25.8%
41.3%
0.070
20.7’%
0,07.
21.5%
0.0%
20.9%
26.7%
68.3%
22.0%
53.2%
20.$%.
49.0%
47.7%
77.5%
81.6%
21.9%
36.2%
64.3%
17.3~o
optimizatiens.
The column Iahelled “Bytes Written
“Intra-Transaction
Savings” and “Inter-Transaction
percentage of the originat log sire that was supresscd by each type of optimization.
from Coda clients and servers.
This data was obtained
to Log” shows the log
Saving s“ indicate the
over a 4day
period in March
1993
Table 2: Savings Due to RVM Optimization
For localized
both RVM
those machines
account access, Figure 9(b) shows that CPU
usage increase linearly
with recoverable
and Camelot.
memory
For all sizes investigated,
tend to be selected on the basis of size,
weight, and power consumption
size for
RVM’S
7.4.
CPU usage remains well below that of Camelot’s.
Broader Analysis
A fair criticism
Overall,
these
considerably
most
of
requires
measurements
the
workloads
half
investigated,
the CPU
anticipate
that refinements
truncation
will further improve
lower
decision
CPU
RVM
it
RVM
usage of
to RVM
usage
to structure
collection
that
less of a CPU burden than Camelot.
about
RVM’S
establish
is
research prototype,
typically
well-tuned
We
as a library
of tasks communicating
commercial
not currently
rather
via PC.
from
than
drawn in Sections 7.1
product
our
products
supports
would
recoverable
of
with
strengthen
the
does not come at the cost of
Unfortunately,
analysis
with a
comparison
such a comparison
possible because no widely
performance
as a
A favorable
simplicity
good performance.
its scalability.
directly
Camelot.
claim that RVM’S
such as incremental
follows
of the conclusions
and 7.2 is that they are based solely on comparison
Over
Camelot.
rather than performance.
virtual
broader
memory.
scope
is
used commercial
will
Hence
have
to
a
awah
the future.
As mentioned
in Section 3.3, Mach IPC costs about 600 times as much as
a procedure
call
experiments.
Further
are the substantially
on
the
hardware
contributing
we
used
for
The simplicity
a versatile
smaller path lengths in various RVM
components due to their inherently
as a Building
8. RVM
our
to reduced CPU usage
base on which
functionality.
simpler functionality.
Block
of the abstraction
offered by RVM
to implement
In principle,
more
any abstraction
Effectiveness of Optimization
To
estimate
the
value
of
intra-
optimizations,
we instrumented
total
of log data eliminated
volume
can be built
and
RVM
inter-transaction
extensions of the RVM
For example,
by each technique.
sample of Coda clients
interface maybe
necessary.
nested transactions
could
be implemented
using RVM as a substrate for bookkeeping
for a
undo logs of nested transactions.
and servers in our
cotnmi
environment.
t,
and abort
operations
would
committed
state would be handled entirely
benefit
feasibility
of this approach
from
30%, though
benefit
exhibit
optimizations
to
transactions.
be
on portable
especially
20% and
substantially
this type of optimization,
to no-flush
proved
performance
optimization.
between
typically
on clients by another 20-30%.
from
applicable
have
some machines
Inter-transaction
savings.
log traffic
intra-transaction
is typically
Support for distributed
reduce
Coda clients,
optimizations
for
since
the
restoration
by RVM.
has been confirmed
of
The
by the
Venari project [37].
because it is only
valuable
begin,
higher
Servers do not
RVM
simple,
top-level
would be visible to RVM,
Recovery
significantly
be
Only
state such as the
The data in Table 2 shows that both servers and clients
The savings in log traffic
semantics
In some cases, minor
to keep track of the
Table 2 presents the observed savings in log traffic
representative
on top of RVM.
complex
that requires
persistent data structures with clean local failure
7.3.
makes it
good
because disks on
157
transactions
could also be provided
by a library
built on RVM.
Such a library
coordinator
and subordinate
routines
two-phase
commit,
beginning
a transaction
as well
transaction.
Recovery
involve
recovery,
RVM
as for
and
after
adding
operations
such as
new
to
a coordinator
followed
would provide
for each phase of a
sites
a
crash would
by appropriate termination
of distributed
transactions
The
crash.
unspecified
communication
until runtime
to perform
extended
in progress
mechanism
communications.
commit
could
RVM
a list
if the coordinator
of the old-value
would
library
at each subordinate
phase commit
is clear.
would be discarded.
subordinate
could
compensating
RVM
could
until
act
by the
by the
On a global
use the saved records
Avalon
that
recoverable
abstraction
support
to construct
was built
virtual
memory
on Camelot,
is
indeed
implementing
Language
– range
calls:
an
code could
these calls transparently.
An approximation
based
be
solution
augmentation
would
to
use
a
– range
post-compilation
the recent work of O’TooIe
of RVM
et al [25].
is provided
the heap for a language
of persistent
some improvements
data.
While
to RVM
work establishes the suitability
for
the authors
of RVM
context from the one that motivated
of simplicity
previously
suggest
their
for a very different
past
work
therefore
it is impossible
that
restrict
contribution
relationship
to fully
attribute
has indirectly
influenced
explicit
our discussion
here to placing
in proper
perspective,
The reliance
of Birrell
checkpointing
makes
which
been focused
enrichment
identification
of transactional
on three
areas.
of the transactional
such as distribution,
We
RVM’S
its
transactions
into
and hardware [6].
languages [21],
operating
are
in the in-
the entire memory
to be the current version of
occurs
only
during
crash
technique
on full-database
practical
limits
its domain
without
being substantially
of use.
RVM
The absence
abort
is more versatile
monitors
(TPMs),
Tuxedo [1, 36],
Like centralized
such
are
TPMs add distribution
back-ends,
and integrate
as
important
and support
heterogeneous
database managers, TPM backin structure.
They encapsulate
properties
interface.
and provide
data access via a query
language
contrast to RVM,
supports only atomicity
which
for
more complex.
processing
products.
update rates.
updates and for explicit
further
and
only
of recoverable
aspect of permanence,
This
is in
and the
and which provides
data as mapped virtual memory.
properties
A more
has
toolkit,
One area has been the
and longevity
then reflected
all three of the basic transactional
concept along dimensions
nesting [23],
There is
Updates
manage small amounts
all the
[13, 18], attention
second area has been the incorporation
abort.
et al’s technique
the
for multi-item
process failure
for their realization
Each transaction
to disk, the log file deleted, and the
systems.
and to clarifying
Their design is is
Periodically,
truncation
access to recoverable
Since the original
on disk,
ends are usually monolithic
to its closest relatives.
and techniques
transaction
file renamed
Log
with
databases have been
et al [5].
checkpointing.
In the
RVM.
comes
and is based upon new-value
database image.
Transaction
9. Related Work
space available,
for
so, it
that have
to update only a single data item.
Encina [35, 40]
is enormous.
By doing
that
for small
by Birrell
in a log file
of support
it.
processing
transactional
to applications
baggage
than RVM’S,
services to OLTP
of transaction
seeks
facilities.
and full-database
commercial
The field
to
the
RVM
essential
application?”
data and which have moderate
of
garbage
this application,
of
the
The virtues
applications
In this work, RVM
that supports concurrent
a “back
embellishing
recovery, not during normal operation.
by
segments are used as the stable to-space and from-space
at
transactional
the database.
calls.
Further evidence of the versatility
balked
sophisticated
new checkpoint
to a language-
and poor
represents
than
accessible
image is checkpointed
issue
phase tQ test for accesses to mapped RVM
regions and to generate set
hitherto
memory
to issue
volumes
or its implementation,
for the average
recorded
RVM
realization
transactions
no support
support wouId alIeviate the problem
compiler-generated
simplest
is constrained
local
in OLTP
It poses and answers the question “What
makes
logging
that
performance
data
Rather
abstraction
both.
even simpler
appropriate
forgetting
the
extolled
with
confirms
language-based
in Section 6 of programmers
collection
a
systems for
Experience
persistence.
which
for
mentioned
at each
large
to those efforts,
properties
RVM transaction.
[38],
persistence,
is
very
movement.
transactional
the records
On a global abort, the library
with
high
[12].
to simplify
of the two-
commit,
environments
In contrast
to
generated
the outcome
for achieving
basics”
One
ion
be preserved
can also be used as the basis of runtime
languages
set
rans
of techniques
locality
have to be
decides to abort.
raords
These records
transaction.
left
to undo the effects of a
way to do this would be to extend end_t
return
be
by using upcalls from the library
to enable a subordinate
first-phase
at the time of the
[11].
A
of support
for
modular
which
functionality
provided
the recovery,
logging,
Transarc
toolkit.
Transarc toolkit
systems [15],
RVM
A third area has been the development
158
approach
is used in the Transarc
is the back-end
by RVM
RVM
is structured
corresponds
and physical
differs
components
entirely
TP
for the Encina TPM.
The
primarily
to
stomge modules of the
from
the corresponding
in two important
as a library
ways.
that is linked
First,
with
applications,
separate
while
some
processes.
accessed
as mapped
Transarc toolkit
of
the toolkit’s
Second,
modules
recoverable
memory
in
RVM,
are
storage
is
whereas
offers access via the conventional
the
buffered
Acknowledgements
Marvin
designers
and
I/O model.
Theimer
discussions
and
Lily
system.
et al have recently
enhance the Mach
memory [7],
providing
kernel
Their
efforts
recoverable
carries
idea
of
MttrnmerL
memory
a
[1]
system
support,
thereby
debt
to Camelot
should
be obvious
Camelot taught us the value of recoverable
and showed us the merits and pitfalls
RVM
operating
system
has restrained
virtual
support
generality
[3]
memory
was willing
to achieve
within
limits
us
to thank
especially
understand
shepherd,
Peter
and
Bill
the
Stout
use
Weihl,
their
helped
significantly.
Andrsde, J.M., Carges, M. T., Kovach, K.R.
Buifding a Transaction Processing System on UNIX
Conference
Proceedings.
Baron, R.V., Black, D. L., Bolosky, W., Chew, J., Golub, D. B.,
Rashid, R. F., Tevarrisn, Jr, A., Young, M.W.
[4]
generality,
University,
Bernstein, P. A., Hadzilacos, V., Goodman, N.
Concumency Controt and Recovery in Data&zre
Addison
to
Wesley,
that preserve
[5]
Bershad, B.N., Anderson,
Birrell,
1987.
Systems,
1987.
T. E., Lar.owska,
E. D., Levy, H.M.
Lightweight
Remote Procedure CZU.
ACM Transaction
on ComptUer Systems 8(l),
operating system independence.
Systems.
San Francisco, CA,
Mach Kernel Interface Manual
School of Ccrrrsptrter Science, Carnegie Metlon
now.
of a specific approach
Whereas Camelot
to its implementation.
require
by
helping
in the early
wish
Febrtsaty, 19g9.
[2]
enhancing portability.
RVM’S
CameloL
of our SOSP
the presentation
In UniForurn
In contrast, RVM avoids the
operating
for
of
We
References
since their support is in the kernel rather than
specialized
implementors
pmticipated
of RVM.
to
virtual
Camelot’s
support for recoverable
in a user-level Disk Manager.
need for
on their
to support
work
system-level
step further,
reported
Hagmarm
to the design
The comments
us improve
Chew
rmd Robert
leading
A. B., Jcmes, M. B., Wobixr,
Febmary,
1990.
E.P.
A Simple and Efficient Implementation
for Small Databases.
In Proceedings of ihe Elevenlh ACM Symposium on Operating
10. Conclusion
In general, RVM
encountered
System Principles.
has proved to be useful wherever we have
a need to maintain
with clean failure semantics.
persistent
The only constraints
of disk capacity, and for the working
size of accesses to them to be significantly
set
Second,
resource usage.
RVM
dimensions.
these
essentially
minimal
is indeed
impact
upon
lightweight
A Unix programmer
library,
such as the st
While
the importance
been
hampered
implementation.
situation.
by
double-edged
the
lack
of
Our hope is that RVM
University,
in
subroutine
abstraction
a
[10]
lightweight
system may
applications,
[11]
applications
with the operating
applications,
represents close to the limit
has
remedy this
sword, as this paper has shown.
class of less demanding
Carnegie Mellon
154, Department
University,
Santa Fe,
of Computer
Febmary,
Eppirrger, J. L., Mtsmmert,
of Computer
June, 1988.
Marragemenr for Transaction
PhD thesis, Department
will
for very demanding
Report CIVIWCS-88-
Eppinger, J.L.
Virtunl Memory
Morgan
of the transactional
While integration
be unavoidable
along both
of the USENIX Mach 1[1 Symposium.
1993.
Processing
Systemr.
Science, Carnegie Mellon
1989.
L. B., Spector, A.Z
Camelot and Avalon.
been known for many years, its use in low-end
has
[9]
system
package.
dio
1988.
Cooper, E.C., Drsves, R.P.
C T~eaak.
Scienm,
and
thinks of RVM
the same way he thinks of a typical
Febmary,
Chew, K-M, Reddy, A.J., Rcsner, T. H., Silberschatz, A.
Kernel Support for Recoverable-Persistent
Virtstat Memory.
Technical
in the title of this paper connotes
First, it implies ease of learning
it signifies
1987.
Chang, A., Mergen, M.F.
801 Storage: Architecture and Programming.
ACM Transactions on Computer Sy.wems 6(l),
In Proceedings
NM, Apd,
[8]
two distinct qualities.
use.
[7]
less than main
memory.
The term “lightweight”
TX, November,
upon its
use have been the need for the size of the data structures to
be a small fraction
[6]
data structures
Austin,
it can be a
[12]
1991.
Garcia-Motina,
Sagas.
H., Salem, K.
In Proceedings
of the ACIU Signrod Conference.
Good, B., Homan, P.W., Gawlick,
1987.
D. E., Sammer, H.
One thousand transactions per second.
In Proceedings of IEEE Compcon. San Francisco,
[13]
For a broad
CA, 1985.
Gray, J.
Notes on Database Operating Systems.
In Goos, G., Harttnanis, J. (editor), Operating
we believe that RVM
of what is attainable
Kaufmann,
Advanced
Course. Springer
Verlag,
Sysiems: An
197g.
without
[14]
hardware or operating system suppofi
Gray, J., Reuter, A.
Transaction Processing: Concepts and Techniques.
Morgan Kaufmann, 1993.
[15]
159
Haskin, R., Malachi, Y., Sawdon, W., Chan, G.
Recovery Management in QuickSilver.
ACM Transactions on Camputer Syrterns 6(l), Febma~,
1988.
[16]
ACM Tran.sactiom
[17]
[32]
Kistler, J. J., Satyrmarayanan, M.
Disconnected Opcratioo in the Coda File System.
on Computer
Kumar, P., Satyanarayanan,
Systems 10(1), February,
Information
Larrtpsmt, B. W.
1981.
Hints for Computer System Design.
lrt Proceedings of tk Ninth ACM Sympositun on Operaling
Systems Principles.
Bretton Woods, NH, October, 1983.
Leffler, S.L., McKusick, M.K.,
Tk Design and lrnplementation
Programs.
ACM Transactions
[22]
[37]
[38]
R.W.
Linguistic
Lunguages
An Approach
of HICSS-25.
[39]
to Reliable
Hawaii,
J., Nettles, S., Gifford,
Syrtem Principles.
University,
1992.
Distributed
[40]
January,
1992.
1993.
fn Proceedings
of the USENIX
Faster As Fast as
Summer Conference.
Anaheun,
CA, June, 1990.
[27]
Patterson, D. A., Gibson, G., Katz, R.
A Case for Redundant Arrays of inexpensive Disks (RAID).
In Proceedings of tk ACM SIGMOD Conference.
1988.
[28]
Rosenblum, M., Ousterhont, J.K.
The Design and Implementation
of a Lmg-Stmctumd File System.
ACM Transactions on Computer Systems 10(1), February, 1992.
Satyanarayanan,
M.
RPC2 User Guide and Reference Manual
School of Computer Science, Carnegie Mellon
[30]
Satyanarayanan, M., Kistler,
Siegel, E. H., Stcere, D.C.
Coda A Highly
Workstation
1991.
J.J., Kumar, P., Okasaki, M. E.,
Available File System for a Distributed
Environment.
IEEE Transactions
[31]
University,
on Computers
39(4), April,
1990.
Satyanarayanan, M., Steere, D. C., Kudo, M., Mashbum, H.
Transparent Imgging as a Technique for Debugging Complex
Distributed Systems.
In Proceedings of lhe Fijih ACM SIGOPS European Worhhop.
Mmt St. Michel, France, September, 1992.
160
San
Wing, J.M.
Young,
Young,
J-B., Spector, A.Z. (editors),
Morgan
Kaufmarm,
1991.
M.W.
M.W.,
to Memory Managemetifiom
Operating System.
of Computer
a
Science, Carnegie Mellon
November,
1989.
Thompson,
D.S., Jaffe, E.
A Modular Architecture for Distributed Transaction
In Proceedings of the USENLY Winter Conference.
January, 1991.
NC, December,
Systems Getting
G., and Nettles, S.M.
CA, June, 1992.
PhD thesm, Department
D.
Ashewlle,
M., Morrisen,
Exporting a User Interface
Communication-Oriented
J.K.
Why Aren’t Operating
Hardware?
[29]
Wistg, J.M., Faehndrich,
1993.
Camelot and Avalon.
5(3), July, 1983.
M.
Science, Carnegie Mellon
Overview
The Avalon Language.
In Eppinger, J. L., Mummers,
Support for Robust, Distributed
Nettles, S.M., Wing, J.M.
Persistence -+ Undoabdlty
= Transactions.
Ousterhout,
System Product
Extensions to Standard ML to Support Trarrsacticns.
In ACM SIGPLAN Workshop on ML and its Applicatwns.
pt’ess, 1985.
O’Toole,
TUXEDO
Unix System Laboratories,
Concurrent Compacting Garbage Collection of a Persistent Heap.
In Proceedings of the Fourteenth ACM Symposium on Operating
[26]
Encina Product Overview
Transarc Corporation,
1991.
University,
In Proceedings
[25]
[35]
Moss, J.E.B.
Nesied Transactions:
Computing.
[24]
Stout, P.D., Jaffe, E. D., Spector, A.Z
Performance of Select Camelot Functions.
In Eppinger, J. L., Mummett, L. B., Spector, A.Z. (editors),
Camelot and Avalon. Morgan Kaufmsnn, 1991.
Francisco,
Mashbum, H., Satyanarayattan,
RVM User Manual
~
Spector, A.Z.
[34]
[36]
Karels, M. J., Quartemtan, J.S.
of the 4.3BSD Unix Operating
on Programming
School of Computer
[23]
Morgan
The Design of CamelcX.
1989.
Liskov, B. H., Scheifler,
Guardians and Actions:
Handbook
In Eppinger, J. L., Mutnmert, L. B., Spector, A.Z. (editors),
Camelot and Avulon, Morgan Kaufmann, 1991.
[19]
[21]
[33]
Systems. San Diego, CA,
Larnpsmr, B.W.
Atomic Transactions.
frr Lampson, B.W., Paul, M,, Siegert, H.J. (editors), Dkributed
Systems -- Architecture and Implementation.
Springer Verlag,
System.
Addison Wesley,
and the TPC.
M.
[18]
[20]
of Debi[Ctedit
Irr Gray, J. (editors), The Benchmark
Kaufman, 1991.
1992.
Log-based Directory Resolution in the Coda File System.
In Proceedings of the Second International
Conference on Parallel
and Distributed
January, 1993.
Serlin, O.
The History
Prccessirrg.
Dallas, TX,
Download