Disconnected Operation in the Coda File System

advertisement
Disconnected
System
JAMES
J. KISTLER
Carnegie
Mellon
Operation
in the Coda File
and M. SATYANARAYANAN
University
Disconnected
operation
data during
temporary
is a mode of operation that enables a client to continue accessing critical
failures of a shared data repository. An important,
though not exclusive,
application of disconnected operation is in supporting portable computers. In this paper, we show
that disconnected operation is feasible, efficient and usable by describing its design and implementation
in the Coda File System. The central idea behind our work is that caching of data,
now widely used for performance, can also be exploited to improve availability.
Categories
and
Subject
Descriptors:
D,4.3
[Operating
Systems]:
Systems]: Reliability—
Management: —distributed
file systems; D.4.5 [Operating
D.4.8 [Operating
Systems]: Performance —nzeasurements
General
Terms: Design, Experimentation,
Additional
Key Words and Phrases:
reintegration,
second-class replication,
Measurement,
Performance,
Disconnected operation,
server emulation
File
fault
Systems
tolerance;
Reliability
hoarding,
optimistic
replication,
1. INTRODUCTION
Every serious user of a distributed
system has faced situations
work has been impeded
by a remote
failure.
His frustration
acute when his workstation
is powerful
has been configured
to be dependent
inst ante
of such dependence
Placing
users,
growing
data
and
is the use of data
in a distributed
allows
popularity
them
enough to be used
on remote
resources.
file
to delegate
of distributed
file
from
standalone,
but
An important
a distributed
system
simplifies
the
administration
systems
where critical
is particularly
collaboration
such as NFS
of that
file
syst em.
between
data.
[161 and AFS
The
[191
This work was supported by the Defense Advanced Research Projects Agency (Avionics Lab,
Wright Research and Development
Center, Aeronautical
Systems Division
(AFSC), U.S. Air
Force, Wright-Patterson
AFB, Ohio, 45433-6543 under Contract F33615-90-C-1465, ARPA Order
7597), National
Science Foundation
(PYI Award and Grant ECD 8907068), IBM Corporation
(Faculty Development
Award, Graduate Fellowship,
and Research Initiation
Grant), Digital
Equipment
Corporation
(External Research Project Grant), and Bellcore (Information
Networking Research Grant).
Authors’
address: School of Computer
Science, Carnegie Mellon University,
Pittsburgh,
PA
15213.
Permission to copy without fee all or part of this material is granted provided that the copies are
not made or distributed
for direct commercial advantage, the ACM copyright notice and the title
of the publication
and its date appear, and notice is given that copying is by permission of the
Association for Computing Machinery.
To copy otherwise, or to republish, requires a fee and/or
specific permission.
@ 1992 ACM 0734-2071/92/0200-0003
$01.50
ACM Transactions
on Computer Systems, Vol. 10, No. 1, February 1992, Pages 3-25.
J J. Kistler and M, Satyanarayanan
4.
attests to the compelling
users of these systems
critical
juncture
How
may
nature
of these considerations.
Unfortunately,
have to accept the fact that a remote
failure
seriously
can we improve
the benefits
when
of a shared
that
repository
inconvenience
this
state
data
repository,
them.
of affairs?
Ideally,
but
is inaccessible.
We
the
at a
we would
like
be able to continue
call
the
latter
to enjoy
critical
mode
work
of operation
disconnected
operation,
because
it represents
a temporary
deviation
from
normal
operation
as a client of a shared repository.
In this paper we show that disconnected
operation
in a file system is indeed
feasible,
efficient
and usable.
The central
idea behind
our work
is that
caching
of
exploited
now
data,
to enhance
widely
used
availability.
to
improve
have
We
performance,
implemented
can
tion in the Coda File System at Carnegie
Mellon
University.
Our initial
experience
with
Coda confirms
the viability
operation.
We have
to two days.
successfully
operated
For a disconnection
disconnected
of this
duration,
and propagating
changes
typically
takes
100MB
has been adequate
for us during
2. DESIGN
Coda
is
servers.
Unix
and
for
an
1 clients
The design
academic
half
lasting
one
of reconnecting
A local disk of
of disconnection.
that
size should
be
OVERVIEW
designed
untrusted
the process
of about
workday.
be
opera-
of disconnected
for periods
about
a minute.
these periods
Trace-driven
simulations
indicate
that a disk
adequate
for disconnections
lasting
a typical
also
disconnected
environment
and
a much
is optimized
research
consisting
smaller
of a large
number
for the access and sharing
environments.
It
is
collection
of trusted
patterns
specifically
not
of
Unix
file
typical
of
intended
for
applications
that exhibit
highly
Each Coda client
has a local
concurrent,
fine granularity
disk and can communicate
data access.
with the servers
over a high bandwidth
network.
ily unable to communicate
with
At certain
times, a client may be temporarsome or all of the servers. This may be due to
a server or network
failure,
or due to the detachment
from the network.
Clients
view Coda as a single,
location-transparent
shared
tem.
servers
The Coda namespace
larity
of subtrees
called
is mapped
volumes.
At
to individual
each
of a portable
file
client,
a cache
The first mechanism,
replicas
at more than
server
one
allows
replication,
server.
The
set
sys-
( Venus)
to achieve
volumes
of replication
high
to have
sites
for
is
protocol
based
1Unix
its
volume
on callbacks
is a trademark
ACM Transactions
storage
of AT&T
group
( VSG).
[9] to guarantee
Bell Telephone
that
subset
a
of a VSG that
is
accessible
VSG
( A VSG).
The performance
currently
accessible
is a client’s
cost of server replication
is kept low by caching
on disks at clients
and
through
the use of parallel
access protocols.
Venus uses a cache coherence
volume
The
file
at the granu-
manager
dynamically
obtains
and caches volume
mappings.
Coda uses two distinct,
but complementary,
mechanisms
availability.
read-write
Unix
client
an open file
Labs
on Computer Systems, Vol. 10, No 1, February
1992,
yields
its latest
Disconnected
Operation m the Coda File System
.
Icopy in the AVSG.
This guarantee
is provided
by servers notifying
when their cached copies are no longer valid, each notification
being
to as a ‘callback
break’.
all AVSG sites,
Disconnected
Modifications
and eventually
operation,
the
mechanism
While
Venus
cache.
services file system requests
Since
cache misses
cannot
application
Venus
propagates
depicts
a typical
and disconnected
Earlier
discuss
cantly
this
3. DESIGN
At
a high
used
by
disconnected,
by relying
solely on the contents
of its
be serviced
or masked,
they
appear
and
and
users.
reverts
involving
When
to server
transitions
[18, 19] have
restricts
replication
disconnection
replication.
between
server
ends,
Figure
1
replication
its
only
our design
described
server
attention
in those
to
areas
for disconnected
replication
disconnected
where
in depth.
In
operation.
its presence
We
has signifi-
operation.
RATIONALE
level,
we
empty.
to
operation.
influenced
First,
scenario
paper
server
becomes
programs
modifications
Coda papers
contrast,
the
clients
referred
in parallel
AVSG
takes
to
when
are propagated
to missing
VSG sites.
second high availability
Coda,
as failures
effect
in Coda
5
two
wanted
our system.
factors
to
Second,
use
influenced
our
conventional,
we wished
strategy
off-the-shelf
to preserve
for
high
availability.
hardware
throughout
by seamlessly
transparency
inte-
grating
the high
availability
mechanisms
of Coda into
a normal
environment.
At a more detailed
level, other considerations
influenced
our design.
the advent of portable
workstations,
include
the need to scale gracefully,
These
the
resource,
integrity,
and
very different
clients and servers, and the need to strike
about
and
We examine
consistency.
3.1
each of these
security
a balance
issues
assumptions
made
availability
between
in the following
Unix
sections.
Scalability
Successful
Coda’s
distributed
systems
tend
ancestor,
AFS,
had impressed
rather
than
treating
a priori,
it
to
grow
upon
in
size.
Our
experience
us the need to prepare
as an afterthought
[17].
with
for growth
We brought
this
experience
to bear upon Coda in two ways. First, we adopted certain
mechanisms
that
enhance
scalability.
Second,
we drew upon a set of general
principles
to guide our design choices.
An example
of a mechanism
we adopted
for scalability
is callback-based
whole-file
caching,
offers the
cache coherence.
Another
such mechanism
added
occur
advantage
of a much simpler
failure
on an open, never
on a read, write,
substantially
partial-file
[1] would
model:
a cache
seek, or close.
miss
This,
can only
in turn,
simplifies
the implementation
of disconnected
operation.
A
caching
scheme such as that
of AFS-4
[22], Echo [8] or MFS
have complicated
our implementation
and made
disconnected
operation
less transparent.
A scalability
principle
that has had considerable
of functionality
on clients
rather than
the placing
influence
on our design
servers. Only if integrity
ACM Transactions on Computer Systems, Vol
is
or
10, No 1, February 1992
6.
J J. Kistler and M Satyanarayanan
‘m
_.___—l
ACM TransactIons on Computer Systems, Vol. 10, No 1, February
1992
Disconnected
security
Another
rapid
would
been
compromised
scalability
principle
change.
Consequently,
or agreement
algorithms
consensus
Powerful,
by
large
have
we have adopted
we have rejected
numbers
such as that
on the current
Portable
3.2
have
Operation in the Coda File System
of nodes.
this
principle.
is the avoidance
ofsystem-wide
strategies
that require
election
For
example,
used in Locus [23] that depend
partition
state of the network.
lightweight
and compact
to observe
how
laptop
a person
we
have
on nodes
avoided
achieving
computers
are commonplace
with
in a shared
uses such a machine.
Typically,
he identifies
them from the shared file system into the
When
he returns,
he copies
substantially
simplify
use a different
name
propagate
today.
file
system
files
back
manual
that
into
the
caching,
shared
with
disconnected
file
write-back
operation
could
the use of portable
clients.
Users would not have to
space while
isolated,
nor would
they have to man-
changes
champion
application
The use of portable
data
files of interest
and downloads
local name space for use while
modified
system. Such a user is effectively
performing
upon reconnection!
Early in the design of Coda we realized
ually
violated
7
Workstations
It is instructive
isolated.
we
o
upon
reconnection.
Thus
portable
for disconnected
operation.
machines
also gave us another
machines
insight.
The
are
fact
a
that
people are able to operate for extended
periods in isolation
indicates
that they
are quite
good at predicting
their
future
file access needs. This, in turn,
suggests
that
it
is reasonable
cache management
policy
involuntary
Functionally,
to
seek
user
for disconnected
disconnections
uoluntary
disconnections
from
Hence Coda provides
a single
assistance
in
augmenting
operation.
caused by failures
caused by unplugging
mechanism
to cope with
are no different
portable
computers.
all disconnections.
Of
course, there may be qualitative
differences:
user expectations
as well
extent of user cooperation
are likely
to be different
in the two cases.
First- vs. Second-Class
3.3
If
disconnected
Clients
unattended
as the
Replication
operation
all? The answer
to this
assumptions
made about
the
why
is server
replication
question
depends
clients and servers
is feasible,
critically
in Coda.
on the
needed
very
at
different
are like appliances:
they can be turned
off at will
and may be
for long periods of time. They have limited
disk storage capacity,
their
software
and hardware
may be tampered
with,
and their owners may
not be diligent
about backing
up the local disks. Servers
are like public
utilities:
they have much greater
disk capacity,
they are physically
secure,
and they
It
are carefully
is therefore
servers,
and
monitored
appropriate
second-class
and administered
to distinguish
replicas
(i.e.,
cache
by professional
between
copies)
first-class
on clients.
staff.
replicas
on
First-class
replicas
are of higher
quality:
they
are more persistent,
widely
known,
secure, available,
complete
and accurate.
Second-class
replicas,
in contrast,
are inferior
along all these dimensions.
Only by periodic
revalidation
with
respect to a first-class
replica can a second-class replica be useful.
ACM Transactions
on Computer Systems, Vol. 10, No. 1, February 1992.
J. J. Kistler and M. Satyanarayanan
8.
The function
and
of a cache
scalability
first-class
coherence
advantages
replica.
When
protocol
is to combine
of a second-class
disconnected,
replica
the quality
with
in
for degradation.
the
face
ability.
Hence
quency
Whereas
of failures,
and
server
duration
server
replication
in
operation,
because
contrast,
quality
because
operation,
which
little.
it
of data
for
reduces
fre-
viewed
additional
Whether
avail-
the
is properly
it requires
costs
of a
replica
the quality
forsakes
is important
quality
which it is contingent
is
the greater the poten-
preserves
operation
of disconnected
a measure
of last resort.
Server replication
is expensive
Disconnected
replication
disconnected
the
of the second-class
may be degraded because the first-class
replica upon
inaccessible.
The longer the duration
of disconnection,
tial
the performance
hardware.
to
use
server
replication
or not is thus a trade-off
between
quality
and cost. Coda
permit
a volume
to have a sole server replica.
Therefore,
an installation
rely exclusively
on disconnected
operation
if it so chooses.
3.4
Optimistic
vs. Pessimistic
By definition,
a network
replica
and
all
replica
control
to the design
ing
strategies,
danger
occurrence.
A pessimistic
all
reads
control
by
towards
a disconnected
choice
strategy
and
provides
conflict-
deals
them
operation
would
of a cached
reconnection.
preclude
reads
much
and
of
central
avoids
resolving
disconnected
would
families
or by restricting
everywhere,
detecting
client
strategy
writes
and writes
second-class
two
[51, is therefore
A pessimistic
partitioned
exclusive
control
such control
until
by a disconnected
between
optimistic
An optimistic
of conflicts
client
to acquire
shared
or
disconnection,
and to retain
exclusive
The
operation.
partition.
approach
between
and
pessimistic
by permitting
attendant
exists
by disallowing
does
can
Control
associates.
of disconnected
to a single
availability
partition
its first-class
operations
and writes
Replica
as
higher
with
after
the
their
require
a
object prior
Possession
to
of
reading
or writing
at all other replicas.
Possession
of shared
control
would
allow reading
at
other replicas,
but writes would still be forbidden
everywhere.
Acquiring
control prior to voluntary
disconnection
is relatively
simple. It is
more
difficult
when
have to arbitrate
disconnection
among
needed
to make
system
cannot
a wise
predict
is involuntary,
multiple
requesters.
decision
is not
which
requesters
because
Unfortunately,
readily
would
available.
actually
the
system
may
the information
For
use the
example,
object,
the
when
they would release control,
or what the relative
costs of denying
them access
would be.
Retaining
control
until
reconnection
is acceptable
in the case of brief
disconnections.
But it is unacceptable
in the case of extended
disconnections.
A disconnected
client with shared control
of an object would force the rest of
the system to defer all updates until it reconnected.
With exclusive
control,
it
would even prevent
other users from making
a copy of the object. Coercing
the client
to reconnect
ACM Transactions
may
not be feasible,
since
its whereabouts
on Computer Systems, Vol. 10, No 1, February
1992
may
not be
Disconnected
Operation in the Coda File System
.
9
known.
Thus, an entire
user community
could be at the mercy of a single
errant
client for an unbounded
amount
of time.
Placing
a time bound on exclusive
or shared control,
as done in the case of
leases [7], avoids this problem
but introduces
others. Once a lease expires,
a
disconnected
client loses the ability
to access a cached object, even if no one
else in the
system
disconnected
already
is interested
operation
made
while
An optimistic
disconnected
which
in it.
is to provide
disconnected
approach
client
This,
have
in turn,
high
conflict
the
availability.
purpose
Worse,
of
updates
to be discarded.
has its own disadvantages.
may
defeats
with
An update
an update
at another
made
at one
disconnected
or
connected
client.
For optimistic
replication
to be viable, the system has to be
more sophisticated.
There needs to be machinery
in the system for detecting
conflicts,
age and
manually
for automating
resolution
when possible,
and for confining
dampreserving
evidence
for manual
repair.
Having
to repair
conflicts
violates
transparency,
is an annoyance
to users, and reduces the
usability
of the system.
We chose optimistic
replication
because
we felt that
its strengths
and
weaknesses
better matched
our design goals. The dominant
influence
on our
choice
an
was the low degree
optimistic
strategy
of write-sharing
was
likely
to
typical
lead
to
of Unix.
This
relatively
few
implied
that
conflicts.
An
optimistic
strategy
was also consistent
with our overall
goal of providing
the
highest
possible availability
of data.
In principle,
we could have chosen a pessimistic
strategy
for server replication even after choosing
an optimistic
strategy
for disconnected
But that would have reduced transparency,
because a user would
the anomaly
of being
able to update
data when
disconnected,
unable
to do so when
the previous
replication.
Using
system
connected
arguments
an optimistic
from
strategy
the user’s
data in his accessible
everyone
else in that
set of
shrinks
updates
universe
universe.
contact.
universe
become
4. DETAILED
throughout
strategy
presents
At any time,
Further,
also apply
a uniform
visible
model
and his updates
are immediately
His accessible
universe
is usually
of
of the
visible
to
the entire
When
failures
occur, his accessible
universe
he can contact, and the set of clients that they, in
throughout
his now-enlarged
disconnected,
reconnection,
accessible
his
his
universe.
AND IMPLEMENTATION
In describing
our implementation
client since this is where much
the physical
structure
of Venus,
and Sections
many
to server
he is able to read the latest
In the limit,
when
he is operating
consists
of just
his machine.
Upon
DESIGN
tion of the server
Section 4.5.
of the servers.
of an optimistic
perspective.
servers
and clients.
to the set of servers
turn,
can
accessible
to a subset
in favor
operation.
have faced
but being
of disconnected
of the complexity
of a client,
4.3 to 4.5
support
needed
operation,
we focus on the
lies. Section 4.1 describes
Section
4.2 introduces
the major
states
discuss these states in detail.
A descripfor disconnected
ACM Transactions
operation
is contained
in
on Computer Systems, Vol. 10, No. 1, February 1992.
10
.
J. J Kistler and M Satyanarayanan
IT
—
Venus
I
1
Fig.2.
4.1
Client
Because
than
Structure
ofa Coda client.
Structure
of the
part
but
debug.
Figure
Venus
complexity
of Venus,
of the kernel.
mance,
would
The latter
have
been
2 illustrates
intercepts
Unix
we made
approach
less portable
the high-level
file
system
it a user-level
may
have
access,
are handled
A system
disconnected
MiniCache.
If possible,
the
returned
to the application.
service
returns
operation
entirely
by Venus.
call on a Coda object
calls
via
or server
is forwarded
from
our
for good performance
Venus
Logically,
reintegration.
Venus
always
rather
better
perfor-
more
difficult
to
widely
used
Sun
Vnode
performance
overhead
MiniCache
to filter
contains
no support
replication;
by the
these
Vnode
on
out
for
functions
interface
to the
call is serviced by the MiniCache
and control
Otherwise,
the MiniCache
contacts
Venus
is
to
the call. This, in turn, may involve
contacting
Coda servers. Control
from Venus via the MiniCache
to the application
program,
updating
Measurements
4.2
process
of a Coda client.
the
MiniCache
state as a side effect.
MiniCache
initiated
by Venus
on events such as callback
critical
yielded
and considerably
structure
interface
[10]. Since this interface
imposes a heavy
user-level
cache managers,
we use a tiny in-kernel
many
kernel-Venus
interactions.
The MiniCache
remote
to Coda
servers
implementation
state changes
breaks
from
confirm
that
the
may
Coda
also be
servers.
MiniCache
is
[211,
States
Venus operates
in one of three
Figure
3 depicts these states
is normally
on the alert
emulation,
between
and
them.
in the hoarding
state, relying
on server replication
but
for possible disconnection.
Upon disconnection,
it enters
the emulation
state and remains
Upon reconnection,
Venus enters
ACM Transactions
states: hoarding,
and the transitions
on Computer Systems, Vol
there
for the
the reintegration
10, No 1, February
duration
state,
1992
of disconnection.
desynchronizes
its
Disconnected
Operation in the Coda File System
.
11
r)
Hoarding
Fig. 3. Venus states and transitions.
When disconnected, Venus is in the emulation
state. It
transmits
to reintegration
upon successful reconnection
to an AVSG member, and thence to
hoarding, where it resumes connected operation.
cache
with
its
AVSG,
and
then
all volumes
may not be replicated
be in different
states with respect
conditions
in the system.
4.3
reverts
to
across the
to different
the
hoarding
state.
same set of servers,
volumes,
depending
Since
Venus can
on failure
Hoarding
The hoarding
state
state
is to hoard
not its only
is so named
useful
data
responsibility.
because
Rather,
reference
predicted
with
behavior,
manage
is
its cache in a manner
and disconnected
operation.
a certain
set of files is critical
the implementation
especially
and reconnection
–The
true cost of a cache
hard to quantify.
an object
this
must
For inbut may
in
of hoarding:
the
distant
future,
cannot
be
certainty.
–Disconnections
—Activity
in this
However,
files.
To provide
good performance,
Venus
must
to be prepared
for disconnection,
it must also cache
the former
set of files.
Many factors complicate
—File
of Venus
of disconnection.
Venus
that balances
the needs of connected
stance, a user may have indicated
that
currently
be using
other
cache the latter files. But
a key responsibility
in anticipation
at other
clients
miss
must
are often
while
unpredictable.
disconnected
be accounted
is highly
for, so that
variable
the latest
and
version
of
is in the cache at disconnection.
—Since cache space is finite,
the availability
of less critical
to be sacrificed
in favor of more critical
objects.
ACM Transactions
objects
may
have
on Computer Systems, Vol. 10, No. 1, February 1992.
12
J J. Kistler and M Satyanarayanan
.
#
a
a
a
Personal
files
/coda /usr/jjk
d+
/coda /usr/jjk/papers
/coda /usr/jjk/pape
#
#
a
a
a
a
a
a
a
a
a
a
a
100:d+
rs/sosp
1000:d+
#
a
a
a
a
a
System files
/usr/bin
100:d+
/usr/etc
100:d+
/usr/include
100:d+
/usr/lib
100:d+
/usr/local/gnu
d+
a /usr/local/rcs
d+
a /usr/ucb
d+
XII
files
(from Xll
maintainer)
/usr/Xll/bin/X
/usr/Xll/bin/Xvga
/usr/Xll/bin/mwm
/usr/Xl
l/ bin/startx
/usr/Xll
/bin/x
clock
/usr/Xll/bin/xinlt
/usr/Xll/bin/xtenn
/usr/Xll/i
nclude/Xll
/bitmaps
c+
/usr/Xll/
lib/app-defaults
d+
/usr/Xll
/lib/font
s/mist
c+
/usr/Xl
l/lib/system
.mwmrc
(a)
#
#
a
a
a
Venus source
files
(shared
among Coda developers)
/coda /project
/coda /src/venus
100:c+
/coda/project/coda/include
100:c+
/coda/project/coda/lib
c+
(c)
Fig.
4.
Sample
hoard
profiles.
These are typical
hoard
profiles
provided
hy a Coda user, an
application
maintainer,
and a group of project developers. Each profile IS interpreted
separately
by the HDB front-end program
The ‘a’ at the beginning
of a line indicates an add-entry
command. Other commands are delete an entry, clear all entries, and list entries. The modifiers
following some pathnames specify nondefault priorities
(the default is 10) and/or metaexpansion
for the entry.
‘/coda’.
To
Note that
address
the pathnames
these
concerns,
and periodically
cache via a process known
algorithm,
4.3.1
plicit
Prioritized
sources
Cache
manage
Management.
of information
of interest
front-end
we
with
reevaluate
which
walking.
as hoard
rithm.
The implicit
information
traditional
caching
algorithms.
hoard
database
per-workstation
fying objects
A simple
beginning
in
its
‘/usr’
the
(HDB),
cache
objects
Venus
priority-based
consists
Explicit
are actually
symbolic
using
merit
a
links
into
prioritized
retention
in the
combines
implicit
and excache management
algo-
of recent
reference
history,
as in
information
takes the form
of a
whose entries are pathnames
identi-
to the user at the workstation.
program
allows
a user to update
the
HDB
using
hoard
profiles,
such as those shown in Figure
command
scripts
called
Since hoard profiles
are just files, it is simple for an application
maintainer
4.
to
provide a common profile
for his users. or for users collaborating
on a project
to maintain
a common profile.
A user can customize
his HDB by specifying
different
combinations
of profiles
or by executing
front-end
commands
interactively.
To facilitate
construction
of hoard profiles,
Venus can record all file
references
observed
between
a pair of start
and stop events indicated
by
a user.
To reduce the verbosity
of hoard profiles
and the effort needed to maintain
them, Venus
if the letter
ACM Transactions
meta-expansion
of HDB
supports
‘c’ (or ‘d’) follows
a pathname,
on Computer Systems, Vol
entries.
As shown in Figure
the command
also applies
10, No, 1, February
1992
4,
to
Disconnected
immediate
children
(or all
Operation in the Coda File System
descendants).
A ‘ +‘
following
the
.
13
‘c’ or ‘d’ indi-
cates that the command
applies to all future
as well
descendants.
A hoard entry may optionally
indicate
as present
children
or
priority,
with
a hoard
higher
priorities
indicating
more critical
The current
priority
of a cached object
of its hoard
well
as a metric
representing
ously in response
no longer
in the
victims
when
To resolve
imperative
usage.
The
latter
priority
is updated
all
ensure
the
ancestors
that
a cached
of the
object
directory
while
of objects
chosen as
disconnected,
also be cached.
is not
as
continu-
to new references,
and serves to age the priority
working
set. Objects of the lowest priority
are
cache space has to be reclaimed.
the pathname
of a cached object
that
therefore
recent
objects.
is a function
purged
it
is
Venus
must
any
of its
before
hierarchical
cache
management
is not needed in tradidescendants.
This
tional
file caching
schemes because cache misses during
name translation
can be serviced,
albeit
at a performance
cost. Venus performs
hierarchical
cache management
by assigning
infinite
priority
to directories
with
children.
This automatically
forces replacement
to occur bottom-up.
4.3.2
Hoard
Walking.
say that
We
a cache
is in
cached
signifying
equilibrium,
that it meets user expectations
about availability,
when no uncached
object
has a higher priority
than a cached object. Equilibrium
maybe
disturbed
as a
result
the
of normal
cache
activity.
mentioned
its priority
equilibrium,
as a hoard
walk.
implementation,
bindings
replacing
an
suppose
object,
B.
an object,
A, is brought
Further
suppose
restores
but
one
disconnection.
of HDB
equilibrium
A hoard
walk
may
The
entries
occurs
be
by performing
every
explicitly
walk
occurs
are reevaluated
Coda clients.
For example,
new
tory whose pathname
is specified
required
in
an operation
10 minutes
by
in
is
First,
to reflect
update
activity
been created
in the HDB.
a key characteristic
data is refetched
only
a strategy
may result
current
a user
phases.
children
may have
with the ‘ +‘ option
callback-based
caching,
break.
But in Coda, such
being unavailable
should
it. Refetching
immediately
ignores
B
known
our
two
priorities
of all entries
in the cache and HDB are reevaluated,
fetched or evicted as needed to restore equilibrium.
Hoard
walks
also address
a problem
arising
from
callback
traditional
callback
into
that
in the HDB, but A is not. Some time after activity
on A ceases,
will decay below the hoard priority
of B. The cache is no longer in
than the uncached
since the cached object A has lower priority
object B.
Venus periodically
voluntary
For example,
on demand,
prior
the
to
name
by other
in a direcSecond, the
and
objects
breaks.
on demand
in a critical
In
after a
object
a disconnection
occur before the next reference
to
upon callback
break
avoids this problem,
but
of Unix
it is likely
to be modified
many
interval
[15, 6]. An immediate
environments:
once an object
is modified,
more times by the same user within
a short
refetch
policy
would
increase
client-server
traffic
considerably,
thereby
reducing
scalability.
Our strategy
is a compromise
that balances
scalability.
For files and symbolic
links, Venus
ACM Transactions
availability,
consistency
and
purges the object on callback
on Computer Systems, Vol. 10, No. 1, February 1992
14
.
break,
and refetches
occurs
earlier.
J. J. Klstler and M. Satyanarayanan
it on demand
If a disconnection
or during
were
the next
to occur
hoard
before
walk,
whichever
refetching,
the
object
would
be unavailable.
For directories,
Venus
does not purge
on callback
break,
but marks
the cache entry
suspicious.
A stale cache entry
is thus
available
should a disconnection
occur before the next hoard walk or reference.
The
callback
entry
acceptability
semantics.
y of stale
A callback
has been added
to or deleted
other directory
entries
fore, saving
the stale
nection
causes
directory
break
and
copy
consistency
the
and
data
follows
on a directory
from
from
typically
the directory.
its
particular
means
It is often
that
objects they name are unchanged.
using
it in the event of untimely
to suffer
only
a little,
but
increases
actions
normally
an
the case that
Therediscon-
availability
considerably.
4.4
Emulation
In the
emulation
servers.
For
semantic
state,
example,
checks.
Venus
performs
Venus
now
many
assumes
It is also responsible
full
responsibility
for generating
handled
by
for access and
temporary
file
identi-
(fids) for new objects,
pending
the assignment
of permanent
fids at
updates
reintegration.
But although
Venus is functioning
as a pseudo-server,
accepted by it have to be revalidated
with respect to integrity
and protection
by real servers. This follows from the Coda policy of trusting
only servers, not
clients.
To minimize
unpleasant
delayed surprises
for a disconnected
user, it
fiers
behooves
Venus
Cache
algorithm
management
used during
cache
freed
ity
entries
to be as faithful
of the
immediately,
so that
default
request
4.4.1
they
objects
but
are
not
version
in its emulation.
is
involved.
Cache
of other
modified
those
purged
before
done with
operations
entries
the same priority
directly
update
the
of deleted
objects
assume
reintegration.
objects
infinite
On a cache
are
prior-
miss,
the
behavior
of Venus is to return
an error code. A user may optionally
Venus to block his processes until
cache misses can be serviced.
Logging.
During
emulation,
to replay update activity
when
in a per-volume
log of mutating
contains
as possible
during
emulation
hoarding.
Mutating
a copy
state
of the
corresponding
of all objects
Venus
records
sufficient
information
it reinteWates.
It maintains
this information
log. Each log entry
operations
called a replay
referenced
system
call
arguments
as well
as the
by the call.
Venus uses a number
of optimizations
to reduce
log, resulting
in a log size that is typically
a few
the length
of the replay
percent
of cache size. A
small log conserves disk space, a critical
resource during
periods of disconnection.
It also improves
reintegration
performance
by reducing
latency
and
server load.
One important
optimization
to reduce log length
pertains
to write operations on files. Since Coda uses whole-file
caching,
the close after an open of a
file for modification
installs
a completely
new copy of the file. Rather
than
logging
the open, close, and intervening
write operations
individually,
Venus
logs a single store record during
the handling
of a close,
Another
optimization
for a file when a new
consists of Venus discarding
a previous
store record
one is appended
to the log. This follows from the fact
ACM TransactIons on Computer Systems, Vol. 10, No. 1, February
1992,
Disconnected
that
a store
record
renders
all
Operation in the Coda File System
previous
does not contain
versions
of a file
a copy of the file’s
.
superfluous.
contents,
but
15
The
merely
store
points
to the
copy in the cache.
We
are
length
currently
of the replay
the previous
earlier
would
implementing
two
log.
generalizes
paragraph
The
such that
operations
may
be the canceling
second optimization
the inverting
and
own log record
4.4.2
chine
further
as well
cancel the
of a store
as that
to reduce
optimization
which
the
described
in
the effect
overwrites
of
corresponding
log records.
An example
by a subsequent
unlink or truncate.
The
operations
to cancel both
of inverse
For example,
a rmdir may cancel its
of the corresponding
A disconnected
user
and continue
where
a shutdown
optimizations
the
any operation
exploits
knowledge
inverted
log records.
Persistence.
after
first
mkdir.
must
be able
he left
to restart
his
ma-
off. In case of a crash,
the
amount
of data lost should be no greater
than if the same failure
occurred
during
connected
operation.
To provide
these guarantees,
Venus must keep
its cache and related
data structures
in nonvolatile
storage.
Mets-data,
consisting
of cached
directory
and symbolic
link
contents,
status
blocks for cached objects of all types, replay
logs, and the HDB, is mapped
virtual
memory
(RVM).
Transacinto Venus’
address space as recoverable
tional
access to this
Venus.
memory
The actual
contents
local Unix files.
The use of transactions
is supported
of cached
to
by the RVM
files
library
are not in RVM,
manipulate
meta-data
[13] linked
but
into
are stored
simplifies
Venus’
as
job
enormously.
To maintain
its invariants
Venus need only ensure that each
transaction
takes meta-data
from one consistent
state to another.
It need not
be concerned
with crash recovery,
since RVM handles
this transparently.
If
we had
files,
chosen
we
synchronous
RVM
trol
over
the
would
writes
supports
the
obvious
have
An
alternative
to
follow
and an ad-hoc
local,
basic
serializability.
had
non-nested
recovery
RVM’S write-ahead
bounded
provides
flushes.
Venus
exploits
log from
persistence,
the
allows
commit
latency
a synchronous
the application
time to time.
where
the
capabilities
and
of atomicity,
reduce
thereby
avoiding
commit
as no-flush,
persistence
of no-flush
transactions,
of RVM
in
local
accessible
when
emulating,
When
bound
con-
permanence
and
by
the
labelling
write to disk. To ensure
must explicitly
flush
used in this manner,
RVM
is the period
between
log
to provide
Venus
is more
the log more frequently.
This lowers performance,
data lost by a client crash within
acceptable
limits.
4.4.3
volatile
Resource
storage
Exhaustion.
during
emulation.
It
is possible
The
timed
independent
good performance
constant
level of persistence.
When
hoarding,
Venus
initiates
log
infrequently,
since a copy of the data is available
on servers.
Since
are not
Unix
of carefully
algorithm.
properties
can
meta-data
discipline
transactions
transactional
application
of placing
a strict
two
for
conservative
but
Venus
significant
keeps
at a
flushes
servers
and flushes
the
to exhaust
instances
amount
its
of this
of
nonare
ACM ‘l!ransactions on Computer Systems, Vol. 10, No. 1, February 1992
16
.
J. J. Kistler and M. Satyanarayanan
the file
allocated
cache becoming
filled
to replay logs becoming
with
full.
modified
files,
Our current
implementation
is not very graceful
tions. When the file cache is full, space can be freed
modified
files.
reintegrate
always
When
ion
log space is full,
has been
no further
performed.
and
the
in handling
by truncating
mutations
Of course,
RVM
non-rout
these situaor deleting
are allowed
at ing
space
until
operations
are
allowed.
We plan
emulating.
to explore
at least three alternatives
One possibility
is to compress
file
to free up disk space while
cache and RVM
contents.
Compression
trades off computation
time for space, and recent work [21 has
shown it to be a promising
tool for cache management.
A second possibility
is
to allow users to selectively
back out updates
made while
disconnected.
A
third
approach
is to allow
out to removable
4.5
media
portions
of the
such as floppy
file
cache
and RVM
to be written
disks.
Reintegration
Reintegration
is a transitory
state
through
which
Venus
passes
in changing
roles from pseudo-server
to cache manager.
In this state, Venus propagates
changes
made during
emulation,
and updates
its cache to reflect
current
server state. Reintegration
is performed
a volume
at a time, with all update
activity
in the volume
4.5.1
Replay
suspended
in two
for
and
This
new
objects
step
steps.
in many
independently
executed.
protection,
to replace
cases,
of changes
step, Venus
temporary
since
Venus
from
obtains
fids
client
in the
obtains
to AVSG
permanent
a small
fids
replay
log.
supply
of
fids in advance of need, while in the hoarding
state. In the second
replay
log is shipped
in parallel
to the AVSG,
and executed
at each
member.
single transaction,
which
The replay
algorithm
parsed,
locked.
In the first
uses them
is avoided
permanent
step, the
completion.
The propagation
Algorithm.
is accomplished
until
Each
server
performs
the
is aborted
if any error is detected.
consists
of four phases.
In phase
replay
one
within
the
log
a
is
a transaction
is begun,
and all objects referenced
in the log are
In phase two, each operation
in the log is validated
and then
The validation
consists
of conflict
detection
as well as integrity,
and disk space checks.
Except
in the case of store operations,
execution
during
replay
is identical
to execution
in connected
mode. For a
file is created
and meta-data
is updated
to reference
store, an empty shadow
it, but the data transfer
is deferred.
Phase three
consists
exclusively
of
performing
these
data
transfers,
a process
known
as bac?z-fetching.
The final
phase commits
the transaction
and releases all locks.
If reinte~ation
succeeds, Venus frees the replay log and resets the priority
of cached objects referenced
by the log. If reintegration
fails, Venus writes
file
in a superset
of the Unix tar format.
out the replay log to a local replay
The log and all corresponding
cache entries
are then purged,
so that subsequent references
will cause refetch
of the current
contents
at the AVSG.
A
tool is provided
which allows the user to inspect the contents
of a replay file,
compare
entirety.
it
to the
ACM Transactions
state
at
the
AVSG,
and
replay
on Computer Systems, Vol. 10, No. 1, February
it
1992,
selectively
or in
its
Disconnected
Operation in the Coda File System
Reintegration
at finer granularity
than
perceived
by clients, improve
concurrency
reduce user effort during
manual
replay.
implementation
to reintegrate
dent
within
operations
using
the
precedence
Venus
will
pass them
to servers
along
The
tagged
with
with
operations
clients.
conflicts.
for conflicts
a storeid
of subsequences
subsequences
of Davidson
precedence
the replay
The
relies
that
only
Read/
system
model,
since it
of a single system call.
check
granularity
graphs
of depen-
can be identified
[4]. In the
revised
during
emulation,
imple
-
and
log.
Our use of optimistic
replica control means that
of one client may conflict
with activity
at servers
Handling.
or other disconnected
with are write / write
Unix
file
boundary
maintain
17
a volume would reduce the latency
and load balancing
at servers, and
To this end, we are revising
our
Dependent
approach
graph
mentation
ConfZict
4.5.2
the disconnected
at the
a volume.
.
class of conflicts
we are concerned
conflicts
are not relevant
to the
write
has
no
on the
fact
uniquely
notion
that
identifies
of atomicity
each replica
the
last
beyond
the
of an object
update
to it.
is
During
phase two of replay,
a server compares
the storeid of every object mentioned
in a log entry
with
the storeid
of its own replica
of the object.
If the
comparison
the mutated
indicates
equality
for all objects, the operation
objects are tagged with a new storeid specified
is performed
and
in the log entry.
If a storeid
comparison
fails, the action taken
depends on the operation
being validated.
In the case of a store of a file, the entire
reintegration
is
aborted.
But for directories,
a conflict
is declared
only if a newly
created
name collides with an existing
name, if an object updated
at the client or the
server has been deleted
by the other,
or if directory
attributes
have been
modified
at the
directory
updates
server
and the
is consistent
client.
with
This
our
strategy
strategy
and was originally
suggested
by Locus [23].
Our original
design for disconnected
operation
replay
files at servers rather
than clients.
This
damage
forcing
to
be
manual
confined
repair,
by
marking
conflicting
as is currently
of resolving
in server
Today,
laptops
[11],
called for preservation
of
approach
would
also allow
replicas
inconsistent
done in the case of server
We are awaiting
more usage experience
to determine
the correct approach
for disconnected
operation.
5. STATUS
partitioned
replication
whether
and
replication.
this
is indeed
AND EVALUATION
Coda runs on IBM RTs, Decstation
such as the Toshiba
5200. A small
3100s and 5000s, and 386-based
user community
has been using
Coda on a daily basis as its primary
data repository
since April
1990. All
development
work on Coda is done in Coda itself. As of July 1991 there were
nearly 350MB of triply
replicated
data in Coda, with plans to expand to 2GB
in the next
A version
few months.
of disconnected
operation
with
strated in October 1990. A more complete
1991, and is now in regular
use. We have
for periods
lasting
one to two days.
ACM Transactions
Our
minimal
functionality
version was functional
successfully
operated
experience
with
on Computer Systems, Vol
was demonin January
disconnected
the system
has been
10, No 1, February 1992.
18
.
J. J. Kistler and M. Satyanarayanan
Table I.
Elapssd
Time
(seconds)
Andrew
Make
Reintegration
Totaf
Time
AllocFid
Size of Replay
(xconds)
Replay
con
Records
Log
Data
Back-Fetched
(Bytes)
Bytes
288 (3)
43 (2)
4 (2)
29 (1)
10 (1)
223
65,010
1,141,315
3,271 (28)
52 (4)
1 (o)
40 (1)
10 (3)
193
65,919
2,990,120
Benchmark
Venus
Time for Reintegration
Note: This data was obtained with a Toshiba T5200/100
client (12MB memory, 100MB disk)
reintegrating
over an Ethernet
with an IBM RT-APC server (12MB memory, 400MB disk).
The values shown above are the means of three trials, Figures m parentheses are standard
deviations.
quite
positive,
and we are confident
that
the refinements
under
will result in an even more usable system.
In the following
sections we provide
qualitative
and quantitative
to three important
questions
pertaining
to disconnected
operation.
(1) How
long
(2) How
large
(3) How
likely
5.1
Duration
does reintegration
a local
disk
development
answers
These are:
take?
does one need?
are conflicts?
of Reintegration
In our experience,
development
lasting
To characterize
typical
disconnected
a few hours required
reintegration
speed more
sessions
of editing
and program
about a minute
for reintegration.
precisely,
we measured
the reinte-
gration
times after disconnected
execution
of two well-defined
tasks. The first
task is the Andrew
benchmark
[91, now widely
used as a basis for comparing
file system performance.
The second task is the compiling
and linking
of the
current
version
of Venus.
Table I presents
the reintegration
times for these
tasks.
The time for reintegration
consists of three components:
the time to allocate permanent
fids, the time for the replay
at the servers, and the time for
the second phase of the update protocol
used for server replication.
The first
component
will
be zero for many
disconnections,
due to the
preallocation
of
fids during
hoarding.
We expect the time for the second component
to fall,
considerably
in many
cases, as we incorporate
the last of the replay
log
optimizations
described
in Section 4.4.1. The third component
can be avoided
only if server replication
is not used.
One can make some interesting
secondary
observations
from Table 1. First,
the total time for reinte~ation
is roughly
the same for the two tasks even
though
the Andrew
benchmark
has a much smaller
elapsed time.
This is
because the Andrew
benchmark
uses the file system more intensively.
Second, reintegration
for the Venus make takes longer, even though the number
of entries in the replay log is smaller.
This is because much more file data is
ACM Transactions
on Computer Systems, Vol. 10, No. 1, February
1992
Disconnected
Operation in the Coda File System
o
19
7J50$
—
– –
-----
540
g
-cl
~
m 30 s
~
Max
Avg
Min
s 20 &
.Q)
z
10 -
—--———
/“--
.........
. . . . . . . . . . ------0
4
2
‘“
12
10
Time (hours)
8
6
,. . . ..-. -.
Fig. 5. High-water
mark of Cache usage. This graph is based on a total of 10 traces from 5
active Coda workstations.
The curve labelled “Avg”
corresponds to the values obtained by
“Max” and “ Min” plot
averaging the high-water
marks of all workstations.
The curves labeled
the highest and lowest values of the high-water
marks across all workstations.
Note that the
high-water
mark does not include space needed for paging, the HDB or replay logs.
back-fetched
in the third
phase of the replay.
Finally,
neither
any think
time.
As a result,
their
reintegration
times
are
task involves
comparable
to
that
session
after
a much
longer,
but
more
typical,
disconnected
in
our
environment.
5.2
Cache
Size
A local disk capacity
of 100MB
on our clients
has proved
initial
sessions of disconnected
operation.
To obtain a better
the cache size requirements
for disconnected
operation,
reference
menting
traces from
workstations
adequate
for
understanding
we analyzed
our environment.
The traces were obtained
to record information
on every file system
by instruoperation,
regardless
of whether
the file was in Coda, AFS, or the local file system.
Our analysis
is based on simulations
driven
by these traces. Writing
validating
Venus
Venus
driven
a simulator
that
precisely
models
the complex
caching
our
of
file
and
behavior
of
would be quite difficult.
To avoid this difficulty,
we have modified
to act as its own simulator.
When running
as a simulator,
Venus is
by traces rather
than requests from the kernel.
Code to communicate
with the servers,
as well as code to perform
system are stubbed out during
simulation.
Figure
5 shows
The actual
disk
both the explicit
From our data
operating
the
high-water
size needed
mark
physical
of cache
for disconnected
1/0
usage
operation
on the
as a function
for a typical
ACM Transactions
workday.
Of course,
user
file
of time.
has to be larger,
and implicit
sources of hoarding
information
it appears
that a disk of 50-60MB
should
disconnected
local
since
are imperfect.
be adequate
for
activity
that
is
on Computer Systems, Vol. 10, No. 1, February 1992.
20
J. J. Kkstler and M. Satyanarayanan
●
drastically
significantly
We plan
First,
different
from what
different
results.
to extend
our work
we will
investigate
disconnection.
Second,
cache
we will
was
recorded
in
on trace-driven
our
could
simulations
size requirements
be sampling
traces
in three
for much
a broader
longer
range
traces from many more machines
in our environment.
the effect of hoarding
by simulating
traces together
profiles
have
5.3
Likelihood
been specified
virtually
different
no
network
Second,
server
conflicts
replication
due to
While
partitions.
perhaps
grows
will lead to many
To obtain
data
mented
400
the AFS
computer
program
cant
conflicts
will
larger.
Third,
in Coda for nearly
multiple
gratifying
servers
science
amount
faculty,
the same kind
mate
frequent
of usage
conflicts
a user
would
modifies
an
at
assumptions,
file
or
we instru-
for
research,
includes
a signifi-
is descended
were
user
are used by over
we can use this
be if Coda
users are
system.
Coda
scale,
profile
Coda
in
is
disconnections
students
usage
Since
AFS
larger
graduate
Their
as the
servers
we have
an object
observation
voluntary
These
and
activity.
only
extended
of conflicts
staff
a year,
updating
us, this
a problem
perhaps
and education.
and makes
users
to
is possible
that our
with an experimental
in our environment.
of collaborative
environment.
Every
time
become
more conflicts.
on the likelihood
development,
how
Third, we
with hoard
by users.
ex ante
subject to at least three criticisms.
First,
it
being cautious,
knowing
that they are dealing
community
of
activity
of Conflicts
In our use of optimistic
seen
ways.
periods
of user
by obtaining
will evaluate
that
produce
to replace
directory,
we
from
AFS
data
to esti-
AFS
in our
compare
his
identity
with that of the user who made the previous
mutation.
We also note
the time interval
between
mutations.
For a file, only the close after an open
for update
is counted
as a mutation;
individual
write operations
are not
counted.
For
directories,
as mutations.
Table II presents
is classified
by
all
operations
our observations
volume
type:
user
that
modify
over a period
volumes
a directory
of twelve
containing
are counted
months.
private
The data
user
data,
volumes
used for collaborative
work, and system
volumes
containing
program
binaries,
libraries,
header files and other similar
data. On average,
project
a project
volume
has about
volume
has about 1600 files
2600 files and 280
and 130 directories.
directories,
and a system
User volumes
tend to be
smaller,
averaging
about 200 files and 18 directories,
place much of their data in their project volumes.
Table II shows that over 99% of all modifications
because
were
users
by the
often
previous
writer,
and that the chances of two different
users modifying
the same object
less than a day apart is at most O.75%. We had expected to see the highest
degree of write-sharing
on project files or directories,
and were surprised
to
see that it actually
occurs on system files. We conjecture
that a significant
fraction
of this sharing
arises from modifications
to system files by operators,
who change shift periodically.
If system files are excluded,
the absence of
ACM Transactions
on Computer Systems, Vol. 10, No. 1, February
1992
Disconnected
Operation in the Coda File System
.
2I
m
“E!
o
v
w
ACM Transactions
on Computer
Systems,
Vol.
10, No.
1, February
1992
22
J. J. Kistler and M. Satyanarayanan
.
write-sharing
the previous
is even more striking:
more than 99.570 of all mutations
are by
writer,
and the chances of two different
users modifying
the
same object within
ing from the point
would
not
environment.
be
6. RELATED
a week are less than 0.4~o! This
of view of optimistic
replication.
a serious
problem
if
were
replaced
by
Coda
in
our
WORK
Coda is unique
in that it exploits
availability
while preserving
a high
no other
AFS
data is highly
encouragIt suggests that conflicts
system,
published
caching
for both performance
and high
degree of transparency.
We are aware of
or unpublished,
that
duplicates
this
key aspect
of
Coda.
By providing
system
[20]
since this
hoarding,
tools
to link
provided
local
and
rudimentary
was not
transparent
its
remote
support
name
for
spaces,
disconnected
the
Cedar
operation.
file
But
primary
goal, Cedar
did not provide
support
reintegz-ation
or conflict
detection.
Files
were
for
ver-
sioned and immutable,
and a Cedar cache manager
could substitute
a cached
version of a file on reference
to an unqualified
remote file whose server was
inaccessible.
However,
the implementors
of Cedar observe that this capability was not often exploited
since remote
files were normally
referenced
by
specific
Birrell
version
and
availability
more recent
number.
Schroeder
pointed
out
the
possibility
of “stashing”
data
for
in an early discussion
of the Echo file system [141. However,
a
description
of Echo [8] indicates
that it uses stashing
only for the
highest
levels of the naming
hierarchy.
The FACE
file system
[3] uses stashing
but does not integrate
it with
caching.
The lack of integration
has at least three negative
consequences.
First,
it reduces transparency
because users and applications
deal with two
different
name spaces, with different
consistency
properties.
Second, utilization of local disk space is likely
to be much worse.
Third,
recent
information
from cache management
is not available
to manage
the
The
available
literature
integration
detracted
An application-specific
the PCMAIL
manipulate
nize with
on FACE
from
system
at MIT
[12].
existing
mail messages
a central
does not
report
on how
the usability
of the system.
form of disconnected
operation
repository
much
the
usage
stash.
lack
of
was implemented
in
PCMAIL
allowed
clients
to disconnect,
and generate
new ones, and desynchro-
at reconnection.
Besides
relying
heavily
on the
semantics
of mail, PCMAIL
was less transparent
than Coda since it required
manual
resynchronization
as well as preregistration
of clients with servers.
The use of optimistic
replication
in distributed
file systems was pioneered
by Locus [23]. Since Locus used a peer-to-peer
model rather
than a clientserver model,
availability
was achieved
solely through
server replication.
There was no notion of caching,
and hence of disconnected
operation.
Coda has
transparency
benefited
in a general
sense
and performance
in distributed
from
file
the large body of work
on
systems. In particular,
Coda
owes much
to AFS [191, from
which
it inherits
its model
of trust
integrity,
as well as its mechanisms
and design philosophy
for scalability.
ACM
TransactIons
on
Computer
Systems,
Vol.
10,
No.
1, February
1992.
and
Disconnected
7. FUTURE
section
operation
sections
optimization,
additional
.
23
development.
In
WORK
Disconnected
earlier
Operation m the Coda File System
in
of this
granularity
work is also
we
consider
broadened.
An excellent
Coda
paper
Explicit
dreds
or thousands
under
work
active
in progress
in the areas of log
of reintegration,
and evaluation
of hoarding.
being
done at lower
levels of the system.
two
ways
opportunity
Unix.
is a facility
we described
in
exists
transactions
which
and
scope
in Coda for adding
become
of nodes,
the
more
the
work
transactional
desirable
informal
of our
as systems
concurrency
In
Much
this
may
be
support
to
scale
control
to hunof Unix
becomes
less effective.
Many
of the mechanisms
supporting
disconnected
operation,
such as operation
logging,
precedence
graph
maintenance,
and
conflict
checking
would
transfer
directly
to a transactional
system
using
optimistic
concurrency
control.
Although
transactional
file systems are not a
new idea, no such system with the scalability,
availability
and performance
properties
of Coda has been proposed
A different
opportunity
exists in
nected
in
operation,
low bandwidth.
lines, or that
mask
failures,
connectivity.
partial
as provided
But
opportunities
aggressive
file
where
Coda
to support
connectivity
weakly
con-
is intermittent
or of
Such conditions
are found in networks
that rely on voice-grade
use wireless
technologies
such as packet radio. The ability
to
weak
cation
more
environments
or built.
extended
by disconnected
techniques
which
operation,
exploit
is of value
and adapt
at hand
write-back
are also needed. Such techniques
policies,
compressed
network
and caching at intermediate
levels.
transfer
even
with
to the communi
may
-
include
transmission,
8. CONCLUSION
Disconnected
operation
preload
cache
one’s
connection,
log all
reconnect ion.
Implementing
is a tantalizingly
with
critical
changes
made
disconnected
simple
data,
continue
while
operation
idea.
disconnected
is not
are dynamic
Only in hindsight
traditional
caching
of a lifeline
choices
to be made
about
one has to do is to
operation
and replay
so simple.
modifications
and careful
attention
to detail
in many
agement.
While hoarding,
a surprisingly
large volume
lated
state has to be maintained.
When
emulating,
integrity
of client
data structures
become
critical.
there
All
normal
It
until
them
involves
disupon
major
aspects of cache manand variety
of interrethe persistence
and
During
reintegration,
the granularity
of reintegration.
do we realize
the extent
to which
implementations
of
schemes have been simplified
by the guaranteed
presence
to a first-class
replica.
Purging
and
refetching
on demand,
a
strategy
often used to handle
pathological
situations
in those implementations, is not viable
when supporting
disconnected
operation.
However,
the
obstacles to realizing
disconnected
operation
are not insurmountable.
Rather,
the central
message of this paper is that disconnected
operation
is indeed
feasible,
efficient
and usable.
One way to view our work is to regard
it as an extension
of the idea of
write-back
caching.
Whereas
write-back
caching
has hitherto
been used for
ACM
TransactIons
on Computer
Systems,
Vol.
10, No
1, February
1992.
24
J. J. Kistler and M. Satyanarayanan
.
performance,
failures
too.
transitions
system.
remote
we have
A broader
between
shown that
it can be extended
to mask temporary
view is that disconnected
operation
allows graceful
states
of autonomy
and
in a distributed
interdependence
Under favorable
conditions,
our approach
provides
all the benefits
of
data access; under
unfavorable
conditions,
it provides
continued
access to critical
data.
We
become increasingly
important
and vulnerability
y.
are certain
that
disconnected
operation
will
as distributed
systems grow in scale, diversity
ACKNOWLEDGMENTS
We wish
to thank
Lily
and postprocessing
Mummert
the file
for her
reference
traces
invaluable
assistance
used in Section
Varotsis,
who helped instrument
the AFS servers which yielded
ments of Section 5.3. We also wish to express our appreciation
present
contributors
Mashburn,
Maria
to the
Okasaki,
Coda
project,
and David
especially
in collecting
5.2, and Dimitris
Puneet
the measureto past and
Kumar,
Hank
Steere.
REFERENCES
1. BURROWS, M.
Efficient
data sharing, PhD thesis, Univ. of Cambridge,
Computer Laboratory, Dec. 1988.
2. CAT~, V., AND GROSS,T. Combinmg the concepts of compression and caching for a two-level
of the 4th ACM
Symposium
on Architectural
Support
for
file system. In Proceedings
Programm
mg Languages
and
Operating
Systems
(Santa Clara, Calif., Apr. 1991), pp
200-211.
3. COVA, L. L. Resource management
in federated computing
environments.
PhD thesis,
Dept. of Computer Science, Princeton Univ., Oct. 1990.
4. DAVIDSON, S. B. Optimism
and consistency in partitioned
distributed
database systems.
ACM
Trans. Database
Syst. 9, 3 (Sept. 1984), 456-481.
networks.
5. DAVIDSON, S. B., GARCIA-M• LINA, H., AND SKEEN, D, Consistency in partitioned
ACM Comput.
Suru. 17, 3 (Sept. 1985), 341-370.
in distributed
file systems. Tech. Rep. TR 272, Dept. of
6. FLOYD, R. A. Transparency
Computer Science, Univ. of Rochester, 1989.
fault-tolerant
mechanism
for
7. GRAY, C. G., AND CHERITON, D. R. Leases: An efficient
of the 12th ACM Symposwm
on Operating
distributed
file cache consistency. In Proceedings
System Principles
(Litchfield
Park, Ariz., Dec. 1989), pp. 202-210.
Availability
and
8. HISGEN, A., BIRRELL, A., MANN, T., SCHROEDER, M., AND SWART, G
of the Second
consistency tradeoffs in the Echo distributed
file system. In Proceedings
Workshop
on Workstation
Operating
Systems (Pacific Grove, Calif., Sept. 1989), pp. 49-54.
M.,
9. HOWARD, J. H., KAZAR, M. L.j MENEES, S, G., NICHOLS, D. A., SATYANARAYANAN,
SIDEBOTHAM, R. N., AND WEST, M. J.
Scale and performance in a distributed
file system.
10.
ACM Trans. Comput.
Syst.
6, 1 (Feb.
1988),
51-81.
KLEIMAN,
S. R.
Vnodes: An architecture
for multiple
Summer
11. KUMAR,
system,
1991.
Usenix
P.,
Conference
Proceedings
AND SATYANARAYANAN,
Tech. Rep. CMU-CS-91-164,
tile system types in Sun UNIX. In
(Atlanta,
Ga., June 1986), pp. 238-247,
M.
Log-based directory
resolution
in the Coda file
School of Computer Science, Carnegie Mellon Univ.,
mail system for personal computers. DARPA-Internet
12. LAMBERT, M. PCMAIL: A distributed
RFC 1056, 1988,
memory user man13. MASHBURN, H., AND SATYANARAYANAN, M. RVM: Recoverable virtual
ual. School of Computer Science, Carnegie Mellon Univ., 1991.
ACM Transactions
on Computer
Systems, Vol. 10, No 1, February
1992
Disconnected
Operation in the Coda File System
.
25
R. M., AND HERBERT, A. J.
Report on the Third European SIGOPS Workshop:
“Autonomy
or Interdependence
in Distributed
Systems.” SZGOPS Reu. 23, 2 (Apr. 1989),
3-19.
OUSTERHOUT,J., DACOSTA, H., HARRISON, D., KUNZE, J., KUPFER, M., AND THOMPSON, J. A
trace-driven analysis of the 4.2BSD file system. In Proceedings of the 10th ACM Symposium
on Operating System FYinciples (Orcas Island, Wash., Dec. 1985), pp. 15-24.
SANDBERG, R., GOLDBERG,D., KLEIMAN, S., WALSH, D., AND LYON, B. Design and implemenUsenix Conference
Proceedings
(Portland,
tation of the Sun network filesystem. In Summer
Ore., June 1985), pp. 119-130.
of
SATYANARAYANAN, M.
On the influence of scale in a distributed
system. In Proceedings
the 10th International
Conference
on Software
Engineering
(Singapore,
Apr. 1988), pp.
10-18.
SATYANARAYANAN, M., KISTLER, J. J., KUMAR, P., OKASAKI, M. E., SIEGEL, E. H., AND
STEERE, D. C. Coda: A highly available file system for a distributed
workstation
environment. IEEE Trans. Comput. 39, 4 (Apr. 1990), 447-459.
SATYANARAYANAN, M.
Scalable, secure, and highly available distributed
file access. IEEE
Comput. 23, 5 (May 1990), 9-21.
14. NEEDHAM,
15.
16.
17.
18.
19.
D., GIFFORD, D. K., AND NEEDHAM,
R. M.
A caching file system for a
of the 10th ACM
Symposium
on Operating
workstation.
In Proceedings
System Principles
(Orcas Island, Wash., Dec. 1985), pp. 25-34.
21. STEERE, D. C., KISTLER, J. J., AND SATYANARAYANAN, M.
Efficient
user-level cache file
Usenix
Conference
Proceedings
management
on the Sun Vnode interface.
In Summer
(Anaheim, Calif., June, 1990), pp. 325-331.
20.
SCHROEDER, M.
programmer’s
22. Decorum File System,
23. WALKER, B., POPEK,
operating system. In
ples (Bretton Woods,
Transarc Corporation,
Jan. 1990.
G., ENGLISH, R., KLINE, C., AND THIEL, G. The LOCUS distributed
Proceedings
of the 9th ACM Symposium
on Operating
System PrinciN. H., Oct. 1983), pp. 49-70.
Received June 1991; revised August
1991; accepted September
ACM Transactions
1991
on Computer Systems, Vol. 10, No. 1, February 1992.
Download