Interposed Request Routing for Scalable Network Storage

advertisement
CS{2000{05
Interposed Request Routing for Scalable
Network Storage
Darrell Anderson, Je Chase, Amin Vahdat
Department of Computer Science
Duke University
Durham, North Carolina 27708{0129
May 2000
Interposed Request Routing for Scalable Network
Storage
Darrell Anderson, Je Chase, Amin Vahdat 1
Department of Computer Science
Duke University
fanderson,chase,vahdatg@cs.duke.edu
This work is supported by the National Science Foundation (CCR-96-24857,
CDA-95-12356, and EIA-9870724) and equipment grants from Intel Corporation
and Myricom. Anderson is supported by a U.S. Department of Education GAANN
fellowship.
1
This paper presents Slice, a new storage system architecture for highspeed LANs incorporating network-attached block storage. Slice interposes
a request switching lter | called a proxy | along the network path
between the client and the network storage system (e.g., in a network adapter
or switch). The purpose of the proxy is to route requests among a server
ensemble that implements the le service. We present a prototype that
uses this approach to virtualize the standard NFS le protocol to provide
scalable, high-bandwidth le service to ordinary NFS clients. The paper
presents and justies the architecture, proposes and evaluates several request
routing policies realizable within the architecture, and explores the eects
of these policies on service structure.
1 Introduction
Demand for storage is growing at roughly the same rate as Internet bandwidth. A prominent factor driving this storage growth is the return of
server-based computing. Software is migrating from shrink-wrap distribution to a service model, in which applications are hosted in data centers
and accessed through Internet sites. Additionally, consumer Internet service providers are managing and distributing high volume content such as
mail, message boards, FTP archives, etc. in large data centers. At the
same time, there is an increasing demand to store large amounts of data for
scalable computing, multimedia and visualization.
This rapid growth means that a successful storage system architecture
must scale to handle current demands, placing a large premium on the costs
(including human costs) associated with upgrading. These properties are
so important that many high-end storage servers are priced an order-ofmagnitude or more beyond the cost of raw disk storage.
Commercial systems increasingly provide scalable shared storage by interconnecting storage devices and servers with dedicated Storage Area Networks (SANs), e.g., FibreChannel. Yet recent order-of-magnitude improvements in LAN performance have narrowed the bandwidth gap between SANs
and LANs. This creates an opportunity to deliver competitive storage solutions by aggregating low-cost storage nodes and servers, using a generalpurpose LAN as the storage backplane. In such a system it should possible
to incrementally scale either capacity or bandwidth of the shared storage
resource by attaching more storage to the network, using a distributed software layer to unify storage resources into a le system view.
This paper introduces a network storage architecture | called Slice |
metadata
servers
MD5
hash
namespace
requests
bulkI/O
client
striping
policy
scalable
blockI/O
service
small
file
read/write
file
placement
policy
small-file
servers
Figure 1: Combining functional decomposition and data decomposition in
the Slice architecture.
that takes a unique approach to scaling storage services. Slice interposes a
request switching lter (called a proxy) along the client's communication
path to the storage service. The lter could reside in a programmable switch
or network adapter, or it could reside in a self-contained module at the
client's interface to the network. The role of the lter is to \wrap" a standard
client/server le system framework, and redirect requests to a collection of
storage elements and servers that cooperate to present a uniform view of a
shared le volume with scalable bandwidth and capacity. File services are
a good context to explore these techniques because they have well-dened
protocols and semantics, established workloads, and a large installed base
of clients and applications, many of which face signicant scaling challenges
today.
The contributions of this paper are:
Illustrate and evaluate the architecture with a prototype proxy implementation based on an IP packet lter. We quantify the performance
and state requirements of the proxy.
Present request routing approaches achievable within this model, in-
cluding striping for high-bandwidth bulk I/O, separating small-le I/O
and distributing them to multiple managers. We introduce mkdir
switching and name hashing approaches that enable transparent
scaling of the le system name space, including support for very large
3
distributed directories. We also explore the eects of request routing
policies on the structure of the le service.
Evaluate the architecture and routing policies with a prototype that is
compliant with Version 3 of the Network File System (NFS), the most
popular protocol standard for commercial network-attached storage
servers.
We present performance results from SpecSFS97, the standard commercial benchmark for network-attached storage. The results demonstrate the
scalability of the architecture, and show that it can provide performance
competitive with commercial-grade le servers by aggregating low-cost components on a switched LAN. Moreover, the architecture allows for growth
of the system by adding storage nodes and servers, meeting demands for
applications with a variety of access characteristics.
This paper is organized as follows. Section 2 outlines the architecture
and its motivation, and sets Slice in context with related work. Section 3 discusses the role of the proxy, presents specic request routing mechanisms
and policies, and explores their implications for le service structure including recovery and reconguration. Section 4 describes our prototype, and
Section 5 presents experimental performance results. Section 6 concludes.
2 Overview
The Slice le service consists of a collection of servers, cooperating to serve a
single arbitrarily large virtual volume of les and directories. To a client, the
ensemble appears as a single le server at some virtual network address. The
proxy intercepts packets and performs simple functions to route requests
and to represent the ensemble as a unied le service to its client.
Figure 1 depicts the structure of a Slice ensemble. The gure illustrates a
client's request stream partitioned into three functional groups corresponding to the major le workload components: (1) high-volume I/O to large
les, (2) I/O on small les, and (3) operations on the name space or le
attributes. The proxy switches on the request type and arguments to redirect requests to a selected server responsible for handling that request class.
Bulk I/O operations route directly to an array of storage nodes, which provide raw block access to storage objects. Other operations are distributed
among le managers for small-le I/O or name space requests.
This functional decomposition allows high-volume data ow to bypass
the managers, while allowing specialization of the servers by workload com4
ponent, tailoring the policies for disk layout, caching and recovery. A single
server node could combine the functions of multiple server classes; we separate them in order to highlight opportunities to benet from distributing
requests across many servers.
For each request the proxy selects a target server from the given class
by switching on the identity of the target le, name entry, or block, possibly
using a separate routing function for each request class. Thus the routing
function induces a data decomposition of the volume data across the ensemble, with the side eect of creating or caching data items on the selected
managers. Ideally, the request routing scheme spreads the data and request
workload in a balanced fashion across all servers. The routing function may
adapt to system conditions, e.g., to use new server sites as they become
available. This allows each workload component to scale independently by
adding resources to its server class.
The next three subsections outline the role of the proxy, storage nodes,
and le manager servers. We then discuss implications of the architecture
and set it in context with related work.
2.1 The proxy
The proxy examines le service requests and responses, applying the routing functions to redirect requests. An overarching goal is to keep the proxy
simple, small, and fast. The proxy may (1) rewrite the source address, destination address, or other elds of request or response packets, (2) maintain a
bounded amount of soft state, and (3) initiate or absorb packets to or from
the Slice ensemble. The proxy does not itself manage any state shared
across clients, so it may reside on the client host or network interface, or in
a network element close to the server.
The proxy functions as a network element within the Internet architecture. Through the strict use of soft state, it is free to fail and discard its
state and/or pending packets without compromising correctness. Higherlevel end-to-end protocols (in this case NFS/RPC/UDP or TCP) retransmit packets as necessary to recover from drops in the proxy. Although the
proxy resides \within the network", it acts as an extension of the service.
For example, since the proxy is a layer-5 protocol component, it must reside (logically) at one end of the connection or the other, e.g., it cannot
reside in the \middle" of the connection where end-to-end encryption could
hide layer-5 protocol elds from the network elements. While it is logically
an extension of the service, the proxy acts with the authority of its clients.
In particular, the proxy may reside outside of the server ensemble's trust
5
boundary. In this case, the damage from a compromised proxy is limited
to the les and directories that its client(s) had permission to access.
2.2 Network Storage Nodes
A shared array of network storage nodes provides all disk storage used in
a Slice ensemble. Slice servers and clients may access the storage array directly. Clients access the network storage array only by request redirection
through a proxy. More storage nodes may be added to incrementally scale
bandwidth, capacity, and disk arms. The placement policies of the le service are responsible for distributing data among storage objects so as to
benet fully from all of the resources in the network storage array.
The Slice block storage architecture is loosely based on a proposal in the
National Storage Industry Consortium (NSIC) for object-based storage devices (OBSD) [3]. Key elements of the OBSD proposal were in turn inspired
by research on Network Attached Secure Disks (NASD) [10, 11]. Storage
nodes are \object-based" rather than sector-based, meaning that requesters
address data on each storage node as logical osets within storage objects.
A storage object is an ordered sequence of bytes with a unique identier.
OBSDs support additional command sets to manage collections of objects,
and they retain responsibility for assigning physical disk blocks to objects.
The OBSD proposal allows for cryptographic protection of object identiers
if the network is insecure [10].
This block storage architecture is orthogonal to the question of which
level arranges redundancy to tolerate disk failures. One alternative is to
provide redundancy of disks and other vulnerable components internally to
each storage node. A second option is for the le service layers to mirror
data or maintain parity across the storage nodes. In Slice, the choice to
employ extra redundancy across storage nodes may be made on a per-le
basis through support for mirrored striping in our prototype. For stronger
protection, a Slice conguration could employ redundancy at both levels.
2.3 File Managers
File management functions above the network storage array are split across
two classes of servers. Each class governs functions that are common to any
le server; the model separates them to distribute the request load and allow
implementation specialized to the semantics of individual workloads.
Directory servers handle name space operations, e.g., to create, remove, or lookup les and directories by symbolic name; they manage
6
and cache directories and mappings from names to le identiers and
attributes for each le or directory.
Small-le servers handle read and write operations on small les and
the initial segments of large les (see Section 3.1).
Slice le managers are dataless; all of their state is backed by the network
storage array. Their role is to aggregate their structures into larger storage
objects backed by the storage nodes, and to provide memory and CPU
resources to cache and manipulate those structures. In this way, the le
managers can benet from the parallel disk arms and high bandwidth of the
storage array as more storage nodes are added.
The principle of dataless le managers also plays a key role in recovery.
In addition to its backing objects, each manager journals its updates in a
write-ahead log [12, 9]; the system can recover the state of any manager
from its backing objects together with its log. This allows fast failover, in
which a surviving site assumes the role of a failed server, recovering its state
from shared storage [14, 4, 27].
2.4 Discussion
In summary, the Slice architecture presents several fundamental benets to
diverse application workloads:
Compatibility with standard le system clients. The proxy factors
request routing policies away from the client-side le system code. The
purpose of this approach is to leverage a minimal computing capability
within the network elements to virtualize the storage protocol.
Direct storage access for high-volume I/O. The proxy routes bulk I/O
trac directly to the network storage array, removing the le managers
from the critical path. Separating requests in this fashion eectively
eliminates a key scaling barrier for conventional le services [10, 11].
Moreover, policies for block placement and replication may be applied
on a per-le basis, by incorporating them in the policy for routing I/O
requests to the storage array. At the same time, the small-le servers
absorb and aggregate I/O operations on small objects, so there is no
need for the storage node mapping structures to handle small objects
eciently.
Content-based request switching. The proxy's routing function selects a server site for each request based on the request arguments.
7
Requests with the same arguments are routed to the same server,
unless the routing function changes (e.g., due to reconguration or rebalancing). This can improve locality in the request stream; a good
routing function will induce a balanced distribution of le objects and
requests across the servers, eliminating redundant caching.
2.5 Related Work
A large number of systems have interposed new system functionality by
\wrapping" an existing interface, including kernel system calls [16], internal
interfaces [15], communication bindings [13], or messaging endpoints. The
concept of a proxy mediating between clients and servers [26] is now common
in distributed systems.
We propose to mediate some storage functions by interposing on standard storage access protocols within the network elements. The role of the
Slice proxy is primarily to route requests based on their content. This
is analogous to the HTTP content switching features oered by prominent
network switch vendors (e.g., Foundry, Extreme, Alteon) for Internet server
sites, based in part on research that demonstrated the locality benets in
those environments [22]. Our proposal extends the concept to a le system
context, in which updates are frequent.
A number of recent commercial and research eorts investigate techniques for building scalable storage systems for high-speed switched LAN
networks. These system are built from disks distributed through the network, (e.g., attached to dedicated servers [17, 27, 14], cooperating peers [4,
29], or the network itself [11]). We separate these into two broad groups.
Many of these systems separate le managers (e.g., the name service)
from the block storage service, as in Slice. This was rst proposed for
the Cambridge Universal File Server [6]. Subsequent systems adopted this
separation to allow bulk I/O to bypass le managers [7, 14], and it is now
a basic tenet of research in network-attached storage devices including the
CMU NASD work on devices with object-based access control [11, 10]. The
industry is moving to base new storage products on the NASD model. Slice
shows how to incorporate essential placement and routing functions into a
new lesystem structure for network-attached storage devices. The Slice
model cleanly separates those functions from the base le access protocol,
preserving compatibility with existing network le system clients. Other
proposals have integrated functions similar to those of the proxy into le
system clients [10].
An alternative structure for cluster le systems is to layer le system
8
functions above a network storage volume using a shared disk model. Policies
for striping, redundancy, and storage site selection are specied on a volume
basis; cluster nodes coordinate their accesses to the shared storage blocks
using an ownership protocol. This approach has been used with both logstructured (Zebra [14] and xFS [4]) and conventional (Frangipani/Petal [17,
27] and GFS [23]) le system organizations. The cluster may be viewed as
\serverless", if all nodes are trusted and have direct access to the shared
disk, or alternatively the entire cluster may act as a le server to untrusted
clients using a standard network le protocol, with all I/O passing through
the cluster nodes as they mediate access to the disk.
The key benets of Slice request routing apply equally to these shared
disk systems, when untrusted clients are present. First, request routing is a
key to incorporating secure network-attached storage devices, which allow
untrusted clients to address storage objects directly without compromising
the integrity of the le system. That is, a proxy could route bulk I/O requests directly to the devices, yielding a more scalable system that preserves
compatibility with standard clients and accommodates per-le policies for
block placement, redundancy or replication, prefetching, etc. Second, request routing could improve locality in the request stream to the le servers,
improving cache eectiveness and reducing the incidence of block sharing
among the servers.
The shared disk model is used in many commercial systems, which increasingly interconnect storage devices and servers with dedicated Storage
Area Networks (SANs), e.g., FibreChannel. This paper explores storage request routing for Internet networks, but the concepts are equally applicable
in SANs.
Our proposal to separate small-le I/O from the request stream is similiar in concept to the Amoeba Bullet Server [28], a specialized le server that
optimizes small les. As described in Section 4.3, the prototype small-le
server draws on techniques from the Bullet Server, FFS fragments [21], and
SquidMLA [19], a Web proxy server that maintains a user-level \lesystem"
of small cached Web pages.
3 Request Routing Policies
This section explains the structure of the proxy and the request routing
schemes used in the Slice prototype. The purpose is to illustrate concretely
the request routing policies enabled by the architecture, and the implications
of those policies for the way the servers interact to maintain and recover
9
consistent le system states. We use the NFS V3 protocol as a reference
point, since it is widely understood and our prototype supports it.
The proxy examines and rewrites NFS requests and responses as they
enter or arrive from the network. It intercepts requests addressed to virtual
NFS servers, and selects a target server from the Slice ensemble by applying a
request routing function to the request type and arguments. It then rewrites
the IP address and port of the request to redirect it to the selected server.
When a response arrives, the proxy rewrites the source address and port
before forwarding it to the client, so the packet appears to originate from
the virtual NFS server.
The request routing functions must permit reconguration to add or
remove servers, while minimizing state requirements in the proxy. The
proxy routes most requests by extracting relevant elds from the request,
optionally hashing to combine multiple elds, and interpreting the result
as a logical server site ID for the request. It then looks up the physical
server corresponding to the logical site in a compact routing table installed
at initialization. These tables constitute soft state; the proxy may reload
them after a failure. It is possible to recongure the routing function by
changing the tables. Multiple logical sites may map to the same physical
server, leaving exibility for reconguration.
The proxy examines up to four elds of each request, depending on the
particular policy congured:
Request type. Routing policies are keyed by the NFS request type,
so the proxy may employ dierent policies for dierent functions.
Table 3 lists the important NFS request groupings discussed in this
paper.
File handle. Each NFS request applies to a specic le or directory,
named by a unique identier called a le handle (or fhandle), which is
passed as an argument to the operation. Although the NFS fhandles
are opaque to the client, their structure is known to the proxy, which
acts as an extension of the service. For example, directory servers
place an integer leID eld or type elds in each fhandle, which the
proxies extract for routing.
Read/write oset. NFS I/O operations specify the range of osets
covered by each read and write. When small-le servers are congured,
the prototype's routing policy denes a xed threshold oset (e.g.,
64KB); the proxy directs I/O requests below the threshold to a smallle server selected from the request fhandle. I/O requests beyond the
10
Namespace Operations
lookup(dir, name) returns (le handle, attributes)
create(dir, name) returns (le handle, attributes) mkdir(dir, name) returns (le handle, attributes)
remove(dir, name)/rmdir(dir, name)
Retrieve a directory entry.
Create a le/directory and update the parent entry/link count and modify timestamp.
Remove a le/directory or hard link and update the
parent entry/link count and modify timestamp.
link(olddir, oldname, newdir, newname) re- Create a new name for an existing le, and the le's
turns (le handle, attributes)
link count and modify timestamp, and update the new
parent's link count and modify timestamp.
rename(olddir, oldname, newdir, newname) Rename an existing le or hard link, and update the
returns (le handle, attributes)
link count and modify timestamp in both the old and
new parent.
getattr(object) returns (attributes)
setattr(object, attributes)
Attribute Operations
Retrieve the attributes of a le or directory.
Modify the attributes of a le or directory, and update
its modify timestamp.
I/O Operations
read(le, oset, len) returns (data, at- Read data from a le, updating its access timestamp.
tributes)
write(le, oset, len) returns (data, at- Write data to a le, updating its modify timestamp.
tributes)
Directory Retrieval
readdir(dir, cookie) returns (entries, cookie)
Read some or all of the entries in a directory.
Table 1: Some of the Sun Network File System (NFS) protocol operations
and their meanings.
threshold are striped across multiple storage nodes using a static policy
or per-le block maps.
Name component. NFS name space operations include a symbolic
name component in their arguments (see Table 3). A key challenge
for scaling le management is to attain a balanced distribution of these
requests across a collection of directory servers; this is particularly important for metadata-intensive workloads with small les and heavy
create/lookup/delete activity, as often occurs in Internet server environments for mail, news, message boards, and Web access. Section 3.2
outlines two policies that select the server by applying a hash function
to the name.
We now outline the functioning of the proxy's routing policies for spe11
cic request groups.
3.1 Block I/O
Request routing for read/write requests have to goals: separate small-le
read/write trac from bulk I/O, and decluster the blocks of large les across
the storage nodes for the desired access properties (e.g., high bandwidth or
a specied level of redundancy). We address each in turn.
Our approach to separating small-le access trac is to redirect all I/O
below the xed threshold oset to a small-le server. The threshold oset is
necessary because the size of each le may change at any time, and in any
case is not known to the proxy. Thus the smallle servers also are polluted
by some subset of the I/O requests on large les, i.e., they receive all I/Os
that fall below the threshold, even if the target le is large. Moreover, large
les below the threshold are limited by the bandwidth of the smallle server.
In practice, large les have little impact on the smallle servers because there
is typically a relatively small number of these les, each consuming at most
threshold bytes. Imbalances can be resolved by adding servers. Similarly,
any performance cost for large les aects only the rst threshold bytes, and
becomes progressively less signicant as the le grows.
The proxy redirects I/O trac above the threshold directly to the network storage array, according to some placement policy to select the storage
site(s) for each block. A simple option is to employ static placement functions that compute on the block oset and/or leID. More exible placement policies would allow the proxy to select write sites considering other
factors, e.g., load conditions on the network or storage nodes, or le attributes encoded in the leID. As an example of such an attribute-based
policy, Slice supports a mirrored striping policy that replicates each block of
a mirrored le on multiple storage nodes, to tolerate failures up to the replication degree. Mirroring consumes more storage and network bandwidth
than striping with parity, but it is simple and reliable, avoids the overhead
of computing and updating parity, and allows load-balanced reads [5, 17].
To generalize to more exible placement policies, Slice records block
locations in per-le block maps stored by map coordinator servers. The
proxies interact with the coordinators to fetch fragments of the block maps
as they handle I/O operations on les. A Slice conguration may include any
number of coordinators, each managing a subset of the les, e.g., selected
from the target leID. The le managers could subsume the functions of the
coordinators, but they are logically separate.
12
3.2 Name Space Operations
Attaining a balanced distribution of name space requests presents dierent challenges from I/O request routing. Directory processing involves more
computation, and name entries may benet more from caching because they
tend to be relatively small and fragmented. Moreover, directories are frequently shared. Directory servers act as a synchronization point to preserve
integrity of the name space, e.g., to prevent two clients from concurrently
creating a le with the same name in the same directory, or removing a
directory while a name create is in progress in the directory.
A common approach to scaling the le manager or directory service is
to manually partition the name space into a set of volumes, each managed
by a single server. Unfortunately, a xed volume partitioning strategy
develops imbalances as load on the individual volumes grows at dierent
rates. Rebalancing requires administrative intervention to repartition the
name space among volumes, which be visible to users through changes to
the names of existing directories.
We propose a namespace routing policy called mkdir switching to
automatically produce an even distribution of directories across directory
servers. By default, the proxies route namespace operations, including
mkdir, to a directory server selected by indexing a routing table with bits
from the leID in the fhandle. For mkdir switching, the proxy decides
with probability p to redirect any given mkdir request to a dierent directory
server, selected by hashing on the parent fhandle and the name of the new
directory. The selected site handles the new directory and its descendents |
except any subdirectories that happen to route to a dierent site. Directory
servers must coordinate to correctly handle name entries that cross server
boundaries, similar to mount points maintained by clients in the volume partitioning approach. Section 4.2 outlines the directory server implementation
in Slice. The hashing simplies the implementation because races over name
manipulation involve at most two sites.
Mkdir switching yields balanced distributions when the average number of active directories is large relative to the number of directory server
sites. However, very large directories are xed to a single server. An alternative name hashing strategy routes all namespace operations to a site
uniquely determined by a hash on the name component and its position in
the directory, represented by the parent directory fhandle. This approach
represents the entire volume name space as a unied global hash table distributed among the directory servers. It views directories as distributed
collections of name entries, rather than les accessed as a unit. Conicting
13
operations on a particular name entry (e.g., create/create, create/remove,
remove/lookup) are routed to the same server, where they synchronize on
the shared hash chain. Operations on dierent entries in the same directory
(e.g., create, remove, lookup) may proceed in parallel on multiple directory
servers.
Depending on the quality of the hash function, name hashing yields
probabilistically balanced request distributions for name-intensive workloads,
and it easily handles arbitrarily large directories. The cost of this policy is
that it increases the complexity of peer-peer interactions among directory
servers to manage the distributed directory tree, and for reconguration
and recovery. Also, it requires sucient memory to keep the hash chains
memory-resident, since the hashing function destroys locality in the hash
chain accesses. Finally, readdir operations span multiple servers; this is
the right behavior for large directories, but increases readdir costs for small
directories.
3.3 Request Routing and Service Structure
Request routing policies impact the structure of the servers that make up
the ensemble. The primary challenges of this architecture are coordination
and recovery to maintain a consistent view of the volume's data across all
servers, and allowing for reconguration to add or remove servers within
each class.
Most of the routing policies outlined above are independent of whether
small les and name entries are bound to the server sites on which they
are created. One option is for the servers to share their backing objects
and coordinate through a locking protocol, as in the shared disk systems
such as Frangipani (see Section 2.5); in this case, the purpose of the proxy
redirection is to improve locality in the request stream to each server. Alternatively, the system may use xed placement in which items are xed
to their create sites, unless reconguration or failover causes them to move.
Fixed placement stresses the role of the request routing policy in data placement, through its routing choices for requests that create new name entries
or data items. It also simplies some aspects of coordination and recovery
because the managers do not concurrently share storage blocks; however, it
introduces a need for atomic updates to state distributed among the managers. Name hashing requires xed placement unless the directory servers
support ne-grained distributed caching.
14
3.3.1 Reconguration
Consider the problem of reconguration to add or remove le managers,
i.e., directory servers, small-le servers, or map coordinators. Most requests
involving these servers are routed by the leID in the request fhandle, including name space operations other than redirected mkdirs in the mkdir
switching policy. If servers join or depart the ensemble, the binding from
leIDs to physical servers must change. The NFS specication prohibits
the system from modifying any fhandle elds once the fhandle has been
minted. It is sucient to hash the leID to produce a logical server site
ID, then consult a routing table in the proxy to map the logical server ID
to a physical server. It is then possible to change the routing function by
distributing updated routing tables to change the mapping. This approach
was used in xFS to partition le block maps across an xFS cluster [4]. Note
that the logical servers dene the minimal granularity for rebalancing; the
granularity may be adjusted by changing the number of bits extracted from
the logical server ID to index the routing table.
This approach generalizes to policies in which the logical server ID is
derived by from a hash that includes other request arguments, e.g., the name
hashing approach. For name hashing systems and other systems with
xed placement, the reconguration procedure must move logical servers
from one physical server to another. One approach is for each physical
server to use multiple backing objects, one for each logical server site that it
hosts, and recongure by reassigning the binding of physical servers to their
backing objects in the storage array. Otherwise, reconguration must copy
data from one backing object to another. In general, it is necessary to move
1/N th of the data to rebalance after adding or removing a physical server.
3.3.2 Atomicity and Recovery
Multiple servers with request routing naturally implies that the servers
must manage distributed state. File system structures have strong integrity
requirements and frequent updates. This state must be kept consistent
through failures and conicting concurrent operations.
As previously stated, le managers may arrange for recovery by generating a write-ahead operation log in shared storage. For systems that use the
shared-disk model with no xed placement, all operations execute at a single
manager site, and the system must provide locking and recovery procedures
to protect and recover the state of the shared blocks [27].
For systems with xed placement, some operations must update state
15
at multiple sites through a peer-peer protocol. The system can synchronize
these operations locally at the points where shared data resides (e.g., the
shared hash chains in the name hashing policy). Thus there is no need for
distributed locking or distributed recovery of individual blocks, but logging
and recovery must be coordinated across sites, e.g., using two-phase commit.
For mkdir switching, the only operations that involve updates to data
on multiple directory servers are those that involve the \orphaned" directories that were placed on a dierent site from their parents. These operations
include the rerouted mkdirs themselves, associated rmdirs, and any rename
operations that might happen to involve the orphaned entries. Since these
operations are relatively rare, as determined by the redirection probability
parameter p, it is acceptable to perform a full two-phase commit as needed
to guarantee the atomicity of these operations on systems with xed placement. However, for the name hashing policy, the majority of name space
update operations involve state on multiple sites. While it is possible to
limit the two-phase commit cost by logging asynchronously and coordinating rollback, this approach would provide weaker failure properties than are
normally accepted for le systems. Unless this problem is solved, name
hashing may be best suited to le systems that are read-mostly or that
store primarily soft state, e.g., scratch les or Internet server caches.
Shared network storage arrays also present interesting atomicity and recovery challenges. In particular, remove and truncate must reclaim storage
from multiple sites, under the direction of the map coordinator. The coordinator must maintain accurate maps, allocate new map fragments for
extending write operations, and support failure-atomic object operations,
including mirrored striping, truncate/remove, and NFS V3 write commitment (commit). The primary challenge is that there is no single serialization
point for writes to the shared array. Amiri et. al. [1] have recognized this
problem and addressed the atomicity issues; their solution handles storage
node failures, but not host failures.
The Slice map coordinators use an intention logging protocol in the coordinator to preserve atomicity. The protocol is fully implemented and is
the subject of a separate paper [2]. The basic protocol is as follows. At the
start of the operation, the proxy sends to the coordinator an intention to
perform the operation. The coordinator logs the intention in stable memory.
When the operation is complete, the proxy noties the coordinator with a
completion message, asynchronously clearing the intention. If the coordinator does not receive the completion within a specied timeout, it probes one
or more participants to determine if the operation completed, and initiates
recovery if necessary. A failed coordinator recovers by scanning its inten16
tions log, completing or aborting operations in progress at the time of the
failure.
In practice, this protocol admits optimizations to eliminate message exchanges and log writes from the critical path of most common-case operations, by piggybacking messages, leveraging the NFS V3 commit semantics,
and amortizing intention logging costs across multiple operations in ways
that trade o recovery time for faster execution. The details of this protocol
are beyond the scope of this paper, but our performance results fully reect
its costs.
4 Implementation
We have prototyped the key elements of the Slice architecture as a set of
loadable kernel modules for the FreeBSD operating system. The prototype
includes a proxy implemented as a standard packet lter, and kernel modules for the basic server classes: small-le server, directory server, block
storage server, and block map coordinator. Our directory server implementations use xed placement and support both the name hashing and mkdir
switching policies.
The storage servers use a stackable le system layer that manages multiple disks, each containing a Fast File System (FFS) volume with a collection
of inodes for storing data. The key operations in the storage node interface
are essentially a subset of NFS, including read, write, commit, truncate (setattr), and remove. The storage nodes use opaque NFS le handles as object
identiers, and include an external hash to map opaque NFS le handles to
local inodes. Our current prototype uses static striping across a static set
of storage nodes.
The small-le servers and directory servers store their state in a collection
of large backing objects, each of which is an NFS le backed directly by the
block I/O service. Like the clients, these servers are congured with local
proxies that govern the distribution of their I/O to the storage nodes, using
maps supplied by a coordinator. The coordinator code is implemented as an
extension to the storage node module, backing its intentions log and block
maps on local disk. A more failure-resilient implementation might back the
block maps with redundancy across multiple storage nodes, using a static
placement function.
All communication in our prototype is carried by UDP/IP. The experiments reported in this paper use Trapeze/Myrinet network interfaces [8].
We have modied the NFS stacks to avoid data copies on the client and
17
server for NFS write requests and read responses, using support for zerocopy TCP and UDP payloads in Trapeze and its driver. However, the Slice
prototype runs over any network interface that supports IP; we debugged
the system using Ethernet interfaces.
Recovery functionality is partially implemented in our prototype. The
map coordinator's intention logging for failure-atomic large le operations
are fully implemented. Similarly, the small-le servers are fully compliant
with the NFS V3 commit specication for write operations below the threshold oset. Directory servers generate a log entry for each operation that
modies critical metadata, but the recovery procedure is not implemented.
We have not implemented reconguration in our prototype.
4.1 The proxy Packet Filter
We implemented the Slice proxy as a loadable packet lter module that
intercepts and redirects packets exchanged with a registered Slice NFS V3
virtual server endpoint. When loaded, the proxy inserts itself into a low
level of the IP stack, by rewriting IP packet handling functions to point
to its own functions, and chaining the existing functions for network trac
that does not match its intercept prole. The Slice proxy implements the
routing policies described earlier for I/O redirection, mkdir switching, and
name hashing using MD5. It may be congured on the client host or within
an IP rewall node on the network path between the clients and servers. We
have validated the rewall conguration using unmodied Solaris clients.
The proxy is a nonblocking state machine with soft state consisting primarily of routing tables. To support the full NFS protocol, the proxy also
maintains records for some pending requests, and a cache over le attributes
returned in NFS responses. Although the NFS V3 specication does not require it, some NFS clients expect a full set of le attributes in each read and
write response. The proxy caches these attributes, and updates the modify
time and size elds as each write completes. The proxy may demand-fetch
or write back le attributes using NFS getattr and setattr operations to the
directory servers.
The proxy and other elements of our prototype provide no support for
concurrent write sharing. However, our proposal is compatible with NFS
le leasing extensions for consistent sharing, as dened in NQ-NFS [18] and
included in a recent IETF draft for NFS V4.
18
Figure 2: Smallle server data structures.
4.2 Directory Servers
The Slice prototype directory servers store arbitrary collections of name
entries indexed by hash chains, keyed by an MD5 [24] hash ngerprint in
each fhandle. This makes it possible to support both the name hashing and
mkdir switching policies easily within the same code base. We determined
empirically that MD5 yields a combination of balanced distribution and
low cost that is superior to competing hash functions with implementations
available to us.
Each directory server's data is stored as a web of linked xed-size cells
representing naming entries and le attributes. Attribute cells may include
a remote fhandle to reference an entry on another server, enabling cross-site
links in the directory structure. Directory servers use a simple peer-peer
protocol to update link counts for create/remove operations that cross sites,
and to follow cross-site links for lookup, getattr, and readdir. The proxy
interacts with a single directory site for each request.
4.3 Small-le Servers
The small-le server is implemented by a Smallle module that manages each
le as a sequence of 8KB logical blocks. Figure 2 illustrates the key Smallle
data structures and their use for a read or write request. The locations for
each block are given by a per-le map record. The server accesses this record
19
by using the leID in the request fhandle to key a map descriptor array ondisk. Each map record gives a xed number of oset/length pairs mapping
8KB le extents to regions within the data le.
This structure allows ecient space allocation and supports le growth.
Each logical block may have less than the full 8KB of physical space allocated. For example, a 8300 byte le would consume only 8320 bytes of
physical storage space, 8192 bytes for the rst block, and 128 for the remaining 108 bytes (rounded up to the next power of two for space management).
New les or writes to previously empty segments are allocated space according to best t, or if no appropriately sized fragment is free, a new region
is allocated at the end of the data le. Under a create-heavy workload,
this policy lays out data within the data le similarly to LFS [25], batching
newly created les into a single stream for ecient disk writes. The best-t
and variable sized fragment approach is similar to SquidMLA [19].
All data from the Smallle backing objects, including the map records,
are cached within the server by the kernel buer cache. This structure
performs well if le accesses and the assignment of leIDs show good locality.
In particular, if the the directory servers assign leIDs with good spatial
locality, and if les created together are accessed together, then the cost of
reading the Smallle map records is amortized across multiple les whose
records t in a single block.
5 Performance
This section presents Slice performance results. The intent is to show Slice
scaling and performance. These experiments were performed on a Myrinet
cluster of 450 MHz Pentium-III PCs with 512 MB main memory using
the Asus P2B motherboard with an Intel 440BX chipset. The NFS server
and Slice storage nodes use four IBM DeskStar 22GXP drives on separate
Promise Ultra/33 IDE channels. All machines are equipped with Myricom
LANai 7 adapters and run kernels built from the same FreeBSD 4.0 source
pool. All network communication in these experiments uses NFS/UDP/IP
over the Trapeze Myrinet rmware. In this conguration, Trapeze/Myrinet
provides 110 MB/s of raw bandwidth with an 8KB transfer size.
First, we report results from the industry-standard SPECsfs97 benchmark suite for network-attached storage. This benchmark runs as a collection of user processes that generate a realistic mix of NFS V3 requests,
check the responses against the NFS standard, and measure latency and delivered throughput. Because SPECsfs is designed to benchmark servers but
20
not clients, it creates its own sockets, sending and receiving NFS packets
without exercising any client NFS implementation. SPECsfs is a demanding, industrial-strength, self-scaling benchmark. We show results to: (1)
demonstrate that the prototype is fully functional, (2) show that the Slice
proxy architecture supports a scalable implementation that is compatible
with NFS V3 and independent of any client NFS implementation, and (3)
provide a basis for judging the performance of the current prototype against
industrial-grade servers.
!"#$%&'
!"#$%& !"#$%& ()! &*+, "()! &.//.0
12 %. "
Figure 3: SPECsfs97 saturation.
Figure 3 shows throughput at saturation. The x-axis gives the SPECsfs
oered load in requests per second. The y-axis gives delivered throughput
of the Slice ensemble, also given in operations per second.
As a baseline, Figure 3 includes results from a single FreeBSD 4.0 NFS
server (including our modications for zero-copy I/O over Trapeze/Myrinet)
exporting its four-disk array as a single volume (using CCD) and as separate
volumes. This server saturates at 850 operations per second.
Because saturation in this experiment is determined primarily by the
number of disk arms, the Slice congurations use a single directory server,
two small-le servers, and a varying number of storage nodes. Slice-1 uses
one storage node for a total of 4 disks, the same as the NFS congurations.
Slice-3 and Slice-6 use 3 and 6 storage nodes, for a total of 12 and 24 disks
respectively.
21
Slice-1 has a higher saturation throughput than either NFS conguration
due to faster metadata operations, however the experiment quickly becomes
disk-arm limited. The SPECsfs le set is skewed heavily toward small les:
94% of les are 64 KB or less, though small les account for only 24% of
the total bytes accessed. Despite this fact, SPECsfs I/O requests are almost
entirely to small les; the large les serve to \pollute" the disks. Increasing
the number of storage nodes (disk arms) has a signicant eect on both
sustainable request rate as well as average latency. With six storage nodes
the Slice ensemble saturates at 2500 ops/s with an average latency around
8 ms.
73
64
UP
I
T
S 63
R
Q
P
O
I
NK 54
M
I
L 53
KJ
I
H
G
4
VWX
VWX YZ[[Z\
X Y]^_`
5
X `abc Y
7
X `abc Y
bc
`a YdXi
efg h d 333
jgk k c `c [[Z
3
3
433
5333
5433
6333
6433
7333
89:;< 9=9> ?@A> B@CDEDF
Figure 4: SPECsfs97 latency. The curves coil backwards as servers saturate.
Figure 4 shows average operation latency for the same experiment, as a
function of delivered throughput on the x-axis. The curves coil back as the
Slice ensemble saturates, delivering lower throughput with higher average
latency.
For comparison, 4 includes line for two recent commercial servers reported at spec.org. The rst, the IBM RS 6000 Model 44P-170 with 1 GB
cache uses 48 10000-RPM SCSI disks serving 42 independent le systems.
The second, the EMC Celerra File Server Cluster Model 506, has 34 Seagate
Cheetah disks and 4 GB of disk cache. We chose the IBM line because it
saturates at about the same point as Slice-6, and the EMC line because it
most closely resembles the Slice-6 disk and memory resources. Both con22
gurations have more disk arms, and deliver better average latency than
our 24 disk Slice-6 cluster with 1 GB cache on the le managers. Pricing
information is not disclosed as part of the reported SPECsfs results.
rll
qll
…„
ƒ pll
|
‚
€
oll
|

~}
| nll
{
z
†‡ˆ‰Š
†‡‹‰Š
ŠŒŽ ‡ m
ŠŒŽ ‡n
ŠŒŽ ‡p
ŠŒŽ ‡
mll
l
l
q
ml
mq
nl
nq
stuvwxy
Figure 5: Metadata intensive workload.
Figure 5 shows Slice scaling for a metadata-intensive workload. For
this experiment, name hashing and mkdir switching yield equivalent
performance; we used p = 1 ? (N ? 1)=N , where N is the number of servers.
In this test, an increasing number of client processes running on ve client
nodes untar a directory tree modeled from the entire FreeBSD distribution
consisting of 36,000 zero-length les, with no read or write activity. This
synthetic benchmark is intended to stress name space operations. Each le
create requires seven NFS operations: lookup, access, create, getattr, lookup,
setattr, setattr, for a total of about 25,000 NFS operations per client. The
y-axis shows shows the average total latency perceived by each client process
for the number of processes shown on the x-axis. Client CPUs saturate at
ve processes.
This graph shows that the Slice directory service scales as more directory
servers are added. Slice-1 measures one directory server, Slice-2 measures 2,
and so on. For comparison, the N-UFS line measures a FreeBSD NFS server
exporting a soft-updates [20] enabled UFS le system over an IBM DeskStar
22GXP drive. The N-MFS line measures an NFS server exporting a memorybased le system (MFS). MFS initially performs better than Slice because
23
Slice logs some operations, however MFS does not scale. A Slice directory
server saturates at 6000 ops/s with the untar benchmark generating about
0.5 MB/s of log trac.
single client
read
66.8 MB/s
write
45.0 MB/s
read-mirrored 46.9 MB/s
write-mirrored 22.6 MB/s
saturation
180.4 MB/s
129.8 MB/s
117.4 MB/s
97.7 MB/s
Table 2: Bulk-I/O performance.
Table 2 shows raw read and write bandwidth through a FreeBSD NFS
client coupled with a proxy, over Trapeze/Myrinet with six storage nodes.
Each test reads or writes a 1.25 GB le, using a read-ahead and write-behind
blocksize of 32 KB and prefetching depth of four as NFS mount options. We
have also modied the FreeBSD NFS client for zero-copy reading.
The two columns present I/O rates delivered to a single client and the
aggregate bandwidth to four clients, saturating the storage nodes. Nonredundant as well as mirrored bandwidths are presented. A single client
reads a non-mirrored le at 67 MB/s with 30% CPU utilization. The same
client writes at only 45 MB/s, saturating the client CPU. At saturation, the
six storage nodes with 24 disks altogether provide 180 MB/s of non-mirrored
read bandwidth, or about 7.5 MB/s per disk, and 130 MB/s of non-mirrored
write bandwidth, or 5.4 MB/s per disk. Storage nodes prefetch sequential
les up to 256 KB beyond the current access, however write clustering is
not as aggressive.
Mirroring degrades single client read performance somewhat because
storage nodes waste prefetching bandwidth loading data that the client
chooses to read from another node. A mirrored writer must push twice
as many bytes out its network interface and achieves half the bandwidth of
a non-mirrored writer. Mirrored read saturation is lessened for the same
reason as the single client case, however mirrored writes consume additional
transfer time and space overheads within the storage nodes.
Figure 6 shows the aect of varying the p parameter for mkdir switching in the metadata-intensive untar workload. The X-axis measures directory anity, the probability p that a new directory is placed on the same
server as its parent. The Y-axis shows the corresponding average client
latency.
This test uses four client nodes hosting one, four, or sixteen client pro24
•‘‘
”’‘
²±
° ”‘‘
©
¯
®­
“’‘
©
¬
«
ª©
¨ “‘‘
§
“— ³´µ¶·¸¹
˜ ³´µ¶·¸¹
– ³´µ¶·¸¹
“ ³´µ¶·¸
’‘
‘
‘
”‘
–‘
—‘
˜‘
“‘‘
™š›œžŸ› ¡¢¢ š£šž ¤¥¦
Figure 6: Metadata placement.
cesses against four directory servers. One client process serializes operations
within the client. Even with an anity of 100%, placing all entries on the
same server, single client time does not change. However, performance degrades with multiple clients and high anity. As directory anity increases,
average time decreases slightly as the number of cross-server operations declines. As anity continues to increase, the directory structure keeps within
a single server and becomes subject to load imbalances. We omit results
for name hashing because latency is unaected by directory size or a p
parameter.
Operation
Packet interception
Parsing
Redirection/rewriting
Logic/overhead
CPU
0.7%
4.1%
0.5%
0.8%
Table 3: proxy overheads for 6250 packets per second.
Table 3 summarizes the proxy overheads for the untar experiment, measured on a 500 MHz Compaq 21264 running FreeBSD/Alpha with iprobe
(Instruction Probe), an on-line proling tool. The workload sends and receives a mixed set of NFS packet types at a rate of 3125 request/response
25
pairs per second and a total 40% CPU utilization, 6% is due to proxy
overheads.
The proxy packet interception costs 0.7% of the absolute CPU utilization. An additional 4.1% is spent to partially parse the packets into
an intermediate representation used for redirection and/or request/response
rewriting. Redirection replaces the packet destination and/or ports, using
a dierential checksum update instead of scanning the whole packet. Redirection accounts for another 0.5% of the CPU time. Finally, the redirection
logic and soft-state management consumes another 0.8%.
Parsing is by far the most expensive aspect of proxy operation. Given
the osets to certain key elds in an NFS request or response, the proxy
can quickly identify and redirect the packet, however both the SunRPC and
NFS version 3 protocol spec have variable length elds. Nearly half of the
parsing overhead stems from reading lengths and skipping past otherwise
ignored elds. Adding a rpc header-length eld or xing (and padding)
the number of groups in a SunRPC header would alleviate an eighth of our
total parsing overhead. Likewise xing the size of an NFS le handle would
expedite the handling of most messages.
6 Conclusion
This paper presents Slice, a new storage system architecture for high-speed
LANs incorporating network-attached block storage. Slice interposes a request switching lter | called a proxy | along the network path between the client and the network storage system. The proxy virtualizes a
client/server le access protocol, applying congurable request routing policies to distribute requests across an array of storage nodes and specialized
servers that cooperate to provide a scalable le service.
This approach realizes the benets of network-attached storage devices
| scalable performance and direct device access for high-volume I/O { while
preserving compatibility with standard le system clients. In addition, the
proxy may incorporate routing policies for scalable le management. We
presented and evaluated two such policies, mkdir switching and name
hashing.
We have prototyped all elements of the Slice architecture within the
FreeBSD kernel. The prototype proxy that runs as a packet lter, with
state requirements that show the feasibility of incorporating it into network
elements. The Slice prototype delivers network-speed le access bandwidth
and competitive performance on an industry-standard NFS benchmark, with
26
evidence of scalability. We used name-intensive synthetic benchmarks to
evaluate the name space policies, to show that name space operations are
scalable using these policies.
In addition to demonstrating a proof-of-concept, we explored the interaction of request routing policies on the storage service structure, including
server cooperation protocols, reconguration, and recovery.
References
[1] Khalil Amiri, Garth Gibson, and Richard Golding. Highly concurrent shared
storage. In Proceedings of the International Conference on Distributed Compuing Systems, April 2000.
[2] Darrell Anderson and Je Chase. Failure-atomic le access in an interposed
network storage system. Technical report.
[3] David Anderson. Object Based Storage Devices: A command set proposal.
Technical report, National Storage Industry Consortium, October 1999.
[4] T. Anderson, M. Dahlin, J. Neefe, D. Patterson, D. Roselli, and R. Wang.
Serverless network le systems. In Proceedings of the ACM Symposium on
Operating Systems Principles, pages 109{126, December 1995.
[5] Remzi H. Arpaci-Dusseau, Eric Anderson, Noah Treuhaft, David E. Culler,
Joseph M. Hellerstein, David A. Patterson, and Katherine Yelic. Cluster I/O
with River: Making the fast case common. In I/O in Parallel and Distributed
Systems, May 1999.
[6] A. D. Birrell and R. M. Needham. A universal le server. IEEE Transactions
on Software Engineering, SE-6(5):450{453, September 1980.
[7] Luis-Felipe Cabrera and Darrell D. E. Long. Swift: Using distributed disk
striping to provide high I/O data rates. Computing Systems, 4(4):405{436,
Fall 1991.
[8] Jerey S. Chase, Darrell C. Anderson, Andrew J. Gallatin, Alvin R. Lebeck,
and Kenneth G. Yocum. Network I/O with Trapeze. In 1999 Hot Interconnects
Symposium, August 1999.
[9] Sailesh Chutani, Owen T. Anderson, Michael L. Kazar, Bruce W. Leverett,
W. Anthony Mason, and Robert N. Sidebotham. The Episode le system. In
Usenix Conference, pages 43{60, Winter 1992.
[10] Garth A. Gibson, David F. Nagle, Khalil Amiri, Fay W. Chang, Eugene M.
Feinberg, Howard Gobio, Chen Lee, Berend Ozceri, Erik Riedel, David
Rochberg, and Jim Zelenka. File server scaling with network-attached secure
disks. In Proceedings of the 1997 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, volume 25,1 of
27
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
Performance Evaluation Review, pages 272{284, New York, June 15{18 1997.
ACM Press.
Garth A. Gibson, David F. Nagle, Khalil Amiri, Fay W. Chang, Howard Gobio, Charles Hardin, Erik Riedel, David Rochberg, and Jim Zelenka. A costeective, high-bandwidth storage architecture. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages
and Operating Systems, October 1998.
Robert B. Hagmann. A crash recovery scheme for a memory-resident database
system. IEEE Transactions on Computers, C-35(9):839{843, September 1986.
Graham Hamilton, Michael L. Powell, and James J. Mitchell. Subcontract: A
exible base for distributed programming. In Proceedings of the Fourteenth
ACM Symposium on Operating Systems Principles, pages 69{79, 1993.
John H. Hartman and John K. Ousterhout. The Zebra striped network le system. ACM Transactions on Computer Systems, 13(3):274{310, August 1995.
John S. Heidemann and Gerald J. Popek. File-system development with stackable layers. ACM Transactions on Computer Systems, 12(1):58{89, February
1994.
Michael B. Jones. Interposition agents: Transparently interposing user code
at the system interface. In Barbara Liskov, editor, Proceedings of the 14th
Symposium on Operating Systems Principles, pages 80{93, New York, NY,
USA, December 1993. ACM Press.
Edward K. Lee and Chandramohan A. Thekkath. Petal: Distributed virtual
disks. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 84{
92, Cambridge, MA, October 1996.
R. Macklem. Not quite NFS, soft cache consistency for NFS. In USENIX
Association Conference Proceedings, pages 261{278, January 1994.
Carlos Maltzahn, Kathy Richardson, and Dirk Grunwald. Reducing the disk
I/O of web proxy server caches. In USENIX Annual Technical Conference,
June 1999.
Marshall Kirk McKusick and Gregory R. Ganger. Soft updates: A technique
for eliminating most synchronous writes in the fast lesystem. In FREENIX
Track, USENIX Annual Technical Conference, pages 1{17, 1999.
Marshall Kirk McKusick, William N. Joy, Samuel J. Leer, and Robert S.
Fabry. a fast le system for UNIX (revised july 27. 1983). Technical Report
CSD-83-147, University of California, Berkeley, July 1983.
Vivek S. Pai, Mohit Aron, Gaurav Banga, Michael Svendsen, Peter Druschel,
Willy Zwaenopoel, and Erich Nahum. Locality-aware request distribution in
28
[23]
[24]
[25]
[26]
[27]
[28]
[29]
cluster-based network servers. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating
Systems, October 1998.
Kenneth Preslan and Matthew O'Keefe. The global le system: A shared
disk le system for *BSD and Linux. In 1999 USENIX Technical Conference
(FREENIX track), June 1999.
R. Rivest. RFC 1321: The MD5 message-digest algorithm, April 1992. ftp:
//ftp.internic.net/rfc/rfc1321.txt.
Mendel Rosenblum and John Ousterhout. The lfs storage manager, 1990.
Marc Shapiro. Structure and encapsulation in distributed systems: The proxy
p rinciple. In Proceedings of the Sixth International Conference on Distri buted
Computing Systems, May 1986.
Chandu Thekkath, Tim Mann, and Ed Lee. Frangipani: A scalable distributed
le system. In Ninth International Conference on Architectural Support for
Programming Languages and Operating Systems, pages 224{237, October 1997.
Robbert van Renesse, Andrew Tanenbaum, and Annita Wilschut. The design
of a high-performance le server. In The 9th International Conference on
Distributed Computing Systems, pages 22{27, Newport Beach, CA, June 1989.
IEEE.
Geo M. Voelker, Eric J. Anderson, Tracy Kimbrel, Michael J. Feeley, Jeffrey S. Chase, Anna R. Karlin, and Henry M. Levy. Implementing cooperative
prefetching and caching in a globally-managed memory system. In Proceedings
of ACM International Conference on Measurement and Modeling of Computer
Systems (SIGMETRICS '98), June 1998.
29
Download