CS{2000{05 Interposed Request Routing for Scalable Network Storage Darrell Anderson, Je Chase, Amin Vahdat Department of Computer Science Duke University Durham, North Carolina 27708{0129 May 2000 Interposed Request Routing for Scalable Network Storage Darrell Anderson, Je Chase, Amin Vahdat 1 Department of Computer Science Duke University fanderson,chase,vahdatg@cs.duke.edu This work is supported by the National Science Foundation (CCR-96-24857, CDA-95-12356, and EIA-9870724) and equipment grants from Intel Corporation and Myricom. Anderson is supported by a U.S. Department of Education GAANN fellowship. 1 This paper presents Slice, a new storage system architecture for highspeed LANs incorporating network-attached block storage. Slice interposes a request switching lter | called a proxy | along the network path between the client and the network storage system (e.g., in a network adapter or switch). The purpose of the proxy is to route requests among a server ensemble that implements the le service. We present a prototype that uses this approach to virtualize the standard NFS le protocol to provide scalable, high-bandwidth le service to ordinary NFS clients. The paper presents and justies the architecture, proposes and evaluates several request routing policies realizable within the architecture, and explores the eects of these policies on service structure. 1 Introduction Demand for storage is growing at roughly the same rate as Internet bandwidth. A prominent factor driving this storage growth is the return of server-based computing. Software is migrating from shrink-wrap distribution to a service model, in which applications are hosted in data centers and accessed through Internet sites. Additionally, consumer Internet service providers are managing and distributing high volume content such as mail, message boards, FTP archives, etc. in large data centers. At the same time, there is an increasing demand to store large amounts of data for scalable computing, multimedia and visualization. This rapid growth means that a successful storage system architecture must scale to handle current demands, placing a large premium on the costs (including human costs) associated with upgrading. These properties are so important that many high-end storage servers are priced an order-ofmagnitude or more beyond the cost of raw disk storage. Commercial systems increasingly provide scalable shared storage by interconnecting storage devices and servers with dedicated Storage Area Networks (SANs), e.g., FibreChannel. Yet recent order-of-magnitude improvements in LAN performance have narrowed the bandwidth gap between SANs and LANs. This creates an opportunity to deliver competitive storage solutions by aggregating low-cost storage nodes and servers, using a generalpurpose LAN as the storage backplane. In such a system it should possible to incrementally scale either capacity or bandwidth of the shared storage resource by attaching more storage to the network, using a distributed software layer to unify storage resources into a le system view. This paper introduces a network storage architecture | called Slice | metadata servers MD5 hash namespace requests bulkI/O client striping policy scalable blockI/O service small file read/write file placement policy small-file servers Figure 1: Combining functional decomposition and data decomposition in the Slice architecture. that takes a unique approach to scaling storage services. Slice interposes a request switching lter (called a proxy) along the client's communication path to the storage service. The lter could reside in a programmable switch or network adapter, or it could reside in a self-contained module at the client's interface to the network. The role of the lter is to \wrap" a standard client/server le system framework, and redirect requests to a collection of storage elements and servers that cooperate to present a uniform view of a shared le volume with scalable bandwidth and capacity. File services are a good context to explore these techniques because they have well-dened protocols and semantics, established workloads, and a large installed base of clients and applications, many of which face signicant scaling challenges today. The contributions of this paper are: Illustrate and evaluate the architecture with a prototype proxy implementation based on an IP packet lter. We quantify the performance and state requirements of the proxy. Present request routing approaches achievable within this model, in- cluding striping for high-bandwidth bulk I/O, separating small-le I/O and distributing them to multiple managers. We introduce mkdir switching and name hashing approaches that enable transparent scaling of the le system name space, including support for very large 3 distributed directories. We also explore the eects of request routing policies on the structure of the le service. Evaluate the architecture and routing policies with a prototype that is compliant with Version 3 of the Network File System (NFS), the most popular protocol standard for commercial network-attached storage servers. We present performance results from SpecSFS97, the standard commercial benchmark for network-attached storage. The results demonstrate the scalability of the architecture, and show that it can provide performance competitive with commercial-grade le servers by aggregating low-cost components on a switched LAN. Moreover, the architecture allows for growth of the system by adding storage nodes and servers, meeting demands for applications with a variety of access characteristics. This paper is organized as follows. Section 2 outlines the architecture and its motivation, and sets Slice in context with related work. Section 3 discusses the role of the proxy, presents specic request routing mechanisms and policies, and explores their implications for le service structure including recovery and reconguration. Section 4 describes our prototype, and Section 5 presents experimental performance results. Section 6 concludes. 2 Overview The Slice le service consists of a collection of servers, cooperating to serve a single arbitrarily large virtual volume of les and directories. To a client, the ensemble appears as a single le server at some virtual network address. The proxy intercepts packets and performs simple functions to route requests and to represent the ensemble as a unied le service to its client. Figure 1 depicts the structure of a Slice ensemble. The gure illustrates a client's request stream partitioned into three functional groups corresponding to the major le workload components: (1) high-volume I/O to large les, (2) I/O on small les, and (3) operations on the name space or le attributes. The proxy switches on the request type and arguments to redirect requests to a selected server responsible for handling that request class. Bulk I/O operations route directly to an array of storage nodes, which provide raw block access to storage objects. Other operations are distributed among le managers for small-le I/O or name space requests. This functional decomposition allows high-volume data ow to bypass the managers, while allowing specialization of the servers by workload com4 ponent, tailoring the policies for disk layout, caching and recovery. A single server node could combine the functions of multiple server classes; we separate them in order to highlight opportunities to benet from distributing requests across many servers. For each request the proxy selects a target server from the given class by switching on the identity of the target le, name entry, or block, possibly using a separate routing function for each request class. Thus the routing function induces a data decomposition of the volume data across the ensemble, with the side eect of creating or caching data items on the selected managers. Ideally, the request routing scheme spreads the data and request workload in a balanced fashion across all servers. The routing function may adapt to system conditions, e.g., to use new server sites as they become available. This allows each workload component to scale independently by adding resources to its server class. The next three subsections outline the role of the proxy, storage nodes, and le manager servers. We then discuss implications of the architecture and set it in context with related work. 2.1 The proxy The proxy examines le service requests and responses, applying the routing functions to redirect requests. An overarching goal is to keep the proxy simple, small, and fast. The proxy may (1) rewrite the source address, destination address, or other elds of request or response packets, (2) maintain a bounded amount of soft state, and (3) initiate or absorb packets to or from the Slice ensemble. The proxy does not itself manage any state shared across clients, so it may reside on the client host or network interface, or in a network element close to the server. The proxy functions as a network element within the Internet architecture. Through the strict use of soft state, it is free to fail and discard its state and/or pending packets without compromising correctness. Higherlevel end-to-end protocols (in this case NFS/RPC/UDP or TCP) retransmit packets as necessary to recover from drops in the proxy. Although the proxy resides \within the network", it acts as an extension of the service. For example, since the proxy is a layer-5 protocol component, it must reside (logically) at one end of the connection or the other, e.g., it cannot reside in the \middle" of the connection where end-to-end encryption could hide layer-5 protocol elds from the network elements. While it is logically an extension of the service, the proxy acts with the authority of its clients. In particular, the proxy may reside outside of the server ensemble's trust 5 boundary. In this case, the damage from a compromised proxy is limited to the les and directories that its client(s) had permission to access. 2.2 Network Storage Nodes A shared array of network storage nodes provides all disk storage used in a Slice ensemble. Slice servers and clients may access the storage array directly. Clients access the network storage array only by request redirection through a proxy. More storage nodes may be added to incrementally scale bandwidth, capacity, and disk arms. The placement policies of the le service are responsible for distributing data among storage objects so as to benet fully from all of the resources in the network storage array. The Slice block storage architecture is loosely based on a proposal in the National Storage Industry Consortium (NSIC) for object-based storage devices (OBSD) [3]. Key elements of the OBSD proposal were in turn inspired by research on Network Attached Secure Disks (NASD) [10, 11]. Storage nodes are \object-based" rather than sector-based, meaning that requesters address data on each storage node as logical osets within storage objects. A storage object is an ordered sequence of bytes with a unique identier. OBSDs support additional command sets to manage collections of objects, and they retain responsibility for assigning physical disk blocks to objects. The OBSD proposal allows for cryptographic protection of object identiers if the network is insecure [10]. This block storage architecture is orthogonal to the question of which level arranges redundancy to tolerate disk failures. One alternative is to provide redundancy of disks and other vulnerable components internally to each storage node. A second option is for the le service layers to mirror data or maintain parity across the storage nodes. In Slice, the choice to employ extra redundancy across storage nodes may be made on a per-le basis through support for mirrored striping in our prototype. For stronger protection, a Slice conguration could employ redundancy at both levels. 2.3 File Managers File management functions above the network storage array are split across two classes of servers. Each class governs functions that are common to any le server; the model separates them to distribute the request load and allow implementation specialized to the semantics of individual workloads. Directory servers handle name space operations, e.g., to create, remove, or lookup les and directories by symbolic name; they manage 6 and cache directories and mappings from names to le identiers and attributes for each le or directory. Small-le servers handle read and write operations on small les and the initial segments of large les (see Section 3.1). Slice le managers are dataless; all of their state is backed by the network storage array. Their role is to aggregate their structures into larger storage objects backed by the storage nodes, and to provide memory and CPU resources to cache and manipulate those structures. In this way, the le managers can benet from the parallel disk arms and high bandwidth of the storage array as more storage nodes are added. The principle of dataless le managers also plays a key role in recovery. In addition to its backing objects, each manager journals its updates in a write-ahead log [12, 9]; the system can recover the state of any manager from its backing objects together with its log. This allows fast failover, in which a surviving site assumes the role of a failed server, recovering its state from shared storage [14, 4, 27]. 2.4 Discussion In summary, the Slice architecture presents several fundamental benets to diverse application workloads: Compatibility with standard le system clients. The proxy factors request routing policies away from the client-side le system code. The purpose of this approach is to leverage a minimal computing capability within the network elements to virtualize the storage protocol. Direct storage access for high-volume I/O. The proxy routes bulk I/O trac directly to the network storage array, removing the le managers from the critical path. Separating requests in this fashion eectively eliminates a key scaling barrier for conventional le services [10, 11]. Moreover, policies for block placement and replication may be applied on a per-le basis, by incorporating them in the policy for routing I/O requests to the storage array. At the same time, the small-le servers absorb and aggregate I/O operations on small objects, so there is no need for the storage node mapping structures to handle small objects eciently. Content-based request switching. The proxy's routing function selects a server site for each request based on the request arguments. 7 Requests with the same arguments are routed to the same server, unless the routing function changes (e.g., due to reconguration or rebalancing). This can improve locality in the request stream; a good routing function will induce a balanced distribution of le objects and requests across the servers, eliminating redundant caching. 2.5 Related Work A large number of systems have interposed new system functionality by \wrapping" an existing interface, including kernel system calls [16], internal interfaces [15], communication bindings [13], or messaging endpoints. The concept of a proxy mediating between clients and servers [26] is now common in distributed systems. We propose to mediate some storage functions by interposing on standard storage access protocols within the network elements. The role of the Slice proxy is primarily to route requests based on their content. This is analogous to the HTTP content switching features oered by prominent network switch vendors (e.g., Foundry, Extreme, Alteon) for Internet server sites, based in part on research that demonstrated the locality benets in those environments [22]. Our proposal extends the concept to a le system context, in which updates are frequent. A number of recent commercial and research eorts investigate techniques for building scalable storage systems for high-speed switched LAN networks. These system are built from disks distributed through the network, (e.g., attached to dedicated servers [17, 27, 14], cooperating peers [4, 29], or the network itself [11]). We separate these into two broad groups. Many of these systems separate le managers (e.g., the name service) from the block storage service, as in Slice. This was rst proposed for the Cambridge Universal File Server [6]. Subsequent systems adopted this separation to allow bulk I/O to bypass le managers [7, 14], and it is now a basic tenet of research in network-attached storage devices including the CMU NASD work on devices with object-based access control [11, 10]. The industry is moving to base new storage products on the NASD model. Slice shows how to incorporate essential placement and routing functions into a new lesystem structure for network-attached storage devices. The Slice model cleanly separates those functions from the base le access protocol, preserving compatibility with existing network le system clients. Other proposals have integrated functions similar to those of the proxy into le system clients [10]. An alternative structure for cluster le systems is to layer le system 8 functions above a network storage volume using a shared disk model. Policies for striping, redundancy, and storage site selection are specied on a volume basis; cluster nodes coordinate their accesses to the shared storage blocks using an ownership protocol. This approach has been used with both logstructured (Zebra [14] and xFS [4]) and conventional (Frangipani/Petal [17, 27] and GFS [23]) le system organizations. The cluster may be viewed as \serverless", if all nodes are trusted and have direct access to the shared disk, or alternatively the entire cluster may act as a le server to untrusted clients using a standard network le protocol, with all I/O passing through the cluster nodes as they mediate access to the disk. The key benets of Slice request routing apply equally to these shared disk systems, when untrusted clients are present. First, request routing is a key to incorporating secure network-attached storage devices, which allow untrusted clients to address storage objects directly without compromising the integrity of the le system. That is, a proxy could route bulk I/O requests directly to the devices, yielding a more scalable system that preserves compatibility with standard clients and accommodates per-le policies for block placement, redundancy or replication, prefetching, etc. Second, request routing could improve locality in the request stream to the le servers, improving cache eectiveness and reducing the incidence of block sharing among the servers. The shared disk model is used in many commercial systems, which increasingly interconnect storage devices and servers with dedicated Storage Area Networks (SANs), e.g., FibreChannel. This paper explores storage request routing for Internet networks, but the concepts are equally applicable in SANs. Our proposal to separate small-le I/O from the request stream is similiar in concept to the Amoeba Bullet Server [28], a specialized le server that optimizes small les. As described in Section 4.3, the prototype small-le server draws on techniques from the Bullet Server, FFS fragments [21], and SquidMLA [19], a Web proxy server that maintains a user-level \lesystem" of small cached Web pages. 3 Request Routing Policies This section explains the structure of the proxy and the request routing schemes used in the Slice prototype. The purpose is to illustrate concretely the request routing policies enabled by the architecture, and the implications of those policies for the way the servers interact to maintain and recover 9 consistent le system states. We use the NFS V3 protocol as a reference point, since it is widely understood and our prototype supports it. The proxy examines and rewrites NFS requests and responses as they enter or arrive from the network. It intercepts requests addressed to virtual NFS servers, and selects a target server from the Slice ensemble by applying a request routing function to the request type and arguments. It then rewrites the IP address and port of the request to redirect it to the selected server. When a response arrives, the proxy rewrites the source address and port before forwarding it to the client, so the packet appears to originate from the virtual NFS server. The request routing functions must permit reconguration to add or remove servers, while minimizing state requirements in the proxy. The proxy routes most requests by extracting relevant elds from the request, optionally hashing to combine multiple elds, and interpreting the result as a logical server site ID for the request. It then looks up the physical server corresponding to the logical site in a compact routing table installed at initialization. These tables constitute soft state; the proxy may reload them after a failure. It is possible to recongure the routing function by changing the tables. Multiple logical sites may map to the same physical server, leaving exibility for reconguration. The proxy examines up to four elds of each request, depending on the particular policy congured: Request type. Routing policies are keyed by the NFS request type, so the proxy may employ dierent policies for dierent functions. Table 3 lists the important NFS request groupings discussed in this paper. File handle. Each NFS request applies to a specic le or directory, named by a unique identier called a le handle (or fhandle), which is passed as an argument to the operation. Although the NFS fhandles are opaque to the client, their structure is known to the proxy, which acts as an extension of the service. For example, directory servers place an integer leID eld or type elds in each fhandle, which the proxies extract for routing. Read/write oset. NFS I/O operations specify the range of osets covered by each read and write. When small-le servers are congured, the prototype's routing policy denes a xed threshold oset (e.g., 64KB); the proxy directs I/O requests below the threshold to a smallle server selected from the request fhandle. I/O requests beyond the 10 Namespace Operations lookup(dir, name) returns (le handle, attributes) create(dir, name) returns (le handle, attributes) mkdir(dir, name) returns (le handle, attributes) remove(dir, name)/rmdir(dir, name) Retrieve a directory entry. Create a le/directory and update the parent entry/link count and modify timestamp. Remove a le/directory or hard link and update the parent entry/link count and modify timestamp. link(olddir, oldname, newdir, newname) re- Create a new name for an existing le, and the le's turns (le handle, attributes) link count and modify timestamp, and update the new parent's link count and modify timestamp. rename(olddir, oldname, newdir, newname) Rename an existing le or hard link, and update the returns (le handle, attributes) link count and modify timestamp in both the old and new parent. getattr(object) returns (attributes) setattr(object, attributes) Attribute Operations Retrieve the attributes of a le or directory. Modify the attributes of a le or directory, and update its modify timestamp. I/O Operations read(le, oset, len) returns (data, at- Read data from a le, updating its access timestamp. tributes) write(le, oset, len) returns (data, at- Write data to a le, updating its modify timestamp. tributes) Directory Retrieval readdir(dir, cookie) returns (entries, cookie) Read some or all of the entries in a directory. Table 1: Some of the Sun Network File System (NFS) protocol operations and their meanings. threshold are striped across multiple storage nodes using a static policy or per-le block maps. Name component. NFS name space operations include a symbolic name component in their arguments (see Table 3). A key challenge for scaling le management is to attain a balanced distribution of these requests across a collection of directory servers; this is particularly important for metadata-intensive workloads with small les and heavy create/lookup/delete activity, as often occurs in Internet server environments for mail, news, message boards, and Web access. Section 3.2 outlines two policies that select the server by applying a hash function to the name. We now outline the functioning of the proxy's routing policies for spe11 cic request groups. 3.1 Block I/O Request routing for read/write requests have to goals: separate small-le read/write trac from bulk I/O, and decluster the blocks of large les across the storage nodes for the desired access properties (e.g., high bandwidth or a specied level of redundancy). We address each in turn. Our approach to separating small-le access trac is to redirect all I/O below the xed threshold oset to a small-le server. The threshold oset is necessary because the size of each le may change at any time, and in any case is not known to the proxy. Thus the smallle servers also are polluted by some subset of the I/O requests on large les, i.e., they receive all I/Os that fall below the threshold, even if the target le is large. Moreover, large les below the threshold are limited by the bandwidth of the smallle server. In practice, large les have little impact on the smallle servers because there is typically a relatively small number of these les, each consuming at most threshold bytes. Imbalances can be resolved by adding servers. Similarly, any performance cost for large les aects only the rst threshold bytes, and becomes progressively less signicant as the le grows. The proxy redirects I/O trac above the threshold directly to the network storage array, according to some placement policy to select the storage site(s) for each block. A simple option is to employ static placement functions that compute on the block oset and/or leID. More exible placement policies would allow the proxy to select write sites considering other factors, e.g., load conditions on the network or storage nodes, or le attributes encoded in the leID. As an example of such an attribute-based policy, Slice supports a mirrored striping policy that replicates each block of a mirrored le on multiple storage nodes, to tolerate failures up to the replication degree. Mirroring consumes more storage and network bandwidth than striping with parity, but it is simple and reliable, avoids the overhead of computing and updating parity, and allows load-balanced reads [5, 17]. To generalize to more exible placement policies, Slice records block locations in per-le block maps stored by map coordinator servers. The proxies interact with the coordinators to fetch fragments of the block maps as they handle I/O operations on les. A Slice conguration may include any number of coordinators, each managing a subset of the les, e.g., selected from the target leID. The le managers could subsume the functions of the coordinators, but they are logically separate. 12 3.2 Name Space Operations Attaining a balanced distribution of name space requests presents dierent challenges from I/O request routing. Directory processing involves more computation, and name entries may benet more from caching because they tend to be relatively small and fragmented. Moreover, directories are frequently shared. Directory servers act as a synchronization point to preserve integrity of the name space, e.g., to prevent two clients from concurrently creating a le with the same name in the same directory, or removing a directory while a name create is in progress in the directory. A common approach to scaling the le manager or directory service is to manually partition the name space into a set of volumes, each managed by a single server. Unfortunately, a xed volume partitioning strategy develops imbalances as load on the individual volumes grows at dierent rates. Rebalancing requires administrative intervention to repartition the name space among volumes, which be visible to users through changes to the names of existing directories. We propose a namespace routing policy called mkdir switching to automatically produce an even distribution of directories across directory servers. By default, the proxies route namespace operations, including mkdir, to a directory server selected by indexing a routing table with bits from the leID in the fhandle. For mkdir switching, the proxy decides with probability p to redirect any given mkdir request to a dierent directory server, selected by hashing on the parent fhandle and the name of the new directory. The selected site handles the new directory and its descendents | except any subdirectories that happen to route to a dierent site. Directory servers must coordinate to correctly handle name entries that cross server boundaries, similar to mount points maintained by clients in the volume partitioning approach. Section 4.2 outlines the directory server implementation in Slice. The hashing simplies the implementation because races over name manipulation involve at most two sites. Mkdir switching yields balanced distributions when the average number of active directories is large relative to the number of directory server sites. However, very large directories are xed to a single server. An alternative name hashing strategy routes all namespace operations to a site uniquely determined by a hash on the name component and its position in the directory, represented by the parent directory fhandle. This approach represents the entire volume name space as a unied global hash table distributed among the directory servers. It views directories as distributed collections of name entries, rather than les accessed as a unit. Conicting 13 operations on a particular name entry (e.g., create/create, create/remove, remove/lookup) are routed to the same server, where they synchronize on the shared hash chain. Operations on dierent entries in the same directory (e.g., create, remove, lookup) may proceed in parallel on multiple directory servers. Depending on the quality of the hash function, name hashing yields probabilistically balanced request distributions for name-intensive workloads, and it easily handles arbitrarily large directories. The cost of this policy is that it increases the complexity of peer-peer interactions among directory servers to manage the distributed directory tree, and for reconguration and recovery. Also, it requires sucient memory to keep the hash chains memory-resident, since the hashing function destroys locality in the hash chain accesses. Finally, readdir operations span multiple servers; this is the right behavior for large directories, but increases readdir costs for small directories. 3.3 Request Routing and Service Structure Request routing policies impact the structure of the servers that make up the ensemble. The primary challenges of this architecture are coordination and recovery to maintain a consistent view of the volume's data across all servers, and allowing for reconguration to add or remove servers within each class. Most of the routing policies outlined above are independent of whether small les and name entries are bound to the server sites on which they are created. One option is for the servers to share their backing objects and coordinate through a locking protocol, as in the shared disk systems such as Frangipani (see Section 2.5); in this case, the purpose of the proxy redirection is to improve locality in the request stream to each server. Alternatively, the system may use xed placement in which items are xed to their create sites, unless reconguration or failover causes them to move. Fixed placement stresses the role of the request routing policy in data placement, through its routing choices for requests that create new name entries or data items. It also simplies some aspects of coordination and recovery because the managers do not concurrently share storage blocks; however, it introduces a need for atomic updates to state distributed among the managers. Name hashing requires xed placement unless the directory servers support ne-grained distributed caching. 14 3.3.1 Reconguration Consider the problem of reconguration to add or remove le managers, i.e., directory servers, small-le servers, or map coordinators. Most requests involving these servers are routed by the leID in the request fhandle, including name space operations other than redirected mkdirs in the mkdir switching policy. If servers join or depart the ensemble, the binding from leIDs to physical servers must change. The NFS specication prohibits the system from modifying any fhandle elds once the fhandle has been minted. It is sucient to hash the leID to produce a logical server site ID, then consult a routing table in the proxy to map the logical server ID to a physical server. It is then possible to change the routing function by distributing updated routing tables to change the mapping. This approach was used in xFS to partition le block maps across an xFS cluster [4]. Note that the logical servers dene the minimal granularity for rebalancing; the granularity may be adjusted by changing the number of bits extracted from the logical server ID to index the routing table. This approach generalizes to policies in which the logical server ID is derived by from a hash that includes other request arguments, e.g., the name hashing approach. For name hashing systems and other systems with xed placement, the reconguration procedure must move logical servers from one physical server to another. One approach is for each physical server to use multiple backing objects, one for each logical server site that it hosts, and recongure by reassigning the binding of physical servers to their backing objects in the storage array. Otherwise, reconguration must copy data from one backing object to another. In general, it is necessary to move 1/N th of the data to rebalance after adding or removing a physical server. 3.3.2 Atomicity and Recovery Multiple servers with request routing naturally implies that the servers must manage distributed state. File system structures have strong integrity requirements and frequent updates. This state must be kept consistent through failures and conicting concurrent operations. As previously stated, le managers may arrange for recovery by generating a write-ahead operation log in shared storage. For systems that use the shared-disk model with no xed placement, all operations execute at a single manager site, and the system must provide locking and recovery procedures to protect and recover the state of the shared blocks [27]. For systems with xed placement, some operations must update state 15 at multiple sites through a peer-peer protocol. The system can synchronize these operations locally at the points where shared data resides (e.g., the shared hash chains in the name hashing policy). Thus there is no need for distributed locking or distributed recovery of individual blocks, but logging and recovery must be coordinated across sites, e.g., using two-phase commit. For mkdir switching, the only operations that involve updates to data on multiple directory servers are those that involve the \orphaned" directories that were placed on a dierent site from their parents. These operations include the rerouted mkdirs themselves, associated rmdirs, and any rename operations that might happen to involve the orphaned entries. Since these operations are relatively rare, as determined by the redirection probability parameter p, it is acceptable to perform a full two-phase commit as needed to guarantee the atomicity of these operations on systems with xed placement. However, for the name hashing policy, the majority of name space update operations involve state on multiple sites. While it is possible to limit the two-phase commit cost by logging asynchronously and coordinating rollback, this approach would provide weaker failure properties than are normally accepted for le systems. Unless this problem is solved, name hashing may be best suited to le systems that are read-mostly or that store primarily soft state, e.g., scratch les or Internet server caches. Shared network storage arrays also present interesting atomicity and recovery challenges. In particular, remove and truncate must reclaim storage from multiple sites, under the direction of the map coordinator. The coordinator must maintain accurate maps, allocate new map fragments for extending write operations, and support failure-atomic object operations, including mirrored striping, truncate/remove, and NFS V3 write commitment (commit). The primary challenge is that there is no single serialization point for writes to the shared array. Amiri et. al. [1] have recognized this problem and addressed the atomicity issues; their solution handles storage node failures, but not host failures. The Slice map coordinators use an intention logging protocol in the coordinator to preserve atomicity. The protocol is fully implemented and is the subject of a separate paper [2]. The basic protocol is as follows. At the start of the operation, the proxy sends to the coordinator an intention to perform the operation. The coordinator logs the intention in stable memory. When the operation is complete, the proxy noties the coordinator with a completion message, asynchronously clearing the intention. If the coordinator does not receive the completion within a specied timeout, it probes one or more participants to determine if the operation completed, and initiates recovery if necessary. A failed coordinator recovers by scanning its inten16 tions log, completing or aborting operations in progress at the time of the failure. In practice, this protocol admits optimizations to eliminate message exchanges and log writes from the critical path of most common-case operations, by piggybacking messages, leveraging the NFS V3 commit semantics, and amortizing intention logging costs across multiple operations in ways that trade o recovery time for faster execution. The details of this protocol are beyond the scope of this paper, but our performance results fully reect its costs. 4 Implementation We have prototyped the key elements of the Slice architecture as a set of loadable kernel modules for the FreeBSD operating system. The prototype includes a proxy implemented as a standard packet lter, and kernel modules for the basic server classes: small-le server, directory server, block storage server, and block map coordinator. Our directory server implementations use xed placement and support both the name hashing and mkdir switching policies. The storage servers use a stackable le system layer that manages multiple disks, each containing a Fast File System (FFS) volume with a collection of inodes for storing data. The key operations in the storage node interface are essentially a subset of NFS, including read, write, commit, truncate (setattr), and remove. The storage nodes use opaque NFS le handles as object identiers, and include an external hash to map opaque NFS le handles to local inodes. Our current prototype uses static striping across a static set of storage nodes. The small-le servers and directory servers store their state in a collection of large backing objects, each of which is an NFS le backed directly by the block I/O service. Like the clients, these servers are congured with local proxies that govern the distribution of their I/O to the storage nodes, using maps supplied by a coordinator. The coordinator code is implemented as an extension to the storage node module, backing its intentions log and block maps on local disk. A more failure-resilient implementation might back the block maps with redundancy across multiple storage nodes, using a static placement function. All communication in our prototype is carried by UDP/IP. The experiments reported in this paper use Trapeze/Myrinet network interfaces [8]. We have modied the NFS stacks to avoid data copies on the client and 17 server for NFS write requests and read responses, using support for zerocopy TCP and UDP payloads in Trapeze and its driver. However, the Slice prototype runs over any network interface that supports IP; we debugged the system using Ethernet interfaces. Recovery functionality is partially implemented in our prototype. The map coordinator's intention logging for failure-atomic large le operations are fully implemented. Similarly, the small-le servers are fully compliant with the NFS V3 commit specication for write operations below the threshold oset. Directory servers generate a log entry for each operation that modies critical metadata, but the recovery procedure is not implemented. We have not implemented reconguration in our prototype. 4.1 The proxy Packet Filter We implemented the Slice proxy as a loadable packet lter module that intercepts and redirects packets exchanged with a registered Slice NFS V3 virtual server endpoint. When loaded, the proxy inserts itself into a low level of the IP stack, by rewriting IP packet handling functions to point to its own functions, and chaining the existing functions for network trac that does not match its intercept prole. The Slice proxy implements the routing policies described earlier for I/O redirection, mkdir switching, and name hashing using MD5. It may be congured on the client host or within an IP rewall node on the network path between the clients and servers. We have validated the rewall conguration using unmodied Solaris clients. The proxy is a nonblocking state machine with soft state consisting primarily of routing tables. To support the full NFS protocol, the proxy also maintains records for some pending requests, and a cache over le attributes returned in NFS responses. Although the NFS V3 specication does not require it, some NFS clients expect a full set of le attributes in each read and write response. The proxy caches these attributes, and updates the modify time and size elds as each write completes. The proxy may demand-fetch or write back le attributes using NFS getattr and setattr operations to the directory servers. The proxy and other elements of our prototype provide no support for concurrent write sharing. However, our proposal is compatible with NFS le leasing extensions for consistent sharing, as dened in NQ-NFS [18] and included in a recent IETF draft for NFS V4. 18 Figure 2: Smallle server data structures. 4.2 Directory Servers The Slice prototype directory servers store arbitrary collections of name entries indexed by hash chains, keyed by an MD5 [24] hash ngerprint in each fhandle. This makes it possible to support both the name hashing and mkdir switching policies easily within the same code base. We determined empirically that MD5 yields a combination of balanced distribution and low cost that is superior to competing hash functions with implementations available to us. Each directory server's data is stored as a web of linked xed-size cells representing naming entries and le attributes. Attribute cells may include a remote fhandle to reference an entry on another server, enabling cross-site links in the directory structure. Directory servers use a simple peer-peer protocol to update link counts for create/remove operations that cross sites, and to follow cross-site links for lookup, getattr, and readdir. The proxy interacts with a single directory site for each request. 4.3 Small-le Servers The small-le server is implemented by a Smallle module that manages each le as a sequence of 8KB logical blocks. Figure 2 illustrates the key Smallle data structures and their use for a read or write request. The locations for each block are given by a per-le map record. The server accesses this record 19 by using the leID in the request fhandle to key a map descriptor array ondisk. Each map record gives a xed number of oset/length pairs mapping 8KB le extents to regions within the data le. This structure allows ecient space allocation and supports le growth. Each logical block may have less than the full 8KB of physical space allocated. For example, a 8300 byte le would consume only 8320 bytes of physical storage space, 8192 bytes for the rst block, and 128 for the remaining 108 bytes (rounded up to the next power of two for space management). New les or writes to previously empty segments are allocated space according to best t, or if no appropriately sized fragment is free, a new region is allocated at the end of the data le. Under a create-heavy workload, this policy lays out data within the data le similarly to LFS [25], batching newly created les into a single stream for ecient disk writes. The best-t and variable sized fragment approach is similar to SquidMLA [19]. All data from the Smallle backing objects, including the map records, are cached within the server by the kernel buer cache. This structure performs well if le accesses and the assignment of leIDs show good locality. In particular, if the the directory servers assign leIDs with good spatial locality, and if les created together are accessed together, then the cost of reading the Smallle map records is amortized across multiple les whose records t in a single block. 5 Performance This section presents Slice performance results. The intent is to show Slice scaling and performance. These experiments were performed on a Myrinet cluster of 450 MHz Pentium-III PCs with 512 MB main memory using the Asus P2B motherboard with an Intel 440BX chipset. The NFS server and Slice storage nodes use four IBM DeskStar 22GXP drives on separate Promise Ultra/33 IDE channels. All machines are equipped with Myricom LANai 7 adapters and run kernels built from the same FreeBSD 4.0 source pool. All network communication in these experiments uses NFS/UDP/IP over the Trapeze Myrinet rmware. In this conguration, Trapeze/Myrinet provides 110 MB/s of raw bandwidth with an 8KB transfer size. First, we report results from the industry-standard SPECsfs97 benchmark suite for network-attached storage. This benchmark runs as a collection of user processes that generate a realistic mix of NFS V3 requests, check the responses against the NFS standard, and measure latency and delivered throughput. Because SPECsfs is designed to benchmark servers but 20 not clients, it creates its own sockets, sending and receiving NFS packets without exercising any client NFS implementation. SPECsfs is a demanding, industrial-strength, self-scaling benchmark. We show results to: (1) demonstrate that the prototype is fully functional, (2) show that the Slice proxy architecture supports a scalable implementation that is compatible with NFS V3 and independent of any client NFS implementation, and (3) provide a basis for judging the performance of the current prototype against industrial-grade servers. !"#$%&' !"#$%& !"#$%& ()! &*+, "()! &.//.0 12 %. " Figure 3: SPECsfs97 saturation. Figure 3 shows throughput at saturation. The x-axis gives the SPECsfs oered load in requests per second. The y-axis gives delivered throughput of the Slice ensemble, also given in operations per second. As a baseline, Figure 3 includes results from a single FreeBSD 4.0 NFS server (including our modications for zero-copy I/O over Trapeze/Myrinet) exporting its four-disk array as a single volume (using CCD) and as separate volumes. This server saturates at 850 operations per second. Because saturation in this experiment is determined primarily by the number of disk arms, the Slice congurations use a single directory server, two small-le servers, and a varying number of storage nodes. Slice-1 uses one storage node for a total of 4 disks, the same as the NFS congurations. Slice-3 and Slice-6 use 3 and 6 storage nodes, for a total of 12 and 24 disks respectively. 21 Slice-1 has a higher saturation throughput than either NFS conguration due to faster metadata operations, however the experiment quickly becomes disk-arm limited. The SPECsfs le set is skewed heavily toward small les: 94% of les are 64 KB or less, though small les account for only 24% of the total bytes accessed. Despite this fact, SPECsfs I/O requests are almost entirely to small les; the large les serve to \pollute" the disks. Increasing the number of storage nodes (disk arms) has a signicant eect on both sustainable request rate as well as average latency. With six storage nodes the Slice ensemble saturates at 2500 ops/s with an average latency around 8 ms. 73 64 UP I T S 63 R Q P O I NK 54 M I L 53 KJ I H G 4 VWX VWX YZ[[Z\ X Y]^_` 5 X `abc Y 7 X `abc Y bc `a YdXi efg h d 333 jgk k c `c [[Z 3 3 433 5333 5433 6333 6433 7333 89:;< 9=9> ?@A> B@CDEDF Figure 4: SPECsfs97 latency. The curves coil backwards as servers saturate. Figure 4 shows average operation latency for the same experiment, as a function of delivered throughput on the x-axis. The curves coil back as the Slice ensemble saturates, delivering lower throughput with higher average latency. For comparison, 4 includes line for two recent commercial servers reported at spec.org. The rst, the IBM RS 6000 Model 44P-170 with 1 GB cache uses 48 10000-RPM SCSI disks serving 42 independent le systems. The second, the EMC Celerra File Server Cluster Model 506, has 34 Seagate Cheetah disks and 4 GB of disk cache. We chose the IBM line because it saturates at about the same point as Slice-6, and the EMC line because it most closely resembles the Slice-6 disk and memory resources. Both con22 gurations have more disk arms, and deliver better average latency than our 24 disk Slice-6 cluster with 1 GB cache on the le managers. Pricing information is not disclosed as part of the reported SPECsfs results. rll qll pll | oll | ~} | nll { z m n p mll l l q ml mq nl nq stuvwxy Figure 5: Metadata intensive workload. Figure 5 shows Slice scaling for a metadata-intensive workload. For this experiment, name hashing and mkdir switching yield equivalent performance; we used p = 1 ? (N ? 1)=N , where N is the number of servers. In this test, an increasing number of client processes running on ve client nodes untar a directory tree modeled from the entire FreeBSD distribution consisting of 36,000 zero-length les, with no read or write activity. This synthetic benchmark is intended to stress name space operations. Each le create requires seven NFS operations: lookup, access, create, getattr, lookup, setattr, setattr, for a total of about 25,000 NFS operations per client. The y-axis shows shows the average total latency perceived by each client process for the number of processes shown on the x-axis. Client CPUs saturate at ve processes. This graph shows that the Slice directory service scales as more directory servers are added. Slice-1 measures one directory server, Slice-2 measures 2, and so on. For comparison, the N-UFS line measures a FreeBSD NFS server exporting a soft-updates [20] enabled UFS le system over an IBM DeskStar 22GXP drive. The N-MFS line measures an NFS server exporting a memorybased le system (MFS). MFS initially performs better than Slice because 23 Slice logs some operations, however MFS does not scale. A Slice directory server saturates at 6000 ops/s with the untar benchmark generating about 0.5 MB/s of log trac. single client read 66.8 MB/s write 45.0 MB/s read-mirrored 46.9 MB/s write-mirrored 22.6 MB/s saturation 180.4 MB/s 129.8 MB/s 117.4 MB/s 97.7 MB/s Table 2: Bulk-I/O performance. Table 2 shows raw read and write bandwidth through a FreeBSD NFS client coupled with a proxy, over Trapeze/Myrinet with six storage nodes. Each test reads or writes a 1.25 GB le, using a read-ahead and write-behind blocksize of 32 KB and prefetching depth of four as NFS mount options. We have also modied the FreeBSD NFS client for zero-copy reading. The two columns present I/O rates delivered to a single client and the aggregate bandwidth to four clients, saturating the storage nodes. Nonredundant as well as mirrored bandwidths are presented. A single client reads a non-mirrored le at 67 MB/s with 30% CPU utilization. The same client writes at only 45 MB/s, saturating the client CPU. At saturation, the six storage nodes with 24 disks altogether provide 180 MB/s of non-mirrored read bandwidth, or about 7.5 MB/s per disk, and 130 MB/s of non-mirrored write bandwidth, or 5.4 MB/s per disk. Storage nodes prefetch sequential les up to 256 KB beyond the current access, however write clustering is not as aggressive. Mirroring degrades single client read performance somewhat because storage nodes waste prefetching bandwidth loading data that the client chooses to read from another node. A mirrored writer must push twice as many bytes out its network interface and achieves half the bandwidth of a non-mirrored writer. Mirrored read saturation is lessened for the same reason as the single client case, however mirrored writes consume additional transfer time and space overheads within the storage nodes. Figure 6 shows the aect of varying the p parameter for mkdir switching in the metadata-intensive untar workload. The X-axis measures directory anity, the probability p that a new directory is placed on the same server as its parent. The Y-axis shows the corresponding average client latency. This test uses four client nodes hosting one, four, or sixteen client pro24 ²± ° © ¯ ®­ © ¬ « ª© ¨ § ³´µ¶·¸¹ ³´µ¶·¸¹ ³´µ¶·¸¹ ³´µ¶·¸ ¡¢¢ £ ¤¥¦ Figure 6: Metadata placement. cesses against four directory servers. One client process serializes operations within the client. Even with an anity of 100%, placing all entries on the same server, single client time does not change. However, performance degrades with multiple clients and high anity. As directory anity increases, average time decreases slightly as the number of cross-server operations declines. As anity continues to increase, the directory structure keeps within a single server and becomes subject to load imbalances. We omit results for name hashing because latency is unaected by directory size or a p parameter. Operation Packet interception Parsing Redirection/rewriting Logic/overhead CPU 0.7% 4.1% 0.5% 0.8% Table 3: proxy overheads for 6250 packets per second. Table 3 summarizes the proxy overheads for the untar experiment, measured on a 500 MHz Compaq 21264 running FreeBSD/Alpha with iprobe (Instruction Probe), an on-line proling tool. The workload sends and receives a mixed set of NFS packet types at a rate of 3125 request/response 25 pairs per second and a total 40% CPU utilization, 6% is due to proxy overheads. The proxy packet interception costs 0.7% of the absolute CPU utilization. An additional 4.1% is spent to partially parse the packets into an intermediate representation used for redirection and/or request/response rewriting. Redirection replaces the packet destination and/or ports, using a dierential checksum update instead of scanning the whole packet. Redirection accounts for another 0.5% of the CPU time. Finally, the redirection logic and soft-state management consumes another 0.8%. Parsing is by far the most expensive aspect of proxy operation. Given the osets to certain key elds in an NFS request or response, the proxy can quickly identify and redirect the packet, however both the SunRPC and NFS version 3 protocol spec have variable length elds. Nearly half of the parsing overhead stems from reading lengths and skipping past otherwise ignored elds. Adding a rpc header-length eld or xing (and padding) the number of groups in a SunRPC header would alleviate an eighth of our total parsing overhead. Likewise xing the size of an NFS le handle would expedite the handling of most messages. 6 Conclusion This paper presents Slice, a new storage system architecture for high-speed LANs incorporating network-attached block storage. Slice interposes a request switching lter | called a proxy | along the network path between the client and the network storage system. The proxy virtualizes a client/server le access protocol, applying congurable request routing policies to distribute requests across an array of storage nodes and specialized servers that cooperate to provide a scalable le service. This approach realizes the benets of network-attached storage devices | scalable performance and direct device access for high-volume I/O { while preserving compatibility with standard le system clients. In addition, the proxy may incorporate routing policies for scalable le management. We presented and evaluated two such policies, mkdir switching and name hashing. We have prototyped all elements of the Slice architecture within the FreeBSD kernel. The prototype proxy that runs as a packet lter, with state requirements that show the feasibility of incorporating it into network elements. The Slice prototype delivers network-speed le access bandwidth and competitive performance on an industry-standard NFS benchmark, with 26 evidence of scalability. We used name-intensive synthetic benchmarks to evaluate the name space policies, to show that name space operations are scalable using these policies. In addition to demonstrating a proof-of-concept, we explored the interaction of request routing policies on the storage service structure, including server cooperation protocols, reconguration, and recovery. References [1] Khalil Amiri, Garth Gibson, and Richard Golding. Highly concurrent shared storage. In Proceedings of the International Conference on Distributed Compuing Systems, April 2000. [2] Darrell Anderson and Je Chase. Failure-atomic le access in an interposed network storage system. Technical report. [3] David Anderson. Object Based Storage Devices: A command set proposal. Technical report, National Storage Industry Consortium, October 1999. [4] T. Anderson, M. Dahlin, J. Neefe, D. Patterson, D. Roselli, and R. Wang. Serverless network le systems. In Proceedings of the ACM Symposium on Operating Systems Principles, pages 109{126, December 1995. [5] Remzi H. Arpaci-Dusseau, Eric Anderson, Noah Treuhaft, David E. Culler, Joseph M. Hellerstein, David A. Patterson, and Katherine Yelic. Cluster I/O with River: Making the fast case common. In I/O in Parallel and Distributed Systems, May 1999. [6] A. D. Birrell and R. M. Needham. A universal le server. IEEE Transactions on Software Engineering, SE-6(5):450{453, September 1980. [7] Luis-Felipe Cabrera and Darrell D. E. Long. Swift: Using distributed disk striping to provide high I/O data rates. Computing Systems, 4(4):405{436, Fall 1991. [8] Jerey S. Chase, Darrell C. Anderson, Andrew J. Gallatin, Alvin R. Lebeck, and Kenneth G. Yocum. Network I/O with Trapeze. In 1999 Hot Interconnects Symposium, August 1999. [9] Sailesh Chutani, Owen T. Anderson, Michael L. Kazar, Bruce W. Leverett, W. Anthony Mason, and Robert N. Sidebotham. The Episode le system. In Usenix Conference, pages 43{60, Winter 1992. [10] Garth A. Gibson, David F. Nagle, Khalil Amiri, Fay W. Chang, Eugene M. Feinberg, Howard Gobio, Chen Lee, Berend Ozceri, Erik Riedel, David Rochberg, and Jim Zelenka. File server scaling with network-attached secure disks. In Proceedings of the 1997 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, volume 25,1 of 27 [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] Performance Evaluation Review, pages 272{284, New York, June 15{18 1997. ACM Press. Garth A. Gibson, David F. Nagle, Khalil Amiri, Fay W. Chang, Howard Gobio, Charles Hardin, Erik Riedel, David Rochberg, and Jim Zelenka. A costeective, high-bandwidth storage architecture. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, October 1998. Robert B. Hagmann. A crash recovery scheme for a memory-resident database system. IEEE Transactions on Computers, C-35(9):839{843, September 1986. Graham Hamilton, Michael L. Powell, and James J. Mitchell. Subcontract: A exible base for distributed programming. In Proceedings of the Fourteenth ACM Symposium on Operating Systems Principles, pages 69{79, 1993. John H. Hartman and John K. Ousterhout. The Zebra striped network le system. ACM Transactions on Computer Systems, 13(3):274{310, August 1995. John S. Heidemann and Gerald J. Popek. File-system development with stackable layers. ACM Transactions on Computer Systems, 12(1):58{89, February 1994. Michael B. Jones. Interposition agents: Transparently interposing user code at the system interface. In Barbara Liskov, editor, Proceedings of the 14th Symposium on Operating Systems Principles, pages 80{93, New York, NY, USA, December 1993. ACM Press. Edward K. Lee and Chandramohan A. Thekkath. Petal: Distributed virtual disks. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 84{ 92, Cambridge, MA, October 1996. R. Macklem. Not quite NFS, soft cache consistency for NFS. In USENIX Association Conference Proceedings, pages 261{278, January 1994. Carlos Maltzahn, Kathy Richardson, and Dirk Grunwald. Reducing the disk I/O of web proxy server caches. In USENIX Annual Technical Conference, June 1999. Marshall Kirk McKusick and Gregory R. Ganger. Soft updates: A technique for eliminating most synchronous writes in the fast lesystem. In FREENIX Track, USENIX Annual Technical Conference, pages 1{17, 1999. Marshall Kirk McKusick, William N. Joy, Samuel J. Leer, and Robert S. Fabry. a fast le system for UNIX (revised july 27. 1983). Technical Report CSD-83-147, University of California, Berkeley, July 1983. Vivek S. Pai, Mohit Aron, Gaurav Banga, Michael Svendsen, Peter Druschel, Willy Zwaenopoel, and Erich Nahum. Locality-aware request distribution in 28 [23] [24] [25] [26] [27] [28] [29] cluster-based network servers. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, October 1998. Kenneth Preslan and Matthew O'Keefe. The global le system: A shared disk le system for *BSD and Linux. In 1999 USENIX Technical Conference (FREENIX track), June 1999. R. Rivest. RFC 1321: The MD5 message-digest algorithm, April 1992. ftp: //ftp.internic.net/rfc/rfc1321.txt. Mendel Rosenblum and John Ousterhout. The lfs storage manager, 1990. Marc Shapiro. Structure and encapsulation in distributed systems: The proxy p rinciple. In Proceedings of the Sixth International Conference on Distri buted Computing Systems, May 1986. Chandu Thekkath, Tim Mann, and Ed Lee. Frangipani: A scalable distributed le system. In Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 224{237, October 1997. Robbert van Renesse, Andrew Tanenbaum, and Annita Wilschut. The design of a high-performance le server. In The 9th International Conference on Distributed Computing Systems, pages 22{27, Newport Beach, CA, June 1989. IEEE. Geo M. Voelker, Eric J. Anderson, Tracy Kimbrel, Michael J. Feeley, Jeffrey S. Chase, Anna R. Karlin, and Henry M. Levy. Implementing cooperative prefetching and caching in a globally-managed memory system. In Proceedings of ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '98), June 1998. 29