CSD PhD Proposal Document
Abstract
Distributed file systems that scale by partitioning files and directories among a collection of servers inevitably encounter cross-server operations. A common example is a rename that moves a file from a directory managed by one server to a directory managed by another. Systems that provide the same semantics for cross-server operations as for those that do not span servers traditionally implement dedicated protocols for these rare operations. This thesis explores an alternate, simpler, approach that exploits the existence of dynamic redistribution functionality
(e.g., for load balancing, incorporation of new servers, and so on). When a client request would involve files on multiple servers, the system can redistribute those files onto one server and have it service the request. Although such redistribution is more expensive than a dedicated cross-server protocol, preliminary analysis of NFS traces indicates that such operations are extremely rare in file system workloads. Thus, when dynamic redistribution functionality exists in the system, cross-server operations can be handled with very little additional implementation complexity.
Transparent scalability is a goal of many distributed systems. That is, it should be possible to increase both capacity and performance by adding servers and spreading data and work among them. Furthermore, it is desirable that client applications and users are presented a consistent set of semantics, regardless of which servers are hosting which data.
In the case of file systems, many designs scale by partitioning the set of files across the set of servers. Each file is managed by a particular server, and accesses to files managed by different servers are completely independent. The vast majority of operations affect only a single file, so this approach works well in the common case. A few operations, such as a cross-directory rename or a snapshot, affect more than one file or directory and so may involve files managed by two distinct servers. A transparently scalable system would provide the same semantics in this case as in the case where all files were on the same server. Providing strong semantics on a single server
is relatively easily done using techniques like local locking and write-ahead logging, but doing so is more difficult when multiple servers are involved.
Many existing systems do not provide identical semantics in this case. Applications, on the other hand, frequently rely on specific consistency semantics (e.g., atomicity of rename) and have no convenient way of knowing whether any given operation will get the same-server or cross-server semantics. Additionally, systems that aim to scalably support legacy protocols, such as NFS v3 or CIFS, must maintain the semantics defined by the protocol, regardless of whether any end application relies on them.
Systems that do provide transparent scalability use some sort of distributed protocol to handle operations that involve multiple servers. In the case where all objects are accessible from any server, this is often done by having one server acquire locks on all objects involved, update the objects and any logs as necessary, and then release the locks. The underlying distributed lock manager will trigger the appropriate cache invalidations to maintain consistency between servers. GPFS uses such an approach. An alternative, frequently used when servers do not share a common storage pool, is to have each server execute only the portion of the operation that pertains to it. All objects are then updated to their final state using a multi-phase-commit protocol to ensure atomicity.
These solutions, while effective, are also complex to implement, debug, and verify, particularly for the cases involving failures. Furthermore, in many workloads, cross server operations occur very rarely. For example, in Farsite’s experimental workload of 2.1M operations, only 8 involved multiple servers [ ?
]. Thus, the programmer effort involved in supporting transparent cross-server operations is out of proportion with its utilization in most workloads.
This thesis develops an alternate approach to handling cross server operations. Specifically:
Reusing dynamic redistribution to eliminate cross-server operations is an effective approach to transparently scaling decentralized storage systems
Most scalable systems have mechanisms to dynamically redistribute objects across servers. This capabilty is used to ensure that server resources are utilized efficiently. For example, when a server is overloaded, some of the objects it is responsible should be moved to other servers, moving the load associated with them. Similarly, when a server’s disks are nearly full, some of its objects are moved to other servers with more free space. There has been much work on load balancing policies, and the underlying mechanisms used for transferring objects are fairly straightforward.
I propose to use this same redistribution mechanism to support multi-server operations, instead of a dedicated protocol. If servers can only process requests for objects they are responsible for, all the objects involved must be the responsibility of the same server. If this precondition is not satisfied, the system will redistribute the required objects until it is satisfied, and only then execute the operation. Objects that were moved may be returned to their original servers immediately, or
2
they can be left until load-balancing policy dictates that they be moved again. All of the inter-server communication is encapsulated in the dynamic redistribution mechanism, which already exists for other purposes, saving implementation and debugging time.
This approach will likely involve more overhead than a dedicated cross-server protocol. For example, the data structures for storing contents and mappings may not permit the redistribution of a single object; a collection of objects may have to be moved together. The reduced performance, however, may be an acceptable tradeoff for the reduced implementation complexity of reusing existing code. This is especially true if cross-server operations are expected to be relatively infrequent, as is the case in distributed file systems. If cross-server performance is important, the dynamic redistribution mechanism can be optimized in various ways to tune it for its new use, rather than just load-balancing. Taken to the logical limit, these optimizations can become as complex as a dedicated cross-server protocol. However, this affords a system designer a spectrum of performance versus complexity options. And unlike a dedicated protocol, it allows for incrementally improving the system from “simple but slow” to “complex and fast” while leaving the rest of the system unaffected.
I will validate the thesis by demonstrating that it is a good choice for real system scenarios. I will also quantify other hypothetical scenarios in which it is and is not effective. My exploration will include the following steps:
•
I will implement dynamic redistribution in the metadata service of a large-scale object storage system. It will implement a range of designs representing SAN storage, private storage, and a variety of optimizations.
•
I will implement cross-server operations using dynamic redistribution in the object metadata service.
•
I will implement cross-server operations using two-phase commit in the object metadata service.
•
I will implement dynamic redistribution in a scalable filesystem metadata service, and implement cross-server operations in that service using dynamic redistribution.
•
I will measure the sensitivity of throughput to scaling the number of servers.
•
I will measure the sensitivity of throughput to the frequency of cross-server operations.
•
I will measure the sensitivity of average response time and its variance to the frequency of cross-server operations.
3
•
I will conduct the previous experiments with both shared and private server storage.
•
I will analyze file system traces to determine the a rates at which such operations occur in real workloads.
•
I will quantify the implementation complexity of each option.
The major metric is throughput, as one of the major motivations for constructing a scalable system is to gain throughput beyond the capabilities of a single server. It is likely that two-phase commit will perform better in all cases but it is expected that the difference will be insignificant at the rename frequencies commonly seen in file system workloads. In that case, the reduced complexity will be a clear benefit.
Dynamic redistribution will have the effect of introducing brief stalls, possibly for other objects as well as those involved in a cross-server operation. Although overall throughput may be unaffected, this will affect the distribution of response times. This is likely insignificant to batch applications but may affect the user’s perception of responsivenes in an interactive application. If the dynamic redistribution mechanism can limit stalls to below a reasonable threshold, then the user should remain unaware.
The following sections describe major items referred to above in more detail.
Ursa Minor is a prototype object storage system being developed by the Parallel Data Lab. Ursa
Minor presents clients with a single flat namespace of 128 bit Object IDs. The Ursa Minor Metadata
Service manages the metadata for all objects in an Ursa Minor installation. This metadata consists of per-object attributes, such as length, permissions, and the list of storage nodes the data is stored on. When any client wants to read a block of an object, it must contact the metadata service to retieve the mapping of that block to storage nodes. The metadata service stores metadata in tables, each table covering a range of objects. Each table is managed by a single metadata server.
Redistribution takes place by transferring responsibilty for an entire table to another server. The only multi-object operation the MDS currently sees is Rename. It atomically updates the attributes of two objects, which correspond to the source and destination directories.
While Ursa Minor provides only a flat namespace, traditional file system clients and applications expect a traditional hierarchical filesystem namespace. To this end, Ursa Minor will include a
Namespace Service that translates filesystem paths into Object IDs for use by the lower layers of the system. Thus, the Namespace Service is essentially responsible for storing and manipulating directory contents while the Object Metadata Service is responsible for inode contents. This functionality is currently incorporated in Ursa Minor’s centralized NFS server, which exports an NFS
4
file system to standard NFS clients. The task of scaling an NFS server is greatly complicated by the limitations of the NFS v3 protocol. The underlying namespace functionality can, however, be abstracted and scaled independetly of the NFS protocol layer. The resulting scalable Namespace
Service could become one component of a scalable pNFS server. Namespace operations that modify multiple items include : Link and Unlink, which update the parent directory and the child, and
Rename, which upates two parents and the child.
The design of the redistribution mechanism has a large influence on the cost of using it to eliminate cross-server operations. The choice of redistributon mechanism is driven by the system model and the expected frequency, purpose, and granularity of redistribution. In turn, the chosen design constrains the minimum latency and minimum granularity of redistribution. I will implement a number of mechanisms that represent designs commonly seen in existing systems for the purpose of load balancing, as well as ones that could be expected in a system designed from the ground up to use redistribution to handle cross-server operations. Experimenting with both options will reveal the overhead of reusing an existing mechanism.
The cost of performing redistribution varies greatly depending on the work required to redistribute.
In some systems, all servers share a common back-end storage pool. In such systems, redistribution merely involves logically handing off control of a portion of the storage pool. In the other common model, each server has its own private storage, and transferring responsibility requires transferring the relevant contents of one server’s local storage to the other’s. I will conduct experiments with both storage models to determine the effects of the increased redistribution cost.
The expected frequency of cross-server operations in a system is a significant criteria in selecting a mechanism for performing them. If cross-server operations are common, then the designer must select a mechanism that handles them very efficiently. If, on the other hand, they are rare, the designer has the flexibility to use a less efficient, but simpler, mechanism. Through trace analysis,
I will quantify the frequencies seen in common storage workloads. I will evaluate the performance of the system at these and greater frequencies. Exploring the performance of this scheme at frequencies greater than seen in current workloads reveals its capability to handle possible future storage workloads or other non-storage applications.
5
This section discusses scalability and transparency in distributed storage systems, cross-server operations, dynamic redistribution, and approaches to supporting them.
Many distributed file systems have been proposed and implemented over the years. To provide context for this work, they can be categorized by how transparently they scale. “Scalability” is used here to refer to increasing storage capacity and throughput by adding more servers. Techniques like using client-side caching to increase the number of clients that a given server can support [ ?
] are orthogonal. Thus, a system scales by partitioning data and responsibilities across servers. “Transparent” scaling implies scaling without client applications being aware of how data is spread across servers. A distributed file system is not transparently scalable if applications or users are exposed to the effects of capacity exhaustion on a single server or receive different semantics depending on which server(s) hold the files being accessed.
No transparent scalability : Many distributed file systems, including those most widely deployed, do not scale transparently. NFS, CIFS, and AFS allow file servers to be added incrementally but each server serves independent file systems (or volumes, in the case of AFS). A client can mount file systems from multiple file servers, but must cope with each server’s limited capacity and the fact that multi-file operations (e.g., rename ) are not atomic across servers.
Transparent data scalability : An increasingly popular design principle is to separate metadata management (e.g., directories, quotas, data locations) from data storage [ ?
, ?
, ?
, ?
]. The latter can be transparently scaled relatively easily as each data access is independent of any others.
Clients interact with the metadata server(s) for metadata activity and to discover the locations of data. They then access data directly at the appropriate data server(s). Metadata semantics and policy management stay with the metadata server(s), permitting simple centralized solutions. The metadata server(s) can limit throughput, of course, but offloading data accesses pushes the overall system’s limit much higher [ ?
]. Scaling of the metadata service is needed to go beyond this limit, and many existing systems do not offer a transparent metadata scaling solution (e.g., Lustre [ ?
] and most SAN file systems); in such systems, a second metadata server can only be added by creating a second file system.
Transparent scalability : A few distributed file systems offer full transparent scalability, including Farsite [ ?
], GPFS [ ?
], and Frangipani [ ?
]. Most use the data scaling architecture above, separating data storage from metadata management. Then, they add protocols for handling metadata operations that span metadata servers. Section
discusses these further.
6
There are a variety of operations that manipulate multiple files, thus creating a consistency challenge when the files are not all on the same server. Of course, every file create and delete involves two files: the parent directory and the file being created or deleted. But, most systems assign a file to the server that owns its parent directory. At some points in the namespace, of course, a directory must be assigned somewhere other than the home of its parent, or else all metadata will be managed by a single metadata server. So, the create and delete of that directory will involve more than one server, but none of the other operations on it will do so. This section describes three more significant sources of multi-file operations: non-trivial namespace manipulations, transactions, and snapshots.
Namespace manipulations : The most commonly noted multi-file operation is rename , which changes the name of a file. The new name can be in a different directory, which would make the rename operation involve both the source and destination parent directories. Also, a rename operation can involve additional files if the destination exists (and, thus, should be deleted) or if the file being renamed is a directory (in which case, the ‘..’ entry must be modified).
Application programming is simplest when the rename operation is atomic, and both the POSIX and the NFSv3 specifications call for atomicity.
1
Many applications assume atomic rename , or at least that the destination will be either its before or after version, as a building block. For example, many document editing programs implement atomic updates by writing the new document version into a temporary file and then using rename to move it to the user-assigned name. Without atomicity, applications and users can see strange intermediate states, such as two identical files (one with each name) existing or one file with both names as hard links.
Creation and deletion of hard links are also multi-file operations. Links created in the same directory should rarely result in cross-server operations. But, for links created in other directories, it is possible for the two directories with names for a file to be on different servers. Thus, the create and subsequent delete of that file would involve both servers. The same can happen with a rename that moves a file into a directory managed by a different server than the directory in which it was originally created.
Transactions : Transactions are a very useful building block. Modern file systems, such as
NTFS [ ?
] and Reiser4 [ ?
], are adding support for multi-request transactions. So, for example, an application could update a set of files atomically, rather than one at a time, and thereby preclude others seeing intermediate forms of the set. This is particularly useful for application installation, application upgrades, and source control applications.
1 Each specification indicates one or more corner cases where atomicity is not necessarily required. For example,
POSIX requires that, if the destination currently exists, the destination name must continue to exist and point to either the old file or the new file throughout the operation.
7
Snapshot : Point-in-time snapshots [ ?
, ?
, ?
, ?
] have become a mandatory feature of most storage systems, as a tool for on-line and consistent off-line back-ups. Snapshots also offer a building block for on-line integrity checking [ ?
] and remote mirroring of data [ ?
]. Snapshot is usually supported only for entire file system volumes, but some systems allow snapshot of particular subtrees of the directory hierarchy. In any case, it is clearly a substantial multi-file operation, with the expectation that the snapshot captures all covered files at a single point in time.
Dynamic redistribution has been an integral feature of many previous system designs. AFS, for example, allows volumes to be migrated among the servers in an AFS cell. xFS’s design [ ?
], which splits metadata management from data storage, allows for migration of each (for subsets of data) among participating systems. Ceph [ ?
] is a recent scalable object-based storage system that supports dynamic redistribution of metadata responsibilities among multiple metadata servers.
Implementing dynamic redistribution, while not trivial, is not overly complex. In all systems, it involves four primary components :
Mapping objects to servers : A level of indirection is a common tool for gaining flexibility in mapping virtual identities to physical resources. A level of indirection between the data or metadata
ID and the server that manages that object allows the object to be transparently redistributed.
The mapping can be provided by a mapping server (e.g., the volume location database in AFS) or a read-shared table (e.g., the manager map in xFS).
Changing the mapping : When a mapping is changed, all relevant parties must find out about it. The previous and next servers must know about it, since the previous one must stop serving requests and the next one should start accepting requests. At all times, the servers involved must agree on which one “owns” a given object.
Clients must also know which server to contact in order to access a given object. If clients cache the mapping, they must know when their cache is stale. Explicit invalidations through a callback mechanism are one approach. Another is to invalidate the cached mapping when the client contacts a server that is no longer managing that object. A third option is to have the old servers forward requests to the new server.
Moving the content : In addition to shifting responsibility, the actual content being served needs to be put under the control of the new server. There are two cases to consider. First, when the object metadata being redistributed is stored in a storage infrastructure shared by the servers (e.g, in a SAN to which they are all attached), it is sufficient for the previous server to checkpoint any relevant dirty cache state and stop accessing it. The new server can then take over responsibility for that metadata and access it directly on the same storage device. Second, when objects are stored on each server’s locally-attached storage devices, they must actually be copied from one server to another. Clearly, such copying is much more expensive than the first case, and thus redistribution
8
in such an environment must be more carefully used. While the content is being moved, most systems block write access to those contents. This may result in a noticeable performance hiccup.
Coordinating changes : A system that supports dynamic redistribition needs policies for deciding which changes to make and when to make them. In the case of load balancing to address performance or capacity limitations, the admistrator or an automated process must balance the benefits of a proposed redistribution against its costs and initiate the redistribution.
In addition to transferring objects for load balancing, I propose to use the dynamic redistribution mechanism to ensure that all the objects involved in an operation are always the responsibility of a single server. Then the operation can be executed using the simple single-server codepath.
As discussed in Section
, few distributed file systems support cross-server operations transpar-
ently. Those that do use one of two approaches: client-driven protocols and inter-server protocols.
In the first approach, illustrated by GPFS [ ?
] and Frangipani [ ?
], the client requesting the cross-server operation implements it. This approach generally requires clients and servers to share access to the underlying storage. The client acquires locks on the relevant files and updates them in the shared disk system, using write-ahead logging into shared journals to update them atomically.
This approach offloads work from servers but also trusts clients to participate correctly in metadata updates and introduces interesting failure scenarios when clients fail.
In the second approach, illustrated by Farsite [ ?
] 2 , operations that span servers are handled by an inter-server protocol. The protocol usually consists of some form of two-phase commit, with a lead server acting as the initiator. Because this second approach does not require modification of the client-server protocol, it is the one most often identified as a “planned extension” for systems that do not yet scale to more than one metadata server (e.g., [ ?
, ?
]). Such extensions often stay
“planned” for a long time, however, because of complexity and performance overhead concerns
(e.g., [ ?
, ?
]). When the functionality does get implemented, it is often significantly more complex than originally expected [ ?
].
Because cross-server operations are rare, protocols implemented just to support them usually do not receive the same attention as the common cases. They tend to be exercised and tested less thoroughly, becoming a breeding ground for long-lived bugs that will arise when this code is finally relied upon.
We promote an approach based on eliminating cross-server operations, via heavy-weight redistribution mechanisms that serve multiple functions, rather than supporting them with their
2 A group of replicated state machines may be considered to be a single logical server for the purposes of this discussion. Farsite uses BFT-based [ ?
] Byzantine fault-tolerant state machine replication to form each “directory group” that manages a subset of the namespace. For Farsite, “cross-server operations” means “cross-directory group operations.”
9
own protocols. Doing so can simplify scalable distributed file systems with minimal impact on performance, so long as cross-server operations are truly rare.
This section describes the initial results of trace analysis and experiments in the context of the
Ursa Minor Metadata Service.
Three real NFS traces were analyzed in order to determine the frequency and distribution of rename operations in real workloads. The results show that rename s are rare and cross-directory rename s are even rarer—they comprise a maximum of only 0.006% of all operations in the traces after excluding getattr s. Among these, only small subset would be cross-server if assignment of files to servers was related to namespace locality because many cross-directory rename s are between
‘nearby’ directories. Additionally, cross-directory renames arrive in quick bursts, indicating that it might be possible to amortize the overhead of redistribution over multiple rename operations.
This subsection is organized as follows. Section
, describes the traces used for our analysis.
Section
analyzes the frequency and distribution of rename s in the traces.
5.1.1
Traces Used
Three NFS traces [ ?
] from Harvard University were analyzed. We describe each trace and its constituent workload below.
EECS : The EECS trace captures NFS traffic observed at a Network Appliance filer on February
2 nd –28 th , 2003. This filer serves home directories for the Electrical Engineering and Computer Science (EECS) Department. It sees an engineering workload of research, software development, and course work. Detailed characterization of this environment (over different time periods) can be found in [ ?
].
DEAS : The DEAS trace captures NFS traffic observed at another Network Appliance server at
Harvard University on February 2 th –28 th , 2003. This filer serves the home directories of the
Department of Engineering and Applied Sciences (DEAS). It sees a heterogenous workload of research and development combined with e-mail and a small amount of WWW traffic. The workload seen in the DEAS environment can be best described as a combination of that seen in the EECS environment and e-mail traffic. Detailed characterization of this environment can be found in [ ?
] and [ ?
].
10
Total operations lookup create update
Same-directory rename
Cross-directory rename
EECS
176,933,945
DEAS
577,751,752
CAMPUS
655,637,109
165,511,605 92.979% 567,130,767 98.162% 638,424,369 97.375%
1,241,034 0.758%
9,469,619 6.207%
1,856,980
8,593,249
0.321%
1.487%
1,917,930
152,288,896
0.293%
2.332%
96,086 0.0543%
2,186 0.0012%
134,680 0.023%
36,076 0.006%
5,666
248 ǫ ǫ
Table 1:
Metadata operation breakdowns for the EECS, DEAS, and CAMPUS month-long traces.
The equivalent metadata operations necessary to perform each operation in the traces are counted and shown in this table. Note that Rename s are very rare. NFS GETATTR operations are not counted in this table, since they are an artifact of the NFS caching model and, as such, clients of an object-based storage system would not issue these requests as the storage system would likely support more a more advanced caching model (e.g., leases). Note that if
GETATTR s were counted, the contribution to total operations by rename s would decrease even further.
CAMPUS : The CAMPUS trace captures activity observed by three NFS servers serving the campus community at Harvard University on October 15 th –28 th , 2001. It provides storage for the email, web, and computing activities of 10,000 students, staff, and faculty. Though it contains heterogenous data, accesses to it are dominated almost entirely by e-mail traffic, thus making the CAMPUS workload resemble that of a dedicated e-mail sever. Detailed characterization of this environment can be found in [ ?
].
5.1.2
Trace Analysis
Real-world workloads contain many fewer cross-server operations than the proposed method of supporting them is capable of. Table
shows the breakdown of operation counts for the three traces after excluding getattr operations, which are an unfortunate artifact of the NFS caching model.
rename s account for only 0.054% (96,086) of all remaining operations in the EECS trace,
0.023% (134,680) of all remaining operations in the DEAS trace, and are almost non-existent in the
CAMPUS trace. Additionally, most of these rename s will not require a multiple server interaction, as they move files within the same directory (i.e., the source directory and destination directory are the same). Cross-directory rename s, which may require a multiple server interaction, account for only 0.001% (2,186) of all operations in EECS and 0.006% (34,076) of all operations in DEAS.
Among these, many rename files to nearby directories and would likely not be cross-server.
Figure
shows that cross-directory rename s in real-world workloads are bursty. Specifically, in the EECS trace, there is a 70% chance that a second rename will be seen within one second of the first; this probability is 93% in the DEAS trace. Such bursts might result, for example, from a user is reorganizing the namespace by moving many files or subdirectories to a different location
11
EECS 02/02/03–02/28/03
100
50
0
1s 10s 100s 16.7m 2.7h 27.7h
Inf
DEAS 02/02/03–02/28/03
100
50
0
1s 10s 100s 16.7m 2.7h 27.7h
Inf
CAMPUS 10/15/01–10/28/01
100
50
0
1s 10s 100s 16.7m 2.7h 27.7h
Inf
Max. cross-dir rename inter-arrival time
Figure 1:
These graphs show the percentage of cross-directory rename operations vs. maximum interarrival time Most cross-directory rename s occur in bursts – in the EECS and DEAS trace, 70% and 93% of all cross-directory rename s occur within one second of each other. The inter-arrival times of cross-directory renames is larger in the CAMPUS trace, but renames are also much less frequent in this trace than the EECS and DEAS trace.
in the hierarchy. The presence of burstiness suggests that the overhead of dynamic distribution can be amortized across multiple cross-server rename s as long as all of them involve the same two metadata servers. Cross-directory rename s are not as bursty in the CAMPUS trace as in the EECS and DEAS traces, but the overhead of dynamic distribution in a workload resembling
CAMPUS is not likely to be significant as rename s are extraordinarily rare in this trace.
I modified the Ursa Minor Metadata Service from a centralized service to a distributed service that uses dynamic redistribution to handle multi-object operations. The goal was to test two hypotheses:
First, that this system will be transparently scalable for traditional file system workloads with negligible performance penalty. Second, that workloads with higher proportions of cross-server operations can be reasonably handled.
5.2.1
Design and Implementation
Redistribution in Ursa Minor takes place at the granularity of a metadata table. In the base
Metadata Service, these tables are stored as objects within the shared object storage system.
Thus, any server may access the storage of any table. Servers are delegated the responsibility for certain tables by a Delegation Manager. The Delegation Manager maintains the authoritative list of delegations. Clients fetch the delegation list at startup and again any time the client’s cached copy of the list disagrees with a server’s, as indicated by a server responding that it is not responsible for an object that the client thinks it should be. Redistribution of a tablemerely involves the sending
12
server completing any in-progress requests for that table and flushing any dirty cache state for it. The receiving server will then fault into its cache any portions of the table it needs to serve subsequent requests.
To emulate a system that does not have shared storage, I modified the redistribution system so that each server could only access tables stored on certain assigned storage nodes. In this configuration, redistribution of a table involves one server reading the from its table, transmitting it over the network to the receiving server, and finally the receving servre writing the table to its private storage. Unlike the shared-storage case, this copy effectively pre-loads the cache of the receiving server. Hence, the first few requests that reference the moved table will not suffer cache misses.
Once dynamic redistribution existed, adding support for cross-server operations was easy. One major operation in Ursa Minor potentially involves multiple objects: rename . When the Ursa
Minor NFS server performs a cross-server rename , it must atomically modify the attributes of source and destination directories, which it does using a rename metadata operation. The major addition required to support this was a Coordination Manager to coordinate the necessary delegation changes. A server needing to perform an operation for which it does not have all tables sends a grab request to the Coordination Manager to request the needed table(s). The Coordination
Manager waits until the requested tables are not used by any other in-progress operation and issues a sequence of requests to the Delegation Manager to move them. When the server is done with the rename , it signals the Coordination Manager with a release , at which point the Coordination
Manager is free to redistribute the table as it sees fit. Because Ursa Minor does not yet have a dynamic load-balancer, the choices it has are to return the table to its original server immediately or to leave it in place until some other rename requires moving it. By default, it returns the table immediately.
5.2.2
Scalabilty Evaluation
To apply load to the system, I used a purpose-built load generator, which issues requests to the
Metadata Servers as fast as possible. The load generator emulates the MDS requests issued by an operating Ursa Minor NFS server. The NFS operations emulated are randomly chosen with a proportion of types matching those seen in the specified trace, and operations are uniformly distributed across files. This is pessimistic compared to the trace workload, because it does not include spatial or temporal locality.
Figure
shows that Ursa Minor is transparently scalable for the EECS and DEAS workloads when the metadata servers are using either shared or dedicated storage. Specifically, as metadata servers are added, throughput increases linearly. CAMPUS is not shown in these graphs, as the number of cross-directory renames in this workload is extremely small.
In addition to traditional filesystem workloads, it is important to identify the most demanding
13
200
150
100
50
DEAS workload
EECS workload
250
200
150
100
50
DEAS workload
EECS workload
0
2 4 6 8
# of metadata servers
10
0
2 4 6 8
# of metadata servers
10
Figure 2:
Ursa Minor with both shared and dedicated storage is transparently scalable for regular filesystem workloads.
workloads for which using dynamic redistribution to eliminate cross-server operations will perform well. To do so, I added additional cross-server renames to the EECS workload. Figure
shows the degradation in throughput as the percentage of cross-directory renames in the EECS workload is increased. Performance degradation is measured by reduction in normalized throughput, which is defined as the throughput obtained when the directories involved in a cross-directory rename are stored on different MDS servers divided by the throughput obtained with the same number of cross-directory renames when both directories are stored on the same metadata server. Hence, any degradation in performance is solely a result of the coordination overhead and the dynamic redistribution mechanism.
The figure shows that noticeable performance degradation does not occur until the percentage of cross-server renames in the applied workload exceeds 0.01%. This critical point occurs sooner with dedicated storage, but both points are well beyond the percentage of cross-directory renames contained in the EECS, DEAS, or CAMPUS workloads. After these initial degradation points, the performance with dedicated storage drops at a faster rate than with shared. From this, we can conclude that systems that use dynamic redistribution to eliminate cross-server operations will perform well even for workloads comprised of a much larger percentage of cross-server operations than traditional filesystem workloads. However, shared-storage systems will be able to use this technique for more demanding workloads than dedicated-storage systems.
Figure
shows the overhead of the cache misses incurred by the metadata server grabbing a table in the shared-storage case. Recall that in the dedicated-storage case, moving a table from one metadata server to another effectively pre-loads the cache of the metadata server which is receiving the table. Conversely, in the shared-storage case, no tables are actually moved and so the cache of the server which is grabbing the table is not pre-loaded with the grabbed table. As a result, all
14
100
80
60
40
20
Throughput for System Xts−shared
Throughput for System Xts−copy
0
0.001
0.010
0.100
Percent of cross-server renames in workload
Figure 3:
Throughput vs Percent of cross-server renames for Ursa Minor with both shared and dedicated storage.
100
80
60
40
20
Throughput w/o cache misses
Actual throughput
0
0.001
0.010
0.100
Percent of cross-server renames in workload
Figure 4:
Cost of initial cache misses at the metadata server that has grabbed a table in order to perform a cross-server operation when using shared storage cross-server operations will each result in at least one cache miss. Additionally, since the metadata server that originally lent the table must flush its cache before allowing the grab to proceed, the first few requests to the lent table after it is returned will also suffer cache misses. The figure shows that the extra overhead imposed by these cache misses is small for workloads comprised of only a small percentage of rename s, but increases to 10% when the percentage of rename s is large (0.1%). Moreover, it shows that most of the degradation caused by using redistribution to eliminate cross-server operations with shared storage is due to these cache misses, since avoiding them eliminates the degradation.
Figure
illustrates the tradeoff between performance and the effort of implementing special purpose protocols. It is conceivable that the cache misses incurred by the lender upon return of the grabbed table could be avoided by implementing a special purpose protocol, which would allow
15
the metadata server returning the table to also transfer the entries it modified. Such a protocol would obviate the need for the lender to flush its cache before lending the table and hence would eliminate many of the cache misses seen by it. However, it is questionable whether the extra effort of implementation is worthwhile given that the performance difference will only be non-negligible when the percentage of rename s is very high.
16
Task
MDS Dynamic Redistribution
MDS Cross Server Rename
Proposal
MDS 2-Phase Commit
MDS Sensitivity Experiments
MDS Load Balancing
MDS Optimizations and Experiments
Trace Analysis
Internship
NFS Dynamic Redistribution
NFS Cross-Server Ops
NFS 2-Phase Commit
NFS Experiments
Code Complexity Analysis
Thesis Writing and Job Search
Months Completion
2 Mar 07
2 Mar 07
1 June 07
1 July 07
1 July 07
1 Aug 07
2 Aug 07
2 Aug 07
3-4 Dec 07
3 Mar 08
2 May 08
1 June 08
1 July 08
1 Aug 08
4 Dec 08
17