Improving Performance of a Distributed File System Using OSDs and Cooperative Cache A. Teperman A. Weit teperman@il.ibm.com weit@il.ibm.com IBM Haifa Labs, Haifa University, Mount Carmel, Haifa 31905, Israel Abstract zFS is a scalable distributed file system that uses Object Store Devices (OSDs) for storage management and a set of cooperative machines for distributed file management. zFS evolved from the DSF project [7], and its high-level architecture is described in [11]. This work uses a cooperative cache algorithm, which is resilient to network delays and nodes failure. The work explores the effectiveness of this algorithm and of zFS as a file system. This is accomplished by comparing the system’s performance to NFS using the IOZONE[8] benchmark. We also investigate whether using a cooperative cache results in better performance, despite the fact that OSDs have their own caches. Our results show that the zFS prototype performs better than NFS when cooperative cache is activated and that zFS provides better performance even though the OSDs have caches of their own. We also demonstrate that using pre-fetching in zFS increases performance significantly. Thus, zFS performance scales well when the number of participating clients increases. Accepting data from the memory of another machine poses a security problem in the presence of un-trusted clients. In this article, we assume that all clients are trusted and the network is secure. The rest of this article is organized as follows: Section 2 briefly describes the zFS architecture. In Section 3, we detail how the cooperative cache works. Section 4 describes the test environment, followed by Section 5, where we compare the performance of zFS to that of NFS. In Section 6, we compare the zFS cooperative cache to that of other file systems and close with concluding remarks in Section 7. 2 zFS Architecture This section briefly describes the zFS components and how they interact with each other. A full description can be found in [11]. All storage allocation and management in zFS is delegated to the OSDs. When a file is created and written to, the data blocks are sent to the OSD, which allocates space on the physical disk and writes the data on the allocated space. Usually, this is done asynchronously for performance reasons. The zFS front-end1 runs on every node that mounts zFS. It presents the user with a standard file system API and provides access to zFS files and directories. Other zFS components, including the file manager (FMGR), lease manager (LMGR) and transaction manager (TMGR), can run on nodes that mount zFS or on other nodes. In the latter case, applications running on these nodes cannot access zFS files although they participate in the file system management. To ensure data integrity, file systems use various locking mechanisms. zFS uses leases rather than locks. The lease management is split between the lease manager and the file manager. The lease manager acquires an OSD lease from the OSD and, when requested, grants the file managers exclusive file leases on whole files. After receiving the exclu- 1 Introduction zFS was designed as a distributed file system that offers extended scalability by separating storage management from file management. Storage management is carried out using Object Store Devices (OSDs), and file management is distributed over a set of cooperative machines [11]. These machines are commodity, off-the-shelf components, such as PCs that run existing operating systems and high speed networks. The two most prominent features of zFS are its cooperative cache and distributed transactions. In this article, we focus on the issue of cooperative cache and investigate its effectiveness in the face of the internal cache used by the OSDs themselves. The cooperative cache of zFS integrates the memory of all participating machines into one coherent cache. Thus, when machine A requests a data block that already resides in the memory of machine B, the zFS cooperative cache retrieves the data block from the memory of B instead of going to the OSD. Our work examines whether a cooperative cache can improve performance, even though the OSDs have caches of their own. 1 1 In this article, the term client or FE refers to the zFS front-end. sive file lease, the file manager manages ranges of leases, which it grants the clients (FEs)2. having the operating system run two caches with two different cache policies, which may interfere with each other. Second, we wanted comparable local performance between zFS and other local file systems supported by Linux. All supported file systems use the kernel page cache. As a result, we achieve the following with no extra effort: 1. Eviction is invoked by the kernel according to its internal algorithm—when free available memory is low. We do not need a special zFS mechanism to detect this. 2. Caching is done on a per-page basis, not on whole files. 3. Fairness exists between zFS and other file systems; pages are treated equally by the kernel algorithm, regardless of the file system type. 4. When a file is closed, its pages remain in the cache until memory pressure causes the kernel to discard/evict them. When eviction is invoked and the candidate page for eviction is a zFS page, the decision is passed to a specific zFS routine, which decides whether to forward the page to the cache of another node or to discard it as described in Section 3.2. Figure 1: zFS architecture Each participating node runs the front-end and cooperative cache. Each OSD has only one lease manager associated with it. Several file managers and transaction managers run on various nodes in the cluster. 3.1 Existing Optimizations The current implementation of the zFS page cache supports the following optimizations: Every file opened in zFS is managed by a single file manager that is assigned to the file when it is first opened. The set of all currently active file managers manage all opened zFS files. Initially, no file has an associated file manager. The first machine to perform an open() on file F creates a local instance of the file manager for that file. Henceforth, and until that file manager is shut down, each lease request for any part of the file F is handled by this file manager3. The cooperative cache of zFS is a key component in achieving high performance. Its advantage stems from today’s fast networks, which enable data retrieval to be done more rapidly from the memory of remote machines than from the local disk. Thus, if the same data is read by two different nodes, the first node will read the data blocks from the OSD, while the other will read it from the memory of the first node, achieving better performance. This eliminates a potential bottleneck in the OSD when many clients read the same data. In this case, each client will retrieve the data from a different node, as explained below. 1. When an application uses a zFS file and writes a whole page, no read is done from the OSD—only the write lease is acquired. 2. If one application/user on a machine has a write lease, all other applications/users on the machine can try to read and write to the page, without going to the file manager for a lease. The kernel checks the permission to read/write, based on the permissions specified in the mode parameter when the file is opened. If the mode bits allow the operation, zFS does not prevent it. 3. When client A has a write lease and client B requests a read lease, a write to the OSD is done if the page is dirty and the lease on A is downgraded from write to read lease without discarding the page. This increases the probability of a cache hit, thus increasing performance. 3 zFS Cooperative Cache In zFS, we made an architectural design decision to integrate the cooperative cache with the Linux kernel page cache for two main reasons. First, by doing this we avoid 2 Details on how range leases are handled are beyond the scope of this document and can be found in [11]. 3 The manner in which nodes find the corresponding file manager for F is described in [11]. 3.2 zFS Cooperative Cache Algorithm In cooperative cache terminology, each data block/page4 is either a singlet or replicated [5]. A singlet page is a page that resides in the memory of only one node, while a replicated page resides in the memory of several nodes. 4 2 The zFS prototype is implemented under Linux and the data blocks used are pages, thus we use the term page for data block. When client A opens and reads a file, the local page cache is checked. In case of a cache miss, zFS requests the page and its read lease from the file manager. The file manager checks if a range of pages starting with the requested page has already been read into the memory of another machine. If not, zFS grants the leases to A, instructing it to read the range of pages from the OSD. A then reads the range of pages from the OSD, marking each page as a singlet. If the file manager finds that the range resides in the memory of node B, it sends a message to B requesting that B send the range of pages and leases to A. zFS then records the fact the A also has this particular range internally. In this case, both A and B mark the pages as replicated. We call node B a third-party node, since A gets the requested data not from the OSD but from a third-party. When memory pressure occurs on client A, the Linux kernel invokes the kswapd() daemon, which scans inactive pages and starts discarding them. In our modified kernel, if the page is a replicated zFS page, a message is sent to the file manager indicating that machine A no longer holds the page and the page is discarded5. If the zFS page is a singlet, the following sequence takes place. Forwarding a singlet page: 1. A message is sent to the file manager indicating that the page is sent to another machine B, the node with the largest free memory known to A. 2. The page is forwarded to B. 3. The page is discarded from the page cache of A. Similar to [5], zFS uses a recirculation counter, and if the singlet page has not been accessed after two hops, it is discarded. Once the page has been accessed, the recirculation counter is reset. When a file manager is notified about a discarded page, it updates the lease and page location and checks whether the page has become a singlet. If only one other node N holds the page, the file manager sends a singlet message to N to that effect. However, this idyllic description becomes complicated when we take into consideration node failure and network delays, as explained below. on nodes that the file manager is not aware of, this may cause data trashing and thus is not allowed. Therefore, the order of messages described under forwarding a singlet page above is important. If the node fails before Step 1, the file manager will eventually detect this and update its data to reflect that the respective node does not hold pages and leases. If the node fails to execute Step 1and notify the file manager, it does not forward the page and only discards it. Thus, we end up with a situation where the file manager assumes the page exists on node A, although it does not. As stated above, this is acceptable since it can be corrected without data corruption. The same applies if the node failed after Step 1. In this case, the file manager is informed that the page is on B, but node A may have crashed before it was able to forward the page to B. Again, we have a situation where the file manager assumes the page is on B, although in reality that is not true. Failure after Step 2 does not pose any problem. Network delays also need to be handled correctly. Consider the case where a replicated page residing on nodes M and N is discarded from the memory of M. The file manager finding that the page has become singlet sends a message to N with this information. However, due to network delays, this message may arrive after memory pressure developed on N. The page will be discarded because it is still marked as replicated, while in reality it is a singlet and should have been forwarded to another node. We handle this situation as follows: If a singlet message arrives at N and the page is not in the cache of N, the cooperative cache algorithm on N will ignore the singlet message. Because the file manager still “knows” that the page resides on N, it may ask N to forward the page to a requesting client B. In this case, N will send back a reject message to the file manager. Upon receiving a reject message, the file manager updates its internal tables and retries to respond to the request from B by finding another client who in the meantime read the page from the OSD or by telling B to read the page from the OSD. In other words, network delay in such cases will cause performance degradation, but not inconsistency. Another possible scenario is that no memory pressure occurred on N, the page has not arrived yet, and a singlet message arrived and was ignored. The file manager asked N to forward the page to B and N sent a reject message back to the file manager. If the page never arrives at N due to sender failure or network failure, we do not have a problem. However, if the page arrives after the reject message was sent, we are faced with a consistency problem if a write lease exists. Because the file manager is not “aware” of the page on N, another node may get the write lease and the page from the OSD. This leaves us with two clients having the same page with write leases on two different nodes. To avoid this situation, N keeps a reject list, which records the pages (and their corresponding leases) that were requested but rejected. 3.3 Node Failures and Network Delay Issues The approach we adopted can be stated informally as follows: It is acceptable for the file manager to assume the existence of pages on nodes even if this is not true. It is unacceptable to have pages on nodes, where the file manager is unaware of their existence. If the file manager is wrong in its assumption, its request will be rejected and thus it will update its records. However, if there are pages 5 Discarding a page of a file involves several page cache operations, such as removing the page from the inode hash, etc. We do not detail these operations here in order to keep the description concise and clear. 3 When a forwarded page arrives at N and the page is on the reject list, the page and its entry on the reject list are discarded, thus keeping the information in the file manager accurate. The reject list is scanned periodically (by the FE) and each entry whose time on the list exceeds T, is deleted. T is the maximum time it can take a page to reach its destination node, and is determined experimentally depending on the network topology. An alternative method for handling these network delay issues is to use a complicated synchronization mechanism to keep track of the state of each page in the cluster. This is unacceptable for two reasons. First, it incurs overhead from extra messages, and second, this synchronization delays the kernel when it needs to evict pages quickly. Another problem caused by network delays is the following. Suppose node N notifies the file manager upon forwarding a page to M, and M does the same forwarding the page to O. However, the link from N to the file manager is slow compared to the other links. Thus, the file manager may receive a message that a page was moved from M to O before receiving the message that the singlet page was moved from N to M. Moreover, the file manager does not have in its records that this specific page and lease reside on M. The problem is further complicated by the fact that M may decide to discard the page and this notification may arrive at the file manager before the move notification. Figure 2 illustrates this scenario. To solve this problem, we use the following data structures: • Each lease on a node has a hop_count, which counts the number of times the lease and its corresponding page were moved from one node to another. • Initially, when the page is read from the OSD, the hop_count in the corresponding lease is set to zero and is incremented whenever the lease and page are transferred to another node. • When a node initiates a move, the move notification passed to the file manager includes the hop_count and the target_node. • Two fields are reserved in each lease record in the file manager’s internal tables for handling move notification messages: Last_hop_count initially set to -1, and target_node initially set to NULL. and used when message (1) arrives. When message (3) arrives, it is ignored due to its smaller hop count. In other words, using the hop count enables us to ignore late messages that are irrelevant. Figure 2: Delayed MOVE notification messages Due to network delays the correct order of messages (1), (2), (3) and (4) occurs in reality as (2), (3), (4) and (1). Suppose the page was moved from N to M and then to O, where it was discarded due to memory pressure on O and its recirculation count exceeded its limit. O then sends a release_lease message, which arrives at the file manager before the move notifications. This case is resolved as follows: Since O is not registered as holding the page and lease, the release_lease message is placed on a pending queue and a flag is raised in the lease record. When the move operation is resolved and this flag is set, the release_lease message is moved to the input queue and executed. 3.4 Choosing the Proper Third-party Node The file manager uses an enhanced round robin method to choose the third-party node, which holds a range of pages starting with the requested page. For each range granted to a node N, the file manager records the time it was granted t(N). When a request arrives, the file manager scans the list of all nodes holding a potential range, N0…Nk. For each selected node Ni, the file manager checks if currentTime–t(Ni) > C to check whether enough time passed for the range of pages granted to Ni to reach the node6. If this is true, Ni is marked as a potential provider for the requested range and the next node, Ni+1, is checked; otherwise, the next node is checked. Once all nodes are checked, the marked node with the largest range, Nmax, is chosen. The next time the file manager is asked for a page and lease, it starts the scan from node Nmax+1. We achieve two goals using this simple algorithm. First, we make sure that no single node is overloaded with requests and becomes a bottleneck. Second, we are quite sure the Using Figure 2 as a reference, we now describe informally how the move is executed. If message (3) arrives first, its hop count and target node are saved in the lease record. This is done since node M is not registered as holding the lease and page. When message (1) arrives, N is the registered node; therefore, the lease is “moved” to the target node stored in the target_node field. This is done by updating the information stored in the internal tables of the file manager. If message (3) arrives first and then message (5) arrives, due to the larger hop count, the information from message (5) is stored 6 4 C is a constant determined experimentally. pages reside at the chosen node, thus reducing the probability of reject messages. client C. Thus, we do not interfere with the kernel readahead mechanism. However, if the file manager finds that client A has a range of k pages, it will ignore the subsequent requests that are initiated by the kernel read-ahead mechanism and covered by the granted range. 3.5 Pre-fetching Data in zFS The Linux kernel uses a read ahead mechanism to improve file reading performance. Based on the read pattern of each file, the kernel dynamically calculates how many pages to read ahead, n, and invokes the readpage() routine n times7. 4 Test Environment The zFS performance test environment was comprised of a cluster of PCs and one server PC. Each of the PCs in the cluster was an 800 MHz Pentium III with 256 MB memory, 256 KB L2 cache and 15 GB IDE disks. All of the PCs in the cluster ran the Linux operating system. The kernel was a modified 2.4.19 kernel with VFS implementing zFS and some patches to enable the integration of zFS with the kernel' s page cache. This modus operandi is not efficient when the pages are transmitted over the network. The overhead for transmitting a data block is composed of two parts: the network setup overhead and the transmission time of the data block itself. For comparatively small blocks, the setup overhead is a significant part of the total overhead. The server PC was a 2 GHz Pentium 4 with 512 MB memory and 30 GB IDE disk running vanilla Linux kernel. The server PC ran a simulator of the Antara [6] OSD when we tested zFS performance and ran an NFS server when we compared the results to NFS. Intuitively, it seems that it is more efficient to transmit k pages in one (scatter/gather) message rather than transmitting them in k separate messages, since the setup overhead is amortized over k pages. To confirm this, we wrote a simple client and server programs that test the time it takes to read a file residing entirely in memory from one node to another. Using a file size of N pages, we tested reading it in chunks of 1…k pages in each TCP message. In other words, reading the file in N…N/k messages. We found that the best results are achieved for k=4 and k =8. When k is smaller, the setup time is significant, and when k is larger (16 and above) the size of the L2 cache starts to affect the performance. As noted in [8], TCP performance decreases when the transmitted block size exceeds the size of the L2 cache. This is shown in the performance results described in Section 5. The PCs in the cluster and the server PC were connected via 1Gbit LAN. The zFS front-end (client) was implemented as a kernel loadable module, while all other components were implemented as user mode processes. The file manager and lease manager were fully implemented. The transaction manager implemented all operations in memory, without writing the log to the OSD. However, because we are looking at the results of read operations using cooperative cache—and not meta data operations—this fact does not influence the results. The zFS pre-fetching mechanism is designed to achieve similar performance gains. When the file manager is instantiated, it is passed a pre-fetching parameter, R, indicating the maximum range of pages to grant. When a client A requests a page (and lease) the file manager searches for a client B having the largest contiguous range of pages, r, starting with the requested page p and r R. If such a client B is found, the file manager sends B a message to send r pages (and their leases) to A. The selected range r can be smaller than R if the file manager finds a page with a conflicting lease8 before reaching the full range R. If no range is found in any client, the file manager grants R leases to client A and instructs A to read R pages from the OSD. To begin testing zFS, we configured the system much like a SAN file system. The server PC ran an OSD simulator, a separate PC ran the lease manager, file manager and transaction manager processes (thus acting as a meta data server) and four PCs ran the zFS front-end. The requested page may reside on client A, while the next one resides on client B, and the next on client C. In this case, the granted range will be only the requested page from client A. The next request initiated by the kernel read-ahead mechanism will be granted from client B and the next from 7 More specifically, calling readpage() is skipped if the page to be read is already in the cache, and the next page in the read ahead range is checked. 8 For read request, a conflicting lease is a write lease; for write requests, a conflicting lease is a read lease. Figure 3: System configuration for testing zFS performance When testing NFS performance, we configured the system differently. The server PC ran an NFS server with eight 5 NFS daemons (nfsd) and the four PCs ran NFS clients. The reported results are an average over several runs, where the caches of the machines were cleared before each run. (since the pages were pre-fetched) and the performance of zFS was slightly better than that of NFS. File smaller than available se rver memory 256M B file, 512MB serv er me mory 30000.00 25000.00 KB/Sec 20000.00 NFS 15000.00 No c_cache c_cache 10000.00 5000.00 Figure 4: System configuration for testing NFS performance 0.00 1 4 8 16 Range 5 Comparison to NFS To evaluate zFS performance relative to an existing file system, we compared it to the widely-used NFS system, using the IOZONE benchmark. Figure 5: Performance results for large server cache This figure shows the performance results when the data fits entirely in the server’s memory. The graphs show the relative performance of zFS to NFS, with and without cooperative cache. The comparison to NFS was difficult because NFS does not carry out pre-fetching. To make up for this feature, we configured IOZONE to read the NFS mounted file using record9 sizes of n=1,4,8,16 pages and compared its performance with reading zFS mounted files with record sizes of one page but with pre-fetching parameter R=1,4,8,16 pages. We also saw that when using cooperative cache, the performance for a range of 16 was lower than for ranges of 4 and 8. Because IOZONE starts the requests of each client with a fixed time delay relative to other clients, each new request was for different 256 KB. This stems from the following calculation: For four clients with 16 pages each, we get 256 KB, the size of the L2 cache. Since almost the entire file is in memory, the L2 cache is flushed and reloaded for each granted request, resulting in reduced performance. This fits with [8], since in this case the L2 cache is fully replaced with new data for each transmission. 5.1 Comparing zFS and NFS on a Cluster Our main goal was to test whether and how much performance is gained if the total free memory in the cluster surpasses the cache size of the server. To this end, we ran two scenarios. In the first one, the file size was smaller than the cache of the server and all the data resided in the server’s cache. In the second, the file size was much larger than the size of the server’s cache. The results appear in Figure 5 and Figure 6, respectively. When the server cache was smaller that the data requested, we expected memory pressure to occur in the server (NFS and OSD) and the server’s local disk to be used. In this case, we anticipated that the cooperative cache would exhibit improved performance. The results are shown in Figure 6. In both cases, we observed that the performance of NFS was almost the same for different block sizes. However, the absolute performance is greatly influenced by the data size compared to the available memory. When the file can fit entirely into memory, NFS performance is almost four times better compared to the case where the file is larger than available memory. Indeed, we see that zFS performance when cooperative cache is deactivated is lower than that of NFS, although it gets better for larger ranges. When the cooperative cache is active, zFS performance is significantly better than NFS and increases as the range increases. The performance with cooperative cache enabled is lower here when compared to the case when the file fits into memory. Because the file was larger than the available memory, clients suffer memory pressure, evict/discard pages and respond to the file manager with reject messages. Thus, sending data blocks to clients was interleaved with reject messages to the file manager and the probability that the requested data is in memory was smaller than when the file was almost entirely in memory. When the file fit entirely into memory (see Figure 5) the performance of zFS with cooperative cache was much better than NFS. When cooperative cache was deactivated, we observed different behavior for ranges of one page compared to larger ranges. This stems from the extra messages passed between the clients and the file manager. Thus, the zFS performance for R=1 is lower than NFS. However, for larger ranges, there are fewer messages to the file manager 9 Record in IOZONE is the data size read in one read system call. 6 There are three major differences between the zFS architecture and the one described in [4]. File larger than av ailable serve r me mory 1GB file, 512M B Serv er M e mory 25000.00 1. zFS does not have a central server and the management of files is distributed among several file managers. There is no hierarchy of cluster servers; if two clients work on the same file they interact with the same file manager. 2. Caching in zFS is done on a per page basis rather than using whole files. This increases sharing since different clients can work on different parts of the same file. 3. In zFS, no caching is done on the local disk KB/Sec 20000.00 15000.00 NFS No c_cache c_cache 10000.00 5000.00 0.00 1 4 8 16 Additionally, no write through and client-to-client forwarding are inherent in zFS. Thus, in addition to supporting the techniques mentioned in [4], zFS is more scalable because it has no central server and file managers can dynamically be added or removed to respond to load changes in the cluster. Moreover, performance is better due to zFS’s stronger sharing capability. Range Figure 6: Performance results for small server cache This figure shows the performance results when the data size is greater than the server cache size and the server’s local disk has to be used. We see that cooperative cache provides much better performance than NFS. Deactivating cooperative cache results in worse performance than NFS. In [5] several techniques were used in an experiment to find the best policy for cooperative caching. The conclusion reached was that N-chance Forwarding gives the best performance. This technique is also employed in zFS. The main difference is that in [5] they choose the target node for eviction of a singlet randomly and suggest an optimization by choosing an idle client. As explained above, in zFS we chose the client that has with the largest amount of free memory and uses the same file manager. 6 Related Work Several existing works have researched cooperative caching [2], [3], [4], [5], [12]. Although we cannot cover all these works, we will concentrate on those systems that describe network file systems, rather than those describing caching for parallel machines [3], or those that use a micro-kernel [2]. Dahlin et. al. [4] describe four caching techniques to improve the performance of the xFS file system. xFS uses a central server to coordinate between the various clients, and the load of the server increases as the number of clients increase. Thus, the scalability of xFS was limited by the strength of the server. Implementing these techniques and testing it on a trace of 237 users (over more than one week) showed that the load on the central server was reduced by a factor of six, making xFS more scalable than AFS and NFS. They also found that each of the four techniques contributed significantly to the load reduction. In other words, no one technique contributed significantly more than the others. Sarkar and Hartman [12] investigated the use of a hints mechanism, information that only approximates the global state of the system, for cooperative caching. In their system, communication among clients, rather than between clients and central server, is used to determine where to evict singlet blocks or where to get a data block. The decisions made by such a system may not be optimal, but managing hints is less expensive than managing facts. As long as the overhead eliminated by using hints more than offsets the effect of making mistakes, this system pays off. The environment consists of a central server with a cache. For each block of a file, a master copy is defined as the one received from the server. The hints mechanism only keeps track of where the master copy is located in order to avoid keeping too much information and thus simplifies the maintenance of accurate hints. It should be noted that in their article Dahlin et. al.[4] use the term caching to mean caching of whole file as in AFS; i.e., the clients cache the whole files on local disk. The four techniques described are: Each client builds its own hints database, which is updated when blocks are evicted or received. When a client wishes to evict a block or to read a block, it does lookup in the internal database to find the proper client. Because the hints mechanism is not accurate, lookup can fail. In this case, the client contacts the central server. 1. no write through 2. client-to-client data transfer 3. write ownership 4. cluster servers10 Measurements showed that using these techniques, the load on the central server is reduced noticeably and that the load of the cluster servers is similar to that of the central server. 10 Simulation showed that this algorithm works well in UNIX environment. However, there are scenarios where it will not be effective; for example, if several clients share a working set that is larger than the cache size, the location of the mas- For details of these techniques see [4]. 7 9 References [1] A. Azagury, V. Dreizin, M. Factor, E. Henis, D. Naor, N. Rinetzky, O. Rodeh, J. Satran, A. Tavori and L. Yerushalmi, Toward an Object Store, proceeding 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies,April 2003, 165-176 [2] T. Cortes, S. Girona and J. Labarta. Avoiding the Cache Coherence Problem in a Parallel/Distributed File System. Department d’Arquitectura de Computadros, Universitat Politecnica de Catalunia – Barcelona. [3] T. Cortes, S. Girona and J. Labarta. PACA: A Cooperative File System Cache for Parallel Machines. Department d’Arquitectura de Computadros, Universitat Politecnica de Catalunia. [4] M. D. Dahlin, C. J. Mather, R. Y. Wang, T. E. Anderson and D. A. Patterson. A Quantitative Analysis of Cache Policies for Scalable Network File System. Computer Science Department, University of California at Berkeley. [5] M. D. Dahlin, Randolph Y. Wang, Thomas E. Anderson and David A. Patterson. "Cooperative Caching: Using Remote Client Memory to Improve File System Performance." Proceedings of the First Symposium on Operating Systems Design and Implementation, (OSDI 1994). ter copies will change rapidly as blocks move in and out of the clients'cache. This will cause the probable master copy location to be inaccurate, resulting in excessive forwarding requests. By comparison, zFS does not have a central server that can become a bottleneck. All control information is exchanged between clients and file managers. The set of file managers dynamically adapts itself to the load on the cluster. Clients in zFS only pass data among themselves (in cooperative cache mode). Lustre [10] suggests a different kind of caching called collaborative caching. In this scheme, data that is frequently used by many clients is pushed from the Object Store Target (OST), where the file/data resides, to the caches of other OSTs. When a client wishes to read data, it contacts the OST holding the data for read lock and receives with the granted lock a list of other OSTs holding the data in their cache. The client then redirects its read request to one of the others OSTs. This scheme has two main differences from the one zFS uses. First, zFS uses the collective memory of all clients as a global cache, while Lustre uses only the memory of the OSTs. Second, zFS uses third-party communication, which requires three messages to get the data from the cache of another client, while Lustre uses four messages. On the other hand, using OSTs alleviates the security issue in the presence of un-trusted clients. [6] V. Drezin, N. Rinetzky, A. Tavory and E. Yerushalmi. "The Antara Object-disk Design." Technical report, IBM Research Labs, Haifa University Campus, Mount Carmel, Haifa, Israel, 2001. [7] Z. Dubitzky, I. Gold, E. Henis, J. Satran and D. Sheinwald. "DSF – Data Sharing Facility." Technical report, IBM Research Labs, Haifa University Campus, Mount Carmel, Haifa, Israel, 2000. See also http://www.haifa.il.ibm.com/projects/systems/ds f.html. [8] Annie P. Foong, Thomas R. Huff, Herbert H. Hum, jaidev P. Patwadhan and Greg J. Regnier, TCP Performance Re-Visited, ISPASS, march 2003 [9] Iozone. See http://iozone.org/ [10] http://www.lustre.org/docs/whitepaper.pdf [11] O. Rodeh and A. Teperman. "zFS – A Scalable distributed File System using Object Disks." In Proceedings of the IEEE Mass Storage Systems and Technologies Conference, pages 207-218, San Diego, CA, USA, 2003. [12] P. Sarkar and J. Hartman. Efficient Cooperative Caching Using Hints. Department of Computer Science, University of Arizona, Tucson. 7 Conclusions and Future Research Our results show that using the memory of all participating nodes as one global cache gives better performance when compared to NFS. This is evident when we use a range of one page. It is also clear from the performance results that using pre-fetching with ranges of four and eight pages results in much better performance. In zFS, the selection of the target node for page eviction is done by the file manager, which chooses the node with the largest free memory as the target node. However, the file manager chooses target nodes only from those interacting with it. It may be the case that there is an idle machine with a large free memory that is not connected to this file manager and thus will not be used. Thus, the eviction algorithm is not globally optimized in zFS. Future research will address this issue and consider lightweight mechanisms (e.g., gossip and others) for distributing load information on the whole cluster of zFS nodes. 8 Acknowledgements We wish to thank Alain Azagury, Kalman Meth, Ohad Rodeh, Julian Satran, Dalit Naor and Ami Tavori from IBM Research Labs in Haifa for their participation in useful discussions on zFS. 8