Improving Performance of a Distributed File System

advertisement
Improving Performance of a Distributed File System Using OSDs
and Cooperative Cache
A. Teperman
A. Weit
teperman@il.ibm.com
weit@il.ibm.com
IBM Haifa Labs, Haifa University, Mount Carmel, Haifa 31905, Israel
Abstract
zFS is a scalable distributed file system that uses Object
Store Devices (OSDs) for storage management and a set of
cooperative machines for distributed file management. zFS
evolved from the DSF project [7], and its high-level architecture is described in [11].
This work uses a cooperative cache algorithm, which is
resilient to network delays and nodes failure. The work
explores the effectiveness of this algorithm and of zFS as a
file system. This is accomplished by comparing the system’s
performance to NFS using the IOZONE[8] benchmark. We
also investigate whether using a cooperative cache results
in better performance, despite the fact that OSDs have their
own caches. Our results show that the zFS prototype performs better than NFS when cooperative cache is activated
and that zFS provides better performance even though the
OSDs have caches of their own. We also demonstrate that
using pre-fetching in zFS increases performance significantly. Thus, zFS performance scales well when the number of participating clients increases.
Accepting data from the memory of another machine poses
a security problem in the presence of un-trusted clients. In
this article, we assume that all clients are trusted and the
network is secure.
The rest of this article is organized as follows: Section 2
briefly describes the zFS architecture. In Section 3, we detail how the cooperative cache works. Section 4 describes
the test environment, followed by Section 5, where we compare the performance of zFS to that of NFS. In Section 6,
we compare the zFS cooperative cache to that of other file
systems and close with concluding remarks in Section 7.
2 zFS Architecture
This section briefly describes the zFS components and how
they interact with each other. A full description can be
found in [11].
All storage allocation and management in zFS is delegated
to the OSDs. When a file is created and written to, the data
blocks are sent to the OSD, which allocates space on the
physical disk and writes the data on the allocated space.
Usually, this is done asynchronously for performance reasons.
The zFS front-end1 runs on every node that mounts zFS. It
presents the user with a standard file system API and provides access to zFS files and directories. Other zFS components, including the file manager (FMGR), lease manager
(LMGR) and transaction manager (TMGR), can run on
nodes that mount zFS or on other nodes. In the latter case,
applications running on these nodes cannot access zFS files
although they participate in the file system management.
To ensure data integrity, file systems use various locking
mechanisms. zFS uses leases rather than locks. The lease
management is split between the lease manager and the file
manager. The lease manager acquires an OSD lease from
the OSD and, when requested, grants the file managers exclusive file leases on whole files. After receiving the exclu-
1 Introduction
zFS was designed as a distributed file system that offers
extended scalability by separating storage management
from file management. Storage management is carried out
using Object Store Devices (OSDs), and file management is
distributed over a set of cooperative machines [11]. These
machines are commodity, off-the-shelf components, such as
PCs that run existing operating systems and high speed networks.
The two most prominent features of zFS are its cooperative
cache and distributed transactions. In this article, we focus
on the issue of cooperative cache and investigate its effectiveness in the face of the internal cache used by the OSDs
themselves.
The cooperative cache of zFS integrates the memory of all
participating machines into one coherent cache. Thus, when
machine A requests a data block that already resides in the
memory of machine B, the zFS cooperative cache retrieves
the data block from the memory of B instead of going to the
OSD. Our work examines whether a cooperative cache can
improve performance, even though the OSDs have caches
of their own.
1
1
In this article, the term client or FE refers to the zFS front-end.
sive file lease, the file manager manages ranges of leases,
which it grants the clients (FEs)2.
having the operating system run two caches with two different cache policies, which may interfere with each other.
Second, we wanted comparable local performance between
zFS and other local file systems supported by Linux. All
supported file systems use the kernel page cache.
As a result, we achieve the following with no extra effort:
1. Eviction is invoked by the kernel according to its
internal algorithm—when free available memory is
low. We do not need a special zFS mechanism to detect this.
2. Caching is done on a per-page basis, not on whole
files.
3. Fairness exists between zFS and other file systems;
pages are treated equally by the kernel algorithm,
regardless of the file system type.
4. When a file is closed, its pages remain in the cache
until memory pressure causes the kernel to discard/evict them.
When eviction is invoked and the candidate page for eviction is a zFS page, the decision is passed to a specific zFS
routine, which decides whether to forward the page to the
cache of another node or to discard it as described in Section 3.2.
Figure 1: zFS architecture
Each participating node runs the front-end and cooperative cache. Each
OSD has only one lease manager associated with it. Several file managers
and transaction managers run on various nodes in the cluster.
3.1 Existing Optimizations
The current implementation of the zFS page cache supports
the following optimizations:
Every file opened in zFS is managed by a single file manager that is assigned to the file when it is first opened. The
set of all currently active file managers manage all opened
zFS files. Initially, no file has an associated file manager.
The first machine to perform an open() on file F creates a
local instance of the file manager for that file. Henceforth,
and until that file manager is shut down, each lease request
for any part of the file F is handled by this file manager3.
The cooperative cache of zFS is a key component in achieving high performance. Its advantage stems from today’s fast
networks, which enable data retrieval to be done more rapidly from the memory of remote machines than from the
local disk. Thus, if the same data is read by two different
nodes, the first node will read the data blocks from the
OSD, while the other will read it from the memory of the
first node, achieving better performance. This eliminates a
potential bottleneck in the OSD when many clients read the
same data. In this case, each client will retrieve the data
from a different node, as explained below.
1. When an application uses a zFS file and writes a
whole page, no read is done from the OSD—only
the write lease is acquired.
2. If one application/user on a machine has a write
lease, all other applications/users on the machine can
try to read and write to the page, without going to
the file manager for a lease. The kernel checks the
permission to read/write, based on the permissions
specified in the mode parameter when the file is
opened. If the mode bits allow the operation, zFS
does not prevent it.
3. When client A has a write lease and client B requests
a read lease, a write to the OSD is done if the page is
dirty and the lease on A is downgraded from write to
read lease without discarding the page. This increases the probability of a cache hit, thus increasing
performance.
3 zFS Cooperative Cache
In zFS, we made an architectural design decision to integrate the cooperative cache with the Linux kernel page
cache for two main reasons. First, by doing this we avoid
2
Details on how range leases are handled are beyond the scope of
this document and can be found in [11].
3
The manner in which nodes find the corresponding file manager
for F is described in [11].
3.2 zFS Cooperative Cache Algorithm
In cooperative cache terminology, each data block/page4 is
either a singlet or replicated [5]. A singlet page is a page
that resides in the memory of only one node, while a replicated page resides in the memory of several nodes.
4
2
The zFS prototype is implemented under Linux and the data
blocks used are pages, thus we use the term page for data block.
When client A opens and reads a file, the local page cache
is checked. In case of a cache miss, zFS requests the page
and its read lease from the file manager. The file manager
checks if a range of pages starting with the requested page
has already been read into the memory of another machine.
If not, zFS grants the leases to A, instructing it to read the
range of pages from the OSD. A then reads the range of
pages from the OSD, marking each page as a singlet. If the
file manager finds that the range resides in the memory of
node B, it sends a message to B requesting that B send the
range of pages and leases to A. zFS then records the fact the
A also has this particular range internally. In this case, both
A and B mark the pages as replicated. We call node B a
third-party node, since A gets the requested data not from
the OSD but from a third-party.
When memory pressure occurs on client A, the Linux kernel
invokes the kswapd() daemon, which scans inactive
pages and starts discarding them. In our modified kernel, if
the page is a replicated zFS page, a message is sent to the
file manager indicating that machine A no longer holds the
page and the page is discarded5. If the zFS page is a singlet,
the following sequence takes place.
Forwarding a singlet page:
1. A message is sent to the file manager indicating that
the page is sent to another machine B, the node with
the largest free memory known to A.
2. The page is forwarded to B.
3. The page is discarded from the page cache of A.
Similar to [5], zFS uses a recirculation counter, and if the
singlet page has not been accessed after two hops, it is discarded. Once the page has been accessed, the recirculation
counter is reset.
When a file manager is notified about a discarded page, it
updates the lease and page location and checks whether the
page has become a singlet. If only one other node N holds
the page, the file manager sends a singlet message to N to
that effect.
However, this idyllic description becomes complicated
when we take into consideration node failure and network
delays, as explained below.
on nodes that the file manager is not aware of, this may
cause data trashing and thus is not allowed.
Therefore, the order of messages described under forwarding a singlet page above is important. If the node fails before Step 1, the file manager will eventually detect this and
update its data to reflect that the respective node does not
hold pages and leases. If the node fails to execute Step 1and
notify the file manager, it does not forward the page and
only discards it. Thus, we end up with a situation where the
file manager assumes the page exists on node A, although it
does not. As stated above, this is acceptable since it can be
corrected without data corruption. The same applies if the
node failed after Step 1. In this case, the file manager is
informed that the page is on B, but node A may have
crashed before it was able to forward the page to B. Again,
we have a situation where the file manager assumes the
page is on B, although in reality that is not true. Failure
after Step 2 does not pose any problem.
Network delays also need to be handled correctly. Consider
the case where a replicated page residing on nodes M and N
is discarded from the memory of M. The file manager finding that the page has become singlet sends a message to N
with this information. However, due to network delays, this
message may arrive after memory pressure developed on N.
The page will be discarded because it is still marked as replicated, while in reality it is a singlet and should have been
forwarded to another node.
We handle this situation as follows: If a singlet message
arrives at N and the page is not in the cache of N, the cooperative cache algorithm on N will ignore the singlet message. Because the file manager still “knows” that the page
resides on N, it may ask N to forward the page to a requesting client B. In this case, N will send back a reject message
to the file manager. Upon receiving a reject message, the
file manager updates its internal tables and retries to respond to the request from B by finding another client who
in the meantime read the page from the OSD or by telling B
to read the page from the OSD.
In other words, network delay in such cases will cause performance degradation, but not inconsistency.
Another possible scenario is that no memory pressure occurred on N, the page has not arrived yet, and a singlet message arrived and was ignored. The file manager asked N to
forward the page to B and N sent a reject message back to
the file manager.
If the page never arrives at N due to sender failure or network failure, we do not have a problem. However, if the
page arrives after the reject message was sent, we are faced
with a consistency problem if a write lease exists. Because
the file manager is not “aware” of the page on N, another
node may get the write lease and the page from the OSD.
This leaves us with two clients having the same page with
write leases on two different nodes. To avoid this situation,
N keeps a reject list, which records the pages (and their
corresponding leases) that were requested but rejected.
3.3 Node Failures and Network Delay Issues
The approach we adopted can be stated informally as follows: It is acceptable for the file manager to assume the
existence of pages on nodes even if this is not true. It is
unacceptable to have pages on nodes, where the file manager is unaware of their existence. If the file manager is
wrong in its assumption, its request will be rejected and
thus it will update its records. However, if there are pages
5
Discarding a page of a file involves several page cache operations, such as removing the page from the inode hash, etc. We
do not detail these operations here in order to keep the description concise and clear.
3
When a forwarded page arrives at N and the page is on the
reject list, the page and its entry on the reject list are discarded, thus keeping the information in the file manager
accurate. The reject list is scanned periodically (by the FE)
and each entry whose time on the list exceeds T, is deleted.
T is the maximum time it can take a page to reach its destination node, and is determined experimentally depending
on the network topology.
An alternative method for handling these network delay
issues is to use a complicated synchronization mechanism to
keep track of the state of each page in the cluster. This is
unacceptable for two reasons. First, it incurs overhead from
extra messages, and second, this synchronization delays the
kernel when it needs to evict pages quickly.
Another problem caused by network delays is the following. Suppose node N notifies the file manager upon forwarding a page to M, and M does the same forwarding the
page to O. However, the link from N to the file manager is
slow compared to the other links. Thus, the file manager
may receive a message that a page was moved from M to O
before receiving the message that the singlet page was
moved from N to M. Moreover, the file manager does not
have in its records that this specific page and lease reside on
M. The problem is further complicated by the fact that M
may decide to discard the page and this notification may
arrive at the file manager before the move notification.
Figure 2 illustrates this scenario.
To solve this problem, we use the following data structures:
•
Each lease on a node has a hop_count, which counts
the number of times the lease and its corresponding
page were moved from one node to another.
•
Initially, when the page is read from the OSD, the
hop_count in the corresponding lease is set to zero and
is incremented whenever the lease and page are transferred to another node.
•
When a node initiates a move, the move notification
passed to the file manager includes the hop_count and
the target_node.
•
Two fields are reserved in each lease record in the file
manager’s internal tables for handling move notification messages: Last_hop_count initially set to -1, and
target_node initially set to NULL.
and used when message (1) arrives. When message (3) arrives, it is ignored due to its smaller hop count. In other
words, using the hop count enables us to ignore late messages that are irrelevant.
Figure 2: Delayed MOVE notification messages
Due to network delays the correct order of messages (1), (2), (3) and (4)
occurs in reality as (2), (3), (4) and (1).
Suppose the page was moved from N to M and then to O,
where it was discarded due to memory pressure on O and its
recirculation count exceeded its limit. O then sends a
release_lease message, which arrives at the file manager
before the move notifications. This case is resolved as follows: Since O is not registered as holding the page and
lease, the release_lease message is placed on a pending
queue and a flag is raised in the lease record. When the
move operation is resolved and this flag is set, the
release_lease message is moved to the input queue and
executed.
3.4 Choosing the Proper Third-party Node
The file manager uses an enhanced round robin method to
choose the third-party node, which holds a range of pages
starting with the requested page.
For each range granted to a node N, the file manager records the time it was granted t(N). When a request arrives,
the file manager scans the list of all nodes holding a potential range, N0…Nk. For each selected node Ni, the file manager checks if currentTime–t(Ni) > C to check whether
enough time passed for the range of pages granted to Ni to
reach the node6. If this is true, Ni is marked as a potential
provider for the requested range and the next node, Ni+1, is
checked; otherwise, the next node is checked. Once all
nodes are checked, the marked node with the largest range,
Nmax, is chosen. The next time the file manager is asked for
a page and lease, it starts the scan from node Nmax+1.
We achieve two goals using this simple algorithm. First, we
make sure that no single node is overloaded with requests
and becomes a bottleneck. Second, we are quite sure the
Using Figure 2 as a reference, we now describe informally
how the move is executed. If message (3) arrives first, its
hop count and target node are saved in the lease record.
This is done since node M is not registered as holding the
lease and page.
When message (1) arrives, N is the registered node; therefore, the lease is “moved” to the target node stored in the
target_node field. This is done by updating the information
stored in the internal tables of the file manager. If message
(3) arrives first and then message (5) arrives, due to the
larger hop count, the information from message (5) is stored
6
4
C is a constant determined experimentally.
pages reside at the chosen node, thus reducing the probability of reject messages.
client C. Thus, we do not interfere with the kernel readahead mechanism. However, if the file manager finds that
client A has a range of k pages, it will ignore the subsequent
requests that are initiated by the kernel read-ahead mechanism and covered by the granted range.
3.5 Pre-fetching Data in zFS
The Linux kernel uses a read ahead mechanism to improve
file reading performance. Based on the read pattern of each
file, the kernel dynamically calculates how many pages to
read ahead, n, and invokes the readpage() routine n
times7.
4 Test Environment
The zFS performance test environment was comprised of a
cluster of PCs and one server PC. Each of the PCs in the
cluster was an 800 MHz Pentium III with 256 MB memory,
256 KB L2 cache and 15 GB IDE disks. All of the PCs in
the cluster ran the Linux operating system. The kernel was a
modified 2.4.19 kernel with VFS implementing zFS and
some patches to enable the integration of zFS with the kernel'
s page cache.
This modus operandi is not efficient when the pages are
transmitted over the network. The overhead for transmitting
a data block is composed of two parts: the network setup
overhead and the transmission time of the data block itself.
For comparatively small blocks, the setup overhead is a
significant part of the total overhead.
The server PC was a 2 GHz Pentium 4 with 512 MB memory and 30 GB IDE disk running vanilla Linux kernel. The
server PC ran a simulator of the Antara [6] OSD when we
tested zFS performance and ran an NFS server when we
compared the results to NFS.
Intuitively, it seems that it is more efficient to transmit k
pages in one (scatter/gather) message rather than transmitting them in k separate messages, since the setup overhead
is amortized over k pages. To confirm this, we wrote a
simple client and server programs that test the time it takes
to read a file residing entirely in memory from one node to
another. Using a file size of N pages, we tested reading it in
chunks of 1…k pages in each TCP message. In other words,
reading the file in N…N/k messages. We found that the best
results are achieved for k=4 and k =8. When k is smaller,
the setup time is significant, and when k is larger (16 and
above) the size of the L2 cache starts to affect the performance. As noted in [8], TCP performance decreases when the
transmitted block size exceeds the size of the L2 cache.
This is shown in the performance results described in Section 5.
The PCs in the cluster and the server PC were connected
via 1Gbit LAN.
The zFS front-end (client) was implemented as a kernel
loadable module, while all other components were implemented as user mode processes. The file manager and lease
manager were fully implemented. The transaction manager
implemented all operations in memory, without writing the
log to the OSD. However, because we are looking at the
results of read operations using cooperative cache—and not
meta data operations—this fact does not influence the results.
The zFS pre-fetching mechanism is designed to achieve
similar performance gains. When the file manager is instantiated, it is passed a pre-fetching parameter, R, indicating
the maximum range of pages to grant. When a client A requests a page (and lease) the file manager searches for a
client B having the largest contiguous range of pages, r,
starting with the requested page p and r R. If such a client B is found, the file manager sends B a message to send r
pages (and their leases) to A. The selected range r can be
smaller than R if the file manager finds a page with a conflicting lease8 before reaching the full range R. If no range
is found in any client, the file manager grants R leases to
client A and instructs A to read R pages from the OSD.
To begin testing zFS, we configured the system much like a
SAN file system. The server PC ran an OSD simulator, a
separate PC ran the lease manager, file manager and transaction manager processes (thus acting as a meta data server)
and four PCs ran the zFS front-end.
The requested page may reside on client A, while the next
one resides on client B, and the next on client C. In this
case, the granted range will be only the requested page from
client A. The next request initiated by the kernel read-ahead
mechanism will be granted from client B and the next from
7
More specifically, calling readpage() is skipped if the page
to be read is already in the cache, and the next page in the read
ahead range is checked.
8
For read request, a conflicting lease is a write lease; for write
requests, a conflicting lease is a read lease.
Figure 3: System configuration for testing zFS performance
When testing NFS performance, we configured the system
differently. The server PC ran an NFS server with eight
5
NFS daemons (nfsd) and the four PCs ran NFS clients. The
reported results are an average over several runs, where the
caches of the machines were cleared before each run.
(since the pages were pre-fetched) and the performance of
zFS was slightly better than that of NFS.
File smaller than available se rver memory
256M B file, 512MB serv er me mory
30000.00
25000.00
KB/Sec
20000.00
NFS
15000.00
No c_cache
c_cache
10000.00
5000.00
Figure 4: System configuration for testing NFS performance
0.00
1
4
8
16
Range
5 Comparison to NFS
To evaluate zFS performance relative to an existing file
system, we compared it to the widely-used NFS system,
using the IOZONE benchmark.
Figure 5: Performance results for large server cache
This figure shows the performance results when the data fits entirely in
the server’s memory. The graphs show the relative performance of zFS to
NFS, with and without cooperative cache.
The comparison to NFS was difficult because NFS does not
carry out pre-fetching. To make up for this feature, we configured IOZONE to read the NFS mounted file using record9 sizes of n=1,4,8,16 pages and compared its performance with reading zFS mounted files with record sizes of
one page but with pre-fetching parameter R=1,4,8,16 pages.
We also saw that when using cooperative cache, the performance for a range of 16 was lower than for ranges of 4
and 8. Because IOZONE starts the requests of each client
with a fixed time delay relative to other clients, each new
request was for different 256 KB. This stems from the following calculation: For four clients with 16 pages each, we
get 256 KB, the size of the L2 cache. Since almost the entire file is in memory, the L2 cache is flushed and reloaded
for each granted request, resulting in reduced performance.
This fits with [8], since in this case the L2 cache is fully
replaced with new data for each transmission.
5.1 Comparing zFS and NFS on a Cluster
Our main goal was to test whether and how much performance is gained if the total free memory in the cluster surpasses the cache size of the server. To this end, we ran two
scenarios. In the first one, the file size was smaller than the
cache of the server and all the data resided in the server’s
cache. In the second, the file size was much larger than the
size of the server’s cache. The results appear in Figure 5 and
Figure 6, respectively.
When the server cache was smaller that the data requested,
we expected memory pressure to occur in the server (NFS
and OSD) and the server’s local disk to be used. In this
case, we anticipated that the cooperative cache would exhibit improved performance. The results are shown in Figure
6.
In both cases, we observed that the performance of NFS
was almost the same for different block sizes. However, the
absolute performance is greatly influenced by the data size
compared to the available memory. When the file can fit
entirely into memory, NFS performance is almost four
times better compared to the case where the file is larger
than available memory.
Indeed, we see that zFS performance when cooperative
cache is deactivated is lower than that of NFS, although it
gets better for larger ranges. When the cooperative cache is
active, zFS performance is significantly better than NFS and
increases as the range increases.
The performance with cooperative cache enabled is lower
here when compared to the case when the file fits into
memory. Because the file was larger than the available
memory, clients suffer memory pressure, evict/discard
pages and respond to the file manager with reject messages.
Thus, sending data blocks to clients was interleaved with
reject messages to the file manager and the probability that
the requested data is in memory was smaller than when the
file was almost entirely in memory.
When the file fit entirely into memory (see Figure 5) the
performance of zFS with cooperative cache was much better than NFS. When cooperative cache was deactivated, we
observed different behavior for ranges of one page compared to larger ranges. This stems from the extra messages
passed between the clients and the file manager. Thus, the
zFS performance for R=1 is lower than NFS. However, for
larger ranges, there are fewer messages to the file manager
9
Record in IOZONE is the data size read in one read system call.
6
There are three major differences between the zFS architecture and the one described in [4].
File larger than av ailable serve r me mory
1GB file, 512M B Serv er M e mory
25000.00
1.
zFS does not have a central server and the management
of files is distributed among several file managers.
There is no hierarchy of cluster servers; if two clients
work on the same file they interact with the same file
manager.
2.
Caching in zFS is done on a per page basis rather than
using whole files. This increases sharing since different
clients can work on different parts of the same file.
3.
In zFS, no caching is done on the local disk
KB/Sec
20000.00
15000.00
NFS
No c_cache
c_cache
10000.00
5000.00
0.00
1
4
8
16
Additionally, no write through and client-to-client forwarding are inherent in zFS. Thus, in addition to supporting the
techniques mentioned in [4], zFS is more scalable because
it has no central server and file managers can dynamically
be added or removed to respond to load changes in the cluster. Moreover, performance is better due to zFS’s stronger
sharing capability.
Range
Figure 6: Performance results for small server cache
This figure shows the performance results when the data size is greater
than the server cache size and the server’s local disk has to be used. We
see that cooperative cache provides much better performance than NFS.
Deactivating cooperative cache results in worse performance than NFS.
In [5] several techniques were used in an experiment to
find the best policy for cooperative caching. The conclusion reached was that N-chance Forwarding gives the best
performance. This technique is also employed in zFS. The
main difference is that in [5] they choose the target node for
eviction of a singlet randomly and suggest an optimization
by choosing an idle client. As explained above, in zFS we
chose the client that has with the largest amount of free
memory and uses the same file manager.
6 Related Work
Several existing works have researched cooperative caching
[2], [3], [4], [5], [12]. Although we cannot cover all these
works, we will concentrate on those systems that describe
network file systems, rather than those describing caching
for parallel machines [3], or those that use a micro-kernel
[2].
Dahlin et. al. [4] describe four caching techniques to improve the performance of the xFS file system. xFS uses a
central server to coordinate between the various clients, and
the load of the server increases as the number of clients
increase. Thus, the scalability of xFS was limited by the
strength of the server. Implementing these techniques and
testing it on a trace of 237 users (over more than one week)
showed that the load on the central server was reduced by a
factor of six, making xFS more scalable than AFS and NFS.
They also found that each of the four techniques contributed significantly to the load reduction. In other words, no
one technique contributed significantly more than the others.
Sarkar and Hartman [12] investigated the use of a hints
mechanism, information that only approximates the global
state of the system, for cooperative caching. In their system,
communication among clients, rather than between clients
and central server, is used to determine where to evict
singlet blocks or where to get a data block. The decisions
made by such a system may not be optimal, but managing
hints is less expensive than managing facts. As long as the
overhead eliminated by using hints more than offsets the
effect of making mistakes, this system pays off.
The environment consists of a central server with a cache.
For each block of a file, a master copy is defined as the one
received from the server. The hints mechanism only keeps
track of where the master copy is located in order to avoid
keeping too much information and thus simplifies the maintenance of accurate hints.
It should be noted that in their article Dahlin et. al.[4] use
the term caching to mean caching of whole file as in AFS;
i.e., the clients cache the whole files on local disk.
The four techniques described are:
Each client builds its own hints database, which is updated
when blocks are evicted or received. When a client wishes
to evict a block or to read a block, it does lookup in the
internal database to find the proper client. Because the hints
mechanism is not accurate, lookup can fail. In this case, the
client contacts the central server.
1. no write through
2. client-to-client data transfer
3. write ownership
4. cluster servers10
Measurements showed that using these techniques, the load
on the central server is reduced noticeably and that the load
of the cluster servers is similar to that of the central server.
10
Simulation showed that this algorithm works well in UNIX
environment. However, there are scenarios where it will not
be effective; for example, if several clients share a working
set that is larger than the cache size, the location of the mas-
For details of these techniques see [4].
7
9 References
[1] A. Azagury, V. Dreizin, M. Factor, E. Henis, D.
Naor, N. Rinetzky, O. Rodeh, J. Satran, A. Tavori
and L. Yerushalmi, Toward an Object Store, proceeding 20th IEEE/11th NASA Goddard Conference
on Mass Storage Systems and Technologies,April
2003, 165-176
[2] T. Cortes, S. Girona and J. Labarta. Avoiding
the Cache Coherence Problem in a Parallel/Distributed File System. Department
d’Arquitectura de Computadros, Universitat Politecnica de Catalunia – Barcelona.
[3] T. Cortes, S. Girona and J. Labarta. PACA: A
Cooperative File System Cache for Parallel
Machines. Department d’Arquitectura de Computadros, Universitat Politecnica de Catalunia.
[4] M. D. Dahlin, C. J. Mather, R. Y. Wang, T. E.
Anderson and D. A. Patterson. A Quantitative
Analysis of Cache Policies for Scalable Network
File System. Computer Science Department,
University of California at Berkeley.
[5] M. D. Dahlin, Randolph Y. Wang, Thomas E.
Anderson and David A. Patterson. "Cooperative
Caching: Using Remote Client Memory to Improve File System Performance." Proceedings of
the First Symposium on Operating Systems Design and Implementation, (OSDI 1994).
ter copies will change rapidly as blocks move in and out of
the clients'cache. This will cause the probable master copy
location to be inaccurate, resulting in excessive forwarding
requests.
By comparison, zFS does not have a central server that can
become a bottleneck. All control information is exchanged
between clients and file managers. The set of file managers
dynamically adapts itself to the load on the cluster. Clients
in zFS only pass data among themselves (in cooperative
cache mode).
Lustre [10] suggests a different kind of caching called collaborative caching. In this scheme, data that is frequently
used by many clients is pushed from the Object Store Target (OST), where the file/data resides, to the caches of other
OSTs. When a client wishes to read data, it contacts the
OST holding the data for read lock and receives with the
granted lock a list of other OSTs holding the data in their
cache. The client then redirects its read request to one of the
others OSTs.
This scheme has two main differences from the one zFS
uses. First, zFS uses the collective memory of all clients as
a global cache, while Lustre uses only the memory of the
OSTs. Second, zFS uses third-party communication, which
requires three messages to get the data from the cache of
another client, while Lustre uses four messages. On the
other hand, using OSTs alleviates the security issue in the
presence of un-trusted clients.
[6] V. Drezin, N. Rinetzky, A. Tavory and E.
Yerushalmi. "The Antara Object-disk Design."
Technical report, IBM Research Labs, Haifa
University Campus, Mount Carmel, Haifa, Israel, 2001.
[7] Z. Dubitzky, I. Gold, E. Henis, J. Satran and D.
Sheinwald. "DSF – Data Sharing Facility."
Technical report, IBM Research Labs, Haifa
University Campus, Mount Carmel, Haifa, Israel, 2000. See also
http://www.haifa.il.ibm.com/projects/systems/ds
f.html.
[8] Annie P. Foong, Thomas R. Huff, Herbert H.
Hum, jaidev P. Patwadhan and Greg J. Regnier,
TCP Performance Re-Visited, ISPASS, march
2003
[9] Iozone. See http://iozone.org/
[10] http://www.lustre.org/docs/whitepaper.pdf
[11] O. Rodeh and A. Teperman. "zFS – A Scalable
distributed File System using Object Disks." In
Proceedings of the IEEE Mass Storage Systems
and Technologies Conference, pages 207-218,
San Diego, CA, USA, 2003.
[12] P. Sarkar and J. Hartman. Efficient Cooperative
Caching Using Hints. Department of Computer
Science, University of Arizona, Tucson.
7 Conclusions and Future Research
Our results show that using the memory of all participating
nodes as one global cache gives better performance when
compared to NFS. This is evident when we use a range of
one page. It is also clear from the performance results that
using pre-fetching with ranges of four and eight pages results in much better performance.
In zFS, the selection of the target node for page eviction is
done by the file manager, which chooses the node with the
largest free memory as the target node. However, the file
manager chooses target nodes only from those interacting
with it. It may be the case that there is an idle machine with
a large free memory that is not connected to this file manager and thus will not be used. Thus, the eviction algorithm
is not globally optimized in zFS. Future research will address this issue and consider lightweight mechanisms (e.g.,
gossip and others) for distributing load information on the
whole cluster of zFS nodes.
8 Acknowledgements
We wish to thank Alain Azagury, Kalman Meth, Ohad
Rodeh, Julian Satran, Dalit Naor and Ami Tavori from IBM
Research Labs in Haifa for their participation in useful discussions on zFS.
8
Download