Replica Placement Strategies for Wide-Area Storage Systems Byung-Gon Chun and Hakim Weatherspoon Computer Science Division, University of California, Berkeley bgchun@cs.berkeley.edu and hweather@cs.berkeley.edu Abstract Wide-area durable storage systems trigger data recovery to maintain target data availability levels. Data recovery needs to be triggered when storage nodes permanently fail, i.e., data is lost. Transient failures, where nodes return from failure with data, add noise to determining that a node has failed permanently. Replica placement strategies maintain data availability while minimizing cost, wide-area bandwidth. Correlated failures can compromise data availability. We demonstrate that correlated downtimes exist and that simple replica placement strategies can overcome the correlation and maintain target data availability levels. A fundamental issue for any wide-area storage system is replica placement. Replica placement is the process in which nodes are selected to store data replicas. The goal of replica placement is to efficiently maintain a target data availability. Efficiency means using minimal wide-area bandwidth. There are two categories of replica placement: random and selective. Random placement assumes node failures are independent or independent enough. If node failures are correlated, the end result could lead to a reduced data availability below the target. In contrast, selective placement chooses nodes that are specifically chosen to have low correlation and long expected remaining lifetime. In this paper, we analyze a wide-area system for correlated failures. We compare different replica 1 Introduction placement strategies to best place replicas that will Abstractly, wide-area storage systems are as simple simultaneously maintain data availability, avoid coras their local-area counterparts: they replicate data to related nodes, and reduce cost (we focus on widereduce risk of data loss and replace lost redundancy area bandwidth). The contribution of this work is to when they have detected permanent failures. Perma- combine the analysis of correlated failures, replica nent failure is loss of data on a node. Transient fail- placement, and the cost of maintaining data availure is when a node becomes temporarily unavailable ability. and comes back with its data intact. Wide-area storage systems are required to continuously maintain The rest of the paper is organized as follows. In data availability1 as nodes fail (permanent or tran- Section 2, we discuss, at a high level, how a widesient). Data maintenance2 is the process of copy- area storage system maintains data. We analyze coring replicas from one node to another to ensure data related failures in a wide-area system in Section 3. availability. Based on the correlation analysis, we suggest some 1 Continuously maintaining data availability is not necessary but is sufficient to maintain data durability. 2 We use data maintenance and data recovery interchangeably. replica placement strategies for comparison in Section 4. In Section 5 and 6, we evaluate and compare the proposed replica placement strategies. Finally, we conclude in Section 7. Dates Feb 16, 2003 to Oct 6, 2004 MaxNodes 390 TotalNodes 512 Size 17GB Table 1: All-pairs ping data set. (m = 1) into a wide area storage system. The minimum threshold number of replicas, th = 7, and there is one extra replica, e = 1. The total number of replicas is n = 7 + 1 = 8. Data recovery is triggered when 2 replicas (e + 1) are simultaneously down. In this example a replica in Georgia has permanently failed and a replica in Washington has transiently failed since a heartbeat has been lost. Data recovery is triggered because the number of remaining replicas 6 is less then the threshold; that is, two new nodes are chosen and replicas are copied to the new nodes. Notice that the node in Washington could return from failure, in which case, recovery work would have been wasted creating a new replica. The problem is that it is not possible to determine a node failure as lost heartbeat (e.g. lossy link due to congestion, bad router configuration, etc), a node being transiently down (reboot), or a permanent failure (data on node removed permanently from the network). A timeout to is the only tool available to a wide-area system to determine failure. Directory = Replica = Heartbeat = Missed Heartbeat = Failed Node m = 1 (one replica required to read) th = 7 (threshold number of replicas required to be available) n = 8 (total number of replicas) e = 1 (extra replicas) (a) Data Repair Triggered vs Number Replicas (1Kbps) Figure 1: Example. Maintaining Data in a Wide Area Storage System. 2 Wide-Area Storage System Overview In this section, we present an overview of the requirements and nomenclature of a wide-area storage system. We assume that data is maintained on nodes, in the wide area, and in well maintained sites. Sites contribute resources such as node (storage, cpu, ram) and network bandwidth. Nodes collectively maintain data availability. That is they are self-organizing, self-maintaining, and adaptive to changes. In particular as nodes fail, data recovery is triggered, the process of copying data replicas to new nodes to maintain data availability. There are five variables of concern when maintaining data in the wide-area. First, the redundancy type m. m is the required number of components necessary to reconstruct the original data item. For example, m = 1 for replication. The minimum data availability threshold th is the minimum number of replicas required to maintain the target data availability. That is, when the number of available replicas decrease below th, new replicas are created. e is the extra replicas above the threshold. e + 1 replicas have to simultaneously be down for data recovery to be triggered. n = th + e is the total number of replicas. When data recovery is triggered, enough new replicas are created to refresh the total number back to n. Finally, the timeout to is used to determine when a node has failed. For example, Figure 1 shows one object replicated 3 Correlated Failure Analysis In this section we present our methodology of characterizing a wide-area storage system. First, we describe the wide-area environment that we analyzed. Second, we analyze the correlated failures in the wide-area environment. 3.1 PlanetLab Wide-Area Environment We used the all-pairs ping data set [6] to characterize PlanetLab. The data set was collected over a period of 20 months, between Feb 16, 2003 to Oct 6, 2004 and included a total of 512 nodes in that time period (Table 1). We used the data set to characterize the sessiontime, downtime, and lifetime distributions. We begin by describing how the all-pairs ping data set was collected and how we used the data set to interpret a node as being available or not. All-pairs ping data collects minimum, average, and maximum ping times (over 10 attempts) between all pairs of nodes in PlanetLab. Measurements were 2 1 cumulative frequency frequency 0.2 0.18 0.05 250 200 0.03 150 0.02 100 0.01 0.4 0.2 50 Feb03 May03 (a) 0.6 0.04 0.12 0.08 (b) 0.4 0.06 0.04 0 1day 10days 0 15min 100days Session Time Number of available nodes v Time 0.6 0.1 0.2 0.02 0 15min 1hr Aug03 Nov03 Feb04 May04 Aug 04 Time(s) from 1970 1hr 1day 10days 0 100days 260 250 240 230 220 210 200 Dec03 Jan04 Feb04 Mar04 Apr04 May04 Jun04 Down Time (c) Sessiontime CDF and PDF Time(s) from 1970 (d) Downtime CDF and PDF Distribution Mean(d) Stddev(d) Median(d) Mode(m) Min(m) 90th(d) Max(d) Session Down 12.5 3.1 23.4 13.1 2.3 0.001 (90min) 30 30 15 15 38.1 5.6 374.8 308.0 Availability 0.822 0.202 0.888 1.000 0.002 0.996 1.000 (e) 270 0.8 0.14 frequency 300 1 cumulative frequency frequency 0.16 cumulative frequency 0.8 cumulative frequency 0.06 frequency number of availabile nodes 0.07 Available Total 350 Number of Remaining Nodes 400 UpperBound 878 days (f) Node Attrition LowerBound 756 days Expected Lifetime Estimates Sessiontime, Downtime, and Availability Statistics Figure 2: PlanetLab Characterization. Details discussed in Section 3.1. taken and collected approximately every 15 minutes from each node. Measurements were taken locally from individual nodes perspective, stored locally, and periodically archived at a central location at MIT. Failed ping attempts were also recorded. A node was determined available/up if a majority of the other nodes that returned ping data successfully pinged that node. An alternative available node classification method used by [4], used a single successful ping to determine a node available. We used the majority vote, because results from [4] suggest that if a PlanetLab node is up and reachable, it is usually reachable from all other PlanetLab nodes. Our own independent evaluation of the data set found this to be true and is the reason that we used a majority vote to determine node availability. Figure 2 shows the PlanetLab characteristics using the all-pairs ping data set. upper bound on node lifetimes. In addition, we computed an estimated lower bound on the availability of data on a node by supplementing the trace with a disk failure distributions obtained from [5]. We used a technique described by Bolosky et al. [3] to estimate the expected node lifetime based on node attrition where the rate of attrition is constant. In particular, we counted the the number of remaining nodes that started before December 5, 2003 and permanently failed before July 1, 2004. The expected node lifetime of a PlanetLab node is 878 days (Table 2.(f)). Figure 2.(d) shows the node attrition. No all-pairs ping data exist between December 17, 2003 and January 20, 2004 due to a near simultaneous compromise and upgrade of PlanetLab. In particular, Figure 2.(a) shows that 150 nodes existed on December 17, 2003 and 200 existed on January 20, 2004, but no ping data was collected in between the dates. Issues and Limitations The all-pairs ping data collection technique is vulnerable to experimental and measurement bias. First, there were instances where the all-pairs ping program had a bug and the process failed on almost all nodes simultaneously. Second, there were related instances where a majority of nodes were unusable due to high load. In both cases the all-pairs ping program did not return data for most nodes. We did not use these all-pairs ping instances since they were experimental error. All-pairs ping does not measure permanent failure. Instead, all-pairs ping measures the availability of a node name (i.e. availability of an ip address) and not the availability of data on a node. As a result, all-pairs ping was used to produce an estimated 3.2 Correlated Failures In this section, we test the PlanetLab data for the possibility of nodes with correlated failures. Figure 3 shows the different views of correlation between PlanetLab nodes. Figure 3.(a) shows the (total and unavailable) number of nodes per day vs time. Figure 3 demonstrates that during any daily interval, a number of failures (transient or permanent) occur. Later graphs show that many failures are correlated although most seem to be uncorrelated. As far as we know, most studies that produce a correlation metric (i.e. probability that y is down given that x is down), only use the one-dimension; 3 P[X is down | Y is down] 1 300 0.8 0.9 250 200 150 100 0.6 0.4 0.2 50 0 0 2000 4000 0.2 0.4 (b) 10000 1 1000 0.8 10 1 0.6 0.8 1 0.4 0.2 (d) 400 500 600 0.4 0.6 P[Y is down | X is down] (e) 2D Correlation Threshold 0.5 0.8 0.5 0.8 (g) Fraction of Correlated Nodes Site Same Same Different Different (h) 1 0.8 Same Site 1 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 2D Correlation (nodes w/ total downtime ≤ 14 days) Fraction Correlated 22.43% 8.86% 0.91% 0.04% 0.9 whole trace half of the trace 0.16 Same Site Different Site Per Node Total Downtime (log scale) Site Same Same Different Different 0.2 0.8 (upper right quadrant) 2D Correlation 0.18 0 300 Rank 0.7 Different Site 0.6 0.01 200 0.6 (c) 2D Correlation 0 100 0.5 P[Y is down | X is down] 0.1 0 0.6 Same Site Different Site Failure Trace 100 0.7 P[Y is down | X is down] 6000 8000 10000 12000 14000 Time (hours) P[X is down | Y is down] Down time (hours) (a) 0.8 0.5 0 0 P[k nodes down simultaneously] Number of nodes 1 total nodes failed nodes 350 P[X is down | Y is down] 400 2D Correlation Threshold 0.5 0.8 0.5 0.8 5 10 15 Number of nodes down simultaneously (k) (f) 20 Node Failure Joint Distribution Fraction Correlated 33.33% 13.83% 0.34% 0.04% Fraction of Correlated Nodes (nodes w/ total downtime ≤ 14 days) Figure 3: Correlated Failure. Details discussed in Section 3.2. that is, nodes that are chronically down can increase the number of correlated nodes. To capture node correlation not influenced by long downtimes, we use a two-dimensional space of conditional downtime probabilities, both p(x is down | y is down) and p(y is down | x is down). Figure 3.(b) and (c) show the two-dimension conditional downtime probability, Figure 3.(c) is the upper right quadrant for Figure 3.(b). Both Figures 3.(b) and (c) highlight nodes that are in the same site with light open circles. The fraction of correlated nodes in both figures are highlighted in Table 3.(g). Table 3.(g) shows that 22% of the time that a node goes down, there is at least a 50% chance another node in the same site will go down as well. Alternatively, Table 3.(g) shows that the two-dimension conditional downtime probability of nodes in different sites is insignificant. than 1000 hours. In Figure 3.(e), we again show the two-dimension conditional downtime probability, but with nodes with total downtime longer than 14 days factored out. Figure 3.(e) shows more density along the diagonal, meaning that p(x is down | y is down) is more symmetric with p(y is down | x is down); that is, the nodes are not asymmetrically influenced by long downtimes. Similar to Table 3.(g), Table 3.(h) shows that 33% of the time that a node goes down, who’s total downtime is less than 14 days, there is at least a 50% chance another node in the same site will go down. However, the inter-site two-dimension conditional downtime probability is still insignificant. Finally, Figure 3.(f) shows the joint probability that k nodes, picked randomly, simultaneously fail; that is, the probability of a catastrophic failure if Next we look at the effects of nodes with long there are k replicas. Figure 3.(f) shows that the probdowntimes. Figure 3.(d) shows the total down- ability of a catastrophic failure decreases exponentime for each node and orders the nodes from most tially as k increases. In particular, k = 5 is enough to least total downtime. Of the 512 total Planet- to make the joint probability that k failures simulLab nodes, 188 nodes have total downtimes greater taneously occur 0.00001 (1 in 10,000). The key is 4 that as long as all replicas do not simultaneously fail, 5 Evaluation Methodology then the lost replicas can be reproduced from the reIn this section we present our methodology of using a maining and data will not be lost. trace-driven simulation to compare and evaluate our proposed replica placement strategies. The trace-driven simulation runs the entire 20 month trace of PlanetLab (Figure 2.(a)). The sim4 Replica Placement Strategies ulator added a node at the time a node was available in the trace and removed a node that was not availThe analysis of correlated failures from Section 3.2 able. That is, nodes that are not available in the trace suggests that random, with replacement of duplicate are not available in the simulator (and visa versa). site nodes, and a “blacklist” of nodes showing long We implemented timeouts/heartbeats in the simuladowntimes, may be a good replica placement strat- tor to detect node failures. Data recovery was trigegy. In order to validate this hypothesis, we compare gered when the extra replicas plus one, e + 1, were four different replica placement strategies including simultaneously considered down (description in Secrandom and four different “oracle” placement strate- tion 2). Recall that e = n−th (n ≥ th). The timeout used to detect failure was to = 1hr. gies. We set that target data availability to 6 9’s using The four replica placement strategies that we equations described in [1, 2]. Data availability equals compare are Random, RandomBlacklist, 1-p[no replicas available] or 1 − = 1 − (1 − a)th , RandomSite, and RandomSiteBlacklist. where a is the average node availability, th is the Random placement picks n unique nodes at random minimum data availability threshold, and is the to store replicas. RandomBlacklist placement probability that no replicas are available. The equalis the same as Random but avoids the use of ity assumes that all nodes are independent. As a nodes that show long downtimes. The blacklist is result, we calculated the minimum data availability comprised of the top k nodes with the longest total threshold th to be 9 for replication m = 1 and an downtimes. RandomSite avoids placing multiple average PlanetLab node availability a = 0.822. replicas in the same site. RandomSite picks n The amount of data that we simulated maintaining unique nodes at random and avoids using nodes was 10,000 unique objects or 2TB if each object was in the same site. We identify a site by the 2B IP 200MB. The replication factor, k = n increased the m address prefix. The other criteria can be geography size of the repository. For example, with n = 10 and or domains. Finally, RandomSiteBlacklist m = 1 the total size of the repository was 100,000 placement is the combination of RandomSite and replicas and 20TB of redundant data. RandomBlacklist. We use the trace supplemented with the disk failThe “oracle” placement strategies use future ure distribution to compare the rate of data recovery: knowledge of node lifetimes, downtimes, and We measured the total number of repairs triggered vs availability to place replicas. In all, we de- time and the bandwidth used per node vs time. Adfine four different oracle algorithms. First, the ditionally, we measured the average bandwidth per Min-Max-TTR oracle places replicas on nodes node as we vary the timeout and extra replicas. Fithat exhibited the minimum maximum future down- nally, we measured the average number of replicas times. Similarly, the second and third oracle, the available vs time, but we omit this graph since the Min-Mean-TTR and Min-Sum-TTR oracle, place number of replicas was actually at least the threshreplicas on nodes that exhibit the minimum mean old. and sum downtimes, respectively. The fourth oracle, Max-Lifetime-Avail, places replicas on nodes 6 Evaluation that permanently fail furthest in the future and exhibit the highest availability, which is the ratio of the In this section, we present results using replication m = 1 with the trace modified with disk failures. total uptime and lifetime. 5 # Repairs % Improvement Random Random Blacklist 400726 391476 2.31 Replica Placement Strategy. (m = 1, th = 9, n = 10, |blacklist| = 35, to = 1hr) Random Random Oracle Oracle Oracle Site Site Min-Max-TTR Min-Mean-TTR Min-Sum-TTR Blacklist 398045 0.67 387233 3.37 411141 -2.60 (a) Random # Repairs % Improvement 63251 413039 -3.07 411176 -2.61 61378 2.96 58673 7.24 53836 14.89 (b) 380521 5.04 n = 10 Replica Placement Strategy. (m = 1, th = 9, n = 15, |blacklist| = 35, to = 1hr) Random Random Random Oracle Oracle Oracle Blacklist Site Site Min-Max-TTR Min-Mean-TTR Min-Sum-TTR Blacklist 60909 3.70 Oracle Max-Lifetime-Avail 52827 16.48 52026 17.75 Oracle Max-Lifetime-Avail 43695 30.92 n = 15 Table 2: Comparison of Replica Placement Strategies Oracle Max-Lifetime-Avail Random RandomBlacklist RandomSite RandomSiteBlacklist 100000 10000 1000 10 (a) 15 20 25 N Total Fragments 512Kbps 128 Kbps 30 Data Repair Triggered vs Number Replicas (1Kbps) Oracle Max-Lifetime-Avail Random RandomBlacklist RandomSite RandomSiteBlacklist 256 Kbps 10 (b) 15 20 25 N Total Fragments Average bandwidth for Random th = 9, m = 1, 10000 200MB objects 10000 initial objects, increase 10 objects per 86400000 ms Bandwidth per node bits/sec 1e+06 Average bandwidth per Node vs Placement Strategy to = 1hr, th = 9, m = 1, 10000 200MB objects 10000 initial objects, increase 10 objects per 86400000 ms Bandwidth per node (bits/sec) Total Data Repairs Triggered Total Data Repairs Triggered vs Placement Strategy to = 1hr, th = 9, m = 1, 10000 objects 10000 initial objects, increase 10 objects per 86400000 ms 30 Avg Bandwidth per node vs Number Replicas (1Kbps) to = 15mins to = 1hr to = 4hrs to = 8hrs to = 1day to = 2days to = 4days to = 8days 512Kbps 256 Kbps 128 Kbps 10 (c) 15 20 25 N Total Fragments 30 Avg Bandwidth per node vs Number Replicas (1Kbps) Figure 4: Cost Tradeoffs for Maintaining Minimum Data Availability for 10,000 200MB objects. (a) Number of Repairs vs. Total Number of replicas (greather than threshold th = 9). (b) and (c) Average bandwidth per node vs Total Number of Replicas. (a) and (b) varies the replica placement strategy and fixes the timeout to = 1hr. And (c) fixes the replica placement strategy to Random and varies the timeout. We did produce results for different erasure code schemes (e.g. m = 4), but we omit the graphs for space since the trends of using extra redundancy with erasure codes were the same. the number of extra replicas and timeouts. We also separate repairs triggered due to permanent failure (loss of data on a node) and transient failure (node returns from failure with data). For n = 15 (Table 2.(b)), more sophisti- 6.1 Evaluation of Replica Placement Strate- cated placement algorithms exhibit noticeable ingies crease in performance; that is, fewer repairs triggered compared to Random. For example, the RandomSiteBlacklist placement shows a 7.25% improvement over Random, which is about the same as the sum of the parts 3.70% and 2.96% for RandomBlacklist and RandomSite respectively. The oracle placement algorithms exhibit a 14.89% to 17.75% improvement for the oracles that do not consider lifetime and 30.92% improvement for the oracle that considers lifetime. First, we compare different replica placement strategies. Table 2 shows the number of repairs and percentage of improvement over Random for the four different random placement and four different oracle placement strategies. Table 2.(a) and (b) use n = 10 and n = 15 respectively. For n = 10 (Table 2.(a)), more sophisticated placement algorithms have marginal improvement over Random; for example, RandomSiteBlacklist shows a 3.37% improvement. The observation that three of the four oracles perform worse than Random indicate that more is happening beyond just placement. In the next section, we will investigate further by varying The fact that number of absolute repairs decreases from n = 10 to n = 15 suggests that the system is responding to transient failures instead of placement and permanent failures. The improvement increases when n increases, and then it decreases as n further 6 Bandwidth per node bits/sec Bandwidth per node bits/sec Average bandwidth for Random th = 9, m = 1, 10000, to = 1hr, 200MB objects 10000 initial objects, increase 10 objects per 86400000 ms 256 Kbps 64 Kbps 16 Kbps 4 Kbps 1 Kbps 256 bps 64 bps 10 15 20 25 30 Average bandwidth for Random th = 9, m = 1, 10000, to = 1hr, 200MB objects 10000 initial objects, increase 100 objects per 86400000 ms 1 Mbps 256 Kbps 64 Kbps 16 Kbps 4 Kbps 1 Kbps 256 bps 64 bps 10 15 N Total Fragments Total BW Permanent BW Transient BW 20 25 30 N Total Fragments Heartbeat BW Write BW Total BW Permanent BW Transient BW (a) Cost Breakdown vs Number Replicas (1Kbps) Heartbeat BW Write BW (b) Cost Breakdown vs Number Replicas (10Kbps) Figure 5: Cost Breakdown for Maintaining Minimum Data Availability for 10,000 200MB objects. (a) and (b) Cost breakdown with a unique write rate of 1Kbps and 10 Kbps per node, respectively. Both (a) and (b) fix the replica placement strategy to Random and timeout to = 1hr. increases. We will investigate further in the next sec- seen in Figure 4.(b). Figure 4.(c) shows that large tion. timeout values exponentially decrease the cost of data recovery but potentially compromise data availability. Similarly, linear increase in extra replicas ex6.2 Evaluation of Extra Redundancy ponentially decreases the cost of data recovery withIn this section, we vary the extra redundancy (i.e. to- out sacrificing data availability. tal replicas subtracted by the threshold e = n − th) Figure 5 shows the breakdown in bandwidth cost to investigate the effects that transient failures influfor maintaining a minimum data availability. Figence triggering data recovery. Also, we continuously ure 5 fixes both the timeout to = 1hr and replica add data to the repository increasing the size of the placement strategy to Random. Figure 5.(a) and repository over time. We use two different write rates (b) used a per node unique write rate of 1Kbps and 10Kbps and 1Kbps per node. To make the write rates 10Kbps, respectively. Both Figures 5.(a) and (b) ilconcrete, we used the measurement that an average lustrate that the cost of maintaining data due to tranworkstation creates 35MB/hr (or 10Kbps) [3] of tosient failures dominates the total cost. The total cost tal data and 3.5MB/hr of permanent data (or 1Kbps). is dominated by unnecessary work. As the number Finally, 10Kbps and 1Kbps per node correspond to of extra replicas (e = n − th), which are required 40GB and 4GB of unique data per node per year to be simultaneously down in order to trigger data added to the repository, respectively. With an averrecovery, increases, the cost due to transient failures age of 300 nodes, the system increases in size at a decreases, thus the cost due to actual permanent failrate of 30TB and 3TB, respectively. ures, which is a system fundamental characteristic, Figure 4 show the effects of varying the number dominates. The difference between Figure 5.(a) and of extra replicas and the timeout. Figure 4.(a) and (b) is that the cost due to permanent failures domi(b) fix the timeout to = 1hr and vary the replica nates in (a) and the cost due to new writes dominates placement strategy; where as, Figure 4.(c) varies the in (b). Finally, the cost due to probing each node timeout and fixes the replica placement strategy to once an hour is insignificant. Random. Figure 4.(a) shows that the number of repairs trigIn summary, Figure 4 demonstrate that extra regered decreases exponentially as the number of ex- dundancy reduces the rate of triggering data recovtra replicas increases linearly. Correspondingly, the ery, thus reduces the cost of maintaining data. In bandwidth required to maintain the minimum data particular, wide-area storage systems can tradeoff inavailability threshold and extra replicas decreases as creased storage for decreased bandwidth consump7 tion and maintain the same minimum data availability threshold. 7 Conclusion In this paper, we analyze the machine failure characteristics in wide area systems and we examine replica placement strategies for such systems. We found that there does exist correlated downtimes, especially significant correlated downtimes between nodes in the same site. As a placement strategy, random seems to be good enough to maintain data availability efficiently. In particular, the random placement with extra replicas and with constraints to avoid using same sites and unreliable nodes is sufficient to reduce the rate of triggering data recovery due to node failures and most observed correlation. References [1] R. Bhagwan, S. Savage, and G. Voelker. Replication strategies for highly available peer-to-peer storage systems. Technical Report CS2002-0726, U. C. San Diego, November 2002. [2] C. Blake and R. Rodrigues. High availability, scalable storage, dynamic peer networks: Pick two. In Proc. of HOTOS, May 2003. [3] W. Bolosky, J. Douceur, D. Ely, and M. Theimer. Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs. In Proc. of Sigmetrics, June 2000. [4] B. Chun and A. Vahdat. Workload and failure characterization on a large-scale federated testbed. Technical Report IRB-TR-03-040, Intel Research, November 2003. [5] D. Patterson and J. Hennessy. Computer Architecture: A Quantitative Approach. Forthcoming Edition. [6] Jeremy Stribling. Planetlab http://infospect.planet-lab.org/pings. all-pairs ping. 8