Replica Placement Strategies for Wide

advertisement
Replica Placement Strategies for Wide-Area Storage Systems
Byung-Gon Chun and Hakim Weatherspoon
Computer Science Division, University of California, Berkeley
bgchun@cs.berkeley.edu and hweather@cs.berkeley.edu
Abstract
Wide-area durable storage systems trigger data
recovery to maintain target data availability levels.
Data recovery needs to be triggered when storage
nodes permanently fail, i.e., data is lost. Transient
failures, where nodes return from failure with data,
add noise to determining that a node has failed permanently. Replica placement strategies maintain
data availability while minimizing cost, wide-area
bandwidth. Correlated failures can compromise data
availability. We demonstrate that correlated downtimes exist and that simple replica placement strategies can overcome the correlation and maintain target data availability levels.
A fundamental issue for any wide-area storage
system is replica placement. Replica placement is
the process in which nodes are selected to store data
replicas. The goal of replica placement is to efficiently maintain a target data availability. Efficiency
means using minimal wide-area bandwidth. There
are two categories of replica placement: random and
selective. Random placement assumes node failures
are independent or independent enough. If node failures are correlated, the end result could lead to a reduced data availability below the target. In contrast,
selective placement chooses nodes that are specifically chosen to have low correlation and long expected remaining lifetime.
In this paper, we analyze a wide-area system for
correlated failures. We compare different replica
1 Introduction
placement strategies to best place replicas that will
Abstractly, wide-area storage systems are as simple simultaneously maintain data availability, avoid coras their local-area counterparts: they replicate data to related nodes, and reduce cost (we focus on widereduce risk of data loss and replace lost redundancy area bandwidth). The contribution of this work is to
when they have detected permanent failures. Perma- combine the analysis of correlated failures, replica
nent failure is loss of data on a node. Transient fail- placement, and the cost of maintaining data availure is when a node becomes temporarily unavailable ability.
and comes back with its data intact. Wide-area storage systems are required to continuously maintain
The rest of the paper is organized as follows. In
data availability1 as nodes fail (permanent or tran- Section 2, we discuss, at a high level, how a widesient). Data maintenance2 is the process of copy- area storage system maintains data. We analyze coring replicas from one node to another to ensure data related failures in a wide-area system in Section 3.
availability.
Based on the correlation analysis, we suggest some
1
Continuously maintaining data availability is not necessary
but is sufficient to maintain data durability.
2
We use data maintenance and data recovery interchangeably.
replica placement strategies for comparison in Section 4. In Section 5 and 6, we evaluate and compare
the proposed replica placement strategies. Finally,
we conclude in Section 7.
Dates
Feb 16, 2003 to Oct 6, 2004
MaxNodes
390
TotalNodes
512
Size
17GB
Table 1: All-pairs ping data set.
(m = 1) into a wide area storage system. The minimum threshold number of replicas, th = 7, and
there is one extra replica, e = 1. The total number of replicas is n = 7 + 1 = 8. Data recovery
is triggered when 2 replicas (e + 1) are simultaneously down. In this example a replica in Georgia
has permanently failed and a replica in Washington
has transiently failed since a heartbeat has been lost.
Data recovery is triggered because the number of remaining replicas 6 is less then the threshold; that is,
two new nodes are chosen and replicas are copied
to the new nodes. Notice that the node in Washington could return from failure, in which case, recovery
work would have been wasted creating a new replica.
The problem is that it is not possible to determine a
node failure as lost heartbeat (e.g. lossy link due to
congestion, bad router configuration, etc), a node being transiently down (reboot), or a permanent failure
(data on node removed permanently from the network). A timeout to is the only tool available to a
wide-area system to determine failure.
Directory
= Replica
= Heartbeat
= Missed Heartbeat
= Failed Node
m = 1 (one replica required to read)
th = 7 (threshold number of replicas required to be available)
n = 8 (total number of replicas)
e = 1 (extra replicas)
(a)
Data Repair Triggered vs Number Replicas (1Kbps)
Figure 1: Example. Maintaining Data in a Wide Area
Storage System.
2 Wide-Area Storage System Overview
In this section, we present an overview of the requirements and nomenclature of a wide-area storage
system.
We assume that data is maintained on nodes, in the
wide area, and in well maintained sites. Sites contribute resources such as node (storage, cpu, ram)
and network bandwidth. Nodes collectively maintain data availability. That is they are self-organizing,
self-maintaining, and adaptive to changes. In particular as nodes fail, data recovery is triggered, the process of copying data replicas to new nodes to maintain data availability.
There are five variables of concern when maintaining data in the wide-area. First, the redundancy type
m. m is the required number of components necessary to reconstruct the original data item. For example, m = 1 for replication. The minimum data
availability threshold th is the minimum number of
replicas required to maintain the target data availability. That is, when the number of available replicas
decrease below th, new replicas are created. e is the
extra replicas above the threshold. e + 1 replicas have
to simultaneously be down for data recovery to be
triggered. n = th + e is the total number of replicas.
When data recovery is triggered, enough new replicas are created to refresh the total number back to n.
Finally, the timeout to is used to determine when a
node has failed.
For example, Figure 1 shows one object replicated
3 Correlated Failure Analysis
In this section we present our methodology of characterizing a wide-area storage system. First, we describe the wide-area environment that we analyzed.
Second, we analyze the correlated failures in the
wide-area environment.
3.1 PlanetLab Wide-Area Environment
We used the all-pairs ping data set [6] to characterize PlanetLab. The data set was collected over a period of 20 months, between Feb 16, 2003 to Oct 6,
2004 and included a total of 512 nodes in that time
period (Table 1). We used the data set to characterize the sessiontime, downtime, and lifetime distributions. We begin by describing how the all-pairs ping
data set was collected and how we used the data set
to interpret a node as being available or not.
All-pairs ping data collects minimum, average,
and maximum ping times (over 10 attempts) between
all pairs of nodes in PlanetLab. Measurements were
2
1
cumulative frequency
frequency
0.2
0.18
0.05
250
200
0.03
150
0.02
100
0.01
0.4
0.2
50
Feb03 May03
(a)
0.6
0.04
0.12
0.08
(b)
0.4
0.06
0.04
0
1day
10days
0
15min
100days
Session Time
Number of available nodes v Time
0.6
0.1
0.2
0.02
0
15min 1hr
Aug03 Nov03 Feb04 May04 Aug 04
Time(s) from 1970
1hr
1day
10days
0
100days
260
250
240
230
220
210
200
Dec03 Jan04 Feb04 Mar04 Apr04 May04 Jun04
Down Time
(c)
Sessiontime CDF and PDF
Time(s) from 1970
(d)
Downtime CDF and PDF
Distribution
Mean(d)
Stddev(d)
Median(d)
Mode(m)
Min(m)
90th(d)
Max(d)
Session
Down
12.5
3.1
23.4
13.1
2.3
0.001 (90min)
30
30
15
15
38.1
5.6
374.8
308.0
Availability
0.822
0.202
0.888
1.000
0.002
0.996
1.000
(e)
270
0.8
0.14
frequency
300
1
cumulative frequency
frequency
0.16
cumulative frequency
0.8
cumulative frequency
0.06
frequency
number of availabile nodes
0.07
Available
Total
350
Number of Remaining Nodes
400
UpperBound
878 days
(f)
Node Attrition
LowerBound
756 days
Expected Lifetime Estimates
Sessiontime, Downtime, and Availability Statistics
Figure 2: PlanetLab Characterization. Details discussed in Section 3.1.
taken and collected approximately every 15 minutes from each node. Measurements were taken locally from individual nodes perspective, stored locally, and periodically archived at a central location
at MIT. Failed ping attempts were also recorded.
A node was determined available/up if a majority
of the other nodes that returned ping data successfully pinged that node. An alternative available node
classification method used by [4], used a single successful ping to determine a node available. We used
the majority vote, because results from [4] suggest
that if a PlanetLab node is up and reachable, it is usually reachable from all other PlanetLab nodes. Our
own independent evaluation of the data set found this
to be true and is the reason that we used a majority
vote to determine node availability.
Figure 2 shows the PlanetLab characteristics using
the all-pairs ping data set.
upper bound on node lifetimes. In addition, we computed an estimated lower bound on the availability
of data on a node by supplementing the trace with
a disk failure distributions obtained from [5]. We
used a technique described by Bolosky et al. [3] to
estimate the expected node lifetime based on node
attrition where the rate of attrition is constant. In
particular, we counted the the number of remaining
nodes that started before December 5, 2003 and permanently failed before July 1, 2004. The expected
node lifetime of a PlanetLab node is 878 days (Table 2.(f)). Figure 2.(d) shows the node attrition.
No all-pairs ping data exist between December 17,
2003 and January 20, 2004 due to a near simultaneous compromise and upgrade of PlanetLab. In particular, Figure 2.(a) shows that 150 nodes existed on
December 17, 2003 and 200 existed on January 20,
2004, but no ping data was collected in between the
dates.
Issues and Limitations The all-pairs ping data
collection technique is vulnerable to experimental
and measurement bias. First, there were instances
where the all-pairs ping program had a bug and the
process failed on almost all nodes simultaneously.
Second, there were related instances where a majority of nodes were unusable due to high load. In both
cases the all-pairs ping program did not return data
for most nodes. We did not use these all-pairs ping
instances since they were experimental error.
All-pairs ping does not measure permanent failure. Instead, all-pairs ping measures the availability
of a node name (i.e. availability of an ip address)
and not the availability of data on a node. As a result, all-pairs ping was used to produce an estimated
3.2 Correlated Failures
In this section, we test the PlanetLab data for the
possibility of nodes with correlated failures. Figure 3 shows the different views of correlation between PlanetLab nodes. Figure 3.(a) shows the (total
and unavailable) number of nodes per day vs time.
Figure 3 demonstrates that during any daily interval,
a number of failures (transient or permanent) occur.
Later graphs show that many failures are correlated
although most seem to be uncorrelated.
As far as we know, most studies that produce a
correlation metric (i.e. probability that y is down
given that x is down), only use the one-dimension;
3
P[X is down | Y is down]
1
300
0.8
0.9
250
200
150
100
0.6
0.4
0.2
50
0
0
2000
4000
0.2
0.4
(b)
10000
1
1000
0.8
10
1
0.6
0.8
1
0.4
0.2
(d)
400
500
600
0.4
0.6
P[Y is down | X is down]
(e)
2D Correlation Threshold
0.5
0.8
0.5
0.8
(g)
Fraction of Correlated Nodes
Site
Same
Same
Different
Different
(h)
1
0.8
Same Site
1
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
2D Correlation (nodes w/ total downtime ≤ 14 days)
Fraction Correlated
22.43%
8.86%
0.91%
0.04%
0.9
whole trace
half of the trace
0.16
Same Site
Different Site
Per Node Total Downtime (log scale)
Site
Same
Same
Different
Different
0.2
0.8
(upper right quadrant) 2D Correlation
0.18
0
300
Rank
0.7
Different Site
0.6
0.01
200
0.6
(c)
2D Correlation
0
100
0.5
P[Y is down | X is down]
0.1
0
0.6
Same Site
Different Site
Failure Trace
100
0.7
P[Y is down | X is down]
6000 8000 10000 12000 14000
Time (hours)
P[X is down | Y is down]
Down time (hours)
(a)
0.8
0.5
0
0
P[k nodes down simultaneously]
Number of nodes
1
total nodes
failed nodes
350
P[X is down | Y is down]
400
2D Correlation Threshold
0.5
0.8
0.5
0.8
5
10
15
Number of nodes down simultaneously (k)
(f)
20
Node Failure Joint Distribution
Fraction Correlated
33.33%
13.83%
0.34%
0.04%
Fraction of Correlated Nodes (nodes w/ total downtime ≤ 14
days)
Figure 3: Correlated Failure. Details discussed in Section 3.2.
that is, nodes that are chronically down can increase
the number of correlated nodes. To capture node
correlation not influenced by long downtimes, we
use a two-dimensional space of conditional downtime probabilities, both p(x is down | y is down) and
p(y is down | x is down). Figure 3.(b) and (c) show
the two-dimension conditional downtime probability, Figure 3.(c) is the upper right quadrant for Figure 3.(b). Both Figures 3.(b) and (c) highlight nodes
that are in the same site with light open circles. The
fraction of correlated nodes in both figures are highlighted in Table 3.(g). Table 3.(g) shows that 22%
of the time that a node goes down, there is at least
a 50% chance another node in the same site will go
down as well. Alternatively, Table 3.(g) shows that
the two-dimension conditional downtime probability
of nodes in different sites is insignificant.
than 1000 hours. In Figure 3.(e), we again show
the two-dimension conditional downtime probability, but with nodes with total downtime longer than
14 days factored out. Figure 3.(e) shows more density along the diagonal, meaning that p(x is down
| y is down) is more symmetric with p(y is down
| x is down); that is, the nodes are not asymmetrically influenced by long downtimes. Similar to Table 3.(g), Table 3.(h) shows that 33% of the time that
a node goes down, who’s total downtime is less than
14 days, there is at least a 50% chance another node
in the same site will go down. However, the inter-site
two-dimension conditional downtime probability is
still insignificant.
Finally, Figure 3.(f) shows the joint probability
that k nodes, picked randomly, simultaneously fail;
that is, the probability of a catastrophic failure if
Next we look at the effects of nodes with long there are k replicas. Figure 3.(f) shows that the probdowntimes. Figure 3.(d) shows the total down- ability of a catastrophic failure decreases exponentime for each node and orders the nodes from most tially as k increases. In particular, k = 5 is enough
to least total downtime. Of the 512 total Planet- to make the joint probability that k failures simulLab nodes, 188 nodes have total downtimes greater taneously occur 0.00001 (1 in 10,000). The key is
4
that as long as all replicas do not simultaneously fail, 5 Evaluation Methodology
then the lost replicas can be reproduced from the reIn this section we present our methodology of using a
maining and data will not be lost.
trace-driven simulation to compare and evaluate our
proposed replica placement strategies.
The trace-driven simulation runs the entire 20
month trace of PlanetLab (Figure 2.(a)). The sim4 Replica Placement Strategies
ulator added a node at the time a node was available
in the trace and removed a node that was not availThe analysis of correlated failures from Section 3.2 able. That is, nodes that are not available in the trace
suggests that random, with replacement of duplicate are not available in the simulator (and visa versa).
site nodes, and a “blacklist” of nodes showing long We implemented timeouts/heartbeats in the simuladowntimes, may be a good replica placement strat- tor to detect node failures. Data recovery was trigegy. In order to validate this hypothesis, we compare gered when the extra replicas plus one, e + 1, were
four different replica placement strategies including simultaneously considered down (description in Secrandom and four different “oracle” placement strate- tion 2). Recall that e = n−th (n ≥ th). The timeout
used to detect failure was to = 1hr.
gies.
We set that target data availability to 6 9’s using
The four replica placement strategies that we equations described in [1, 2]. Data availability equals
compare are Random, RandomBlacklist, 1-p[no replicas available] or 1 − = 1 − (1 − a)th ,
RandomSite, and RandomSiteBlacklist. where a is the average node availability, th is the
Random placement picks n unique nodes at random minimum data availability threshold, and is the
to store replicas. RandomBlacklist placement probability that no replicas are available. The equalis the same as Random but avoids the use of ity assumes that all nodes are independent. As a
nodes that show long downtimes. The blacklist is result, we calculated the minimum data availability
comprised of the top k nodes with the longest total threshold th to be 9 for replication m = 1 and an
downtimes. RandomSite avoids placing multiple average PlanetLab node availability a = 0.822.
replicas in the same site. RandomSite picks n
The amount of data that we simulated maintaining
unique nodes at random and avoids using nodes was 10,000 unique objects or 2TB if each object was
in the same site. We identify a site by the 2B IP 200MB. The replication factor, k = n increased the
m
address prefix. The other criteria can be geography size of the repository. For example, with n = 10 and
or domains. Finally, RandomSiteBlacklist m = 1 the total size of the repository was 100,000
placement is the combination of RandomSite and replicas and 20TB of redundant data.
RandomBlacklist.
We use the trace supplemented with the disk failThe “oracle” placement strategies use future ure distribution to compare the rate of data recovery:
knowledge of node lifetimes, downtimes, and We measured the total number of repairs triggered vs
availability to place replicas.
In all, we de- time and the bandwidth used per node vs time. Adfine four different oracle algorithms. First, the ditionally, we measured the average bandwidth per
Min-Max-TTR oracle places replicas on nodes node as we vary the timeout and extra replicas. Fithat exhibited the minimum maximum future down- nally, we measured the average number of replicas
times. Similarly, the second and third oracle, the available vs time, but we omit this graph since the
Min-Mean-TTR and Min-Sum-TTR oracle, place number of replicas was actually at least the threshreplicas on nodes that exhibit the minimum mean old.
and sum downtimes, respectively. The fourth oracle,
Max-Lifetime-Avail, places replicas on nodes 6 Evaluation
that permanently fail furthest in the future and exhibit the highest availability, which is the ratio of the In this section, we present results using replication
m = 1 with the trace modified with disk failures.
total uptime and lifetime.
5
# Repairs
% Improvement
Random
Random
Blacklist
400726
391476
2.31
Replica Placement Strategy. (m = 1, th = 9, n = 10, |blacklist| = 35, to = 1hr)
Random
Random
Oracle
Oracle
Oracle
Site
Site
Min-Max-TTR
Min-Mean-TTR
Min-Sum-TTR
Blacklist
398045
0.67
387233
3.37
411141
-2.60
(a)
Random
# Repairs
% Improvement
63251
413039
-3.07
411176
-2.61
61378
2.96
58673
7.24
53836
14.89
(b)
380521
5.04
n = 10
Replica Placement Strategy. (m = 1, th = 9, n = 15, |blacklist| = 35, to = 1hr)
Random
Random
Random
Oracle
Oracle
Oracle
Blacklist
Site
Site
Min-Max-TTR
Min-Mean-TTR
Min-Sum-TTR
Blacklist
60909
3.70
Oracle
Max-Lifetime-Avail
52827
16.48
52026
17.75
Oracle
Max-Lifetime-Avail
43695
30.92
n = 15
Table 2: Comparison of Replica Placement Strategies
Oracle Max-Lifetime-Avail
Random
RandomBlacklist
RandomSite
RandomSiteBlacklist
100000
10000
1000
10
(a)
15
20
25
N Total Fragments
512Kbps
128 Kbps
30
Data Repair Triggered vs Number Replicas (1Kbps)
Oracle Max-Lifetime-Avail
Random
RandomBlacklist
RandomSite
RandomSiteBlacklist
256 Kbps
10
(b)
15
20
25
N Total Fragments
Average bandwidth for Random
th = 9, m = 1, 10000 200MB objects
10000 initial objects, increase 10 objects per 86400000 ms
Bandwidth per node bits/sec
1e+06
Average bandwidth per Node vs Placement Strategy
to = 1hr, th = 9, m = 1, 10000 200MB objects
10000 initial objects, increase 10 objects per 86400000 ms
Bandwidth per node (bits/sec)
Total Data Repairs Triggered
Total Data Repairs Triggered vs Placement Strategy
to = 1hr, th = 9, m = 1, 10000 objects
10000 initial objects, increase 10 objects per 86400000 ms
30
Avg Bandwidth per node vs Number Replicas (1Kbps)
to = 15mins
to = 1hr
to = 4hrs
to = 8hrs
to = 1day
to = 2days
to = 4days
to = 8days
512Kbps
256 Kbps
128 Kbps
10
(c)
15
20
25
N Total Fragments
30
Avg Bandwidth per node vs Number Replicas (1Kbps)
Figure 4: Cost Tradeoffs for Maintaining Minimum Data Availability for 10,000 200MB objects. (a) Number
of Repairs vs. Total Number of replicas (greather than threshold th = 9). (b) and (c) Average bandwidth
per node vs Total Number of Replicas. (a) and (b) varies the replica placement strategy and fixes the timeout
to = 1hr. And (c) fixes the replica placement strategy to Random and varies the timeout.
We did produce results for different erasure code
schemes (e.g. m = 4), but we omit the graphs for
space since the trends of using extra redundancy with
erasure codes were the same.
the number of extra replicas and timeouts. We also
separate repairs triggered due to permanent failure
(loss of data on a node) and transient failure (node
returns from failure with data).
For n
=
15 (Table 2.(b)), more sophisti-
6.1 Evaluation of Replica Placement Strate- cated placement algorithms exhibit noticeable ingies
crease in performance; that is, fewer repairs
triggered compared to Random. For example,
the RandomSiteBlacklist placement shows a
7.25% improvement over Random, which is about
the same as the sum of the parts 3.70% and 2.96% for
RandomBlacklist and RandomSite respectively. The oracle placement algorithms exhibit a
14.89% to 17.75% improvement for the oracles that
do not consider lifetime and 30.92% improvement
for the oracle that considers lifetime.
First, we compare different replica placement strategies. Table 2 shows the number of repairs and percentage of improvement over Random for the four
different random placement and four different oracle
placement strategies. Table 2.(a) and (b) use n = 10
and n = 15 respectively.
For n = 10 (Table 2.(a)), more sophisticated placement algorithms have marginal
improvement over Random;
for example,
RandomSiteBlacklist shows a 3.37%
improvement. The observation that three of the four
oracles perform worse than Random indicate that
more is happening beyond just placement. In the
next section, we will investigate further by varying
The fact that number of absolute repairs decreases
from n = 10 to n = 15 suggests that the system is
responding to transient failures instead of placement
and permanent failures. The improvement increases
when n increases, and then it decreases as n further
6
Bandwidth per node bits/sec
Bandwidth per node bits/sec
Average bandwidth for Random
th = 9, m = 1, 10000, to = 1hr, 200MB objects
10000 initial objects, increase 10 objects per 86400000 ms
256 Kbps
64 Kbps
16 Kbps
4 Kbps
1 Kbps
256 bps
64 bps
10
15
20
25
30
Average bandwidth for Random
th = 9, m = 1, 10000, to = 1hr, 200MB objects
10000 initial objects, increase 100 objects per 86400000 ms
1 Mbps
256 Kbps
64 Kbps
16 Kbps
4 Kbps
1 Kbps
256 bps
64 bps
10
15
N Total Fragments
Total BW
Permanent BW
Transient BW
20
25
30
N Total Fragments
Heartbeat BW
Write BW
Total BW
Permanent BW
Transient BW
(a) Cost Breakdown vs Number Replicas (1Kbps)
Heartbeat BW
Write BW
(b) Cost Breakdown vs Number Replicas (10Kbps)
Figure 5: Cost Breakdown for Maintaining Minimum Data Availability for 10,000 200MB objects. (a) and
(b) Cost breakdown with a unique write rate of 1Kbps and 10 Kbps per node, respectively. Both (a) and (b)
fix the replica placement strategy to Random and timeout to = 1hr.
increases. We will investigate further in the next sec- seen in Figure 4.(b). Figure 4.(c) shows that large
tion.
timeout values exponentially decrease the cost of
data recovery but potentially compromise data availability. Similarly, linear increase in extra replicas ex6.2 Evaluation of Extra Redundancy
ponentially decreases the cost of data recovery withIn this section, we vary the extra redundancy (i.e. to- out sacrificing data availability.
tal replicas subtracted by the threshold e = n − th)
Figure 5 shows the breakdown in bandwidth cost
to investigate the effects that transient failures influfor maintaining a minimum data availability. Figence triggering data recovery. Also, we continuously
ure 5 fixes both the timeout to = 1hr and replica
add data to the repository increasing the size of the
placement strategy to Random. Figure 5.(a) and
repository over time. We use two different write rates
(b) used a per node unique write rate of 1Kbps and
10Kbps and 1Kbps per node. To make the write rates
10Kbps, respectively. Both Figures 5.(a) and (b) ilconcrete, we used the measurement that an average
lustrate that the cost of maintaining data due to tranworkstation creates 35MB/hr (or 10Kbps) [3] of tosient failures dominates the total cost. The total cost
tal data and 3.5MB/hr of permanent data (or 1Kbps).
is dominated by unnecessary work. As the number
Finally, 10Kbps and 1Kbps per node correspond to
of extra replicas (e = n − th), which are required
40GB and 4GB of unique data per node per year
to be simultaneously down in order to trigger data
added to the repository, respectively. With an averrecovery, increases, the cost due to transient failures
age of 300 nodes, the system increases in size at a
decreases, thus the cost due to actual permanent failrate of 30TB and 3TB, respectively.
ures, which is a system fundamental characteristic,
Figure 4 show the effects of varying the number
dominates. The difference between Figure 5.(a) and
of extra replicas and the timeout. Figure 4.(a) and
(b) is that the cost due to permanent failures domi(b) fix the timeout to = 1hr and vary the replica
nates in (a) and the cost due to new writes dominates
placement strategy; where as, Figure 4.(c) varies the
in (b). Finally, the cost due to probing each node
timeout and fixes the replica placement strategy to
once an hour is insignificant.
Random.
Figure 4.(a) shows that the number of repairs trigIn summary, Figure 4 demonstrate that extra regered decreases exponentially as the number of ex- dundancy reduces the rate of triggering data recovtra replicas increases linearly. Correspondingly, the ery, thus reduces the cost of maintaining data. In
bandwidth required to maintain the minimum data particular, wide-area storage systems can tradeoff inavailability threshold and extra replicas decreases as creased storage for decreased bandwidth consump7
tion and maintain the same minimum data availability threshold.
7 Conclusion
In this paper, we analyze the machine failure characteristics in wide area systems and we examine
replica placement strategies for such systems. We
found that there does exist correlated downtimes,
especially significant correlated downtimes between
nodes in the same site. As a placement strategy, random seems to be good enough to maintain data availability efficiently. In particular, the random placement with extra replicas and with constraints to avoid
using same sites and unreliable nodes is sufficient
to reduce the rate of triggering data recovery due to
node failures and most observed correlation.
References
[1] R. Bhagwan, S. Savage, and G. Voelker. Replication strategies for highly available peer-to-peer storage systems. Technical Report CS2002-0726, U. C. San Diego, November
2002.
[2] C. Blake and R. Rodrigues. High availability, scalable storage, dynamic peer networks: Pick two. In Proc. of HOTOS,
May 2003.
[3] W. Bolosky, J. Douceur, D. Ely, and M. Theimer. Feasibility of a serverless distributed file system deployed on an
existing set of desktop PCs. In Proc. of Sigmetrics, June
2000.
[4] B. Chun and A. Vahdat. Workload and failure characterization on a large-scale federated testbed. Technical Report
IRB-TR-03-040, Intel Research, November 2003.
[5] D. Patterson and J. Hennessy. Computer Architecture: A
Quantitative Approach. Forthcoming Edition.
[6] Jeremy Stribling.
Planetlab
http://infospect.planet-lab.org/pings.
all-pairs
ping.
8
Download