Topological Similarity-based Scheme for Large-scale Group Communication Services

advertisement
Topological Similarity-based Scheme for
Large-scale Group Communication Services
Yuehua Wang1,2 , Zhong Zhou1,2 , Ling Liu3 , Liang Cheng4 , Wei Wu1,2
Key Lab of Virtual Reality Technology and Systems, Beihang University, China
2 School of Computer Science and Engineering, Beihang University, China
3 College of Computing, Georgia Institute of Technology, USA
4 Department of Computer Science and Engineering, Lehigh University, USA
Email:yuehua.research@gmail.com, zz@vrlab.buaa.edu.cn, lingliu@cc.gatech.edu,
cheng@cse.lehigh.edu, wuwei@vrlab.buaa.edu.cn
1 State
Abstract—Group communication is essential for multi-user
applications. However, due to unpredictable node departures
and non-deterministic network partitions, providing reliable
and scalable group communication services is challenging when
the applications are utilized by the users with heterogeneous
capacities on a large scale. To address this challenge, we propose
a novel replication scheme to achieve high reliability and lowcost scalability in group communication with following three
features. First, it introduces a new concept of replication based
on topological similarity, which empowers each node with an
ability of measuring similarity between the nodes in topology.
By eliminating the topological similarity between the replicas,
it intelligently mitigates service interruptions caused by node
failures and network partitions. Second, instead of specifying
the number of replicas, it provides a technique for nodes
to dynamically adapt the replication placement schemes by
exploiting functionality importance of the nodes in the groupcommunication session. It eliminates the bottleneck problem and
improves the network resource utilization. Third, the scheme is
self-converging and it can stabilize within a few adaptations even
facing a high churn rate. Extensive simulations show that it yields
significant improvements in reduction of replication overhead and
service interruption when comparing to existing approaches.
Keywords- Topological similarity, Importance, Reliability, Scalability, Replication, Group communication.
I. I NTRODUCTION
The explosive growth of network applications and the
increasing popularity of handhold devices in recent years have
made reliability and scalability important and challenging to
achieve large-scale communication. The applications include
online game, real-time conference, large-scale distributed interactive simulation, instant messaging and RSS services,
which are characterized by exchanging information contents
among multiple unreliable participants with heterogeneous
capacities. In those applications, a large number of participants
who are often geographically dispersed access the services and
require reliable support by applications even in a network of
unpredictable node departures and non-deterministic network
partitions.
However, it is difficult to design a new network infrastructure or add a new service to the network layer to satisfy the
requirements of the participants given the existing infrastructures of systems deployed by the applications. One intuitive
approach to address such issue is to do replication, where
data/file copies are created and placed on other participants in
the network for high reliability and scalability. This approach
provides a proactive component that can be used to augment
the performance of the applications.
In the past decades, a fair amount of research [1–8]
with replication scheme has been developed. Generally, there
are three categories: ID-based replication [1, 2, 6], neighbor
locality based replication [7, 8] and path-based replication [3–
5]. These approaches either have mainly focused on achieving query efficiency without considering network resource
utilization, or ignore the influence of network partitions on
the system performance. In fact, the occurrence of network
partitioning could make the problem of replication even more
severe and consequently deteriorate the system performance
because of massive volumes of messages that are generated
for replica maintenance and service recovery.
Fig. 1 depicts three different scenarios that are group
communication sessions obtained in the presence of failures
in different patterns. In each scenario, there are 11 nodes
that are divided into two classes: content publisher nodes
(i.e., darkened nodes), being responsible for collecting and
disseminating data to content subscriber nodes (i.e., shaded
nodes) that are located at the edge of the network, and content
subscriber nodes, listening to their parent nodes and receiving
data from them as the data arrives.
To minimize the replication cost and improve reliability
of group communication services provided by nodes, one
approach can choose close nodes and replica nodes’ content
on them. In Fig. 1(a), nodes 7 and 9 are two replicas deployed
by node 8 with this approach [3–5, 7, 8]. This approach
brings two unique benefits. First, it enables the node in the
network with the ability of reacting quickly to the changes
of network. Second, it reduces the creation and maintenance
cost of replication. However, such an approach does not work
well when a network partition occurs as shown in Fig. 1(b).
In Fig. 1(b), a circle region of three nodes 7, 8 and 9 may
appear to be unreachable from the other nodes. As a result,
nodes 2, 5, and 11 are isolated from the session and have
no data received from the content publisher node. To address
Fig. 1.
A motivating example
that, one may suggest placing the replicas in a hybrid manner
such that node 2 could also employ a remote node 26 as a
replica node to improve the efficiency of replication scheme.
It, however, subjects to a difficulty of determining the locations
and the number of remote nodes.
We also argue that the data/file replication often leads to
inefficient network resource utilization due to a large volume
of data/file copy propagation. Given accessing to the most
popular ones is frequently skewed, the replication approaches,
such as [3], [4] and [5], may exhaust the capacity of those
nodes and decrease the network resource utilization and service quality. To avoid that, one intuitive approach is to place
more replicas for important nodes (e.g., node 8) while the
nodes with less importance have only a few. However, this
approach ignores the heterogeneity of nodes’ importance in
the different sessions. In fact, it is a normal case that some
nodes may be very important in one session but not in others
and some nodes may participate in multiple sessions while
playing less important roles.
In this paper, we focus on characterizing the replication
problem and devise a novel replication scheme to provide
scalable and reliable group communication services for the
participants at the edge of network. The scheme is unique in
three aspects. First, a new concept of replication based on
topological similarity is exploited. It empowers each participant with the ability of measuring similarity between the nodes
in topology. Instead of making replication among the physically close nodes, the nodes with low topological similarity are
employed as replicas in our replication scheme in such a way
it increases the immunity of replication placement strategies to
the network partitions. Second, we outline node importance in
groups in replication placement. With node importance consideration, the nodes adaptively determine the number of replicas
by themselves which eliminates unnecessary replicas, resulting
in a reduction of replica creation and maintenance overhead. In
addition, it avoids exacerbating the bottleneck problem. Third,
in contrast to the existing replication schemes [3, 4, 9–11],
our replication scheme is self-converging and it can stabilize
within a few adaptations even facing high churn rate.
The rest of this paper is organized as follows: we first give
a formal definition of replication problem. In Section 3, the
design of the replication algorithm is presented. We provide
extensive simulations in Section 4 and review some related
work in Section 5. Finally, Section 6 concludes the paper and
discusses the research directions.
II. P ROBLEM DEFINITION
We study replica placement problem in a general distributed
network G = (V, E, W ), where V is the set of nodes,
E = V × V is the set of edges and W is the set of weights
of the edges. Nodes represent users who participate in the
overlay network G, edges represent links between the nodes.
Each node vi in V has a limited capacity Ci . Ci refers to
(t) over all t, where cavail
(t) denotes
the maximum of cavail
i
i
the available storage capacity of the node i at time t given the
storage capacity is one main factor in the Internet applications
like content searching and media streaming dissemination.
Weights represent communication cost of the links. Concretely,
for each weight ωk in W , it refers to the propagation delay
li,j required for a packet to transmit from node vi to node vj
through the link ek = (vi , vj ) , ∀ vi , vj ∈ V, ek ∈ E.
Given the dynamic nature of the network, we consider G
as a network where the delays of the links and the available
capacities of the nodes may change as the nodes continuously
join or leave the network. In such a dynamic network, the
problem of replication can be modeled as a multi-objective
optimization problem, which can be represented as: for any
node vi ∈ V ,t ∈ Q+
X
X
(t), 1 − RR (t + τ |t)) s.t.
(1)
cavail
li,j ,
min (
j
vj ∈R
j
j
1)
R = {vj }
2)
3)
cavail
(t) ≤ Ci
k
RR (t + τ |t)) > Pi (t)
(∀ j ∈ [1, N ], i 6= j)
(∀ k ∈ [1, N ])
(∀ τ > 0, R 6= Ø)
where the symbols Q+ and N denote the set of positive rational
numbers and the number of nodes in network G respectively.
RR (t + τ |t) refers to the service reliability offered by node vi
and its replica nodes vj . It is the probability of a set of nodes
{vj } remaining in the network in next time slot τQ, which is
defined by RR (t + τ |t) = 1 − Pi (x < t + τ |x > t) j Pj (x <
Q Pj (x<t+τ )
i (x<t+τ )
t + τ |x > t) = 1 − P1−P
j 1−Pj (x<t) . Pi (t) is the
i (x<t)
probability of node vi with a lifetime of t seconds.
Our goal is to find a replica placement strategy that minimizes the replica maintenance cost and replica resources usage
while maximizes the service reliability offered by nodes {vj }.
However, it is important to note that this problem does not have
an exact solution since practical systems often contain a large
number of nodes and the storage capacity of the nodes may
range from a few units to a few thousands of units. Potentially,
it is impossible to enumerate all of the solutions. Therefore,
our replication algorithm is developed and we will describe it
in details in the following section.
III. T OPOLOGICAL SIMILARITY- BASED REPLICATION
SCHEME
In this section, we start off by introducing topological similarity and describing replica placement consideration. Then,
the design of the scheme is described.
A. Topological similarity
Consider the example showed in Fig. 1(b), where the nodes
in the region marked with the dash line depart from the
multicast session because of the network partition. To avoid
that, one remote node is preferred to be placed by node 8 as
mentioned before. But the challenge is how to find this type
of nodes (e.g., node 26) while keeping cost low. Based on
the analysis of the relationship between nodes, we find that
two nodes in vicinity have a high similarity in topology. The
more shared nodes they have, the closer and more similar they
are. Given that, we define a quality index named topological
similarity as follows:
Definition 1: Given a node vi , the topological similarity of
vi to node vj , denoted by Sa(i, j), is used to describe the
similarity between vi and vj in topology, which is given by:
|Sineigh ∩ Sjneigh |
Sa(i, j) = neigh
(2)
|Si
| + |Sjneigh |
where Sineigh and Sjneigh refer to the neighbor set of node vi
and vj that contain the two-hop neighbors of node vi and vj
respectively.
Intuitively, the value of Sa(i, j) reflects how similar the
nodes are in topology. Given a node vi , a large value of
topological similarity means the node vj is located close and
there is high overlap of the sets of the two nodes, while a
small value of that means that the node vj is located far away
from vi and there is a large space between those two nodes.
For simplicity, we use Sa to represent Sa(i, j) if there is no
confusion.
Consider the simple network given in Fig. 1. After receiving
update messages from nodes 4, 7, 9 and 14, node 8 can
have a neighbor set S8neigh consisting of nodes 3, 4, 5, 7,
9, 10, 12, 13, and 15. S8neigh shares 3 and 4 nodes with
that of node 7 and node 9, respectively. Then, we can get
Sa(7, 8) = 0.19 and Sa(9, 8) = 0.21. For node 26, it has
neigh
which contains 1 node
nodes 14, 15, 16 and 27 in S26
neigh
and Sa(26, 8) = 0.07. Compared to nodes 7 and 9,
of S8
node 26 shares fewer nodes and has lower similarity with node
8 in topology. Given the probability of one network partition
involving both nodes 8 and 26 is lower than that of network
partition marked with dash line, exploring node 26 as one
replica node can be a better choice for node 8 to improve its
service reliability. Therefore, in this paper, the technique that
is related to topological similarity is developed.
B. Algorithm Description
To begin with, we first introduce some important notions.
• The one-hop neighbor of node vi refers to the immediate
neighbor node of node vi [12].
• The k-hop neighbor of node vi refers to the immediate
neighbor of node vi ’s k-1 hop neighbor, ∀k > 1.
neigh
, denoted by
• The incidence vector of neighbor set Si
Ui , is a vector whose entries are labeled with the members
of Sineigh . It subjects to: (Ui )k = 1 ⇔ vk ∈ Sineigh ,
otherwise (Ui )k = 0.
• The multicast session set of the network, denoted by
S M , contains the multicast sessions that are announced
by nodes to transmit information to multiple demanding
nodes. S M (i) and S M (j) are considered to be different
if and only if they are published by two different nodes.
M
• The multicast session set of node vi , denoted by Si , contains the multicast sessions that vi involves in. It satisfies:
M
M
M
for 0 ≤ i ≤ N , SiM ⊂ SG
, where SG
= ∪N
i=0 Si .
j
M
• The importance of node vi in S (j), denoted by Impi ,
j
N umber of descendants of vi
,
is defined as Impi =
N umber of nodes in S M (j)
where the descendants of vi are the nodes who are
listening data message from vi through the message paths.
With the above notions, each node along the multicast tree
creates its replica nodes by performing following procedure,
which consists of five steps: candidatelist construction, reliability determination, topological similarity minimization, cost
reduction and replica placing.
Step 1: Candidatelist Construction It is accomplished by
two operations: filtration and reliability calculation.
Filtration A tree nodes vi first builds a new list listca
vi
(called candidatelist) and then adds the nodes in Sineigh whose
capacity satisfies cavail (t) > 0 to the list. Neither failed nodes
nor heavy loaded nodes are considered as the candidate nodes.
Reliability Calculation This operation is triggered as the
filtration is finished. Node vi begins to calculate conditional
reliability of nodes in listca
vi using the formula mentioned in
section II. Given RR (t + τ |t) measures the probability of the
list nodes remaining in the network in next time slot τ , it
is utilized to help node vi choose one appropriate placement
strategy, which will be further described in the following steps.
Step 2: Replica Degree Determination The goal of this
step is to determine an appropriate value for |R| by carefully
combining the reliability theory and importance-aware replication strategy. The procedure of replica degree determination
is as follows.
• Node vi first lists all the nodes in the candidatelist with
their corresponding reliability in the descending order and
computes the minimum number of replica nodes r1 that
satisfies |ϕ(rP
1 +1)−ϕ(r1 )| < δ, where service reliability
x
ϕ(x) = 1 − i=1 RR (t|τ ) and δ is a system parameter
that is configured by default. Based on both reliability
theory and experimental results, we find that there is no
apparent improvement in service reliability after δ reaches
a certain value. In GeoCast, it is set to 0.01 (as suggested
by results in Fig. 5).
•
•
The importance of node vi Im is then measured by
using the information collected from its children nodes,
M
P
Impji . Given
which is defined as Im = P|S|S M(j)|
(j)|
the importance of node vi Im, the number of replica nodes r2 desired by node vi can be calculated as
r2 = rounded(1 + Im ∗ W ), where W represents the
local knowledge of node vi about the network, defined
2d G.size−log2d R.size
as W = log
|immediate neighbors| . The method of r2 ’s
calculation ensures that the more important vi is, the
larger r2 is.
At last, the replication degree is fixed by choosing
the bigger one from r1 and r2 , which satisfies |R| =
max{r1 , r2 }. To minimize the topological similarity between replica candidate nodes, the first |R| nodes in
sorted candidatelist are selected and added to a new set
(called prereplicaset).
Step 3: Topological Similarity Minimization It is achieved
by iteratively choosing the candidate nodes with low topological similarity but not including in prereplicaset to replace
the prereplicaset nodes. At each time, node vi checks if
the set prereplicaset satisfies: ϕR′ (|R|) ≤ ϕR (|R|), where
′
′
R is a temporary set such that R ⊂ candidatelist. If
so, the node with highest topological similarity is removed
and the candidate node is added into prereplicaset. This
procedure is executed P
repeatedly until SaR′ ≥ SaR , where
1
SaR = Sa(i, j) = |R|
Sa(i, j), ∀vi , vj ∈ R.
In this procedure, nodes in candidatelist that have lower topological similarity to the nodes in prereplicaset
are only considered. If there is a node uj such that
Sa{prereplicaset−{ui }}∪{uj } < SaR , ui that is the node
with highest topological similarity is replaced with uj in
prereplicaset. Another operation could also be triggered in the
similar way. It executes repeatedly until there is no replica
placement having the lower topological similarity than SaF ,
where R is the improvement replica placement strategy after
the mth operation. In such a way, SaR is convergent.
Step 4: Cost Reduction Similar to Step 3, this step is done
by replacing the prereplicaset nodes with the candidate nodes
with less replication cost which satisfying the conditions:
ϕR′ (|R|) ≤ ϕR (|R|) and SaR′ ≤ SaR .
Step 5: Replica Placing Nodes remaining in prereplicaset
are selected as replicas and node vi places its files/information
on them.
To avoid introducing extra overhead, we attach the maintenance information of replicas to the existing heartbeat messages. Every T seconds, heartbeat mechanism is performed
where T is a constant that refers to the parameterized heartbeat period. Long time absence of heartbeat or its responds
indicates that the node has left, an then initials the recovery
process mentioned in [13].
Once a replica departure is detected by node vi , the update
of replica list is triggered by performing the following steps.
•
Node vi first checks if the condition of |R| ≥
max{r1 , r2 } is still satisfied. If so, end this update.
Otherwise, go to next step.
TABLE I
S IMULATION S ETUP
Parameter
Number of Routers
Number of Nodes
Link Latency
Capacity Distribution
Group Number
Value
8080 (80 transit routers and 8000 stub routers)
[1000,2000,4000,6000,8000]
intra-transit link in [50ms,80ms]
transit-stub link in [10ms, 20ms]
intra-stub link in [1ms, 5ms]
5% nodes with 1000 units
15% nodes with 100 units
30% nodes with 10 units
j5% nodes with 1 unit
k
Group Size
Lifetime Distribution
Simulation Time
r
RT
•
•
N −1
(log1.25 11.5
)
-1
[ N ∗ r −1.25 + 0.5 , 11]
Pareto (α = 1.1,β = 0.05)
7200
[0,8]
[30,120]s
The candidatelist and prereplicaset are created and initialized by adding the qualified nodes that are in the
replica set. Then, go to next step.
With such setting of candidatelist and prereplicaset, the
proposed replication algorithm is performed.
The computation complexity of the topological similaritybased replication algorithm is O(n(logk + k)), where n is
the number of nodes contained in S neigh and k is a variable
ranged from 1 to n. Since a network may consist of a large
number of nodes, it is necessary for the nodes to reduce the
computation complexity. The most effective way is to reduce
the number of nodes in the neighbor sets and the number of
replica nodes. But from experiments we find that n and k are
normally within [10, 20] and [1,6] respectively that enable the
complexity of the proposed scheme is low even in a large
network. In addition, it is important to point out that not all
nodes in S neigh are valid for replica selection since some
nodes may be overloaded at that point in time.
IV. P ERFORMANCE EVALUATION
In this section, we use extensive simulations to evaluate the
proposed algorithm.
A. Experimental environment
We use Transit-Stub graph model of the GT-ITM topology
generator to generate network topologies for our simulation.
All experiments in this paper are run on 10 topologies with
8080 routers. Table I lists the detailed parameter setting used
in our simulations.
With the lack of real-world trace data to drive these experiments, we use the method mentioned in [14] to simulate
distribution
of groupsize and group number. A modified range
[ N ∗ ra−1.25 + 0.5 , 11] is set on the group size, where N
is the number of nodes in the system and ra is a system
N −1
)
. We
configured variable, ranging from 2 to (log1.25 11.5
simulate the behaviors of joining nodes in the network using
Pareto lifetime function [15] with α = 1.1 and β = 0.05,
where the average lifetime of joining nodes is around 0.5 hour.
(a) Neigh
Fig. 2.
(b) RAM
(c) N&R
Comparison of replica number in terms of node number
To study the effectiveness of TSRS, we compare the scheme
TSRS to RAM, Neigh, N&R and Mcost in various scenarios. RAM [10, 16] in our simulations refers to the random
replication scheme where the locations of data copies are
selected by following an uniform distribution. Neigh refers
to the neighbor-based replication approaches [8, 17]. N&R is
a simple extension scheme of Neigh and RAM designed to
reduce construction cost caused by RAM and improve reliability of Neigh. Mcost refers to the replication schemes [9, 18]
with the aim of minimizing both replica construction cost
and its maintenance cost while satisfying system reliability
requirement.
B. Results
We first make a comparison between the replication schemes
in networks with different size, as shown in Table II. In Table
II, n1 and n2 denote total number of interrupted services with
and without replication scheme respectively. The results lead
to two observations. First, all of three replication schemes:
Neigh, RAM and N&R, can improve their performances by
increasing the replica number r. However, after r reaches
5, increasing the number of replica nodes does not achieve
dramatic improvement in the metric. This is, essentially,
because of the redundancy among the replica nodes. Second,
the scheme TSRS yields better performance in all cases than
Mcost and other counterparts with smaller r (i.e., r < 5).
Similar to Neigh and N&R with r = 5, TSRS achieves a
reduction from 99.35% to 99.61% in service interruption. This
benefits from the flexibility and reliability of the TSRS replica
nodes, which enables them to have high probability to be alive
and consequently be able to detect and react to the changes
of the network quickly. The results also show that Mcost
achieves a poor performance. This is due to the influence of
the failures of nodes with high reliability. In Mcost, the nodes
with higher reliability may have fewer replicas, but heavier
backup workload.
Fig. 2 measures the average number of the valid replica
nodes per node as a function of network size for the schemes.
As the results in Table II suggest, we vary r from 5 to 8
to obtain the optimal performance of the schemes so that we
can study the substantial benefits of TSRS. We observe that
both TSRS and Mcost yield better performance than the other
schemes in terms of replica node number. But unlike TSRS,
Mcost fails to achieve high efficiency in terms of the number
of service interruptions, as shown in Table II. Potentially, this
indicts that the scheme of TSRS provide a better way for data
(d) Mcost
TABLE II
C OMPARISON BETWEEN REPLICATION SCHEMES IN NETWORKS WITH
DIFFERENT SIZE
r=1
r=2
r=3
r=4
Neigh
r=5
r=6
r=7
r=8
r=1
r=2
r=3
r=4
RAM
r=5
r=6
r=7
r=8
r=1
r=2
r=3
r=4
N&R
r=5
r=6
r=7
r=8
Target = 0.7
Target = 0.8
Mcost
Target = 0.9
Target = 0.99
TSRS
n2
Reduction Rate(%)=
n2 −n1
n2
n1
N=1000 N=2000 N=4000 N=6000 N=8000
82
220
397
480
690
36
131
197
353
495
35
92
182
284
234
23
72
105
161
168
14
49
54
126
145
11
40
36
76
111
9
16
33
65
91
5
11
21
44
82
84
211
380
585
649
45
150
248
402
437
36
80
180
177
277
13
76
96
153
136
8
23
53
97
110
6
25
48
9p
75
5
23
41
64
70
2
20
33
50
65
80
203
385
496
687
47
91
333
296
403
46
74
124
210
301
17
65
142
197
217
13
45
90
105
154
11
38
84
99
107
8
17
50
95
103
2
16
49
58
72
122
246
562
700
898
124
247
558
702
897
120
217
465
588
760
102
224
372
563
690
13
46
77
93
135
1988
7232
12409
22302
34419
99.35% 99.36% 99.38% 99.58% 99.61%
replication with consideration of the reliability of nodes. Given
Mcost is not practical due to the poor performance of Most
in dealing with node failures and the network partitions, we
ignore the analysis of such method in the rest of paper for
reliability purposes.
Fig. 3 shows that the performance of RAM and N&R
heavily depends on the setting of parameter r. The larger r
is, the more messages are generated. By comparing RAM
with N&R, we see that RAM generates less messages for
replica creation than N&R, especially when the network size
is larger. This is because of the extra overhead generated for
neighborhood inquisition. In N&R, the inquire messages are
initiated and sent to both remote nodes and neighbor nodes. Since it is hard to determine the range of the inquisition, a large
number of messages might be issued during this procedure
and consequently degrades the performance of applications.
(a) Neigh
Fig. 3.
(a) N ∈ [1000, 8000]
Fig. 4.
(b) RAM
(c) N&R
Comparison of Creation Cost in Terms of Node Number
(b) N=4000
Effect of RT
Different with them, TSRS yields better performance. The
replica number in TSRS are delicately determined by nodes
based on their statuses and the network condition.
Fig. 4(a) illustrates the effectiveness of TSRS under different RT. RT refers to the time required for node initialization
right after it is selected as a replacement node to take over the
island previously owned by failed node. We vary RT from 30s
to 120s. It is observed that more services tend to be interrupted
when RT is larger, as well as when the network size is larger.
This can be explained by the fact that as growing RT the nodes
in the network are more likely to fail in a long time interval and
it consequently leads to a poor performance in terms of service
interruption. However, given the results showed in Table II, we
find that the scheme of TSRS with RT= 60s can still have a
high reduction rate. For instance, in the network of 8000 nodes,
it achieves 99.3% saving. The results in Fig. 4(b) show the
distribution of service interruptions over time in the network
of 4000 nodes. We note that the metric gradually drops with
the run time. This is because that majority of node failures
happen at the beginning stage of simulation, which confirms
the nature of nodes in the underlay network.
Fig. 5(a) illustrates the convergence of the algorithm TSRS
when the network size changes. The upper bound of parameter
∆Sa, defined as ∆Sa = Sasi+1 − Sasi is set to 10−4 . This
value is chosen to make sure that there is no big difference
between the topological similarity of the replica placement
strategies when the algorithm is stable. The results show
that in all cases the time required for convergence to stable
is short(less than 10). Fig. 5(b) provides a further study of
the performance of the algorithm in the network of 4000
nodes. We see that after a few adaptations, TSRS reduces the
topological similarity of placement strategies to a small value,
which implies the fast convergence of the algorithm.
(a) ∆Sa ≤ 10−4
Fig. 5.
(d) Mcost
(b) N=4000
Convergence Speed
V. R ELATED WORK
Many research studies have also been conducted to cope
with massive node failure and keep high data availability by
using the technique of file replication. In those studies, location
of the replica nodes are determined based on node ID [1, 2, 6],
query path [3–5], or neighbor locality [7, 8].
The ID-based replication determines replica nodes based
on the relationship between the node ID and the data/file’s
ID, where the replica nodes are the nodes whose IDs match
most closely to the data/files’ IDs. In PAST [1], each file is
replicated on a set of nodes whose node ID are numerically
closest to the files’ ID. A new replica is created immediately
following a failure. CFS [2] divides files into blocks and
spreads them evenly over the available servers to prevent
large files from causing unbalanced use of storage. In recent
work, Hirokazu et al. [6] propose a distributed interval tree
replication scheme for adaptively setting the replicated objects
considering the scale of networks. The replicas are assigned
to the nodes of the interval tree created by the node with file.
In the path-based replication, the data/file copies are placed
on the nodes along the file routing path from the requester
node to the provider node. LAR [3] creates replicas on the
source of the query and adds routing hints to the replica on
nodes along the message query path. In Freenet [4], each node
along the path from the source node to the query node stores
a copy of the file. OceanStore [5] places the replicates data on
or near the client machines where the data is accessed. Those
replicas function as local access points for the data. However,
due to time-varying file popularity and node interest variation,
most of the replicas cannot be fully utilized.
The neighbor locality based replication makes file replication among the nodes located in the neighborhood of the
data/file host node. Plover [7] makes file replication among
physically close nodes based on node available capacities.
It enables the file replication and consistency maintenance
to be conducted among physically close nodes. Peercast [8]
developed a neighbor-based replication scheme, where data
copies are put on nodes who are locating in the neighborhood.
However, this type of schemes has two drawbacks. First, it
ignores the heterogeneity of nodes’ importance in the services.
Second, it subjects to a difficulty in determining replication
degree.
There are other studies [9–11, 19] for file replication based
on the file popularity or request rate. Most of them focus on
the relationship among the number of replicas, file search time,
and load balance, but do not investigate the impact of replica
location on file query efficiency.
There are two important features distinguish our approach
from the existing approaches. First, our approach leverages
node heterogeneity in lifetime, capacity and importance to
improve the replication flexibility and reduce the replication
cost caused by replica creation and maintenance. Second, our
approach makes good use of the information encoded in the
link structure of the nodes and reduces the impact of different
network partition on the service quality.
VI. C ONCLUSION AND FUTURE WORK
This paper proposes a cost-effective replication scheme to
provide scalable and reliable group communication services.
It is novel as it exploits nodes’ topological similarity and
importance heterogeneity in a distributed fashion to improve
the service reliability while keeping the replication overhead
low. Through extensive experiments, we demonstrate that
TSRS increases the reduction rate up to 99.63% and reduces
the number of data copies up to 60% when comparing to the
existing replication degree based approaches, the cost-aware
replication schemes, and their simple alternative schemes. The
results also show that TSRS is practical as it can deal with failures in different patterns without scarifying the service quality
in terms of service interruption and replication overhead. Our
future work will focus on developing techniques to increase
the utilization of the replica nodes and improve the efficiency
of our proposed scheme.
VII. ACKNOWLEDGMENT
This paper is supported by the National Basic Research
973 Program of China under Grant No. 2009CB320805,
the National Natural Science Foundation of China under
Grant No. 61170188, the National High Technology Research
and Development 863 Program of China under Grant No.
2012AA011803, and Fundamental Research Funds for the
Central Universities of China. The third author is partially supported by grants from NSF CISE NetSE program, CyberTrust
program, an IBM faculty award and an Intel ISTC on Cloud
Computing.
R EFERENCES
[1] P. Druschel and A. Rowstron, “Past: A large-scale, persistent
peer-to-peer storage utility,” in HotOS’01, 2001, p. 0075.
[2] F. Dabek, M. Kaashoek, D. Karger, R. Morris, and I. Stoica,
“Wide-area cooperative storage with cfs,” ACM SIGOPS Operating Systems Review, vol. 35, no. 5, pp. 202–215, 2001.
[3] V. Gopalakrishnan, B. Silaghi, B. Bhattacharjee, and P. Keleher,
“Adaptive replication in peer-to-peer systems,” in ICDCS’04,
2004, pp. 360–369.
[4] I. Clarke, O. Sandberg, B. Wiley, and T. Hong, “Freenet: A distributed anonymous information storage and retrieval system,”
pp. 46–66, 2001.
[5] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton,
D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, C. Wells
et al., “Oceanstore: An architecture for global-scale persistent
storage,” ACM SIGARCH Computer Architecture News, vol. 28,
no. 5, pp. 190–201, 2000.
[6] H. Yoshinaga, T. Tsuchiya, H. Sawano, and K. Koyanagi, “A
study on scalable object replication method for the distributed
cooperative storage system,” in ICDT’09, 2009, pp. 96–101.
[7] H. Shen and Y. Zhu, “Plover: a proactive low-overhead file
replication scheme for structured p2p systems,” in ICC’08,
2008, pp. 5619–5623.
[8] J. Zhang, L. Liu, L. Ramaswamy, and C. Pu, “PeerCast: Churnresilient end system multicast on heterogeneous overlay networks,” Journal of Network and Computer Applications(JNCA),
vol. 31, no. 4, pp. 821–850, 2008.
[9] H. Yu and A. Vahdat, “Minimal replication cost for availability,”
in PODC’02, 2002, pp. 98–107.
[10] Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker, “Search and
replication in unstructured peer-to-peer networks,” in ICS’02,
2002, pp. 84–95.
[11] S. Tewari and L. Kleinrock, “Proportional replication in peerto-peer networks,” in Infocom’06, 2006.
[12] Y. Wang, L. liu, C. Pu and G. Zhang, “GeoCast: An Efficient
Overlay System for Multicast Application,” Tech. Rep., GITCERCS-09-14, Georgia Tech, 2009.
[13] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Schenker,
“A scalable content-addressable network,” in SIGCOMM’01,
vol. 31, no. 4, 2001, pp. 161–172.
[14] M. Castro, P. Druschel, A. Kermarrec, and A. Rowstron,
“SCRIBE: A large-scale and decentralized application-level
multicast infrastructure,” IEEE Journal on Selected Areas in
communications, vol. 20, no. 8, pp. 1489–1499, 2002.
[15] X. Wang, Z. Yao, and D. Loguinov, “Residual-based estimation of peer and link lifetimes in P2P networks,” IEEE/ACM
Transactions on Networking(TON), vol. 17, no. 3, pp. 726–739,
2009.
[16] Y. Liu, X. Liu, L. Xiao, L. Ni, and X. Zhang, “Location-aware
topology matching in P2P systems,” in Infocom’04, 2004, pp.
2220–2230.
[17] T. Chang and M. Ahamad, “Improving service performance
through object replication in middleware: a peer-to-peer approach,” in P2P’05, 2005, pp. 245–252.
[18] Y. Saito and M. Shapiro, “Optimistic replication,” ACM Computing Surveys (CSUR), vol. 37, no. 1, p. 81, 2005.
[19] S. Tewari and L. Kleinrock, “Analysis of search and replication
in unstructured peer-to-peer networks,” vol. 33, no. 1, pp. 404–
405, 2005.
Download