Transaction / Regular Paper Title

advertisement
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
1
Supplementary File for “Churn-Resilient Protocol for Massive
Data Dissemination in P2P Networks”
Zhenyu Li, Member IEEE, Gaogang Xie, Member IEEE,
Kai Hwang, Fellow IEEE, and Zhongcheng Li, Member IEEE
This file is supplementary for the manuscript that is submitted to IEEE Transactions on Parallel and Distributed
Systems. The manuscript is entitled “Churn-Resilient Protocol for Massive Data Dissemination in P2P Networks”. The
file includes pseudo-code for two algorithms, proofs for theoretical analysis, detailed simulation experiment settings
and results. The notations used in this file can be found in Table 1 in the main paper.
——————————  ——————————
1 PSEUDO-CODE FOR ALGORITHMS
The pseudo-code of the base ring join procedure for
the (n+1)-th node x is given in Algorithm S-1, where n > 0.
SL(x) denotes the successor list of the node x.
13:
14:
15:
Algorithm S-1: Base Ring Overlay Construction
2 PROOFS FOR THEORETICAL ANALYSIS
Input: The base ring of n existing nodes (n > 0), a new node x to
be inserted, and the successor lists of all existing nodes along
the ring.
Output: Updated base ring after adding a new node x
Procedure:
1: x randomly chooses an existing node z
2: x joins the base ring as the successor of z
3: SL(x)  SL(z)
4: z adds x into SL(z) at the first position
5: counter  2mx - 1
6: u  prdz
7: loop
8:
if counter > 0 then
9:
counter  counter – 1
10:
if u == x then Break loop
11:
if sucx∈SL(u) then
12:
u deletes the last (i.e. mu-th) entry of SL(u)
13:
u adds x into SL(u) at the position before sucx
14:
end if
15:
u  prdu
16:
goto loop
Theorem 1. The average size of the node successor list
converges to O(log n) , where n is network size.
The pseudo-code for the parent switch process of a
node x to repair the tree structure is listed in Algorithm S2. Rchd(x) is the set of x’s chord neighbors.
Proof. Let p denote the probability that a node fails, m
denote the successor list size, and Pf (n, m, p) denote the
Algorithm S-2: Recovery of the Data Delivery Tree
Input: Node x whose parent fails, the existing overlay and its
delivery tree
Output: Recovered data delivery tree
Procedure:
1: z←nil
2: z←a node in Rchd(x), not a tree descendant of x
3: if z is not nil then
4:
x takes z as its new parent node
5:
x’s descendants update all ancestors
6: else u ← x
7:
loop: //check successor to act as the parent//
8:
u takes sucu as a new parent
9:
u’s descendants update all ancestors
10:
if sucu is not a tree descendant of x then
11:
break loop
12:
else
sucu disconnects the tree link to its parent
u ← sucu
goto loop
Proof. We prove this theorem by mathematical
induction. Let E[M n ] be the average size of the successor
list for n > 0. When n is 1, the successor list is of size 1.
Suppose with an n-node base ring, E[M n ] = logn . The
addition of the (n+1)-th node increases the size of its
predecessor’s list by 1. Since the new node selects any
existing node as its processor with a probability 1/n, we
have
E[M n 1 ]  log n  1/ n  log n  log(n  1/ n)  log(n  1)
n 
This completes the proof.

Theorem 2. If any node fails with a probability e a ( a  1 )
independently, then the ring structure is connected with a
probability higher than 1  n1a , where e is the base of
natural logarithm.
probability that the ring is disconnected. Let Pr (n, m, p)
denote the probability that a line, which contains n nodes
with each node having logn successors, is disconnected.
Then, following the analysis results in [1], we have
Pf (n, m, p)  Pr (n  m  1, m, p)
where Pr (n, m, p)  0 if n  m , Pr (n, m, p)  p m if n  m , and if
nm,
Pr (n, m, p)  Pr (n  1, m, p)  (1  p) p m (1  Pr (n  m  1, m, p))
 pr (m, m, p)  (1  p) p m  i  m (1  Pr (i  m, m, p))
n 1
 p m  p m  ( n  m)
Thus, Pf (n, m, p)  p m  p m (n  1)  np m . If
m  O(log n) , we have Pf (n,log n, e  a )  n1 a .
p  e a and
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
2
Therefore, the ring structure is connected with a
probability higher than 1  n1a . As n grows, the
probability converges to 1.

Lemma 1: Suppose s random nodes on the tree have
known a data item. These nodes broadcast the item on the
tree independently and simultaneously. Then, on average, n  s  2 messages are generated to ensure all the
nodes receive the item.
Proof. Let g i denote the number of tree neighbors that
a node i has. Then, except the root node, the node i has
gi  1 children nodes. Let b  n  1 denote the number of
tree links. Because a node does not transmit the item back
to the node from which it receives the item, the node i
forwards at most gi  1 copies of the item to its tree
neighbors. Thus, the total number of copies used is given
by:
G   i 1 gi   i  s 1 ( gi  1)
 b 1  s
ns2
s
n
Otherwise, xi connects to its successor sucxi as the new
parent. The successor sucxi disconnects from its original
parent and becomes the new root of Ti . The node sucxi
performs the same operation as xi does. This procedure
continues. Since each node maintains the correct successor
on the ring, there must be at least one node in Ti whose
successor is in another sub-tree T j ( i  j ). This node takes
the successor as its parent. Thus, Ti and T j form a new
sub-tree, reducing the number of sub-trees by 1.
Since each child node actively locates its new parent
node, all the r  1 sub-trees will connect again as a new
delivery tree. 
The following corollary is directly derived from Theorems 2 and 4.
Corollary 3. The delivery tree can be recovered from node
failures, even when the failure rate of every single node is
as high as e a ( a  1 ).

Theorem 3. If a data item is flooded for t hops on the
overlay, and then broadcasted on the tree structure, the
average number of messages used is bounded by
n  2d ( d  1)t 1 .
Proof. Let Vi (0  i  t ) denote the set containing the
unique nodes that have reached within i hops, Ki
denote the cardinality of the set Vi , and ki (1  i  t ) denote
the number of nodes that are reached at the i-th hop and
not included in the Vi 1 . The average number of messages
used by flooding is Q1 .
Before broadcasting the item on the tree, the number of
nodes that are aware of the data item is Kt . The nodes in
the Vt 1 do not forward the item any more, or they are
shielded by the nodes reached at the t-th hop. From
Lemma 1, the number of messages used by broadcasting
on the tree Q2 is:
Q2  (n  Kt 1 )  kt  2
 n  Q1  2kt
 n  Q1  2d (d  1)t 1
Thus, the total number of messages Q is:
Q  Q1  Q2  n  2d (d  1)t 1

Theorem 4. If each node on the base ring of the CRPenabled overlay keeps a correct successor, the delivery
tree can be recovered from any node failure.
Proof. Suppose the faulty node has r ( r  1 ) child nodes.
The tree is broken into r  1 sub-trees: r sub-trees rooted by
its r child nodes and 1 sub-tree rooted by the original root.
We denote the i-th ( r  i  1 ) child node as xi , the sub-tree
rooted by xi as Ti . Each child node actively locates its new
parent node according to the Algorithm S-2.
For xi ( r  i  1 ), if its successor resides in another subtree T j ( i  j ), then xi can take it as its new parent. Thus,
Ti and T j form a new sub-tree, reducing the number of
sub-trees by 1.
3 DETAILED SIMULATION EXPERIMENTS AND
PERFORMANCE RESULTS
In this section, we report detailed experimental results
on the performance of CRP in a simulated P2P network
environment. While some results have been briefly reported in the main paper, we provide detailed explanations in this supplementary file.
We compare the CRP with other schemes, including
four tree-based designs: CAM-chord [10], ACOM [2],
GoCast [8], Plumtree [6], and one mesh-based design:
Chainsaw [7].
3.1. Simulation Experimental Settings
We developed an event-driven P2P simulator. We
simulate a P2P network up to 20,000 nodes. The simulator
runs on a Linux SMP server consisting of 8 processors.
After any node joins the system, the timer starts to run.
From the 300th second, 1,000 nodes are randomly chosen
as date source to send data items simultaneously. We apply two physical topologies (i.e. IP topologies) to simulate
IP networks.
 TS-topology: The topology is generated by GT-ITM
[9]. It has about 2,500 routers: 4 transit domains each
with 4 transit routers, 5 stub domains attached to each
transit router, and 30 stub routers in each stub domain
on average. Nodes are attached to randomly chosen
stub routers. The latency of the link between a node
and the router to which it is attached is 1 ms; the
latency of the link between two transit routers is 2 ms;
the latency of the link between a transit router and the
stub router that attaches to it is 10 ms; and the latency
of the link between two stub routers is 50 ms.
 King-topology: This is a real topology from the King
dataset [3]. It consists of 1,740 DNS servers in the
Internet. The latencies are from real measurements of
the RTTs between the DNS servers. The RTTs are
divided by two to obtain one-way latencies. The
average link latency is about 91 ms. Multiple nodes
are simulated at a single DNS server if the number of
member nodes is above 1,740.
ZHENYU LI, ET. AL . : CHURN-RESILIENT PROTOCOL FOR MASSIVE DATA DISSEMINATION IN P2P NETWORKS
Node capacity follows a bounded Pareto distribution.
The shape parameter is set at 2 and scale parameter is set
at c/2, where c is the average node capacity. The samples
outside the range [c / 2,8c] are discarded. The average
node capacity was set at 8 by default.
We implement the CAM-Chord with cx log cx n neighbors per each node x on the Chord overlay. In the ACOM,
a node x has cx  1 randomly selected neighbors and 1
ring neighbor. In the GoCast, a typical node x keeps cx  1
physically close nodes and 1 randomly selected node as
neighbors. In the Plumtree, a node x has c x randomly
selected neighbors. Gossip messages are periodically
(every 0.5 second) exchanged amongst neighbors on the
overlays of GoCast and Plumtree for dependable data
delivery.
In the Chainsaw, a node notifies its neighbors once it
has received h messages. Its neighbors then request the
messages from it if needed. We simulated Chainsaw with
h as 1 and 5, denoted as Chainsaw-1 and Chainsaw-5,
respectively.
In our CRP system, the default weighting factor  was
set at 0.3. The default message replication ratio q was set
at 0.01.
3.2. Performance Metrics and Results
Five performance metrics are specified below for use
in our P2P network experiments:
M 1. Average delivery hops: the average number of hops
needed for a data message to reach its destination.
M 2. Average delivery time: the average time needed for a
data message to reach its destination.
(a) Delivery hops
(c) Message replication ratio
M 3. Message replication ratio: Let m be the total
number of messages used to deliver a data item and
n be the number of nodes. The message replication
ratio is defined by the ratio [m-(n-1)]/m.
M 4. Control overhead: the number of control messages
used per node.
M 5. Overlay reliability: the fraction of nodes in the
largest connected subgraph of the overlay network.
[8].
A. Relative Performance
We compare the performances of four tree-based
designs and one mesh-based design. The TS-topology
was taken as the default IP network topology. The results
are plotted in Fig. 1 and shown in Table 1. Each data point
represents the average value of 5 trials. The notation
''CRP: q '' denotes the CRP system with a message
replication ratio q .
(a) As shown in Fig.1 (a), the CAM-Chord requires the
least hop count and the GoCast and Plumtree designs
have the highest hop counts. The CRP performs
second lowest, lower than all remaining overlay
designs. In all designs, the hop count grows slowly
with the network size.
(b) In Fig.1 (b), we see the CRP outperforms all other
designs in terms of delivery time. The delivery time
depends on the hop count and link latency. The link
latency is reduced by using network proximity
information in the CRP and GoCast schemes. The
CAM-Chord and Plumtree designs ignore this
information. Although the ACOM design uses
(b) Delivery time
(d) Control overhead
Fig. 1. Relative speed performance of CRP overlay network against those of
CAM-Chord, GoCast, PlumTree, and ACOM overlay designs.
3
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
4
network proximity information in the last several hops,
the initial delivery process amongst random
neighboring nodes prolongs the overall time
remarkably.
(c) In our CRP, increasing the replication ration q from 0
to 0.01 significantly reduces the delivery hop and
delivery time. In Fig.1 (a, b), we see the benefit of
scoped-flooding at the first few hops.
(d) In Fig.1 (c), we see the effect of using no redundant
messages in the CAM-Chord, Plumtree and GoCast.
The ACOM applies a message replication ratio as high
as 0.18. The replication ratio in the CRP can be
controlled by adjusting q in an application.
(e) In Fig.1 (d), we see that the CRP and ACOM designs
require to using the least control messages. This is
because both of them maintain flexible overlays that
do not apply strict topologies. The CAM-Chord
requires 3 times more overhead than our CRP because
of the high maintenance cost for the DHT-based
overlays. The considerable overhead in the GoCast
and Plumtree designs are due to the periodic gossips
every 0.5 second.
ing. The failing conditions are quickly repaired with dynamic reconfiguration.
Fig.2 shows the performance of the CRP-enabled
overlay networks under various node failure rates. Fig.2
(a) depicts the impacts of massive node failures on data
delivery time and on the number of unreached live nodes.
The average delivery time grows linearly when the failure
rate is below 50%. The performance degrades much
rapidly when the failure rate exceeds 50%. All nodes are
reachable when the failure rate is below 40%. However,
there are as many as 125 unreached nodes at the rate 60%.
Fig.2 (b) shows the average time for a node to repair its
faulty parent in the delivery tree. The time grows with the
failure rate since it becomes harder to locate a live parent
on the delivery tree in the case many nodes fail simultaneously. However, even with a failure rate as high as 80%,
only 2 rounds of repairing service are demanded.
(f) As the number of nodes grows beyond 10,000, the
CRP:0.01 design reduces the average delivery time by
28%~50% (Fig.1 (b)) from that experienced in other
four tree designs. In addition, the CRP requires only
1/2 and 1/3 of the control messages used in the
Plumtree and CAM-Chord, respectively (Fig.1 (d)).
(a) Average delivery time as a function of failure rate
The performance results of the Chainsaw are shown in
Table 1. The network size is fixed at 10,000. 1,000 nodes
are randomly chosen as data sources to send data items
simultaneously, one data item per node. The overhead
refers to the notification messages sent to neighbors by
nodes once they have received h new messages, where h
is 2 or 5 in our simulations.
TABLE 1. Performance comparison between CRP and
Chainsaw
(b) Time to repair recovery as a function of failure rate
Fig. 2. Performance of the CRP under node failures.
Two findings are notable from Table 1. First, the CRP
requires less delivery time. This is because nodes push
data items without wait. Second, mesh-based Chainsaw
has a compromise between delay and overhead.
B. Performance under Massive Failures
We have conducted a set of experiments in Fig. 2 to
evaluate the static resilience [4] of CRP under various
node failure rates from 10% to 80%. Any failure exceeding
30% of the nodes is considered massive, as P2P networks
often have tens of thousands of active nodes at the same
time. The faulty or failed nodes stop forwarding data
messages. The nodes in our dissemination tree detect the
failures of their parent nodes by periodic heartbeat check-
In particular, at the failure rate of 40%, all live nodes
are reached by the data messages from any source. Thus,
the CRP guarantees the atomic data dissemination (defined in [5]) even when 40% of nodes fail, simultaneously.
With this high failure rate, the delivery time is prolonged
by about 110 ms over 270 ms needed under no failures.
The key point is that the CRP design is robust to large
number of simultaneous node failures
C. Effects of Network Topologies
In Fig.3, we evaluate the relative performance of various P2P networks in terms of average delivery times for
two different topologies. Fig.3 (a) and Fig.3 (b) show the
results for the TS-topology and King-topology, respectively.
ZHENYU LI, ET. AL . : CHURN-RESILIENT PROTOCOL FOR MASSIVE DATA DISSEMINATION IN P2P NETWORKS
We plot the fraction of nodes reached as the time increases. As shown in Fig.3 (a, b), the CRP design takes
much less time to reach most of the nodes than all 4 other
overlay designs under both the TS and King topologies.
With the King-topology, the curve for ACOM has a flatter
tail. This is due to the fact that link latencies in the King
topology are randomly distributed, while the TS topology
has evenly distributed link latencies.
The network proximity-aware designs, including the
CRP, GoCast and ACOM, perform better in Kingtopology. This is stemmed from the fact that the distribution of physically link latency in King-topology is more
skewed. This facilitates the proximity awareness. Our
CRP design reduces the delivery time by 33% to 50%,
comparing with others in King-topology.
the performance attributes.
B. Effect of Messaging Overhead
We consider the variations of the overlay link latency
and the messaging overhead with respect to time. In our
simulation experiments, from 300th second, 500 nodes
leave. After that, another 500 nodes join. The departure
and join intervals follow an exponential distribution with
a mean value of 100 ms, assuming a Poisson process.
Fig.5 reports the results on the average link latency
and average number of control messages used per node.
The tree construction messages are included when estimating the control overhead.
Fig. 4. Effect of weighting parameter on the average
delivery time and average hop count.
(a) TS-topology
(b) King-topology
Fig. 3. Average delivery times of 5 overlay networks
for the King- and TS-topologies.
2.3 Effects of Parameter Variations
Now, we consider the effects of adjusting of three key
system parameters on the CRP design.
A. Impact of the Weighting Factor 
In CRP overlay, the parameter  reflects the weight
between network proximity and node capacity proximity.
A lower value of  emphasizes more on the network
proximity, while a higher value emphasizes capacity
proximity. Fig.4 shows that as  is less than 0.3, the average deliver time is reduced due to sharp decrease in the
hop count.
When  goes beyond 0.6, the hop count grows since
the effect of network proximity awareness becomes marginal. We select  = 0.3 to yield a compromise between
Fig. 5. Average link latency and control overhead.
After the 120th second, the message overhead per node
grows linearly due to the tree construction messages. The
join and departure operations have very limited impacts
on the average overlay link latency, although they slightly
increase the control messages. This further demonstrates
the responsiveness of the CRP to node churns.
C. Overlay Reliability Analysis
We analyze below the reliability of the CRP overlay.
The successor list size on each node directly determines
the overlay reliability. We show in Fig.6 (a) the distribution of the successor list size after 3 runs in a 10,000-node
system. The mean size is very close to the desired value
(i.e. 9.21), which is of the order of logn . Moreover, the
distribution of the successor list size concentrates around
the mean value.
We compare the reliabilities of the CRP, Plumtree and
GoCast overlays. Fig.6 (b) plots the measured overlay
5
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
6
reliability against network size. Both the CRP and
Plumtree overlays have almost perfect fault resilience
close to 100%. The GoCast overlay drops to 94% resilience
ratio when 50% of the nodes fail. The high overlay reliability greatly facilitates the churn-resilience in massive
data dissemination.
(a) Distribution of successor list size.
(b ) Overlay reliability.
Fig. 6. The distribution of successor list size and measured
reliability of 3 overlay networks.
REFERENCES FOR THIS FILE
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
A. Binzenhofer, D. Staehle, and R. Henjes, “On the Stability of Chordbased P2P Systems”, Proc. IEEE Globecom, St. Louis, Dec. 2005.
S. Chen, B. Shi, S. Chen, and Y. Xia, "ACOM: Capacity-constrained
Overlay Multicast in Non-DHT P2P Networks," IEEE Trans. on Parallel
and Distributed Systems, Sept. 2007, pp. 1188-1201.
F. Dabek, J. Li, E. Sit, J. Robertson, M. F. Kaashoek, and R. Morris,
“Designing a DHT for Low Latency and High Throughput,” Proc.
USENIX Symp. Networked Systems Design and Implementation, San
Francisco, March. 2004.
K. Gummadi, R. Gummadi, S. Gribble, S. Ratnasamy, S. Shenker, and I.
Stoica, "The impact of DHT routing geometry on resilience and
proximity," Proc. ACM SIGCOMM, Karlsruhe, Germany, 2003.
A.-M. Kermarrec, L. Massoulie, and A.J. Ganesh, “Probabilistic Reliable
Dissemination in Large-scale Systems,” IEEE Trans. Parallel and
Distributed Systems, pp.248-258, March 2003.
J. Leitao, J. Pereira, and L. Rodrigues, “Epidemic broadcast trees”, Proc.
26th IEEE Int’l Symp. Reliable Distributed Systems, Beijing, China, Oct.
2007.
V. Pai, K. Kumar, K. Tamilmani, V. Sambamurthy, and A. Mohr,
"Chainsaw: Eliminating Trees from Overlay Multicast," Proc. Int’l
Workshop P2P Systems, NY, USA, Feb. 2005.
C. Tang, R. N. Chang, C. Ward, “GoCast: Gossip-enhanced Overlay
Multicast for Fast and Dependable Group Communication,” Int’l Conf.
on Dependable Systems and Networks, Japan, June 2005.
E. W. Zegura, K. L. Calvert, and S. Bhattacharjee, “How to Model an
Internetwork,” Proc. IEEE INFOCOM, CA. March 1996.
[10] Z. Zhang, S. Chen, Y. Ling, and R. Chow, “Capacity-Aware Multicast
Algorithms in Heterogeneous Overlay Networks,” IEEE Trans. Parallel
and Distributed Systems, pp. 135-147, Feb. 2006.
Download