IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1 Supplementary File for “Churn-Resilient Protocol for Massive Data Dissemination in P2P Networks” Zhenyu Li, Member IEEE, Gaogang Xie, Member IEEE, Kai Hwang, Fellow IEEE, and Zhongcheng Li, Member IEEE This file is supplementary for the manuscript that is submitted to IEEE Transactions on Parallel and Distributed Systems. The manuscript is entitled “Churn-Resilient Protocol for Massive Data Dissemination in P2P Networks”. The file includes pseudo-code for two algorithms, proofs for theoretical analysis, detailed simulation experiment settings and results. The notations used in this file can be found in Table 1 in the main paper. —————————— —————————— 1 PSEUDO-CODE FOR ALGORITHMS The pseudo-code of the base ring join procedure for the (n+1)-th node x is given in Algorithm S-1, where n > 0. SL(x) denotes the successor list of the node x. 13: 14: 15: Algorithm S-1: Base Ring Overlay Construction 2 PROOFS FOR THEORETICAL ANALYSIS Input: The base ring of n existing nodes (n > 0), a new node x to be inserted, and the successor lists of all existing nodes along the ring. Output: Updated base ring after adding a new node x Procedure: 1: x randomly chooses an existing node z 2: x joins the base ring as the successor of z 3: SL(x) SL(z) 4: z adds x into SL(z) at the first position 5: counter 2mx - 1 6: u prdz 7: loop 8: if counter > 0 then 9: counter counter – 1 10: if u == x then Break loop 11: if sucx∈SL(u) then 12: u deletes the last (i.e. mu-th) entry of SL(u) 13: u adds x into SL(u) at the position before sucx 14: end if 15: u prdu 16: goto loop Theorem 1. The average size of the node successor list converges to O(log n) , where n is network size. The pseudo-code for the parent switch process of a node x to repair the tree structure is listed in Algorithm S2. Rchd(x) is the set of x’s chord neighbors. Proof. Let p denote the probability that a node fails, m denote the successor list size, and Pf (n, m, p) denote the Algorithm S-2: Recovery of the Data Delivery Tree Input: Node x whose parent fails, the existing overlay and its delivery tree Output: Recovered data delivery tree Procedure: 1: z←nil 2: z←a node in Rchd(x), not a tree descendant of x 3: if z is not nil then 4: x takes z as its new parent node 5: x’s descendants update all ancestors 6: else u ← x 7: loop: //check successor to act as the parent// 8: u takes sucu as a new parent 9: u’s descendants update all ancestors 10: if sucu is not a tree descendant of x then 11: break loop 12: else sucu disconnects the tree link to its parent u ← sucu goto loop Proof. We prove this theorem by mathematical induction. Let E[M n ] be the average size of the successor list for n > 0. When n is 1, the successor list is of size 1. Suppose with an n-node base ring, E[M n ] = logn . The addition of the (n+1)-th node increases the size of its predecessor’s list by 1. Since the new node selects any existing node as its processor with a probability 1/n, we have E[M n 1 ] log n 1/ n log n log(n 1/ n) log(n 1) n This completes the proof. Theorem 2. If any node fails with a probability e a ( a 1 ) independently, then the ring structure is connected with a probability higher than 1 n1a , where e is the base of natural logarithm. probability that the ring is disconnected. Let Pr (n, m, p) denote the probability that a line, which contains n nodes with each node having logn successors, is disconnected. Then, following the analysis results in [1], we have Pf (n, m, p) Pr (n m 1, m, p) where Pr (n, m, p) 0 if n m , Pr (n, m, p) p m if n m , and if nm, Pr (n, m, p) Pr (n 1, m, p) (1 p) p m (1 Pr (n m 1, m, p)) pr (m, m, p) (1 p) p m i m (1 Pr (i m, m, p)) n 1 p m p m ( n m) Thus, Pf (n, m, p) p m p m (n 1) np m . If m O(log n) , we have Pf (n,log n, e a ) n1 a . p e a and IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2 Therefore, the ring structure is connected with a probability higher than 1 n1a . As n grows, the probability converges to 1. Lemma 1: Suppose s random nodes on the tree have known a data item. These nodes broadcast the item on the tree independently and simultaneously. Then, on average, n s 2 messages are generated to ensure all the nodes receive the item. Proof. Let g i denote the number of tree neighbors that a node i has. Then, except the root node, the node i has gi 1 children nodes. Let b n 1 denote the number of tree links. Because a node does not transmit the item back to the node from which it receives the item, the node i forwards at most gi 1 copies of the item to its tree neighbors. Thus, the total number of copies used is given by: G i 1 gi i s 1 ( gi 1) b 1 s ns2 s n Otherwise, xi connects to its successor sucxi as the new parent. The successor sucxi disconnects from its original parent and becomes the new root of Ti . The node sucxi performs the same operation as xi does. This procedure continues. Since each node maintains the correct successor on the ring, there must be at least one node in Ti whose successor is in another sub-tree T j ( i j ). This node takes the successor as its parent. Thus, Ti and T j form a new sub-tree, reducing the number of sub-trees by 1. Since each child node actively locates its new parent node, all the r 1 sub-trees will connect again as a new delivery tree. The following corollary is directly derived from Theorems 2 and 4. Corollary 3. The delivery tree can be recovered from node failures, even when the failure rate of every single node is as high as e a ( a 1 ). Theorem 3. If a data item is flooded for t hops on the overlay, and then broadcasted on the tree structure, the average number of messages used is bounded by n 2d ( d 1)t 1 . Proof. Let Vi (0 i t ) denote the set containing the unique nodes that have reached within i hops, Ki denote the cardinality of the set Vi , and ki (1 i t ) denote the number of nodes that are reached at the i-th hop and not included in the Vi 1 . The average number of messages used by flooding is Q1 . Before broadcasting the item on the tree, the number of nodes that are aware of the data item is Kt . The nodes in the Vt 1 do not forward the item any more, or they are shielded by the nodes reached at the t-th hop. From Lemma 1, the number of messages used by broadcasting on the tree Q2 is: Q2 (n Kt 1 ) kt 2 n Q1 2kt n Q1 2d (d 1)t 1 Thus, the total number of messages Q is: Q Q1 Q2 n 2d (d 1)t 1 Theorem 4. If each node on the base ring of the CRPenabled overlay keeps a correct successor, the delivery tree can be recovered from any node failure. Proof. Suppose the faulty node has r ( r 1 ) child nodes. The tree is broken into r 1 sub-trees: r sub-trees rooted by its r child nodes and 1 sub-tree rooted by the original root. We denote the i-th ( r i 1 ) child node as xi , the sub-tree rooted by xi as Ti . Each child node actively locates its new parent node according to the Algorithm S-2. For xi ( r i 1 ), if its successor resides in another subtree T j ( i j ), then xi can take it as its new parent. Thus, Ti and T j form a new sub-tree, reducing the number of sub-trees by 1. 3 DETAILED SIMULATION EXPERIMENTS AND PERFORMANCE RESULTS In this section, we report detailed experimental results on the performance of CRP in a simulated P2P network environment. While some results have been briefly reported in the main paper, we provide detailed explanations in this supplementary file. We compare the CRP with other schemes, including four tree-based designs: CAM-chord [10], ACOM [2], GoCast [8], Plumtree [6], and one mesh-based design: Chainsaw [7]. 3.1. Simulation Experimental Settings We developed an event-driven P2P simulator. We simulate a P2P network up to 20,000 nodes. The simulator runs on a Linux SMP server consisting of 8 processors. After any node joins the system, the timer starts to run. From the 300th second, 1,000 nodes are randomly chosen as date source to send data items simultaneously. We apply two physical topologies (i.e. IP topologies) to simulate IP networks. TS-topology: The topology is generated by GT-ITM [9]. It has about 2,500 routers: 4 transit domains each with 4 transit routers, 5 stub domains attached to each transit router, and 30 stub routers in each stub domain on average. Nodes are attached to randomly chosen stub routers. The latency of the link between a node and the router to which it is attached is 1 ms; the latency of the link between two transit routers is 2 ms; the latency of the link between a transit router and the stub router that attaches to it is 10 ms; and the latency of the link between two stub routers is 50 ms. King-topology: This is a real topology from the King dataset [3]. It consists of 1,740 DNS servers in the Internet. The latencies are from real measurements of the RTTs between the DNS servers. The RTTs are divided by two to obtain one-way latencies. The average link latency is about 91 ms. Multiple nodes are simulated at a single DNS server if the number of member nodes is above 1,740. ZHENYU LI, ET. AL . : CHURN-RESILIENT PROTOCOL FOR MASSIVE DATA DISSEMINATION IN P2P NETWORKS Node capacity follows a bounded Pareto distribution. The shape parameter is set at 2 and scale parameter is set at c/2, where c is the average node capacity. The samples outside the range [c / 2,8c] are discarded. The average node capacity was set at 8 by default. We implement the CAM-Chord with cx log cx n neighbors per each node x on the Chord overlay. In the ACOM, a node x has cx 1 randomly selected neighbors and 1 ring neighbor. In the GoCast, a typical node x keeps cx 1 physically close nodes and 1 randomly selected node as neighbors. In the Plumtree, a node x has c x randomly selected neighbors. Gossip messages are periodically (every 0.5 second) exchanged amongst neighbors on the overlays of GoCast and Plumtree for dependable data delivery. In the Chainsaw, a node notifies its neighbors once it has received h messages. Its neighbors then request the messages from it if needed. We simulated Chainsaw with h as 1 and 5, denoted as Chainsaw-1 and Chainsaw-5, respectively. In our CRP system, the default weighting factor was set at 0.3. The default message replication ratio q was set at 0.01. 3.2. Performance Metrics and Results Five performance metrics are specified below for use in our P2P network experiments: M 1. Average delivery hops: the average number of hops needed for a data message to reach its destination. M 2. Average delivery time: the average time needed for a data message to reach its destination. (a) Delivery hops (c) Message replication ratio M 3. Message replication ratio: Let m be the total number of messages used to deliver a data item and n be the number of nodes. The message replication ratio is defined by the ratio [m-(n-1)]/m. M 4. Control overhead: the number of control messages used per node. M 5. Overlay reliability: the fraction of nodes in the largest connected subgraph of the overlay network. [8]. A. Relative Performance We compare the performances of four tree-based designs and one mesh-based design. The TS-topology was taken as the default IP network topology. The results are plotted in Fig. 1 and shown in Table 1. Each data point represents the average value of 5 trials. The notation ''CRP: q '' denotes the CRP system with a message replication ratio q . (a) As shown in Fig.1 (a), the CAM-Chord requires the least hop count and the GoCast and Plumtree designs have the highest hop counts. The CRP performs second lowest, lower than all remaining overlay designs. In all designs, the hop count grows slowly with the network size. (b) In Fig.1 (b), we see the CRP outperforms all other designs in terms of delivery time. The delivery time depends on the hop count and link latency. The link latency is reduced by using network proximity information in the CRP and GoCast schemes. The CAM-Chord and Plumtree designs ignore this information. Although the ACOM design uses (b) Delivery time (d) Control overhead Fig. 1. Relative speed performance of CRP overlay network against those of CAM-Chord, GoCast, PlumTree, and ACOM overlay designs. 3 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 4 network proximity information in the last several hops, the initial delivery process amongst random neighboring nodes prolongs the overall time remarkably. (c) In our CRP, increasing the replication ration q from 0 to 0.01 significantly reduces the delivery hop and delivery time. In Fig.1 (a, b), we see the benefit of scoped-flooding at the first few hops. (d) In Fig.1 (c), we see the effect of using no redundant messages in the CAM-Chord, Plumtree and GoCast. The ACOM applies a message replication ratio as high as 0.18. The replication ratio in the CRP can be controlled by adjusting q in an application. (e) In Fig.1 (d), we see that the CRP and ACOM designs require to using the least control messages. This is because both of them maintain flexible overlays that do not apply strict topologies. The CAM-Chord requires 3 times more overhead than our CRP because of the high maintenance cost for the DHT-based overlays. The considerable overhead in the GoCast and Plumtree designs are due to the periodic gossips every 0.5 second. ing. The failing conditions are quickly repaired with dynamic reconfiguration. Fig.2 shows the performance of the CRP-enabled overlay networks under various node failure rates. Fig.2 (a) depicts the impacts of massive node failures on data delivery time and on the number of unreached live nodes. The average delivery time grows linearly when the failure rate is below 50%. The performance degrades much rapidly when the failure rate exceeds 50%. All nodes are reachable when the failure rate is below 40%. However, there are as many as 125 unreached nodes at the rate 60%. Fig.2 (b) shows the average time for a node to repair its faulty parent in the delivery tree. The time grows with the failure rate since it becomes harder to locate a live parent on the delivery tree in the case many nodes fail simultaneously. However, even with a failure rate as high as 80%, only 2 rounds of repairing service are demanded. (f) As the number of nodes grows beyond 10,000, the CRP:0.01 design reduces the average delivery time by 28%~50% (Fig.1 (b)) from that experienced in other four tree designs. In addition, the CRP requires only 1/2 and 1/3 of the control messages used in the Plumtree and CAM-Chord, respectively (Fig.1 (d)). (a) Average delivery time as a function of failure rate The performance results of the Chainsaw are shown in Table 1. The network size is fixed at 10,000. 1,000 nodes are randomly chosen as data sources to send data items simultaneously, one data item per node. The overhead refers to the notification messages sent to neighbors by nodes once they have received h new messages, where h is 2 or 5 in our simulations. TABLE 1. Performance comparison between CRP and Chainsaw (b) Time to repair recovery as a function of failure rate Fig. 2. Performance of the CRP under node failures. Two findings are notable from Table 1. First, the CRP requires less delivery time. This is because nodes push data items without wait. Second, mesh-based Chainsaw has a compromise between delay and overhead. B. Performance under Massive Failures We have conducted a set of experiments in Fig. 2 to evaluate the static resilience [4] of CRP under various node failure rates from 10% to 80%. Any failure exceeding 30% of the nodes is considered massive, as P2P networks often have tens of thousands of active nodes at the same time. The faulty or failed nodes stop forwarding data messages. The nodes in our dissemination tree detect the failures of their parent nodes by periodic heartbeat check- In particular, at the failure rate of 40%, all live nodes are reached by the data messages from any source. Thus, the CRP guarantees the atomic data dissemination (defined in [5]) even when 40% of nodes fail, simultaneously. With this high failure rate, the delivery time is prolonged by about 110 ms over 270 ms needed under no failures. The key point is that the CRP design is robust to large number of simultaneous node failures C. Effects of Network Topologies In Fig.3, we evaluate the relative performance of various P2P networks in terms of average delivery times for two different topologies. Fig.3 (a) and Fig.3 (b) show the results for the TS-topology and King-topology, respectively. ZHENYU LI, ET. AL . : CHURN-RESILIENT PROTOCOL FOR MASSIVE DATA DISSEMINATION IN P2P NETWORKS We plot the fraction of nodes reached as the time increases. As shown in Fig.3 (a, b), the CRP design takes much less time to reach most of the nodes than all 4 other overlay designs under both the TS and King topologies. With the King-topology, the curve for ACOM has a flatter tail. This is due to the fact that link latencies in the King topology are randomly distributed, while the TS topology has evenly distributed link latencies. The network proximity-aware designs, including the CRP, GoCast and ACOM, perform better in Kingtopology. This is stemmed from the fact that the distribution of physically link latency in King-topology is more skewed. This facilitates the proximity awareness. Our CRP design reduces the delivery time by 33% to 50%, comparing with others in King-topology. the performance attributes. B. Effect of Messaging Overhead We consider the variations of the overlay link latency and the messaging overhead with respect to time. In our simulation experiments, from 300th second, 500 nodes leave. After that, another 500 nodes join. The departure and join intervals follow an exponential distribution with a mean value of 100 ms, assuming a Poisson process. Fig.5 reports the results on the average link latency and average number of control messages used per node. The tree construction messages are included when estimating the control overhead. Fig. 4. Effect of weighting parameter on the average delivery time and average hop count. (a) TS-topology (b) King-topology Fig. 3. Average delivery times of 5 overlay networks for the King- and TS-topologies. 2.3 Effects of Parameter Variations Now, we consider the effects of adjusting of three key system parameters on the CRP design. A. Impact of the Weighting Factor In CRP overlay, the parameter reflects the weight between network proximity and node capacity proximity. A lower value of emphasizes more on the network proximity, while a higher value emphasizes capacity proximity. Fig.4 shows that as is less than 0.3, the average deliver time is reduced due to sharp decrease in the hop count. When goes beyond 0.6, the hop count grows since the effect of network proximity awareness becomes marginal. We select = 0.3 to yield a compromise between Fig. 5. Average link latency and control overhead. After the 120th second, the message overhead per node grows linearly due to the tree construction messages. The join and departure operations have very limited impacts on the average overlay link latency, although they slightly increase the control messages. This further demonstrates the responsiveness of the CRP to node churns. C. Overlay Reliability Analysis We analyze below the reliability of the CRP overlay. The successor list size on each node directly determines the overlay reliability. We show in Fig.6 (a) the distribution of the successor list size after 3 runs in a 10,000-node system. The mean size is very close to the desired value (i.e. 9.21), which is of the order of logn . Moreover, the distribution of the successor list size concentrates around the mean value. We compare the reliabilities of the CRP, Plumtree and GoCast overlays. Fig.6 (b) plots the measured overlay 5 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 6 reliability against network size. Both the CRP and Plumtree overlays have almost perfect fault resilience close to 100%. The GoCast overlay drops to 94% resilience ratio when 50% of the nodes fail. The high overlay reliability greatly facilitates the churn-resilience in massive data dissemination. (a) Distribution of successor list size. (b ) Overlay reliability. Fig. 6. The distribution of successor list size and measured reliability of 3 overlay networks. REFERENCES FOR THIS FILE [1] [2] [3] [4] [5] [6] [7] [8] [9] A. Binzenhofer, D. Staehle, and R. Henjes, “On the Stability of Chordbased P2P Systems”, Proc. IEEE Globecom, St. Louis, Dec. 2005. S. Chen, B. Shi, S. Chen, and Y. Xia, "ACOM: Capacity-constrained Overlay Multicast in Non-DHT P2P Networks," IEEE Trans. on Parallel and Distributed Systems, Sept. 2007, pp. 1188-1201. F. Dabek, J. Li, E. Sit, J. Robertson, M. F. Kaashoek, and R. Morris, “Designing a DHT for Low Latency and High Throughput,” Proc. USENIX Symp. Networked Systems Design and Implementation, San Francisco, March. 2004. K. Gummadi, R. Gummadi, S. Gribble, S. Ratnasamy, S. Shenker, and I. Stoica, "The impact of DHT routing geometry on resilience and proximity," Proc. ACM SIGCOMM, Karlsruhe, Germany, 2003. A.-M. Kermarrec, L. Massoulie, and A.J. Ganesh, “Probabilistic Reliable Dissemination in Large-scale Systems,” IEEE Trans. Parallel and Distributed Systems, pp.248-258, March 2003. J. Leitao, J. Pereira, and L. Rodrigues, “Epidemic broadcast trees”, Proc. 26th IEEE Int’l Symp. Reliable Distributed Systems, Beijing, China, Oct. 2007. V. Pai, K. Kumar, K. Tamilmani, V. Sambamurthy, and A. Mohr, "Chainsaw: Eliminating Trees from Overlay Multicast," Proc. Int’l Workshop P2P Systems, NY, USA, Feb. 2005. C. Tang, R. N. Chang, C. Ward, “GoCast: Gossip-enhanced Overlay Multicast for Fast and Dependable Group Communication,” Int’l Conf. on Dependable Systems and Networks, Japan, June 2005. E. W. Zegura, K. L. Calvert, and S. Bhattacharjee, “How to Model an Internetwork,” Proc. IEEE INFOCOM, CA. March 1996. [10] Z. Zhang, S. Chen, Y. Ling, and R. Chow, “Capacity-Aware Multicast Algorithms in Heterogeneous Overlay Networks,” IEEE Trans. Parallel and Distributed Systems, pp. 135-147, Feb. 2006.