Stochastic Analysis and Improvement of the Reliability of DHT-based Multicast Guang Tan and Stephen A. Jarvis Department of Computer Science, University of Warwick Coventry, CV4 7AL, United Kingdom Email: gtan,saj @dcs.warwick.ac.uk Abstract— This paper investigates the reliability of applicationlevel multicast based on a distributed hash table (DHT) in a highly dynamic network. Using a node residual lifetime model, we derive the stationary end-to-end delivery ratio of data streaming between a pair of nodes in the worst case, and show through numerical examples that in a practical DHT network, this ratio can be very low (e.g., less than 50%). Leveraging the property of heavy-tailed lifetime distribution, we then consider three optimizing techniques, namely Senior Member Overlay (SMO), Longer-Lived Neighbor Selection (LNS), and Reliable Route Selection (RRS), and present quantitative analysis of data delivery reliability under these schemes. In particular, we discuss the tradeoff between delivery ratio and the load imbalance among nodes. Simulation experiments are also used to evaluate the multicast performance under practical settings. Our model and analytic results provide useful tools for reliability analysis for other overlay-based applications (e.g., those involving persistent data transfers). I. I NTRODUCTION Overlay multicast [12] is an effective paradigm to provide large-scale data dissemination over the Internet. There are two basic approaches to organizing multicast groups. The first is to make all multicast members self-organize into a group according to some kind of topology (e.g., tree or mesh); the multicast members need to locate upstream nodes and assume links maintenance [11] [20]. In the second approach [10] [31] [21], the multicast protocol is layered based on a distributed hash table (DHT) protocol that supports application-layer routing between overlay nodes. The different routes provided by the DHT from receiving nodes to the source node automatically form a tree topology. Two main advantages of the DHT-based approach are (1) multicast applications can easily exploit the DHT’s routing and failure recovery functions to organize the multicast group, obviating the need to handle network dynamics and maintain neighbor sets themselves, and (2) the same DHT-based overlay (e.g., openDHT [23]) can be shared by many overlay applications and multicast trees simultaneously. As DHTs have been increasingly deployed and used as a building block of distributed systems, developing multicast based on DHTs can considerably simply software development and thus becomes an appealing scheme for multicast. A number of projects have been or are being undertaken using this technique (e.g., Splitstream [9], RSSDHT [1], MOOD [2], and QStream [23]). In this paper we study the reliability property of DHT-based multicast in the context of low-bit-rate streaming applications, such as text/voice streaming and distributed white board. Reliability is one of the major concerns for any overlay-based multicast protocol. In an overlay network, nodes are highly transient and the data streaming between two end-points can suffer frequent interruptions, which may last in the order of tens of seconds [11]. As a result, the nodes may either receive only a small proportion of the data or have to heavily rely on some kind of error recovery mechanism. This problem is particularly critical for a DHT-based multicast protocol, as the DHT routes often pass through some non-multicast-group nodes, which leads to longer data delivery paths and hence poorer transfer reliability. Our analysis begins with a stochastic model for node lifetime and data delivery over a set of nodes. Specifically, we consider the distribution of node residual lifetime, which plays a fundamental role in the reliability analysis throughout the paper. With this result, we then obtain the worst-case stationary data delivery ratio between the source node and a receiving node. The numerical examples show that this ratio can be very low (e.g., less than 50%). Using the fact that a node’s lifetime generally follows a heavy-tailed distribution [8] [28] [25] [26], which itself implies that the longer-lived nodes are likely to be more stable than short-lived ones, we then consider three optimization schemes. In the first scheme, called Senior Member Overlay (SMO), the nodes above a certain threshold age are organized into a special overlay, which takes the responsibility of all the forwarding tasks in data delivery. The second scheme, called LongerLived Neighbor Selection (LNS), leverages the flexibility of neighbor selection provided by some DHT algorithms to make every node choose relatively stable nodes as their neighbors, thus improving the stability of the data delivery paths. In the third scheme, called Reliable Route Selection (RRS), the DHT algorithm no longer progresses through the ID space in a greedy manner; instead it will choose stable next hop nodes from the available options under the constraint that the path length remains unchanged. We examine the data delivery ratio under these schemes and discuss their implications on the load imbalance among nodes. To obtain insight into DHT-based multicast under more realistic settings, we also conduct simulations to evaluate their performance using practical metrics. For illustration purposes, we use Chord [27] and Scribe [10] as examples of the DHT and multicast group management protocols, respectively. We will also discuss the applicability of certain techniques to a number of representative DHT algorithms, including Pastry [24], CAN [22], Kademlia [16] and De Bruijn [18]. Due to space limitation, we do not discuss other multicast group management protocols, such as Bayeux [31] and the one proposed in [21], as well as their combinations with the different underlying DHT algorithms, but the analysis would be easily applied to these variations. We confine our study to applications for which the bandwidth constraint of nodes is not a big issue. Handling bandwidth constraints [5] poses major challenges to the formal analysis and complicates the design of optimization schemes. We leave this as a subject of future research. The paper proceeds as follows: Section II establishes the stochastic model; based on the model, Section III analyzes the data delivery ratio of an overlay path under the plain DHT; Section IV introduces and analyzes the three optimization techniques; Section V presents the simulation results; Section VI documents some related work and finally Section VII concludes the paper. II. M ODEL AND P ERFORMANCE M ETRICS We assume a simple multicast application that transfers data in the following manner: in a multicast group, the data is sent from the source infinitely at a constant data rate, without buffering data at any node, and no retransmission or recovery mechanism is present in the system. It has been widely observed that application endpoints’ lifetimes (in terms of node uptime, user session time or file transfer time) often follow a heavy-tailed distribution. Typical applications exhibiting such characteristics include file sharing [8] [28], multimedia streaming [6] [25] [26] [30], etc. We therefore use a shifted Pareto distribution [14] "!#$%& to model nodes’ lifetimes ' . The shape parameter ! is assumed to be greater than 2 so that the finite mean ( and variance ) exist. We also 0 1 2 34 8 use the density function *+,-.!/ 1"!56$+ . It ? is easy to verify that (#7,1! and )9:;!9<%=>! 9+@! $+BA . Other assumptions are as follows. Nodes enter the system in a Poisson process (PP) [28] at a constant rate C , with node D ’s arrival time at EGF ; during the evolution of the network, nodes’ lifetimes follow the same distribution 3 . The overlay network has a very large number of nodes (e.g., one million) and has entered a steady state. We further assume a H -bit (HGIKJ ) DHT identifier and that $ML is much larger than the number of nodes in the system. Throughout the paper, we use delivery ratio as a primary metric for the reliability of the multicast. It is defined as the proportion of data units successfully received by a node from the source [4]. Under the assumed transfer mode, the delivery ratio is approximately the fraction of time during which a receiving node finds that all the forwarding nodes between the source and itself function normally. We focus on the worstcase stationary delivery ratio between two nodes, which is achieved when the two nodes have a maximum path length N Q (e.g., $O for two nodes $+9"P distance apart). The quantity N is determined by the two nodes’s IDs and the routing algorithm of the DHT. Assuming a fixed maximum path length allows us to obtain an exact lower bound for the stationary delivery ratio between two general nodes. A second metric in our model is the node stress, defined as the number of children supported by a node. In our model, a key tradeoff is between delivery ratio and the load balancing of nodes. Since the minimum node stress is zero, the maximum node stress (MNS) of all nodes, which reflects the variation in the range of individual node stresses, is used as a metric for load balancing. When the number of multicast groups R& , the MNS of a DHT is the per-group MNS times R . Usually R is independent of the DHT and multicast protocols, so we often omit this factor (i.e., assume RS ) when comparing the MNSs of various schemes. Note that in a practical network, the dynamics of nodes may introduce a substantial number of undesirable non-DHT links among the nodes [5], which complicates the analysis of MNS. We therefore assume that the multicast protocol can automatically correct these non-DHT links (by, for example, periodically re-establishing the application-layer connections according to the DHT routing table), so that a node’s stress can be approximated by the number of DHT nodes pointing to it (i.e., its in-degree in the DHT graph). III. DATA D ELIVERY R ATIO IN A P LAIN DHT In this section we derive the worst-case delivery ratio between two nodes in a DHT network. We first examine the mean residual lifetime of a randomly chosen node, and then obtain the stationary delivery ratio using renewal theory. A. Node Residual Lifetime Given the current time T , we let random variables U and V denote a (randomly chosen) node’s age and residual lifetime, respectively. This is illustrated in Figure 1(a), where the sum of U and V is equal to ' . We first examine the property of U , and then consider the joint distribution of 3UW"VX , whose marginal distribution will characterize our variable of interest V . As assumed in the previous section, the nodes enter the overlay network Y as a homogeneous Poisson process (PP) with rate C . Let 3Z["\]Z_^?T be the corresponding counting process [19] formed by the arrival YW` of all the nodes that have ever entered the system, and 3ZaQ\]Z_^?T the counting process formed by the arrival of all nodes that are present at the network at time T . In Figure 1(b), the former process is depicted by a sequence of solid points along the top time axis, while the latter process is formed Yby ` the sequence of circles on the bottom time axis. Indeed, Zab\cZd^cT is a non-homogeneous Poisson process (NPP), as stated by the following lemma. (The proofs of some lemmas and theorems are provided in our technical report Y[29].) ` Z"G\eZ ^fT is a Lemma 1: The counting process non-homogeneous Poisson process (NPP) with rate function g 3Zh;Ci - T Z""Q\?Z_^]T-j N(t) = PP(tau) Hence, 0 T L 3T d Pr U ^? d 3T: d g 3Z- %Z . 3ZBA/%Z = . g P 3ZBA/%Z . 3Z 1Z . = P P which gives 0 , 3 * 3h j 3Z A0%Z . = P 3Z A0%Z ( we Letting T12 and noting that 43 = : P BA( which establishes the lemma. have *3d = X Y 0 L T X: age Y: residual lifetime L: lifetime . . )) N'(t) = NPP(lambda( (a) Fig. 1. T (b) The point processes formed by the arrival of nodes. Next, Y ` we apply the generalized campbell theorem [19] to 3Z to capture the characteristics of the arrival times (thus the ages) of present nodes. This theorem is restated in Lemma 2. Lemma 2: [19, page 227] Let E aE j/j/j aE be the g 9 2 event times in an NPP ; let be i.i.d. 2 9 j/j/j/ random variables with distribution Pr F\K dd3"d3Td , g where :; , and 2 9 /j j/j[ Y the order P j/j/j . Then, conditioned on 3Z statistics of With the above conclusion, we can solve the distribution of , the node’s residual lifetime. Lemma 4: As T+5 , the density function of the residual lifetime of a randomly chosen node is given by V *6 871 2 9 f 2 9 j/j j/ @E 2 aE 9 j/j/j/aE [j Lemma 2 states that the joint distribution of the arrival times of the present nodes is in distribution equivalent to that of 2 9 j/j/j/ , which has a marginal distribution which is easy to manipulate. With this lemma, we can proceed to obtain the density function of U . Lemma 3: As T , the density function of the age of a randomly chosen node is given by *h = ( 3 A j (1) Proof: Assume that at time T there are ` ` active nodes ` in the system, whose arrival times are E E /j j/j/ E , re 2 9 $%/j j/j spectively. Define i.i.d. random variable FD and their order statistics as in Lemma 2. Let indicator : represent one when event occurs and zero otherwise. In an ordinary DHT, the selection of a node for general purposes such as finding its neighbors is independent of the node’s properties such as arrival time and age, so the arrival ` time of a selected node can be seen as equiprobable for all E F D a$1/j j/j[ . Thus, Pr U ? "! Y ` # F%$ 2 3T: ` Y ` 3T E F ] & 3 T:(',j Y ` Z"Z*)? Applying Lemma 2 to the NPP that all F ’s are symmetric, we have Pr U K" "! Pr Y ` # 3T: + 3T and observing F K-' F,$ 2 :T ] ^ T d j 2 d3Td Moreover, ( = (i9 71BABj (2) )9 j = V A $<( * 6 7% * 3 , which (3) Notice that means that a node’s residual lifetime is in distribution equivalent to a node’s age. Also from Eq.(3), it can be seen that = V A = 'Ad )9 (i9 "+$<(89 = VX9[A3+$<( , indicating that, provided the network model, the mean residual lifetime of participating nodes is even greater than the node’s mean lifetime. This somewhat anti-intuitive result explains the measurement observation in [28]: that in a steady-state network, a majority of participating nodes in the system are long-lived nodes, while the remaining short-lived nodes turn over at a high rate. An important implication of this fact is that, while a reliabilityignorant multicast protocol may have a poor delivery ratio due to a few highly transient nodes on the path, the existence of the many long-lived nodes provides the opportunity to achieve a high delivery ratio without occurring significant load imbalance, if the underlying DHT were able to provide a reliability-aware routing service that avoids passing through those unstable nodes. The above relationship between node lifetime and residual lifetime is particularly stressed in [7], where it is found that random selection of replacement from all existing nodes after some node fails produces surprisingly lower churn than choosing replacement from a fixed set of nodes. It should also be noted that Lemma 4 achieves the result that has been used by Leonard et al. [14]. However, the results are obtained in distinct approaches and under different modeling assumptions. In [14], the node arrival/departure is modeled as a renewal process and the residual lifetime distribution is immediately obtained from existing results of renewal theory. Their model relies on an important assumption: that the probability that a newly arriving node finds a neighbor at any point within that neighbor’s lifespan is equally likely. Unfortunately, it is not clear under what circumstances this assumption holds, or how it can be interpreted in a more plausible way. Also in their simulations, they assume that a leaving node is immediately replaced by a fresh node with the same lifetime distribution, an arrival pattern yet to be justified. In contrast, we assume a Poisson arrival pattern which has been verified by previous empirical studies [28] [3]. More importantly, our modeling reveals more details of the stochastic properties of a node’s age and residual lifetime, which facilitate the analysis of data delivery reliability in more complicated contexts, as will be shown later in the paper. B. Delivery Ratio Consider a data delivery path from the source node P to some receiving node N I;J . We assume that the two nodes have a maximum path length of N . Between the two 2 9 /j j/j 2 . nodes is a sequence of forwarding nodes When a forwarding node F on the path fails (departs), its child node F will try to` find a` substitute for F and re2 establish a new path F F 2 /j j/j[ , where, in ` P 2 j/j/j ; $%/j j/j"D is a node succeeding most instances, on the Chord ring. (It is possible that the length of the new path becomes less than N . We again assume the worst case where the path length remains unchanged.) We make a minor modification to the Scribe protocol so that the nodes 2 9 j/j j/ F 2 need not to be` changed: when 1F ` fails, its child F finds a substitute F and requests that F directly 2 connects to its original grandparent 1F , the original path i2 from F 2 to P being re-used. Now, the path P 2 /j j/j/ will have only one forwarding node replaced when F fails. An important consequence of this is that the replacement of forwarding nodes on the path becomes independent of each other, and so the modeling analysis can be greatly simplified. From the viewpoint of the system, this change obviates the need to destroy the previously established path and thus reduces the communication cost. In addition, this modification is easy to implement – recall that a Chord node already maintains a successor list for each of its neighbors for the purpose of fault tolerance [27]. Now, each forwarding node on the path has two states: normal state and failure state. The normal period is equivalent to the residual lifetime V of a randomly chosen node among the present nodes in the network. The failure period includes failure detection and finding a substitute node, which we assume takes a random time , called the fixing time. We can therefore treat a forwarding node’s evolution as an alternating renewal process [19], and using Smith’s theorem, we obtain the stationary probability of a forwarding node being found in = A3 . Since all forwarding a normal state + = V A% = V A nodes on the path are independent of each other in their own renewal processes, the joint probability of their status being normal simultaneously is simply i2 , which corresponds to the probability of the receiving node not being in starvation. Thus the following result concerning the delivery ratio seen by the receiver node holds. Theorem 1: In a DHT network, the worst-case stationary delivery ratio between two nodes that are at most N hops apart is given by = V A =V A = :A i2 ( 9 2 (i9 )9 j ) 9 <$ (( = :A (4) Theorem 1 provides an estimate for the worst-case delivery ratio regardless of the actual node lifetime distributions (such as Pareto, lognormal, Weibull, etc.). As an example, Table I shows the worst-case delivery ratios between two nodes with Pareto node lifetime distribution for varying maximum path lengths and mean fixing time = :A s. As we often do through out the paper, the parameters ! and are set to and respectively, such that the mean lifetime ( .1j hours and the mean residual lifetime = V A- hour. In the table, the worst-case delivery ratio drops noticeably as the maximum path lengths or mean fixing time increases. For example, for N+ and = :A minute, the worst-case delivery ratio is only slightly higher than 60%, a level far from satisfactory for many applications. Mean fixing time 30 seconds 1 minute 2 minutes Maximum path length 10 20 30 92.8% 85.4% 78.6% 86.2% 73.0% 61.9% 74.4% 53.6% 38.6% TABLE I W ORST- CASE DELIVERY RATIO FOR VARYING S AND S . C. Maximum Node Stress In a DHT network, a node can be responsible for at most IDs with high probability (whp) (balls-in-bins model [17]); each of these IDs corresponds to a virtual node which has an in-degree of at most H , so the following theorem holds. Theorem 2: With high probability, the maximum node stress of an -node DHT network is 9 (L Q*) +, . @$ML Q !#" $%!&"'%!&" IV. O PTIMIZING S CHEMES ) +,-) +, AND A NALYSIS Given a fixed path length, Theorem 1 suggests two ways to improve the delivery ratio: increase = V A and decrease = :A . In this work, we focus on the first approach. The general idea of increasing = V A is to give preference to nodes that have stayed alive for a relatively long period of time when selecting forwarders in the delivery path. In this section, we introduce three schemes that can assist with this. Note that we do not elaborate on low-level protocol details here; rather, we focus on the main ideas and modifications on the original multicast and DHT protocols. A. The Senior Member Overlay (SMO) Scheme The SMO scheme organizes nodes above a certain threshold age into a special overlay, called a senior member overlay (SMO), which will take the responsibility of all the forwarding tasks in data delivery. Now that most young (and thus unstable) nodes are pushed to the leaf level of the tree, they will not affect other nodes and thus the data delivery ratio can be improved. The idea of biased task allocation among heterogeneous nodes (in terms of processing capacity, bandwidth, up time, etc.) is not new. Our contribution here is the application of this idea to the new context of DHT-based multicast and a formalized analysis with respect to data delivery reliability. The formation of the SMO is simple. The only parameter involved is the threshold age ; when a node has stayed in P the base overlay for , it joins the SMO with the help of P some bootstrap node, which can be easily obtained through the propagation and exchange of node information in the base overlay. In the SMO scheme, every publisher is required to join the SMO. For a non-publisher node, if it is not in the SMO, it needs to identify on the Chord ring the nearest successor node that belongs to the SMO, called the SMO successor. When a node joins a multicast group, if it is already a member of the SMO, it simply performs the joining routine of multicast protocol within the SMO; otherwise it asks its SMO successor to perform the joining routine within the SMO, and then asks the parent of the SMO successor to add itself as a child. After this, the SMO successor is dropped from the parent’s child list. To help understand the tradeoff between delivery ratio and load balancing, we define another metric that is related to P . The SMO fraction, denoted , is the proportion of all nodes that are selected to join the SMO. Clearly, P is b O -quantile, denoted 2 , of a node’s equal to the age distribution . To calculate this quantile, a node first estimates the nodes’ lifetime distribution 6 by monitoring the arrival and departure times of its neighbors and exchanging this information with others; it then obtains according to Lemma 3 and finally calculates 2 . Let random variable V denote the residual lifetime of a randomly chosen SMO node. (We are interested in only SMO nodes because they are the forwarding nodes for data delivery.) The following lemma concerning the density function of V holds. Lemma 5: In a DHT network with the SMO scheme, as T+5 , * 6 871h ` = , 7% A (5) where ` ( ( P ( ZBA4%Z @Q^\ (i9 )9 = V[A ` j $<( = moreover, (6) Now, the worst-case data delivery ratio can be obtained using the asymptotic results of renewal theory, as done in Theorem 1. Theorem 3: In a DHT network with the SMO scheme, the worst-case stationary delivery ratio of two nodes that are at most N hops apart is given by 2 (i9 )9 (7) ` ( 9 ) 9 $<( = :A ` h3ZBA%Z @Q^\ j where ( ( 4 = P When node lifetime follows a Pareto distribution, the worst - case delivery ratio has an elegant expression. The following corollary can be obtained after simple integration. Corollary 1: For Pareto node lifetime distribution, the worst-case stationary delivery ratio of two nodes that are at most N hops apart in a DHT network with the SMO scheme is 2 (i9 )9 j ( 9 ) 9 $ (( = :A (8) Eq.(8) differs from Eq.(4) only in the factor in the denominator, which clearly shows the impact of on delivery ratio. This is further demonstrated in Figure 2, which shows the worst-case delivery ratios under the SMO scheme for varying values of and = :A . The node lifetime model is the same as that used in Table I. It can be seen that the use of SMO can effectively improve the worst-case delivery ratio; moreover, the smaller the SMO, the higher this ratio. However, a small SMO may result in more load imbalance between the SMO nodes and non-SMO nodes. This tradeoff is quantified by the following theorem. Theorem 4: With high probability, the maximum node stress of an -node network with the SMO scheme is L 2 39( ) +, " ! j ) +,-) +, B. The Longer-Lived Neighbor Selection (LNS) Scheme LNS makes use of the flexibility of neighbor selection provided by some DHT algorithms such as Chord, Pastry, and Kademlia. In Chord, for example, it is possible for a node to choose its D th neighbor from a subset of nodes, called a F F $ $ 2/ [13]. This candidate subset, in the range = enables a node to choose reliable neighbors by selecting the oldest node from each candidate set, so that the data delivery path can be more reliable. Considering that the candidate set may grow too large as D increases, we let LNS sample at most # F consecutive nodes starting from $ for the D th neighbor of node . We now consider the residual lifetime of a forwarding node on a delivery path. The following lemma characterizes such a random variable and further provides its mean for the special case of Pareto lifetime distribution. Lemma 6: Let V F @b^ D:^ N be the residual lifetime of a node’s D th neighbor node in a DHT network with the LNS scheme, then the density of V F is given by * * $ F Z ! %Z ' ) i2 % * "6 $% & 71 ( ' $ 3 * 17 ! ( ) P P (9) # +.-0/ $ F i2 . Specially, if node lifetime follows where F , ' a Pareto distribution, 2 ! F 3 F = V A (10) ?21 2 ' j ! 2 ! F ' 2 F 1 F increases, setting 4 2 To see the trend of = V A as ' 2 %i2 and 5 6 ! , we can expand Eq. (10) as 12 1 2 4 ) $ 2>=a9 F = V 8A 795;: 8< ! ' F 4M < (11) ' F 4 # F ' which indicates that = V A grows with F (thus# ) and tends ' to infinity. Also notice that in the special case of "!G F and W , Eq. (10) reduces to = V A ,%@! $+ + = V A , 1.0 1.0 0.9 0.9 0.9 0.8 0.8 Worst-case delivery ratio Worst-case delivery Ratio 0.8 0.7 0.6 0.5 E[R] = 30 sec E[R] = 60 sec E[R] = 90 sec E[R] = 120 sec 0.4 0.3 0.2 0.2 0.7 0.6 E[R] = 30 sec E[R] = 60 sec E[R] = 90 sec E[R] = 120 sec 0.5 0.4 0.4 0.5 0.6 0.7 Fig. 2. Impact of SMO fraction delivery ratio. 0.8 0.9 1.0 1 1.1 2 0.6 5 10 i2 F,$ 2 1 # F,, +.-0/ $ F 20 30 60 Plain DHT, E[R] = 60 sec Plain DHT, E[R] = 90 sec RRS DHT, E[R] = 60 sec RRS DHT, E[R] = 90 sec 0.5 0.4 0.3 100 150 10 15 on worst-case 2 2 1 2 ! F 3 ! ? 2 ' 1 Fig. 3. Impact of maximum candidate size worst-case delivery ratio. ! F3 ' ; 2 ! = :A F ' 2 (12) where . Figure' 3 shows the worst-case delivery ratios under the LNS scheme for varying candidate set sizes and mean fixing times. The node lifetime model is the same as that used in Table I. As expected, the worst-case delivery ratio improves substantially # # as varies from 1 to 100. A large , however, implies more significant load imbalance. The following theorem shows a # linear relationship between and the maximum node stress. Theorem 6: With high probability, the maximum node stress for an -node DHT network with the LNS scheme is L 9( Q ) +, . ) +,-) +, C. The Reliable Route Selection (RRS) Scheme Although in its original proposal, Chord greedily routes a message towards a destination in decreasing ID distances, the order of the distances is in no way essential to the correctness and efficiency of routing. In other words, in terms of total hop count, a route that spans a sequence of distances is equivalent to any route that spans a permutation of that sequence, if that permutated sequence could be achieved. For example, the route traversing a node sequence (with distance sequence ) is equivalent to the route traversing nodes 1 (with distance sequence ). This flexibility provides some room for choosing stable nodes for a data delivery path. Specifically, at each node the RRS scheme chooses the oldest node from a set of neighbors, called the candidate set, as the next hop. The candidate set is selected in such a way that the total hop count will not be increased. Here we analyze an approach to achieving this: if the distance of a node to some 20 25 30 35 Maximum path length Maximum candidate set size (K) which corresponds to the node residual lifetime in a plain DHT. The following theorem gives the delivery ratio result for DHT network under the LNS scheme. Theorem 5: In a DHT network where nodes’ lifetimes follow a Pareto distribution and the LNS scheme is applied, the worst-case stationary delivery ratio of two nodes that are at most N hops apart is given by 0.7 0.3 0.3 SMO Fraction (delta) Worst-case delivery ratio 1.0 Fig. 4. Worst-case delivery ratio as a function of maximum path length . on destination is expressed as a binary number, then neighbor D is selected to the set if that binary number has a 1 in the D th position. This can be thought of as a binary string having its 1 bits cleared one at a time as a message routes from the source to the destination, and hence we call it bit-clearing. This heuristic is originally proposed in [13] for achieving network proximity; it, however, can guarantee a fixed hop count only in a fully populated DHT. Some other heuristics are possible and we will discuss these in Section V. In the following, we assume two nodes that are initially $ distance apart on the ring, then starting from the receiving node, the first node can choose any of its N neighbors as its next hop, the next node has N $ possible next hops, and so on to generate a route. Following the same line of Lemma 6 and Theorem 5, we can obtain the following results. Lemma 7: Let V F ^ D ^ N@ be the of residual lifetime the D th forwarding node on the path P 2 /j j/j[ , where P is the source node and the receiving node, in a DHT network with the RRS scheme. Then the density of V F is given by D * 3 * 6 $ 7% F * ( P 1 7 * ! P F 3Z ! 1 Z ' i 2 %j (13) Specially, if node lifetime follows a Pareto distribution, 2 !D3 F = V A j ] 1 2 ! 2 ! D 2 nodes’ 1 In a DHT network where (14) Theorem 7: lifetimes follow a Pareto distribution and the RRS scheme is applied, the worst-case stationary delivery ratio of two nodes that are at most N hops apart is given by 2 !D3 0 2 1 ] 2 ; 2 D 3 ! D ! ! ,F $ 2 2 2 1 1 i2 = :A (15) Theorem 8: With high probability, the maximum node stress of an -node DHT network with the RRS scheme is L 9( Q*) +, . +, Figure) +, 4) shows the delivery ratios under the RRS scheme for varying path lengths and mean fixing times. The node lifetime model is the same as that used in Table I. As can be seen, the j RRS scheme leads to a considerable improvement in the worstcase delivery ratio. Moreover, the curves for the RRS scheme have smaller slopes than those for a plain DHT, indicating that as the path length increases, the worst-case delivery ratio under the RRS scheme drops at a smaller speed than in a plain DHT. This is because a longer path provides larger candidate F sets, which in turn increases = V A . This partly compensates for the loss of delivery ratio due to the increase in path length. D. Comparison of the Schemes Besides the differences in delivery ratio and maximum node stress, the three schemes differ in a number of other respects, including applicability to different DHT algorithms, control flexibility, implementation cost, and communication overhead. The SMO scheme does not rely on any particular underlying DHT geometries, so it can be applied to all types of DHT network. It provides the parameter SMO fraction to balance between data delivery ratio and load balancing, and therefore has good control flexibility. To implement this scheme, the original multicast protocol needs to be strengthened to maintain the age information of nodes and be aware of the existence of two overlays, whereas the underlying DHT algorithm need not be changed. The major drawback of this scheme is that it requires a fraction of nodes to stay in two DHT overlays simultaneously, which means a higher node overhead and message traffic for those nodes. The LNS scheme makes use of the flexibility of neighbor selection, which is unavailable for some DHT algorithms such as CAN and # de Bruijn. Like SMO, the scheme also provides a parameter to balance between reliability and load balancing. This scheme only requires minor modification to existing DHT algorithms, which is transparent to upper-layer multicast protocols. The extra overhead imposed on the nodes is small because the nodes only need to sample a limited number of nodes to choose the oldest ones as its neighbors. The RRS scheme relies on the flexibility of route selection, which can only be provided by Chord and CAN in our cases. Therefore, this scheme has the least applicability in terms of DHT choice. Like LNS, it has minimal implementation cost and running overhead. V. S IMULATION S TUDY A. Methodology An event-driven simulator is developed with the Chord and Scribe protocols implemented. Nodes enter the DHT network in a homogeneous Poisson process such that the average number of DHT nodes remains at approximately 800,000. Upon joining the DHT, some nodes also participate in one of 30 multicast groups with equal probability; the total number of multicast members remains at approximately 300,000. Since the reliability of data delivery is our focus, the simulator does not model network latency. Two lifetime distributions are tested: Pareto distribution and Lognormal distribution as observed by [26] and [30]. The Pareto distribution has the same parameters as those used in Table I and the Lognormal distribution has the scale and shape parameters set to 5.5 and 2, respectively, such that both distributions have a mean of around 30 minutes. We wish to see how the variations in lifetime distribution affect multicast reliability and how the optimizing schemes adapt to them. In the simulator, the data loss is only caused by node departures (without notification). The total failure detection and recovery time is assumed to be uniformly random between = 1M<A seconds. We call the accumulative time a node spends in failure detection and recovery its failure time, and define the ratio of the failure time to its session time (current time minus arrival time) as the data loss ratio (or loss ratio), which is equal to the delivery ratio. Other performance metrics include node stress and path length. For loss ratio and path length, we only report the statistical results of all the leaf nodes, as they reflect the worstcase performance of the multicast. All the following results are taken from a typical snapshot of the network after it has evolved for 3.6 hours. B. Simulation Results 1) Data loss ratio: Figures 5(a) and 5(b) show the cumulative distribution functions (CDFs) of loss ratio under the SMO scheme. It can be seen that the SMO scheme considerably reduces the loss ratios. In the Pareto case (Figure 5(a)), for example, an SMO fraction of 0.1 reduces the mean loss ratio by nearly a half (from 18.0% to 9.3%); and the fraction of nodes whose loss ratios are below 10% improves from 22.6% to 68.9%. Similar observations can be made in the Lognormal case (Figure 5(b)). Compared with the Pareto case, the Lognormal case has consistently lower mean loss ratios. Lemma 4 helps us understand the case for SMO fraction (plain DHT): the mean residual lifetime = V A is 1 hour for the Pareto lifetime distribution, whereas with Lognormal lifetime j hours, which means a higher distribution, = V A 7 delivery ratio according to Theorem 1. (The path lengths under the two cases are indeed very close to each other. The results are reported in [29].) Lemma 5 and Theorem 3 can explain the difference in curves for the other SMO fractions. Figures 6(a) and 6(b) show the CDFs of loss ratios under the LNS scheme, which demonstrate a noticeable improvement by LNS scheme. The difference between the Pareto and Lognormal cases lies in the sensitivity of loss ratio to the # changing value of : in the Lognormal case, the loss ratio # improves more significantly with small values of . For # example, the loss ratio drops by 49% as grows from 1 # to 4, whereas a reduction of only 16% can be observed as grows from 5 to 16. The loss ratios under the RRS scheme with Pareto lifetime distribution are shown in Figure 7. The bit-clearing heuristic results in a substantially higher mean loss ratio than in the plain DHT. This is because in a non-fully-populated DHT, the distance span for each hop is not necessarily a power of 2, thus the binary string of the distance may often find new 1 bits appear when messages route from the source to the destination, which results in a larger number of hop counts and hence an increased loss ratio than in the plain DHT. SMO fraction = 0.5 Mean = 14.6% 20 SMO fraction = 1 (plain DHT) Mean = 18.0% 1.5625 3.125 6.25 80 12.5 25 50 60 SMO fraction = 0.1 Mean = 6.1% SMO fraction = 0.5 Mean = 11.7% 40 20 SMO fraction = 1 (plain DHT) Mean = 14.9% 0 100 1.5625 3.125 Data loss ratio (%) 6.25 12.5 25 50 100 Candidate set size K=1 Candidate set size K=4 Candidate set size K=16 100 60 L = 16 Mean = 10.4% Cumulative % of nodes Cumulative % of nodes 80 L=4 Mean = 14.6% 40 20 K = 1 (plain DHT) Mean = 18.0% 0 1.5625 3.125 6.25 12.5 Data loss ratio (%) 25 60 Candidate set size K=1 Candidate set size K=4 Candidate set size K=16 K = 16 Mean = 6.4% K=4 Mean = 7.5% 40 20 K = 1 (plain DHT) Mean = 14.9% 0 50 100 1.5625 3.125 6.25 12.5 25 50 100 Data loss ratio (%) (b) (a) Fig. 6. CDF of Data loss ratio. (a) LNS + Pareto lifetime distribution; (b) LNS + Lognormal lifetime distribution. In view of this, we consider another heuristic for the selection of next hop nodes. Assuming the distance of a and the age of that neighbor is , then the neighbor is neighbor with the maximum value of @!S M is selected as the next hop node. The heuristic trades off total hop count against the choices of stable nodes, and the power of age ! determines the effect of age. When !Gc , the routing scheme reduces to that of a plain DHT. Figure 7 shows the CDFs of loss ratio for three values of ! . For ! , the loss ratio is still higher than that of the plain DHT, which implies the negative effect of increased hop count exceeds the positive effect of stable node selection. Lowering ! from 1 to 0.5 remedies this situation slightly, making the loss ratio very close to that of the plain DHT. Further reduction of ! , however, have very little effect on the loss ratio under the Pareto distribution. (The results are not shown to preserve the clarity of the figures.) The stable node selection appears to be more effective for a Lognormal lifetime distribution (Figure 7(b)), although the improvement is only moderate: for !]e j , the loss ratio is reduced by 31%; the bit-clearing heuristic still performs worst. 2) Node stress: Figure 8 shows the distribution of node - stress under the SMO scheme. When ^ , a fraction of nodes are outside the SMO and have a node stress of zero, so the total number of zero-stress nodes are larger than that of a plain DHT. For example, the number of zero-stress nodes is 105,695 for a plain DHT, whereas this figure grows to 150,010 for the SMO scheme with Q 1j . On the other hand, since all the forwarding tasks are assigned to only a fraction of the nodes, there are likely to be more nodes with high stress values than in a plain DHT. This is shown by the bars in Figure 8 Power of age = 0 (plain DHT) Mean = 18.0% Power of age = 1 Mean = 19.1% 40 20 Power of age = 0.5 Mean = 17.9% Bit-clearing Mean = 34.6% 1.5625 3.125 6.25 12.5 25 50 Power of age = 0 Power of age = 1 Power of age = 0.5 Bit fixing Power of age = 1 Mean = 10.4% 60 40 Power of age = 0 (plain DHT) Mean = 14.9% Power of age = 0.5 Mean = 11.4% 20 Bit-clearing Mean = 15.4% 0 100 1.5625 3.125 6.25 12.5 25 50 100 Data loss ratio (%) (b) (a) (b) 80 80 Data loss ratio (%) Fig. 5. CDF of Data loss ratio. (a) SMO + Pareto lifetime distribution; (b) SMO + Lognormal lifetime distribution. 100 100 0 Data loss ratio (%) (a) 60 Plain DHT Power of age = 1 Power of age = 0.5 Bit-fixing Cumulative % of nodes SMO fraction = 0.1 Mean = 9.3% 0 100 80 60 40 SMO fraction = 1 SMO fraction = 0.5 SMO fraction = 0.1 Cumulative % of nodes Cumulative % of nodes 80 100 SMO fraction = 1 SMO fraction = 0.5 SMO fraction = 0.1 Cumulative % of nodes 100 Fig. 7. CDF of Data loss ratio. (a) RRS + Pareto lifetime distribution; (b) RRS + Lognormal lifetime distribution. for stress values ranging from 12 to 36. Generally, the smaller the SMO fraction, the more nodes are distributed near both ends of the range of stress values. Similar observations can be made from Figures 9 and 10, which depict the distributions of node stress under the LNS and the RRS schemes, respectively, although the latter figure shows a smaller deviation of stress distribution from that of the plain DHT. 3) Path length: Although it is an important factor for data delivery reliability, path length is critical for many applications in its own right. Figure 11 shows the mean path length of the various schemes. It can be seen that the SMO and the LNS schemes can slightly shorten the multicast paths, especially with a smaller SMO fraction or a large candidate set size. This is because a smaller set of forwarding nodes reduces the number of necessary intermediate nodes on the path. Consider in the extreme case where only one forwarder# is present (corresponding to an SMO with a single node or ), the overlay graph assumes to a star-like structure, and the path length between an pair of nodes would be only 2. The RRS scheme with the bit-clearing heuristic yields very large path lengths, which verifies the qualitative discussion in Section V-B.1. For the heuristic using the product of distance and age, the path length is still longer than that of the plain DHT – a consequence of sacrificing small hop counts for choices of high reliability routes. VI. R ELATED W ORK From the perspective of stochastic modeling, perhaps the closest to our work is by Leonard et al. [14]. Based on a similar node lifetime model to ours (see the discussion of differences in Section III-A), they analyze the resilience of general peerto-peer networks and derive the expected delay before a user is isolated from the network and the probability of this occurring within his/her lifetime. They model the evolution of a node’s neighbors as a superposition of renewal processes, and then obtain the limiting probability of at least one neighbor being alive using renewal theory. We also use renewal theory to analyze the normal probability of an intermediate node on a delivery path, but our focus is on the probability of all nodes being at the normal state simultaneously. For the resilience of DHT networks, Gummadi et al. [13] identify several representative routing geometries and analyze their degrees of flexibility which benefits static resilience and 17 2 16 2 15 2 14 1x2 13 1x2 12 2 11 2 10 2 9 2 8 SMO fraction = 1 (Plain DHT) SMO fraction = 0.5 SMO fraction = 0.1 Number of nodes Number of nodes 2 2 17 2 16 2 15 2 14 1x2 13 1x2 12 2 11 2 10 2 Candidate set size = 1 (plain DHT) Candidate set size = 4 Candidate set size = 8 9 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 Node stress value Node stress value 2 17 2 16 2 15 2 14 1x2 13 1x2 12 2 11 2 10 2 Fig. 9. Node stress under the LNS scheme. Power of age = 0 (plain DHT) Power of age = 0.5 Power of age = 1 Path length (hop count) Number of nodes Fig. 8. Node stress under the SMO scheme. 9 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 5th and 95th percentiles plain DHT SMO,0.5 SMO,0.1 Node stress value LNS,4 LNS,16 RRS,1 Bit-clearing Scheme Fig. 10. Node stress under the RRS scheme. Fig. 11. Path lengths under different schemes. proximity. Loguinov et al. [15] examine graph-theoretic properties of several DHTs and analyze their routing performance and fault resilience. Stutzbach et al. [28] characterize the churn of P2P networks through empirical experiments. In the field of application-layer multicast, reliability has been a topic of enduring interest. The early work of Chu et al. [12] and Padmanabhan et al. [20] proposes some simple and effective techniques for improving the multicast reliability. Castro et al. [9] propose to use multiple trees to improve the resilience of streaming to interruptions. In [25], a number of single-tree construction algorithms are proposed and evaluated using traces from large-scale commercial systems. Due to the lack of a generic supporting overlay topology, these optimizing schemes are somewhat ad hoc, and are therefore difficult to model and evaluate using an integrated framework. VII. C ONCLUSIONS This paper investigates the reliability of DHT-based multicast. The contributions include: (1) A node residual time model, which is fundamental to our analysis and we believe will be a useful tool in other contexts of overlay-based applications; (2) A renewal-theory-based model for stationary delivery ratio, which provides a worst-case estimate for the reliability of data delivery between two nodes in the DHT network; (3) Three optimization schemes and analysis of their reliability; and (4) simulation experiments which provide insight into the performance of DHT-based multicast from a number of major respects. In the future, we will consider how the model and the optimization schemes can be applied to bandwidth-demanding applications such as video streaming. R EFERENCES [1] RSSDHT. http://sourceforge.net/projects/rssdht/ [2] Bamboo DHT project. http://bamboo-dht.org/ [3] K. C. Almeroth and M. H. Ammar. Collecting and Modeling the Join/Leave Behavior of Multicast Group Members in the MBone. Proc. of the High Performance Distributed Computing (HPDC), 1996. [4] S. Banerjee, S. Lee, B. Bhattacharjee, and A. Srinivasan. Resilient multicast using overlays. ACM SIGMETRICS 2003. [5] A. R. Bharambe, S. G. Rao, V. N. Padmanabhan, S. Seshan and H. Zhang. The Impact of Heterogeneous Bandwidth Constraints on DHTBased Multicast Protocols. IPTPS, 2005. [6] M. Bishop, S. Rao, and K. Sripanidkulchai. Considering Priority in Overlay Multicast Protocols under Heterogeneous Environments. In Proc. of INFOCOM 2006. [7] P. B. Godfrey, S. Shenker, and I. Stoica. Minimizing Churn in Distributed Systems. Proc. of SIGCOMM 2006. [8] F. E. Bustemante and Y. Qiao. Friendships that last: peer lifespan and its role in P2P protocols. WCW workshop 2003. [9] M. Castro, P. Druschel, A-M. Kermarrec, A. Nandi, A. Rowstron and A. Singh. SplitStream: High-bandwidth multicast in a cooperative environment. Proc. of SOSP 2003. [10] M. Castro, P. Druschel, A-M. Kermarrec and A. Rowstron. Scribe: A large-scale and decentralised application-level multicast infrastructure. IEEE Journal on Selected Areas in Communications (JSAC). Oct., 2002. [11] Y. Chu, A. Ganjam, T. S. E. Ng, S. G. Rao, K. Sripanidkulchai, J. Zhan and H. Zhang. Early Experience with an Internet Broadcast System Based on Overlay Multicast. USENIX 2004 Annual Technical Conference. [12] Y. Chu, S. Rao, and H. Zhang. A Case for End System Multicast. Proc. of ACM SIGMETRICS, June 2000. [13] P. K. Gummadi, R. Gummadi, S. D. Gribble, S. Ratnasamy, S. Shenker and I. Stoica. The impact of DHT routing geometry on resilience and proximity. Proc. of ACM SIGCOMM 2003. [14] D. Leonard, V. Rai, and D. Loguinov. On lifetime-based node failure and stochastic resilience of decentralized peer-to-peer networks. Proc. of ACM SIGMETRICS 2005. [15] D. Loguinov, A. Kumar, V. Rai, and S. Ganesh. Graph-theoretic analysis of structured peer-to-peer systems: routing distances and fault resilience. Proc. of ACM SIGCOMM 2003. [16] P. Maymounkov and D. Mazieres. Kademlia: A Peer-to-peer Information System Based on the XOR Metric. IPTPS, 2002. [17] M. Mitzenmacher and E. Upfal. Probability and Computing . Cambridge University Press, 2005. [18] F. Kaashoek and D. R. Karger. Koorde: A simple degree-optimal distributed hash table. IPTPS 2003. [19] V. G. Kulkarni. Modeling and Analysis of Stochastic Systems. Chapman & Hall Ltd. ISBN: 0-41204-991-0, 1996. [20] V. N. Padmanabhan, H. J. Wang and P. A. Chou. Resilient Peer-to-Peer Streaming. Proc. of ICNP 2003. [21] S. Ratnasamy, M. Handley, R. Karp, S. Shenker. Application-level Multicast using Content-Addressable Networks. Proc. of International Workshop on Networked Group Communication (NGC) 2001. [22] S. Ratnasamy, P. Francis, M. Handley, R. Karp and S. Shenker. A Scalable Content-Addressable Network. Proc. of SIGCOMM 2001. [23] S. Rhea, B. Godfrey, B. Karp, et al. OpenDHT: A Public DHT Service and Its Uses. Proc. of SIGCOMM 2005. [24] A. Rowstron and P. Druschel. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. IFIP/ACM Intl. Conference on Distributed Systems Platforms 2001. [25] K. Sripanidkulchai, A. Ganjam, B. Maggs and H. Zhang. The feasibility of supporting large-scale live streaming applications with dynamic application end-points. Proc. of ACM SIGCOMM, 2004. [26] K. Sripanidkulchai, B. Maggs and H. Zhang An analysis of live streaming workloads on the Internet. SIGCOMM IMC 2004. [27] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, H. Balakrishnan. Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications. Proc. of ACM SIGCOMM 2001. [28] D. Stutzbach, and R. Rejaie. Understanding Churn in Peer-to-Peer Networks. SIGCOMM IMC 2006. [29] G. Tan and Stephen A. Jarvis. On the reliability of DHT-based multicast. Technical Report CS-TR-06, University of Warwick, 2006. [30] E. Veloso, V. Almeida, W. Meira, A. Bestavros, and S. Jin. A Hierarchical Characterization of A Live Streaming Media Workload. IEEE/ACM Trans. on Networking, 14(1), 2006. [31] S. Q. Zhuang, B. Y. Zhao, A. D. Joseph, R. H. Katz, J. D. Kubiatowicz. Bayeux: An Architecture for Scalable and Fault-tolerant Wide-area Data Dissemination. Proc. of NOSSDAV 2001.