LIDS.P-2006 November 1990 Multinode Broadcast in Hypercubes and Rings with Randomly Distributed Length of Packetsi by Emmanouel A. Varvarigos and Dimitri P. Bertsekas 2 Abstract We consider a multinode broadcast (MNB) in a hypercube and in a ring network of processors. This is the communication task where we want each node of the network to broadcast a packet to all the other nodes. The communication model that we use is different than those considered in the literature so far. In particular, we assume that the lengths of the packets that are broadcast are not fixed, but are distributed according to some probabilistic rule, and we compare the optimal times required to execute the MNB for variable and for fixed packet lengths. For large hypercubes we show under very general probabilistic assumptions on the packet lengths, that the MNB is completed in essentially the same time as when the packet lengths are fixed. In particular, we show that the MNB is completed by time (1 + 6)T 8 with probability at least 1 - e, for any positive e and 6, where T 8 is the optimal time required to execute the MNB when the packet lengths are fixed at their mean, provided that the size of the hypercube is big enough. In the case of the ring we prove that the average time required to execute a MNB when the packet lengths are exponentially distributed exceeds by a factor of In n the corresponding time for the case where the packet lengths are fixed at their mean, where n is the number of nodes of the ring. 1 Research supported by NSF under Grant NSF-DDM-8903385 and by the ARO under Grant DAAL03-b6K-0171. 2 Laboratory for Information and Decision Systems, M.I.T, Cambridge, Mass. 02139. 1 1. Introduction 1. INTRODUCTION The processors of a multiprocessor system, when doing computations, often have to communicate intermediate results. The interprocessor communication time may be substantial relative to the time needed exclusively for computations, so it is important to carry out the information exchange as efficiently as possible. One of the most frequent communication tasks is the multinode broadcast (MNB). In this task we want each node to broadcast a packet (the same packet) to all the other nodes. The MNB arises, for example, in iterations of the form x+l, = Ax*, where A is a square matrix and xi, x 1+l are column vectors of appropriate dimensions. In this computation, each processor computes a specific component of the vector xi+l and broadcasts it to all the other processors so that it can be used during the next iteration. Algorithms for routing messages between different processors have been studied by several authors under a variety of assumptions on the communication network connecting the processors. Saad and Shultz [SaS85], [SaS86] have introduced a number of generic communication problems that arise frequently in numerical and other methods. They have assumed that all packets take unit time to traverse any communication link. Johnson and Ho [JoH89] have developed minimum and nearly minimum completion time algorithms for similar routing problems as those of Saad and Schultz but used a different communication model and a hypercube network. Their model quantifies the effects of setup time (or overhead) per packet, while it allows packets to have different lengths, and to be split and be recombined prior to transmission on any link in order to save on setup time. In the model of [JoH89], each packet may consist of data originating at different nodes and/or destined for different nodes and the extra overhead for splitting and combining packets is considered negligible. Bertsekas et al [BOS89], and Bertsekas and Tsitsiklis [BeT89] have used the communication model of Saad and Shultz to derive minimum completion time algorithms for several communication problems in a hypercube. In particular, they have given an algorithm for the multinode broadcast that executes in a minimum number of steps ([-l-1 for a hypercube with n = 2d processors). Several other works deal with various communication problems and network architectures related to those discussed in the present paper; see [BhI85], [DNS81], [HHL88], [Joh87], [KVC88], [McV87], [Ozv87], [SaS88], [StT90], [StW87], [Top85], [VaB90], and [Var90]. In this paper we will be dealing with a multinode broadcast in hypercubes and rings. The case where the packets broadcast by the processors have equal deterministic lengths has been studied in the literature for a variety of regular topologies. Algorithms that execute the MNB in optimal time exist for the case of the ring ([BeT89]), the d-dimensional wraparound mesh ([Ozv87], [BeT89]), 2 1. Introduction the hypercube ([SaS85], [SaS86], [Ozv87], [BeT89], [BOS89], [JoH89]), and other topologies. A common assumption in the communication model adopted by the previous authors is that all the packets require one unit of time in order to travel over a link (i.e. the transmission time plus the propagation delay over the link and the processing delay at the node all sum up to one unit). As a result the algorithms found were synchronous and assumed the existence of a global clock. In this paper we will relax some of the previous assumptions. The existence of a global clock is no longer assumed. Furthermore, the lengths of the packets broadcast are not deterministic, but they are distributed according to some probabilistic rule. In order to be able to make comparisons with the fixed packet length case, the mean value of the packet lengths will be taken equal to one unit. For both the hypercube and the ring network of processors the number of nodes will be denoted by n. For the purposes of this paper, a hypercube network (or d-cube) consists of the set of points in ddimensional space with each coordinate equal to zero or one. There is a bidirectional communication link for every two points differing in a single coordinate. We thus obtain an undirected graph with the processors as nodes and the communication links as arcs. The binary string of length d that corresponds to the coordinates of a node of the d-cube is referred to as the identity number (or ID) of the node. When confusion cannot arise, we refer to a d-cube node interchangeably in terms of its identity number (a binary string of length d) and in terms of the decimal representation of its identity number. Thus, for example, the nodes (00 ... 0), (00 ... 1), and (11 ... 1) will also be referred to as nodes 0, 1, and 2d - 1, respectively. For any two binary strings x and y we denote by x Ey the bitwise exclusive OR of x and y. We also denote by log n and In n the base 2 and the Napier logarithm of n, respectively. The optimal completion time of the MNB in a d-dimensional hypercube with n = 2d nodes, when each packet requires one time unit (or slot) for transmission over a link, was found in [BOS89] (see also [BeT89]) to be [fL1 time slots, where by rx] we denote the smallest integer which is greater or equal to x. We evaluate the time complexity of the optimal algorithm described in [BOS89] when the packet lengths are not constant, but are distributed according to some probabilistic rule. We consider the natural adaptation of the optimal deterministic algorithm which can deal with the probabilistic case. In particular, we assume that the same schedule of packet transmissions (i.e., order in which the packets are transmitted over the links) with that of the optimal synchronous algorithm is followed; however, the timing is not the same, since the model is no longer deterministic. Let TMNB be the time required for the completion of the MNB in the asynchronous probabilistic case and let be the the corresponding optimal time for synchronous d eterministic case For a given n, T be the corresponding optimal time for the synchronous deterministic case. For a given n, 3 is TMNB is 2. Multinode Broadcast in a Ring a random variable and T, is a constant. We assume that the probability distribution of the packet lengths has unit mean and that the corresponding characteristic function 4>(s) exists for some s > 0. We prove that given any 6 > 0 and e > 0, we can find no = no(6, c) such that Pr(TMNs < (1 + 6)T,) > 1 - e (for the hypercube) for all n > no. This means that as n -- oo, the MNB is completed with probability one in time less than (1 + 5)T,, where 6 is arbitrarily small. Thus, the probabilistic nature of the packet lengths does not deteriorate appreciably the performance of the optimal MNB algorithm for large hypercubes. This is a rather surprising result; as shown in Section 2, in the case of the ring, the average completion time of the MNB with random packet lengths increases substantially over the corresponding deterministic case. In particular, the mean time E(TMNB) required to complete the MNB in a ring for exponentially distributed length of packets with mean one time unit is E(TMNB) ;, (for the ring with n nodes) (C + In n)T,, where C = 0.577215 is Euler's number, and T,= [ni-] is the time to execute the MNB in the case of unit packet lengths. The organization of the paper is the following. The MNB in a ring is treated in Section 2. The analysis found there is rather straightforward, but it does give some insight for the case of the hypercube. Sections 3-5 deal with the hypercube network of processors. In Section 3 we describe the communication algorithm (scheduling) that will be used. Section 4 derives a loose upper bound for E(TMNB) when the packet lengths are exponentially distributed. Our main result, which holds for general distributions of the packet lengths, appears in Propositions 3 and 4 of Section 5. Finally, the appendices at the end resolve some technical issues arising in our analysis. 2. MULTINODE BROADCAST IN A RING Consider a multinode broadcast in a ring. In the case when all the packets require one unit of time (or slot) to be transmitted over a link, the optimal time to perform the task is [H-l], where n is the number of processors of the ring. The following algorithm, found in [BeT89], achieves this optimal time. At the first slot, each node sends its packet to its clockwise and counterclockwise neighbors. During slots 2,..., [-21], every node sends to its clockwise (counterclockwise) neighbor the packet 4 2. Multinode Broadcast in a Ring Figure 1: Broadcast in a ring. received from its counterclockwise (respectively, counterclockwise neighbor) at the previous slot (Figure 1). We are interested in the case where the lengths of the packets generated by the nodes (and therefore the time required to transmit them over a link) are not constant but follow some probability distribution. We will evaluate the mean time E(TMNB) required to complete the MNB. The remainder of this section consists of two parts. In subsection 2.1 we derive a general expression for E(TMNB) and apply it to the case where the packet lengths follow a uniform distribution. In subsection 2.2 we calculate E(TMNB) for the case of exponentially distributed length of packets. 2.1 General Expression for E(TMNB) - Uniform Distribution and Bounded Distribution of Lengths Let xi, i = 1, .. , n, be the time required to transmit the packet generated at node i over a link 5 2. Multinode Broadcast in a Ring of the ring. Since for each processor i there is a processor which is [r!1] links away from i, and since the packet from i has to be broadcast to all the other nodes, we have TMNB > 2l] xi, for all i. Thus, 1 maxX TMNB > [ i On the other hand, it can be seen that TMNB < [2 maxxi, by observing that a MNB in a ring with packets of lengths (in time units) xi, i = 1,..., n, cannot require more time than the time required when all the packets are equal to the largest of them. An alternative way to prove this inequality, is to note that the largest packet generated by a node is the bottleneck of the algorithm, in the sense that by the time it arrives to the node opposite to its source on the ring, all the other packets of smaller length have already arrived to the node opposite to their source. Thus TMNB = [n]2 1 maxi xi, and E(TMNB) = [n 2 1 E(maxxi). (1) This equation holds for any probability distribution of the xi 's. In what follows we will examine specific distributions. We first note that if the xi's are bounded by a constant B, then the time required to execute the MNB is bounded by [n-] B. Assume now that the lengths of the packets are independent and uniformly distributed over the interval (O,u). Then, straigthforward calculation yields un E(maxx,) = n+1 n+l' and the desired time is E(TMNB) = n + 1 If u = 2 so that the mean transmission time E(x,) of a packet is one time unit, the preceding formula becomes E(TMNB) = 2nrn-i1 2 n+ 1 -= 2n -. T, n+ 1 where T, denotes the optimal completion time for the corresponding deterministic case. 6 2. Multinode Broadcast in a Ring 2.2 Exponentially distributed length of packets Assume now that the transmission times xi over a single link are exponentially distributed independent random variables with mean one time unit. Then we get: Pr(xz < X) = 1 - e -X. Since Pr(max ixi < X) = Pr(xi < X,x 2 < X,...,x, < X) and the xi's are independent and identically distributed random variables, we obtain Pr(max xi < X) = (1 -e-X)". On the other hand we have E(maxx) = Pr(maxxi > X)dX = (1 - (1 - e-X)n)dX. By calculating this integral, we get, after some manipulation, that E(maxxi) = 1+ 1 ++ 2 1 -. n Therefore, l (1 + E(TMNB)== i 2 For large n, the sum of the n first terms of the harmonic series is approximately In n. In order to see that, we integrate 1/x from 1 to n, and bound the integral from above and below by discretizing it. Then we find that 2+ 2 ++...- 1 1 +-< -dx = lnn < 1 + x-x3 nn + .. + 1j n-l' which gives 1 In n + - 1 < < Inn + 1. (2) k=l If more accuracy is required, the following formula ([GrR80], p. 2) can be used: =l k=l n(n + 1) * -(n+ k- 1) n k=2 where Ak = - x(1- x)(2- x)(3 -x) ... (k- 1- x)dx and C=0.577215 is Euler's constant. It is shown in Appendix 1 that nloo2n k=2 n(n+ 1)...(n+k-1) 7 0. 3. Modified algorithm for the MNB in a hypercube Therefore, for large n, we obtain the following approximation: E(TMhNBe) r s1 (C + + n n). By comparing the results for the uniform and the exponential distributions, one can see an interesting difference. For the uniform distribution with unit mean packet length (u = 2), the expected time for the completion of the multinode broadcast is 2n n+1 *T, -, 2T., where T, is the completion time for the corresponding synchronous deterministic case; for the exponential distribution the corresponding mean value of the completion time is approximately In n -Ts. Thus, the mean completion time for a MNB in a ring strongly depends upon the particular distribution of the packet lengths. It is not only the mean, but also the tail of the distribution that plays a significant role. We will see that in the hypercube, the particular distribution of the packet lengths plays a less important role. 3. MODIFIED ALGORITHM FOR THE MNB IN A HYPERCUBE In the remainder of the paper we will be dealing with a multinode broadcast in a hypercube network of processors. In order to obtain complexity results for the MNB in the random packet length case, we will analyze a slightly modified version of the optimal synchronous algorithm found in [BOS89] and [BeT89]. The results obtained for the modified algorithm will hold for the optimal algorithm as well. In this section, we first explain why a modified algorithm is analyzed instead of the asynchronous version of the optimal algorithm. We then give the description of the modified algorithm. The optimal synchronous algorithm of [BOS89] can be viewed as consisting of d phases. During phase i the packet that originates at node s, where s = 0,...,n - 1, arrives at the nodes that are located i links away from s. When d is not prime, the phases may have to overlap in order for the completion time of the MNB to be strictly optimal. This complicates the analysis of the asynchronous case, since the time required to complete a phase will be affected by previous phases that have not finished yet. The reason is that in phase i + 1, a packet of some origin node may be scheduled to be transmitted over some link after the packet of another origin, but the latter packet may not have yet completed phase i. To circumvent this difficulty, we modify the algorithm so that its phases are not allowed to overlap. The modified algorithm, is the same with the optimal algorithm except that there is a constraint 8 3. Modified algorithm for the MNB in a hypercube that each packet begins phase i + 1 only after all packets have completed phase i. The completion time of the modified algorithm is slightly larger than the actual TMNB achieved by the optimal algorithm. We now describe briefly the modified algorithm, assuming the reader is somewhat familiar with the optimal synchronous version given in [BOS89] and [BeT89]. We first note that if we find n synchronous single node broadcast algorithms in a hypercube, each one originating at a different node, and such that no two of them use the same link during a slot, then we have a MNB algorithm. Let A 1(0) be the set of links on which the packet originated at node 00 ... 0 is transmitted during the lt h slot. Obviously, each link in A 1(0) connects two nodes with IDs that differ in a specific bit position. Our aim will be to define At(0) in a way that no two links in A 1(O) connect nodes whose IDs differ in the same position. If we do so, then the sets Al(s) = {(s 3 x, s 1 y) I (x, y) E Al(O)}, 1 = 1,2,... can be the sets of links on which the packet generated at node s is transmitted during the lih slot, for all s = 0, 1,... n - 1. The fact that Al(s) n At(u) = 0 can be seen by observing that s G x and s E y differ in a particular bit if and only if x and y differ in the same bit. Thus for s = 0, 1,..., n - 1, the sets Al(s) do not have common elements for a specific 1, provided of course that A,(0) satisfies the condition mentioned above. In this way it is guaranteed that no two packets will claim the same link during the same slot. What is left to be done now is to specify A 1(0). For i = 1, ... , d, we denote with Ni the set of d-bit binary numbers with exactly i ones. The cardinality of Ni is (/). Each set Ni is in turn partitioned in disjoint subsets Ril,..., Rin, which are equivalence classes under a single bit rotation to the left. Ri 1 is selected to be the class of the element of Ni whose i rightmost bits are unity. Then each node ID is associated with a distinct number m(t) E {1, 2,..., n - 1} in the order RIR21 ...R2n2 ...R(d- 2 )nd_ 2 R(dl)l (11...1). The first element in each set Rij is chosen so that its bit in position 1 + [(m(t)- (4) 1)(mod d)] from the right, is a one. The subsequent elements of Rij are found by rotating the first element to the left. We successively group together d elements of the set Ni into sets Eij,j = 1,..., [(I)/dl in the order they appear in (4). When d is a prime integer, each one of the equivalence classes Rij (except for Rol and Rdl) has d elements. This happens because when d is prime, all left shifts of less than d positions produce distinct d bit binary numbers. Then there are (/d)/d equivalence classes in Ni, each with d elements, and the sets Rij and Eij coincide. If d is not prime, we will have r(l)/dl sets Eij which we order as in (4). 9 4. A Loose Upper Bound for E(TMNB) Consider now the following graph consisting of nodes 0, 1,..., n - 1. Every element of Eij is connected to an element in Ni-_ so that each connection corresponds to a reversal of a different bit. Every element of Eij is connected to exactly one element of Ni-1. In this way a graph is obtained which is the single node broadcast tree for node 0. The links connecting layers I - 1 and I of this graph constitute the links of Al(0). By construction, these links satisfy the conditions that we have set for Al(0) (see also [BeT89], p.60). Then, the broadcast tree of node s can be obtained from the broadcast tree of node 0 by simply adding (mod 2) s to the ID of each node of the tree. The broadcast tree of node 0 for d = 5 is shown in Figure 2. N2 N N3 R 31 N4 R32 R4 1 R.21 R 22 10001 01001 11001 01101 11101 0010 00011 10010 10011 1010 I 1011 00100 00110 00101 00111 10101 10111 01000 01100 01010 01110 01011 01111 10000 11000 10100 11100 10110 11110 RI 00000 II I Terminal packet Figure 2: The broadcast tree of node 0 for d = 5. 4. A LOOSE UPPER BOUND FOR E(TMNB) Assume that the lengths of the packets that are broadcast by the nodes are independent exponentially distributed random variables, and consider the asynchronous version of the algorithm 10 5. Asymptotic Behavior of TMNB as n -+ oo presented in Section 3. This algorithm should be interpreted as specifying the order in which the packets are transmitted over the links and not the exact timing. We are interested in the completion time TMNB of the algorithm. The following upper bound for the average time required for the MNB can be found using the reasoning developed in Section 2 for the ring topology: E(TMNB) < (1 + 1/2 + 1/3 + - + 1/n)T, t (In n + C)T,, (5) where C is Euler's constant and T, n-1 is the time required by the deterministic optimal algorithm. Relation (5) can be proved by arguing that the time for the MNB when the packets which are broadcast by nodes 0, 1,..., n - 1 have variable lengths x 0, zl,..., x,-1, cannot be longer than the time for a MNB in which all packets have lengths equal to maxi xi. This gives TMNB < (maxxi)T,, and by using the relation E(maxxi) = 1+ 1/2+...+1/n-lnn+C, used in Section 2, we obtain the bound (5). However, this bound is not tight as in the case of the ring, because the packet with the greatest length is not necessarily the one that determines the completion time of the MNB in a hypercube. The reason is that, during the MNB in hypercubes, a packet is wasting most of the time waiting behind other packets that were scheduled to use a link before it. Thus, although the bound in (5) is valid, it is not tight because it corresponds to a case in which the packet having the maximum length among all the packets has to wait behind other packets which also have maximum length. This is not a typical scenario and the mean completion time is considerably overestimated. 5. ASYMPTOTIC BEHAVIOR OF To obtain a tighter estimate of TMNB AS N -- TMNB, oo a different approach based on Markoff's inequality will be used. We will look at the asymptotic behavior of the 11 TMNB as the dimension d gets large. 5. Asymptotic Behavior of TMNB as n -+ oo During phase i of the synchronous modified algorithm, the packets are received by all the destinations that are at a Hamming distance i from the source of the packets. Phase i consists of (1 bi = steps. As already mentioned, in order to analyze the case of random packet lengths, the scheduling of the asynchronous modified algorithm described in Section 3 will be used. Phase i of the algorithm is considered to have been completed only when all the packets have completed phase i. There are d copies of each packet, which upon completing transmission, mark the end of phase i of this packet. These are the rd ((n) copies which are transmitted during step Ej=l bi (last step of phase i), and d - rd((d)) copies which are transmitted during step C}=l bi-1 (next to the last step of phase i) of the synchronous algorithm, where we denote by rx (y) the remainder of the division of y by x. We will refer to these packets as the terminal packets of phase i. In Fig. 2 we show the terminal packets of phase 2 for the broadcast tree rooted at node 0. Let us number the terminal packets of phase i as j = 1, 2, .. ., nd (there are d terminal packets per origin node, so the total is nd). Let us also denote by wij the time that elapses between the beginning of phase i and the time the jih of these terminal packets completes phase i. For every terminal packet of phase i, the time required for the packet to complete phase i consists of at most bi packet transmissions; bi - 1 or bi - 2 transmissions of other packets that were scheduled to use the link before it, and one transmission of the copy of the packet under consideration. Then we can write Wij = XI, X iEAij where Aij is a set of distinct integers between 0 and n - 1 which has cardinality less or equal to bi, and xl is the length of the packet which originates from node 1. In particular, a node belongs to set Aij if a packet generated at that node is scheduled to use during the /ih phase the same link that terminal packet j of phase i will use. Let 40(s) = E(es$) be the characteristic function of the distribution of the packet lengths and let 1)+ = {s > 0 I E(e$x) exists} be the positive portion of its domain. Since wii is the sum of at most bi independent and identically distributed random variables, we have (since (I(s) > 1 for s E VZ+) E(eswii) _<4(s),,, 12 V s E v+. 5. Asymptotic Behavior of TMNB as n -+ oo Also, from Markoff's inequality [Ros83] we have Pr(elWii > a) < E (ew a > O, s E D+. From the last two relations we obtain Pr(w < In > 1- ()b Va>, a> O s E +. (6) The time required for the completion of the it h phase by the nd terminal packets of that phase is Ti = max {wij}. j=l,...,nd Therefore, for all s E V+ and a > 0, ( - s )=PrP= w,j ( 1,...,nd)(7) The wij 's that appear at the right hand side of (7) are not independent random variables, so we cannot readily use (6) to estimate Pr (T _< "i ). To proceed further with the analysis, the following two propositions are needed: Proposition 1: Let IJ C N x N be a subset of the set of index pairs (i, j) with 1 < i < d, and 1 < j < nd. Then for all m E {1,..., d}, k E{1,..., nd}, and all a > 0, we have Pr(wij < a, V (i,j) E I I Wik < a) > Pr(wij < a,(i,j) E IJ). Proof: We know that wij = IlEAij xl. Let Z = (xo, Ix,..., x,-l) be the random vector of packet lengths. We define the sets Eij = a Rn I| h= = maxwp, = x=E Rn xI = malxEa p lEA,, (p,)EIT EApq In order for the sets Eij, (i, j) E IJ to be disjoint, we say that if for some x there are two pairs (il, jl) and (i2 ,j 2 ) in IJ, with (il,jl) -< (i 2 ,j 2 ) (-< is any order in N x N, for example the lexicographic order), such that wi 1j, = wi2j 2, then this x will be considered to belong to Efij, and not to Ei2j2. Then Pr (wij < a, (i,j) E IJ l Wmk < a)= E x Pr (Y E Eij)Pr(wij < a I E Eij,Wmk < a). (8) (ij)EIJ In Appendix 2 we prove that Pr (wij < a I Wmk < b) > Pr (wj < a) for all a, b > 0. It can be seen that the proof of the last inequality is not altered if both probabilities are conditioned on the fact E Eij. Thus, Pr(wij < a i wmk_< a,Y E Ei3 ) > Pr(wij < a I 13 E Eij) 5. Asymptotic Behavior of TMNB as n -- oo The last relation together with (8) yields Pr(wij < a, (i,j) E IJ I Wmk < a) > E Pr(x E Eij)Pr(wij < a I[ E Eij) (ij)ElJ = Pr(wij < a, (i, j) E IJ). Note that if in the preceding proof we replace wij by wij + a - aii we get the more general relation Pr(wij < ai, (i,j) E J IJwmk < ak) > Pr(wii < aij, (i,j) IJ). (9) Q.E.D. Proposition 2: For all a > 0 and k E {1,...,nd}, k Pr(wij < a,j=1,...,k) > Pr(w < a). (10) ji=1 Proof: The proof will be given by induction on k. For k = 2, we know from Appendix 2 that Pr(wil < a wi2 < a) > Pr(wil < a), which by using Bayes' rule yields Pr(wil < a, w;2 < a) > Pr (wil < a) Pr (Wi2 < a). Suppose that the proposition is true for k - 1 or equivalently: k-I Pr(wj < a,j = 1,.. .,- 1) > ] Pr (wij < a). (11) j=1 Then Wk < a)Pr(wik < a). Pr(wij < a,j = 1,...,k) = Pr(wij < a,j = 1,...,k--1 (12) By using Proposition 1, (12) gives Pr(wij < a,j = 1,...,k) > Pr(w,, < a,j = 1,...,k- 1) Pr(wk < a), which with the aid of (10) is transformed to Pr(wi < a,j = 1,...,k) > Pr(wi < a), j=l completing the induction. Q.E.D. The next proposition is an intermediate result leading to our main result. Proposition 3: Let s be a scalar in the positive portion /+ of the domain of ¢(s), and let A > 0, and p > 0, 0 > 1 be any scalars such that def e__ def >1, Fe-.-, C = V(s) () 14 14 (s) > 2. 5. Asymptotic Behavior of TMNB as n - oo Then for each so E D+ and 0 > 0, we have Pr (TMNB < A E bi + 2psb 2 + 3o in n > nlo (1- ) () (1 ) ( 1l 1og2 for sufficiently large dimension d. Proof: By combining (7) and (10) we get Pr (T• < ) ŽPr > Wij <--) j=1 which with the aid of (6) gives ) >a > Pr (S < (s3) (1 Inequality (13) is valid for any a > 0 and any s E D+. We select s E /D+so that = > > a and let a = eAbi. By substituting these values of a and ~, (13) is transformed to (14): Pr(T < Ab) > (1- ' /) (14) By using the obvious inequality Pr i T< Eci i=P > Pr(Ti <ci,i=p,...,q) i=p we further obtain Pr ( T _< AE ci) > \i=3 Pr(TI < Abi,i = 3,4,..., d- 3) . (15) i=3 Note that the Ti's are not independent. However, we prove in Appendix 3 that for all ai > 0, Pr(Ti < ai,i = 1,2,..., ,m-1I Tm < am) > Pr(Ti < ai, i = 1,2,...,m- 1). (16) (Relation (16) is intuitively clear, since the knowledge that the duration of the mih phase is less than a,,, cannot decrease the probability that the duration of phases i = 1, 2,..., m - 1 is less than ai, i = 1, 2,..., m - 1 below its a priori value.) By successive use of (16), we obtain from (15) Pr( T i=3 < AZ bi> i=3 I Pr (Ti < i=3 15 Ab,). 5. Asymptotic Behavior of TMNB as n -- oo By combining the relations (14) for i = 3, 4,..., d - 3 we get: d-3 · Prs£ d-3 TiA bid) a < i=3 d-3 Ž_IIPr (Tt <Ab) i=3 Since E > 1 and bi = r > Žl 2 2 d 1o-9 2 = n nd (1-bi* > =3 ) d (17) i=3 b3 for 3 < i < d - 3, we get .bi d-213 \ (d-i3 Furthermore, we have that d-3 > Cb3 and (17) can be transformed to 1 n b3 = [(d-1(d-2)] > d2 for sufficiently large d. This in turn gives 5b3 > dilog 2 7 , since ~ > 1. Using this, relation (18) can be written as Pr ( t T_<Aebi) > (ii=3 i=3 ' 1 n~dlog2 (19) ~ with ~ > 1. Note that id-33Ti is the time required for the completion of phases 3, 4,..., d - 4, d - 3. Phases 1, 2, d - 2, d - 1 and d will be treated separately. The time to complete phases 1, d - 1 and d is determined by the length of the longest packet. This is so because these phases consist of one step and, therefore, Ti = maxt xt, i = 1, d - 1, d. Then by using Markoff's inequality we obtain Pr(_<l pr(x_<n, l ) _ 0 > ,1,...,n-1) (1- (:)). If we select a = ng with 0 > 1 and we let s be equal to any so E D+, the preceding inequality yields: Pr Tn)< (1 (so ) for i = 1, d-1, d. (20) For the phases i =2 and i = d- 2, we get from relation (13) and the fact b2 = bd-2, that ( _ i- s )If we select a = 6 esb 2, then (T a )f for i = 2, d- 2. (21) (21) is transformed to < pb) > (1- , for i= 2, d- 2, (22) where ¢= e 1 > 2. V (us) Since 2b2 = 2r(d- 1)/21 > d- 1, we get Pr(T _<pb2) > 1- - I =n 16 ) , for i=2, d-2. (23) 5. Asymptotic Behavior of TMNB as n --+ oo By combining equations (19), (21), and (23), and by using the fact id=l Ti = TMNB, we finally obtain Pr TMNB < A ij bi +3- >(1(V 2 ) ~~~~~~~~i=3 " 1 1 Inn + 2b 2nd A 1 _ (so) 80 (1- (-Inlom2V) n)3l(-__ with a > 1, so E D+, ( > 2, and ~ = d2 1 dlog2 ~~~~~(24) J > 1. Q.E.D. Now we are in a position to prove the main result of the paper. Proposition 4: Let TMNB be the completion time of the MNB when the lengths of the packets are distributed according to some probabilistic rule with unit mean, and let T, = I[-!d ] be the completion time of the MNB when packets have deterministic length equal to one time unit. Assume also that the positive portion D+ of the domain of 4I(s) is nonempty. Then given any 6 > 0 and e > 0, we can find no = no(5, e) such that Pr(TMNB < (1 + 6)T) > 1- Proof: V n > no. e, Since 4~(s) < oo for s E D+, there exists a A > 0 such that eAs > 4>(s) for some s > 0, and a y > 0 such that eMl > 44(s). Let (= eAs s) ejA > , i = 2 > 2. Consider also some so E D+ and 0 > 1. Then the conditions of Proposition 3 are satisfied and equation (24) holds. In Appendix 4 we prove that for 4 > 1, C > 2, 0> 1, (25) the right hand side of (24) goes to one as the number of processors n goes to infinity. Thus, for all e > 0, we can find nl(e) such that for all n > nl(e) Pr (TMNB bi + 3 < A In n + 2pb2 We denote d-3 T,(n) = E b, + 3- In n + 2pb2 i=3 and T.d 17 1 > 1-e (26) 5. Asymptotic Behavior of TMNB as n --+ oo T,(n) is the optimal completion time of the MNB for the synchronous (deterministic) case. Since n-1 = -d=lj (a it can be seen that _< bi = < From this fact, it follows with some additional calculation that = lim n-oo T.(n) A. Therefore, given some 6 > 0 we can find n2 (6) such that AE bi + 3 n n + 2b i=3 2 < ( + T,(n) (27) for all n > n2 (6). We define no(6, e) = max(nl(E),n 2 (6)). For n > no(6,e), both (26) and (27) hold. Thus for any S > 0 and any c > 0 there always exists a no = no(6, c), defined as above, such that for all n > no(6, c) Pr (TMNB < (A+ )T, >-e. (28) We will now prove that A can always be chosen to be equal to 1 + 5 for any 6 > 0. It is enough to prove that there exists an s E V+ such that e(l+)' > b(s) Let F(s)= ,(s)e-(l+2) . Since F(O) = t(0) = 1, it is enough to prove that F(s) is strictly decreasing in a neighborhood of 0. Since qt(s) is differentiable (because the exponential function is differentiable), we have F'(s) = e-(+)$' (V'(s) - (1 + which gives F'(0) = -2 )q(s)) < 0 (we used the fact that 4X'(O) = 1, since the packet lengths have unit mean). Therefore, there exists an s E D+ such that F(s) < 1 or equivalently e(l+f)' > 1'(s). Thus, we can always choose A = 1 + 5. By substituting this value of A in (28) the proof is completed. Q.E.D. The last theorem constitutes the main result of this paper. It indicates that in hypercubes of large dimension, the factor by which the completion time of the MNB increases when the packet lengths have essentially any probability distribution over the corresponding case where all the packets have 18 References constant lengths, is very close to one. This result should be compared with the case of the ring with n processors, where the average time required for the MNB when the lengths of the packets are exponentially distributed is In n times that of the corresponding deterministic case. An intuitive explanation of this result is the following. Loosely speaking, as d increases, the number of steps within each phase (except phases 1,2, d- 2, d - 1, d) grows faster than d. Therefore, the number of packets that are transmitted one after the other within some phase increases more rapidly than the number of phases. The sum of the lengths of the packets which are transmitted over some link during phase i is then forced by the central limit theorem to come close to its mean value bi. As a result the total time required for the MNB of the asynchronous probabilistic case approaches the time complexity of the synchronous deterministic case. Therefore, although the result may seem unexpected, it is not counter-intuitive. We finally note that the existence of a global clock was not assumed at any point throughout the analysis. The only exception is the initialization of the algorithm, which was assumed to take place synchronously for all the processors. If this assumption is relaxed the additional overhead for the initialization using a naive scheme (single node broadcast of a start signal) will be O(d) which is small, so the result still holds. REFERENCES [BeT89] Bertsekas, D. P., and Tsitsiklis, J. N., "Parallel and Distributed Computation: Numerical Methods", Prentice-Hall, Englewood Cliffs, N.J., 1989. [BOS89] Bertsekas, D. P., Ozveren, C., Stamoulis, G. D., P. Tseng, and Tsitsiklis, J. N., "Optimal Communication Algorithms for Hypercubes", Lab. for Information and Decision Systems Report LIDS-P-1847, M.I.T., Jan. 1989, J. of Parallel and Distributed Computing (to appear). [BhI85] Bhatt, S. N., and Ipsen, I. C. F., "How to Embed Trees in Hypercubes", Yale University, Dept. of Computer Science, Research Report YALEU/DCS/RR-443, 1985. [DaS87] Dally, W. J., and Seitz, C. L., "Deadlock-Free Message Routing in Multiprocessor Interconnection Networks", IEEE Trans. on Computers, Vol. C-36, pp. 547-553, 1987. [DNS81] Dekel, E., Nassimi, D., and Sahni, S., "Parallel Matrix and Graph Algorithms", SIAM J. Comput., Vol. 10, pp. 657-673, 1981. [GrR80] Gradshteyn, I. S., and Ryzhik, I. M., "Table of Integrals, Series, and Products, Academic Press, Orlando, Florida, 1980. [HHL88] Hedetniemi, S. M., Hedetniemi, S. T., and Liestman, A. L., "A Survey of Gossiping and 19 Appendix 1 Broadcasting in Communication Networks", Networks, Vol. 18, pp. 319-349, 1988. [Joh87] Johnsson, S. L., "Communication Efficient Basic Linear Algebra Computations on Hypercube Architectures", J. Parallel and Distr. Comput., Vol. 4, pp. 133-172, 1987. [JoH89] Johnsson, S. L., and Ho, C. T., "Optimum Broadcasting and Personalized Communication in Hypercubes", IEEE Transactions on Computers, Vol. 38, 1989,. pp. 1249-1268. [KVC88] Krumme, D. W., Venkataraman, K. N., and Cybenko, G., "The Token Exchange Problem", Tufts University, Technical Report 88-2, 1988. [McV87] McBryan, O. A., and Van de Velde, E. F., "Hypercube Algorithms and their Implementations", SIAM J. Sci. Stat. Comput., Vol. 8, pp. 227-287, 1987. [Ozv87] Ozveren, C., "Communication Aspects of Parallel Processing", Laboratory for Information and Decision Systems Report LIDS-P-1721, M.I.T., Cambridge, MA, 1987. [Ros83] Ross, S. M., "Stochastic Processes", J. Wiley, New York, 1983. [SaS85] Saad, Y., and Schultz, M. H., "Data Communication in Hypercubes", Yale University Research Report YALEU/DCS/RR-428, October 1985 (revision of August 1987). [SaS86] Saad Y., and Schultz, M. H., "Data Communication in Parallel Architectures", Yale University Report, March 1986. [SaS88] Saad Y., and Schultz, M. H., "Topological Properties of Hypercubes", IEEE Trans. on Computers, Vol. 37, 1988, pp. 867-872. [StT90] Stamoulis G., and Tsitsiklis J., "Efficient Routing Schemes for Multiple Broadcasts in Hypercubes", Laboratory for Information and Decision Systems, Report LIDS-P-1948, February 1990. [StW87] Stout, Q. F., and Wagar, B., "Passing Messages in Link-Bound Hypercubes", in Proc. 1986 Hypercube Conference, SIAM, Philadelphia, pp. 251-257, 1987. [Top85] Topkis, D. M., "Concurrent Broadcast for Information Dissemination", IEEE Trans. Software Engineering, Vol. 13, pp. 207-231, 1983. [Var90O] Varvarigos, E. A., "Optimal Communication Algorithms for Multiprocessor Computers", M.S. Thesis, Dept. of Electrical engineering and Computer Science, M.I.T.; also Center for Intelligent Control Systems Report, CICS-TH-192, January 1990 [VaB90] Varvarigos, E. A., Bertsekas, D. P., "Optimal Communication Algorithms for Isotropic Tasks in Hypercubes and Wraparound Meshes", Laboratory for Information and Decision Systems, Report LIDS-P-1972, May 1990. 20 Appendix 2 APPENDIX 1 In this appendix we will prove that limn n-+oo 2n E Ak n(n + 1) . (n + k-1) =0) Denote F(n) =E n(n + 1) . .(n+ k-1) Since Ak > 0, obviously F(n) > 0 for all n > 0. From (2) and (3) we get that C - F(n) > > 0, which gives F(n) < C, for all n > 0. Therefore, for n = 1, F(1) < C. It can also be seen from the definition of F(n) that F(n) < F(1)/n. This gives 0 < F(n) < C n Therefore, lim-,, F(n) = 0 and lim -F(n = O. Q.E.D. APPENDIX 2 In this Appendix we prove the inequality Pr(wi, < a Wi,k < b) > Pr(wij < a), (29) which was used to prove Prop. 1 We have xr, wij = Wk = The dependence between wzi and Wink I. lE Amk IEAjj comes from the fact that some indices I are common in Amk and A/j. Let C = Amk n Aij, A = Aij - C and B = Ak - C. Then if we denote y = EIc Z = DIEA xi, and r = Z1EB , xl, the random variables y, z, r are independent, since for distinct l's, the random variables xl are independent. By using the above notation, it is enough to prove that Pr(y+z < a I yr+ r < b) > Pr(y+ z < a). 21 (30) Appendix 3 Bayes' rule gives Pr(y+ z < a I y+ r < b)= Pr(y+ z < a,y+r < b) Pr(y+ r < b) We have that Pr(y+-z<ay+r<b a-z<b -r)= Pr(y + z•ba-z < b a- z < b - r) Pr(y+z< <b-r) > Pr(y + z <a a - z < b -r) and Pr(y+ z < a I y+ r < b, a- z > b- r) = Pr(y+r< b a- z > b-r) Pr(y + r < b I a - z > b - r) > Pr(y+ z < a a - z > b-r). These relations give Pr(y+ z < a I y + r < b) = Pr(y + z <a y + r < b,a - z < b- r)Pr(a- z < b- r) y + r < b,a - z >b- r)Pr(a - z > b- r) + Pr(y+z < a a - z < b - r)Pr(a- z < b - r) > Pr(y + z < a + Pr(y + z <a a - z > b - r)Pr(a - z > b - r) = Pr(y+ z < a). As a result, relation (30), which is equivalent to relation (29) holds. Q.E.D. APPENDIX 3 In this appendix, we prove Eq. (16), repeated below for convenience, Pr(T_< ai,i = 1,2,...,m - 1 I Tm <am) > Pr(T < ai, i = 1,2,...,m - 1). (31) By definition we have T = max{wik I k= ,...,nd}. Let ki, i = 1, 2,..., m, be the arguments that attain the maximum above for i = 1, 2,..., m, and let Pr(k;, i = 1,..., m) be the corresponding probability. By substituting these equations in (31), we have to prove the equivalent relation Pr(k, i = 1,...,m)Pr(wik, < ai,i = 1,2,...,m - 1 ki, i = 1, 2,..., m, Wmkm < am) > (32) > 2 Pr(kI, i = 1,..., m)Pr(wik, < ai, i = 1, 2,..., m22 22 I k, i = 1,2,..., m). Appendix 4 In Prop. 1 we proved [cf. Eq. (16)] that Pr(wik, < ai, i = 1, 2,..., m- 1 l wkm_ < am) > Pr(wik, < at, i= 1,2,...,m- 1). (33) It can be seen from the proof of this proposition that the same result holds even if all the probabilities in (33) are conditioned on the event {k, i = 1, 2,.. ., m} and therefore ki, i = 1,2,..., m, WMmkm Pr(wiki; < ai, i = 1, 2,..., m -1 > Pr(wiki < am) ai, i = 1,2,..., m - 1 ki, i = 1, 2,..., m). Using this, (32) is proved and from there (31) follows immediately. Q.E.D. APPENDIX 4 In this appendix we will show that the right hand side of inequality (24) goes to 1 as n -- oo, i.e. lim (1n-01 ) 1 C-lfllog2 C )O I 1 ] nO)7 J+,log (34) 2 ) 1 (34) when ~ > 1, C > 2, and 0 > 1. We recall that d = log 2 n. The proof consists of two steps. At first we show that all the terms of the product in (34) are of the form where x = x(n) goes to infinity as n goes to infinity and Q(x) is a function of x such that =0. lim (35) As a second step we will prove that for a function Q2(x) satisfying (35), we have that lim 2-00 ( 1 Q(z)) 1. Step 1: In order to prove that all the terms in (34) are of the desired form, we will use successive applications of L' Hospital rule. This rule states that if f(x), g(x) are differentiable functions in the neighborhood of oo (i.e. for sufficiently large x) with the property that limrS f(x) = limz_, g(x) = oo then limL,-Ot = lim_, 0(. Although the number of processors n is an integer, we will treat it as a continuous variable here and allow differentiation with respect to it. This is permitted since we are interested in the limit n -+ oo and we are dealing with continuous functions of n. Thus 23 Appendix 4 a) Since C > 2, log 2, C def a > 1. Thus 2nd lim n-ooo (lna lim 2(1 + In n)( n-oo (a In 2)n0-1 2= lim =0 n-eoo (a In 2)nc - ' as n -- oo and a > 1. Thus (-lnlog2( £Q(2nd). = b) Obviously limoo l3n1(so) n O 0, since 0 > 1 and <>(so) is a constant. c) We denote a = lo21 > 0 for E > 1 and take into account that d = 1og2 n = 1. third term of (34) to be of the desired form it is enough to prove that n(ln n)2 (In 2)2narn lim ( n-+no = 0. =° By successive applications of L' Hospital rule we find that 2 lim n(ln =n) m n-oo nalnun lim lil n(ln n + 2) n-oo 2anCr -ln n n(ln n + 3) = 0. oo4a2(lnn)nalnn- Therefore ° nZdl" 92 = lQ (nd2) . Step 2: For the second step we note that lim 1-lim lim (1_ Since imx- (i - () (x) = e-1 =-lim ((lQ()) () = 0, we finally obtain that and lims,, lim= 1 X-00 -) (z) (- Steps 1 and 2 together give Eq. (34). Q.E.D. 24 Then for the