Multinode Broadcast in Hypercubes and ... with Randomly Distributed Length of ...

advertisement
LIDS.P-2006
November 1990
Multinode Broadcast in Hypercubes and Rings
with Randomly Distributed Length of Packetsi
by
Emmanouel A. Varvarigos and Dimitri P. Bertsekas 2
Abstract
We consider a multinode broadcast (MNB) in a hypercube and in a ring network of processors. This is the
communication task where we want each node of the network to broadcast a packet to all the other nodes.
The communication model that we use is different than those considered in the literature so far. In particular,
we assume that the lengths of the packets that are broadcast are not fixed, but are distributed according to
some probabilistic rule, and we compare the optimal times required to execute the MNB for variable and
for fixed packet lengths. For large hypercubes we show under very general probabilistic assumptions on the
packet lengths, that the MNB is completed in essentially the same time as when the packet lengths are fixed.
In particular, we show that the MNB is completed by time (1 + 6)T 8 with probability at least 1 - e, for any
positive e and 6, where T 8 is the optimal time required to execute the MNB when the packet lengths are
fixed at their mean, provided that the size of the hypercube is big enough. In the case of the ring we prove
that the average time required to execute a MNB when the packet lengths are exponentially distributed
exceeds by a factor of In n the corresponding time for the case where the packet lengths are fixed at their
mean, where n is the number of nodes of the ring.
1 Research supported by NSF under Grant NSF-DDM-8903385 and by the ARO under Grant DAAL03-b6K-0171.
2
Laboratory for Information and Decision Systems, M.I.T, Cambridge, Mass. 02139.
1
1. Introduction
1. INTRODUCTION
The processors of a multiprocessor system, when doing computations, often have to communicate
intermediate results. The interprocessor communication time may be substantial relative to the time
needed exclusively for computations, so it is important to carry out the information exchange as
efficiently as possible.
One of the most frequent communication tasks is the multinode broadcast (MNB). In this task
we want each node to broadcast a packet (the same packet) to all the other nodes. The MNB
arises, for example, in iterations of the form x+l, = Ax*, where A is a square matrix and xi, x 1+l are
column vectors of appropriate dimensions. In this computation, each processor computes a specific
component of the vector xi+l and broadcasts it to all the other processors so that it can be used
during the next iteration.
Algorithms for routing messages between different processors have been studied by several authors
under a variety of assumptions on the communication network connecting the processors. Saad and
Shultz [SaS85], [SaS86] have introduced a number of generic communication problems that arise
frequently in numerical and other methods. They have assumed that all packets take unit time to
traverse any communication link. Johnson and Ho [JoH89] have developed minimum and nearly
minimum completion time algorithms for similar routing problems as those of Saad and Schultz but
used a different communication model and a hypercube network. Their model quantifies the effects of
setup time (or overhead) per packet, while it allows packets to have different lengths, and to be split
and be recombined prior to transmission on any link in order to save on setup time. In the model of
[JoH89], each packet may consist of data originating at different nodes and/or destined for different
nodes and the extra overhead for splitting and combining packets is considered negligible. Bertsekas
et al [BOS89], and Bertsekas and Tsitsiklis [BeT89] have used the communication model of Saad
and Shultz to derive minimum completion time algorithms for several communication problems in
a hypercube. In particular, they have given an algorithm for the multinode broadcast that executes
in a minimum number of steps ([-l-1 for a hypercube with n = 2d processors). Several other works
deal with various communication problems and network architectures related to those discussed in
the present paper; see [BhI85], [DNS81], [HHL88], [Joh87], [KVC88], [McV87], [Ozv87], [SaS88],
[StT90], [StW87], [Top85], [VaB90], and [Var90].
In this paper we will be dealing with a multinode broadcast in hypercubes and rings. The case
where the packets broadcast by the processors have equal deterministic lengths has been studied in
the literature for a variety of regular topologies. Algorithms that execute the MNB in optimal time
exist for the case of the ring ([BeT89]), the d-dimensional wraparound mesh ([Ozv87], [BeT89]),
2
1. Introduction
the hypercube ([SaS85], [SaS86], [Ozv87], [BeT89], [BOS89], [JoH89]), and other topologies.
A
common assumption in the communication model adopted by the previous authors is that all the
packets require one unit of time in order to travel over a link (i.e. the transmission time plus the
propagation delay over the link and the processing delay at the node all sum up to one unit). As a
result the algorithms found were synchronous and assumed the existence of a global clock.
In this paper we will relax some of the previous assumptions. The existence of a global clock is no
longer assumed. Furthermore, the lengths of the packets broadcast are not deterministic, but they
are distributed according to some probabilistic rule. In order to be able to make comparisons with
the fixed packet length case, the mean value of the packet lengths will be taken equal to one unit.
For both the hypercube and the ring network of processors the number of nodes will be denoted by
n. For the purposes of this paper, a hypercube network (or d-cube) consists of the set of points in ddimensional space with each coordinate equal to zero or one. There is a bidirectional communication
link for every two points differing in a single coordinate. We thus obtain an undirected graph with
the processors as nodes and the communication links as arcs. The binary string of length d that
corresponds to the coordinates of a node of the d-cube is referred to as the identity number (or ID)
of the node. When confusion cannot arise, we refer to a d-cube node interchangeably in terms of
its identity number (a binary string of length d) and in terms of the decimal representation of its
identity number. Thus, for example, the nodes (00 ... 0), (00
...
1), and (11 ... 1) will also be referred
to as nodes 0, 1, and 2d - 1, respectively. For any two binary strings x and y we denote by x
Ey
the bitwise exclusive OR of x and y. We also denote by log n and In n the base 2 and the Napier
logarithm of n, respectively.
The optimal completion time of the MNB in a d-dimensional hypercube with n = 2d nodes, when
each packet requires one time unit (or slot) for transmission over a link, was found in [BOS89] (see
also [BeT89]) to be [fL1 time slots, where by
rx]
we denote the smallest integer which is greater
or equal to x. We evaluate the time complexity of the optimal algorithm described in [BOS89]
when the packet lengths are not constant, but are distributed according to some probabilistic rule.
We consider the natural adaptation of the optimal deterministic algorithm which can deal with
the probabilistic case. In particular, we assume that the same schedule of packet transmissions
(i.e., order in which the packets are transmitted over the links) with that of the optimal synchronous
algorithm is followed; however, the timing is not the same, since the model is no longer deterministic.
Let TMNB be the time required for the completion of the MNB in the asynchronous probabilistic
case and let
be
the
the
corresponding
optimal
time
for
synchronous
d
eterministic case For a given n, T
be the corresponding optimal time for the synchronous deterministic case. For a given n,
3
is
TMNB is
2. Multinode Broadcast in a Ring
a random variable and T, is a constant. We assume that the probability distribution of the packet
lengths has unit mean and that the corresponding characteristic function 4>(s) exists for some s > 0.
We prove that given any 6 > 0 and e > 0, we can find no = no(6, c) such that
Pr(TMNs < (1 + 6)T,) > 1 - e
(for the hypercube)
for all n > no. This means that as n -- oo, the MNB is completed with probability one in time less
than (1 + 5)T,, where 6 is arbitrarily small. Thus, the probabilistic nature of the packet lengths does
not deteriorate appreciably the performance of the optimal MNB algorithm for large hypercubes.
This is a rather surprising result; as shown in Section 2, in the case of the ring, the average completion time of the MNB with random packet lengths increases substantially over the corresponding
deterministic case. In particular, the mean time E(TMNB) required to complete the MNB in a ring
for exponentially distributed length of packets with mean one time unit is
E(TMNB) ;,
(for the ring with n nodes)
(C + In n)T,,
where C = 0.577215 is Euler's number, and
T,= [ni-]
is the time to execute the MNB in the case of unit packet lengths.
The organization of the paper is the following. The MNB in a ring is treated in Section 2. The
analysis found there is rather straightforward, but it does give some insight for the case of the
hypercube. Sections 3-5 deal with the hypercube network of processors. In Section 3 we describe
the communication algorithm (scheduling) that will be used. Section 4 derives a loose upper bound
for
E(TMNB)
when the packet lengths are exponentially distributed. Our main result, which holds
for general distributions of the packet lengths, appears in Propositions 3 and 4 of Section 5. Finally,
the appendices at the end resolve some technical issues arising in our analysis.
2. MULTINODE BROADCAST IN A RING
Consider a multinode broadcast in a ring. In the case when all the packets require one unit of
time (or slot) to be transmitted over a link, the optimal time to perform the task is [H-l], where
n is the number of processors of the ring. The following algorithm, found in [BeT89], achieves this
optimal time.
At the first slot, each node sends its packet to its clockwise and counterclockwise neighbors.
During slots 2,..., [-21], every node sends to its clockwise (counterclockwise) neighbor the packet
4
2. Multinode Broadcast in a Ring
Figure 1: Broadcast in a ring.
received from its counterclockwise (respectively, counterclockwise neighbor) at the previous slot
(Figure 1).
We are interested in the case where the lengths of the packets generated by the nodes (and
therefore the time required to transmit them over a link) are not constant but follow some probability
distribution. We will evaluate the mean time E(TMNB) required to complete the MNB.
The remainder of this section consists of two parts. In subsection 2.1 we derive a general expression
for E(TMNB) and apply it to the case where the packet lengths follow a uniform distribution. In
subsection 2.2 we calculate
E(TMNB)
for the case of exponentially distributed length of packets.
2.1 General Expression for E(TMNB) - Uniform Distribution and Bounded Distribution
of Lengths
Let xi, i
=
1, .. , n, be the time required to transmit the packet generated at node i over a link
5
2. Multinode Broadcast in a Ring
of the ring. Since for each processor i there is a processor which is [r!1] links away from i, and
since the packet from i has to be broadcast to all the other nodes, we have TMNB
>
2l]
xi, for all
i. Thus,
1 maxX
TMNB > [
i
On the other hand, it can be seen that
TMNB
< [2
maxxi,
by observing that a MNB in a ring with packets of lengths (in time units) xi, i = 1,..., n, cannot
require more time than the time required when all the packets are equal to the largest of them. An
alternative way to prove this inequality, is to note that the largest packet generated by a node is
the bottleneck of the algorithm, in the sense that by the time it arrives to the node opposite to its
source on the ring, all the other packets of smaller length have already arrived to the node opposite
to their source.
Thus
TMNB =
[n]2 1 maxi xi, and
E(TMNB) =
[n 2
1 E(maxxi).
(1)
This equation holds for any probability distribution of the xi 's. In what follows we will examine
specific distributions.
We first note that if the xi's are bounded by a constant B, then the time required to execute
the MNB is bounded by [n-] B. Assume now that the lengths of the packets are independent and
uniformly distributed over the interval (O,u). Then, straigthforward calculation yields
un
E(maxx,) = n+1
n+l'
and the desired time is
E(TMNB) = n + 1
If u = 2 so that the mean transmission time E(x,) of a packet is one time unit, the preceding formula
becomes
E(TMNB) =
2nrn-i1
2
n+ 1
-=
2n
-. T,
n+ 1
where T, denotes the optimal completion time for the corresponding deterministic case.
6
2. Multinode Broadcast in a Ring
2.2 Exponentially distributed length of packets
Assume now that the transmission times xi over a single link are exponentially distributed independent random variables with mean one time unit. Then we get:
Pr(xz < X) = 1 - e -X.
Since Pr(max ixi < X) = Pr(xi < X,x
2
< X,...,x,
< X) and the xi's are independent and
identically distributed random variables, we obtain
Pr(max xi < X) = (1 -e-X)".
On the other hand we have
E(maxx)
=
Pr(maxxi > X)dX =
(1
- (1 - e-X)n)dX.
By calculating this integral, we get, after some manipulation, that
E(maxxi) = 1+
1
++
2
1
-.
n
Therefore,
l (1 +
E(TMNB)== i 2
For large n, the sum of the n first terms of the harmonic series is approximately In n. In order to
see that, we integrate 1/x from 1 to n, and bound the integral from above and below by discretizing
it. Then we find that
2+
2
++...-
1
1
+-<
-dx = lnn < 1 +
x-x3 nn
+ ..
+
1j
n-l'
which gives
1
In n + -
1
<
< Inn + 1.
(2)
k=l
If more accuracy is required, the following formula ([GrR80], p. 2) can be used:
=l
k=l
n(n + 1) * -(n+ k- 1)
n
k=2
where
Ak = -
x(1- x)(2- x)(3 -x)
... (k-
1- x)dx
and C=0.577215 is Euler's constant. It is shown in Appendix 1 that
nloo2n
k=2 n(n+ 1)...(n+k-1)
7
0.
3. Modified algorithm for the MNB in a hypercube
Therefore, for large n, we obtain the following approximation:
E(TMhNBe) r s1
(C +
+
n n).
By comparing the results for the uniform and the exponential distributions, one can see an
interesting difference.
For the uniform distribution with unit mean packet length (u = 2), the
expected time for the completion of the multinode broadcast is
2n
n+1
*T, -, 2T.,
where T, is the completion time for the corresponding synchronous deterministic case; for the exponential distribution the corresponding mean value of the completion time is approximately In n -Ts.
Thus, the mean completion time for a MNB in a ring strongly depends upon the particular distribution of the packet lengths. It is not only the mean, but also the tail of the distribution that plays a
significant role. We will see that in the hypercube, the particular distribution of the packet lengths
plays a less important role.
3. MODIFIED ALGORITHM FOR THE MNB IN A HYPERCUBE
In the remainder of the paper we will be dealing with a multinode broadcast in a hypercube
network of processors. In order to obtain complexity results for the MNB in the random packet
length case, we will analyze a slightly modified version of the optimal synchronous algorithm found
in [BOS89] and [BeT89]. The results obtained for the modified algorithm will hold for the optimal
algorithm as well. In this section, we first explain why a modified algorithm is analyzed instead of
the asynchronous version of the optimal algorithm. We then give the description of the modified
algorithm.
The optimal synchronous algorithm of [BOS89] can be viewed as consisting of d phases. During
phase i the packet that originates at node s, where s = 0,...,n - 1, arrives at the nodes that
are located i links away from s. When d is not prime, the phases may have to overlap in order
for the completion time of the MNB to be strictly optimal. This complicates the analysis of the
asynchronous case, since the time required to complete a phase will be affected by previous phases
that have not finished yet. The reason is that in phase i + 1, a packet of some origin node may be
scheduled to be transmitted over some link after the packet of another origin, but the latter packet
may not have yet completed phase i.
To circumvent this difficulty, we modify the algorithm so that its phases are not allowed to overlap.
The modified algorithm, is the same with the optimal algorithm except that there is a constraint
8
3. Modified algorithm for the MNB in a hypercube
that each packet begins phase i + 1 only after all packets have completed phase i. The completion
time of the modified algorithm is slightly larger than the actual
TMNB
achieved by the optimal
algorithm.
We now describe briefly the modified algorithm, assuming the reader is somewhat familiar with
the optimal synchronous version given in [BOS89] and [BeT89]. We first note that if we find n
synchronous single node broadcast algorithms in a hypercube, each one originating at a different
node, and such that no two of them use the same link during a slot, then we have a MNB algorithm.
Let A 1(0) be the set of links on which the packet originated at node 00 ... 0 is transmitted during
the lt h slot.
Obviously, each link in A 1(0) connects two nodes with IDs that differ in a specific bit
position. Our aim will be to define At(0) in a way that no two links in A 1(O) connect nodes whose
IDs differ in the same position. If we do so, then the sets
Al(s) = {(s 3 x, s 1 y) I (x, y) E Al(O)},
1 = 1,2,...
can be the sets of links on which the packet generated at node s is transmitted during the lih slot, for
all s = 0, 1,... n - 1. The fact that Al(s) n At(u) = 0 can be seen by observing that s G x and s E y
differ in a particular bit if and only if x and y differ in the same bit. Thus for s = 0, 1,..., n - 1,
the sets Al(s) do not have common elements for a specific 1, provided of course that A,(0) satisfies
the condition mentioned above. In this way it is guaranteed that no two packets will claim the same
link during the same slot. What is left to be done now is to specify A 1(0).
For i = 1, ... , d, we denote with Ni the set of d-bit binary numbers with exactly i ones. The
cardinality of Ni is (/). Each set Ni is in turn partitioned in disjoint subsets Ril,..., Rin, which
are equivalence classes under a single bit rotation to the left. Ri 1 is selected to be the class of the
element of Ni whose i rightmost bits are unity. Then each node ID is associated with a distinct
number m(t) E {1, 2,..., n - 1} in the order
RIR21 ...R2n2 ...R(d- 2 )nd_ 2 R(dl)l (11...1).
The first element in each set Rij is chosen so that its bit in position 1 + [(m(t)-
(4)
1)(mod d)] from
the right, is a one. The subsequent elements of Rij are found by rotating the first element to the
left. We successively group together d elements of the set Ni into sets Eij,j = 1,..., [(I)/dl in the
order they appear in (4). When d is a prime integer, each one of the equivalence classes Rij (except
for Rol and Rdl) has d elements. This happens because when d is prime, all left shifts of less than
d positions produce distinct d bit binary numbers. Then there are (/d)/d equivalence classes in Ni,
each with d elements, and the sets Rij and Eij coincide. If d is not prime, we will have r(l)/dl sets
Eij which we order as in (4).
9
4. A Loose Upper Bound for E(TMNB)
Consider now the following graph consisting of nodes 0, 1,..., n - 1. Every element of Eij is
connected to an element in Ni-_ so that each connection corresponds to a reversal of a different bit.
Every element of Eij is connected to exactly one element of Ni-1. In this way a graph is obtained
which is the single node broadcast tree for node 0. The links connecting layers I - 1 and I of this
graph constitute the links of Al(0). By construction, these links satisfy the conditions that we have
set for Al(0) (see also [BeT89], p.60). Then, the broadcast tree of node s can be obtained from
the broadcast tree of node 0 by simply adding (mod 2) s to the ID of each node of the tree. The
broadcast tree of node 0 for d = 5 is shown in Figure 2.
N2
N
N3
R 31
N4
R32
R4 1
R.21
R 22
10001
01001
11001
01101
11101
0010
00011
10010
10011
1010
I 1011
00100
00110
00101
00111
10101
10111
01000
01100
01010
01110
01011
01111
10000
11000
10100
11100
10110
11110
RI
00000
II I
Terminal packet
Figure 2: The broadcast tree of node 0 for d = 5.
4. A LOOSE UPPER BOUND FOR E(TMNB)
Assume that the lengths of the packets that are broadcast by the nodes are independent exponentially distributed random variables, and consider the asynchronous version of the algorithm
10
5. Asymptotic Behavior of TMNB as n -+ oo
presented in Section 3. This algorithm should be interpreted as specifying the order in which the
packets are transmitted over the links and not the exact timing. We are interested in the completion
time TMNB of the algorithm.
The following upper bound for the average time required for the MNB can be found using the
reasoning developed in Section 2 for the ring topology:
E(TMNB) < (1 + 1/2 + 1/3 + -
+ 1/n)T, t (In n + C)T,,
(5)
where C is Euler's constant and
T,
n-1
is the time required by the deterministic optimal algorithm.
Relation (5) can be proved by arguing that the time for the MNB when the packets which are
broadcast by nodes 0, 1,..., n - 1 have variable lengths x 0, zl,..., x,-1, cannot be longer than the
time for a MNB in which all packets have lengths equal to maxi xi. This gives
TMNB < (maxxi)T,,
and by using the relation
E(maxxi) = 1+ 1/2+...+1/n-lnn+C,
used in Section 2, we obtain the bound (5).
However, this bound is not tight as in the case of the ring, because the packet with the greatest
length is not necessarily the one that determines the completion time of the MNB in a hypercube.
The reason is that, during the MNB in hypercubes, a packet is wasting most of the time waiting
behind other packets that were scheduled to use a link before it. Thus, although the bound in (5)
is valid, it is not tight because it corresponds to a case in which the packet having the maximum
length among all the packets has to wait behind other packets which also have maximum length.
This is not a typical scenario and the mean completion time is considerably overestimated.
5. ASYMPTOTIC BEHAVIOR OF
To obtain a tighter estimate of
TMNB AS N --
TMNB,
oo
a different approach based on Markoff's inequality will be
used. We will look at the asymptotic behavior of the
11
TMNB
as the dimension d gets large.
5. Asymptotic Behavior of TMNB as n
-+
oo
During phase i of the synchronous modified algorithm, the packets are received by all the destinations that are at a Hamming distance i from the source of the packets. Phase i consists of
(1
bi =
steps.
As already mentioned, in order to analyze the case of random packet lengths, the scheduling of
the asynchronous modified algorithm described in Section 3 will be used. Phase i of the algorithm
is considered to have been completed only when all the packets have completed phase i. There are d
copies of each packet, which upon completing transmission, mark the end of phase i of this packet.
These are the rd ((n) copies which are transmitted during step Ej=l bi (last step of phase i), and
d - rd((d)) copies which are transmitted during step C}=l bi-1 (next to the last step of phase i) of
the synchronous algorithm, where we denote by rx (y) the remainder of the division of y by x. We
will refer to these packets as the terminal packets of phase i. In Fig. 2 we show the terminal packets
of phase 2 for the broadcast tree rooted at node 0. Let us number the terminal packets of phase i as
j = 1, 2, .. ., nd (there are d terminal packets per origin node, so the total is nd). Let us also denote
by wij the time that elapses between the beginning of phase i and the time the jih of these terminal
packets completes phase i. For every terminal packet of phase i, the time required for the packet to
complete phase i consists of at most bi packet transmissions; bi - 1 or bi - 2 transmissions of other
packets that were scheduled to use the link before it, and one transmission of the copy of the packet
under consideration. Then we can write
Wij =
XI,
X
iEAij
where Aij is a set of distinct integers between 0 and n - 1 which has cardinality less or equal to bi,
and xl is the length of the packet which originates from node 1. In particular, a node belongs to set
Aij if a packet generated at that node is scheduled to use during the /ih phase the same link that
terminal packet j of phase i will use.
Let
40(s) = E(es$)
be the characteristic function of the distribution of the packet lengths and let
1)+ = {s > 0 I E(e$x) exists}
be the positive portion of its domain. Since wii is the sum of at most bi independent and identically
distributed random variables, we have (since (I(s) > 1 for s E VZ+)
E(eswii) _<4(s),,,
12
V s E v+.
5. Asymptotic Behavior of TMNB as n -+ oo
Also, from Markoff's inequality [Ros83] we have
Pr(elWii > a) < E (ew
a > O, s E D+.
From the last two relations we obtain
Pr(w
< In
> 1-
()b
Va>,
a> O s E
+.
(6)
The time required for the completion of the it h phase by the nd terminal packets of that phase is
Ti = max {wij}.
j=l,...,nd
Therefore, for all s E V+ and a > 0,
(
-
s )=PrP=
w,j (
1,...,nd)(7)
The wij 's that appear at the right hand side of (7) are not independent random variables, so we
cannot readily use (6) to estimate Pr (T _< "i ). To proceed further with the analysis, the following
two propositions are needed:
Proposition 1: Let IJ C N x N be a subset of the set of index pairs (i, j) with 1 < i < d, and
1 < j < nd. Then for all m E {1,..., d}, k E{1,..., nd}, and all a > 0, we have
Pr(wij < a, V (i,j) E I I Wik < a) > Pr(wij < a,(i,j) E IJ).
Proof:
We know that wij = IlEAij xl. Let Z = (xo, Ix,..., x,-l) be the random vector of packet
lengths. We define the sets
Eij =
a Rn I| h=
= maxwp, = x=E Rn
xI = malxEa p
lEA,,
(p,)EIT EApq
In order for the sets Eij, (i, j) E IJ to be disjoint, we say that if for some x there are two pairs (il, jl)
and (i2 ,j 2 ) in IJ, with (il,jl) -< (i 2 ,j 2 ) (-< is any order in N x N, for example the lexicographic
order), such that wi 1j, = wi2j 2, then this x will be considered to belong to Efij, and not to Ei2j2.
Then
Pr (wij < a, (i,j) E IJ l Wmk < a)= E
x
Pr (Y E Eij)Pr(wij < a I E Eij,Wmk < a).
(8)
(ij)EIJ
In Appendix 2 we prove that Pr (wij < a I Wmk < b) > Pr (wj < a) for all a, b > 0. It can be seen
that the proof of the last inequality is not altered if both probabilities are conditioned on the fact
E Eij. Thus,
Pr(wij < a i wmk_< a,Y E Ei3 ) > Pr(wij < a I
13
E Eij)
5. Asymptotic Behavior of TMNB as n -- oo
The last relation together with (8) yields
Pr(wij < a, (i,j) E IJ I Wmk < a) > E
Pr(x E Eij)Pr(wij < a I[
E Eij)
(ij)ElJ
= Pr(wij < a, (i, j) E IJ).
Note that if in the preceding proof we replace wij by wij + a - aii we get the more general relation
Pr(wij < ai, (i,j) E J IJwmk < ak) > Pr(wii < aij, (i,j)
IJ).
(9)
Q.E.D.
Proposition 2: For all a > 0 and k E {1,...,nd},
k
Pr(wij < a,j=1,...,k) >
Pr(w
< a).
(10)
ji=1
Proof:
The proof will be given by induction on k. For k
=
2, we know from Appendix 2 that
Pr(wil < a wi2 < a) > Pr(wil < a), which by using Bayes' rule yields
Pr(wil < a, w;2 < a) > Pr (wil < a) Pr (Wi2 < a).
Suppose that the proposition is true for k - 1 or equivalently:
k-I
Pr(wj < a,j = 1,.. .,-
1) > ] Pr (wij < a).
(11)
j=1
Then
Wk < a)Pr(wik < a).
Pr(wij < a,j = 1,...,k) = Pr(wij < a,j = 1,...,k--1
(12)
By using Proposition 1, (12) gives
Pr(wij < a,j = 1,...,k) > Pr(w,, < a,j = 1,...,k- 1) Pr(wk < a),
which with the aid of (10) is transformed to
Pr(wi < a,j = 1,...,k) >
Pr(wi < a),
j=l
completing the induction. Q.E.D.
The next proposition is an intermediate result leading to our main result.
Proposition 3: Let s be a scalar in the positive portion /+ of the domain of ¢(s), and let A > 0,
and p > 0, 0 > 1 be any scalars such that
def
e__
def
>1,
Fe-.-,
C = V(s)
()
14
14
(s)
>
2.
5. Asymptotic Behavior of TMNB as n - oo
Then for each so E D+ and 0 > 0, we have
Pr (TMNB < A E bi + 2psb 2 + 3o in n
>
nlo
(1-
)
()
(1
)
(
1l 1og2
for sufficiently large dimension d.
Proof:
By combining (7) and (10) we get
Pr (T• <
) ŽPr
>
Wij <--)
j=1
which with the aid of (6) gives
) >a
>
Pr (S <
(s3)
(1
Inequality (13) is valid for any a > 0 and any s E D+. We select s E /D+so that
=
>
>
a
and
let a = eAbi. By substituting these values of a and ~, (13) is transformed to (14):
Pr(T < Ab) > (1-
' /)
(14)
By using the obvious inequality
Pr i
T<
Eci
i=P
> Pr(Ti <ci,i=p,...,q)
i=p
we further obtain
Pr
(
T _< AE ci) >
\i=3
Pr(TI < Abi,i = 3,4,..., d- 3) .
(15)
i=3
Note that the Ti's are not independent. However, we prove in Appendix 3 that for all ai > 0,
Pr(Ti < ai,i = 1,2,..., ,m-1I Tm < am) > Pr(Ti < ai, i = 1,2,...,m- 1).
(16)
(Relation (16) is intuitively clear, since the knowledge that the duration of the mih phase is less
than a,,, cannot decrease the probability that the duration of phases i = 1, 2,..., m - 1 is less than
ai, i = 1, 2,..., m - 1 below its a priori value.)
By successive use of (16), we obtain from (15)
Pr( T
i=3
< AZ
bi>
i=3
I Pr (Ti <
i=3
15
Ab,).
5. Asymptotic Behavior of TMNB as n -- oo
By combining the relations (14) for i = 3, 4,..., d - 3 we get:
d-3
· Prs£
d-3
TiA
bid)
a
<
i=3
d-3
Ž_IIPr
(Tt <Ab)
i=3
Since E > 1 and bi = r
>
Žl
2
2
d 1o-9 2
= n
nd
(1-bi*
>
=3
)
d
(17)
i=3
b3 for 3 < i < d - 3, we get
.bi
d-213 \
(d-i3
Furthermore, we have that
d-3
>
Cb3 and (17) can be transformed to
1 n
b3 = [(d-1(d-2)] > d2
for sufficiently large d. This in turn gives 5b3 >
dilog 2
7 , since ~ > 1. Using this, relation (18) can be written as
Pr (
t
T_<Aebi)
> (ii=3
i=3
'
1
n~dlog2
(19)
~
with ~ > 1.
Note that id-33Ti is the time required for the completion of phases 3, 4,..., d - 4, d - 3. Phases
1, 2, d - 2, d - 1 and d will be treated separately.
The time to complete phases 1, d - 1 and d is determined by the length of the longest packet.
This is so because these phases consist of one step and, therefore,
Ti
= maxt xt, i = 1, d - 1, d. Then
by using Markoff's inequality we obtain
Pr(_<l
pr(x_<n,
l
)
_
0
>
,1,...,n-1)
(1-
(:)).
If we select a = ng with 0 > 1 and we let s be equal to any so E D+, the preceding inequality yields:
Pr
Tn)<
(1
(so )
for i = 1, d-1,
d.
(20)
For the phases i =2 and i = d- 2, we get from relation (13) and the fact b2 = bd-2, that
( _ i- s )If we select a =
6
esb
2, then
(T
a
)f
for i = 2, d- 2.
(21)
(21) is transformed to
< pb)
>
(1-
,
for i=
2, d-
2,
(22)
where
¢=
e 1 > 2.
V (us)
Since 2b2 = 2r(d- 1)/21 > d- 1, we get
Pr(T _<pb2)
>
1-
-
I
=n
16
)
,
for i=2, d-2.
(23)
5. Asymptotic Behavior of TMNB as n --+ oo
By combining equations (19), (21), and (23), and by using the fact
id=l Ti = TMNB, we finally
obtain
Pr TMNB < A ij bi +3-
>(1(V
2
)
~~~~~~~~i=3
"
1
1
Inn + 2b
2nd
A
1 _
(so)
80
(1-
(-Inlom2V)
n)3l(-__
with a > 1, so E D+, ( > 2, and ~ =
d2
1
dlog2
~~~~~(24)
J
> 1. Q.E.D.
Now we are in a position to prove the main result of the paper.
Proposition 4: Let TMNB be the completion time of the MNB when the lengths of the packets are
distributed according to some probabilistic rule with unit mean, and let T, =
I[-!d
] be the completion
time of the MNB when packets have deterministic length equal to one time unit. Assume also that
the positive portion D+ of the domain of 4I(s) is nonempty. Then given any 6 > 0 and e > 0, we
can find no = no(5, e) such that
Pr(TMNB < (1 + 6)T) > 1-
Proof:
V n > no.
e,
Since 4~(s) < oo for s E D+, there exists a A > 0 such that eAs > 4>(s) for some s > 0, and
a y > 0 such that eMl > 44(s). Let
(=
eAs
s)
ejA
>
,
i
=
2
> 2.
Consider also some so E D+ and 0 > 1. Then the conditions of Proposition 3 are satisfied and
equation (24) holds. In Appendix 4 we prove that for
4 > 1,
C > 2,
0> 1,
(25)
the right hand side of (24) goes to one as the number of processors n goes to infinity. Thus, for all
e > 0, we can find nl(e) such that for all n > nl(e)
Pr
(TMNB
bi + 3
< A
In n + 2pb2
We denote
d-3
T,(n) =
E
b, + 3- In n + 2pb2
i=3
and
T.d
17
1
> 1-e
(26)
5. Asymptotic Behavior of TMNB as n --+ oo
T,(n) is the optimal completion time of the MNB for the synchronous (deterministic) case. Since
n-1
=
-d=lj (a it can be seen that
_<
bi =
<
From this fact, it follows with some additional calculation that
=
lim
n-oo T.(n)
A.
Therefore, given some 6 > 0 we can find n2 (6) such that
AE
bi + 3
n n + 2b
i=3
2
< (
+
T,(n)
(27)
for all n > n2 (6).
We define no(6, e) = max(nl(E),n 2 (6)). For n > no(6,e), both (26) and (27) hold. Thus for any
S > 0 and any c > 0 there always exists a no = no(6, c), defined as above, such that for all n > no(6, c)
Pr (TMNB < (A+
)T,
>-e.
(28)
We will now prove that A can always be chosen to be equal to 1 + 5 for any 6 > 0. It is enough to
prove that there exists an s E V+ such that
e(l+)' > b(s)
Let
F(s)= ,(s)e-(l+2)
.
Since F(O) = t(0) = 1, it is enough to prove that F(s) is strictly decreasing in a neighborhood of
0. Since qt(s) is differentiable (because the exponential function is differentiable), we have
F'(s) = e-(+)$' (V'(s) - (1 +
which gives F'(0) = -2
)q(s))
< 0 (we used the fact that 4X'(O) = 1, since the packet lengths have unit
mean). Therefore, there exists an s E D+ such that F(s) < 1 or equivalently e(l+f)' > 1'(s). Thus,
we can always choose A = 1 + 5. By substituting this value of A in (28) the proof is completed.
Q.E.D.
The last theorem constitutes the main result of this paper. It indicates that in hypercubes of large
dimension, the factor by which the completion time of the MNB increases when the packet lengths
have essentially any probability distribution over the corresponding case where all the packets have
18
References
constant lengths, is very close to one. This result should be compared with the case of the ring
with n processors, where the average time required for the MNB when the lengths of the packets
are exponentially distributed is In n times that of the corresponding deterministic case.
An intuitive explanation of this result is the following. Loosely speaking, as d increases, the
number of steps within each phase (except phases 1,2, d- 2, d - 1, d) grows faster than d. Therefore,
the number of packets that are transmitted one after the other within some phase increases more
rapidly than the number of phases. The sum of the lengths of the packets which are transmitted over
some link during phase i is then forced by the central limit theorem to come close to its mean value
bi. As a result the total time required for the MNB of the asynchronous probabilistic case approaches
the time complexity of the synchronous deterministic case. Therefore, although the result may seem
unexpected, it is not counter-intuitive.
We finally note that the existence of a global clock was not assumed at any point throughout the
analysis. The only exception is the initialization of the algorithm, which was assumed to take place
synchronously for all the processors. If this assumption is relaxed the additional overhead for the
initialization using a naive scheme (single node broadcast of a start signal) will be O(d) which is
small, so the result still holds.
REFERENCES
[BeT89] Bertsekas, D. P., and Tsitsiklis, J. N., "Parallel and Distributed Computation: Numerical
Methods", Prentice-Hall, Englewood Cliffs, N.J., 1989.
[BOS89] Bertsekas, D. P., Ozveren, C., Stamoulis, G. D., P. Tseng, and Tsitsiklis, J. N., "Optimal
Communication Algorithms for Hypercubes", Lab. for Information and Decision Systems Report
LIDS-P-1847, M.I.T., Jan. 1989, J. of Parallel and Distributed Computing (to appear).
[BhI85] Bhatt, S. N., and Ipsen, I. C. F., "How to Embed Trees in Hypercubes", Yale University,
Dept. of Computer Science, Research Report YALEU/DCS/RR-443, 1985.
[DaS87] Dally, W. J., and Seitz, C. L., "Deadlock-Free Message Routing in Multiprocessor Interconnection Networks", IEEE Trans. on Computers, Vol. C-36, pp. 547-553, 1987.
[DNS81] Dekel, E., Nassimi, D., and Sahni, S., "Parallel Matrix and Graph Algorithms", SIAM J.
Comput., Vol. 10, pp. 657-673, 1981.
[GrR80] Gradshteyn, I. S., and Ryzhik, I. M., "Table of Integrals, Series, and Products, Academic
Press, Orlando, Florida, 1980.
[HHL88] Hedetniemi, S. M., Hedetniemi, S. T., and Liestman, A. L., "A Survey of Gossiping and
19
Appendix 1
Broadcasting in Communication Networks", Networks, Vol. 18, pp. 319-349, 1988.
[Joh87] Johnsson, S. L., "Communication Efficient Basic Linear Algebra Computations on Hypercube Architectures", J. Parallel and Distr. Comput., Vol. 4, pp. 133-172, 1987.
[JoH89] Johnsson, S. L., and Ho, C. T., "Optimum Broadcasting and Personalized Communication
in Hypercubes", IEEE Transactions on Computers, Vol. 38, 1989,. pp. 1249-1268.
[KVC88] Krumme, D. W., Venkataraman, K. N., and Cybenko, G., "The Token Exchange Problem",
Tufts University, Technical Report 88-2, 1988.
[McV87] McBryan, O. A., and Van de Velde, E. F., "Hypercube Algorithms and their Implementations", SIAM J. Sci. Stat. Comput., Vol. 8, pp. 227-287, 1987.
[Ozv87] Ozveren, C., "Communication Aspects of Parallel Processing", Laboratory for Information
and Decision Systems Report LIDS-P-1721, M.I.T., Cambridge, MA, 1987.
[Ros83] Ross, S. M., "Stochastic Processes", J. Wiley, New York, 1983.
[SaS85] Saad, Y., and Schultz, M. H., "Data Communication in Hypercubes", Yale University Research Report YALEU/DCS/RR-428, October 1985 (revision of August 1987).
[SaS86] Saad Y., and Schultz, M. H., "Data Communication in Parallel Architectures", Yale University Report, March 1986.
[SaS88] Saad Y., and Schultz, M. H., "Topological Properties of Hypercubes", IEEE Trans. on
Computers, Vol. 37, 1988, pp. 867-872.
[StT90] Stamoulis G., and Tsitsiklis J., "Efficient Routing Schemes for Multiple Broadcasts in Hypercubes", Laboratory for Information and Decision Systems, Report LIDS-P-1948, February 1990.
[StW87] Stout, Q. F., and Wagar, B., "Passing Messages in Link-Bound Hypercubes", in Proc. 1986
Hypercube Conference, SIAM, Philadelphia, pp. 251-257, 1987.
[Top85] Topkis, D. M., "Concurrent Broadcast for Information Dissemination", IEEE Trans. Software Engineering, Vol. 13, pp. 207-231, 1983.
[Var90O] Varvarigos, E. A., "Optimal Communication Algorithms for Multiprocessor Computers",
M.S. Thesis, Dept. of Electrical engineering and Computer Science, M.I.T.; also Center for Intelligent
Control Systems Report, CICS-TH-192, January 1990
[VaB90] Varvarigos, E. A., Bertsekas, D. P., "Optimal Communication Algorithms for Isotropic
Tasks in Hypercubes and Wraparound Meshes", Laboratory for Information and Decision Systems,
Report LIDS-P-1972, May 1990.
20
Appendix 2
APPENDIX 1
In this appendix we will prove that
limn
n-+oo 2n
E
Ak
n(n + 1) . (n + k-1)
=0)
Denote
F(n) =E n(n + 1) . .(n+ k-1)
Since Ak > 0, obviously F(n) > 0 for all n > 0. From (2) and (3) we get that C - F(n) >
> 0,
which gives F(n) < C, for all n > 0. Therefore, for n = 1, F(1) < C. It can also be seen from the
definition of F(n) that F(n) < F(1)/n. This gives
0 < F(n) < C
n
Therefore, lim-,, F(n) = 0 and
lim
-F(n
= O.
Q.E.D.
APPENDIX 2
In this Appendix we prove the inequality
Pr(wi, < a Wi,k < b) > Pr(wij < a),
(29)
which was used to prove Prop. 1
We have
xr,
wij =
Wk =
The dependence between wzi and
Wink
I.
lE Amk
IEAjj
comes from the fact that some indices I are common in Amk
and A/j. Let C = Amk n Aij, A = Aij - C and B = Ak - C. Then if we denote y = EIc
Z = DIEA xi, and r =
Z1EB
,
xl, the random variables y, z, r are independent, since for distinct l's,
the random variables xl are independent.
By using the above notation, it is enough to prove that
Pr(y+z < a I yr+ r < b) > Pr(y+ z < a).
21
(30)
Appendix 3
Bayes' rule gives
Pr(y+ z < a I y+ r < b)= Pr(y+ z < a,y+r < b)
Pr(y+ r < b)
We have that
Pr(y+-z<ay+r<b
a-z<b
-r)=
Pr(y
+ z•ba-z
< b a- z <
b - r)
Pr(y+z<
<b-r)
> Pr(y + z <a a - z < b -r)
and
Pr(y+ z < a I y+ r < b, a- z > b- r) = Pr(y+r< b a- z > b-r)
Pr(y + r < b I a - z > b - r)
> Pr(y+ z < a a - z > b-r).
These relations give
Pr(y+ z < a I y + r < b) = Pr(y + z <a y + r < b,a - z < b- r)Pr(a- z < b- r)
y + r < b,a - z >b- r)Pr(a - z > b- r)
+ Pr(y+z < a
a - z < b - r)Pr(a- z < b - r)
> Pr(y + z < a
+ Pr(y + z <a a - z > b - r)Pr(a - z > b - r)
= Pr(y+ z < a).
As a result, relation (30), which is equivalent to relation (29) holds. Q.E.D.
APPENDIX 3
In this appendix, we prove Eq. (16), repeated below for convenience,
Pr(T_< ai,i = 1,2,...,m - 1 I Tm <am) > Pr(T < ai, i = 1,2,...,m - 1).
(31)
By definition we have
T = max{wik I k= ,...,nd}.
Let ki, i = 1, 2,..., m, be the arguments that attain the maximum above for i = 1, 2,..., m, and let
Pr(k;, i = 1,..., m) be the corresponding probability. By substituting these equations in (31), we
have to prove the equivalent relation
Pr(k, i = 1,...,m)Pr(wik, < ai,i = 1,2,...,m - 1 ki, i = 1, 2,..., m, Wmkm < am) >
(32)
> 2 Pr(kI, i = 1,..., m)Pr(wik, < ai, i = 1, 2,..., m22
22
I k, i = 1,2,..., m).
Appendix 4
In Prop. 1 we proved [cf. Eq. (16)] that
Pr(wik, < ai, i = 1, 2,..., m- 1 l wkm_ < am) > Pr(wik, < at, i= 1,2,...,m- 1).
(33)
It can be seen from the proof of this proposition that the same result holds even if all the probabilities
in (33) are conditioned on the event {k, i = 1, 2,.. ., m} and therefore
ki, i = 1,2,..., m, WMmkm
Pr(wiki; < ai, i = 1, 2,..., m -1
> Pr(wiki
< am)
ai, i = 1,2,..., m - 1 ki, i = 1, 2,..., m).
Using this, (32) is proved and from there (31) follows immediately. Q.E.D.
APPENDIX 4
In this appendix we will show that the right hand side of inequality (24) goes to 1 as n -- oo, i.e.
lim (1n-01
)
1
C-lfllog2 C )O
I
1
]
nO)7
J+,log
(34)
2
)
1
(34)
when ~ > 1, C > 2, and 0 > 1. We recall that d = log 2 n.
The proof consists of two steps. At first we show that all the terms of the product in (34) are of
the form
where x = x(n) goes to infinity as n goes to infinity and Q(x) is a function of x such that
=0.
lim
(35)
As a second step we will prove that for a function Q2(x) satisfying (35), we have that
lim
2-00 (
1
Q(z))
1.
Step 1: In order to prove that all the terms in (34) are of the desired form, we will use successive
applications of L' Hospital rule. This rule states that if f(x), g(x) are differentiable functions in the
neighborhood of oo (i.e. for sufficiently large x) with the property that limrS f(x) = limz_, g(x) =
oo then limL,-Ot
= lim_,
0(.
Although the number of processors n is an integer, we will treat
it as a continuous variable here and allow differentiation with respect to it. This is permitted since
we are interested in the limit n -+ oo and we are dealing with continuous functions of n. Thus
23
Appendix 4
a) Since
C > 2,
log 2, C
def
a > 1.
Thus
2nd
lim
n-ooo (lna
lim
2(1 + In n)(
n-oo (a In 2)n0-1
2=
lim
=0
n-eoo (a In 2)nc - '
as n -- oo and a > 1. Thus
(-lnlog2(
£Q(2nd).
=
b) Obviously
limoo
l3n1(so)
n O
0,
since 0 > 1 and <>(so) is a constant.
c) We denote a = lo21 > 0 for E > 1 and take into account that d = 1og2 n = 1.
third term of (34) to be of the desired form it is enough to prove that
n(ln n)2
(In 2)2narn
lim (
n-+no
= 0.
=°
By successive applications of L' Hospital rule we find that
2
lim n(ln =n)
m
n-oo nalnun
lim
lil n(ln n + 2)
n-oo 2anCr
-ln n
n(ln n + 3) = 0.
oo4a2(lnn)nalnn-
Therefore
°
nZdl"
92 = lQ (nd2) .
Step 2: For the second step we note that
lim
1-lim
lim (1_
Since
imx-
(i
-
()
(x)
= e-1
=-lim
((lQ())
() = 0, we finally obtain that
and lims,,
lim= 1
X-00
-)
(z)
(-
Steps 1 and 2 together give Eq. (34). Q.E.D.
24
Then for the
Download