January 1984 revised November 1984 LIDS-P-1425

advertisement
January 1984
revised November 1984
LIDS-P-1425
DISTRIBUTED ASYNCHRONOUS DETERMINISTIC AND STOCHASTIC GRADIENT
OPTIMIZATION ALGORITHMSt
by
John N. Tsitsiklis*
Dimitri P. Bertsekas*
Michael Athans*
ABSTRACT
We present a model for asynchronous distributed computation and then proceed to
analyze the convergence of natural asynchronous distributed versions of a large
class of deterministic and stochastic gradient-like algorithms.
We show that
such algorithms retain the desirable convergence properties of their centralized
counterparts, provided that the time between consecutive communications between
processors and communication delays are not too large.
tResearch supported by Grants ONR/N00014-77-C-0532 and NSF-ECS-8217668.
Laboratory for Information and Decision Systems, Dept. of Electrical Engineering
and Computer Science, M.I.T., Cambridge, MA. 02139.
-2-
1.
INTRODUCTION
Many deterministic and stochastic iterative algorithms admit a natural distributed
implementation [1,4,5] whereby several processors perform computations and exchange
messages with the end-goal of minimizing a certain cost function.
If all processors
communicate to each other their partial results at each instance of time and perform
computations synchronously, the distributed algorithm is mathematically equivalent to
a single processor (serial) algorithm and its convergence may be studied by conventional means.
Synchronous algorithms may have, however, certain drawbacks:
a) Synchronism may be hard to enforce, or its enforcement may introduce substantial
overhead.
b) Communication delays may introduce bottlenecks to the speed of the algo-
rithm (the time required for one stage of the algorithm will be constrained by the
slowest communication channel).
c) Synchronous algorithms may require far more com-
munications than are actually necessary.
d) Even if all processors are equally
powerful some will perform certain computations faster than others, due solely to the
fact that they operate on different inputs.
This in turn, may lead to having many
processors idle for a large proportion of time.
For these reasons, we choose to study
asynchronous distributed iterative optimization algorithms in which each processor does
not need to communicate to each other processor at each time instance; also, processors
may keep performing computations without having to wait until they receive the
messages that have been transmitted to them; processors are allowed to remain idle
some of the time; finally, some processors may perform computations faster than others.
Such algorithms can alleviate communication overloads and they are not excessively
slowed down by neither communication delays, nor by differences in the time it takes
processors to perform one computation.
(A similar discussion of the merits of
asynchronous algorithms is provided by H.T. Kung [10].)
-3-
The algorithms that we consider are gradient-like, meaning that the (expected)
updates by each processor are in a.descent direction with respect to a cost function
being minimized.
There may be certain limitations to our approach, because we restrict
attention to a special class of algorithms, whereas it could be the case that algorithms with a different structure may offer some advantages over the ones we propose.
However, the problem of finding an "optimal" distributed algorithm, which minimizes,
say, the amount of information exchanged between processors, or some other measure of
communication and computation, can be very hard or intractable [i15,20]
or worse).
(NP-complete
For this reason, we prefer to assume a fixed structure and then proceed
to investigate the amount of asynchronism and the magnitude of the communication delays
that may be tolerated without adversely affecting the convergence of the algorithm.
In Section 2 we present the model of distributed computation to be employed.
In
this model, there is a number of processors who perform certain computations and update
some of the components of a vector stored in their memory.
In the meanwhile, they
exchange messages, thus informing each other about the results of their latest computations.
Processors who receive messages use them either to update directly some of the
components of the vector in their memory, or they may.combine the message with the
outcome of their own computations, by forming a convex combination.
Very weak assump-
tions are made about the relative timing and frequency of computations or message
transmissions by the processors.
In Section 3 we employ this model of computation and also assume that the
(possibly random) updates of each processor are expected to be in a descent direction,
when conditioned on the past history of the algorithm.
Our main results show that,
under certain assumptions, asynchronous distributed algorithms have similar convergence
-4-
properties as their centralized counterparts, provided that the time between consecutive communications between processors plus communication delays are not too large.
We distinguish two cases: a) constant step-size algorithms (e.g.
deterministic
gradient-type algorithms) in which the time between consecutive communications has to
be bounded for convergence to be guaranteed and,
(e.g.
b) decreasing step-size algorithms
stochastic approximation-type algorithms) for which convergence is proved even
if the time between consecutive communications increases without bounds as the algorithm proceeds.
Section 2 and 3 are developed in parallel with a variety of examples
which are used to motivate and explain the formal assumptions that are being introduced.
Finally, Section 4 suggests some extensions and areas of application, together
with our conclusions.
Appendix A contains the proof of Lemma 2.1.
Appendix B contains
the proofs of our main results.
2. A MODEL OF DISTRIBUTED COMPUTATION
We present here the model of distributed computation employed in this paper.
also define the notation and conventions to be followed.
Related models of distributed
computation have been used in [3,4,5,8], in which each processor
updating a different component of some vector.
We
specialized in
The model developed here is more general,
in that it allows different processors to update the same component of some vector.
If their individual updates are different, their disagreement is (asymptotically)
eliminated through a process of communicating and combining their individual updates.
In such a case, we will say that there is overlap between processors.
Overlapping
processors are probably not very useful in the context of deterministic algorithms,
unless redundancy is intended to provide a certain degree of fault tolerance and/or
safeguarding against malfunction of a particular processor.
For stochastic algorithms,
however, overlap essentially amounts to having different processors obtain noisy
measurements of the same unknown quantity and effectively increases the "signal-tonoise ratio."
Let
H1, H 2 ...,HL
,
H=HlxH2x...xHL,
be finite dimensional real vector spaces
which we endow with the Euclidean norm.
and let
If x=(xl,x2,...,xL,
x6H,
we will refer to xZ as the 2-th component of x.
Let {1,...,M} be the set of processors that participate in the distributed
computation.
As a general rule concerning notation, we use subscripts to indicate a
component of an element of H, superscripts to indicate an associated processor; we
indicate time by an argument that follows.
The algorithms to be considered evolve in discrete time.
Even if a distributed
algorithm is asynchronous and communication delays are real (i.e.,
not integer)
variables, the events of interest (an update by some processor, transmission or reception of a message) may be indexed by a discrete variable; so, the restriction to
discrete time entails no loss of generality.
It is important here to draw a distinction between "global" and "local" time.
The time variable we have just referred to corresponds to a global clock.
Such a
global clock is needed only for analysis purposes: it is the clock of an analyst who
watches the operation of the system.
On the other hand the processors may be working
without having access to a global clock.
no clock at all.
1All
They may have access to a local clock or to
We will see later that our results, based on the existence of a
of our results generalize to the case where each H i is a Banach space. No modifications whatsoever are needed in the assumptions or the proofs except that matrices
should be now called linear operators and that expressions like VJ(x) should be interpreted as elements of the dual of H rather than vectors in H.
-6-
global clock, may be used in a straightforward way to prove convergence of
algorithms implemented on the basis of local clocks only.
We assume that each processor has a buffer in its memory in which it keeps some
element of H. The value stored by the i-th processor at time n is denoted by x (n).
At time n, each processor may receive some exogenous measurements and/or perform
some computations.
This allows it to compute a "step"
ting the new vector xi(n+l).
i
s (n)eH, to be used in evalua-
Besides their own measurements and computations, proces-
sors may also receive messages from other processors, which will be taken into account
in evaluating their next vector.
The process of communications is assumed to be as
follows:
At any time n, processor i may transmit some (possibly all) of the components of
x (n) to some (possibly all or none) of the other processors.
(In a physical implemen-
tation, messages do not need to go directly from their origin to their destination;
they may go through some intermediate nodes.
mathematical model presented here).
bounded.
Of course, this does not change the
We will assume that communication delays are
For convenience, we also assume that for any pair (i,j) of processors, for
any component x9
and any time n, processor i may receive at most one message originating
from processor j and containing an element of H E.
of generality:
This leads to no significant loss
for example, a processor that receives two messages simultaneously could
keep only the one which was most recently sent; if messages do not carry timestamps,
there could be some other arbitration mechanism.
Physically, of course, simultaneous
receptions are impossible; so, a processor may always identify and keep the most
recently received message, even if all messages arrived at the same discrete time n.
If a message from processor j, containing an element of H E is received by
processor i [(ij) at time n, let
tJ(n)i denote the time that this message was sent.
2For algorithms with decreasing-step-size this assumption may be relaxed.
-7-
Therefore, the content of such a message is precisely
(r
xjt
(n)).
Naturally, we as-
(in)<n. For notational convenience, we also let tI (n)=n, for all i,k,n.
sume that t
We will be assuming that the algorithm starts at time 1; accordingly, we assume that
that
t
j (n)>
1. Finally, we denote by T
the set of all times that processor i
receives a message from processor j, containing an element of HR. To simplify matters
we will assume that, for any i,j,Z,
the set TiJ
is either empty or infinite.
Once processor i has received the messages arriving at time n and has also evaluated
i
s (n), it evaluates its new vector x (n+l)eH by forming (componentwise.) a convex
combination of its old vector and the values in the messages it has just received, as
follows:
x1(n+l) =
I
aj (n)xi(ti (n))
+
i(n)si(n),
(2.1)
n>l,
j=1
where sl(n) is the Z-th component of s (n) and the coefficients a, (n) are scalars
satisfying:
(i)
ij
a9 (n)>O,
Vi,j,Z,n,
(2.2)
Vi,Z,n,
(2.3)
M
(ii)
X
=1
(iii)
aj,(n)=1,
aJ (n)=O,
nT
(2.4)
ij.
Remarks:
1.
Note that ti] (n) has been defined only for those times n that processor i
receives a message of a particular type, i.e.
t
for neTiJ.
However, whenever
(in)is undefined, we have assumed above that alJ (n)=0, so that equation (2.1)
has an unambiguous meaning.
2. When we refer to a processor performing a "computation," we mean the evaluation
and addition of the term y (n)si (n) in (2.1).
With this terminology, forming
the convex combination in (2.1) is not called a computation.
We denote by Ti
the set of all times that processor i performs a computation involving the M-th
component.
Whenever
n*T~,
it is-understood that .sCn) in (2.1) equals zero.
We assume again that for any i,Z the set TZ
is either infinite or empty.
Accordingly, processor i will be called computing, or non-computing, for component
.i
3. The quantities y (n) in (2.1) are nonnegative scalar step-sizes.
These step-sizes
may be constant (e.g. y (n)=yo, Vn), or time-varying, e.g. y (n)=l/t , where
i
t n is the number of times that processor i has performed a computation up to
time n. Notice that with such a choice each processor may evaluate its step-size
using only a local counter rather than a global clock.
4. The envisaged sequence of events underlying (2.1) at any time n, is as follows:
For each component Z processor i
(i)
1.
Transmits xi(n)
to processors and receives messages xjij
(t( n)) from
processors
j for which neT J .
i
i
s (n) if neTI and sets S.Z(n)=O otherwise.
(ii)
Computes
(iii)
Evaluates xl(n+l) according to (2.1).
Examples
We now introduce a collection of simple examples representing various classes of
algorithms we are interested in, so as to illustrate the nature
of the assumptions
to be introduced later.
We actually start with a broad classification and then
proceed to more special cases.
and transmissions (i.e.
In these examples, we model the message receptions
the sets
T
j
which computations are performed (i.e.
a2 J(n) as deterministic.
and the variables t J(n)), the times at
the sets T~) and the combining coefficients
(This does not mean, however, that they have to be a priori
known by the processors).
Specialization:
This is the case considered by Bertsekas [4,5], where each processor
updates a particular component of the x-vector specifically assigned to it and
relies on messages from the other processors for the remaining components.
(i)
M=L. (There are as many processors as there are components).
(ii)
s [n)=O, VZ#i, Vn.
Formally:
(A processor may update only its own component;
Ta=~, ViPZ).
(iii)
Processor j only sends messages containing elements of Hi; if processor i
receives such a message, it uses it to update xj by setting xj equal to
the value received.
Equivalently,
(a) If ifj and j$Z, then T'j=~ and ai (n)=O, Vn.
(b) If processor i receives a message from processor j at time n,
i.e.
Overlap:
if neTi j , then a
j
(n)=l.
Otherwise, a j (n)=O, and a
ii(n)=l.
This is the other extreme, at which L=l (we do not distinguish components
of elements of H), messages contain elements of H (not just components) and each
processor may update any component of x.
(For this case subscripts are redundant
and will be omitted.)
We now let H be finite-dimensional and assume that J: H-*[O,o)
is a continuously
differentiable nonnegative cost function with a Lipschitz continuous derivative,
-10-
Example I: Deterministic Gradient Algorithm; Specialization.
i
At each time nGT i that processor i updates xi,
1i
i
Y (n) = y>0, Vn,i.
i
si (n) =
i
(x n)) and lets
DJ
-
Let
siCn)=O,
for jHi.
it computes
We assume that each processor i
communicates its component xi to every other processor at least once every B1 time
units, for some constant B1.
Other than this restriction, we allow the transmission
and reception times to be arbitrary.
i
aJ
by letting s i n)
(A related stochastic algorithm could be obtained
i
i
n)) l+t(n)), where wi.(n) is unit variance white noise,
C(
independent for different i's).
Example II: Newton's Method; Overlap.
only two processors
For simplicity we assume that there are
Let yi(n) = Yo>0, vn.
(M=2).
We also assume that J is twice
continuously differentiable, strictly convex and its Hessian matrix, denoted by G(x),
satisfies
0<6LI<G(x)<6 2 I,
VxGH.
At each time n6T , processor i computes
(x (n)) aJ (xi(n)). For nOTi , si (n)=O. If at time n processor 1
(i (n)> ax
1 21
2 12
(n))(resp-.ix (t 2(n))), it updates its vectorby
(respectively, 2) receives a message x (t
s (n) = -G
S ~n) = -G
212
1
1
x (n+l) = allx (n) + a1 2 x (t
a 22 x (n) + y (n)s (n)).
a22
121
2
1
1
(n)) + yl(n)s (n),(resp. x (n+l) = a 21 x (t
+a2=a2+a=l.
12 21 22
Here we assume that O<aij<l and that a
1311
For other times n the same formula is used with a12=0 (a21=O).
(n)) +
We make the same
assumptions on transmission and reception times as in Example 1.
Example III:
Let yi(n)
Distributed Stochastic Approximation; Specialization.
be such that, for some positive constants
Al, A 2, A 1 /n < y i(n)< A 2 /n,
Vn.
Notice
that the implementation of such a stepsize only requires a local clock that runs in
the same time scale (i.e.
neT , let
s i (n) =
i
ax. '
within a constant factor) as the global clock.
Cx (n)) +xw(n).
For
Also, sj(n)=O, for isj and for all n.
We assume that w (n), conditioned on the past history of the algorithm has zero
mean and that E[ljjw(n)j
2
xi'(n)]< K(IIVJ(xi(n)) 12 +l1), for some constant K. We
assume that for some BlL>O, 8>l
and for all n, each processor communicates its
component x i to every other processor at least once during the time interval
[Bln ,Bln+l) ].
Other than the above restriction, we allow transmission and recep-
tion times to be arbitrary.
Notice that the above assumptions allow the time between
consecutive communications to grow without bound.
Example IV: Distributed Stochastic Approximation; Overlap.
Example III and let M=2.
For nGT 1 , let s (n)
-
Let yi (n) be as in
i
(x(n)) + w (n), where w (n) is
as in Example III.
We make the same assumptions on transmission and reception times
as in Example III.
Whenever a message is received, a processor combines its vector
with the content of that message using the combining rules of Example II.
Example V:
This example is rather academic but will serve to illustrate some of
the ideas to be introduced later.
dimensional, and let y (n)=l, en.
Consider the case of overlap, assume that H is oneAssume that, at each time n, either all processors
communicate to each other, or no processor sends any message.
delays be zero ( so, t
(n)=n,
whenever t
Let the communication
(n) is defined) and assume that a
(constant) at those times n that messages are exchanged.
x(n) = (x (n),...,x (n)) and s(n) = (s (n),...,s (n)).
j
1
(n) = ai
We define vectors
Then, the algorithm (2.1) may
be written as
x(n+l) = A(n)x(n) + s(n)
(2.5)
For each time n, either A(n)=I (no communications) or A(n)=A, the matrix consisting of
the coefficients a1J.
The latter is a "stochastic" matrix: it has nonnnegative entries
-12-
and each row sums to 1. We assume that aij is positive.
exists and has identical rows with positive elements.
It follows that A = lim A n
n-),o
We assume that the time between
consecutive communications is bounded but otherwise arbitrary.
n
lim II A(m)=A,
n-* m=k
for all k.
Clearly then,
This example corresponds to a set of processors who
individually solve the same problem and, from time to time, simultaneously exchange
their partial results.
It is interesting to compare equation (2.5) with the generic
equation
x(n+l) = x(n) + y(n)sCn)
which arises in centralized algorithms.
Assumptions on the communications and the combining coefficients
We now consider a set of assumptions on the nature of the communications and
combining process, so that the preceding examples appear as special cases.
In formula-
ting an appropriate set of assumptions, there is a trade-off between generality and ease of verification.
The assumptions below are not the most general possible,
but are very easy to enforce.
Some generalizations will be suggested later.
For each component
£e{l,...,L} we introduce a directed graph
nodes V={1,...,M} corresponding to the set of processors.
G,=(V,E.)
with
An edge (j,i) belongs to
E Z if and only if T1 3 is infinite, that is, if and only if processor j sends (in the
long run) an infinite number of messages to processor i with a value of the Z-th
component x3,.
Assumption 2.1:
For each component lG{1,...,L}, the following hold:
a) There is at least one computing processor for component Z.
b)
There is a directed path in Gi
,
from every computing processor (for component Z)
to every other processor (computing or not).
-13-
c)
There is some c0O such that:
(i)
If processor i receives a message from processor j at time n
(i.e.
if nG6T'), then a jin) >a.
(ii)
For every computing processor i, a iCn)>a, vn.
(iii)
If processor i has in-degree
(in Gz) larger or equal than 2,
then al (n)>a, Vn.
Let us pause to indicate the intuitive content of part (c) of Assumption 2.1.
Part (i) states that a processor should not ignore the messages it receives.
Part (ii)
requires the past updates of any computing processor to have a lasting effect on its
state of computation.
Finally, part (iii) implies that if processor i receives
messages from two processor (say il,i2), it does not forget the effects of messages
of processor i ! upon reception of a message from processor i 2.
These conditions,
together with part (b) of the Assumption, guarantee that any update by any computing
processor has a lasting effect on the states of computation of all other processors.
Assumption 2.2:
The time between consecutive transmissions of component xZ
processor j to processor i
Assumption 2.3:
from
is bounded by some B1>O, for all (j,i)eEI.
There are constants B >O, S>1 such that, for any (j,i)GE l, and for
any n, at least one message x~
is sent from processor j to processor i during the
time interval [BlnS, BC
1 n+l) ]. Moreover, the total number of messages transmitted
and/or received during any such interval is bounded.
Assumption 2.4:
and neT
3The
3j
Communication delays are bounded by some B >0, i.e. for all i,jZ
we have n-tJ (n)<Bo.
in-degree of a processor (node) i (in G9) is the number of edges in EB
pointing to node i.
-14-
Note that Assumption 2.2 is a special case of 2.3, with S=l.
holds for all the examples introduced above.
Assumption 2.1
Assumption 2.2 holds for Examples I,
II,V; Assumption 2.3 holds for Examples III,IV, except for its last part which has
to be explicitly introduced.
Equation (2.1) which defines the structure of the algorithm is a linear system
driven by the steps
si(n).
In the special case where communication delays are zero,
we have t'i(n)=n, and (2.1) becomes a linear system with state vector
M
(xl(n),...,x (n)).
Equation (2.5) of Example V best illustrates this situation.
In
general, however, the presence of communication delays necessitates an augmented state
if a state space representation is desired
.
(Notice that actually (2.1) defines a
decoupled set of linear systems, one for each component
linearity, we conclude that there exist scalars
M
x (n) =
( Z (n[O)x(l) +
j=-
n-l
X
M
k=l
j=l
Ze{l,...,L}.)
Exploiting
ij (nlk), for n>k, such that
(2.6)
y'(k)j(nlk)sj (k).
The coefficients (j (nlk) are determined by the sequence of transmission and reception
times and the combining coefficients.
Consequently, they are unknown, in general.
Nevertheless, they have the following qualitative properties.
Lemma 2.1:
(i)
O<i
J(nk),
Vi,j,Z,n>k,
(2.7)
M
I
j=l
(ii)
'
vi,Z,n>k .
Under Assumptions2.1, 2.4 and either Assumption
exists, for any i,j,k,Z.
4
(nlk)< 1,
(2.8)
2.2 or 2.3,
lim Dj (nik)
n-~O
The limit is independent of i and will be denoted by
Z3(k).
Such an augmented state should incorporate all messages that have been transmitted
but not yet received. Since we are assuming bounded communication delays, there can
only be a bounded number of such messages and the augmented system may be chosen
finite dimensional.
Moreover,
there is a constant
r>0 such that, if
j is a computing processor for
component Z, then
4)z
(k)
~(2.9) >T1.9
~~V~k.
The constant n, depends only on the constants introduced in our assumptions
(i.e.
BoB ,8,c).
1
Under Assumptions 2.1, 2.2, 2.4, there exist de[0,1),
(iii)
Bo ,B 1,a)
B>0 (depending only on
such that
max 1I
J
(n Il k) -
Ij(k) I<Bd
,
V,n>k .
(2.10)
i,j
(iv) Under Assumptions 2.1, 2.3, 2.4, there exist dG[0,1), S6(0,1], B>0
(depending only on Bo,Bl,S,a) such that
max 1ijk(njk) --lak)j<
i,j
Proof:
Bdn
,
VZ,n>k .
(2.11)
See Appendix A.
In the light of equation (2.6), Lemma 2.1 admits the following interpretation:
part (ii) states that if all processors cease updating (that is if they set s (n)=0)
from some time on, they will asymptotically converge to a common limit.
Moreover,
this common limit depends by a non-negligible factor on all past updates of all
computing processors.
Parts (iii) and (iv) quantify the natural relationship between
the frequency of interprocessor communications and the speed at which agreement is
reached.
-16-
For any pair (i,j) of processors we may also define a linear transformation
ij (nlk):H+H by
Xji(nIk)
where x=(xl,...,XL).
= (j(Cnlk)xl
-,.,j(nlk)xL)
(2.12)
Note that if each HQ is one-dimensional then Dij(n1k
represented by a diagonal matrix.
)
may be
In general, it corresponds to a block-diagonal
matrix, each block being a constant multiple of the identity matrix of suitable
dimension.
Clearly,
denoted by
J(k).
lim ijC(nlk) also exists, is independent of i and will be
n-ew
We can now define a vector
y(n)GH
M
&j(0)xj (1) +
y(n) =
j=l
by
n-l
M
E
i
k=l
j=l
y (k)&(k)s
j
(k)
(2.13)
and note that y(n) is recursively generated by
M
y(n+l) = y(n) +
I
yjC(n)sj (n)
n)
(2.14)
j=l
The vector y(n) is the element of H at which all processors would asymptotically
agree if they were to stop computing (but keep communicating and combining) at a time
n.
It may be viewed as a concise global summary of the state of computation at
i
time n, in contrast with the vectors xin) which are the local states of computation;
it allows us to look at the algorithm from two different levels: an aggregate and
a more detailed one.
We will see later that this vector y(n) is also a very
convenient tool for proving convergence results.
(2.1) is a linear system.
We have noted earlier that equation
However, it is a fairly complicated one, whereas the
-17-
recursion (2.14) corresponds to a very simple linear system in standard state space
form.
The content of the vector y(n) and of the & (n)'s is easiest to visualize
in two special cases:
Specialization:
(e.g. Examples I and III).
Here
the processor who specializes in that component.
i(n)
takes each component from
That is, y(n) =
n
n ))
Accordingly, ci(n)=O, for ifj, and QJ(n)=l.
j
Example V:
Here
that the limit of
only on j.
J
i
(nlk) is the ij-th entry of the matrix
ij(nik)
It follows
is the ij-th entry of A, which by our assumptions depends
Moreover, y(n) equals any component of Ax(n).
by our assumptions).
n-l
11 A(m).
m=k+l
(All components are equal
If we multiply both sides of (2.5) by A and note that AA(n)=A,
M
we obtain y(n+l)=y(n) +
I A. .s(n), which is precisely (2.14).
j=l
]3
The model of computation introduced in this section may be generalized in several
directions
[7,19].
To name a few examples, the updating rules of each processor need
not have the linear structure of equation (2.1); also, it may be convenient to communicate other information (e.g.
derivatives of the cost function) and not just the
values of components of x. However, the present model is sufficient for the class of
algorithms under consideration.
Let us also mention that, while all of the above
examples refer to either specialization or overlap, we may think of intermediate
situations in which some of the components are updated by a single processor while
some of the components are updated by all processors simultaneously (partial overlap).
Finally, Assumptions 2.1, 2.4 are unnecessary, as long as the conclusions of
Lemma 2.1 may be somehow independently verified.
-18-
3.
CONVERGENCE RESULTS
There is a large number of well-known centralized deterministic and stochastic
optimization algorithms which have been analyzed using a variety of analytical
tools [2,11,12,16].
A large class of them, the so-called "pseudo-gradient" algo-
rithms [16], have the distinguishing feature that the (expected) direction of update
(conditioned upon the past history of the algorithm) is a descent direction with
respect to the cost function to be minimized.
The Examples of Section 2 certainly
Reference [16] presents a larger list of examples.
have such a property.
It is
also shown there that the development of results for pseudo-gradient algorithms leads
easily to results for broader classes of algorithms, such as Kiefer-Wolfowitz
stochastic approximation.
In this section we present convergence results for the
natural distributed asynchronous versions of pseudo-gradient algorithms.
We adopt
the model of computation and the corresponding notation of Section 2.
1
M
We allow the initialization {x (l),...,x (1)} of the algorithm to be random, with
finite mean and variance.
random.
We also allow the updates
s (n) of each processor to be
On the other hand, we assume that yi(n) is deterministic; we also model the
combining coefficients, a J(n) and the sequence of transmission and reception times
as being deterministic.
This is not a serious
restriction because they do not need
to be known by the processors in advance, in order to carry out the algorithm.
We
assume that all random variables of interest are defined on a probability space (Q,F,P).
We introduce {Fn}, an increasing sequence of a-fields contained in F and describing the
history of the algorithm up to time n.
In particular, Fn is defined as the smallest
a-field such that s (k), k<n-l, and xi(1), ie{l,...,M} are F -measurable.
We assume that the objective of the algorithm is to minimize a nonnegative cost
function J:H+[O,o).
For the time being, we only assume that J is a smooth function.
In particular, J is allowed to have several local minima.
Assumption 3.1:
J is continuously differentiable and its derivative satisfies the
Lipschitz condition
IIJ(x)-VJ(x') II< KI IXXtI 1,
vx,xt CH,
(3.1)
where K is some nonnegative constant.
Assumption 3.2:
The updates
x
si(n) of each processor satisfy
(X (n))EE[s(n
) nIF ] <
0,
¥i,Z,n.
a.s.,
(3.2)
This assumption states that each component of each processor's updates is in a
descent direction, when conditioned on the past history of the algorithm and it is
satisfied by Examples I-IV.
A slightly weaker version, under which our results
remain true would be
VJ(x (n))
i(n)E[si(n)lFn]< 0,
a.s.,
Vi, k,n.
On the other hand, it can be shown that the condition
VJ(x (n))E[s (n)IFn]< 0,
a.s.
is not sufficient for proving convergence.
The next assumption is easily seen to hold for Examples I and II. For stochastic
algorithms, it requires that the variance of the updates (and hence of any noise
contained in them) goes to zero, as the gradient of the cost function goes to zero.
5
5In
aJ
(3.2), if H A has dimension larger thanl1, ax
should be interpreted as a row
vector. In general, the appropriate interpretation should be clear from the
context.
-20-
Assumption 3.3:
For some K >0 and for all i,Z,n,
0-
El[ IIs(n)l
<-
K0 E
[L
(x(n))s(n)]
As a matter of verifying Assumption 3.3, one would typically check the validity
of the slightly stronger condition
[ (n)l
F]] - -K Dx
E Is
i(n)I
2 IFn
Ko
1
(xi(n))E
[s(n)F ]
(3n)iF
Also, using Lemma 2.1 (ii) and Assumption 3.3 we obtain
E[is (n)I I
<
E[ Is(n)
I
]<
= -K E[VJ(x (n))
Where Ko=Ko/ > 0.
o
E[a
(xi (n))c ,(n)s'
(n
)]
(3.3)
ns
(n)],
It turns out that (3.3) is all we need for our results to hold.
Our first convergence result states that the algorithm converges in a suitable
sense, provided that the step-size employed by each processor is small enough and
that the time between consecutive communications is bounded, and applies to Examples
I and II.
It should be noted, however, that Theorem 3.1 (as well as Theorem 3.2 later)
does not prove yet convergence to a minimum or a stationary point of J.
there is nothing in our assumptions that prohibits having s (n)=O, Vi,n.
In particular,
Optimality
is obtained later, using a few auxiliary and fairly natural assumptions
(see Corollary 3.1).
Theorem 3.1:
i
y (n)> 0
Let Assumptions 2.1, 2.2, 2.4, 3.1, 3.2, 3.3 hold.
and that
i
sup y (n) = yo
0 <
i,n
Suppose also that
. There exists a constant y*>Q (depending on the
-21-
constants introduced in the Assumptions)
such that the inequality O<yo<Y*
implies:
a) J(x (n ) ), i=l,2,...,M, as well as J(y(n)), converge almost surely, and to
the same limit.
b)
lim (xi(n)-x (n)) = lim (x (n)-y(n))=O,
vi,j,
nflw
fn-wco
almost surely and in the mean square.
c) The expression
0Co
M
I
i
n-l
i=l
y1 (n)VJ(x (n))E[s (n) Fn]
is finite, almost surely.
Proof:
(3.4)
Its expectation is also finite.
See Appendix B.
Leaving technical issues aside, the idea behind the proof of the above (and
the next) Theorem is rather simple: the difference between y(n) and x (n), for any
i,
is of the order of Byo,
where B is proportional to a bound on communication
delays plus the time between consecutive communications between processors.
as long as yo remains small,
s (n) (and consequently
from point y(n).
VJ(xi(n)) is approximately equal to
i(n)s
Therefore,
VJ(y(n)); hence
(n)) is approximately in a descent direction, starting
Therefore, iteration (2.14) is approximately the same as a centralized
descent (pseudo-gradient) algorithm which is, in general convergent [16].
Decreasing Step-Size Algorithms
We now introduce an alternative set of assumptions.
the updates
We allow the magnitude of
i
i
s (n) to remain nonzero, even if VJ(x (n)) is zero (Examples III and IV).
Such situations are common in stochastic approximation algorithms or in system
-22-
identification applications.
Since the noise is persistent, the algorithm can be
made convergent only by letting the step-size yi (n) decrease to zero.
y (n)=l/n
The choice
is most commonly used in centralized algorithms and in the sequel we will
assume that y (n) behaves like 1/n.
Since the step-size is decreasing, the algorithm becomes progressively slower
as n-+oo.
This allows us to let the communications process become progressively slower
as well, provided that it remains fast enough,
when compared with the natural time
scale of the algorithm, the latter being determined by the rate of decrease of the
step-size.
Such a possibility is captured by Assumption 2.3.
The next assumption, intended to replace Assumption 3.3, allows the noise to
be persistent.
It holds for Examples I-IV.
As in Assumption 3.3, inequality (3.5)
could be more naturally stated in terms of conditional expectations, but such a
stronger version turns out to be unnecessary.
Assumption 3.4:
For some K1, K >0, and for all i,Z,n,
E[IIS(n)l
Theorem 3.2:
some K3_>,
]2J< -K1 E[
(x (n))s Cn) ]+ K2
(3.5)
Let Assumptions 2.1, 2.3, 2.4, 3.1, 3.2, 3.4, hold and assume that for
y (n)< K3 /n, vn,i.
Then, conclusions (a),(b),(c), of Theorem 3.1 remain
valid.
Proof:
See Appendix B.
Theorem 3.2 remain valid if (3.5) is replaced by the much weaker assumption
E[Isi (n)
23< KoEJ(xi(n))] -K1 EVJ(x (n))& (n)s (n)] + K 2 '
(3.6)
-23-
The proof may be found in [19] and is significantly more complicated.
Theorems
3.1, 3.2 also remain valid even if Assumptions 3.2, 3.3 or 3.4 hold after some
finite time n0 , but are violated earlier.
We only need to assume that s (k) has
finite mean and variance, for k<n .
--o
We continue with a corollary which shows that, under reasonable conditions,
convergence to a stationary point or a global optimum may be guaranteed.
We only
need to assume that away from stationary points some processor will make a positive
improvement in the cost function.
Naturally, we only require the processors to make
positive improvements at times that they are not idle.
Corollary 3.1:
Suppose that for some K4 >0, Y (n)> K4 /n, Vn,i.
Assume that J has
compact level sets and that there exist continuous functions g.: H-+[O,-) such that
3J
i
3i
ax ix 1 (n))E[s (n)IFn]< -gi (x l (n),
M
We define
satisfying
g: H4-[0,-) by g(x) =
nT1
(3.7)
L
i
I
i=l
Z=l
gg(x)
and we assume that any point x6H
g(x)=O is a stationary point of J. Finally, suppose that the difference
between consecutive elements of T l is bounded, for any i,9
such that T ~.
Then,
a) Under the Assumptions of either Theorem 3.1 or 3.2,
liminf jVJ(x l (n)) I=O, vi,
a.s.
(3.8)
n-eoo
b) Under the Assumptions of Theorem 3.1 and if (for some
>0O) y (n)>e,
Vi,n,
we have
lim IIVJ(xi(n) )
and
any
limit
point of
11= 0,
Vi,
a.s.
is a stationary point of J
and any limit point of {xi (n ) } is a stationary point of J.
(3.9)
-24-
c) Under the Assumptions. of either- Theorem 3.1 or 3.2 and if every point
satisfying
g(x)=O is a minimizing point of J (this is implicitly assuming
that all stationary point of J are minima), then
lim
Proof:
(3.10)
J(x (n)) = inf J(x) .
n->o
x6H
See Appendix B.
We now discuss the above corollary and apply it to our examples.
Notice first
that the assumption on T% states that, for each component Q, the time between
successive computations of s~
component.
y (n)> K4/n
For example,
is bounded, for any computing processor i for that
Such a condition will be always met in practice.
The assumption
may be enforced without the processor having access to a global clock.
apart from the trivial case of constant step-size,
we may let
Y (n) = 1/t n where tni is the number of times, before time n, that processor i has
performed a computation.
For Examples I and III, (3.7) holds with gix) a constant multiple of (aJ/3xi)
for Examples II and IV, it holds with g (x) a constant multiple of
IIVJ(x) 1
2.
We
may conclude that Corollary 3.1 applies and proves convergence for Examples I-IV.
We close this section by pointing out that our results remain valid if we model
the combining coefficients, the transmission and reception times as random variables
defined on the same probability space
(i,F,P),
subject to some restrictions.
Notice
that such a generalization allows the processors to decide when and where to transmit
based on information related to the progress of the algorithm.
Hence, for stochastic
algorithms, we have to avoid the possibility that a processor judiciously choosesto
-25-
transmit at those times that it receives a "bad measurement" s in), or somehow
ensure that the long-run outcome is independent of such decisions.
It turns out
that one only needs to assume that, for any ie{l,...,M} and any m<n, the random
variables i (m) and s (n), are conditionally independent given the history of the
algorithm up to time n
[19]. This assumption holds if communications and the
combining coefficients are independent of the noise in the computations (this is
true, in particular, for deterministic algorithms), as well as for the specialization
case because
'
(n) turns out to be deterministic and a priori known, even if the
communication times are random.
4. EXTENSIONS, APPLICATIONS AND CONCLUSIONS
A main direction along which our results may be extended is in analyzing the
convergence of distributed algorithms with decreasing step-size and with correlated
noise, for which the pseudo-gradient assumption fails to hold.
arise frequently, for example in system identification.
Such algorithms
Very few global convergence
results are available, even for the centralized case [17].
However, as in the central-
ized case an ordinary differential equation (ODE) may be associated with such algorithms, which may be used to prove local convergence subject to an assumption that
the algorithm returns infinitely often to a bounded region [11,12,19].
Another issue, arising in the case of constant step-size algorithms, concerns
the choice of a step-size which will guarantee convergence.
in the proof of Theorem 3.1 and find some bounds on
but these boundswill not be particularly tight.
We may trace the steps
¥o so as to ensure convergence,
For a version of a distributed
deterministic gradient algorithm, tighter bounds have been obtained in [19] which
quantify the notion that the frequency of communications between different processors should in some sense reflect the degree of coupling inherent in the
optimization problem.
-26-
It should be clear that the bounds on the time between successive communications
and delays are imposed so as to guarantee that processors are being informed, without
excessive delays, about changes in the state of computation of other processors.
The same effect, however, could be accomplished by having each processor monitor its
state and inform the others only if a substantial change occurs.
It seems that such
an approach could lead to some savings in the number of messages being exchanged.
However, more research is needed on this issue and, in particular, the notion of a
"substantial change" in the state of computation of a processor needs to be made
more precise.
A final issue of interest is the rate of convergence of distributed algorithms.
With perfect synchronization, the rate of convergence is the same as for their
centralized counterparts.
deterioration.
Asynchronism should be expected to bring about some
We may then ask how much asynchronism may be tolerated before such
deterioration becomes significant.
For decreasing step-size stochastic algorithms,
as long as communications do not become too infrequent, it is safe to conjecture
that the convergence rate will not deteriorate at all; still, there would be some
deterioration in the transient performance of such algorithms which is also worth
investigating.
Concerning possible applications, there are three broad areas that come to
mind.
There is first the area of parallel computation, where an asynchronous
algorithm could avoid several types of bottlenecks [10].
Then, there is the area
data communication networks in which there has been much interest for distributed
algorithms for routing and flow control [6,9].
Such algorithms resemble the ones
that we have analyzed and our methods of analysis, with a few modifications, are
-27-
applicable [18].
Finally, certain common algorithms for system identification or
adaptive filtering fall into the framework of decreasing step-size stochastic algorithms and our approach may be used for analyzing the convergence of their distributed versions [19].
Our results may not be applicable without any modifications or
refinements to such diverse application areas.
Nevertheless, our analysis indicates
what kind of results should be expected to hold and provides tools for proving them.
As a conclusion, the natural distributed asynchronous versions of a large
class of deterministic and stochastic optimization algorithms retain the desirable
convergence properties of their centralized (or synchronous) counterparts, under mild
assumptions on the timing of communications and computations and even in the presence
of communication delays.
It appears that there are many and promising applications
of such algorithms in fields as diverse as parallel computation, data communication
networks and distributed signal processing.
-28REFERENCES
1.
Arrow, K.J. and L. Hurwicz, "Decentralization and Computation in Resource
Allocation," in Essays in Economics and Econometrics, R.W. Pfouts, (ed.)
Univ. of North Carolina Press, Chapel Hill, N.C., 1960, pp. 34-104.
2.
Avriel, M., Nonlinear Programming, Prentice-Hall, Englewood Cliffs, N.J.,
1976.
3.
Baudet, G.M., "Asynchronous Iterative Methods for Multiprocessors," Journal of
the ACM, Vol. 25, No. 2, 1978, pp. 226-244.
4.
Bertsekas, D.P., "Distributed Dynamic Programming," IEEE Transactions on
Automatic Control, Vol. AC-27, No. 3, 1982, pp. 610-616.
5.
Bertsekas, D.P., "Distributed Asynchronous Computation of Fixed Points,"
Mathematical Programming, Vol. 27, 1983, pp. 107-120.
6.
Bertsekas, D.P., "Optimal Routing and Flow Control Methods for Communication
Networks," in Analysis and Optimization of Systems, A. Bensoussan and J.L.
Lions, (eds.), Springer Verlag, Berlin, 1982, pp. 615-643.
7.
Bertsekas, D.P., J.N. Tsitsiklis and M. Athans, "Convergence Theories of
Distributed Iterative Processes: A Survey," submitted to SIAM Review, 1984.
8.
Chazan, D. and W. Miranker, "Chaotic Relaxation," Linear Algebra and Applications,
Vol. 2, 1969, pp. 199-222.
9.
Gallager, R.G., "A Minimum Delay Routing Algorithm Using Distributed Computation,"
IEEE Transactions on Communications, Vol. COM-25, .No. 1, 1977, pp. 73-85.
10.
Kung, H.T., "Synchronized and Asynchronous Parallel Algorithms for Multiprocessors,"
in Algorithms and Complexity, Academic Press, 1976, pp. 153-200.
11.
Kushner, H.J. and D.S. Clark, Stochastic Approximation Methods for Constrained
and Unconstrained Systems, Applied Math. Series, No. 26, Springer Verlag,
Berlin, 1978.
12.
Ljung, L., "Analysis of Recursive Stochastic Algorithms," IEEE Transactions on
Automatic Control, Vol. AC-22, No. 4, 1977, pp. 551-575.
13.
Ljung, L., and T. Soderstrom, Theory and Practice of Recursive Identification,
MIT Press, Cambridge, MA, 1983.
14.
Meyer, P.A., Probability and Potentials, Blaisdell, Waltham, MA, 1966.
15.
Papadimitriou, C.H. and J.N. Tsitsiklis, "On the Complexity of Designing Distributed Protocols," Information and Control, Vol. 53, No. 3, June 1982, pp. 211-218.
16.
Poljak, B.T. and Y.Z. Tsypkin, "Pseudogradient Adaptation and Training Algorithms,"
Automation and Remote Control, No. 3, 1973, pp. 45-68.
17.
Solo, V., "The Convergence of AML," IEEE Transactions on Automatic Control,
Vol. AC-24, 1979, pp. 958-962.
18.
Tsitsiklis, J.N., and D.P. Bertsekas, "Distributed Asynchronous Optimal Routing
for Data Networks," Proceedings of the 23d Conference on Decision and Control,
Las Vegas, Nevada, December 1984.
19.
Tsitsiklis, J.N., "Problems in Decentralized Decision Making and Computation,"
Ph.D. Thesis, Dept. of Electrical Engineering and Computer Science, MIT,
Cambridge, MA, 1984.
20.
Yao, A.C., "Some Complexity Questions Related to Distributed Computing,"
Proceedings of the 14th STOC, 1979, pp. 209-213.
APPENDIX A
Proof of Lemma
2.1:
separately, since
We only need to prove the Lemma for each component
(2.1)> corresponds to a decoupled set of linear systems.
We may therefore assume that there is only one component;
will be omitted and
ij
k(njk) being the impulse response of the linear system
is determined by the following"experimene:
and some time k;
let
which case we let
,
ij (nk) becomes a scalar.
The coefficient
(2.1)
so, the subscript
x
(1)=0,
Vh, s (m)=O, Vh,m,
let us fix a processor j
unless if h=j and m=k in
yJ (k)s (k)=v, some nonzero element of H.
for all processors i, x
(n) will be a scalar multiple of v.
factor is precisely equal to P ij(nlk).
For all times n and
This proportionalitv
Since this experiment takes place in a
one-dimensional subspace of H, we may assume -without loss of generality- that H
is one-dimensional and that v=l.
i j
(njk) = xi (n), Vi,n.
shows that x
(n)>O,
For inequality
some time k and let
we let y
(2.1)
h
(k)s
and
h
A trivial induction based on
Vi,n, which proves
(2.8)
(2.1)
and using
(2.2)
(2.7).
we consider a different "experiment": let us fix
xh (1)=0, Vh, sh(m)=0, Vm,h,
(k)=l, Vh.
(2.3)
We then have, for the above "experiment",
unless if m=k in which case
(H is still assumed one-dimensional).
to show by a simple inductive argument that
We then use
x
(n)<l, Vi,n
which concludes the proof of part (i).
For the proof of the remaining catts of the Lemma we first perform a reduction to a simpler case.
It is relatively easy to see that we may assume, with-
out loss of generality, that communication delays are nonzero.
For example, we
could redefine the time variable so that one time unit for the old time variable
corresponds to two time units for the new one and so that any message that
had zero delay for the original description has unit delay for the new descrip21, -
If any of the Assumptions
tion.
2.3,
2.2,
2.4
holds in terms
of the old time variable, it also remains true with the new one.
Next, we perform a reduction to the case where all messages have zero
delay, as follows: for any
(i,j)~E,
we introduce a finite set of dummy processors,
between i and j (see Fig. A.1) which act as buffers.
is first transmitted to a buffer processor
Any message from i to j
(with zero delay) which holds it for
time equal to the desired delay and then transmits it to j, again with zero
(So, whenever a buffer processor h receives a message, it lets
delay.
a
(n)=l).
The buffer processor which is to be employed for any particular
message is determined by a round-robin rule.
Note that for each pair (i,j)eE
the number of buffer processors that needs to be introduced is equal to the
maximum communication delay for messages from i to j, which has been assumed
6
finite.
Note also that the buffer processors are non-computing ones and have
in-degree equal to 1.
It is easy to see that if Assumptions
2.2,
2.3
are valid for
the original description of the algorithm, they are also valid for the above
introduced augmented description, possibly with a different choice of constants.
Assumption
violated.
2.1 also remains valid except for part c(iii) which may be
(If in Figure A.1 processor j had originally in-degree equal to 1, it
now has in-degree equal to 4, but there is nothing in our assumptions that
guarantees that aj j (n)>tA,
Vn).
We have, nevertheless, the following condition:
6.This procedure is equivalent to a state augmentation for the linear system
(2.2).
-32-
Fr
Fg
Figure
i:
l
i
Iu
1
Ro0R
Reduction to the zero delay case.
-33-
Assumption A.l:
For any processor i with in-degree larger than i, either
1o
Assumption
2olc(iii) holds, or
2.
All predecessors of i are non-computing, have in-degree I and a common
predecessor.
From now on, we assume, without loss of generality, that cormmunication
delays are zero, provided that we replace Assumption 2.ic(iii)
by Assumption
A.l.
Let A(n),
V(n jk) be the matrices with coefficients a
(n;,
(rn;k),
e
respectively. Because of the zero delay assumption it is easy to see t.haa
n-I
(nik) =
1I
A(m) , n>k+l
m=k+l
We first prove the desired results under Assumptions
By combining Assumptions
B such that, for any
some tIe
such that aj
2.lc(i) and
2.2.
2.2, note that there exists a constant
(i,j)EE and any time interval I of length B, there exists
(t)>ca> 0.
(These are times that i communicates to
Let us fix a computing processor j and some time k.
us assume that j=l.
2.1 and
By relabeiling,
j.)
let
We will show by induction (with respect to a particular
numbering of the processors) that for any processor i, there exists some ai >0,
independent of k, such that
il (nIk)>pa,
for all ne[k+(i-l)B+i,
k+£tBi -I.
To start the induction, consider first the case i=l and notice thatc
~11
(n k)>
n-I
11
n-k-l
II
a
(t)> ane.
>
t=k+l
MB >
CL
Vn=
.
7.Since each A(n) is a "stochastic matrix (nonnegative entries, eacih row sums to
questions of convergence of P((nlk) are equivalent to questions about the longrun behavior of a finite (time-varying) Markov chain.
),
Suppose now that a subset S of the processors has been numbered
{l,...,i-l},
for some i>2,
for all processors in S.
and that the induction hypothesis has been proved
We show that it is always possible to find a new
processor in V/S, rename it to i and prove the induction hypothesis for i as
well.
Consider the set Q of processors qqS
If Q=-,
such that (p,q)eE, for some pES.
then S is the set of all processors (because of AssuTmption
2.1b) and
If not, we choose one processor from Q, and rename it to i,
we are done.
subject to the following restriction: we choose a processor with in-degree more
than one only if no processor with in-degree equal to one belongs to
We now prove the induction hypothesis for processor i.
predecessor of i, belonging to S.
hl (n I k)
V nEIc
>0
ih
(t)>a
we have a
I.i_
Moreover, for some tE[k+(i-2)B+l,...,k+(i-l)BJ
il
(t+Lfk)>cta
We first suppose that i has in-degree 1.
il
that ~
.i
Let hES be some
Then, h<i and, by the induction hypothesis,
and consequently,
ne[t+l,...,k+MB]D
Q.
(n k)>ac.
=
C.>O.
We prove by induction on
n, for
Indeed,
(n+11 k
aih (n)l
'hl(nik) + aii (n)iii (nlk)>
i(n+lk
) = a
> min{h
(nik)
,i(nlk)}>
min
,(A.
CLa. =
n
-
a .
Suppose now that i has in-degree more than I and that Assumption
holds.
Then,
~il (
(
(njk)>
i
Ln-=
i a
I
a
I)
n-t
(m)
(t+L
MB
ih)>>X
A
ah a >o
VnEI
2.lc(iii)
The last possibility is that i has in-degree more than 1
(hence all
2.lc(iii) fails.
processors in Q have in-degree more than 1) and Assumption
Then the set of predecessors of i (denoted by U) has a single common predecessor,
denoted by j.
Since ieQ,
Since any hEU is non-
some hEU must being to S.
computing, we have hl1 and its predecessor j must belong to S.
Now, for any pSU,
p does not belong to Q (since it has in-degree 1) and therefore, pUS.
that all predecessors of i belong to S (UC).
on
(n)
EP
We now perform an easy induction
(nik)>x min {ah} A
heU
hl
ih
il
i (n+lk)a
~
to show that
n, for ne(t+l,...,k+MB]
.EUU1 i }
We conclude
.>O.
Indeed,
hi
(n k)>
min
(nlk)>
heUUt{i}l
> mintaC,
rain {ah }} = a
heU
This completes our inductive argument.
'We may now conclude that O(k+MBtk) is a stochastic matrix with the
property that all entries in some column (corresponding to any computing processor)
are positive and bounded away from zero by a constant ca>0 which does not depend on
k.
We combine this fact with Lemma A.1 below to conclude that the assertions of
Lenmma
2.1 are true.
Lemma A.l:
Consider a sequence {D
}
of nonnegative matrices with the properties
that:
(i)
Each row sums to 1.
(ii)
For some a>0
and for some column (say the first one), all entries of Dn,
for any n, in that column are larger or equal than a.
--------....-- -- -----
---
-36-
Then,
a)
b)
D =
im
n-n
n
1
Dk
exists.
k=l
All rows of D are identical.
c)
The entry in the first column of D is bounded below by a.
d)
Convergence to D takes place at the rate of a geometric progression.
Given any vector
Proof of Lemma A.1:
x=(xl,...,xM) we decompose it as x=y+ce,
where c is a scalar, eis the vector with all entries equal to 1 and y has one
zero entry and all other entries are nonnegative.
(So c equals the minimum of
the components of x.
n
Let x(n)
j|y(n+l)
(I_ <
f= Dkx(O), Vn
k=l
(1-CL)
y(n) j|,
and x(n)=y(n) c(n)e.
where |*j|*|
denotes the max-norm on
shows that y(n) converges geometrically to zero.
I y(n) jl,
It is easy to see that
R
,
which
Moreover, c(n)< c(n+l)< c(n) +
which shows that c(n) also converges geometrically to some c.
x(n) converges geometrically to ce.
Hence,
Since this is true for any x(O)e R , parts
(a),(b),(d) ;of the Lemma follow. Part (c) is proved by an easy induction, for
the finite products of the Dk's and, therefore, it holds for D as well. C
We now consider the case where Assumption
2.2. is replaced by
2.3.
The key observation here is that during an interval of the form [B n , B (n+l)B]
a bounded number of messages is transmitted; hence, A(k)=I, except for a bounded
number of times in that interval.
If we redefine the time variable so that time
is incremented only at communication times, we have reduced the problem to the case
of Assumption
2.2.
The only difference, due to the change of the time variable,
-37-.
Under Assumption
is in the rate of convergence.
2.2,
1l|(n1k)-(k) ||
This implies that,
by a constant factor during intervals of constant length.
2.3, j|D(njk)-D(k)jj decreases by a constant factor during
under Assumption
intervals of the form
if
3Bl (t+c) 1] for some appropriate constant c.
[Bt
we have |
B.t"=k and Bl(t+m) =n,
'I
t9(nfk)-9~k) [< Em'n/c
, for some B>
!< Bd
(nlk)-(k)
B
Eliminating t and solving for m, we obtain
.(Bv )
4m
.
(
)- -
yields
which finallT
1-*
cB 1/
Let
6=1/a and
decreases
d=d
o
c
, to recover the desired result. m
Therefore,
d
e3i)
-38-
APPENDIX B
This Appendix contains the proofs of the results of Section 3.
Remark on Notation:
In the course of the proofs in this section, we will use
the symbol A to denote non-negative constants which are independent
i
i
y (n), x (n),
of n, Y ,
I
s (n) etc., but which may depend on the constants introduced in the
various assumptions (that is, M, L, K, K0 ,
BB,
a, etc.).
When A appears in
different expressions, or even in different sides of the same equality
inequality), it will not necessarily represent the same constant.
convention, an inequality of the form A + i<A
(or
(With this
is meaningful and has to be inter-
preted as saying that A+1, where A is some constant, is smaller than some other
constant, denoted again by A.)
This convention is followed so as to avoid the
introduction of unnecessarily many symbols.
Without loss of generality, we will assume that the algorithm is initialized
so that x (l)=0, Vi.
In the general case where x (1)#0, we may think of the
algorithm as having started at time 0, with x (0)=O; then, a random update
i
i
s (0) sets x (1) to a nonzero value.
So, the case in which the processors initially
disagree may be easily reduced to the case where they initially agree.
-i
Note that we may define
step with step-size yo.
hold for
s (r).
y (n) = y0 , vn
s (n) =
Y .n)
(n)
Y0
(n) and view
It is easy to see that Assumptions
s (n) as the new
(3.2) and (3.3) also
For these reasons, no generality is iost if we assume t2hat
and this is what we will do.
Let us define
b(n) =
M
i
i=l
Isi(n) I
(B.1)
and note that
2
b
2
(n) < M
(n)
i=l
2
<
b
(n )
.
-39-
Using
and Lemma
(Z.6).,- (2.13)
M
n-I
<
f 'ol|
j=L
!
k=l
Ily(n)-xi(nll
2.1 (iii), we obtain
n-I
d
< AyO
-j (k) iJ (nl k) |k| |
s
(k)
!1
n-k
(B.2)
b (k)
k=l
From a Taylor series expansion for J we obtain
()
(n)+
J(y(n+l)) = J
iL
(n)(n
M
M
Z '
< J(y(n) ) + y 0 7J(y(n))
< J(y(n)) + y 0 VJ(y(n)))
Y
(n) s (n)+ Aly
Ci(n)
i
i (n)si(n)
2
(B.3)
s (n) + Ay 0 b(n
i=i
.
is in terms of 7J(x (n)), whereas above we have 7J(y(n))
Assumption 3.2
To
overcome this difficulty, we use the Lipschitz continuity of the derivative of J
(B..2:)
and invoke
|tVJ(y(n))
to obtain
i
iM .
I D (n)sA
i-l
(n)-
(n)
Z 7J(xi(n)
i=l
1(n)
M
<A
i-=i
Iy(n)-xi(n) I It s(n) I <_
n-l
CyOA
I
nn-k
dn b(k)
k=l
n-l
M
i = y 0OA n
i si(n)
s
i=l
d
n-k
b(k)b(n)
<
k=l
n-!
<y A n' d n - k [b2 (k)+b 2 (n)]
k=i
(B.4)
Let us define
G (n) = -7J(x (n))(
(n)s (n) ,
(B.5)
M
G(n) =
G
(n
),
(.6)
i=l
and note that Assumption 3.2 implies that E[G(n)]>-0.
We now rewrite inequality
-40-
(B,3)
using
(B.4)
to replace the derivative term, to obtain:
J(y(n+l))< J(y(n)) - Y0 G(n) + Ay0b2 (n) + AY
dn
[b (k)+b (n)]<
k=l
( J(y(n))
- yOG(n)
+ Ay
d
b (k)
(B..7)
k=l
Assumption 3.3
implies that [cf. inequality (3.3)]
E2Lb (k)I< AE[G(k)
Taking expectations in
(B.7)
(B.8)
.
and using
2
E[J(y(n+l))]< E[J(y(n))]-
we obtain
(B.8)
YOEL (n)] + AyO
n
-
i
dn-k
d
E(k)
(B.9)
~ for different values of n, to obtain
We then sum -(B.9)
n~n
0 .
<EJ(y(+l))
E[J
]<
k=l
k=l
())]+
A
EG(,
k)
(B.ll)
-41-
and obtain
theorem to' (B.12)
E
Vk; we may apply the monotone convergence
E[G(k) IFk >O,
3.2,
By Assumption
E E[ (k) <'B
Zk
E E[G(k)F
-k=l
1o3)
k=l
which implies
cO
E[G(k) IFk<co,
a.s.
(14)
k=1
From
(B.14)
we obtain
n=1t
77zJ(x (n))E
NIow use the fact (Lemma 5.2.1
(n) s(n)
(ii),
IF
inequality
any computing processor i for component Z.
k=l
] >-co,
VJ(x (n))E[s (n)IF ]>-0,
21
Z
n
a.s.
(2.9))
that
g(n)>o>O,
for
This implies that
a.s.
and establishes part (c) of the theorem.
Lemma
B. 1:
Let X(n), Z(n) be non-negative stochastic processes
expectation) adapted to (F }
n
E[X(n.+)
i
IF]<
EtZ(n)n]<
(with finite
and such that
X(n) + Z(n),
-
n=l
Then X(n) converges almost surely, as new.
(B.5)
(B.16)
Proof of Lemma 5.3.1:
X
follows that
By the monotone convergence theorem and (5.3.23) it
Zn <,
almost surely.
Then, Lemma 5.3.1 becomes the same
n=l
as Lemma 4.C.1 in [13, p.4531, which in turn is a consequence of the supermartingale convergence theorem [14]. O
Now let A be the constant in the right hand side of
k~lC
A
Z(n)
= Ay
Then, Z(n)> 0 and by
2
n-k
I
d
(B.7) and let
b (k) IFj
(B-.17)
(B.8)
EZ(n) ]< A
d
I
EkEG(k)I
.
(B.18)
k=l
Therefore,
ECZ(n)]< A
n-l
I
n=l
A
=
dn-kE[G(k)] =
k=l
I1d
I
E[G(k)3]<0
k=l
where the last inequality follows from
(B.16).
J(y(n))
(B.12)
We take the conditional expectation of
is F -measurable and that E(G(n)IF ]> 0.
n
n-
and J(y(n))
(B.19)
Therefore, Z(n) satisfies
(B.7),
given F .
Therefore Lemma B.1
Note that
applies
converges almost surely.
Using Assumption
r
b
k
3.3
once more, together with
)<A
1
E[GXk)]<
k=l
(B.12),
(B.20)
-43-
which implies that b(k) converges to zero, almost surely.
Recall (B.2) to
conclude that y(n)-x (n) converges to zero, almost surely.
Also, by squaring
(B.2), taking expectations and using the fact that E[b (k)] converges to zero
we conclude that E[j ly(n)-xi(n) 112] also converges to zero, and this proves part
(b) of the Theorem.
Now, let us use Assumption 3.1 and a second order expansion of J to obtain,
for any aeR
O<J(x-aVJ(x))<J(x) - aA 1
VJ(x)
|
+ a 2A
VJ(X2
)
,
where A1, A2 are positive constants not depending on a.
(B.21)
Assuming that a was
chosen small enough, we may use (B.21) and the nonnegativity of J to conclude
< AJ(x),
InVJ(x) II
VxedH.
(B.22)
Since J(y(n)) converges, it is bounded; hence VJ(y(n)) is also bounded, by
(B.22).
We then use the fact that y(n)-xi(n) converges to zero, to concludes
that J(x (n))-J(y(n)) also converges to zero.
This proves part (a) and concludes
the proof of the Theorem. !
In the following Lemma we bound certain infinite series by corresponding
infinite integrals.
This is justified as long as the integrand cannot change by
more than a constant factor between any two consecutive integer points.
notational convenience, we use c(nlk) to denote
d
,
For
where d and 6 are as in
Lemma 2.1(iv).
Lemma B.2:
The following hold:
n=l
-- E m
c(mln)<co,
n mnn
m=n
(B.23)
-44-
m
lim
m
2
mno
i
c(mlk) = lim
k=l k
lim
mnr0
I
k c(mlk)=O,
k=l
mno
(B.24)
c(kjm)=0.
I
(B.o25)
k=m
1
Proof of Lemma B.2:
1
Let t =y; then, t=y /
t
-s
1
dt=
and
t=s
1
dy
dt=(l/6)y
1
<
Y
dY
6
y=s
where A does not depend on s.
dy.
Therefore,
dy <
(B.26)
y=s
Equation (B.25) follows.
1A
Since
s
-A is an
6
integrable function of s, (B.23) follows as well.
The left hand side of (B.24) is bounded by
mt
Am= -t
dt
tt~~~
=1
6
m
A lm y
1
=
y
2/d
-
6
dy
1
-
m
-Ydy <Am
y=l
which converges to zero.
The middle term in (B.24) is certainly smaller and
converges to zero as well. O
Proof of Theorem 3.2:
Using the same arguments as in the proof of Theorem 3.1,
we may assume, without loss of generality, that xi(l)=0 and that
--i
i
(Otherwise we could define s (n) = ny (n)s (n).)
Y (n)=l/n, Vi,n.
-45-
65
dn
We still use c(nlk) to denote
G(n) by
(B.l),
(B.5),
.
We define again b(n), G
(n),
(B.6), respectively, as in the proof of Theorem 5.3.1.
Also, let
ni -
2
n
O(njk)
l-
-
c(njm),
n=k,
m=l
1= k
0,
c(nlk),
n>k,
(B.27)
n<k.
By replicating the steps leading to inequality (B.7) in the proof of Theorem 3.1
and using Lemma 2.1(iv) and
(B.27) we obtain, for some A>0,
n
J(y(n+l))< J(y(n))
-
1 G(n)
G(n) +8 A
.
2
(nIk)b2(k),
V n
(B.28)
k=l
Taking expectations in
(B.28), we have
E[J(y(n+l))]< E[J(y(n))]
-
E[G(n)] + A
n
I(nk)E[b
k=l
(k ) ]
, Vn.
(B.29)
and using Assumption 3.4,
E
n
~ ~(nlk)(E[G(k)]+l).
E[G(n)] + A
~n ~
We then sum
(B.30), for different values of n to obtain
n
O<E [J(y(n+l))]<EJ(y(l))+m
m.
A
The definition
(B.27) and
(B.31) is bounded.
n
n
m=l
k=m
(mk)+
m=l k=l
side of
(B.30)
k=l
1
-
EG
(m)].
(B.23) imply that the middle term in the right hand
Moreover, using
(B.27),
-46-
m
0
1
m-l
4(kjm) =
+ I
k=m
k=l
I
which converges to zero, as m:+o,
p(kjm)- - <
I
enough m,
k=m
Inequality
-1
k c(mlk) +
kcc(klm),
k=m
by (B.24) and (B.25).
Therefore, for large
It follows that E[J(y(n))] is bounded.
m
co00
(B.31) and the above also imply that
-1 E[G(m)]<o
and part (c) of
m=l
the Theorem follows, as in the proof of Theorem 3.1.
We now define
z(n) = E k=l
(nlk)b (k)IF'
_ k=l
and note that
0
[
n=l
co
E[Z(n)]< A 'I
o00
Z
00
(njk) E[G(k)]+l
1k E[G(k)]<00
< A+A
k=l n=k
k =1
Taking conditional expectations in (B.28) (with respect to F ) and using Lemma
B.1, we conclude that J(y(n)) converges, almost surely.
We now turn to the proof of part (b).
Using (3.5) and (B.22), we have
which finally implies that
iE[|si(n)
]< AE[J(x (n))] + A.
E[ Isi(n)jI 1
AB[J(x (n))] + A.
(B.33)
(B.33)
-47-
Now,
using
(B.22) once more,
J(x (n))-J(y(n))< I VJ(y(n)l
II
<
1
11x (n)-y(n)II + A|Ix (n)-Y(n)
IIVJ(Y(n) 112 + All xi(n)-y(n)112< AJ(y(n))
A+
i 12<
xi(n)-y(n) 112
(B.34)
Inequalities
(B.33),
(B.34) and the boundedness of E[J(y(n))] (which is a con-
sequence of (B.31)) yield
E[b 2 (n)]< A+AEl Ix (n)-y(n)j112]
Similarly with
(B.35)
(B.2), we have
i1y(n)-x
(n)|
<
n-l
A
Z k c(nlk)b(k)
k=l
(B.36)
and
n-l c 2
I ly)_(
xit
n )
n )
I
12<
An
(nnk)
b2
l
b
I
c
k=l
n-1
n
c(n~k)
k=
k
(k)< An
2
2
b
(k).
(B.37)
Therefore,
E[lly(n)-xi(n)l12]< S(n)
w (n) = An
k=l
k
max
1<k<n
E[b
(k)],
converges to zero, by
(B.38)
(B.24).
Using
(B.35),
ElI ly(n)-x i (n) j12]< A(n) (l+max E[I|y(k)-xi (k) 12])
k<n
and since S(n) converges to zero, it follows that E[l
to zero as well.
We also conclude from (B.35) that
y(n)-x (n)l
sup E[b
n
2 ]
(n)]<o.
converges
-48-
Let
Dk =E
1 b(1),
k
k>l.
(B.39)
<i< (k+l)
Using the fact that there exists an A such that (k+1) /-k /
<Ak ( 1
/
)- 1
, Vk,
obtain from (B.39)
~
ELD k] <
.l
+l
klk/<
k</6<i< (k+l)
J.k=L
1/6
co1
2
<A
/6
[(k+l)
1/6
-k
]<
A
k=l k
]
2
sup E[b2 (n)] <
o
1
(2/ 6 )-2
<00
I
2/6 k( /
k=l k
Consequently, so does Dk
If follows that D k converges to zero, almost surely.
and
C
nd
d
n-k
Dk
as well.
(B.40)
Let us fix some n, let N denote the largest integer
k=l
such that N<n ° and use
i (n ) - y
Ix
(n )
(B.36) to obtain'
j|<A A
k c(nlk)b(k)< A l<
k/6<i<(kl)
A
<AdN-'k+l
l<k<n -1
6
k
1/6
1/6
<i<(k+l) /
N
I
k=l
b(i)
Nb(ki)<
N-k
k
As n converges to infinity, so does N and, by the above discussion, x (n)-y(n)
converges to zero, as n->.
Consequently, x (n)-xi (n) also converges to zero,
for any i,j, completing the proof of part (b).
Finally, since J(y(n)) converges and x (n)-y(n) converges to zero, part (a)
of the theorem follows, as in the proof of Theorem 3.1..
-49-
Proof of Corollary 5.3.1:
From part (c) of either Theorem 3.1 or 3.2 and
(3.7) we obtain
M
L
i
E
i=l
=l
Y(n)g'(xi())<
E
a.s.
(B.42)
neT
Because of our assumption on the sets Tl, it follows that there exists a
positive integer c such that, for any i,Z,m, the interval {cm+l,cm+2,...,c(m+l)}
contains at least one element of TI .
denoted by t Q
M
.
m
L
By
Let us choose sequences of such elements
(B.42), we have
X.
i=l X=l
i Y1 (tQ,m)g
m=l
,m
a.s
))<
(B.43)
,m
Now notice that, for some constant K5 0,
K
y (tI
)>
Q,m -
Hence,
K
>
-
i
t,m
>
c(m+l) -
K5
5,
m
Vi,fm .
(B.44)
a.s.
(B.45)
(B.43) yields
X
1
I
m
m=l
M
L
g
i=l
))<X,
=1
From either Theorem 3.1 or 3.2 and its proof we obtain
lim (x (n)-y(n)) = lim (y(n+l)-y(n))=O
n-00
n-Ko
lim
(x (t im)-Y(t
which implies that
))=O'
Vi
Since J has compact level sets and J(y(n))
bounded.
converges, the sequence {y(n)} is
We therefore need to consider the functions gi
on which they are uniformly continuous.
_.
Ur>0
; ix i
m
)ig
(Y(t
(B46)
only on a compact set
Therefore,
.
1 ......
(B...).
-50-
By combining (B.45) and (B.47) we obtain
M
liminf
-
m&co
a)
L
g(y(t
i
))=O .
(B.48)
Z=1
=l
By (B.48), there must be some subsequence of {t
g(y(tl )) converges to zero.
l,m
}
along which
Let y* be a limit point of the corresponding
By continuity, g(y*)=O and by assumption, y* must be
subsequence of {y(tl)}.
i l
Moreover, x (t
) also converges to y*
l,m
a stationary point of J, so
VJ(y*)=O.
along the same subsequence.
By continuity of VJ, (3.8) follows.
b)
In this case, (B.43). implies
M
m00
L
I
lim
gI (x (t
i=l
m))=,
m=l
a.s.
(B.49)
and the rest of the proof is the same as for part (a), except that we do not need
to restrict ourselves to a convergent subsequence.
c)
From part (a) we conclude that some subsequence of fy(t 1 )} converges to
l'm
some y* for which g(y*)=O.
Consequently, y* minimizes J.
Using the continuity
of J,
liminf J(y(n))< liminf J(y(tl
n-tcx,=
ne=
On the other hand, J(y(n)) converges
which shows that (3.10) holds.i
))< J(y*) = inf J(x)
'xeH
(part (a) of either Theorem 3.1 or 3.2)
Download