January 1984 revised November 1984 LIDS-P-1425 DISTRIBUTED ASYNCHRONOUS DETERMINISTIC AND STOCHASTIC GRADIENT OPTIMIZATION ALGORITHMSt by John N. Tsitsiklis* Dimitri P. Bertsekas* Michael Athans* ABSTRACT We present a model for asynchronous distributed computation and then proceed to analyze the convergence of natural asynchronous distributed versions of a large class of deterministic and stochastic gradient-like algorithms. We show that such algorithms retain the desirable convergence properties of their centralized counterparts, provided that the time between consecutive communications between processors and communication delays are not too large. tResearch supported by Grants ONR/N00014-77-C-0532 and NSF-ECS-8217668. Laboratory for Information and Decision Systems, Dept. of Electrical Engineering and Computer Science, M.I.T., Cambridge, MA. 02139. -2- 1. INTRODUCTION Many deterministic and stochastic iterative algorithms admit a natural distributed implementation [1,4,5] whereby several processors perform computations and exchange messages with the end-goal of minimizing a certain cost function. If all processors communicate to each other their partial results at each instance of time and perform computations synchronously, the distributed algorithm is mathematically equivalent to a single processor (serial) algorithm and its convergence may be studied by conventional means. Synchronous algorithms may have, however, certain drawbacks: a) Synchronism may be hard to enforce, or its enforcement may introduce substantial overhead. b) Communication delays may introduce bottlenecks to the speed of the algo- rithm (the time required for one stage of the algorithm will be constrained by the slowest communication channel). c) Synchronous algorithms may require far more com- munications than are actually necessary. d) Even if all processors are equally powerful some will perform certain computations faster than others, due solely to the fact that they operate on different inputs. This in turn, may lead to having many processors idle for a large proportion of time. For these reasons, we choose to study asynchronous distributed iterative optimization algorithms in which each processor does not need to communicate to each other processor at each time instance; also, processors may keep performing computations without having to wait until they receive the messages that have been transmitted to them; processors are allowed to remain idle some of the time; finally, some processors may perform computations faster than others. Such algorithms can alleviate communication overloads and they are not excessively slowed down by neither communication delays, nor by differences in the time it takes processors to perform one computation. (A similar discussion of the merits of asynchronous algorithms is provided by H.T. Kung [10].) -3- The algorithms that we consider are gradient-like, meaning that the (expected) updates by each processor are in a.descent direction with respect to a cost function being minimized. There may be certain limitations to our approach, because we restrict attention to a special class of algorithms, whereas it could be the case that algorithms with a different structure may offer some advantages over the ones we propose. However, the problem of finding an "optimal" distributed algorithm, which minimizes, say, the amount of information exchanged between processors, or some other measure of communication and computation, can be very hard or intractable [i15,20] or worse). (NP-complete For this reason, we prefer to assume a fixed structure and then proceed to investigate the amount of asynchronism and the magnitude of the communication delays that may be tolerated without adversely affecting the convergence of the algorithm. In Section 2 we present the model of distributed computation to be employed. In this model, there is a number of processors who perform certain computations and update some of the components of a vector stored in their memory. In the meanwhile, they exchange messages, thus informing each other about the results of their latest computations. Processors who receive messages use them either to update directly some of the components of the vector in their memory, or they may.combine the message with the outcome of their own computations, by forming a convex combination. Very weak assump- tions are made about the relative timing and frequency of computations or message transmissions by the processors. In Section 3 we employ this model of computation and also assume that the (possibly random) updates of each processor are expected to be in a descent direction, when conditioned on the past history of the algorithm. Our main results show that, under certain assumptions, asynchronous distributed algorithms have similar convergence -4- properties as their centralized counterparts, provided that the time between consecutive communications between processors plus communication delays are not too large. We distinguish two cases: a) constant step-size algorithms (e.g. deterministic gradient-type algorithms) in which the time between consecutive communications has to be bounded for convergence to be guaranteed and, (e.g. b) decreasing step-size algorithms stochastic approximation-type algorithms) for which convergence is proved even if the time between consecutive communications increases without bounds as the algorithm proceeds. Section 2 and 3 are developed in parallel with a variety of examples which are used to motivate and explain the formal assumptions that are being introduced. Finally, Section 4 suggests some extensions and areas of application, together with our conclusions. Appendix A contains the proof of Lemma 2.1. Appendix B contains the proofs of our main results. 2. A MODEL OF DISTRIBUTED COMPUTATION We present here the model of distributed computation employed in this paper. also define the notation and conventions to be followed. Related models of distributed computation have been used in [3,4,5,8], in which each processor updating a different component of some vector. We specialized in The model developed here is more general, in that it allows different processors to update the same component of some vector. If their individual updates are different, their disagreement is (asymptotically) eliminated through a process of communicating and combining their individual updates. In such a case, we will say that there is overlap between processors. Overlapping processors are probably not very useful in the context of deterministic algorithms, unless redundancy is intended to provide a certain degree of fault tolerance and/or safeguarding against malfunction of a particular processor. For stochastic algorithms, however, overlap essentially amounts to having different processors obtain noisy measurements of the same unknown quantity and effectively increases the "signal-tonoise ratio." Let H1, H 2 ...,HL , H=HlxH2x...xHL, be finite dimensional real vector spaces which we endow with the Euclidean norm. and let If x=(xl,x2,...,xL, x6H, we will refer to xZ as the 2-th component of x. Let {1,...,M} be the set of processors that participate in the distributed computation. As a general rule concerning notation, we use subscripts to indicate a component of an element of H, superscripts to indicate an associated processor; we indicate time by an argument that follows. The algorithms to be considered evolve in discrete time. Even if a distributed algorithm is asynchronous and communication delays are real (i.e., not integer) variables, the events of interest (an update by some processor, transmission or reception of a message) may be indexed by a discrete variable; so, the restriction to discrete time entails no loss of generality. It is important here to draw a distinction between "global" and "local" time. The time variable we have just referred to corresponds to a global clock. Such a global clock is needed only for analysis purposes: it is the clock of an analyst who watches the operation of the system. On the other hand the processors may be working without having access to a global clock. no clock at all. 1All They may have access to a local clock or to We will see later that our results, based on the existence of a of our results generalize to the case where each H i is a Banach space. No modifications whatsoever are needed in the assumptions or the proofs except that matrices should be now called linear operators and that expressions like VJ(x) should be interpreted as elements of the dual of H rather than vectors in H. -6- global clock, may be used in a straightforward way to prove convergence of algorithms implemented on the basis of local clocks only. We assume that each processor has a buffer in its memory in which it keeps some element of H. The value stored by the i-th processor at time n is denoted by x (n). At time n, each processor may receive some exogenous measurements and/or perform some computations. This allows it to compute a "step" ting the new vector xi(n+l). i s (n)eH, to be used in evalua- Besides their own measurements and computations, proces- sors may also receive messages from other processors, which will be taken into account in evaluating their next vector. The process of communications is assumed to be as follows: At any time n, processor i may transmit some (possibly all) of the components of x (n) to some (possibly all or none) of the other processors. (In a physical implemen- tation, messages do not need to go directly from their origin to their destination; they may go through some intermediate nodes. mathematical model presented here). bounded. Of course, this does not change the We will assume that communication delays are For convenience, we also assume that for any pair (i,j) of processors, for any component x9 and any time n, processor i may receive at most one message originating from processor j and containing an element of H E. of generality: This leads to no significant loss for example, a processor that receives two messages simultaneously could keep only the one which was most recently sent; if messages do not carry timestamps, there could be some other arbitration mechanism. Physically, of course, simultaneous receptions are impossible; so, a processor may always identify and keep the most recently received message, even if all messages arrived at the same discrete time n. If a message from processor j, containing an element of H E is received by processor i [(ij) at time n, let tJ(n)i denote the time that this message was sent. 2For algorithms with decreasing-step-size this assumption may be relaxed. -7- Therefore, the content of such a message is precisely (r xjt (n)). Naturally, we as- (in)<n. For notational convenience, we also let tI (n)=n, for all i,k,n. sume that t We will be assuming that the algorithm starts at time 1; accordingly, we assume that that t j (n)> 1. Finally, we denote by T the set of all times that processor i receives a message from processor j, containing an element of HR. To simplify matters we will assume that, for any i,j,Z, the set TiJ is either empty or infinite. Once processor i has received the messages arriving at time n and has also evaluated i s (n), it evaluates its new vector x (n+l)eH by forming (componentwise.) a convex combination of its old vector and the values in the messages it has just received, as follows: x1(n+l) = I aj (n)xi(ti (n)) + i(n)si(n), (2.1) n>l, j=1 where sl(n) is the Z-th component of s (n) and the coefficients a, (n) are scalars satisfying: (i) ij a9 (n)>O, Vi,j,Z,n, (2.2) Vi,Z,n, (2.3) M (ii) X =1 (iii) aj,(n)=1, aJ (n)=O, nT (2.4) ij. Remarks: 1. Note that ti] (n) has been defined only for those times n that processor i receives a message of a particular type, i.e. t for neTiJ. However, whenever (in)is undefined, we have assumed above that alJ (n)=0, so that equation (2.1) has an unambiguous meaning. 2. When we refer to a processor performing a "computation," we mean the evaluation and addition of the term y (n)si (n) in (2.1). With this terminology, forming the convex combination in (2.1) is not called a computation. We denote by Ti the set of all times that processor i performs a computation involving the M-th component. Whenever n*T~, it is-understood that .sCn) in (2.1) equals zero. We assume again that for any i,Z the set TZ is either infinite or empty. Accordingly, processor i will be called computing, or non-computing, for component .i 3. The quantities y (n) in (2.1) are nonnegative scalar step-sizes. These step-sizes may be constant (e.g. y (n)=yo, Vn), or time-varying, e.g. y (n)=l/t , where i t n is the number of times that processor i has performed a computation up to time n. Notice that with such a choice each processor may evaluate its step-size using only a local counter rather than a global clock. 4. The envisaged sequence of events underlying (2.1) at any time n, is as follows: For each component Z processor i (i) 1. Transmits xi(n) to processors and receives messages xjij (t( n)) from processors j for which neT J . i i s (n) if neTI and sets S.Z(n)=O otherwise. (ii) Computes (iii) Evaluates xl(n+l) according to (2.1). Examples We now introduce a collection of simple examples representing various classes of algorithms we are interested in, so as to illustrate the nature of the assumptions to be introduced later. We actually start with a broad classification and then proceed to more special cases. and transmissions (i.e. In these examples, we model the message receptions the sets T j which computations are performed (i.e. a2 J(n) as deterministic. and the variables t J(n)), the times at the sets T~) and the combining coefficients (This does not mean, however, that they have to be a priori known by the processors). Specialization: This is the case considered by Bertsekas [4,5], where each processor updates a particular component of the x-vector specifically assigned to it and relies on messages from the other processors for the remaining components. (i) M=L. (There are as many processors as there are components). (ii) s [n)=O, VZ#i, Vn. Formally: (A processor may update only its own component; Ta=~, ViPZ). (iii) Processor j only sends messages containing elements of Hi; if processor i receives such a message, it uses it to update xj by setting xj equal to the value received. Equivalently, (a) If ifj and j$Z, then T'j=~ and ai (n)=O, Vn. (b) If processor i receives a message from processor j at time n, i.e. Overlap: if neTi j , then a j (n)=l. Otherwise, a j (n)=O, and a ii(n)=l. This is the other extreme, at which L=l (we do not distinguish components of elements of H), messages contain elements of H (not just components) and each processor may update any component of x. (For this case subscripts are redundant and will be omitted.) We now let H be finite-dimensional and assume that J: H-*[O,o) is a continuously differentiable nonnegative cost function with a Lipschitz continuous derivative, -10- Example I: Deterministic Gradient Algorithm; Specialization. i At each time nGT i that processor i updates xi, 1i i Y (n) = y>0, Vn,i. i si (n) = i (x n)) and lets DJ - Let siCn)=O, for jHi. it computes We assume that each processor i communicates its component xi to every other processor at least once every B1 time units, for some constant B1. Other than this restriction, we allow the transmission and reception times to be arbitrary. i aJ by letting s i n) (A related stochastic algorithm could be obtained i i n)) l+t(n)), where wi.(n) is unit variance white noise, C( independent for different i's). Example II: Newton's Method; Overlap. only two processors For simplicity we assume that there are Let yi(n) = Yo>0, vn. (M=2). We also assume that J is twice continuously differentiable, strictly convex and its Hessian matrix, denoted by G(x), satisfies 0<6LI<G(x)<6 2 I, VxGH. At each time n6T , processor i computes (x (n)) aJ (xi(n)). For nOTi , si (n)=O. If at time n processor 1 (i (n)> ax 1 21 2 12 (n))(resp-.ix (t 2(n))), it updates its vectorby (respectively, 2) receives a message x (t s (n) = -G S ~n) = -G 212 1 1 x (n+l) = allx (n) + a1 2 x (t a 22 x (n) + y (n)s (n)). a22 121 2 1 1 (n)) + yl(n)s (n),(resp. x (n+l) = a 21 x (t +a2=a2+a=l. 12 21 22 Here we assume that O<aij<l and that a 1311 For other times n the same formula is used with a12=0 (a21=O). (n)) + We make the same assumptions on transmission and reception times as in Example 1. Example III: Let yi(n) Distributed Stochastic Approximation; Specialization. be such that, for some positive constants Al, A 2, A 1 /n < y i(n)< A 2 /n, Vn. Notice that the implementation of such a stepsize only requires a local clock that runs in the same time scale (i.e. neT , let s i (n) = i ax. ' within a constant factor) as the global clock. Cx (n)) +xw(n). For Also, sj(n)=O, for isj and for all n. We assume that w (n), conditioned on the past history of the algorithm has zero mean and that E[ljjw(n)j 2 xi'(n)]< K(IIVJ(xi(n)) 12 +l1), for some constant K. We assume that for some BlL>O, 8>l and for all n, each processor communicates its component x i to every other processor at least once during the time interval [Bln ,Bln+l) ]. Other than the above restriction, we allow transmission and recep- tion times to be arbitrary. Notice that the above assumptions allow the time between consecutive communications to grow without bound. Example IV: Distributed Stochastic Approximation; Overlap. Example III and let M=2. For nGT 1 , let s (n) - Let yi (n) be as in i (x(n)) + w (n), where w (n) is as in Example III. We make the same assumptions on transmission and reception times as in Example III. Whenever a message is received, a processor combines its vector with the content of that message using the combining rules of Example II. Example V: This example is rather academic but will serve to illustrate some of the ideas to be introduced later. dimensional, and let y (n)=l, en. Consider the case of overlap, assume that H is oneAssume that, at each time n, either all processors communicate to each other, or no processor sends any message. delays be zero ( so, t (n)=n, whenever t Let the communication (n) is defined) and assume that a (constant) at those times n that messages are exchanged. x(n) = (x (n),...,x (n)) and s(n) = (s (n),...,s (n)). j 1 (n) = ai We define vectors Then, the algorithm (2.1) may be written as x(n+l) = A(n)x(n) + s(n) (2.5) For each time n, either A(n)=I (no communications) or A(n)=A, the matrix consisting of the coefficients a1J. The latter is a "stochastic" matrix: it has nonnnegative entries -12- and each row sums to 1. We assume that aij is positive. exists and has identical rows with positive elements. It follows that A = lim A n n-),o We assume that the time between consecutive communications is bounded but otherwise arbitrary. n lim II A(m)=A, n-* m=k for all k. Clearly then, This example corresponds to a set of processors who individually solve the same problem and, from time to time, simultaneously exchange their partial results. It is interesting to compare equation (2.5) with the generic equation x(n+l) = x(n) + y(n)sCn) which arises in centralized algorithms. Assumptions on the communications and the combining coefficients We now consider a set of assumptions on the nature of the communications and combining process, so that the preceding examples appear as special cases. In formula- ting an appropriate set of assumptions, there is a trade-off between generality and ease of verification. The assumptions below are not the most general possible, but are very easy to enforce. Some generalizations will be suggested later. For each component £e{l,...,L} we introduce a directed graph nodes V={1,...,M} corresponding to the set of processors. G,=(V,E.) with An edge (j,i) belongs to E Z if and only if T1 3 is infinite, that is, if and only if processor j sends (in the long run) an infinite number of messages to processor i with a value of the Z-th component x3,. Assumption 2.1: For each component lG{1,...,L}, the following hold: a) There is at least one computing processor for component Z. b) There is a directed path in Gi , from every computing processor (for component Z) to every other processor (computing or not). -13- c) There is some c0O such that: (i) If processor i receives a message from processor j at time n (i.e. if nG6T'), then a jin) >a. (ii) For every computing processor i, a iCn)>a, vn. (iii) If processor i has in-degree (in Gz) larger or equal than 2, then al (n)>a, Vn. Let us pause to indicate the intuitive content of part (c) of Assumption 2.1. Part (i) states that a processor should not ignore the messages it receives. Part (ii) requires the past updates of any computing processor to have a lasting effect on its state of computation. Finally, part (iii) implies that if processor i receives messages from two processor (say il,i2), it does not forget the effects of messages of processor i ! upon reception of a message from processor i 2. These conditions, together with part (b) of the Assumption, guarantee that any update by any computing processor has a lasting effect on the states of computation of all other processors. Assumption 2.2: The time between consecutive transmissions of component xZ processor j to processor i Assumption 2.3: from is bounded by some B1>O, for all (j,i)eEI. There are constants B >O, S>1 such that, for any (j,i)GE l, and for any n, at least one message x~ is sent from processor j to processor i during the time interval [BlnS, BC 1 n+l) ]. Moreover, the total number of messages transmitted and/or received during any such interval is bounded. Assumption 2.4: and neT 3The 3j Communication delays are bounded by some B >0, i.e. for all i,jZ we have n-tJ (n)<Bo. in-degree of a processor (node) i (in G9) is the number of edges in EB pointing to node i. -14- Note that Assumption 2.2 is a special case of 2.3, with S=l. holds for all the examples introduced above. Assumption 2.1 Assumption 2.2 holds for Examples I, II,V; Assumption 2.3 holds for Examples III,IV, except for its last part which has to be explicitly introduced. Equation (2.1) which defines the structure of the algorithm is a linear system driven by the steps si(n). In the special case where communication delays are zero, we have t'i(n)=n, and (2.1) becomes a linear system with state vector M (xl(n),...,x (n)). Equation (2.5) of Example V best illustrates this situation. In general, however, the presence of communication delays necessitates an augmented state if a state space representation is desired . (Notice that actually (2.1) defines a decoupled set of linear systems, one for each component linearity, we conclude that there exist scalars M x (n) = ( Z (n[O)x(l) + j=- n-l X M k=l j=l Ze{l,...,L}.) Exploiting ij (nlk), for n>k, such that (2.6) y'(k)j(nlk)sj (k). The coefficients (j (nlk) are determined by the sequence of transmission and reception times and the combining coefficients. Consequently, they are unknown, in general. Nevertheless, they have the following qualitative properties. Lemma 2.1: (i) O<i J(nk), Vi,j,Z,n>k, (2.7) M I j=l (ii) ' vi,Z,n>k . Under Assumptions2.1, 2.4 and either Assumption exists, for any i,j,k,Z. 4 (nlk)< 1, (2.8) 2.2 or 2.3, lim Dj (nik) n-~O The limit is independent of i and will be denoted by Z3(k). Such an augmented state should incorporate all messages that have been transmitted but not yet received. Since we are assuming bounded communication delays, there can only be a bounded number of such messages and the augmented system may be chosen finite dimensional. Moreover, there is a constant r>0 such that, if j is a computing processor for component Z, then 4)z (k) ~(2.9) >T1.9 ~~V~k. The constant n, depends only on the constants introduced in our assumptions (i.e. BoB ,8,c). 1 Under Assumptions 2.1, 2.2, 2.4, there exist de[0,1), (iii) Bo ,B 1,a) B>0 (depending only on such that max 1I J (n Il k) - Ij(k) I<Bd , V,n>k . (2.10) i,j (iv) Under Assumptions 2.1, 2.3, 2.4, there exist dG[0,1), S6(0,1], B>0 (depending only on Bo,Bl,S,a) such that max 1ijk(njk) --lak)j< i,j Proof: Bdn , VZ,n>k . (2.11) See Appendix A. In the light of equation (2.6), Lemma 2.1 admits the following interpretation: part (ii) states that if all processors cease updating (that is if they set s (n)=0) from some time on, they will asymptotically converge to a common limit. Moreover, this common limit depends by a non-negligible factor on all past updates of all computing processors. Parts (iii) and (iv) quantify the natural relationship between the frequency of interprocessor communications and the speed at which agreement is reached. -16- For any pair (i,j) of processors we may also define a linear transformation ij (nlk):H+H by Xji(nIk) where x=(xl,...,XL). = (j(Cnlk)xl -,.,j(nlk)xL) (2.12) Note that if each HQ is one-dimensional then Dij(n1k represented by a diagonal matrix. ) may be In general, it corresponds to a block-diagonal matrix, each block being a constant multiple of the identity matrix of suitable dimension. Clearly, denoted by J(k). lim ijC(nlk) also exists, is independent of i and will be n-ew We can now define a vector y(n)GH M &j(0)xj (1) + y(n) = j=l by n-l M E i k=l j=l y (k)&(k)s j (k) (2.13) and note that y(n) is recursively generated by M y(n+l) = y(n) + I yjC(n)sj (n) n) (2.14) j=l The vector y(n) is the element of H at which all processors would asymptotically agree if they were to stop computing (but keep communicating and combining) at a time n. It may be viewed as a concise global summary of the state of computation at i time n, in contrast with the vectors xin) which are the local states of computation; it allows us to look at the algorithm from two different levels: an aggregate and a more detailed one. We will see later that this vector y(n) is also a very convenient tool for proving convergence results. (2.1) is a linear system. We have noted earlier that equation However, it is a fairly complicated one, whereas the -17- recursion (2.14) corresponds to a very simple linear system in standard state space form. The content of the vector y(n) and of the & (n)'s is easiest to visualize in two special cases: Specialization: (e.g. Examples I and III). Here the processor who specializes in that component. i(n) takes each component from That is, y(n) = n n )) Accordingly, ci(n)=O, for ifj, and QJ(n)=l. j Example V: Here that the limit of only on j. J i (nlk) is the ij-th entry of the matrix ij(nik) It follows is the ij-th entry of A, which by our assumptions depends Moreover, y(n) equals any component of Ax(n). by our assumptions). n-l 11 A(m). m=k+l (All components are equal If we multiply both sides of (2.5) by A and note that AA(n)=A, M we obtain y(n+l)=y(n) + I A. .s(n), which is precisely (2.14). j=l ]3 The model of computation introduced in this section may be generalized in several directions [7,19]. To name a few examples, the updating rules of each processor need not have the linear structure of equation (2.1); also, it may be convenient to communicate other information (e.g. derivatives of the cost function) and not just the values of components of x. However, the present model is sufficient for the class of algorithms under consideration. Let us also mention that, while all of the above examples refer to either specialization or overlap, we may think of intermediate situations in which some of the components are updated by a single processor while some of the components are updated by all processors simultaneously (partial overlap). Finally, Assumptions 2.1, 2.4 are unnecessary, as long as the conclusions of Lemma 2.1 may be somehow independently verified. -18- 3. CONVERGENCE RESULTS There is a large number of well-known centralized deterministic and stochastic optimization algorithms which have been analyzed using a variety of analytical tools [2,11,12,16]. A large class of them, the so-called "pseudo-gradient" algo- rithms [16], have the distinguishing feature that the (expected) direction of update (conditioned upon the past history of the algorithm) is a descent direction with respect to the cost function to be minimized. The Examples of Section 2 certainly Reference [16] presents a larger list of examples. have such a property. It is also shown there that the development of results for pseudo-gradient algorithms leads easily to results for broader classes of algorithms, such as Kiefer-Wolfowitz stochastic approximation. In this section we present convergence results for the natural distributed asynchronous versions of pseudo-gradient algorithms. We adopt the model of computation and the corresponding notation of Section 2. 1 M We allow the initialization {x (l),...,x (1)} of the algorithm to be random, with finite mean and variance. random. We also allow the updates s (n) of each processor to be On the other hand, we assume that yi(n) is deterministic; we also model the combining coefficients, a J(n) and the sequence of transmission and reception times as being deterministic. This is not a serious restriction because they do not need to be known by the processors in advance, in order to carry out the algorithm. We assume that all random variables of interest are defined on a probability space (Q,F,P). We introduce {Fn}, an increasing sequence of a-fields contained in F and describing the history of the algorithm up to time n. In particular, Fn is defined as the smallest a-field such that s (k), k<n-l, and xi(1), ie{l,...,M} are F -measurable. We assume that the objective of the algorithm is to minimize a nonnegative cost function J:H+[O,o). For the time being, we only assume that J is a smooth function. In particular, J is allowed to have several local minima. Assumption 3.1: J is continuously differentiable and its derivative satisfies the Lipschitz condition IIJ(x)-VJ(x') II< KI IXXtI 1, vx,xt CH, (3.1) where K is some nonnegative constant. Assumption 3.2: The updates x si(n) of each processor satisfy (X (n))EE[s(n ) nIF ] < 0, ¥i,Z,n. a.s., (3.2) This assumption states that each component of each processor's updates is in a descent direction, when conditioned on the past history of the algorithm and it is satisfied by Examples I-IV. A slightly weaker version, under which our results remain true would be VJ(x (n)) i(n)E[si(n)lFn]< 0, a.s., Vi, k,n. On the other hand, it can be shown that the condition VJ(x (n))E[s (n)IFn]< 0, a.s. is not sufficient for proving convergence. The next assumption is easily seen to hold for Examples I and II. For stochastic algorithms, it requires that the variance of the updates (and hence of any noise contained in them) goes to zero, as the gradient of the cost function goes to zero. 5 5In aJ (3.2), if H A has dimension larger thanl1, ax should be interpreted as a row vector. In general, the appropriate interpretation should be clear from the context. -20- Assumption 3.3: For some K >0 and for all i,Z,n, 0- El[ IIs(n)l <- K0 E [L (x(n))s(n)] As a matter of verifying Assumption 3.3, one would typically check the validity of the slightly stronger condition [ (n)l F]] - -K Dx E Is i(n)I 2 IFn Ko 1 (xi(n))E [s(n)F ] (3n)iF Also, using Lemma 2.1 (ii) and Assumption 3.3 we obtain E[is (n)I I < E[ Is(n) I ]< = -K E[VJ(x (n)) Where Ko=Ko/ > 0. o E[a (xi (n))c ,(n)s' (n )] (3.3) ns (n)], It turns out that (3.3) is all we need for our results to hold. Our first convergence result states that the algorithm converges in a suitable sense, provided that the step-size employed by each processor is small enough and that the time between consecutive communications is bounded, and applies to Examples I and II. It should be noted, however, that Theorem 3.1 (as well as Theorem 3.2 later) does not prove yet convergence to a minimum or a stationary point of J. there is nothing in our assumptions that prohibits having s (n)=O, Vi,n. In particular, Optimality is obtained later, using a few auxiliary and fairly natural assumptions (see Corollary 3.1). Theorem 3.1: i y (n)> 0 Let Assumptions 2.1, 2.2, 2.4, 3.1, 3.2, 3.3 hold. and that i sup y (n) = yo 0 < i,n Suppose also that . There exists a constant y*>Q (depending on the -21- constants introduced in the Assumptions) such that the inequality O<yo<Y* implies: a) J(x (n ) ), i=l,2,...,M, as well as J(y(n)), converge almost surely, and to the same limit. b) lim (xi(n)-x (n)) = lim (x (n)-y(n))=O, vi,j, nflw fn-wco almost surely and in the mean square. c) The expression 0Co M I i n-l i=l y1 (n)VJ(x (n))E[s (n) Fn] is finite, almost surely. Proof: (3.4) Its expectation is also finite. See Appendix B. Leaving technical issues aside, the idea behind the proof of the above (and the next) Theorem is rather simple: the difference between y(n) and x (n), for any i, is of the order of Byo, where B is proportional to a bound on communication delays plus the time between consecutive communications between processors. as long as yo remains small, s (n) (and consequently from point y(n). VJ(xi(n)) is approximately equal to i(n)s Therefore, VJ(y(n)); hence (n)) is approximately in a descent direction, starting Therefore, iteration (2.14) is approximately the same as a centralized descent (pseudo-gradient) algorithm which is, in general convergent [16]. Decreasing Step-Size Algorithms We now introduce an alternative set of assumptions. the updates We allow the magnitude of i i s (n) to remain nonzero, even if VJ(x (n)) is zero (Examples III and IV). Such situations are common in stochastic approximation algorithms or in system -22- identification applications. Since the noise is persistent, the algorithm can be made convergent only by letting the step-size yi (n) decrease to zero. y (n)=l/n The choice is most commonly used in centralized algorithms and in the sequel we will assume that y (n) behaves like 1/n. Since the step-size is decreasing, the algorithm becomes progressively slower as n-+oo. This allows us to let the communications process become progressively slower as well, provided that it remains fast enough, when compared with the natural time scale of the algorithm, the latter being determined by the rate of decrease of the step-size. Such a possibility is captured by Assumption 2.3. The next assumption, intended to replace Assumption 3.3, allows the noise to be persistent. It holds for Examples I-IV. As in Assumption 3.3, inequality (3.5) could be more naturally stated in terms of conditional expectations, but such a stronger version turns out to be unnecessary. Assumption 3.4: For some K1, K >0, and for all i,Z,n, E[IIS(n)l Theorem 3.2: some K3_>, ]2J< -K1 E[ (x (n))s Cn) ]+ K2 (3.5) Let Assumptions 2.1, 2.3, 2.4, 3.1, 3.2, 3.4, hold and assume that for y (n)< K3 /n, vn,i. Then, conclusions (a),(b),(c), of Theorem 3.1 remain valid. Proof: See Appendix B. Theorem 3.2 remain valid if (3.5) is replaced by the much weaker assumption E[Isi (n) 23< KoEJ(xi(n))] -K1 EVJ(x (n))& (n)s (n)] + K 2 ' (3.6) -23- The proof may be found in [19] and is significantly more complicated. Theorems 3.1, 3.2 also remain valid even if Assumptions 3.2, 3.3 or 3.4 hold after some finite time n0 , but are violated earlier. We only need to assume that s (k) has finite mean and variance, for k<n . --o We continue with a corollary which shows that, under reasonable conditions, convergence to a stationary point or a global optimum may be guaranteed. We only need to assume that away from stationary points some processor will make a positive improvement in the cost function. Naturally, we only require the processors to make positive improvements at times that they are not idle. Corollary 3.1: Suppose that for some K4 >0, Y (n)> K4 /n, Vn,i. Assume that J has compact level sets and that there exist continuous functions g.: H-+[O,-) such that 3J i 3i ax ix 1 (n))E[s (n)IFn]< -gi (x l (n), M We define satisfying g: H4-[0,-) by g(x) = nT1 (3.7) L i I i=l Z=l gg(x) and we assume that any point x6H g(x)=O is a stationary point of J. Finally, suppose that the difference between consecutive elements of T l is bounded, for any i,9 such that T ~. Then, a) Under the Assumptions of either Theorem 3.1 or 3.2, liminf jVJ(x l (n)) I=O, vi, a.s. (3.8) n-eoo b) Under the Assumptions of Theorem 3.1 and if (for some >0O) y (n)>e, Vi,n, we have lim IIVJ(xi(n) ) and any limit point of 11= 0, Vi, a.s. is a stationary point of J and any limit point of {xi (n ) } is a stationary point of J. (3.9) -24- c) Under the Assumptions. of either- Theorem 3.1 or 3.2 and if every point satisfying g(x)=O is a minimizing point of J (this is implicitly assuming that all stationary point of J are minima), then lim Proof: (3.10) J(x (n)) = inf J(x) . n->o x6H See Appendix B. We now discuss the above corollary and apply it to our examples. Notice first that the assumption on T% states that, for each component Q, the time between successive computations of s~ component. y (n)> K4/n For example, is bounded, for any computing processor i for that Such a condition will be always met in practice. The assumption may be enforced without the processor having access to a global clock. apart from the trivial case of constant step-size, we may let Y (n) = 1/t n where tni is the number of times, before time n, that processor i has performed a computation. For Examples I and III, (3.7) holds with gix) a constant multiple of (aJ/3xi) for Examples II and IV, it holds with g (x) a constant multiple of IIVJ(x) 1 2. We may conclude that Corollary 3.1 applies and proves convergence for Examples I-IV. We close this section by pointing out that our results remain valid if we model the combining coefficients, the transmission and reception times as random variables defined on the same probability space (i,F,P), subject to some restrictions. Notice that such a generalization allows the processors to decide when and where to transmit based on information related to the progress of the algorithm. Hence, for stochastic algorithms, we have to avoid the possibility that a processor judiciously choosesto -25- transmit at those times that it receives a "bad measurement" s in), or somehow ensure that the long-run outcome is independent of such decisions. It turns out that one only needs to assume that, for any ie{l,...,M} and any m<n, the random variables i (m) and s (n), are conditionally independent given the history of the algorithm up to time n [19]. This assumption holds if communications and the combining coefficients are independent of the noise in the computations (this is true, in particular, for deterministic algorithms), as well as for the specialization case because ' (n) turns out to be deterministic and a priori known, even if the communication times are random. 4. EXTENSIONS, APPLICATIONS AND CONCLUSIONS A main direction along which our results may be extended is in analyzing the convergence of distributed algorithms with decreasing step-size and with correlated noise, for which the pseudo-gradient assumption fails to hold. arise frequently, for example in system identification. Such algorithms Very few global convergence results are available, even for the centralized case [17]. However, as in the central- ized case an ordinary differential equation (ODE) may be associated with such algorithms, which may be used to prove local convergence subject to an assumption that the algorithm returns infinitely often to a bounded region [11,12,19]. Another issue, arising in the case of constant step-size algorithms, concerns the choice of a step-size which will guarantee convergence. in the proof of Theorem 3.1 and find some bounds on but these boundswill not be particularly tight. We may trace the steps ¥o so as to ensure convergence, For a version of a distributed deterministic gradient algorithm, tighter bounds have been obtained in [19] which quantify the notion that the frequency of communications between different processors should in some sense reflect the degree of coupling inherent in the optimization problem. -26- It should be clear that the bounds on the time between successive communications and delays are imposed so as to guarantee that processors are being informed, without excessive delays, about changes in the state of computation of other processors. The same effect, however, could be accomplished by having each processor monitor its state and inform the others only if a substantial change occurs. It seems that such an approach could lead to some savings in the number of messages being exchanged. However, more research is needed on this issue and, in particular, the notion of a "substantial change" in the state of computation of a processor needs to be made more precise. A final issue of interest is the rate of convergence of distributed algorithms. With perfect synchronization, the rate of convergence is the same as for their centralized counterparts. deterioration. Asynchronism should be expected to bring about some We may then ask how much asynchronism may be tolerated before such deterioration becomes significant. For decreasing step-size stochastic algorithms, as long as communications do not become too infrequent, it is safe to conjecture that the convergence rate will not deteriorate at all; still, there would be some deterioration in the transient performance of such algorithms which is also worth investigating. Concerning possible applications, there are three broad areas that come to mind. There is first the area of parallel computation, where an asynchronous algorithm could avoid several types of bottlenecks [10]. Then, there is the area data communication networks in which there has been much interest for distributed algorithms for routing and flow control [6,9]. Such algorithms resemble the ones that we have analyzed and our methods of analysis, with a few modifications, are -27- applicable [18]. Finally, certain common algorithms for system identification or adaptive filtering fall into the framework of decreasing step-size stochastic algorithms and our approach may be used for analyzing the convergence of their distributed versions [19]. Our results may not be applicable without any modifications or refinements to such diverse application areas. Nevertheless, our analysis indicates what kind of results should be expected to hold and provides tools for proving them. As a conclusion, the natural distributed asynchronous versions of a large class of deterministic and stochastic optimization algorithms retain the desirable convergence properties of their centralized (or synchronous) counterparts, under mild assumptions on the timing of communications and computations and even in the presence of communication delays. It appears that there are many and promising applications of such algorithms in fields as diverse as parallel computation, data communication networks and distributed signal processing. -28REFERENCES 1. Arrow, K.J. and L. Hurwicz, "Decentralization and Computation in Resource Allocation," in Essays in Economics and Econometrics, R.W. Pfouts, (ed.) Univ. of North Carolina Press, Chapel Hill, N.C., 1960, pp. 34-104. 2. Avriel, M., Nonlinear Programming, Prentice-Hall, Englewood Cliffs, N.J., 1976. 3. Baudet, G.M., "Asynchronous Iterative Methods for Multiprocessors," Journal of the ACM, Vol. 25, No. 2, 1978, pp. 226-244. 4. Bertsekas, D.P., "Distributed Dynamic Programming," IEEE Transactions on Automatic Control, Vol. AC-27, No. 3, 1982, pp. 610-616. 5. Bertsekas, D.P., "Distributed Asynchronous Computation of Fixed Points," Mathematical Programming, Vol. 27, 1983, pp. 107-120. 6. Bertsekas, D.P., "Optimal Routing and Flow Control Methods for Communication Networks," in Analysis and Optimization of Systems, A. Bensoussan and J.L. Lions, (eds.), Springer Verlag, Berlin, 1982, pp. 615-643. 7. Bertsekas, D.P., J.N. Tsitsiklis and M. Athans, "Convergence Theories of Distributed Iterative Processes: A Survey," submitted to SIAM Review, 1984. 8. Chazan, D. and W. Miranker, "Chaotic Relaxation," Linear Algebra and Applications, Vol. 2, 1969, pp. 199-222. 9. Gallager, R.G., "A Minimum Delay Routing Algorithm Using Distributed Computation," IEEE Transactions on Communications, Vol. COM-25, .No. 1, 1977, pp. 73-85. 10. Kung, H.T., "Synchronized and Asynchronous Parallel Algorithms for Multiprocessors," in Algorithms and Complexity, Academic Press, 1976, pp. 153-200. 11. Kushner, H.J. and D.S. Clark, Stochastic Approximation Methods for Constrained and Unconstrained Systems, Applied Math. Series, No. 26, Springer Verlag, Berlin, 1978. 12. Ljung, L., "Analysis of Recursive Stochastic Algorithms," IEEE Transactions on Automatic Control, Vol. AC-22, No. 4, 1977, pp. 551-575. 13. Ljung, L., and T. Soderstrom, Theory and Practice of Recursive Identification, MIT Press, Cambridge, MA, 1983. 14. Meyer, P.A., Probability and Potentials, Blaisdell, Waltham, MA, 1966. 15. Papadimitriou, C.H. and J.N. Tsitsiklis, "On the Complexity of Designing Distributed Protocols," Information and Control, Vol. 53, No. 3, June 1982, pp. 211-218. 16. Poljak, B.T. and Y.Z. Tsypkin, "Pseudogradient Adaptation and Training Algorithms," Automation and Remote Control, No. 3, 1973, pp. 45-68. 17. Solo, V., "The Convergence of AML," IEEE Transactions on Automatic Control, Vol. AC-24, 1979, pp. 958-962. 18. Tsitsiklis, J.N., and D.P. Bertsekas, "Distributed Asynchronous Optimal Routing for Data Networks," Proceedings of the 23d Conference on Decision and Control, Las Vegas, Nevada, December 1984. 19. Tsitsiklis, J.N., "Problems in Decentralized Decision Making and Computation," Ph.D. Thesis, Dept. of Electrical Engineering and Computer Science, MIT, Cambridge, MA, 1984. 20. Yao, A.C., "Some Complexity Questions Related to Distributed Computing," Proceedings of the 14th STOC, 1979, pp. 209-213. APPENDIX A Proof of Lemma 2.1: separately, since We only need to prove the Lemma for each component (2.1)> corresponds to a decoupled set of linear systems. We may therefore assume that there is only one component; will be omitted and ij k(njk) being the impulse response of the linear system is determined by the following"experimene: and some time k; let which case we let , ij (nk) becomes a scalar. The coefficient (2.1) so, the subscript x (1)=0, Vh, s (m)=O, Vh,m, let us fix a processor j unless if h=j and m=k in yJ (k)s (k)=v, some nonzero element of H. for all processors i, x (n) will be a scalar multiple of v. factor is precisely equal to P ij(nlk). For all times n and This proportionalitv Since this experiment takes place in a one-dimensional subspace of H, we may assume -without loss of generality- that H is one-dimensional and that v=l. i j (njk) = xi (n), Vi,n. shows that x (n)>O, For inequality some time k and let we let y (2.1) h (k)s and h A trivial induction based on Vi,n, which proves (2.8) (2.1) and using (2.2) (2.7). we consider a different "experiment": let us fix xh (1)=0, Vh, sh(m)=0, Vm,h, (k)=l, Vh. (2.3) We then have, for the above "experiment", unless if m=k in which case (H is still assumed one-dimensional). to show by a simple inductive argument that We then use x (n)<l, Vi,n which concludes the proof of part (i). For the proof of the remaining catts of the Lemma we first perform a reduction to a simpler case. It is relatively easy to see that we may assume, with- out loss of generality, that communication delays are nonzero. For example, we could redefine the time variable so that one time unit for the old time variable corresponds to two time units for the new one and so that any message that had zero delay for the original description has unit delay for the new descrip21, - If any of the Assumptions tion. 2.3, 2.2, 2.4 holds in terms of the old time variable, it also remains true with the new one. Next, we perform a reduction to the case where all messages have zero delay, as follows: for any (i,j)~E, we introduce a finite set of dummy processors, between i and j (see Fig. A.1) which act as buffers. is first transmitted to a buffer processor Any message from i to j (with zero delay) which holds it for time equal to the desired delay and then transmits it to j, again with zero (So, whenever a buffer processor h receives a message, it lets delay. a (n)=l). The buffer processor which is to be employed for any particular message is determined by a round-robin rule. Note that for each pair (i,j)eE the number of buffer processors that needs to be introduced is equal to the maximum communication delay for messages from i to j, which has been assumed 6 finite. Note also that the buffer processors are non-computing ones and have in-degree equal to 1. It is easy to see that if Assumptions 2.2, 2.3 are valid for the original description of the algorithm, they are also valid for the above introduced augmented description, possibly with a different choice of constants. Assumption violated. 2.1 also remains valid except for part c(iii) which may be (If in Figure A.1 processor j had originally in-degree equal to 1, it now has in-degree equal to 4, but there is nothing in our assumptions that guarantees that aj j (n)>tA, Vn). We have, nevertheless, the following condition: 6.This procedure is equivalent to a state augmentation for the linear system (2.2). -32- Fr Fg Figure i: l i Iu 1 Ro0R Reduction to the zero delay case. -33- Assumption A.l: For any processor i with in-degree larger than i, either 1o Assumption 2olc(iii) holds, or 2. All predecessors of i are non-computing, have in-degree I and a common predecessor. From now on, we assume, without loss of generality, that cormmunication delays are zero, provided that we replace Assumption 2.ic(iii) by Assumption A.l. Let A(n), V(n jk) be the matrices with coefficients a (n;, (rn;k), e respectively. Because of the zero delay assumption it is easy to see t.haa n-I (nik) = 1I A(m) , n>k+l m=k+l We first prove the desired results under Assumptions By combining Assumptions B such that, for any some tIe such that aj 2.lc(i) and 2.2. 2.2, note that there exists a constant (i,j)EE and any time interval I of length B, there exists (t)>ca> 0. (These are times that i communicates to Let us fix a computing processor j and some time k. us assume that j=l. 2.1 and By relabeiling, j.) let We will show by induction (with respect to a particular numbering of the processors) that for any processor i, there exists some ai >0, independent of k, such that il (nIk)>pa, for all ne[k+(i-l)B+i, k+£tBi -I. To start the induction, consider first the case i=l and notice thatc ~11 (n k)> n-I 11 n-k-l II a (t)> ane. > t=k+l MB > CL Vn= . 7.Since each A(n) is a "stochastic matrix (nonnegative entries, eacih row sums to questions of convergence of P((nlk) are equivalent to questions about the longrun behavior of a finite (time-varying) Markov chain. ), Suppose now that a subset S of the processors has been numbered {l,...,i-l}, for some i>2, for all processors in S. and that the induction hypothesis has been proved We show that it is always possible to find a new processor in V/S, rename it to i and prove the induction hypothesis for i as well. Consider the set Q of processors qqS If Q=-, such that (p,q)eE, for some pES. then S is the set of all processors (because of AssuTmption 2.1b) and If not, we choose one processor from Q, and rename it to i, we are done. subject to the following restriction: we choose a processor with in-degree more than one only if no processor with in-degree equal to one belongs to We now prove the induction hypothesis for processor i. predecessor of i, belonging to S. hl (n I k) V nEIc >0 ih (t)>a we have a I.i_ Moreover, for some tE[k+(i-2)B+l,...,k+(i-l)BJ il (t+Lfk)>cta We first suppose that i has in-degree 1. il that ~ .i Let hES be some Then, h<i and, by the induction hypothesis, and consequently, ne[t+l,...,k+MB]D Q. (n k)>ac. = C.>O. We prove by induction on n, for Indeed, (n+11 k aih (n)l 'hl(nik) + aii (n)iii (nlk)> i(n+lk ) = a > min{h (nik) ,i(nlk)}> min ,(A. CLa. = n - a . Suppose now that i has in-degree more than I and that Assumption holds. Then, ~il ( ( (njk)> i Ln-= i a I a I) n-t (m) (t+L MB ih)>>X A ah a >o VnEI 2.lc(iii) The last possibility is that i has in-degree more than 1 (hence all 2.lc(iii) fails. processors in Q have in-degree more than 1) and Assumption Then the set of predecessors of i (denoted by U) has a single common predecessor, denoted by j. Since ieQ, Since any hEU is non- some hEU must being to S. computing, we have hl1 and its predecessor j must belong to S. Now, for any pSU, p does not belong to Q (since it has in-degree 1) and therefore, pUS. that all predecessors of i belong to S (UC). on (n) EP We now perform an easy induction (nik)>x min {ah} A heU hl ih il i (n+lk)a ~ to show that n, for ne(t+l,...,k+MB] .EUU1 i } We conclude .>O. Indeed, hi (n k)> min (nlk)> heUUt{i}l > mintaC, rain {ah }} = a heU This completes our inductive argument. 'We may now conclude that O(k+MBtk) is a stochastic matrix with the property that all entries in some column (corresponding to any computing processor) are positive and bounded away from zero by a constant ca>0 which does not depend on k. We combine this fact with Lemma A.1 below to conclude that the assertions of Lenmma 2.1 are true. Lemma A.l: Consider a sequence {D } of nonnegative matrices with the properties that: (i) Each row sums to 1. (ii) For some a>0 and for some column (say the first one), all entries of Dn, for any n, in that column are larger or equal than a. --------....-- -- ----- --- -36- Then, a) b) D = im n-n n 1 Dk exists. k=l All rows of D are identical. c) The entry in the first column of D is bounded below by a. d) Convergence to D takes place at the rate of a geometric progression. Given any vector Proof of Lemma A.1: x=(xl,...,xM) we decompose it as x=y+ce, where c is a scalar, eis the vector with all entries equal to 1 and y has one zero entry and all other entries are nonnegative. (So c equals the minimum of the components of x. n Let x(n) j|y(n+l) (I_ < f= Dkx(O), Vn k=l (1-CL) y(n) j|, and x(n)=y(n) c(n)e. where |*j|*| denotes the max-norm on shows that y(n) converges geometrically to zero. I y(n) jl, It is easy to see that R , which Moreover, c(n)< c(n+l)< c(n) + which shows that c(n) also converges geometrically to some c. x(n) converges geometrically to ce. Hence, Since this is true for any x(O)e R , parts (a),(b),(d) ;of the Lemma follow. Part (c) is proved by an easy induction, for the finite products of the Dk's and, therefore, it holds for D as well. C We now consider the case where Assumption 2.2. is replaced by 2.3. The key observation here is that during an interval of the form [B n , B (n+l)B] a bounded number of messages is transmitted; hence, A(k)=I, except for a bounded number of times in that interval. If we redefine the time variable so that time is incremented only at communication times, we have reduced the problem to the case of Assumption 2.2. The only difference, due to the change of the time variable, -37-. Under Assumption is in the rate of convergence. 2.2, 1l|(n1k)-(k) || This implies that, by a constant factor during intervals of constant length. 2.3, j|D(njk)-D(k)jj decreases by a constant factor during under Assumption intervals of the form if 3Bl (t+c) 1] for some appropriate constant c. [Bt we have | B.t"=k and Bl(t+m) =n, 'I t9(nfk)-9~k) [< Em'n/c , for some B> !< Bd (nlk)-(k) B Eliminating t and solving for m, we obtain .(Bv ) 4m . ( )- - yields which finallT 1-* cB 1/ Let 6=1/a and decreases d=d o c , to recover the desired result. m Therefore, d e3i) -38- APPENDIX B This Appendix contains the proofs of the results of Section 3. Remark on Notation: In the course of the proofs in this section, we will use the symbol A to denote non-negative constants which are independent i i y (n), x (n), of n, Y , I s (n) etc., but which may depend on the constants introduced in the various assumptions (that is, M, L, K, K0 , BB, a, etc.). When A appears in different expressions, or even in different sides of the same equality inequality), it will not necessarily represent the same constant. convention, an inequality of the form A + i<A (or (With this is meaningful and has to be inter- preted as saying that A+1, where A is some constant, is smaller than some other constant, denoted again by A.) This convention is followed so as to avoid the introduction of unnecessarily many symbols. Without loss of generality, we will assume that the algorithm is initialized so that x (l)=0, Vi. In the general case where x (1)#0, we may think of the algorithm as having started at time 0, with x (0)=O; then, a random update i i s (0) sets x (1) to a nonzero value. So, the case in which the processors initially disagree may be easily reduced to the case where they initially agree. -i Note that we may define step with step-size yo. hold for s (r). y (n) = y0 , vn s (n) = Y .n) (n) Y0 (n) and view It is easy to see that Assumptions s (n) as the new (3.2) and (3.3) also For these reasons, no generality is iost if we assume t2hat and this is what we will do. Let us define b(n) = M i i=l Isi(n) I (B.1) and note that 2 b 2 (n) < M (n) i=l 2 < b (n ) . -39- Using and Lemma (Z.6).,- (2.13) M n-I < f 'ol| j=L ! k=l Ily(n)-xi(nll 2.1 (iii), we obtain n-I d < AyO -j (k) iJ (nl k) |k| | s (k) !1 n-k (B.2) b (k) k=l From a Taylor series expansion for J we obtain () (n)+ J(y(n+l)) = J iL (n)(n M M Z ' < J(y(n) ) + y 0 7J(y(n)) < J(y(n)) + y 0 VJ(y(n))) Y (n) s (n)+ Aly Ci(n) i i (n)si(n) 2 (B.3) s (n) + Ay 0 b(n i=i . is in terms of 7J(x (n)), whereas above we have 7J(y(n)) Assumption 3.2 To overcome this difficulty, we use the Lipschitz continuity of the derivative of J (B..2:) and invoke |tVJ(y(n)) to obtain i iM . I D (n)sA i-l (n)- (n) Z 7J(xi(n) i=l 1(n) M <A i-=i Iy(n)-xi(n) I It s(n) I <_ n-l CyOA I nn-k dn b(k) k=l n-l M i = y 0OA n i si(n) s i=l d n-k b(k)b(n) < k=l n-! <y A n' d n - k [b2 (k)+b 2 (n)] k=i (B.4) Let us define G (n) = -7J(x (n))( (n)s (n) , (B.5) M G(n) = G (n ), (.6) i=l and note that Assumption 3.2 implies that E[G(n)]>-0. We now rewrite inequality -40- (B,3) using (B.4) to replace the derivative term, to obtain: J(y(n+l))< J(y(n)) - Y0 G(n) + Ay0b2 (n) + AY dn [b (k)+b (n)]< k=l ( J(y(n)) - yOG(n) + Ay d b (k) (B..7) k=l Assumption 3.3 implies that [cf. inequality (3.3)] E2Lb (k)I< AE[G(k) Taking expectations in (B.7) (B.8) . and using 2 E[J(y(n+l))]< E[J(y(n))]- we obtain (B.8) YOEL (n)] + AyO n - i dn-k d E(k) (B.9) ~ for different values of n, to obtain We then sum -(B.9) n~n 0 . <EJ(y(+l)) E[J ]< k=l k=l ())]+ A EG(, k) (B.ll) -41- and obtain theorem to' (B.12) E Vk; we may apply the monotone convergence E[G(k) IFk >O, 3.2, By Assumption E E[ (k) <'B Zk E E[G(k)F -k=l 1o3) k=l which implies cO E[G(k) IFk<co, a.s. (14) k=1 From (B.14) we obtain n=1t 77zJ(x (n))E NIow use the fact (Lemma 5.2.1 (n) s(n) (ii), IF inequality any computing processor i for component Z. k=l ] >-co, VJ(x (n))E[s (n)IF ]>-0, 21 Z n a.s. (2.9)) that g(n)>o>O, for This implies that a.s. and establishes part (c) of the theorem. Lemma B. 1: Let X(n), Z(n) be non-negative stochastic processes expectation) adapted to (F } n E[X(n.+) i IF]< EtZ(n)n]< (with finite and such that X(n) + Z(n), - n=l Then X(n) converges almost surely, as new. (B.5) (B.16) Proof of Lemma 5.3.1: X follows that By the monotone convergence theorem and (5.3.23) it Zn <, almost surely. Then, Lemma 5.3.1 becomes the same n=l as Lemma 4.C.1 in [13, p.4531, which in turn is a consequence of the supermartingale convergence theorem [14]. O Now let A be the constant in the right hand side of k~lC A Z(n) = Ay Then, Z(n)> 0 and by 2 n-k I d (B.7) and let b (k) IFj (B-.17) (B.8) EZ(n) ]< A d I EkEG(k)I . (B.18) k=l Therefore, ECZ(n)]< A n-l I n=l A = dn-kE[G(k)] = k=l I1d I E[G(k)3]<0 k=l where the last inequality follows from (B.16). J(y(n)) (B.12) We take the conditional expectation of is F -measurable and that E(G(n)IF ]> 0. n n- and J(y(n)) (B.19) Therefore, Z(n) satisfies (B.7), given F . Therefore Lemma B.1 Note that applies converges almost surely. Using Assumption r b k 3.3 once more, together with )<A 1 E[GXk)]< k=l (B.12), (B.20) -43- which implies that b(k) converges to zero, almost surely. Recall (B.2) to conclude that y(n)-x (n) converges to zero, almost surely. Also, by squaring (B.2), taking expectations and using the fact that E[b (k)] converges to zero we conclude that E[j ly(n)-xi(n) 112] also converges to zero, and this proves part (b) of the Theorem. Now, let us use Assumption 3.1 and a second order expansion of J to obtain, for any aeR O<J(x-aVJ(x))<J(x) - aA 1 VJ(x) | + a 2A VJ(X2 ) , where A1, A2 are positive constants not depending on a. (B.21) Assuming that a was chosen small enough, we may use (B.21) and the nonnegativity of J to conclude < AJ(x), InVJ(x) II VxedH. (B.22) Since J(y(n)) converges, it is bounded; hence VJ(y(n)) is also bounded, by (B.22). We then use the fact that y(n)-xi(n) converges to zero, to concludes that J(x (n))-J(y(n)) also converges to zero. This proves part (a) and concludes the proof of the Theorem. ! In the following Lemma we bound certain infinite series by corresponding infinite integrals. This is justified as long as the integrand cannot change by more than a constant factor between any two consecutive integer points. notational convenience, we use c(nlk) to denote d , For where d and 6 are as in Lemma 2.1(iv). Lemma B.2: The following hold: n=l -- E m c(mln)<co, n mnn m=n (B.23) -44- m lim m 2 mno i c(mlk) = lim k=l k lim mnr0 I k c(mlk)=O, k=l mno (B.24) c(kjm)=0. I (B.o25) k=m 1 Proof of Lemma B.2: 1 Let t =y; then, t=y / t -s 1 dt= and t=s 1 dy dt=(l/6)y 1 < Y dY 6 y=s where A does not depend on s. dy. Therefore, dy < (B.26) y=s Equation (B.25) follows. 1A Since s -A is an 6 integrable function of s, (B.23) follows as well. The left hand side of (B.24) is bounded by mt Am= -t dt tt~~~ =1 6 m A lm y 1 = y 2/d - 6 dy 1 - m -Ydy <Am y=l which converges to zero. The middle term in (B.24) is certainly smaller and converges to zero as well. O Proof of Theorem 3.2: Using the same arguments as in the proof of Theorem 3.1, we may assume, without loss of generality, that xi(l)=0 and that --i i (Otherwise we could define s (n) = ny (n)s (n).) Y (n)=l/n, Vi,n. -45- 65 dn We still use c(nlk) to denote G(n) by (B.l), (B.5), . We define again b(n), G (n), (B.6), respectively, as in the proof of Theorem 5.3.1. Also, let ni - 2 n O(njk) l- - c(njm), n=k, m=l 1= k 0, c(nlk), n>k, (B.27) n<k. By replicating the steps leading to inequality (B.7) in the proof of Theorem 3.1 and using Lemma 2.1(iv) and (B.27) we obtain, for some A>0, n J(y(n+l))< J(y(n)) - 1 G(n) G(n) +8 A . 2 (nIk)b2(k), V n (B.28) k=l Taking expectations in (B.28), we have E[J(y(n+l))]< E[J(y(n))] - E[G(n)] + A n I(nk)E[b k=l (k ) ] , Vn. (B.29) and using Assumption 3.4, E n ~ ~(nlk)(E[G(k)]+l). E[G(n)] + A ~n ~ We then sum (B.30), for different values of n to obtain n O<E [J(y(n+l))]<EJ(y(l))+m m. A The definition (B.27) and (B.31) is bounded. n n m=l k=m (mk)+ m=l k=l side of (B.30) k=l 1 - EG (m)]. (B.23) imply that the middle term in the right hand Moreover, using (B.27), -46- m 0 1 m-l 4(kjm) = + I k=m k=l I which converges to zero, as m:+o, p(kjm)- - < I enough m, k=m Inequality -1 k c(mlk) + kcc(klm), k=m by (B.24) and (B.25). Therefore, for large It follows that E[J(y(n))] is bounded. m co00 (B.31) and the above also imply that -1 E[G(m)]<o and part (c) of m=l the Theorem follows, as in the proof of Theorem 3.1. We now define z(n) = E k=l (nlk)b (k)IF' _ k=l and note that 0 [ n=l co E[Z(n)]< A 'I o00 Z 00 (njk) E[G(k)]+l 1k E[G(k)]<00 < A+A k=l n=k k =1 Taking conditional expectations in (B.28) (with respect to F ) and using Lemma B.1, we conclude that J(y(n)) converges, almost surely. We now turn to the proof of part (b). Using (3.5) and (B.22), we have which finally implies that iE[|si(n) ]< AE[J(x (n))] + A. E[ Isi(n)jI 1 AB[J(x (n))] + A. (B.33) (B.33) -47- Now, using (B.22) once more, J(x (n))-J(y(n))< I VJ(y(n)l II < 1 11x (n)-y(n)II + A|Ix (n)-Y(n) IIVJ(Y(n) 112 + All xi(n)-y(n)112< AJ(y(n)) A+ i 12< xi(n)-y(n) 112 (B.34) Inequalities (B.33), (B.34) and the boundedness of E[J(y(n))] (which is a con- sequence of (B.31)) yield E[b 2 (n)]< A+AEl Ix (n)-y(n)j112] Similarly with (B.35) (B.2), we have i1y(n)-x (n)| < n-l A Z k c(nlk)b(k) k=l (B.36) and n-l c 2 I ly)_( xit n ) n ) I 12< An (nnk) b2 l b I c k=l n-1 n c(n~k) k= k (k)< An 2 2 b (k). (B.37) Therefore, E[lly(n)-xi(n)l12]< S(n) w (n) = An k=l k max 1<k<n E[b (k)], converges to zero, by (B.38) (B.24). Using (B.35), ElI ly(n)-x i (n) j12]< A(n) (l+max E[I|y(k)-xi (k) 12]) k<n and since S(n) converges to zero, it follows that E[l to zero as well. We also conclude from (B.35) that y(n)-x (n)l sup E[b n 2 ] (n)]<o. converges -48- Let Dk =E 1 b(1), k k>l. (B.39) <i< (k+l) Using the fact that there exists an A such that (k+1) /-k / <Ak ( 1 / )- 1 , Vk, obtain from (B.39) ~ ELD k] < .l +l klk/< k</6<i< (k+l) J.k=L 1/6 co1 2 <A /6 [(k+l) 1/6 -k ]< A k=l k ] 2 sup E[b2 (n)] < o 1 (2/ 6 )-2 <00 I 2/6 k( / k=l k Consequently, so does Dk If follows that D k converges to zero, almost surely. and C nd d n-k Dk as well. (B.40) Let us fix some n, let N denote the largest integer k=l such that N<n ° and use i (n ) - y Ix (n ) (B.36) to obtain' j|<A A k c(nlk)b(k)< A l< k/6<i<(kl) A <AdN-'k+l l<k<n -1 6 k 1/6 1/6 <i<(k+l) / N I k=l b(i) Nb(ki)< N-k k As n converges to infinity, so does N and, by the above discussion, x (n)-y(n) converges to zero, as n->. Consequently, x (n)-xi (n) also converges to zero, for any i,j, completing the proof of part (b). Finally, since J(y(n)) converges and x (n)-y(n) converges to zero, part (a) of the theorem follows, as in the proof of Theorem 3.1.. -49- Proof of Corollary 5.3.1: From part (c) of either Theorem 3.1 or 3.2 and (3.7) we obtain M L i E i=l =l Y(n)g'(xi())< E a.s. (B.42) neT Because of our assumption on the sets Tl, it follows that there exists a positive integer c such that, for any i,Z,m, the interval {cm+l,cm+2,...,c(m+l)} contains at least one element of TI . denoted by t Q M . m L By Let us choose sequences of such elements (B.42), we have X. i=l X=l i Y1 (tQ,m)g m=l ,m a.s ))< (B.43) ,m Now notice that, for some constant K5 0, K y (tI )> Q,m - Hence, K > - i t,m > c(m+l) - K5 5, m Vi,fm . (B.44) a.s. (B.45) (B.43) yields X 1 I m m=l M L g i=l ))<X, =1 From either Theorem 3.1 or 3.2 and its proof we obtain lim (x (n)-y(n)) = lim (y(n+l)-y(n))=O n-00 n-Ko lim (x (t im)-Y(t which implies that ))=O' Vi Since J has compact level sets and J(y(n)) bounded. converges, the sequence {y(n)} is We therefore need to consider the functions gi on which they are uniformly continuous. _. Ur>0 ; ix i m )ig (Y(t (B46) only on a compact set Therefore, . 1 ...... (B...). -50- By combining (B.45) and (B.47) we obtain M liminf - m&co a) L g(y(t i ))=O . (B.48) Z=1 =l By (B.48), there must be some subsequence of {t g(y(tl )) converges to zero. l,m } along which Let y* be a limit point of the corresponding By continuity, g(y*)=O and by assumption, y* must be subsequence of {y(tl)}. i l Moreover, x (t ) also converges to y* l,m a stationary point of J, so VJ(y*)=O. along the same subsequence. By continuity of VJ, (3.8) follows. b) In this case, (B.43). implies M m00 L I lim gI (x (t i=l m))=, m=l a.s. (B.49) and the rest of the proof is the same as for part (a), except that we do not need to restrict ourselves to a convergent subsequence. c) From part (a) we conclude that some subsequence of fy(t 1 )} converges to l'm some y* for which g(y*)=O. Consequently, y* minimizes J. Using the continuity of J, liminf J(y(n))< liminf J(y(tl n-tcx,= ne= On the other hand, J(y(n)) converges which shows that (3.10) holds.i ))< J(y*) = inf J(x) 'xeH (part (a) of either Theorem 3.1 or 3.2)