SIAM J. CONTROL OPTIM. Vol. 38, No. 2, pp. 447–469 c 2000 Society for Industrial and Applied Mathematics ° THE O.D.E. METHOD FOR CONVERGENCE OF STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING∗ V. S. BORKAR† AND S. P. MEYN‡ Abstract. It is shown here that stability of the stochastic approximation algorithm is implied by the asymptotic stability of the origin for an associated ODE. This in turn implies convergence of the algorithm. Several specific classes of algorithms are considered as applications. It is found that the results provide (i) a simpler derivation of known results for reinforcement learning algorithms; (ii) a proof for the first time that a class of asynchronous stochastic approximation algorithms are convergent without using any a priori assumption of stability; (iii) a proof for the first time that asynchronous adaptive critic and Q-learning algorithms are convergent for the average cost optimal control problem. Key words. stochastic approximation, ODE method, stability, asynchronous algorithms, reinforcement learning AMS subject classifications. 62L20, 93E25, 93E15 PII. S0363012997331639 1. Introduction. The stochastic approximation algorithm considered in this paper is described by the d-dimensional recursion ¡ (1.1) X(n + 1) = X(n) + a(n) h X(n) + M (n + 1) , n ≥ 0, where X(n) = [X1 (n), . . . , Xd (n)]T ∈ Rd , h : Rd → Rd , and {a(n)} is a sequence of positive numbers. The sequence {M (n) : n ≥ 0} is uncorrelated with zero mean. Though more than four decades old, the stochastic approximation algorithm is now of renewed interest due to novel applications to reinforcement learning [20] and as a model of learning by boundedly rational economic agents [19]. Traditional convergence analysis usually shows that the recursion (1.1) will have the desired asymptotic behavior provided that the iterates remain bounded with probability one, or that they visit a prescribed bounded set infinitely often with probability one [3, 14]. Under such stability or recurrence conditions one can then approximate the sequence X = {X(n) : n ≥ 0} with the solution to the ordinary differential equation (ODE) ¡ (1.2) ẋ(t) = h x(t) with identical initial conditions x(0) = X(0). The recurrence assumption is crucial, and in many practical cases this becomes a bottleneck in applying the ODE method. The most successful technique for establishing stochastic stability is the stochastic Lyapunov function approach (see, e.g., ∗ Received by the editors December 17, 1997; accepted for publication (in revised form) February 22, 1999; published electronically January 11, 2000. http://www.siam.org/journals/sicon/38-2/33163.html † School of Technology and Computer Science, Tata Institute of Fundamental Research, Mumbai 400005, India (borkar@tifr.res.in). The research of this author was supported in part by the Department of Science and Technology (Government of India) grant III5(12)/96-ET. ‡ Department of Electrical and Computer Engineering and the Coordinated Sciences Laboratory, University of Illinois at Urbana-Champaign, Urbana, IL 61801 (s-meyn@uiuc.edu). The research of this author was supported in part by NSF grant ECS 940372 and JSEP grant N00014-90-J-1270. This research was completed while this author was a visiting scientist at the Indian Institute of Science under a Fulbright Research Fellowship. 447 448 V. S. BORKAR AND S. P. MEYN [14]). One also has techniques based upon the contractive properties or homogeneity properties of the functions involved (see, e.g., [20] and [12], respectively). The main contribution of this paper is to add to this collection another general technique for proving stability of the stochastic approximation method. This technique is inspired by the fluid model approach to stability of networks developed in [9, 10], which is itself based upon the multistep drift criterion of [15, 16]. The idea is that the usual stochastic Lyapunov function approach can be difficult to apply due to the fact that time-averaging of the noise may be necessary before a given positive valued function of the state process will decrease towards zero. In general such timeaveraging of the noise will require infeasible calculation. In many models, however, it is possible to combine time-averaging with a limiting operation on the magnitude of the initial state, to replace the stochastic system of interest with a simpler deterministic process. The scaling applied in this paper to approximate the model (1.1) with a deterministic process is similar to the construction of the fluid model of [9, 10]. Suppose that e the state is scaled by its initial value to give X(n) = X(n)/ max(|X(0)|, 1), n ≥ 0. We then scale time to obtain a continuous function φ : R+ → Rd which interpolates e e the values of {X(n)}. At a sequence of times {t(j) : j ≥ 0} we set φ(t(j)) = X(j), and for arbitrary t ≥ 0, we extend the definition by linear interpolation. The times {t(j) : j ≥ 0} are defined in terms of the constants {a(j)} used in (1.1). For any r > 0, the scaled function hr : Rd → Rd is given by (1.3) hr (x) = h(rx)/r, x ∈ Rd . Then through elementary arguments we find that the stochastic process φ approximates the solution φb to the associated ODE ¡ (1.4) ẋ(t) = hr x(t) , t ≥ 0, b = φ(0) and r = max(|X(0)|, 1). with φ(0) With our attention on stability considerations, we are most interested in the behavior of X when the magnitude of the initial condition |X(0)| is large. Assuming that the limiting function h∞ = limr→∞ hr exists, for large initial conditions we find that φ is approximated by the solution φ∞ of the limiting ODE ¡ (1.5) ẋ(t) = h∞ x(t) , where again we take identical initial conditions φ∞ (0) = φ(0). Thus, for large initial conditions all three processes are approximately equal, φ ≈ φb ≈ φ∞ . Using these observations we find in Theorem 2.1 that the stochastic model (1.1) is stable in a strong sense provided the origin is asymptotically stable for the limiting ODE (1.5). Equation (1.5) is precisely the fluid model of [9, 10]. Thus, the major conclusion of this paper is that the ODE method can be extended to establish both the stability and convergence of the stochastic approximation method, as opposed to only the latter. The result [14, Theorem 4.1, p. 115] arrives at a similar conclusion: if the ODE (1.2) possesses a “global” Lyapunov function with bounded partial derivatives, then this will serve as a stochastic Lyapunov function, thereby establishing recurrence of the algorithm. Though similar in flavor, there are STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING 449 significant differences between these results. First, in the present paper we consider a scaled ODE, not the usual ODE (1.2). The former retains only terms with dominant growth and is frequently simpler. Second, while it is possible that the stability of the scaled ODE and the usual one go hand in hand, this does not imply that a Lyapunov function for the latter is easily found. The reinforcement learning algorithms for ergodic-cost optimal control and asynchronous algorithms, both considered as applications of the theory in this paper, are examples where the scaled ODE is conveniently analyzed. Though the assumptions made in this paper are explicitly motivated by applications to reinforcement learning algorithms for Markov decision processes, this approach is likely to find a broader range of applications. The paper is organized as follows. The next section presents the main results for the stochastic approximation algorithm with vanishing stepsize or with bounded, nonvanishing stepsize. Section 2 also gives a useful error bound for the constant stepsize case and briefly sketches an extension to asynchronous algorithms, omitting details that can be found in [6]. Section 3 gives examples of algorithms for reinforcement learning of Markov decision processes to which this analysis is applicable. The proofs of the main results are collected together in section 4. 2. Main results. Here we collect together the main general results concerning the stochastic approximation algorithm. Proofs not included here may be found in section 4. We shall impose the following additional conditions on the functions {hr : r ≥ 1} defined in (1.3) and the sequence M = {M (n) : n ≥ 1} used in (1.1). Some relaxations of assumption (A1) are discussed in section 2.4. (A1) The function h is Lipschitz, and there exists a function h∞ : Rd → Rd such that x ∈ Rd . lim hr (x) = h∞ (x), r→∞ Furthermore, the origin in Rd is an asymptotically stable equilibrium for the ODE (1.5). (A2) The sequence {M (n), Fn : n ≥ 1}, with Fn = σ(X(i), M (i), i ≤ n), is a martingale difference sequence. Moreover, for some C0 < ∞ and any initial condition X(0) ∈ Rd , °2 ¡ ° E °M (n + 1)° | Fn ≤ C0 1 + kX(n)k2 , n ≥ 0. The sequence {a(n)} is deterministic and is assumed to satisfy one of the following two assumptions. Here TS stands for “tapering stepsize” and BS for “bounded stepsize.” (TS) The sequence {a(n)} satisfies 0 < a(n) ≤ 1, n ≥ 0, and X a(n) = ∞, X n a(n)2 < ∞. n (BS) The sequence {a(n)} satisfies for some constants 1 > α > α > 0, α ≤ a(n) ≤ α, n ≥ 0. 450 V. S. BORKAR AND S. P. MEYN 2.1. Stability and convergence. The first result shows that the algorithm is stabilizing for both bounded and tapering step sizes. Theorem 2.1. Assume that (A1) and (A2) hold. Then we have the following: (i) Under (TS), for any initial condition X(0) ∈ Rd , sup kX(n)k < ∞ n almost surely (a.s.). (ii) Under (BS) there exist α∗ > 0 and C1 < ∞ such that for all 0 < α < α∗ and X(0) ∈ Rd , °2 ° lim sup E °X(n)° ≤ C1 . n→∞ An immediate corollary to Theorem 2.1 is convergence of the algorithm under (TS). The proof is a standard application of the Hirsch lemma (see [11, Theorem 1, p. 339] or [3, 14]), but we give the details below for sake of completeness. Theorem 2.2. Suppose that (A1), (A2), and (TS) hold and that the ODE (1.2) has a unique globally asymptotically stable equilibrium x∗ . Then X(n) → x∗ a.s. as n → ∞ for any initial condition X(0) ∈ Rd . Proof. We may suppose that X(0) is deterministic without any loss of generality so that the conclusion of Theorem 2.1 (i) holds that the sample paths of X are bounded with probability one. Fixing such a sample path, we see that X remains in a bounded set H, which may be chosen so that x∗ ∈ int(H). The proof depends on an approximation of X with the solution to the primary ODE (1.2). To perform this approximation, first define t(n) ↑ ∞, T (n) ↑ ∞ as Pn−1 follows: Set t(0) = T (0) = 0 and for n ≥ 1, t(n) = i=0 a(i). Fix T > 0 and define inductively ª T (n + 1) = min t(j) : t(j) > T (n) + T , n ≥ 0. Thus T (n) = t(m(n)) for some m(n) ↑ ∞ and T ≤ T (n + 1) − T (n) ≤ T + 1 for n ≥ 0. We then define two functions from R+ to Rd : (a) {ψ(t), t > 0} is defined by ψ(t(n)) = X(n) with linear interpolation on [t(n), t(n + 1)] for each n ≥ 0. b (b) {ψ(t), t > 0} is piecewise continuous, defined so that, for any j ≥ 0, ψb is the b (j)) = ψ(T (j)). solution to (1.2) for t ∈ [T (j), T (j + 1)), with the initial condition ψ(T ∗ Let > 0 and let B() denote the open ball centered at x of radius . We may then choose the following: (i) 0 < δ < such that x(t) ∈ B() for all t ≥ 0 whenever x( · ) is a solution of (1.2) satisfying x(0) ∈ B(δ). (ii) T > 0 so large that for any solution of (1.2) with x(0) ∈ H we have x(t) ∈ b (j)−) ∈ B(δ/2) for all j ≥ 1. B(δ/2) for all t ≥ T . Hence, ψ(T (iii) An application of the Bellman Gronwall lemma as in Lemma 4.6 below that leads to the limit ° ° b °→0 °ψ(t) − ψ(t) (2.1) a.s., t → ∞. Hence we may choose j0 > 0 so that we have ° ¡ ¡ ° °ψ T (j) − − ψb T (j) − ° ≤ δ/2, j ≥ j0 . STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING 451 Since ψ( · ) is continuous, we conclude from (ii) and (iii) that ψ(T (j)) ∈ B(δ) for b (j)) = ψ(T (j)), it then follows from (i) that ψ(t) b ∈ B() for all j ≥ j0 . Since ψ(T t ≥ T (j0 ). Hence by (2.1), lim sup kψ(t) − x∗ k ≤ t→∞ a.s. This completes the proof since > 0 was arbitrary. We now consider (BS), focusing on the absolute error defined by e(n) := kX(n) − x∗ k, (2.2) n ≥ 0. Theorem 2.3. Assume that (A1), (A2), and (BS) hold, and suppose that (1.2) has a globally asymptotically stable equilibrium point x∗ . Then for any 0 < α ≤ α∗ , where α∗ is introduced in Theorem 2.1 (ii), (i) for any > 0, there exists b1 = b1 () < ∞ such that ¡ lim sup P e(n) ≥ ≤ b1 α; n→∞ ∗ (ii) if x is a globally exponentially asymptotically stable equilibrium for the ODE (1.2), then there exists b2 < ∞ such that for every initial condition X(0) ∈ Rd , lim sup E e(n)2 ≤ b2 α. n→∞ 2.2. Rate of convergence. A uniform bound on the mean square error E[e(n)2 ] for n ≥ 0 can be obtained under slightly stronger conditions on M via the theory of ψ-irreducible Markov chains. We find that this error can be bounded from above by a sum of two terms: the first converges to zero as α ↓ 0, while the second decays to zero exponentially as n → ∞. To illustrate the nature of these bounds, consider the linear recursion ¡ X(n + 1) = X(n) + α − X(n) − x∗ + W (n + 1) , n ≥ 0, where {W (n)} is independently and identically distributed (i.i.d.) with mean zero and variance σ 2 . This is of the form (1.1) with h(x) = −(x − x∗ ) and M (n) = W (n). The error e(n + 1) defined in (2.2) may be bounded as follows: E e(n + 1)2 ≤ α2 σ 2 + (1 − α)2 E e(n)2 ≤ ασ 2 /(2 − α) + exp(−2αn)E e(0)2 , n ≥ 0. For a deterministic initial condition X(0) = x and any > 0, we thus arrive at the formal bound, ¡ ¡ E[e(n)2 | X(0) = x] ≤ B1 (α) + B2 kxk2 + 1 exp − 0 (α)n , (2.3) where B1 , B2 , and 0 are positive-valued functions of α. The bound (2.3) is of the form that we seek: the first term on the right-hand side (r.h.s.) decays to zero with α, while the second decays exponentially to zero with n. However, the rate of convergence for the second term becomes vanishingly small as α ↓ 0. Hence to maintain a small probability of error the variable α should be neither too small nor too large. This recalls the well-known trade-off between mean and variance that must be made in the application of stochastic approximation algorithms. 452 V. S. BORKAR AND S. P. MEYN A bound of this form carries over to the nonlinear model under some additional conditions. For convenience, we take a Markov model of the form ¡ ¡ (2.4) X(n + 1) = X(n) + α h X(n) + m X(n), W (n + 1) , where again {W (n)} is i.i.d. and also independent of the initial condition X(0). We assume that the functions h : Rd → Rd and m : Rd × Rq → Rd are smooth (C 1 ) and that assumptions (A1) and (A2) continue to hold. The recursion (2.4) then describes a Feller–Markov chain with stationary transition kernel to be denoted by P . Let V : Rd → [1, ∞) be given. The Markov chain X with transition function P is called V -uniformly ergodic if there is a unique invariant probability π, an R < ∞, and ρ < 1 such that for any function g satisfying |g(x)| ≤ V (x), ¡ ¡ (2.5) E g X(n) | X(0) = x − Eπ g X(n) ≤ RV (x)ρn , x ∈ Rd , n ≥ 0, R where Eπ [g(X(n))] = g(x) π(dx), n ≥ 0. The following result establishes bounds of the form (2.3) using V -ergodicity of the model. Assumptions (2.6) and (2.7) below are required to establish ψ-irreducibility of the model in Lemma 4.10. There exists a w∗ ∈ Rq with m(x∗ , w∗ ) = 0, and for a continuous function p : q R → [0, 1] with p(w∗ ) > 0, Z ¡ (2.6) p(z)dz, A ∈ B(Rq ). P W (1) ∈ A ≥ A The pair of matrices (F, G) is controllable with (2.7) F = ∂ d h(x∗ ) + m(x∗ , w∗ ) dx ∂x and G= ∂ m(x∗ , w∗ ). ∂w Theorem 2.4. Suppose that (A1), (A2), (2.6), and (2.7) hold for the Markov model (2.4) with 0 < α ≤ α∗ . Then the Markov chain X is V -uniformly ergodic, with V (x) = kxk2 + 1, and we have the following bounds: (i) There exist positive-valued functions A1 and 0 of α and a constant A2 independent of α, such that ¡ ¡ ª P e(n) ≥ | X(0) = x ≤ A1 (α) + A2 kxk2 + 1 exp − 0 (α)n . The functions satisfy A1 (α) → 0, 0 (α) → 0 as α ↓ 0. (ii) If in addition the ODE (1.2) is exponentially asymptotically stable, then the stronger bound (2.3) holds, where again B1 (α) → 0, 0 (α) → 0 as α ↓ 0, and B2 is independent of α. Proof. The V -uniform ergodicity is established in Lemma 4.10. From Theorem 2.3 (i) we have, when X(0) ∼ π, ¡ ¡ Pπ e(n) ≥ = Pπ e(0) ≥ ≤ b1 α, and hence from V -uniform ergodicity, ¡ ¡ ¡ ¡ P e(n) ≥ | X(0) = x ≤ Pπ e(n) ≥ + P e(n) ≥ | X(0) = x − Pπ e(n) ≥ ≤ b1 α + RV (x)ρn , n ≥ 0. This and the definition of V establishes (i). The proof of (ii) is similar. The fact that ρ = ρα → 1 as α ↓ 0 is discussed in section 4.3. STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING 453 2.3. The asynchronous case. The conclusions above also extend to the model of asynchronous stochastic approximation analyzed in [6]. We now assume that each component of X(n) is updated by a separate processor. We postulate a set-valued process {Y (n)} taking values in the set of subsets of {1, 2, . . . , d}, with the interpretation: Y (n) = {indices of the components updated at time n}. For n ≥ 0, 1 ≤ i ≤ d, define ν(i, n) = n X ª I i ∈ Y (m) , m=0 the number of updates executed by the ith processor up to time n. A key assumption is that there exists a deterministic ∆ > 0 such that for all i, lim inf n→∞ ν(i, n) ≥∆ n a.s. This ensures that all components are updated comparably often. Furthermore, if ) ( m X a(n) > x N (n, x) = min m > n : k=n+1 for x > 0, the limit Pv(i,N (n,x)) k=v(i,n) a(k) k=v(j,n) a(k) limn→∞ Pv(j,N (n,x)) exists a.s. for all i, j. At time n, the kth processor has available the following data: (i) Processor (k) is given ν(k, n), but it may not have n, the “global clock.” (ii) There are interprocessor communication delays τkj (n), 1 ≤ k, j ≤ d, n ≥ 0, so that at time n, processor (k) may use the data Xj (m) only for m ≤ n − τkj (n). We assume that τkk (n) = 0 for all n and that {τkj (n)} have a common upper bound τ < ∞ ([6] considers a slightly more general situation). To relate the present work to [6], we recall that the “centralized” algorithm of [6] is ¡ X(n + 1) = X(n) + a(n)f X(n), W (n + 1) , where {W (n)} are i.i.d. and {f (·, y)} are uniformly Lipschitz. Thus F (x) := E[f (x, W (1))] is Lipschitz. The correspondence with the present set up is obtained by setting h(x) = F (x) and ¡ ¡ M (n + 1) = f X(n), W (n + 1) − F X(n) for n ≥ 0. The asynchronous version then is ¡ ¡ ¡ Xi (n + 1) = Xi (n) + a ν(i, n) f X1 (n − τi1 (n) , X2 n − τi2 (n) , (2.8) ª ¡ . . . , Xd n − τid (n) , W (n + 1))I i ∈ Y (n) , n ≥ 0, for 1 ≤ i ≤ d. Note that this can be executed by the ith processor without any knowledge of the global clock which, in fact, can be a complete artifice as long as causal relationships are respected. 454 V. S. BORKAR AND S. P. MEYN The analysis presented in [6] depends upon the following additional conditions on {a(n)}: (i) a(n + 1) ≤ a(n) eventually; (ii) for x ∈ (0, 1), supn a([xn])/a(n) < ∞; (iii) for x ∈ (0, 1), Ã ! X [xn] n X a(i) a(i) → 1, i=0 i=0 where [ · ] stands for “the integer part of ( · ).” A fourth condition is imposed in [6], but this becomes irrelevant when the delays are bounded. Examples of {a(n)} satisfying (i)–(iii) are a(n) = 1/(n + 1) or 1/(1 + n log(n + 1)). As a first simplifying step, it is observed in [6] that {Y (n)} may be assumed to be singletons without any loss of generality. We shall do likewise. What this entails is simply unfolding a single update at time n into |Y (n)| separate updates, each involving a single component. This blows up the delays at most d-fold, which does not affect the analysis in any way. The main result of [6] is the analog of our Theorem 2.2 given that the conclusions of our Theorem 2.1 hold. In other words, stability implies convergence. Under (A1) and (A2), our arguments above can be easily adapted to show that the conclusions of Theorem 2.2 also hold for the asynchronous case. One argues exactly as above and in [6] to conclude that the suitably interpolated and rescaled trajectory of the algorithm tracks an appropriate ODE. The only difference is a scalar factor 1/d multiplying the r.h.s. of the ODE (i.e., ẋ(t) = (1/d)h(x(t))). This factor, which reflects the asynchronous sampling, amounts to a time-scaling that does not affect the qualitative behavior of the ODE. Theorem 2.5. Under the conditions of Theorem 2.2 and the above hypotheses on {a(n)}, {Y (n)}, and {τij (n)}, the asynchronous iterates given by (3.7) remain a.s. bounded and (therefore) converge to x∗ a.s. 2.4. Further extensions. Although satisfied in all of the applications treated in section 3, in some other models assumption (A1) that hr → h∞ pointwise may be violated. If this convergence does not hold, then we may abandon the fluid model and replace (A1) by (A10 ) The function h is Lipschitz, and there exists T > 0, R > 0 such that b ≤ 1, φ(t) 2 t ≥ T, b for any solution to (1.4) with r ≥ R and with initial condition satisfying |φ(0)| = 1. Under the Lipschitz condition on h, at worst we may find that the pointwise limits of {hr : r ≥ 1} will form a family Λ of Lipschitz functions on Rd . That is, h∞ ∈ Λ if and only if there exists a sequence {ri } ↑ ∞ such that hri (x) → h∞ (x), i → ∞, where the convergence is uniform for x in compact subsets of Rd . Under (A10 ) we then find, using the same arguments as in the proof of Lemma 4.1, that the family Λ is uniformly stable. STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING 455 Lemma 2.6. Under (A10 ) the family of ODEs defined via Λ is uniformly exponentially asymptotically stable in the following sense. For some b < ∞, δ > 0, and any solution φ∞ to the ODE (1.5) with h∞ ∈ Λ, |φ∞ (t)| ≤ be−δt |φ∞ (0)|, t ≥ 0. Using this lemma the development of section 4 goes through with virtually no changes, and hence Theorems 2.1–2.5 are valid with (A1) replaced by (A10 ). Another extension is to broaden the class of scalings. Consider a nonlinear scaling defined by a function g : R+ → R+ satisfying g(r)/r → ∞ as r → ∞, and suppose that hr ( · ) redefined as hr (x) = h(rx)/g(r) satisfies hr (x) → h∞ (x) uniformly on compacts as r → ∞. Then, assuming that the a.s. boundedness of rescaled iterates can be separately established, a completely analogous development of the stochastic algorithm is possible. An example would be a “stochastic gradient” scheme, where h( · ) is the gradient of an even degree polynomial, with degree, say, 2n. Then g(r) = r2n−1 will do. We do not pursue this further because the reinforcement learning algorithms we consider below do conform to the case g(r) = r. 3. Reinforcement learning. As both an illustration of the theory and an important application in its own right, in this section we analyze reinforcement learning algorithms for Markov decision processes. The reader is referred to [4] for a general background of the subject and to other references listed below for further details. 3.1. Markov decision processes. We consider a Markov decision process Φ = {Φ(t) : t ∈ Z} taking values in a finite state space S = {1, 2, . . . , s} and controlled by a control sequence Z = {Z(t) : t ∈ Z} taking values in a finite action space A = {a0 , . . . , ar }. We assume that the control sequence is admissible in the sense that Z(n) ∈ σ{Φ(t) : t ≤ n} for each n. We are most interested in stationary policies of the form Z(t) = w(Φ(t)), where the feedback law w is a function w : S → A. The controlled transition probabilities are given by p(i, j, a) for i, j ∈ S, a ∈ A. Let c : S × A → R be the one-step cost function, and consider first the infinite horizon discounted cost control problem of minimizing over all admissible Z the total discounted cost # "∞ X ¡ t β c Φ(t), Z(t) | Φ(0) = i , J(i, Z ) = E t=0 where β ∈ (0, 1) is the discount factor. The minimal value function is defined as V (i) = min J(i, Z ), where the minimum is over all admissible control sequences Z . The function V satisfies the dynamic programming equation X p(i, j, a)V (j) , i ∈ S, V (i) = min c(i, a) + β a j and the optimal control minimizing J is given as the stationary policy defined through the feedback law w∗ given as any solution to X ∗ p(i, j, a)V (j) , i ∈ S. w (i) := arg min c(i, a) + β a j 456 V. S. BORKAR AND S. P. MEYN The value iteration algorithm is an iterative procedure to compute the minimal value function. Given an initial function V0 : S → R+ one obtains a sequence of functions {Vn } through the recursion X Vn+1 (i) = min c(i, a) + β (3.1) p(i, j, a)Vn (j) , i ∈ S, n ≥ 0. a j This recursion is convergent for any initialization V0 ≥ 0. If we define Q-values via X p(i, j, a)V (j), i ∈ S, a ∈ A, Q(i, a) = c(i, a) + β j then V (i) = mina Q(i, a) and the matrix Q satisfies X Q(i, a) = c(i, a) + β p(i, j, a) min Q(j, b), b j i ∈ S, a ∈ A. The matrix Q can also be computed using the equivalent formulation of value iteration, X (3.2) Qn+1 (i, a) = c(i, a) + β p(i, j, a) min Qn (j, b), i ∈ S, a ∈ A, n ≥ 0, j b where Q0 ≥ 0 is arbitrary. The value iteration algorithm is initialized with a function V0 : S → R+ . In contrast, the policy iteration algorithm is initialized with a feedback law w0 and generates a sequence of feedback laws {wn : n ≥ 0}. At the nth stage of the algorithm a feedback law wn is given and the value function for the resulting control sequence Z n = {wn (Φ(0)), wn (Φ(1)), wn (Φ(2)), . . . } is computed to give ¡ Jn (i) = J i, Z n , i ∈ S. Interpreted as a column vector in Rs , the vector Jn satisfies the equation ¡ I − βPn Jn = cn , (3.3) where the s × s matrix Pn is defined by Pn (i, j) = p(i, j, wn (i)), i, j ∈ S, and the column vector cn is given by cn (i) = c(i, wn (i)), i ∈ S. Equation (3.3) can be solved for fixed n by the “fixed-policy” version of value iteration given by (3.4) Jn (i + 1) = βPn Jn (i) + cn , i ≥ 0, where Jn (0) ∈ Rs is given as an initial condition. Then Jn (i) → Jn , the solution to (3.3), at a geometric rate as i → ∞. Given Jn , the next feedback law wn+1 is then computed via X n+1 (3.5) (i) = arg min c(i, a) + β p(i, j, a)Jn (j) , i ∈ S. w a j Each step of the policy iteration algorithm is computationally intensive for large state spaces since the computation of Jn requires the inversion of the s × s matrix I − βPn . In the average cost optimization problem one seeks to minimize over all admissible Z , (3.6) lim sup n→∞ n−1 1X ¡ E c Φ(t), Z(t) . n t=0 STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING 457 The policy iteration and value iteration algorithms to solve this optimization problem remain unchanged with three exceptions. One is that the constant β must be set equal to unity in (3.1) and (3.5). Second, in the policy iteration algorithm the value function Jn is replaced by a solution Jn to Poisson’s equation X ¡ ¡ p i, j, wn (i) Jn (j) = Jn (i) − c i, wn (i) + ηn , i ∈ S, where ηn is the steady state cost under the policy wn . The computation of Jn and ηn again involves matrix inversions via ¡ ¡ πn I − Pn + ee0 = e0 , ηn = πn cn , I − Pn + ee0 Jn = cn , where e ∈ Rs is the column vector consisting of all ones and the row vector πn is the invariant probability for Pn . The introduction of the outer product ensures that the matrix (I − Pn + ee0 ) is invertible, provided that the invariant probability πn is unique. Lastly, the value iteration algorithm is replaced by the “relative value iteration,” where a common scalar offset is subtracted from all components of the iterates at each iteration (likewise for the Q-value iteration). The choice of this offset term is not unique. We shall be considering one particular choice, though others can be handled similarly (see [1]). 3.2. Q-learning. If the matrix Q defined in (3.2) can be computed via value iteration or some other scheme, then the optimal control is found through a simple minimization. If transition probabilities are unknown so that value iteration is not directly applicable, one may apply a stochastic approximation variant known as the Q-learning algorithm of Watkins [1, 20, 21]. This is defined through the recursion h i Qn+1 (i, a) = Qn (i, a) + a(n) β min Qn (Ψn+1 (i, a), b) + c(i, a) − Qn (i, a) , b i ∈ S, a ∈ A, where Ψn+1 (i, a) is an independently simulated S-valued random variable with law p(i, ·, a). Making the appropriate correspondences with our set up, we have X(n) = Qn and h(Q) = [hia (Q)]i,a with X hia (Q) = β p(i, j, a) min Q(j, b) + c(i, a) − Q(i, a), i ∈ S, a ∈ A. b j The martingale is given by M (n + 1) = [Mia (n + 1)]i,a with Mia (n + 1) = β min Qn (Ψn+1 (i, a), b) − b X Fia (Q) = β X j p(i, j, a) min Qn (j, b) , b j Define F (Q) = [Fia (Q)]i,a by p(i, j, a) min Q(j, b) + c(i, a). b Then h(Q) = F (Q) − Q and the associated ODE is (3.7) Q̇ = F (Q) − Q := h(Q). i ∈ S, a ∈ A. 458 V. S. BORKAR AND S. P. MEYN The map F : Rs×(r+1) → Rs×(r+1) is a contraction with respect to the max norm k · k∞ . The global asymptotic stability of its unique equilibrium point is a special case of the results of [8]. This h( · ) fits the framework of our analysis, with the (i, a)th component of h∞ (Q) given by X β p(i, j, a) min Q(j, b) − Q(i, a), i ∈ S, a ∈ A. b j This also is of the form h∞ (Q) = F∞ (Q) − Q where F∞ ( · ) is an k · k∞ - contraction, and thus the asymptotic stability of the unique equilibrium point of the corresponding ODE is guaranteed (see [8]). We conclude that assumptions (A1) and (A2) hold, and hence Theorems 2.1–2.4 also hold for the Q-learning model. 3.3. Adaptive critic algorithm. Next we shall consider the adaptive critic algorithm, which may be considered as the reinforcement learning analog of policy iteration (see [2, 13] for a discussion). There are several variants of this, one of which, taken from [13], is as follows. For i ∈ S, we define ¡ ¡ ¡ Vn+1 (i) = Vn (i) + b(n) c i, ψn (i) + βVn Ψn i, ψn (i) − Vn (i) , (3.8) from which the policies are updated according to (3.9) ( w bn+1 (i) =Γ w bn (i) + a(n) r X ¡ c(i, a0 ) + βVn ηn (i, a0 ) ) − [c(i, a` ) + βVn (ηn (i, a` ))]e` . `=1 Here {Vn } are s-vectors and for eachPi, {w bn (i)} are r-vectors lying in the simplex {x ∈ Rr | x = [x1 , . . . , xr ], xi ≥ 0, i xi ≤ 1}. Γ( · ) is the projection onto this simplex. The sequences {a(n)}, {b(n)} satisfy X X X¡ ¡ a(n) = b(n) = ∞, a(n)2 + b(n)2 < ∞, a(n) = o b(n) . n n n The rest of the notation is as follows. For 1 ≤ ` ≤ r, e` is the unit r-vector in the `th coordinate direction. For each i, n, wn (i) = wn (i, ·) is a probability vector on A bn (i, 1), . . . , w bn (i, r)], defined by the following. For w bn (i) = [w for ` 6= 0, bn (i, `) w X wn (i, a` ) = 1 − w bn (i, j) for ` = 0. j6=0 Given wn (i), ψn (i) is an A-valued random variable independently simulated with law wn (i). Likewise, Ψn (i, ψn (i)) are S-valued random variables which are independently simulated (given ψn (i)) with law p(i, ·, ψn (i)) and {ηn (i, a` )} are S-valued random variables independently simulated with law p(i, ·, a` ), respectively. To see why this is based on policy iteration, recall that policy iteration alternates between two steps. One step solves the linear system of (3.3) to compute the fixedpolicy value function corresponding to the current policy. We have seen that solving (3.3) can be accomplished by performing the fixed-policy version of value iteration given in (3.4). The first step (3.8) in the above iteration is indeed the “learning” or STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING 459 “simulation-based stochastic approximation” analog of this fixed-policy value iteration. The second step in policy iteration updates the current policy by performing an appropriate minimization. The second iteration (3.9) is a particular search algorithm for computing this minimum over the simplex of probability measures on A. This search algorithm is by no means unique; the paper [13] gives two alternative schemes. However, the first iteration (3.8) is common to all. The different choices of stepsize schedules for the two iterations (3.8) and (3.9) induces the “two time-scale” effect discussed in [5]. The first iteration sees the policy computed by the second as nearly static, thus justifying viewing it as a fixed-policy iteration. In turn, the second sees the first as almost equilibrated, justifying the search scheme for minimization over A. See [13] for details. The boundedness of {w bn } is guaranteed by the projection Γ( · ). For {Vn }, the fact that b(n) = o(a(n)) allows one to treat w bn (i) as constant, say, w(i); see, e.g., [13]. The appropriate ODE then turns out to be (3.10) v̇ = G(v) − v := h(v), where G : Rs → Rs is defined by X X w(i, a` ) β p(i, j, a` )xj + c(i, a` ) − xi , Gi (x) = i ∈ S. j ` Once again, G( · ) is an k · k∞ -contraction and it follows from the results of [8] that (3.10) is globally asymptotically stable. The limiting function h∞ (x) is again of the form h∞ (x) = G∞ (x) − x with G∞ (x) defined so that its ith component is X X w(i, a` ) β p(i, j, a` )xj − xi . j ` We see that G∞ is also a k · k∞ -contraction and the global asymptotic stability of the origin for the corresponding limiting ODE follows as before from the results of [8]. 3.4. Average cost optimal control. For the average cost control problem, we impose the additional restriction that the chain Φ has a unique invariant probability measure under any stationary policy so that the steady state cost (3.6) is independent of the initial condition. For the average cost optimal control problem, the Q-learning algorithm is given by the recursion Qn+1 (i, a) = Qn (i, a) + a(n) min Qn (Ψn (i, a), b) + c(i, a) − Qn (i, a) − Qn (i0 , a0 ) , b where i0 ∈ S, a0 ∈ A areP fixed a priori. The appropriate ODE now is (3.7) with F ( · ) redefined as Fia (Q) = j p(i, j, a) minb Q(j, b) + c(i, a) − Q(i, a) − Q(i0 , a0 ). The global asymptotic stability for the unique equilibrium point for this ODE has been established in [1]. Once again this fits our framework with h∞ (x) = F∞ (x) − x for F∞ defined the same way as F , except for the terms c(·, ·) which are dropped. We conclude that (A1) and (A2) are satisfied for this version of the Q-learning algorithm. Another variant of Q-learning for average cost, based on a “stochastic shortest path” formulation, is presented in [1]. This also can be handled similarly. 460 V. S. BORKAR AND S. P. MEYN In [13], three variants of the adaptive critic algorithm for the average cost problem are discussed, differing only in the {w bn } iteration. The iteration for {Vn } is common to all and is given by ¡ ¡ ¡ Vn+1 (i) = Vn (i) + b(n) c i, ψn (i) + Vn Ψn i, ψn , (i) − Vn (i) − Vn (i0 ) , i ∈ S, where i0 ∈ S is a fixed state prescribed beforehand. This leads to the ODE (3.10) with G redefined as X X w(i, a` ) p(i, j, a` )xj + c(i, a` ) − xi − xi0 , i ∈ S. Gi (x) = ` j The global asymptotic stability of the unique equilibrium point of this ODE has been established in [7]. Once more, this fits our framework with h∞ (x) = G∞ (x) − x for G∞ defined just like G, but without the c(·, ·) terms. Asynchronous versions of all the above can be written down along the lines of (3.7). Then by Theorem 2.5, they have bounded iterates a.s. The important point to note here is that to date, a.s. boundedness for Q-learning and adaptive critic is proved by other methods for centralized algorithms [1, 12, 20]. For asynchronous algorithms, it is proved for discounted cost only [1, 13, 20] or by introducing a projection to enforce stability [14]. 4. Derivations. Here we provide proofs for the main results given in section 2. Throughout this section we assume that (A1) and (A2) hold. 4.1. Stability. The functions {hr , r ≥ 1} and the limiting function h∞ are Lipschitz with the same Lipschitz constant as h under (A1). It follows from Ascoli’s theorem that the convergence hr → h∞ is uniform on compact subsets of Rd . This observation is the basis of the following lemma. Lemma 4.1. Under (A1), the ODE (1.5) is globally exponentially asymptotically stable. Proof. The function h∞ satisfies h∞ (cx) = ch∞ (x), c > 0, x ∈ Rd . Hence the origin θ ∈ Rd is an equilibrium for (1.5), i.e., h∞ (θ) = θ. Let B() be the closed ball of radius centered at θ with chosen so that x(t) → θ as t → ∞ uniformly for initial conditions in B(). Thus there exists a T > 0 such that kx(T )k ≤ /2 whenever kx(0)k ≤ . For an arbitrary solution x( · ) of (1.5), y( · ) = x( · )/kx(0)k is another, with ky(0)k = . Hence ky(T )k < /2, implying kx(T )k ≤ 12 kx(0)k. The global exponential asymptotic stability follows. With the scaling parameter r given by r(j) = max(1, kX(m(j))k), j ≥ 0, we define three piecewise continuous functions from R+ to Rd as in the introduction: (a) {φ(t) : t ≥ 0} is an interpolated version of X defined as follows. For each j ≥ 0, define a function φj on the interval [T (j), T (j + 1)] by ¡ φj t(n) = X(n)/r(j), m(j) ≤ n ≤ m(j + 1), with φj ( · ) defined by linear interpolation on the remainder of [T (j), T (j + 1)] to form a piecewise linear function. We then define φ to be the piecewise continuous function t ∈ T (j), T (j + 1) , j ≥ 0. φ(t) = φj (t), STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING 461 b (b) {φ(t) : t ≥ 0} is continuous on each interval [T (j), T (j + 1)), and on this interval it is the solution to the ODE ¡ ẋ(t) = hr(j) x(t) , (4.1) b (j)) = φ(T (j)), j ≥ 0. with initial condition φ(T (c) {φ∞ (t) : t ≥ 0} is also continuous on each interval [T (j), T (j + 1)), and on this interval it is the solution to the “fluid model” (1.5) with the same initial condition ¡ ¡ ¡ φ∞ T (j) = φb T (j) = φ T (j) j ≥ 0. b · ) and φ∞ ( · ) is crucial in deriving useful approximations. Boundedness of φ( Lemma 4.2. Under (A1) and (A2) and either (TS) or (BS), there exists C̄ < ∞ such that for any initial condition X(0) ∈ Rd b ≤ C̄ φ(t) φ∞ (t) ≤ C̄, and t ≥ 0. Proof. To establish the first bound use the Lipschitz continuity of h to obtain the bound ° ° ¡ ¡° d° b °2 = 2φ(t) b T hr(j) φ(t) b b °2 + 1 , °φ(t) ≤ C °φ(t) dt T (j) ≤ t < T (j + 1), where C is a deterministic constant, independent of j. The claim follows with C̄ = b (j))k ≤ 1. The proof of the second bound is therefore 2 exp((T + 1)C) since kφ(T identical. The following version of the Bellman Gronwall lemma will be used repeatedly. Lemma 4.3. (i) Suppose {α(n)}, {A(n)} are nonnegative sequences and β > 0 such that A(n + 1) ≤ β + n X α(k)A(k), n ≥ 0. k=0 Then for all n ≥ 1, Ã A(n + 1) ≤ exp ! n X α(k) ¡ α(0)A(0) + β . k=1 (ii) Suppose {α(n)}, {A(n)}, {γ(n)} are nonnegative sequences such that ¡ A(n + 1) ≤ 1 + α(n) A(n) + γ(n), n ≥ 0. Then for all n ≥ 1, Ã A(n + 1) ≤ exp Pn n X ! α(k) ¡¡ 1 + α(0) A(0) + β(n) , k=1 where β(n) = 0 γ(k). Proof. Define {R(n)} inductively by R(0) = A(0) and R(n + 1) = β + n X k=0 α(k)R(k), n ≥ 0. 462 V. S. BORKAR AND S. P. MEYN A simple induction shows that A(n) ≤ R(n), n ≥ 0. An alternative expression for R(n) is ! Ã n Y ¡ (1 + α(k) α(0)A(0) + β . R(n) = k=1 The inequality (i) then follows from the bound 1 + x ≤ ex . To see (ii) fix n ≥ 0 and observe that on summing both sides of the bound A(k + 1) − A(k) ≤ α(k)A(k) + γ(k) over 0 ≤ k ≤ ` we obtain for all 0 ≤ ` < n, A(` + 1) ≤ A(0) + β(n) + X̀ α(k)A(k). k=0 The result then follows from (i). b · ), and φ∞ ( · ). The following lemmas relate the three functions φ( · ), φ( Lemma 4.4. Suppose that (A1) and (A2) hold. Given any > 0, there exist T, R < ∞ such that for any r > R and any solution to the ODE (1.4) satisfying kx(0)k ≤ 1, we have kx(t)k ≤ for t ∈ [T, T + 1]. Proof. By global asymptotic stability of (1.5) we can find T > 0 such that kφ∞ (t)k ≤ /2, t ≥ T , for solutions φ∞ ( · ) of (1.5) satisfying kφ∞ (0)k ≤ 1. b − φ∞ (t)| ≤ /2 whenever φb is a solution With T fixed, choose R so large that |φ(t) ∞ b b to (1.4) satisfying φ(0) = φ (0); |φ(0)| ≤ 1; and r ≥ R. This is possible since, as we have already observed, hr → h∞ as r → ∞ uniformly on compact sets. The claim then follows from the triangle inequality. Define the following: For j ≥ 0, m(j) ≤ n < m(j + 1), e X(n) := X(n)/r(j), f(n + 1) := M (n + 1)/r(j), M and for n ≥ 1, ξ(n) := n−1 X f(m + 1). a(m)M m=0 Lemma 4.5. Under (A1), (A2), and either (TS) or (BS), for each initial condition X(0) ∈ Rd satisfying E[kX(0)k2 ] < ∞, we have the following: 2 e (i) sup n≥0 E[kX(n)k ] < ∞. (ii) sup j≥0 E[kX(m(j + 1))/r(j)k2 ] < ∞. (iii) sup j≥0,T (j)≤t≤T (j+1) E[kφ(t)k2 ] < ∞. (iv) Under (TS) the sequence {ξ(n), Fn } is a square integrable martingale with sup E[kξ(n)k2 ] < ∞. n≥0 Proof. To prove (i) note first that under (A2) and the Lipschitz condition on h there exists C < ∞ such that for all n ≥ 1, ¡ (4.2) E kX(n)k2 | Fn−1 ≤ 1 + Ca(n − 1) kX(n − 1)k2 + Ca(n − 1), n ≥ 0. STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING 463 It then follows that for any j ≥ 0 and any m(j) < n < m(j + 1), ° ° ° ¡ ° e °2 | Fn−1 ≤ 1 + Ca(n − 1) °X(n e − 1)°2 + Ca(n − 1), E °X(n) so that by Lemma 4.3 (ii), for all such n, °2 ° ¡ ¡ ° ° e e + 1)°2 ≤ exp C(T + 1) 2E °X(m(j)) ° + C(T + 1) E °X(n ¡ ¡ ≤ exp C(T + 1) 2 + C(T + 1) . Claim (i) follows, and claim (ii) follows similarly. We then obtain claim (iii) from the f(n)k2 ] < ∞. Using this definition of φ( · ). From (i), (ii), and (A2), we have supn E[kM and the square summability of {a(n)} assumed in (TS), the bound (iv) immediately follows. Lemma 4.6. Suppose E[kX(0)k2 ] < ∞. Under (A1), (A2), and (TS), with probability one, b (i) kφ(t) − φ(t)k → 0 as t → ∞, (ii) sup t≥0 kφ(t)k < ∞. b · ) as follows: For m(j) ≤ n < m(j + 1), Proof. Express φ( b b (j)) + φ(t(n + 1)−) = φ(T Z n X i=m(j) b (j)) + 1 (j) + = φ(T (4.3) t(i+1) t(i) n X ¡ b hr(j) φ(s) ds ¡ b a(i)hr(j) φ(t(i)) , i=m(j) Pm(j+1) where 1 (j) = O( i=m(j) a(i)2 ) → 0 as j → ∞. The “−” covers the case where t(n + 1) = t(m(j + 1)) = T (j + 1). We also have by definition (4.4) n X ¡ ¡ ¡ ¡ f(i + 1) . a(i) hr(j) φ t(i) + M φ t(n + 1) − = φ T (j) + i=m(j) b For m(j) ≤ n ≤ m(j + 1), let ε(n) = kφ(t(n)−) − φ(t(n)−)k. Combining (4.3), (4.4), and the Lipschitz continuity of h, we have n X ¡ a(i)ε(i), ε(n + 1) ≤ ε m(j) + 1 (j) + kξ(n + 1) − ξ(m(j))k + C i=m(j) where C < ∞ is a suitable constant. Since ε(m(j)) = 0, we can use Lemma 4.3 (i) to obtain ¡ ¡ m(j) ≤ n ≤ m(j + 1), ε(n) ≤ exp C(T + 1) 1 (j) + 2 (j) , where 2 (j) = maxm(j)≤n≤m(j+1) kξ(n + 1) − ξ(m(j))k. By (iv) of Lemma 4.5 and the martingale convergence theorem [18, p. 62], {ξ(n)} converges a.s.; thus 2 (j) → 0 a.s., as j → ∞. Since 1 (j) → 0 as well, ° ° b °φ(t(n)−) − φ(t(n)−) °= sup ε(n) → 0 sup m(j)≤n≤m(j+1) m(j)≤n≤m(j+1) 464 V. S. BORKAR AND S. P. MEYN as j → ∞, which implies the first claim. Result (ii) then follows from Lemma 4.2 and the triangle inequality. Lemma 4.7. Under (A1), (A2), and (BS), there exists a constant C2 < ∞ such that for all j ≥ 0, 2 b | Fn(j) ] ≤ C2 α, (i) sup j≥0,T (j)≤t≤T (j+1) E[kφ(t) − φ(t)k 2 (ii) sup j≥0,T (j)≤t≤T (j+1) E[kφ(t)k | Fn(j) ] ≤ C2 . Proof. Mimic the proof of Lemma 4.6 to obtain ε(n + 1) ≤ n X Ca(i)ε(i) + 0 (j), m(j) ≤ n < m(j + 1), i=m(j) 2 b | Fm(j) ]1/2 for m(j) ≤ n ≤ m(j + 1), and the where ε(n) = E[kφ(t(n)−) − φ(t(n)−)k error term has the upper bound |0 (j)| = O(α), where the bound is deterministic. By Lemma 4.3 (i) we obtain the bound, ¡ m(j) ≤ n ≤ m(j + 1), ε(n) ≤ exp C(T + 1) 0 (j), which proves (i). We, therefore, obtain (ii) using Lemma 4.2, (i), and the triangle inequality. Proof of Theorem 2.1. (i) By a simple conditioning argument, we may take X(0) to be deterministic without any loss of generality. In particular, E[kX(0)k2 ] < ∞ trivially. By Lemma 4.6 (ii), it now suffices to prove that supn kX(m(n))k < ∞ a.s. Fix a sample point outside the zero probability set where Lemma 4.6 fails. Pick T > 0 as above and R > 0 such that for every solution x( · ) of the ODE (1.4) with kx(0)k ≤ 1 and r ≥ R, we have kx(t)k ≤ 14 for t ∈ [T, T + 1]. This is possible by Lemma 4.4. Hence by Lemma 4.6 (i) we can find an j0 ≥ 1 such that whenever j ≥ j0 and kX(m(j))k ≥ R, (4.5) ¡ 1 kX(m(j + 1))k = φ T (j + 1) − ≤ . kX(m(j))k 2 This implies that {X(m(j)) : j ≥ 0} is a.s. bounded, and the claim follows. (ii) For m(j) < n ≤ m(j + 1), (4.6) ° °2 ° ¡ °2 1/2 1/2 ¡° ° °X(m(j))° ∨ 1 = E °φ t(n) − ° | Fm(j) E °X(n)° | Fm(j) 1/2 ¡° ¡ ° ° ¡ ¡ °2 °X m(j) ° ∨ 1 ≤ E °φ t(n) − − φb t(n) − ° | Fm(j) °2 ¡ ° b ° | Fm(j) 1/2 kX(m(j) k ∨ 1 . + E °φ(t(n)−) Let 0 < η < 12 , and let α∗ = η/(2C2 ), for C2 as in Lemma 4.7. We then obtain for α ≤ α∗ , °2 ° ¡° ¡ ° 1/2 E °X(n)° | Fm(j) ≤ (η/2) °X m(j) ° ∨ 1 1/2 ¡° ¡ ° ° ¡ °2 °X m(j) ° ∨ 1 . (4.7) + E °φb t(n) − ° | Fm(j) STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING 465 Choose R, T > 0 such that for any solution x( · ) of the ODE (1.4), kx(t)k < η/2 for t ∈ [T, T + 1], whenever kx(0)k < 1 and r ≥ R. When kX(m(j))k ≥ R, we then obtain ° ¡ ° ¡ °2 ° 1/2 E °X m(j + 1) ° | Fm(j) (4.8) ≤ η °X m(j) °, while by Lemma 4.7 (ii) there exists a constant C such that the left-hand side (l.h.s.) of the inequality above is bounded by C a.s. when kX(m(j))k ≤ R. Thus, ° ¡ °2 ° ¡ °2 E °X m(j + 1) ° ≤ 2η 2 E °X m(j) ° + 2C 2 . This establishes boundedness of E[kX(m(j + 1))k2 ], and the proof then follows from (4.7) and Lemma 4.2. 4.2. Convergence for (BS). Lemma 4.8. Suppose that (A1), (A2), and (BS) hold and that α ≤ α∗ . Then for some constant C3 < ∞, ° ° b − ψ(t)°2 ≤ C3 α. sup E °ψ(t) t≥0 Proof. By (A2) and Theorem 2.1 (ii), °2 °2 ° ° sup E °X(n)° < ∞, sup E °M (n)° < ∞. n n The claim then follows from familiar arguments using the Bellman Gronwall lemma exactly as in the proof of Lemma 4.6. Proof of Theorem 2.3. (i) We apply Theorem 2.1 which allows us to choose an R > 0 such that ¡ sup P kX(n)k > R < α. n Let B(c) denote the ball centered at x∗ of radius c > 0 and let 0 < µ < /2 be such that if a solution x( · ) of (1.2) satisfies x(0) ∈ B(µ), then x(t) ∈ B(/2) for t ≥ 0. Pick T > 0 such that if a solution x( · ) of (1.2) satisfies kx(0)k ≤ R, then x(t) ∈ B(µ/2) for t ∈ [T, T + 1]. Then for all j ≥ 0, ¡ ° ¡ ¡ ¡ ° P e m(j + 1) ≥ µ = P e m(j + 1) ≥ µ, °X m(j) ° > R) ¡ ¡ + P e m(j + 1 ≥ µ, kX(m(j))k ≤ R ¡ ¡ ≤ α + P ψ T (j + 1) 6∈ B(µ), ψb T (j + 1) ∈ B(µ/2) ¡ ¡ ≤ α + P kψ T (j + 1) − ψb T (j + 1) k > µ/2 ≤ O(α) by Lemma 4.8. Then for m(j) ≤ n < m(j + 1), ¡ ¡ ¡ P e(n) ≥ = P e(n) ≥ , e m(j) ≥ µ ¡ ¡ + P e(n) ≥ , e m(j) ≤ µ ¡ ¡ ¡ ≤ O(α) + P ψ(t(n) 6∈ B(), ψb t(n) ∈ B /2) ¡ ¡ ¡ ≤ O(α) + P kψ t(n) − ψb t(n) k > /2 ≤ O(α). 466 V. S. BORKAR AND S. P. MEYN Since the bound on the r.h.s. is uniform in n, the claim follows. (ii) We first establish the bound with n = m(j + 1), j → ∞. We have for any j, ° ¡ ¡ °2 1/2 ¡ 2 1/2 ≤ E °ψ T (j + 1) − − ψb T (j + 1) − ° E e m(j + 1) °2 1/2 ° ¡ (4.9) + E °ψb T (j + 1) − − x∗ ° . By exponential stability there exist C < ∞, δ > 0 such that for all j ≥ 0, ° ¡ ° ° ¡ ° ¡ °ψb T (j + 1) − − x∗ ° ≤ C exp − δ T (j + 1) − T (j) °ψb T (j) − x∗ ° ° ¡ ° ≤ C exp(−δT )°ψb T (j) − x∗ °. Choose T so large that C exp(−δT ) ≤ 1 2 so that °2 1/2 °2 1/2 ° ¡ 1 ° ¡ ≤ E °ψb T (j) − x∗ ° E °ψb T (j + 1) − − x∗ ° 2 2 1/2 1 ° ¡ ¡ °2 1/2 1 ¡ (4.10) ≤ E e m(j) + E °ψ T (j) − ψb T (j) ° . 2 2 Combining (4.9) and (4.10) with Lemma 4.8 gives p 2 1/2 ¡ 2 1/2 1 ¡ ≤ E e m(j) + 2 C3 α, E e m(j + 1) 2 which shows that ¡ 2 lim sup E e m(j) ≤ 16C3 α. j→∞ The result follows from this and Lemma 4.7 (ii). Proof of Theorem 2.5. The details of the proof, though pedestrian in the light of the foregoing and [6], are quite lengthy, not to mention the considerable overhead of additional notation, and are therefore omitted. We briefly sketch below a single point of departure in the proof. b · ) on the interval [T (j), T (j+ In Lemma 4.6 we compare two functions φ( · ) and φ( e 1)]. The former in turn involved the iterates X(n) for m(j) ≤ n < m(j + 1) or, equivalently, X(n) for m(j) ≤ n < m(j + 1). Here X(n + 1) was computed in terms of X(n) and the “noise” M (n + 1). In the asynchronous case, however, the evaluation of Xj (n + 1) can involve Xj (n) for n − τ ≤ m ≤ n, j 6= i. Therefore the argument leading to Lemma 4.6 calls for a slight modification. While computing X(n), m(j) ≤ ei (m) = Xi (m)/r(j). n < m(j + 1), we plug into the iteration as and when required X Note, however, that if the same Xi (m) also features in the computation of Xk (l) for ei (m) should be redefined there as m(q) ≤ ` < m(q + 1), say, with q 6= j, then X e Xi (m)/r(q). Thus the definition of Xi (m) now becomes context-dependent. With this minor change, the proofs of [6] can be easily combined with the arguments used in the proofs of Theorems 2.1 and 2.2 to draw the desired conclusions. 4.3. The Markov model. The bounds that we obtain for the Markov model (2.4) are based upon the theory of ψ-irreducible Markov chains. A subset S ⊂ Rd is called petite if there exists a probability measure ν on Rd and δ > 0 such that the resolvent kernel K satisfies K(x, A) := ∞ X k=0 2−k−1 P k (x, A) ≥ δν(A), x ∈ S, STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING 467 for any measurable A ⊂ Rd . Under assumptions (2.6) and (2.7) we show below that every compact subset of Rd is petite, so that Φ is a ψ-irreducible T -chain. We refer the reader to [16] for further terminology and notation. Lemma 4.9. Suppose that (A1), (A2), (2.6), and (2.7) hold and that α ≤ α∗ . Then all compact subsets of Rd are petite for the Markov chain X, and hence the chain is ψ-irreducible. Proof. The conclusions of the theorem will be satisfied if we can find a function s which is bounded from below on compact sets and a probability ν such that the resolvent kernel K satisfies the bound K(x, A) ≥ s(x)ν(A) for every x ∈ Rd and any measurable subset A ⊂ Rd . This bound is written succinctly as K ≥ s ⊗ ν. The first step of the proof is to apply the implicit function theorem together with (2.6) and (2.7) to obtain a bound of the form P d (x, A) = P(X(d) ∈ A | X(0) = x) ≥ ν(A), x ∈ O, where O is an open set containing x∗ , > 0, and ν is the uniform distribution on O. The set O can be chosen independent of α, but the constant may depend on α. For details on this construction, see Chapter 7 of [16]. To complete the proof it is enough to show that K(x, O) > 0. To see this, suppose that α ≤ α∗ and that W (n) = w∗ for all n. Then the foregoing stability analysis shows that X(n) ∈ O for all n sufficiently large. Since w∗ is in the support of the marginal distribution of {W (n)}, it then follows that K(x, O) > 0. From these two bounds, we then have Z K(x, A) ≥ 2−d K(x, dy)P d (y, A) ≥ 2−d K(x, O)ν(A). This is of the form K ≥ s ⊗ ν with s lower semicontinuous and positive everywhere. The function s is therefore bounded from below on compact sets, which proves the claim. The previous lemma together with Theorem 2.1 allows us to establish a strong form of ergodicity for the model. Lemma 4.10. Suppose that (A1), (A2), (2.6), and (2.7) hold and that α ≤ α∗ . (i) There exists a function Vα : Rd → [1, ∞) and constants b, L < ∞ and 0 > 0 independent of α such that P Vα (x) ≤ exp(−0 α)Vα (x) + bIC (x), where C = {x : kxk ≤ L}. While the function Vα will depend upon α, it is uniformly bounded as follows, γ −1 (kxk2 + 1) ≤ Vα (x) ≤ γ(kxk2 + 1), where γ ≥ 1 does not depend upon α. (ii) The chain is V -uniformly ergodic, with V (x) = kxk2 + 1. Proof. Using (4.8) we may construct T and L independent of α ≤ α∗ such that °2 ° E °X(k0 )° + 1 | X(0) = x ≤ (1/2)(kxk2 + 1), kxk ≥ L, 468 V. S. BORKAR AND S. P. MEYN where k0 = [T /α] + 1. We now set Vα (x) = α kX 0 −1 h° i °2 E °X(k)° + 1 | X(0) = x 2k/k0 . k=0 From the previous bound, it follows directly that the desired drift inequality holds with 0 = log(2)/T . Lipschitz continuity of the model gives the bounds on Vα . This proves (i). The V -uniform ergodicity then follows from Lemma 4.9 and Theorem 16.0.1 of [16]. We note that for small α and large x, the Lyapunov function Vα approximates V∞ plus a constant, where Z T ¡ kx(s)k2 + 1 2s/T ds; x(0) = x, V∞ (x) = 0 and x( · ) is a solution to (1.5). If this ODE is asymptotically stable then the function V∞ is in fact a Lyapunov function for (1.5), provided T > 0 is chosen sufficiently large. In [17] a bound is obtained on the rate of convergence ρ given in (2.5) for a chain satisfying the drift condition P Vα (x) ≤ λV (x) + bIC (x). The bound depends on the “petiteness” of the set C and the constants b < ∞ and λ < 1. The bound on ρ obtained in [17] also tends to unity with vanishing α since in the preceding lemma we have λ = exp(−0 α) → 1 as α → 0. From the structure of the algorithm this is not surprising, but this underlines the fact that care must be taken in the choice of the stepsize α. REFERENCES [1] J. Abounadi, D. Bertsekas, and V. S. Borkar, Learning algorithms for Markov decision processes with average cost, SIAM J. Control Optim., submitted. [2] A. G. Barto, R. S. Sutton, and C. W. Anderson, Neuron-like elements that can solve difficult learning control problems, IEEE Trans. Systems, Man and Cybernetics, 13 (1983), pp. 835–846. [3] A. Benveniste, M. Métivier, and P. Priouret, Adaptive Algorithms and Stochastic Approximations, Springer-Verlag, Berlin, 1990. [4] D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, Belmont, MA, 1996. [5] V. S. Borkar, Stochastic approximation with two time scales, Systems Control Lett., 29 (1997), pp. 291–294. [6] V. S. Borkar, Asynchronous stochastic approximation, SIAM J. Control Optim., 36 (1998), pp. 840–851. [7] V. S. Borkar, Recursive self-tuning control of finite Markov chains, Appl. Math., 24 (1996), pp. 169–188. [8] V. S. Borkar and K. Soumyanath, An analog scheme for fixed-point computation, part I: Theory, IEEE Trans. Circuits Systems I Fund. Theory Appl., 44 (1997), pp. 351–355. [9] J. G. Dai, On positive Harris recurrence for multiclass queueing networks: A unified approach via fluid limit models, Ann. Appl. Probab., 5 (1995), pp. 49–77. [10] J. G. Dai and S. P. Meyn, Stability and convergence of moments for multiclass queueing networks via fluid limit models, IEEE Trans. Automat. Control, 40 (1995), pp. 1889–1904. [11] M. W. Hirsch, Convergent activation dynamics in continuous time networks, Neural Networks, 2 (1989), pp. 331–349. STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING 469 [12] T. Jaakola, M. I. Jordan, and S. P. Singh, On the convergence of stochastic iterative dynamic programming algorithms, Neural Computation, 6 (1994), pp. 1185–1201. [13] V. R. Konda and V. S. Borkar, Actor-critic–type learning algorithms for Markov decision processes, SIAM J. Control Optim., 38 (1999), pp. 94–123. [14] H. J. Kushner and G. G. Yin, Stochastic Approximation Algorithms and Applications, Springer-Verlag, New York, 1997. [15] V. A. Malyshev and M. V. Men’sikov, Ergodicity, continuity and analyticity of countable Markov chains, Trans. Moscow Math. Soc., 1 (1982), pp. 1–48. [16] S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability, Springer-Verlag, London, 1993. [17] S. P. Meyn and R. L. Tweedie, Computable bounds for geometric convergence rates of Markov chains, Ann. Appl. Probab., 4 (1994), pp. 981–1011. [18] J. Neveu, Discrete Parameter Martingales, North Holland, Amsterdam, 1975. [19] T. Sargent, Bounded Rationality in Macroeconomics, Clarendon Press, Oxford, 1993. [20] J. Tsitsiklis, Asynchronous stochastic approximation and q-learning, Mach. Learning, 16 (1994), pp. 195–202. [21] C. J. C. H. Watkins and P. Dayan, Q-learning, Mach. Learning, 8 (1992), pp. 279–292.