THE O.D.E. METHOD FOR CONVERGENCE OF STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING

advertisement
SIAM J. CONTROL OPTIM.
Vol. 38, No. 2, pp. 447–469
c 2000 Society for Industrial and Applied Mathematics
°
THE O.D.E. METHOD FOR CONVERGENCE OF STOCHASTIC
APPROXIMATION AND REINFORCEMENT LEARNING∗
V. S. BORKAR† AND S. P. MEYN‡
Abstract. It is shown here that stability of the stochastic approximation algorithm is implied
by the asymptotic stability of the origin for an associated ODE. This in turn implies convergence of
the algorithm. Several specific classes of algorithms are considered as applications. It is found that
the results provide (i) a simpler derivation of known results for reinforcement learning algorithms;
(ii) a proof for the first time that a class of asynchronous stochastic approximation algorithms are
convergent without using any a priori assumption of stability; (iii) a proof for the first time that
asynchronous adaptive critic and Q-learning algorithms are convergent for the average cost optimal
control problem.
Key words. stochastic approximation, ODE method, stability, asynchronous algorithms, reinforcement learning
AMS subject classifications. 62L20, 93E25, 93E15
PII. S0363012997331639
1. Introduction. The stochastic approximation algorithm considered in this
paper is described by the d-dimensional recursion
¡
(1.1)
X(n + 1) = X(n) + a(n) h X(n) + M (n + 1) , n ≥ 0,
where X(n) = [X1 (n), . . . , Xd (n)]T ∈ Rd , h : Rd → Rd , and {a(n)} is a sequence of
positive numbers. The sequence {M (n) : n ≥ 0} is uncorrelated with zero mean.
Though more than four decades old, the stochastic approximation algorithm is
now of renewed interest due to novel applications to reinforcement learning [20] and as
a model of learning by boundedly rational economic agents [19]. Traditional convergence analysis usually shows that the recursion (1.1) will have the desired asymptotic
behavior provided that the iterates remain bounded with probability one, or that
they visit a prescribed bounded set infinitely often with probability one [3, 14]. Under such stability or recurrence conditions one can then approximate the sequence
X = {X(n) : n ≥ 0} with the solution to the ordinary differential equation (ODE)
¡
(1.2)
ẋ(t) = h x(t)
with identical initial conditions x(0) = X(0).
The recurrence assumption is crucial, and in many practical cases this becomes
a bottleneck in applying the ODE method. The most successful technique for establishing stochastic stability is the stochastic Lyapunov function approach (see, e.g.,
∗ Received by the editors December 17, 1997; accepted for publication (in revised form) February
22, 1999; published electronically January 11, 2000.
http://www.siam.org/journals/sicon/38-2/33163.html
† School of Technology and Computer Science, Tata Institute of Fundamental Research, Mumbai 400005, India (borkar@tifr.res.in). The research of this author was supported in part by the
Department of Science and Technology (Government of India) grant III5(12)/96-ET.
‡ Department of Electrical and Computer Engineering and the Coordinated Sciences Laboratory,
University of Illinois at Urbana-Champaign, Urbana, IL 61801 (s-meyn@uiuc.edu). The research of
this author was supported in part by NSF grant ECS 940372 and JSEP grant N00014-90-J-1270.
This research was completed while this author was a visiting scientist at the Indian Institute of
Science under a Fulbright Research Fellowship.
447
448
V. S. BORKAR AND S. P. MEYN
[14]). One also has techniques based upon the contractive properties or homogeneity
properties of the functions involved (see, e.g., [20] and [12], respectively).
The main contribution of this paper is to add to this collection another general
technique for proving stability of the stochastic approximation method. This technique is inspired by the fluid model approach to stability of networks developed in
[9, 10], which is itself based upon the multistep drift criterion of [15, 16]. The idea
is that the usual stochastic Lyapunov function approach can be difficult to apply due
to the fact that time-averaging of the noise may be necessary before a given positive
valued function of the state process will decrease towards zero. In general such timeaveraging of the noise will require infeasible calculation. In many models, however, it
is possible to combine time-averaging with a limiting operation on the magnitude of
the initial state, to replace the stochastic system of interest with a simpler deterministic process.
The scaling applied in this paper to approximate the model (1.1) with a deterministic process is similar to the construction of the fluid model of [9, 10]. Suppose that
e
the state is scaled by its initial value to give X(n)
= X(n)/ max(|X(0)|, 1), n ≥ 0.
We then scale time to obtain a continuous function φ : R+ → Rd which interpolates
e
e
the values of {X(n)}.
At a sequence of times {t(j) : j ≥ 0} we set φ(t(j)) = X(j),
and for arbitrary t ≥ 0, we extend the definition by linear interpolation. The times
{t(j) : j ≥ 0} are defined in terms of the constants {a(j)} used in (1.1). For any
r > 0, the scaled function hr : Rd → Rd is given by
(1.3)
hr (x) = h(rx)/r,
x ∈ Rd .
Then through elementary arguments we find that the stochastic process φ approximates the solution φb to the associated ODE
¡
(1.4)
ẋ(t) = hr x(t) , t ≥ 0,
b = φ(0) and r = max(|X(0)|, 1).
with φ(0)
With our attention on stability considerations, we are most interested in the
behavior of X when the magnitude of the initial condition |X(0)| is large. Assuming
that the limiting function h∞ = limr→∞ hr exists, for large initial conditions we find
that φ is approximated by the solution φ∞ of the limiting ODE
¡
(1.5)
ẋ(t) = h∞ x(t) ,
where again we take identical initial conditions φ∞ (0) = φ(0).
Thus, for large initial conditions all three processes are approximately equal,
φ ≈ φb ≈ φ∞ .
Using these observations we find in Theorem 2.1 that the stochastic model (1.1) is
stable in a strong sense provided the origin is asymptotically stable for the limiting
ODE (1.5). Equation (1.5) is precisely the fluid model of [9, 10].
Thus, the major conclusion of this paper is that the ODE method can be extended to establish both the stability and convergence of the stochastic approximation
method, as opposed to only the latter. The result [14, Theorem 4.1, p. 115] arrives at
a similar conclusion: if the ODE (1.2) possesses a “global” Lyapunov function with
bounded partial derivatives, then this will serve as a stochastic Lyapunov function,
thereby establishing recurrence of the algorithm. Though similar in flavor, there are
STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING
449
significant differences between these results. First, in the present paper we consider a
scaled ODE, not the usual ODE (1.2). The former retains only terms with dominant
growth and is frequently simpler. Second, while it is possible that the stability of the
scaled ODE and the usual one go hand in hand, this does not imply that a Lyapunov
function for the latter is easily found. The reinforcement learning algorithms for
ergodic-cost optimal control and asynchronous algorithms, both considered as applications of the theory in this paper, are examples where the scaled ODE is conveniently
analyzed.
Though the assumptions made in this paper are explicitly motivated by applications to reinforcement learning algorithms for Markov decision processes, this approach is likely to find a broader range of applications.
The paper is organized as follows. The next section presents the main results for
the stochastic approximation algorithm with vanishing stepsize or with bounded, nonvanishing stepsize. Section 2 also gives a useful error bound for the constant stepsize
case and briefly sketches an extension to asynchronous algorithms, omitting details
that can be found in [6]. Section 3 gives examples of algorithms for reinforcement
learning of Markov decision processes to which this analysis is applicable. The proofs
of the main results are collected together in section 4.
2. Main results. Here we collect together the main general results concerning
the stochastic approximation algorithm. Proofs not included here may be found in
section 4.
We shall impose the following additional conditions on the functions {hr : r ≥ 1}
defined in (1.3) and the sequence M = {M (n) : n ≥ 1} used in (1.1). Some relaxations
of assumption (A1) are discussed in section 2.4.
(A1) The function h is Lipschitz, and there exists a function h∞ : Rd → Rd such
that
x ∈ Rd .
lim hr (x) = h∞ (x),
r→∞
Furthermore, the origin in Rd is an asymptotically stable equilibrium for the ODE
(1.5).
(A2) The sequence {M (n), Fn : n ≥ 1}, with Fn = σ(X(i), M (i), i ≤ n), is a
martingale difference sequence. Moreover, for some C0 < ∞ and any initial condition
X(0) ∈ Rd ,
°2
¡
°
E °M (n + 1)° | Fn ≤ C0 1 + kX(n)k2 ,
n ≥ 0.
The sequence {a(n)} is deterministic and is assumed to satisfy one of the following two assumptions. Here TS stands for “tapering stepsize” and BS for “bounded
stepsize.”
(TS) The sequence {a(n)} satisfies 0 < a(n) ≤ 1, n ≥ 0, and
X
a(n) = ∞,
X
n
a(n)2 < ∞.
n
(BS) The sequence {a(n)} satisfies for some constants 1 > α > α > 0,
α ≤ a(n) ≤ α,
n ≥ 0.
450
V. S. BORKAR AND S. P. MEYN
2.1. Stability and convergence. The first result shows that the algorithm is
stabilizing for both bounded and tapering step sizes.
Theorem 2.1. Assume that (A1) and (A2) hold. Then we have the following:
(i) Under (TS), for any initial condition X(0) ∈ Rd ,
sup kX(n)k < ∞
n
almost surely (a.s.).
(ii) Under (BS) there exist α∗ > 0 and C1 < ∞ such that for all 0 < α < α∗ and
X(0) ∈ Rd ,
°2 °
lim sup E °X(n)° ≤ C1 .
n→∞
An immediate corollary to Theorem 2.1 is convergence of the algorithm under
(TS). The proof is a standard application of the Hirsch lemma (see [11, Theorem 1,
p. 339] or [3, 14]), but we give the details below for sake of completeness.
Theorem 2.2. Suppose that (A1), (A2), and (TS) hold and that the ODE (1.2)
has a unique globally asymptotically stable equilibrium x∗ . Then X(n) → x∗ a.s. as
n → ∞ for any initial condition X(0) ∈ Rd .
Proof. We may suppose that X(0) is deterministic without any loss of generality
so that the conclusion of Theorem 2.1 (i) holds that the sample paths of X are
bounded with probability one. Fixing such a sample path, we see that X remains in
a bounded set H, which may be chosen so that x∗ ∈ int(H).
The proof depends on an approximation of X with the solution to the primary
ODE (1.2). To perform this approximation, first define t(n) ↑ ∞, T (n) ↑ ∞ as
Pn−1
follows: Set t(0) = T (0) = 0 and for n ≥ 1, t(n) = i=0 a(i). Fix T > 0 and define
inductively
ª
T (n + 1) = min t(j) : t(j) > T (n) + T , n ≥ 0.
Thus T (n) = t(m(n)) for some m(n) ↑ ∞ and T ≤ T (n + 1) − T (n) ≤ T + 1 for n ≥ 0.
We then define two functions from R+ to Rd :
(a) {ψ(t), t > 0} is defined by ψ(t(n)) = X(n) with linear interpolation on
[t(n), t(n + 1)] for each n ≥ 0.
b
(b) {ψ(t),
t > 0} is piecewise continuous, defined so that, for any j ≥ 0, ψb is the
b (j)) = ψ(T (j)).
solution to (1.2) for t ∈ [T (j), T (j + 1)), with the initial condition ψ(T
∗
Let > 0 and let B() denote the open ball centered at x of radius . We may
then choose the following:
(i) 0 < δ < such that x(t) ∈ B() for all t ≥ 0 whenever x( · ) is a solution of
(1.2) satisfying x(0) ∈ B(δ).
(ii) T > 0 so large that for any solution of (1.2) with x(0) ∈ H we have x(t) ∈
b (j)−) ∈ B(δ/2) for all j ≥ 1.
B(δ/2) for all t ≥ T . Hence, ψ(T
(iii) An application of the Bellman Gronwall lemma as in Lemma 4.6 below that
leads to the limit
°
°
b °→0
°ψ(t) − ψ(t)
(2.1)
a.s., t → ∞.
Hence we may choose j0 > 0 so that we have
° ¡
¡
°
°ψ T (j) − − ψb T (j) − ° ≤ δ/2,
j ≥ j0 .
STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING
451
Since ψ( · ) is continuous, we conclude from (ii) and (iii) that ψ(T (j)) ∈ B(δ) for
b (j)) = ψ(T (j)), it then follows from (i) that ψ(t)
b ∈ B() for all
j ≥ j0 . Since ψ(T
t ≥ T (j0 ). Hence by (2.1),
lim sup kψ(t) − x∗ k ≤ t→∞
a.s.
This completes the proof since > 0 was arbitrary.
We now consider (BS), focusing on the absolute error defined by
e(n) := kX(n) − x∗ k,
(2.2)
n ≥ 0.
Theorem 2.3. Assume that (A1), (A2), and (BS) hold, and suppose that (1.2)
has a globally asymptotically stable equilibrium point x∗ .
Then for any 0 < α ≤ α∗ , where α∗ is introduced in Theorem 2.1 (ii),
(i) for any > 0, there exists b1 = b1 () < ∞ such that
¡
lim sup P e(n) ≥ ≤ b1 α;
n→∞
∗
(ii) if x is a globally exponentially asymptotically stable equilibrium for the ODE
(1.2), then there exists b2 < ∞ such that for every initial condition X(0) ∈ Rd ,
lim sup E e(n)2 ≤ b2 α.
n→∞
2.2. Rate of convergence. A uniform bound on the mean square error E[e(n)2 ]
for n ≥ 0 can be obtained under slightly stronger conditions on M via the theory of
ψ-irreducible Markov chains. We find that this error can be bounded from above by
a sum of two terms: the first converges to zero as α ↓ 0, while the second decays to
zero exponentially as n → ∞.
To illustrate the nature of these bounds, consider the linear recursion
¡
X(n + 1) = X(n) + α − X(n) − x∗ + W (n + 1) , n ≥ 0,
where {W (n)} is independently and identically distributed (i.i.d.) with mean zero
and variance σ 2 . This is of the form (1.1) with h(x) = −(x − x∗ ) and M (n) = W (n).
The error e(n + 1) defined in (2.2) may be bounded as follows:
E e(n + 1)2 ≤ α2 σ 2 + (1 − α)2 E e(n)2
≤ ασ 2 /(2 − α) + exp(−2αn)E e(0)2 , n ≥ 0.
For a deterministic initial condition X(0) = x and any > 0, we thus arrive at the
formal bound,
¡
¡
E[e(n)2 | X(0) = x] ≤ B1 (α) + B2 kxk2 + 1 exp − 0 (α)n ,
(2.3)
where B1 , B2 , and 0 are positive-valued functions of α. The bound (2.3) is of the form
that we seek: the first term on the right-hand side (r.h.s.) decays to zero with α, while
the second decays exponentially to zero with n. However, the rate of convergence for
the second term becomes vanishingly small as α ↓ 0. Hence to maintain a small
probability of error the variable α should be neither too small nor too large. This
recalls the well-known trade-off between mean and variance that must be made in the
application of stochastic approximation algorithms.
452
V. S. BORKAR AND S. P. MEYN
A bound of this form carries over to the nonlinear model under some additional
conditions. For convenience, we take a Markov model of the form
¡
¡
(2.4)
X(n + 1) = X(n) + α h X(n) + m X(n), W (n + 1) ,
where again {W (n)} is i.i.d. and also independent of the initial condition X(0). We
assume that the functions h : Rd → Rd and m : Rd × Rq → Rd are smooth (C 1 ) and
that assumptions (A1) and (A2) continue to hold. The recursion (2.4) then describes
a Feller–Markov chain with stationary transition kernel to be denoted by P .
Let V : Rd → [1, ∞) be given. The Markov chain X with transition function P
is called V -uniformly ergodic if there is a unique invariant probability π, an R < ∞,
and ρ < 1 such that for any function g satisfying |g(x)| ≤ V (x),
¡
¡
(2.5) E g X(n) | X(0) = x − Eπ g X(n) ≤ RV (x)ρn ,
x ∈ Rd , n ≥ 0,
R
where Eπ [g(X(n))] = g(x) π(dx), n ≥ 0.
The following result establishes bounds of the form (2.3) using V -ergodicity of the
model. Assumptions (2.6) and (2.7) below are required to establish ψ-irreducibility
of the model in Lemma 4.10.
There exists a w∗ ∈ Rq with m(x∗ , w∗ ) = 0, and for a continuous function p :
q
R → [0, 1] with p(w∗ ) > 0,
Z
¡
(2.6)
p(z)dz, A ∈ B(Rq ).
P W (1) ∈ A ≥
A
The pair of matrices (F, G) is controllable with
(2.7)
F =
∂
d
h(x∗ ) +
m(x∗ , w∗ )
dx
∂x
and
G=
∂
m(x∗ , w∗ ).
∂w
Theorem 2.4. Suppose that (A1), (A2), (2.6), and (2.7) hold for the Markov
model (2.4) with 0 < α ≤ α∗ . Then the Markov chain X is V -uniformly ergodic, with
V (x) = kxk2 + 1, and we have the following bounds:
(i) There exist positive-valued functions A1 and 0 of α and a constant A2 independent of α, such that
¡
¡
ª
P e(n) ≥ | X(0) = x ≤ A1 (α) + A2 kxk2 + 1 exp − 0 (α)n .
The functions satisfy A1 (α) → 0, 0 (α) → 0 as α ↓ 0.
(ii) If in addition the ODE (1.2) is exponentially asymptotically stable, then the
stronger bound (2.3) holds, where again B1 (α) → 0, 0 (α) → 0 as α ↓ 0, and B2 is
independent of α.
Proof. The V -uniform ergodicity is established in Lemma 4.10.
From Theorem 2.3 (i) we have, when X(0) ∼ π,
¡
¡
Pπ e(n) ≥ = Pπ e(0) ≥ ≤ b1 α,
and hence from V -uniform ergodicity,
¡
¡
¡
¡
P e(n) ≥ | X(0) = x ≤ Pπ e(n) ≥ + P e(n) ≥ | X(0) = x − Pπ e(n) ≥ ≤ b1 α + RV (x)ρn , n ≥ 0.
This and the definition of V establishes (i). The proof of (ii) is similar.
The fact that ρ = ρα → 1 as α ↓ 0 is discussed in section 4.3.
STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING
453
2.3. The asynchronous case. The conclusions above also extend to the model
of asynchronous stochastic approximation analyzed in [6]. We now assume that each
component of X(n) is updated by a separate processor. We postulate a set-valued
process {Y (n)} taking values in the set of subsets of {1, 2, . . . , d}, with the interpretation: Y (n) = {indices of the components updated at time n}. For n ≥ 0, 1 ≤ i ≤ d,
define
ν(i, n) =
n
X
ª
I i ∈ Y (m) ,
m=0
the number of updates executed by the ith processor up to time n. A key assumption
is that there exists a deterministic ∆ > 0 such that for all i,
lim inf
n→∞
ν(i, n)
≥∆
n
a.s.
This ensures that all components are updated comparably often. Furthermore, if
)
(
m
X
a(n) > x
N (n, x) = min m > n :
k=n+1
for x > 0, the limit
Pv(i,N (n,x))
k=v(i,n)
a(k)
k=v(j,n)
a(k)
limn→∞ Pv(j,N (n,x))
exists a.s. for all i, j.
At time n, the kth processor has available the following data:
(i) Processor (k) is given ν(k, n), but it may not have n, the “global clock.”
(ii) There are interprocessor communication delays τkj (n), 1 ≤ k, j ≤ d, n ≥ 0, so
that at time n, processor (k) may use the data Xj (m) only for m ≤ n − τkj (n).
We assume that τkk (n) = 0 for all n and that {τkj (n)} have a common upper
bound τ < ∞ ([6] considers a slightly more general situation).
To relate the present work to [6], we recall that the “centralized” algorithm of [6] is
¡
X(n + 1) = X(n) + a(n)f X(n), W (n + 1) ,
where {W (n)} are i.i.d. and {f (·, y)} are uniformly Lipschitz. Thus F (x) := E[f (x, W (1))]
is Lipschitz. The correspondence with the present set up is obtained by setting
h(x) = F (x) and
¡
¡
M (n + 1) = f X(n), W (n + 1) − F X(n)
for n ≥ 0. The asynchronous version then is
¡
¡
¡
Xi (n + 1) = Xi (n) + a ν(i, n) f X1 (n − τi1 (n) , X2 n − τi2 (n) ,
(2.8)
ª
¡
. . . , Xd n − τid (n) , W (n + 1))I i ∈ Y (n) , n ≥ 0,
for 1 ≤ i ≤ d. Note that this can be executed by the ith processor without any
knowledge of the global clock which, in fact, can be a complete artifice as long as
causal relationships are respected.
454
V. S. BORKAR AND S. P. MEYN
The analysis presented in [6] depends upon the following additional conditions on
{a(n)}:
(i) a(n + 1) ≤ a(n) eventually;
(ii) for x ∈ (0, 1), supn a([xn])/a(n) < ∞;
(iii) for x ∈ (0, 1),
 Ã

!
X
[xn]
n
X


a(i)
a(i) → 1,
i=0
i=0
where [ · ] stands for “the integer part of ( · ).”
A fourth condition is imposed in [6], but this becomes irrelevant when the delays
are bounded. Examples of {a(n)} satisfying (i)–(iii) are a(n) = 1/(n + 1) or 1/(1 +
n log(n + 1)).
As a first simplifying step, it is observed in [6] that {Y (n)} may be assumed to
be singletons without any loss of generality. We shall do likewise. What this entails is
simply unfolding a single update at time n into |Y (n)| separate updates, each involving
a single component. This blows up the delays at most d-fold, which does not affect
the analysis in any way.
The main result of [6] is the analog of our Theorem 2.2 given that the conclusions
of our Theorem 2.1 hold. In other words, stability implies convergence. Under (A1)
and (A2), our arguments above can be easily adapted to show that the conclusions of
Theorem 2.2 also hold for the asynchronous case. One argues exactly as above and in
[6] to conclude that the suitably interpolated and rescaled trajectory of the algorithm
tracks an appropriate ODE. The only difference is a scalar factor 1/d multiplying
the r.h.s. of the ODE (i.e., ẋ(t) = (1/d)h(x(t))). This factor, which reflects the
asynchronous sampling, amounts to a time-scaling that does not affect the qualitative
behavior of the ODE.
Theorem 2.5. Under the conditions of Theorem 2.2 and the above hypotheses
on {a(n)}, {Y (n)}, and {τij (n)}, the asynchronous iterates given by (3.7) remain a.s.
bounded and (therefore) converge to x∗ a.s.
2.4. Further extensions. Although satisfied in all of the applications treated
in section 3, in some other models assumption (A1) that hr → h∞ pointwise may be
violated. If this convergence does not hold, then we may abandon the fluid model
and replace (A1) by
(A10 ) The function h is Lipschitz, and there exists T > 0, R > 0 such that
b ≤ 1,
φ(t)
2
t ≥ T,
b
for any solution to (1.4) with r ≥ R and with initial condition satisfying |φ(0)|
= 1.
Under the Lipschitz condition on h, at worst we may find that the pointwise limits
of {hr : r ≥ 1} will form a family Λ of Lipschitz functions on Rd . That is, h∞ ∈ Λ if
and only if there exists a sequence {ri } ↑ ∞ such that
hri (x) → h∞ (x),
i → ∞,
where the convergence is uniform for x in compact subsets of Rd . Under (A10 ) we
then find, using the same arguments as in the proof of Lemma 4.1, that the family Λ
is uniformly stable.
STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING
455
Lemma 2.6. Under (A10 ) the family of ODEs defined via Λ is uniformly exponentially asymptotically stable in the following sense. For some b < ∞, δ > 0, and
any solution φ∞ to the ODE (1.5) with h∞ ∈ Λ,
|φ∞ (t)| ≤ be−δt |φ∞ (0)|,
t ≥ 0.
Using this lemma the development of section 4 goes through with virtually no
changes, and hence Theorems 2.1–2.5 are valid with (A1) replaced by (A10 ).
Another extension is to broaden the class of scalings. Consider a nonlinear scaling
defined by a function g : R+ → R+ satisfying g(r)/r → ∞ as r → ∞, and suppose
that hr ( · ) redefined as hr (x) = h(rx)/g(r) satisfies
hr (x) → h∞ (x) uniformly on compacts as r → ∞.
Then, assuming that the a.s. boundedness of rescaled iterates can be separately established, a completely analogous development of the stochastic algorithm is possible.
An example would be a “stochastic gradient” scheme, where h( · ) is the gradient of
an even degree polynomial, with degree, say, 2n. Then g(r) = r2n−1 will do. We
do not pursue this further because the reinforcement learning algorithms we consider
below do conform to the case g(r) = r.
3. Reinforcement learning. As both an illustration of the theory and an important application in its own right, in this section we analyze reinforcement learning
algorithms for Markov decision processes. The reader is referred to [4] for a general
background of the subject and to other references listed below for further details.
3.1. Markov decision processes. We consider a Markov decision process Φ =
{Φ(t) : t ∈ Z} taking values in a finite state space S = {1, 2, . . . , s} and controlled
by a control sequence Z = {Z(t) : t ∈ Z} taking values in a finite action space
A = {a0 , . . . , ar }. We assume that the control sequence is admissible in the sense
that Z(n) ∈ σ{Φ(t) : t ≤ n} for each n. We are most interested in stationary policies
of the form Z(t) = w(Φ(t)), where the feedback law w is a function w : S → A. The
controlled transition probabilities are given by p(i, j, a) for i, j ∈ S, a ∈ A.
Let c : S × A → R be the one-step cost function, and consider first the infinite
horizon discounted cost control problem of minimizing over all admissible Z the total
discounted cost
#
"∞
X
¡
t
β c Φ(t), Z(t) | Φ(0) = i ,
J(i, Z ) = E
t=0
where β ∈ (0, 1) is the discount factor. The minimal value function is defined as
V (i) = min J(i, Z ),
where the minimum is over all admissible control sequences Z . The function V satisfies
the dynamic programming equation
X
p(i, j, a)V (j) ,
i ∈ S,
V (i) = min c(i, a) + β
a
j
and the optimal control minimizing J is given as the stationary policy defined through
the feedback law w∗ given as any solution to
X
∗
p(i, j, a)V (j) , i ∈ S.
w (i) := arg min c(i, a) + β
a
j
456
V. S. BORKAR AND S. P. MEYN
The value iteration algorithm is an iterative procedure to compute the minimal
value function. Given an initial function V0 : S → R+ one obtains a sequence of
functions {Vn } through the recursion
X
Vn+1 (i) = min c(i, a) + β
(3.1)
p(i, j, a)Vn (j) , i ∈ S, n ≥ 0.
a
j
This recursion is convergent for any initialization V0 ≥ 0. If we define Q-values via
X
p(i, j, a)V (j),
i ∈ S, a ∈ A,
Q(i, a) = c(i, a) + β
j
then V (i) = mina Q(i, a) and the matrix Q satisfies
X
Q(i, a) = c(i, a) + β
p(i, j, a) min Q(j, b),
b
j
i ∈ S,
a ∈ A.
The matrix Q can also be computed using the equivalent formulation of value iteration,
X
(3.2) Qn+1 (i, a) = c(i, a) + β
p(i, j, a) min Qn (j, b),
i ∈ S, a ∈ A, n ≥ 0,
j
b
where Q0 ≥ 0 is arbitrary.
The value iteration algorithm is initialized with a function V0 : S → R+ . In
contrast, the policy iteration algorithm is initialized with a feedback law w0 and
generates a sequence of feedback laws {wn : n ≥ 0}. At the nth stage of the algorithm
a feedback law wn is given and the value function for the resulting control sequence
Z n = {wn (Φ(0)), wn (Φ(1)), wn (Φ(2)), . . . } is computed to give
¡
Jn (i) = J i, Z n ,
i ∈ S.
Interpreted as a column vector in Rs , the vector Jn satisfies the equation
¡
I − βPn Jn = cn ,
(3.3)
where the s × s matrix Pn is defined by Pn (i, j) = p(i, j, wn (i)), i, j ∈ S, and the
column vector cn is given by cn (i) = c(i, wn (i)), i ∈ S. Equation (3.3) can be solved
for fixed n by the “fixed-policy” version of value iteration given by
(3.4)
Jn (i + 1) = βPn Jn (i) + cn ,
i ≥ 0,
where Jn (0) ∈ Rs is given as an initial condition. Then Jn (i) → Jn , the solution to
(3.3), at a geometric rate as i → ∞.
Given Jn , the next feedback law wn+1 is then computed via
X
n+1
(3.5)
(i) = arg min c(i, a) + β
p(i, j, a)Jn (j) , i ∈ S.
w
a
j
Each step of the policy iteration algorithm is computationally intensive for large state
spaces since the computation of Jn requires the inversion of the s × s matrix I − βPn .
In the average cost optimization problem one seeks to minimize over all
admissible Z ,
(3.6)
lim sup
n→∞
n−1
1X ¡
E c Φ(t), Z(t) .
n t=0
STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING
457
The policy iteration and value iteration algorithms to solve this optimization problem
remain unchanged with three exceptions. One is that the constant β must be set
equal to unity in (3.1) and (3.5). Second, in the policy iteration algorithm the value
function Jn is replaced by a solution Jn to Poisson’s equation
X ¡
¡
p i, j, wn (i) Jn (j) = Jn (i) − c i, wn (i) + ηn , i ∈ S,
where ηn is the steady state cost under the policy wn . The computation of Jn and
ηn again involves matrix inversions via
¡
¡
πn I − Pn + ee0 = e0 ,
ηn = πn cn ,
I − Pn + ee0 Jn = cn ,
where e ∈ Rs is the column vector consisting of all ones and the row vector πn is
the invariant probability for Pn . The introduction of the outer product ensures that
the matrix (I − Pn + ee0 ) is invertible, provided that the invariant probability πn is
unique.
Lastly, the value iteration algorithm is replaced by the “relative value iteration,”
where a common scalar offset is subtracted from all components of the iterates at
each iteration (likewise for the Q-value iteration). The choice of this offset term is not
unique. We shall be considering one particular choice, though others can be handled
similarly (see [1]).
3.2. Q-learning. If the matrix Q defined in (3.2) can be computed via value
iteration or some other scheme, then the optimal control is found through a simple
minimization. If transition probabilities are unknown so that value iteration is not
directly applicable, one may apply a stochastic approximation variant known as the
Q-learning algorithm of Watkins [1, 20, 21]. This is defined through the recursion
h
i
Qn+1 (i, a) = Qn (i, a) + a(n) β min Qn (Ψn+1 (i, a), b) + c(i, a) − Qn (i, a) ,
b
i ∈ S, a ∈ A, where Ψn+1 (i, a) is an independently simulated S-valued random variable with law p(i, ·, a).
Making the appropriate correspondences with our set up, we have X(n) = Qn
and h(Q) = [hia (Q)]i,a with
X
hia (Q) = β
p(i, j, a) min Q(j, b) + c(i, a) − Q(i, a),
i ∈ S, a ∈ A.
b
j
The martingale is given by M (n + 1) = [Mia (n + 1)]i,a with
Mia (n + 1)

= β min Qn (Ψn+1 (i, a), b) −
b
X
Fia (Q) = β
X
j
p(i, j, a) min Qn (j, b)  ,
b
j
Define F (Q) = [Fia (Q)]i,a by

p(i, j, a) min Q(j, b) + c(i, a).
b
Then h(Q) = F (Q) − Q and the associated ODE is
(3.7)
Q̇ = F (Q) − Q := h(Q).
i ∈ S,
a ∈ A.
458
V. S. BORKAR AND S. P. MEYN
The map F : Rs×(r+1) → Rs×(r+1) is a contraction with respect to the max norm
k · k∞ . The global asymptotic stability of its unique equilibrium point is a special
case of the results of [8]. This h( · ) fits the framework of our analysis, with the (i, a)th
component of h∞ (Q) given by
X
β
p(i, j, a) min Q(j, b) − Q(i, a), i ∈ S, a ∈ A.
b
j
This also is of the form h∞ (Q) = F∞ (Q) − Q where F∞ ( · ) is an k · k∞ - contraction,
and thus the asymptotic stability of the unique equilibrium point of the corresponding
ODE is guaranteed (see [8]). We conclude that assumptions (A1) and (A2) hold, and
hence Theorems 2.1–2.4 also hold for the Q-learning model.
3.3. Adaptive critic algorithm. Next we shall consider the adaptive critic
algorithm, which may be considered as the reinforcement learning analog of policy
iteration (see [2, 13] for a discussion). There are several variants of this, one of which,
taken from [13], is as follows. For i ∈ S, we define
¡
¡ ¡
Vn+1 (i) = Vn (i) + b(n) c i, ψn (i) + βVn Ψn i, ψn (i) − Vn (i) ,
(3.8)
from which the policies are updated according to
(3.9)
(
w
bn+1 (i)
=Γ w
bn (i) + a(n)
r X
¡
c(i, a0 ) + βVn ηn (i, a0 )
)
− [c(i, a` ) + βVn (ηn (i, a` ))]e`
.
`=1
Here {Vn } are s-vectors and for eachPi, {w
bn (i)} are r-vectors lying in the simplex
{x ∈ Rr | x = [x1 , . . . , xr ], xi ≥ 0, i xi ≤ 1}. Γ( · ) is the projection onto this
simplex. The sequences {a(n)}, {b(n)} satisfy
X
X
X¡
¡
a(n) =
b(n) = ∞,
a(n)2 + b(n)2 < ∞,
a(n) = o b(n) .
n
n
n
The rest of the notation is as follows. For 1 ≤ ` ≤ r, e` is the unit r-vector in the
`th coordinate direction. For each i, n, wn (i) = wn (i, ·) is a probability vector on A
bn (i, 1), . . . , w
bn (i, r)],
defined by the following. For w
bn (i) = [w


for ` 6= 0,
bn (i, `)
w
X
wn (i, a` ) = 1 −
w
bn (i, j) for ` = 0.


j6=0
Given wn (i), ψn (i) is an A-valued random variable independently simulated with law
wn (i). Likewise, Ψn (i, ψn (i)) are S-valued random variables which are independently
simulated (given ψn (i)) with law p(i, ·, ψn (i)) and {ηn (i, a` )} are S-valued random
variables independently simulated with law p(i, ·, a` ), respectively.
To see why this is based on policy iteration, recall that policy iteration alternates
between two steps. One step solves the linear system of (3.3) to compute the fixedpolicy value function corresponding to the current policy. We have seen that solving
(3.3) can be accomplished by performing the fixed-policy version of value iteration
given in (3.4). The first step (3.8) in the above iteration is indeed the “learning” or
STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING
459
“simulation-based stochastic approximation” analog of this fixed-policy value iteration. The second step in policy iteration updates the current policy by performing an
appropriate minimization. The second iteration (3.9) is a particular search algorithm
for computing this minimum over the simplex of probability measures on A. This
search algorithm is by no means unique; the paper [13] gives two alternative schemes.
However, the first iteration (3.8) is common to all.
The different choices of stepsize schedules for the two iterations (3.8) and (3.9)
induces the “two time-scale” effect discussed in [5]. The first iteration sees the policy
computed by the second as nearly static, thus justifying viewing it as a fixed-policy
iteration. In turn, the second sees the first as almost equilibrated, justifying the search
scheme for minimization over A. See [13] for details.
The boundedness of {w
bn } is guaranteed by the projection Γ( · ). For {Vn }, the
fact that b(n) = o(a(n)) allows one to treat w
bn (i) as constant, say, w(i); see, e.g., [13].
The appropriate ODE then turns out to be
(3.10)
v̇ = G(v) − v := h(v),
where G : Rs → Rs is defined by


X
X
w(i, a` ) β
p(i, j, a` )xj + c(i, a` ) − xi ,
Gi (x) =
i ∈ S.
j
`
Once again, G( · ) is an k · k∞ -contraction and it follows from the results of [8]
that (3.10) is globally asymptotically stable. The limiting function h∞ (x) is again of
the form h∞ (x) = G∞ (x) − x with G∞ (x) defined so that its ith component is


X
X
w(i, a` ) β
p(i, j, a` )xj  − xi .
j
`
We see that G∞ is also a k · k∞ -contraction and the global asymptotic stability of the
origin for the corresponding limiting ODE follows as before from the results of [8].
3.4. Average cost optimal control. For the average cost control problem, we
impose the additional restriction that the chain Φ has a unique invariant probability
measure under any stationary policy so that the steady state cost (3.6) is independent
of the initial condition.
For the average cost optimal control problem, the Q-learning algorithm is given
by the recursion
Qn+1 (i, a) = Qn (i, a) + a(n) min Qn (Ψn (i, a), b) + c(i, a) − Qn (i, a) − Qn (i0 , a0 ) ,
b
where i0 ∈ S, a0 ∈ A areP
fixed a priori. The appropriate ODE now is (3.7) with F ( · )
redefined as Fia (Q) =
j p(i, j, a) minb Q(j, b) + c(i, a) − Q(i, a) − Q(i0 , a0 ). The
global asymptotic stability for the unique equilibrium point for this ODE has been
established in [1]. Once again this fits our framework with h∞ (x) = F∞ (x) − x for
F∞ defined the same way as F , except for the terms c(·, ·) which are dropped. We
conclude that (A1) and (A2) are satisfied for this version of the Q-learning algorithm.
Another variant of Q-learning for average cost, based on a “stochastic shortest
path” formulation, is presented in [1]. This also can be handled similarly.
460
V. S. BORKAR AND S. P. MEYN
In [13], three variants of the adaptive critic algorithm for the average cost problem
are discussed, differing only in the {w
bn } iteration. The iteration for {Vn } is common
to all and is given by
¡
¡ ¡
Vn+1 (i) = Vn (i) + b(n) c i, ψn (i) + Vn Ψn i, ψn , (i) − Vn (i) − Vn (i0 ) , i ∈ S,
where i0 ∈ S is a fixed state prescribed beforehand. This leads to the ODE (3.10)
with G redefined as


X
X
w(i, a` ) 
p(i, j, a` )xj + c(i, a` ) − xi − xi0 , i ∈ S.
Gi (x) =
`
j
The global asymptotic stability of the unique equilibrium point of this ODE has been
established in [7]. Once more, this fits our framework with h∞ (x) = G∞ (x) − x for
G∞ defined just like G, but without the c(·, ·) terms.
Asynchronous versions of all the above can be written down along the lines of
(3.7). Then by Theorem 2.5, they have bounded iterates a.s. The important point to
note here is that to date, a.s. boundedness for Q-learning and adaptive critic is proved
by other methods for centralized algorithms [1, 12, 20]. For asynchronous algorithms,
it is proved for discounted cost only [1, 13, 20] or by introducing a projection to
enforce stability [14].
4. Derivations. Here we provide proofs for the main results given in section 2.
Throughout this section we assume that (A1) and (A2) hold.
4.1. Stability. The functions {hr , r ≥ 1} and the limiting function h∞ are Lipschitz with the same Lipschitz constant as h under (A1). It follows from Ascoli’s
theorem that the convergence hr → h∞ is uniform on compact subsets of Rd . This
observation is the basis of the following lemma.
Lemma 4.1. Under (A1), the ODE (1.5) is globally exponentially asymptotically
stable.
Proof. The function h∞ satisfies
h∞ (cx) = ch∞ (x),
c > 0,
x ∈ Rd .
Hence the origin θ ∈ Rd is an equilibrium for (1.5), i.e., h∞ (θ) = θ. Let B() be the
closed ball of radius centered at θ with chosen so that x(t) → θ as t → ∞ uniformly
for initial conditions in B(). Thus there exists a T > 0 such that kx(T )k ≤ /2
whenever kx(0)k ≤ . For an arbitrary solution x( · ) of (1.5), y( · ) = x( · )/kx(0)k
is another, with ky(0)k = . Hence ky(T )k < /2, implying kx(T )k ≤ 12 kx(0)k. The
global exponential asymptotic stability follows.
With the scaling parameter r given by r(j) = max(1, kX(m(j))k), j ≥ 0, we
define three piecewise continuous functions from R+ to Rd as in the introduction:
(a) {φ(t) : t ≥ 0} is an interpolated version of X defined as follows. For each
j ≥ 0, define a function φj on the interval [T (j), T (j + 1)] by
¡
φj t(n) = X(n)/r(j), m(j) ≤ n ≤ m(j + 1),
with φj ( · ) defined by linear interpolation on the remainder of [T (j), T (j + 1)] to form
a piecewise linear function.
We then define φ to be the piecewise continuous function
t ∈ T (j), T (j + 1) , j ≥ 0.
φ(t) = φj (t),
STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING
461
b
(b) {φ(t)
: t ≥ 0} is continuous on each interval [T (j), T (j + 1)), and on this
interval it is the solution to the ODE
¡
ẋ(t) = hr(j) x(t) ,
(4.1)
b (j)) = φ(T (j)), j ≥ 0.
with initial condition φ(T
(c) {φ∞ (t) : t ≥ 0} is also continuous on each interval [T (j), T (j + 1)), and on this
interval it is the solution to the “fluid model” (1.5) with the same initial condition
¡
¡
¡
φ∞ T (j) = φb T (j) = φ T (j) j ≥ 0.
b · ) and φ∞ ( · ) is crucial in deriving useful approximations.
Boundedness of φ(
Lemma 4.2. Under (A1) and (A2) and either (TS) or (BS), there exists C̄ < ∞
such that for any initial condition X(0) ∈ Rd
b ≤ C̄
φ(t)
φ∞ (t) ≤ C̄,
and
t ≥ 0.
Proof. To establish the first bound use the Lipschitz continuity of h to obtain the
bound
°
°
¡
¡°
d°
b °2 = 2φ(t)
b T hr(j) φ(t)
b
b °2 + 1 ,
°φ(t)
≤ C °φ(t)
dt
T (j) ≤ t < T (j + 1),
where C is a deterministic constant, independent of j. The claim follows with C̄ =
b (j))k ≤ 1. The proof of the second bound is therefore
2 exp((T + 1)C) since kφ(T
identical.
The following version of the Bellman Gronwall lemma will be used repeatedly.
Lemma 4.3.
(i) Suppose {α(n)}, {A(n)} are nonnegative sequences and β > 0 such that
A(n + 1) ≤ β +
n
X
α(k)A(k),
n ≥ 0.
k=0
Then for all n ≥ 1,
Ã
A(n + 1) ≤ exp
!
n
X
α(k)
¡
α(0)A(0) + β .
k=1
(ii) Suppose {α(n)}, {A(n)}, {γ(n)} are nonnegative sequences such that
¡
A(n + 1) ≤ 1 + α(n) A(n) + γ(n), n ≥ 0.
Then for all n ≥ 1,
Ã
A(n + 1) ≤ exp
Pn
n
X
!
α(k)
¡¡
1 + α(0) A(0) + β(n) ,
k=1
where β(n) = 0 γ(k).
Proof. Define {R(n)} inductively by R(0) = A(0) and
R(n + 1) = β +
n
X
k=0
α(k)R(k),
n ≥ 0.
462
V. S. BORKAR AND S. P. MEYN
A simple induction shows that A(n) ≤ R(n), n ≥ 0. An alternative expression for
R(n) is
!
à n
Y
¡
(1 + α(k) α(0)A(0) + β .
R(n) =
k=1
The inequality (i) then follows from the bound 1 + x ≤ ex .
To see (ii) fix n ≥ 0 and observe that on summing both sides of the bound
A(k + 1) − A(k) ≤ α(k)A(k) + γ(k)
over 0 ≤ k ≤ ` we obtain for all 0 ≤ ` < n,
A(` + 1) ≤ A(0) + β(n) +
X̀
α(k)A(k).
k=0
The result then follows from (i).
b · ), and φ∞ ( · ).
The following lemmas relate the three functions φ( · ), φ(
Lemma 4.4. Suppose that (A1) and (A2) hold. Given any > 0, there exist
T, R < ∞ such that for any r > R and any solution to the ODE (1.4) satisfying
kx(0)k ≤ 1, we have kx(t)k ≤ for t ∈ [T, T + 1].
Proof. By global asymptotic stability of (1.5) we can find T > 0 such that
kφ∞ (t)k ≤ /2, t ≥ T , for solutions φ∞ ( · ) of (1.5) satisfying kφ∞ (0)k ≤ 1.
b − φ∞ (t)| ≤ /2 whenever φb is a solution
With T fixed, choose R so large that |φ(t)
∞
b
b
to (1.4) satisfying φ(0)
= φ (0); |φ(0)|
≤ 1; and r ≥ R. This is possible since, as we
have already observed, hr → h∞ as r → ∞ uniformly on compact sets. The claim
then follows from the triangle inequality.
Define the following: For j ≥ 0, m(j) ≤ n < m(j + 1),
e
X(n)
:= X(n)/r(j),
f(n + 1) := M (n + 1)/r(j),
M
and for n ≥ 1,
ξ(n) :=
n−1
X
f(m + 1).
a(m)M
m=0
Lemma 4.5. Under (A1), (A2), and either (TS) or (BS), for each initial condition
X(0) ∈ Rd satisfying E[kX(0)k2 ] < ∞, we have the following:
2
e
(i) sup n≥0 E[kX(n)k
] < ∞.
(ii) sup j≥0 E[kX(m(j + 1))/r(j)k2 ] < ∞.
(iii) sup j≥0,T (j)≤t≤T (j+1) E[kφ(t)k2 ] < ∞.
(iv) Under (TS) the sequence {ξ(n), Fn } is a square integrable martingale with
sup E[kξ(n)k2 ] < ∞.
n≥0
Proof. To prove (i) note first that under (A2) and the Lipschitz condition on h
there exists C < ∞ such that for all n ≥ 1,
¡
(4.2)
E kX(n)k2 | Fn−1 ≤ 1 + Ca(n − 1) kX(n − 1)k2 + Ca(n − 1), n ≥ 0.
STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING
463
It then follows that for any j ≥ 0 and any m(j) < n < m(j + 1),
°
°
°
¡
°
e °2 | Fn−1 ≤ 1 + Ca(n − 1) °X(n
e − 1)°2 + Ca(n − 1),
E °X(n)
so that by Lemma 4.3 (ii), for all such n,
°2 ° ¡
¡ °
°
e
e + 1)°2 ≤ exp C(T + 1) 2E °X(m(j))
° + C(T + 1)
E °X(n
¡
¡
≤ exp C(T + 1) 2 + C(T + 1) .
Claim (i) follows, and claim (ii) follows similarly. We then obtain claim (iii) from the
f(n)k2 ] < ∞. Using this
definition of φ( · ). From (i), (ii), and (A2), we have supn E[kM
and the square summability of {a(n)} assumed in (TS), the bound (iv) immediately
follows.
Lemma 4.6. Suppose E[kX(0)k2 ] < ∞. Under (A1), (A2), and (TS), with
probability one,
b
(i) kφ(t) − φ(t)k
→ 0 as t → ∞,
(ii) sup t≥0 kφ(t)k < ∞.
b · ) as follows: For m(j) ≤ n < m(j + 1),
Proof. Express φ(
b
b (j)) +
φ(t(n
+ 1)−) = φ(T
Z
n
X
i=m(j)
b (j)) + 1 (j) +
= φ(T
(4.3)
t(i+1)
t(i)
n
X
¡
b
hr(j) φ(s)
ds
¡
b
a(i)hr(j) φ(t(i))
,
i=m(j)
Pm(j+1)
where 1 (j) = O( i=m(j) a(i)2 ) → 0 as j → ∞. The “−” covers the case where
t(n + 1) = t(m(j + 1)) = T (j + 1).
We also have by definition
(4.4)
n
X
¡ ¡ ¡
¡
f(i + 1) .
a(i) hr(j) φ t(i) + M
φ t(n + 1) − = φ T (j) +
i=m(j)
b
For m(j) ≤ n ≤ m(j + 1), let ε(n) = kφ(t(n)−) − φ(t(n)−)k.
Combining (4.3), (4.4),
and the Lipschitz continuity of h, we have
n
X
¡
a(i)ε(i),
ε(n + 1) ≤ ε m(j) + 1 (j) + kξ(n + 1) − ξ(m(j))k + C
i=m(j)
where C < ∞ is a suitable constant. Since ε(m(j)) = 0, we can use Lemma 4.3 (i) to
obtain
¡
¡
m(j) ≤ n ≤ m(j + 1),
ε(n) ≤ exp C(T + 1) 1 (j) + 2 (j) ,
where 2 (j) = maxm(j)≤n≤m(j+1) kξ(n + 1) − ξ(m(j))k. By (iv) of Lemma 4.5 and
the martingale convergence theorem [18, p. 62], {ξ(n)} converges a.s.; thus 2 (j) → 0
a.s., as j → ∞. Since 1 (j) → 0 as well,
°
°
b
°φ(t(n)−) − φ(t(n)−)
°=
sup
ε(n) → 0
sup
m(j)≤n≤m(j+1)
m(j)≤n≤m(j+1)
464
V. S. BORKAR AND S. P. MEYN
as j → ∞, which implies the first claim.
Result (ii) then follows from Lemma 4.2 and the triangle inequality.
Lemma 4.7. Under (A1), (A2), and (BS), there exists a constant C2 < ∞ such
that for all j ≥ 0,
2
b
| Fn(j) ] ≤ C2 α,
(i) sup j≥0,T (j)≤t≤T (j+1) E[kφ(t) − φ(t)k
2
(ii) sup j≥0,T (j)≤t≤T (j+1) E[kφ(t)k | Fn(j) ] ≤ C2 .
Proof. Mimic the proof of Lemma 4.6 to obtain
ε(n + 1) ≤
n
X
Ca(i)ε(i) + 0 (j),
m(j) ≤ n < m(j + 1),
i=m(j)
2
b
| Fm(j) ]1/2 for m(j) ≤ n ≤ m(j + 1), and the
where ε(n) = E[kφ(t(n)−) − φ(t(n)−)k
error term has the upper bound
|0 (j)| = O(α),
where the bound is deterministic. By Lemma 4.3 (i) we obtain the bound,
¡
m(j) ≤ n ≤ m(j + 1),
ε(n) ≤ exp C(T + 1) 0 (j),
which proves (i). We, therefore, obtain (ii) using Lemma 4.2, (i), and the triangle
inequality.
Proof of Theorem 2.1. (i) By a simple conditioning argument, we may take X(0)
to be deterministic without any loss of generality. In particular, E[kX(0)k2 ] < ∞
trivially. By Lemma 4.6 (ii), it now suffices to prove that supn kX(m(n))k < ∞
a.s. Fix a sample point outside the zero probability set where Lemma 4.6 fails. Pick
T > 0 as above and R > 0 such that for every solution x( · ) of the ODE (1.4) with
kx(0)k ≤ 1 and r ≥ R, we have kx(t)k ≤ 14 for t ∈ [T, T + 1]. This is possible by
Lemma 4.4.
Hence by Lemma 4.6 (i) we can find an j0 ≥ 1 such that whenever j ≥ j0 and
kX(m(j))k ≥ R,
(4.5)
¡
1
kX(m(j + 1))k
= φ T (j + 1) − ≤ .
kX(m(j))k
2
This implies that {X(m(j)) : j ≥ 0} is a.s. bounded, and the claim follows.
(ii) For m(j) < n ≤ m(j + 1),
(4.6)
°
°2
° ¡
°2
1/2
1/2 ¡°
°
°X(m(j))° ∨ 1
= E °φ t(n) − ° | Fm(j)
E °X(n)° | Fm(j)
1/2 ¡° ¡
°
° ¡
¡
°2
°X m(j) ° ∨ 1
≤ E °φ t(n) − − φb t(n) − ° | Fm(j)
°2
¡
°
b
° | Fm(j) 1/2 kX(m(j) k ∨ 1 .
+ E °φ(t(n)−)
Let 0 < η < 12 , and let α∗ = η/(2C2 ), for C2 as in Lemma 4.7. We then obtain
for α ≤ α∗ ,
°2
°
¡° ¡
°
1/2
E °X(n)° | Fm(j)
≤ (η/2) °X m(j) ° ∨ 1
1/2 ¡° ¡
°
° ¡
°2
°X m(j) ° ∨ 1 .
(4.7)
+ E °φb t(n) − ° | Fm(j)
STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING
465
Choose R, T > 0 such that for any solution x( · ) of the ODE (1.4), kx(t)k < η/2 for
t ∈ [T, T + 1], whenever kx(0)k < 1 and r ≥ R. When kX(m(j))k ≥ R, we then
obtain
° ¡
° ¡
°2
°
1/2
E °X m(j + 1) ° | Fm(j)
(4.8)
≤ η °X m(j) °,
while by Lemma 4.7 (ii) there exists a constant C such that the left-hand side (l.h.s.)
of the inequality above is bounded by C a.s. when kX(m(j))k ≤ R. Thus,
° ¡
°2 ° ¡
°2 E °X m(j + 1) ° ≤ 2η 2 E °X m(j) ° + 2C 2 .
This establishes boundedness of E[kX(m(j + 1))k2 ], and the proof then follows from
(4.7) and Lemma 4.2.
4.2. Convergence for (BS). Lemma 4.8. Suppose that (A1), (A2), and (BS)
hold and that α ≤ α∗ . Then for some constant C3 < ∞,
° °
b − ψ(t)°2 ≤ C3 α.
sup E °ψ(t)
t≥0
Proof. By (A2) and Theorem 2.1 (ii),
°2 °2 °
°
sup E °X(n)° < ∞, sup E °M (n)° < ∞.
n
n
The claim then follows from familiar arguments using the Bellman Gronwall lemma
exactly as in the proof of Lemma 4.6.
Proof of Theorem 2.3. (i) We apply Theorem 2.1 which allows us to choose an
R > 0 such that
¡
sup P kX(n)k > R < α.
n
Let B(c) denote the ball centered at x∗ of radius c > 0 and let 0 < µ < /2 be such
that if a solution x( · ) of (1.2) satisfies x(0) ∈ B(µ), then x(t) ∈ B(/2) for t ≥ 0. Pick
T > 0 such that if a solution x( · ) of (1.2) satisfies kx(0)k ≤ R, then x(t) ∈ B(µ/2)
for t ∈ [T, T + 1]. Then for all j ≥ 0,
¡
° ¡
¡ ¡
°
P e m(j + 1) ≥ µ = P e m(j + 1) ≥ µ, °X m(j) ° > R)
¡ ¡
+ P e m(j + 1 ≥ µ, kX(m(j))k ≤ R
¡
¡
≤ α + P ψ T (j + 1) 6∈ B(µ), ψb T (j + 1) ∈ B(µ/2)
¡
¡
≤ α + P kψ T (j + 1) − ψb T (j + 1) k > µ/2
≤ O(α)
by Lemma 4.8. Then for m(j) ≤ n < m(j + 1),
¡
¡
¡
P e(n) ≥ = P e(n) ≥ , e m(j) ≥ µ
¡
¡
+ P e(n) ≥ , e m(j) ≤ µ
¡
¡
¡
≤ O(α) + P ψ(t(n) 6∈ B(), ψb t(n) ∈ B /2)
¡ ¡
¡
≤ O(α) + P kψ t(n) − ψb t(n) k > /2
≤ O(α).
466
V. S. BORKAR AND S. P. MEYN
Since the bound on the r.h.s. is uniform in n, the claim follows.
(ii) We first establish the bound with n = m(j + 1), j → ∞. We have for any j,
° ¡
¡
°2 1/2
¡
2 1/2
≤ E °ψ T (j + 1) − − ψb T (j + 1) − °
E e m(j + 1)
°2 1/2
° ¡
(4.9)
+ E °ψb T (j + 1) − − x∗ °
.
By exponential stability there exist C < ∞, δ > 0 such that for all j ≥ 0,
° ¡
°
°
¡
° ¡
°ψb T (j + 1) − − x∗ ° ≤ C exp − δ T (j + 1) − T (j) °ψb T (j) − x∗ °
° ¡
°
≤ C exp(−δT )°ψb T (j) − x∗ °.
Choose T so large that C exp(−δT ) ≤
1
2
so that
°2 1/2
°2 1/2
° ¡
1 ° ¡
≤ E °ψb T (j) − x∗ °
E °ψb T (j + 1) − − x∗ °
2
2 1/2 1 ° ¡
¡
°2 1/2
1 ¡
(4.10)
≤ E e m(j)
+ E °ψ T (j) − ψb T (j) °
.
2
2
Combining (4.9) and (4.10) with Lemma 4.8 gives
p
2 1/2
¡
2 1/2
1 ¡
≤ E e m(j)
+ 2 C3 α,
E e m(j + 1)
2
which shows that
¡
2 lim sup E e m(j)
≤ 16C3 α.
j→∞
The result follows from this and Lemma 4.7 (ii).
Proof of Theorem 2.5. The details of the proof, though pedestrian in the light of
the foregoing and [6], are quite lengthy, not to mention the considerable overhead of
additional notation, and are therefore omitted. We briefly sketch below a single point
of departure in the proof.
b · ) on the interval [T (j), T (j+
In Lemma 4.6 we compare two functions φ( · ) and φ(
e
1)]. The former in turn involved the iterates X(n) for m(j) ≤ n < m(j + 1) or,
equivalently, X(n) for m(j) ≤ n < m(j + 1). Here X(n + 1) was computed in terms
of X(n) and the “noise” M (n + 1). In the asynchronous case, however, the evaluation
of Xj (n + 1) can involve Xj (n) for n − τ ≤ m ≤ n, j 6= i. Therefore the argument
leading to Lemma 4.6 calls for a slight modification. While computing X(n), m(j) ≤
ei (m) = Xi (m)/r(j).
n < m(j + 1), we plug into the iteration as and when required X
Note, however, that if the same Xi (m) also features in the computation of Xk (l) for
ei (m) should be redefined there as
m(q) ≤ ` < m(q + 1), say, with q 6= j, then X
e
Xi (m)/r(q). Thus the definition of Xi (m) now becomes context-dependent.
With this minor change, the proofs of [6] can be easily combined with the
arguments used in the proofs of Theorems 2.1 and 2.2 to draw the desired
conclusions.
4.3. The Markov model. The bounds that we obtain for the Markov model
(2.4) are based upon the theory of ψ-irreducible Markov chains.
A subset S ⊂ Rd is called petite if there exists a probability measure ν on Rd and
δ > 0 such that the resolvent kernel K satisfies
K(x, A) :=
∞
X
k=0
2−k−1 P k (x, A) ≥ δν(A),
x ∈ S,
STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING
467
for any measurable A ⊂ Rd . Under assumptions (2.6) and (2.7) we show below that
every compact subset of Rd is petite, so that Φ is a ψ-irreducible T -chain. We refer
the reader to [16] for further terminology and notation.
Lemma 4.9. Suppose that (A1), (A2), (2.6), and (2.7) hold and that α ≤ α∗ .
Then all compact subsets of Rd are petite for the Markov chain X, and hence the
chain is ψ-irreducible.
Proof. The conclusions of the theorem will be satisfied if we can find a function
s which is bounded from below on compact sets and a probability ν such that the
resolvent kernel K satisfies the bound
K(x, A) ≥ s(x)ν(A)
for every x ∈ Rd and any measurable subset A ⊂ Rd . This bound is written succinctly
as K ≥ s ⊗ ν.
The first step of the proof is to apply the implicit function theorem together with
(2.6) and (2.7) to obtain a bound of the form
P d (x, A) = P(X(d) ∈ A | X(0) = x) ≥ ν(A),
x ∈ O,
where O is an open set containing x∗ , > 0, and ν is the uniform distribution on O.
The set O can be chosen independent of α, but the constant may depend on α. For
details on this construction, see Chapter 7 of [16].
To complete the proof it is enough to show that K(x, O) > 0. To see this, suppose
that α ≤ α∗ and that W (n) = w∗ for all n. Then the foregoing stability analysis shows
that X(n) ∈ O for all n sufficiently large. Since w∗ is in the support of the marginal
distribution of {W (n)}, it then follows that K(x, O) > 0.
From these two bounds, we then have
Z
K(x, A) ≥ 2−d K(x, dy)P d (y, A) ≥ 2−d K(x, O)ν(A).
This is of the form K ≥ s ⊗ ν with s lower semicontinuous and positive everywhere.
The function s is therefore bounded from below on compact sets, which proves the
claim.
The previous lemma together with Theorem 2.1 allows us to establish a strong
form of ergodicity for the model.
Lemma 4.10. Suppose that (A1), (A2), (2.6), and (2.7) hold and that α ≤ α∗ .
(i) There exists a function Vα : Rd → [1, ∞) and constants b, L < ∞ and 0 > 0
independent of α such that
P Vα (x) ≤ exp(−0 α)Vα (x) + bIC (x),
where C = {x : kxk ≤ L}. While the function Vα will depend upon α, it is uniformly
bounded as follows,
γ −1 (kxk2 + 1) ≤ Vα (x) ≤ γ(kxk2 + 1),
where γ ≥ 1 does not depend upon α.
(ii) The chain is V -uniformly ergodic, with V (x) = kxk2 + 1.
Proof. Using (4.8) we may construct T and L independent of α ≤ α∗ such that
°2
°
E °X(k0 )° + 1 | X(0) = x ≤ (1/2)(kxk2 + 1),
kxk ≥ L,
468
V. S. BORKAR AND S. P. MEYN
where k0 = [T /α] + 1. We now set
Vα (x) = α
kX
0 −1
h°
i
°2
E °X(k)° + 1 | X(0) = x 2k/k0 .
k=0
From the previous bound, it follows directly that the desired drift inequality holds
with 0 = log(2)/T . Lipschitz continuity of the model gives the bounds on Vα . This
proves (i).
The V -uniform ergodicity then follows from Lemma 4.9 and Theorem 16.0.1 of
[16].
We note that for small α and large x, the Lyapunov function Vα approximates
V∞ plus a constant, where
Z T
¡
kx(s)k2 + 1 2s/T ds; x(0) = x,
V∞ (x) =
0
and x( · ) is a solution to (1.5). If this ODE is asymptotically stable then the function
V∞ is in fact a Lyapunov function for (1.5), provided T > 0 is chosen sufficiently
large.
In [17] a bound is obtained on the rate of convergence ρ given in (2.5) for a chain
satisfying the drift condition
P Vα (x) ≤ λV (x) + bIC (x).
The bound depends on the “petiteness” of the set C and the constants b < ∞ and
λ < 1. The bound on ρ obtained in [17] also tends to unity with vanishing α since
in the preceding lemma we have λ = exp(−0 α) → 1 as α → 0. From the structure
of the algorithm this is not surprising, but this underlines the fact that care must be
taken in the choice of the stepsize α.
REFERENCES
[1] J. Abounadi, D. Bertsekas, and V. S. Borkar, Learning algorithms for Markov decision
processes with average cost, SIAM J. Control Optim., submitted.
[2] A. G. Barto, R. S. Sutton, and C. W. Anderson, Neuron-like elements that can solve
difficult learning control problems, IEEE Trans. Systems, Man and Cybernetics, 13 (1983),
pp. 835–846.
[3] A. Benveniste, M. Métivier, and P. Priouret, Adaptive Algorithms and Stochastic Approximations, Springer-Verlag, Berlin, 1990.
[4] D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, Belmont,
MA, 1996.
[5] V. S. Borkar, Stochastic approximation with two time scales, Systems Control Lett., 29
(1997), pp. 291–294.
[6] V. S. Borkar, Asynchronous stochastic approximation, SIAM J. Control Optim., 36 (1998),
pp. 840–851.
[7] V. S. Borkar, Recursive self-tuning control of finite Markov chains, Appl. Math., 24 (1996),
pp. 169–188.
[8] V. S. Borkar and K. Soumyanath, An analog scheme for fixed-point computation, part I:
Theory, IEEE Trans. Circuits Systems I Fund. Theory Appl., 44 (1997), pp. 351–355.
[9] J. G. Dai, On positive Harris recurrence for multiclass queueing networks: A unified approach
via fluid limit models, Ann. Appl. Probab., 5 (1995), pp. 49–77.
[10] J. G. Dai and S. P. Meyn, Stability and convergence of moments for multiclass queueing
networks via fluid limit models, IEEE Trans. Automat. Control, 40 (1995), pp. 1889–1904.
[11] M. W. Hirsch, Convergent activation dynamics in continuous time networks, Neural Networks, 2 (1989), pp. 331–349.
STOCHASTIC APPROXIMATION AND REINFORCEMENT LEARNING
469
[12] T. Jaakola, M. I. Jordan, and S. P. Singh, On the convergence of stochastic iterative
dynamic programming algorithms, Neural Computation, 6 (1994), pp. 1185–1201.
[13] V. R. Konda and V. S. Borkar, Actor-critic–type learning algorithms for Markov decision
processes, SIAM J. Control Optim., 38 (1999), pp. 94–123.
[14] H. J. Kushner and G. G. Yin, Stochastic Approximation Algorithms and Applications,
Springer-Verlag, New York, 1997.
[15] V. A. Malyshev and M. V. Men’sikov, Ergodicity, continuity and analyticity of countable
Markov chains, Trans. Moscow Math. Soc., 1 (1982), pp. 1–48.
[16] S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability, Springer-Verlag,
London, 1993.
[17] S. P. Meyn and R. L. Tweedie, Computable bounds for geometric convergence rates of
Markov chains, Ann. Appl. Probab., 4 (1994), pp. 981–1011.
[18] J. Neveu, Discrete Parameter Martingales, North Holland, Amsterdam, 1975.
[19] T. Sargent, Bounded Rationality in Macroeconomics, Clarendon Press, Oxford, 1993.
[20] J. Tsitsiklis, Asynchronous stochastic approximation and q-learning, Mach. Learning, 16
(1994), pp. 195–202.
[21] C. J. C. H. Watkins and P. Dayan, Q-learning, Mach. Learning, 8 (1992), pp. 279–292.
Download