Gordon, 1995 - EECS at UC Berkeley

advertisement
Stable Function Approximation
in Dynamic Programming
Georey J. Gordon
January 1995
CMU-CS-95-103
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
Abstract
The success of reinforcement learning in practical problems depends on the ability to combine function
approximation with temporal dierence methods such as value iteration. Experiments in this area have
produced mixed results; there have been both notable successes and notable disappointments. Theory has
been scarce, mostly due to the diculty of reasoning about function approximators that generalize beyond the
observed data. We provide a proof of convergence for a wide class of temporal dierence methods involving
function approximators such as k-nearest-neighbor, and show experimentally that these methods can be
useful. The proof is based on a view of function approximators as expansion or contraction mappings.
In addition, we present a novel view of approximate value iteration: an approximate algorithm for one
environment turns out to be an exact algorithm for a dierent environment.
The author's current e-mail address is ggordon@cs.cmu.edu. This material is based on work supported under a
National Science Foundation Graduate Research Fellowship, and by NSF grant number BES-9402439. Any opinions,
ndings, conclusions, or recommendations expressed in this publication are those of the author and do not necessarily
reect the views of the National Science Foundation or the United States Government.
Keywords: machine learning, dynamic programming, temporal dierences, value iteration, function
approximation
1 Introduction and background
The problem of temporal credit assignment | deciding which of a series of actions is responsible for a
delayed reward or penalty | is an integral part of machine learning. The methods of temporal dierences
are one approach to this problem. In order to learn how to select actions, they learn how easily the agent
can achieve a reward from various states of its environment. Then, they weigh the immediate rewards for
an action against its long-term consequences | a small immediate reward may be better than a large one,
if the small reward allows the agent to reach a high-payo state. If a temporal dierence method discovers
a low-cost path from one state to another, it will remember that the rst state can't be much harder to
get rewards from than the second; in this way, information propagates backwards from states in which an
immediate reward is possible to those from which the agent can only achieve a delayed reward.
One of the rst examples of a temporal dierence method was the Bellman-Ford single-destination
shortest paths algorithm [Bel58, FF62], which learns paths through a graph by repeatedly updating the
estimated distance-to-goal for each node based on the distances for its neighbors. At around the same
time, research on optimal control led to the solution of Markov processes and Markov decision processes (see
below) by temporal dierence methods [Bel61, Bla65]. More recently [Wit77, Sut88, Wat89], researchers have
attacked the problem of solving an unknown Markov process or Markov decision process by experimenting
with it.
Many of the above methods have proofs of convergence [BT89, WD92, Day92, JJS94, Tsi94]. Unfortunately, most of these proofs assume that we represent our solution exactly and therefore expensively, so that
solving a Markov decision problem with n states requires O(n) storage. On the other hand, it is perfectly
possible to perform temporal dierencing on an approximate representation of the solution to a decision
problem | Bellman discusses quantization and low-order polynomial interpolation in [Bel61], and approximation by orthogonal functions in [BD59, BKK63]. These approximate temporal dierence methods are
not covered by the above convergence proofs. But, if they do converge, they can allow us to nd numerical
solutions to problems which would otherwise be too large to solve.
Researchers have experimented with a number of approximate temporal dierence methods. Results
have been mixed: there have been notable successes, including Samuels' checkers player [Sam59], Tesauro's
backgammon player [Tes90], and Lin's robot navigation [Lin92]. But these algorithms are notoriously unstable; Boyan and Moore [BM95] list several embarrassingly simple situations where popular approximate
algorithms fail miserably. Some possible reasons for these failures are given in [TS93, Sab93].
Several researchers have recently conjectured that some classes of function approximators work more
reliably with temporal dierence methods than others. For example, Sutton [SS94] has provided experimental
evidence that linear functions of coarse codes, such as CMACs, can converge reliably for some problems.
He has also suggested that online exploration of a Markov decision process can help to concentrate the
representational power of a function approximator in the important regions of the state space.
We will prove convergence for a signicant class of approximate temporal dierence algorithms, including
algorithms based on k-nearest-neighbor, linear interpolation, some types of splines, and local weighted
averaging. These algorithms will converge when applied either to discounted decision processes or to an
important subset of nondiscounted decision processes. We will give sucient conditions for convergence
to the exact value function, and for discounted processes we will bound the maximum error between the
estimated and true value functions.
2 Denitions and basic theorems
Our theorems in the following sections will be based on two views of function approximators. First, we will
cast function approximators as expansion or contraction mappings; this distinction captures the essential
dierence between approximators that can exaggerate changes in their training values, like linear regression
and neural nets, and those like k-nearest-neighbor that respond conservatively to changes in their inputs.
Second, we will show that approximate temporal dierence learning with some function approximators is
equivalent to exact temporal dierence learning for a slightly dierent problem. To aid the statement of
these theorems, we will need several denitions.
1
Consider a vector space S and a norm k k on S. If S is closed under limits in k k, then S is
a complete vector space . Unless otherwise noted, all vector spaces will be assumed to be complete.
Examples of complete vector spaces include the real numbers under absolute value and the n-vectors of
real numbers under Manhattan (L1 ), Euclidean (L2 ), and max (L1 ) norms. The max norm, dened as
Definition:
k a k1 max
j ai j
i
and the weighted max norm with weight vector W, dened as
1 ja j
k a kW max
i Wi i
are particularly important for reasoning about Markov decision problems.
Definition: A function f from a vector space S to itself is a contraction mapping if, for all points a and b
in S, k f(a) , f(b) k k a , b k. Here , the contraction factor or modulus, is any real number in [0;1). If
we merely have k f(a) , f(b) k k a , b k, we call f a nonexpansion.
For example, the function f(x) = 5 + x2 is a contraction with contraction factor 12 . The identity function
is a nonexpansion. All contractions are nonexpansions.
Lemma 2.1 Let C and D be contractions with contraction factors and on some vector space S . Let M
and N be nonexpansions on S . Then
C N and N C are each contractions with contraction factor C D is a contraction with factor M N is a nonexpansion.
A point x is a xed point of the function f if f(x) = x.
A function may have any number of xed points. For example, the function x2 on the real line has two
xed points, 0 and 1; any number is a xed point of the identity function; and x + 1 has no xed points.
Definition:
Theorem 2.1 (Contraction Mapping) Let S be a vector space with norm kk. Suppose f is a contraction
mapping on S with contraction factor . Then f has exactly one xed point x in S . For any initial point
x0 in S , the sequence x0 ;f(x0); f(f(x0)); .. . converges to x ; the rate of convergence of the above sequence
in the norm k k is at least .
For example, the function 5 + x2 has exactly one xed point on the real line, namely x = 10.
If S is a nite-dimensional vector space, then convergence in any norm implies convergence in all norms.
Definition: A Markov decision process is a tuple (S; A;;c; ;S0 ). MDPs are a formalism for describing
the experiences of an agent interacting with its environment. The set S is the state space ; A is the action
space. At any given time t, the environment is in some state xt 2 S. The agent perceives xt, and is allowed
to choose an action at 2 A. The transition function, (which may be probabilistic), then acts on xt and at
to produce a next state xt+1, and the process repeats. S0 is a distribution on S which gives the probability
of being in each state at time 0. The cost function, c (which may be probabilistic), measures how well the
agent is doing: at each timePstep
t, the agent incurs a cost c(xt; at). The agent must act to minimize the
t
expected discounted cost E[ 1
t=0 c(xt ;at )]; 2 [0;1] is called the discount factor. We will write V (x)
for the optimal value function, the minimal possible expected discounted cost starting from state x. We
introduce conditions below under which V is unique and well-dened.
We will say that an MDP is deterministic if the functions c(x;a) and (x; a) are deterministic for all x
and a, i.e., if the current state and action uniquely determine the cost and the next state. An MDP is nite
if its state and action spaces are nite; it is discounted if < 1. We will call an MDP a Markov process
2
if jAj = 1. In a Markov process, we cannot inuence the expected discounted cost; our goal is merely to
compute it.
There are several ways that we can ensure that V exists. In a nite discounted MDP, it is sucient to
require that the cost function c(x; a) have bounded mean and variance for all x and a. For a nondiscounted
MDP, even if c is bounded, cycles may cause the expected total cost for some states to be innite. So, we
are usually interested in the case where some set of states G is absorbing and cost-free : that is, if we are
in G at time t, we will be in G at time t + 1, and c(x;a) = 0 for any x 2 G and a 2 A. Without loss of
generality, we may lump all such states together and replace them by a single state. Suppose that state 1 of
an MDP is absorbing. Call an action selection strategy proper if, no matter what state we start in, following
the strategy ensures that P (xt = 1) ! 1 as t ! 1. A nite nondiscounted MDP will have a well-dened
optimal value function as long as
the cost function has bounded mean and variance,
there exists a proper strategy, and
there does not exist a strategy which has expected cost equal to ,1 from some initial state.
From now on, we will assume that all MDPs that we consider have a well-dened V . We will also assume
that S0 puts a nonzero probability on every state in S. This allows us to avoid worrying about inaccessible
states.
If we have two MDPs M1 = (S; A1; 1 ;c1 ;1 ;S0 ) and M2 = (S; A2; 2 ;c2 ;2 ;S0 ) which share the same
state space, we can dene a new MDP M12 , the composition of M1 and M2 , by alternately following
transitions from M1 and M2. More formally, let M12 = (S;A1 A2; 12; c12;1 2 ;S0 ). At each step,
the agent will select one action from A1 and one from A2 ; we dene the composite transition function
12 so that 12(x; (a1; a2)) = 2 (1 (x;a1 );a2 ). The cost of the composite action will be c12(x;(a1 ;a2 )) =
c1 (x;a1) + 1 c2 (1 (x; a1); a2).
A trajectory is a sequence of tuples (x0 ;a0 ;c0 );(x1 ;a1 ;c1 ); .. .; trajectories describe the experiences of an
agent acting in an MDP. If the MDP is absorbing, there will be a point t so that ct = ct+1 = .. . = 0; we
will usually omit the portion of the trajectory after t.
Dene a policy to be a function : S 7! A. An agent may follow policy by choosing action (x)
whenever it is in state x. It is possible to generalize the above denition to include randomized strategies
and strategies which change over time; but the extra generality is unnecessary. It is well-known [Bel61, BT89]
that every Markov decision process with a well-dened V has at least one optimal policy ; an agent which
follows will do at least as well as any other agent, including agents which choose actions according to
non-policies. The policy will satisfy Bellman's equation
(8x 2 S) V (x) = E(c(x; (x)) + V ((x; (x))))
and every policy which satises Bellman's equation is optimal.
There are two broad classes of learning problems for Markov decision processes: online and oine. In
both cases, we wish to compute an optimal policy for some MDP. In the oine case, we are allowed access
to the whole MDP, including the cost and transition functions; in the online case, we are only given S
and A, and then must discover what we can about the MDP by interacting with it. (In particular, in the
online case, we are not free to try an action from any state; we are limited to acting in the current state.)
We can transform an online problem into an oine one by observing one or more trajectories, estimating
the cost and transition functions, and then pretending that our estimates are the truth. (This approach is
called the certainty equivalent method.) Similarly, we can transform an oine problem into an online one
by pretending that we don't know the cost and transition functions. Most of the remainder of the paper
deals with oine problems, but we mention online problems again in section 5.
In the oine case, the optimal value function tells us the optimal policies: we may set (x) to be any a
which maximizes E(c(x;a) + V ((x; a))). (In the online case, V is not sucient, since we can't compute
the above expectation.) For a nite MDP, we can nd V by dynamic programming. With appropriate
assumptions, repeated application of the dynamic programming backup operator
V (x) min
E(c(x; a) + V ((x;a)))
a2A
3
to every state is guaranteed to converge to V from any initial guess [BT89]. (In the case of a nondiscounted
problem with cost-free absorbing state g, we dene the backup operator to set V (g) 0 as a special case.)
This dynamic programming algorithm is called value iteration. If we need to solve an innite MDP that
satises certain continuity conditions, we may rst approximate it by a nite MDP as described in [CT89],
then solve the nite MDP by value iteration. For this reason, the remainder of the paper will focus on nite
(although possibly very large) Markov decision processes.
We can generalize the above single-state version of the value iteration backup operator to allow parallel
updating: instead of merely changing our estimate for one state at a time, we compute the new value for
every state before altering any of the estimates. The result of this change is the parallel value iteration
operator. The following two theorems imply the convergence of parallel value iteration. See [BT89] for
proofs.
Theorem 2.2 (value contraction) The parallel value iteration operator for a discounted Markov decision
process is a contraction in max norm, with contraction factor equal to the discount. If all policies in a
nondiscounted Markov decision process are proper, then the parallel value iteration operator for that process
is a contraction in some weighted max norm. The xed point of each of these operators is the optimal value
function for the MDP.
Theorem 2.3 Let a nondiscounted Markov decision process have at least one proper policy, and let all
improper policies have expected cost equal to +1 for at least one initial state. Then the parallel value
iteration operator for that process converges from any initial guess to the optimal value function for that
process.
3 Main results: a simple case
In this section, we will consider only discounted Markov decision processes. The following sections generalize
the results to other interesting cases.
Suppose that T is the parallel value backup operator for a Markov decision process M. In the basic
value iteration algorithm, we start o by setting V0 to some initial guess at M's value function. Then we
repeatedly set Vi+1 to be T(Vi ) until we either run out of time or decide that some Vn is a suciently
accurate approximation to V . Normally we would represent each Vi as an array of real numbers indexed
by the states of M; this data structure allows us to represent any possible value function exactly.
Now suppose that we wish to represent Vi , not by a lookup table, but by some other more compact data
structure such as a neural net. We immediately run into two diculties. First, computing T (Vi ) generally
requires that we examine Vi (x) for nearly every x in M's state space; and if M has enough states that we
can't aord a lookup table, we probably can't aord to compute Vi that many times either. Second, even if
we can represent Vi exactly with a neural net, there is no guarantee that we can also represent T(Vi ).
To address both of these diculties, we will assume that we have a sample X0 of states from M. X0
should be small enough that we can examine each element repeatedly; but it should be large enough and
representative enough that we can learn something about M by examining only the states in X0 . Now we can
dene an approximate value iteration algorithm. Rather than setting Vi+1 to T (Vi ), we will rst compute
(T(Vi ))(x) only for x 2 X0 ; then we will t our neural net (or other approximator) to these training values
and call the resulting function Vi+1 .
In order to reason about approximate value iteration, we will consider function approximation methods
themselves as operators on the space of value functions: given any target value function, the approximator
will produce a tted value function, as shown in gure 1. In the gure, the sample X0 is the rst ve natural
numbers, and the representable functions are the cubic splines with knots in X0 .
The characteristics of the function approximation operator determine how it behaves when combined with
value iteration. One particularly important property is illustrated in gure 2. As the gure shows, linear
regression can exaggerate the dierence between two target value functions V1 and V2 : a small dierence
between the targets V1 (x) and V2 (x) can lead to a larger dierence between the tted values V^1 (x) and
V^2 (x). Many function approximators, such as neural nets and local weighted regression, can exaggerate this
way; others, such as k-nearest-neighbor, can not. We will show later that this sort of exaggeration can cause
instability in an approximate value iteration algorithm.
4
16
12
8
4
0
1
2
3
4
0
1
(a)
2
3
4
0
1
(b)
2
3
4
(c)
Figure 1: Function approximation methods as mappings. In (a) we see the value function for a simple
random walk on the positive real line. (On each transition, the agent has an equal probability of moving left
or right by one step. State 0 is absorbing; transitions from other states have cost 1.) Applying a function
approximator (in this case, tting a spline with knots at the rst ve natural numbers) maps the value
function in (a) to the value function in (b). Since the function approximator discards some information, its
mapping can't be 1-to-1: in (c) we see a dierent value function which the approximator also maps to (b).
2
2
1
1
0
0
0
1
2
0
(a)
1
2
(b)
Figure 2: The mapping associated with linear regression when samples are taken at the points x = 0; 1;2. In
(a) we see a target value function (solid line) and its corresponding tted value function (dotted line). In (b)
we see another target function and another tted function. The rst target function has values y = 0;0;0
at the sample points; the second has values y = 0;1;1. Regression exaggerates the dierence between the
two functions: the largest dierence between the two target functions at a sample point is 1 (at x = 1 and
x = 2), but the largest dierence between the two tted functions at a sample point is 67 (at x = 2).
Suppose we wish to approximate a function from a space S to a vector space R. Fix a sample
vector X0 of points from S, and x a function approximation scheme A. Now for each possible vector Y of
target values in R, A will produce a function f^ from S to R. Dene Y^ to be the vector of tted values; that
is, the i-th element of Y^ will be f^ applied to the i-th element of X0 . Now dene MA , the mapping associated
with A, to be the function which takes each possible Y to its corresponding Y^ .
Now we can apply the powerful theorems about contraction mappings to function approximation methods.
In fact, it will turn out that if MA is a nonexpansion in an appropriate norm, the combination of A with
value iteration is stable. (That is, under the usual assumptions, value iteration will converge to some
approximation of the value function.) The rest of this section states the required property more formally,
then proves that some common function approximators have this property.
Definition: Two operators MF and T on the space S are compatible if repeated application of MF T is
guaranteed to converge to some x 2 S from any initial guess x0 2 S.
Note that compatibility is symmetric: if the sequence of operators MF ;T; MF ;T;MF ; . .. causes convergence from any initial guess x0, then it also causes convergence from T(x0 ); so the sequence T; MF ;T;MF ;. ..
also causes convergence.
Theorem 3.1 Let TM be the parallel value backup operator for some Markov decision process M with state
space S , action space A, and discount < 1. Let X = S A. Let F be a be a function approximator (for
functions from X to R) with mapping MF 2 RjX j 7! RjX j . Suppose MF is a nonexpansion in max norm.
Then MF is compatible with TM , and MF TM has contraction factor .
Definition:
5
3
4
1
2
5
6
(a)
(b)
Figure 3: A Markov process and a CMAC which are incompatible. Part (a) shows the process. Its goal is
state 1. On each step, with probability 95%, the process follows a solid arrow, and with probability 5%,
it follows a dashed arrow. All arc costs are zero. Part (b) shows the CMAC, which has 4 receptive elds
each covering 3 nodes. If the CMAC starts out with all predictions equal to 1, approximate value iteration
77 ;( 77 )2 ; ... for state 2.
produces the series of target values 1; 60
60
By the value contraction theorem, TM is a contraction in max norm with factor . By assumption,
MF is a nonexpansion in max norm. Therefore MF TM is a contraction in max norm by the factor . 2
Proof:
Corollary 3.1 The approximate value iteration algorithm based on F converges in max norm at the rate when applied to M .
It remains to show which function approximators are compatible with value iteration. Many common
approximators are incompatible with standard parallel value backup operators. For example, as gure 2
demonstrates, linear regression can be an expansion in max norm; and Boyan and Moore [BM95] show that
the combination of value iteration with linear regression can diverge. Other incompatible methods include
standard feedforward neural nets, some forms of spline tting, and local weighted regression [BM95].
A particularly interesting case is the CMAC. Sutton [SS94] has recently reported success in combining
value iteration with a CMAC, and has suggested that function approximators similar to the CMAC are
likely to allow temporal dierencing to converge in the online case. While we have no evidence to support
or refute this suggestion, it is worth mentioning that convergence is not guaranteed in our oine framework,
as the counterexample in gure 3 shows.
In this example, the optimal value function is uniformly zero, since all arc costs are zero. The exact
value backup operator assigns 0 to V (1) (since state 1 is the goal) and :05V (1) + :95V (2) to V (i) for i 6= 1
(since all states except 1 have a 5% chance of transitioning to state 1 and a 95% chance of transitioning to
state 2). The approximate backup operator computes these same 6 numbers as training data for the CMAC;
but since the CMAC can't represent this function exactly, our output is the closest representable function
in the sense of least squared error. If the exact operator produces kv where v = (0;1; 1;1; 1;1)T , then the
closest representable function will be kw where w = ( 31 ; 43 ; 56 ; 65 ; 56 ; 65 )T. If we repeat the process from this
new value function, the exact backup operator will now produce 7760k v, and the CMAC will produce 7760k w.
77 .
Further iteration causes divergence at the rate 60
The CMAC still diverges even if we choose a small learning rate: with learning rate , the rate of
divergence is 1+ 17
60 . However, if we train the CMAC based on actual trajectories in the Markov process, as
Sutton suggests, we no longer diverge: transitions out of state 2, which lower the coecients of our CMAC
substantially, are as frequent as transitions into state 2, which raise the coecients somewhat. In fact, since
this example is a Markov process rather than a Markov decision process, convergence of the online algorithm
is guaranteed by a theorem in [Day92].
We will prove that a broad class of approximation methods is compatible with value iteration. This
class includes kernel averaging, k-nearest-neighbor, weighted k-nearest-neighbor, Bezier patches, linear interpolation on a triangular (or tetrahedral, etc.) mesh, bilinear interpolation on a square (or cubical, etc.)
6
mesh, and many others. (See the Experiments section for a denition of bilinear interpolation. Note that
the square mesh is important: on non-rectangular meshes, bilinear interpolation will sometimes need to
extrapolate.)
Definition: A real-valued function approximation scheme is an averager if every tted value is the weighted
average of zero or more target values and possibly some predetermined constants. The weights involved in
calculating the tted value Y^i may depend on the sample vector X0 , but may not depend on the target values
Y . More precisely, for a xed X0 , if Y has n elements, there must exist n real numbers ki,Pn2 nonnegative
real numbers ij , and n nonnegative real numbers i , so that for each i we have i + j ij = 1 and
P
Y^i = i ki + j ij Yj .
It should be obvious that all of the methods mentioned before the denition are averagers: in all cases,
the tted value at any given coordinate is a weighted average of target values, and the weights are determined
by distances in X values, and so are unaected by the target Y values.
Theorem 3.2 The mapping MF associated with any averaging method F is a nonexpansion in max norm,
and is therefore compatible with the parallel value backup operator for any discounted MDP.
Proof: Fix the s and ks for F as in the above denition. Let Y and Z be two vectors of target values.
Consider a particular coordinate i. Then, letting k k denote max norm,
[M (Y ) , M (Z)] = ( k + X Y ) , ( k + X Z ) F
F
i
i i
ij j
i i
ij j
j
j
X
= (Y , Z ) j
ij j
j
max
(Y , Zj )
j j
= kY , Z k
That is, every element of (MF (Y ) , MF (Z)) is no larger than k Y , Z k. Therefore, the max norm of
(MF (Y ) , MF (Z)) is no larger than k Y , Z k. In other words, MF is a nonexpansion in max norm. 2
4 Nondiscounted processes
Consider now a nondiscounted Markov decision process M. Suppose for the moment that all policies for
M are proper. Then the value contraction theorem states that the parallel value backup operator TM for
this process is a contraction in some weighted max norm k kW . The previous section proved that if the
approximation method A is an averager, then MA is a nonexpansion in (unweighted) max norm. If we could
prove that MA were also a nonexpansion in k kW , we would have MA compatible with TM . Unfortunately,
MA may be an expansion in k kW ; in fact parts (a) and (b) of gure 4 show a simple example of a
nondiscounted MDP and an averager which are incompatible. (Part (c) is a simple proof that the MDP and
the averager are incompatible; see below for an explanation.)
Fortunately, there are averagers which are compatible with nondiscounted MDPs. The proof relies on
an intriguing property of averagers: we can view any averager as a Markov process, so that state x has a
transition to state y whenever xy > 0, i.e., whenever the tted V (x) depends on the target V (y). Part
(b) of gure 4 shows one example of a simple averager viewed as a Markov process; this averager has
11 = 23 = 33 = 1 and all other coecients zero.
If we view an averager as a Markov process, and compose this process with our original MDP, we will
derive a new MDP. Part (c) of gure 4 shows a simple example; a slightly more complicated example is
in gure 5. As the following theorem shows, exact value iteration on this derived MDP is the same as
approximate value iteration on the original MDP.
Theorem 4.1 (Derived MDP) For any averager A with mapping MA, and for any MDP M (either
discounted or nondiscounted) with parallel value backup operator TM , the function TM MA is the parallel
value backup operator for a new Markov decision process M 0 .
7
1
2
3
(a)
(b)
(c)
Figure 4: A nondiscounted deterministic Markov process, and an averager with which it is incompatible.
The process is shown in (a); the goal is state 1, and all arc costs except at the goal are 1. In (b) we see an
averager, represented as a Markov process: states 1 and 3 are unchanged, while V (2) is replaced by V (3).
The derived Markov process is shown in (c); state 3 has been disconnected, so its value estimate will diverge.
(a)
(b)
(c)
Figure 5: An example of the construction of the derived Markov process. Part (a) shows a deterministic
Markov process: its state space is the unit triangle, and on every step the agent moves a constant distance
towards the origin. The value of each state is simply its distance from the origin, so the value function
is nonlinear. For our function approximator, we will use linear interpolation on the three corners of the
triangle. Part (b) shows a representative transition from the derived process: as before, the agent moves
towards the goal, but then the averager moves the agent randomly to one of the three corners. On average,
this scattering moves the agent back away from the goal, so steps in the derived process don't move the
agent as far on average as they did in the original process. Part (c) shows the expected progress the agent
makes on each step. The value function for the derived process is V (x; y) = x + y.
8
Dene the derived MDP M 0 as follows. It will have the same state and action spaces as M, and
it will also have the same discount factor and initial distribution. We can assume without loss of generality
that state 1 of M is cost-free and absorbing: if not, we can renumber the states of M starting at 2, add a
new state 1 which satises this property, and make all of its incoming transition probabilities zero. We can
also assume, again without loss of generality, that 1 = 1 and k1 = 0 (that is, that A always sets V (1) = 0)
| again, if this property does not already hold for M, we can add a new state 1.
Suppose that,Pin M, action a in state x takes us to state y with probability paxy . Suppose that A replaces
V (y) by y ky + z yz V (z). Then we will dene the transition probabilities in M 0 for state x and action a
to be
Proof:
p0axz =
p0ax1 =
X
y
X
y
(z 6= 1)
paxy yz
paxy (y 1 + y )
P
These transition probabilities make sense: since A is an averager, we know that z yz + y is 1, so
X0
XX
X
paxz =
paxy yz + paxy (y 1 + y )
z
z6=1 y
X
=
y
X
=
y
paxy
X
z
y
yz + y
!
paxy = 1
Now suppose that, in M, performing action a from state x yields expected cost cxa. Then performing
action a from state x in M 0 yields expected cost
c0xa = cxa + X
y
paxy y ky
0
0
0
0
Now the parallel value backup operator TM for M 0 is
0
V (x)
min
E(c0 (x;a) + V ( 0 (x;a)))
a
X0 0
= min
paxz (cxa + V (z))
a
1
0z
!
!
X
X
X
@
paxy yz (c0xa + V (z)) +
paxy (y 1 + y ) (c0xa + V (1))A
= min
a
y
z6=1 y
0
1
X @X 0
= min
p
yz (cxa + V (z)) + (y 1 + y )c0xa A
a y axy
0z6=1
1
X @0
X
= min
p
c + yz V (z)A
a y axy xa
z6=1
0
1
X
X
@c0xa + paxy yz V (z)A
= min
a
y
z6=1
1
0
X
X
X
@cxa + paxy y ky + paxy yz V (z)A
= min
a
y
0
0
0
0
y
9
z6=1
On the other hand, the parallel value backup operator for M is
V (x)
min
E(c(x; a) + V ((x;a)))
a
X
= min
paxy (cxa + V (y))
a
y
If we replace V (y) by its approximation under A, the operator becomes TM MA :
V (x)
min
a
X
y
paxy cxa + (y ky +
X
z
!
yz V (z))
0
1
X
X
X
@cxa + paxy y ky + paxy yz V (z)A
= min
a
y
y
z6=1
which is exactly the same as TM above.
2
Given an initial estimate V0 of the value function, approximate value iteration begins by computing
MA (V0 ), the representation of V0 in A. Then it alternately applies TM and MA to produce the series of
functions V0 ; MA (V0 );TM (MA (V0 )); MA (TM (MA (V0 ))); .... (In an actual implementation, only the functions
MA (. ..) would be represented explicitly; the functions TM (. ..) would just be sampled at the points X0 .) On
the other hand, exact value iteration on M 0 produces the series of functions V0 ;TM MA (V0 );TM MA (TM MA (V0 )); . . .. This series obviously contains exactly the same information as the previous one. The only
dierence between the two algorithms is that approximate value iteration would stop at one of the functions
MA (. ..), while iteration on M 0 would stop at one of the functions TM (. ..).
Now we can see why the combination in gure 4 diverges: the derived MDP has a state with innite
cost. So, in order to prove compatibility, we need to guarantee that the derived MDP is well-behaved.
If the arc costs of a discounted MDP M have nite mean and variance, it is obvious that the arc costs
of M 0 also have nite mean and variance. That means that TM = TM MA converges in max norm at the
rate | i.e., we have just proven again that MA is compatible with TM .
More importantly, if M is a nite nondiscounted process, there are averagers which are compatible with
it. For example, if A uses weight decay (i.e., if y > 0 for all y), then M 0 will have all policies proper, since
any action in any state has a nonzero probability of bringing us immediately to state 1.
More generally, if M has only proper policies, we may partition its
S state space by distance from state 1,
as follows (see [BT89]). Let S1 = f1g. Now recursively dene Uk = j<k Sk and Sk as
0
0
Sk = x (x 62 Uk ) ^ (min
max P((x;a) = y) > 0)
a2A y2U
k
Bertsekas and Tsitsiklis show that this partitioning is exhaustive; so we may set k(x) to be the unique k so
that x 2 Sk . If M is nite, there will be nitely many nonempty Sk .
We will say that an averager A is self-weighted for M if for every state y, either y > 0 or there exists a
state x so that k(x) k(y) and yx > 0.
Now, if A is self-weighted for a nite nondiscounted MDP M, then M 0 will have only proper policies: if
at some time we are in state x, then no matter what action a we take, there is a nonzero chance that we will
immediately follow an arc to a state in Sk for k < k(x). (By denition of the partition, there is a transition
in M under a from x to some y so that k(y) < k(x). By the self-weighting property, either yz must be
nonzero for some z so that k(z) k(y), in which case x has a possible transition in M 0 under a to z; or
y > 0, in which case x has a possible transition in M 0 under a to state 1.) If we follow such an arc, there is
then a nonzero chance that we will immediately follow another arc to a state in Sk for k0 < k, and so forth
until we eventually (with positive probability) reach partition S1 and therefore state 1. Let m be the largest
integer so that Sm is nonempty. Then the previous discussion shows that, no matter what state we start
from, we have a positive probability of reaching state 1 in m , 1 steps. Call the smallest such probability
. Then in k(m , 1) steps, we have probability at least 1 , (1 , )k of reaching state 1. The limit of this
quantity as k ! 1 is 1; so with probability 1 we eventually absorb from any initial state.
We have just proven
0
10
Corollary 4.1 Let M be a nite nondiscounted Markov decision process. Suppose all policies in M are
proper. Let A be an averager which is self-weighted for M . Let TM and MA be the parallel backup operator
for M and the mapping associated with A. Then TM and MA are compatible, and the approximate value
iteration algorithm based on A will converge if is applied to M .
If A ignores V (x) for all states x not in some sample X0 , then the states in X0 will play an important
role in the derived MDP M 0 . While the initial state of a trajectory may be outside of X0 , all transitions
in M 0 lead to states in X0 , so after one step the trajectory will enter X0 and stay there indenitely. This
means that we can characterize a large part of M 0 by looking only at its behavior on X0 | just what we
need for a tractable algorithm. (It is worth mentioning that, if M is nondiscounted, X0 should contain the
goal state: if it doesn't, M 0 will have no transitions into the goal, so all of its values will be innite.)
5 The online problem and Q-learning
The results of the previous sections carry over directly to a gradual version of the parallel value backup
operator
V (x) x min
E(c(x; a) + V ((x;a)))
a2A
in which, rather than replacing V (x) by its computed update on each step, we take a weighted average of
the old and new values of V (x). (The weights x may dier for each x, and may change from iteration to
iteration.) That is, we can still construct the derived MDP M 0 and perform gradual value iteration on it,
and gradual value iteration on M 0 is still the same as gradual approximate value iteration on M.
The results also apply nearly directly to dynamic programming with Watkins' [Wat89] Q function
Q (x;a) E(c(x;a) + V ((x; a)))
and Q-learning operator
Q(x;a) xa E(c(x; a) + min
Q((x;a); a0 ))
a
(The learning rates xa may now be random variables, and may depend for each step not only on x and
a but also on the entire past history of the agent's interactions with the MDP, up to but not including the
current values of the random variables c(x;a) and (x;a).) That is, we can still dene a derived MDP so
that the behavior of the approximate algorithm on the original MDP is the same as the behavior of the exact
algorithm on the derived MDP. The derived MDP is, however, slightly dierent from the derived MDP for
value iteration; see the Appendix for details.
The previous paragraphs imply that, if we could sample transitions at will from the derived MDP
(and so compute unbiased estimates of the Q-learning updates), we could apply gradual Q-learning to these
transitions and learn a policy. The convergence of this algorithm would be guaranteed, as long as the weights
xa were chosen appropriately, by any suciently general convergence proof for Q-learning [JJS94, Tsi94].
Unfortunately, there is a catch. Q-learning is designed to work for online problems, where we don't know
the cost or transition functions and can only sample transitions from our current state. The power of the
approximate value iteration method, on the other hand, comes from the fact that we can pay attention only
to transitions from a certain small set of states. So, while the derived MDP for Q-learning will still have
only a few relevant states, we won't in general be able to observe many transitions from these states, and so
the approximate Q-learning iteration will take a very long time to converge.
There are two ways that the approximate Q-learning algorithm might still be useful. The rst is if we
encounter a problem which is somewhere between online and oine: we can sample any transition at will,
but don't know the cost or transition functions a priori or can't compute the necessary expectations. In
such a problem we still can't use value iteration, since it is dicult to compute an unbiased estimate of the
value iteration update, so the approximate Q-learning algorithm is helpful.
The second way is if we are willing to accept possible lack of convergence. Suppose our function approximator pays attention to the states in the set X0 . If we pretend that every transition we see from a
state x 62 X0 is actually a transition from the nearest state x0 2 X0 , we will then have enough data to
compute the behavior of the derived MDP on X0 . Unfortunately, following this approximation is equivalent
to introducing hidden state into the derived MDP; so we now run the risk of divergence.
0
11
6 Converging to what?
Until now, we have only considered the convergence or divergence of approximate dynamic programming
algorithms. Of course we would like not only convergence, but convergence to a reasonable approximation
of the value function. The next section contains some empirical studies of approximate value iteration;
this section proves some error bounds, then outlines the types of problems we have encountered while
experimenting with approximate value iteration.
6.1
Error bounds
Suppose that M is an MDP with value function V , and let A be an averager. What if V is also a xed
point of MA ? Then V is a xed point of TM MA ; so, if we can show that TM MA converges, we will
know that it converges to the right answer.
If M is discounted and has bounded costs, TM MA will always converge; the following theorem describes
some situations where nondiscounted problems converge.
Theorem 6.1 Let V be the optimal value function for a nite nondiscounted Markov decision process M .
Let TM be the parallel value backup operator for M . Let MA be the mapping for an averager A, and let V also be a xed point of MA . Then V is a xed point of TM MA ; and if either
1. M has all policies proper and A is self-weighted for M , or
2. M has E (c(x;a)) > 0 for all x 6= 1 and A has kx 0 for all x
then iteration of TM MA converges to V .
By the value contraction theorem, V is a xed point of TM . So TM MA (V ) = TM (V ) = V ;
that is, V is a xed point of TM MA .
If (1) holds, then by the corollary to the derived MDP theorem, TM MA is a contraction in some
weighted max norm; so V is the unique xed point of TM MA , and iteration of TM MA converges to V .
If (2) holds, then M 0 has c0 (x;a) > 0 for all x 6= 1; so every improper policy in M 0 has innite cost for
some initial states. If we can show that M 0 has at least one proper policy, then TM MA must converge
(and therefore must converge to V ).
Note that V (1) = 0 and V (x) > 0 for x 6= 1, since all arc costs in M are positive. Suppose we start
at some non-goal state x in M 0, and choose an action a so that V (x) = E (c0 (x;a) + V ( 0(x; a))). (There
must be such an action, since V is a xed point of the value backup operator for M 0 .) Since c0 (x;a) > 0,
we know that V (x) > E (V ( 0 (x;a))). In particular, there must be a possible transition to some state y so
that V (x) > V (y ). If y is not the goal, we can repeat the argument to nd a z so that y has a possible
transition to z and V (y ) > V (z ), and so forth until (with positive probability) we eventually reach the
goal.
2
The above theorem is useful only when we know that the optimal value function is a xed point of our
averager. For example, it shows that bilinear interpolation will converge to the exact value function for a
gridworld, if every arc's cost is equal to its (Manhattan) length, since the value function for this MDP is
linear. If we are trying to solve a discounted MDP, on the other hand, we can prove a much stronger result:
if we only know that the optimal value function is somewhere near a xed point of our averager, we can still
guarantee an error bound for approximate value iteration.
Proof:
Theorem 6.2 Let V be the optimal value function for a nite Markov decision process M with discount
factor . Let TM be the parallel value backup operator for M . Let MA be the mapping for an averager A.
Let V A be any xed point of MA . Suppose k V A , V k = , where k k denotes max norm. Then iteration
of TM MA converges to a value function V0 so that k V0 , V k 12, .
Proof: By the value contraction theorem, TM is a contraction in max norm with factor . By theorem 3.2,
MA is a nonexpansion in max norm. So, TM MA is a contraction in max norm with factor , and therefore
12
converges to some V0 . Repeated application of the triangle inequality and the denition of a contraction
give
k V0 , TM (MA (V )) k = k TM (MA (V0 )) , TM (MA (V )) k
k V0 , V k
k TM (MA (V )) , V k = k TM (MA (V )) , TM (V ) k
k MA (V ) , V k
k MA (V ) , V A k + k V A , V k
= k MA (V ) , MA (V A ) k + k V A , V k
k V , V A k + k V A , V k
k V0 , V k k V0 , TM (MA (V )) k + k TM (MA (V )) , V k
k V0 , V k + 2 k V , V A k
(1 , )k V0 , V k 2 k V , V A k
k V0 , V k 12,
which is what was required.
2
If we let ! 0, we can make the above error bound arbitrarily small. This result is somewhat counterintuitive, since A may not even be able to represent V exactly. The reason for this behavior is that the
nal step in computing V0 is to apply TM ; when = 0, this step produces V immediately.
As mentioned in a previous section, approximate value iteration returns MA (V0 ) rather than V0 itself. So,
an error bound for MA (V0 ) would be useful. The error bound on V0 leads directly to a bound for MA (V0 ):
k V , MA (V0 ) k k V , V A k + k V A , MA (V0 ) k
= + k MA (V A ) , MA (V0 ) k
+ k V A , V0 k
+ k V A , V k + k V , V0 k
2 + 12,
By way of comparison, Chow and Tsitsiklis [CT89] bound the error introduced by using a grid of side h
to compute the value function for a type of continuous-state-space MDP. Writing Vh for the approximate
value function computed this way, their bound is
k V , Vh k 1 ,h (K1 + K2k V kQ)
(for all suciently small h) where k kQ is the span quasi-norm
k F kQ = sup F (x) , infx F (x)
x
and K1 and K2 are constants. While their formalism allows them to discretize the available actions as well
as the state space, an approximation which we don't consider, their bound also applies to processes with a
nite number of actions. So, we can compare their bound on Vh for the case of non-discretized actions to
our bound on MA (V0 ).
Write Mh for the mapping associated with discretization on a grid of side h. (For deniteness, assume that
(Mh (V ))(x) = V (x0) where x0 is the center of the grid cell which contains x. Almost any other reasonable
convention would also work.) The processes that Chow and Tsitsiklis consider satisfy
j V (x) , V (x0 ) j Lk x , x0 k
for some constant L (i.e., V is Lipschitz continuous ). (We can determine L easily from the cost and
transition functions.) So, we can nd a xed point Vh of Mh which is close to V , as follows. Pick a grid cell
13
C . Write CH and CL for the maximum and minimum values of V in C ; write xH and xL for states x 2 C
which achieve these values. (Such states exist because V is continuous.) We know that k xH , xL k h,
since the diameter of C is h. Therefore, by the Lipschitz condition, j CH , CL j hL. So, if we set Vh (x) to
be 21 (CH , CL) for all x 2 C , the largest dierence between V and Vh on C will be hL2 . Applying our error
bound now gives
k V , Vh k = hL(1 + 1 , )
which has slightly dierent constants but the same behavior as h ! 0 or ! 1 as the bound of Chow
and Tsitsiklis. Our bound is valid for any h, rather than just suciently small h. Their bound depends on
k V kQ ; since k V kQ is bounded by 12,r , where r is the largest single-step expected reward, we have folded
a similar dependence into the constant L.
The sort of error bound which we have proved is particularly useful for function approximators such as
linear interpolation which have many xed points. (In fact, every function which is representable by a linear
interpolator is a xed point of that interpolator's mapping. The same is true for bilinear interpolation, grids,
and 1-nearest-neighbor; it is not true for local weighted averaging or k-nearest-neighbor for k > 1.) Because
it depends on the maximum dierence between V and the xed point of MA , the error bound is not very
useful if V may have large discontinuities at unknown locations: if V has a discontinuity of height d, then
any averager which can't mimic the location of this discontinuity exactly will have no representable functions
(and therefore no xed points) within 2d of V .
6.2
In practice
The most common problem with approximate value iteration is the presence of barriers in the derived MDP.
That is, sometimes the derived MDP can be divided into two pieces so that the rst piece contains the goal
and the second piece has no transitions into the rst. In this case, the estimated values of the states in the
second piece will be innite. (We mentioned a special case of this situation above: if the averager ignores
the goal state, then the derived MDP will have no transitions into the goal.) A less drastic but similar
problem occurs when the second piece has only low-probability transitions to the rst; in this case, the costs
for states in the second piece will not be innite, but will still be articially inated.
This sort of problem is likely to happen when the MDP has short transitions and when there are large
regions where a single state dominates the averager. For a particularly bad example, suppose our function
approximator is 1-nearest-neighbor. If the transitions out of a sampled state x in M are shorter than half
the distance to the nearest adjacent sampled state, then the only transitions out of x in M 0 will lead straight
back to x. Similarly, in local weighted averaging with a narrow kernel, a short transition out of x in M will
translate to a high probability self loop in M 0. In both cases, the eect of the averager is to produce a drag
on transitions out of x: actions in M 0 don't get the agent as far on average as they did in M .
One way to reduce this drag is to make sure that no single state has the dominant weight over a large
region. The best way to do so is to sample the state space more densely; but if we could aord to do
that, we wouldn't need a function approximator in the rst place. Another way is to increase a smoothing
parameter such as kernel width or number of neighbors, and so reduce the weight of each sample point in
its immediate neighborhood. Unfortunately, increased smoothing brings its own problems: it can remove
exactly the features of the value function that we are interested in. For example, if the agent must follow a
long, narrow path to the goal, the scattering eect of a wide-kernel averager is almost certain to push it o
of the path long before it reaches the end. We will see an example of this problem in the hill-car experiment
below.
Both of the above problems | too much smoothing and the introduction of barriers | can be reduced
if we can alter our MDP so that the actions move the agent farther. For example, we might look ahead two
or more time steps at each value backup. (This strategy corresponds to the dynamic programming operator
TMn MA for some n > 1. Since TMn is the backup operator for an MDP (derived by composing n copies
of M ), the previous sections' convergence theorems also apply to TMn MA .) While in general the cost
of looking ahead n steps is exponential in n, there are many circumstances where we can reduce this cost
dramatically. For instance, in a physical simulation, we can choose a longer time increment; in a grid world,
we can consider only the compound actions which don't contain two steps in opposite directions; and in the
14
case of a Markov process, where there's only 1 action, the cost of lookahead is linear rather than exponential
in n. (In the last case, TD() [Sut88] allows us to combine lookaheads at several depths.) If actions are
selected from an interval of R, numerical minimum-nding algorithms such as Newton's method or golden
section search can nd a local minimum quickly. In any case, if the depth and branching factor are large
enough, standard heuristic search techniques can at least chip away at the base of the exponential.
7 Experiments
This section describes our experiments with the Markov decision problems from [BM95].
7.1
Puddle world
In this world, the state space is the unit square, and the goal is the upper right corner. The agent has four
actions, which move it up, left, right, or down by .1 per step. The cost of each action depends on the current
state: for most states, it is the distance moved, but for states within the two \puddles," the cost is higher.
See gure 6.
For a function approximator, we will use bilinear interpolation, dened as follows: to nd the predicted
value at a point (x;y ), rst nd the corners (x0 ;y0 ), (x0 ;y1 ), (x1; y0 ), and (x1 ;y1 ) of the grid square
containing (x; y ). Interpolate along the left edge of the square between (x0 ;y0 ) and (x0; y1 ) to nd the
predicted value at (x0 ;y ). Similarly, interpolate along the right edge to nd the predicted value at (x1 ;y ).
Now interpolate across the square between (x0; y ) and (x1 ;y ) to nd the predicted value at (x;y ).
Figure 6 shows the cost function for one of the actions, the optimal value function computed on a 100 100
grid, an estimate of the optimal value function computed with bilinear interpolation on the corners of a 7 7
grid (i.e., on 64 sample points), and the dierence between the two estimates. Since the optimal value
function is nearly piecewise linear outside the puddles, but curved inside, the interpolation performs much
better outside the puddles: the root mean squared dierence between the two approximations is 2.27 within
one step of the puddles, and .057 elsewhere. (The lowest-resolution grid which beats bilinear interpolation's
performance away from the puddles is 20 20; but even a 5 5 grid can beat its performance near the
puddles.)
7.2
Car on a hill
In this world, the agent must drive a car up to the top of a steep hill. Unfortunately, the car's motor is weak:
it can't climb the hill from a standing start. So, the agent must back the car up and get a running start. The
state space is [,1; 1] [,2; 2], which represents the position and velocity of the car; there are two actions,
forward and reverse. (This formulation diers slightly from [BM95]: they allowed a third action, coast. We
expect that the dierence makes the problem no more or less dicult.) The cost function measures time
until goal.
There are several interesting features to this world. First, the value function contains a discontinuity
despite the continuous cost and transition functions: there is a sharp transition between states where the
agent has just enough speed to get up the hill and those where it must back up and try again. Since
most function approximators have trouble representing discontinuities, it will be instructive to examine the
performance of approximate value iteration in this situation. Second, there is a long, narrow region of
state space near the goal through which all optimal trajectories must pass (it is the region where the car
is partway up the hill and moving quickly forward). So, excessive smoothing will cause errors over large
regions of the state space. Finally, the physical simulation uses a fairly small time step, .03 seconds, so we
need ne resolution in our function approximator just to make sure that we don't introduce a barrier.
The results of our experiments appear in gure 7. For a reference model, we t a 128 128 grid. While
this model has 16384 parameters, it is still less than perfect: the right end of the discontinuity is somewhat
rough. (Boyan and Moore used a 200 by 200 grid to compute their optimal value function, and it shows
no perceptible roughness at this boundary.) We also t two smaller grids, one 64 64 and one 32 32.
Finally, we t a weighted 4-nearest neighbor model using the 1024 centers of the cells of the 32 32 grid as
sample points, and another using a uniform random sample of 1000 points from the state space. Note that
15
5
5
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
7
6
5
5
4
3
2
1
0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-1
0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 6: The puddle world. From top left: the cost of moving up, the optimal value function as seen by a
100 100 grid, the optimal value function as seen by bilinear interpolation on the corners of a 7 7 grid,
and the dierence between the two value functions.
16
2
1.5
1
0.5
-2
-1.5
0
-1
-0.5
0
0.5
0.5
0
1
1.5
-0.5
2
-1
2
1.5
1.5
1
1
0.5
0.5
-2
-1.5
0
-2
-1.5
0
-1
-0.5
-1
-0.5
0
0.5
0.5
0
0
0.5
1
1.5
-0.5
0.5
0
-1
1
1.5
-0.5
-1
1.5
1.5
1
1
0.5
0.5
0
-0.5
0
-1
-0.5
-2
-1.5
-1.5
-1
-0.5
-2
-2
-1.5
-1
-0.5
-1
0
0.5
0.5
0
0
0.5
1
1.5
-0.5
0.5
-1
0
1
1.5
-0.5
-1
Figure 7: Approximations to the value function for the hill-car problem. From the top: the reference model,
a 32 32 grid, a k-nearest-neighbor model, the error of the 32 32 grid, and the error of the k-nearestneighbor model. In each plot, the horizontal axes represent the agent's position and velocity, and the vertical
axis represents the estimated time remaining until reaching the summit at x = :6.
17
2
1.5
1.5
1
1
0.5
0.5
-2
-1.5
0
-2
-1.5
0
-1
-0.5
-1
-0.5
0
0.5
0.5
0
0
0.5
1
1.5
-0.5
0.5
-1
0
1
1.5
-0.5
-1
Figure 8: Two smaller models for the hill-car world: a divergent 12 12 grid, and a convergent nearestneighbor model on the same 144 sample points.
the nearest-neighbor methods are roughly comparable in complexity to the 32 32 grid: each one requires
us to evaluate about two thousand transitions in the MDP for every value backup.
As the dierence plots show, most of the error in the smaller models is concentrated around the discontinuity in the value function. Near the discontinuity, the grids perform better than the nearest-neighbor
models (as we would expect, since the nearest-neighbor models tend to smooth out discontinuities). But
away from the discontinuity, the nearest-neighbor models win. The 32 32 nearest-neighbor model also
beats the 32 32 grid at the right end of the discontinuity: the car is moving slowly enough here that the
grid thinks that one of the actions keeps the car in exactly the same place. The nearest-neighbor model,
on the other hand, since it smooths more, doesn't introduce as much drag as the grid does and so doesn't
have this problem. The root mean square error of the 64 64 grid (not shown) from the reference model is
0:190s, and of the 32 32 grid is 0:336s. The RMS error of the 4-nearest-neighbor tter with samples at
the grid points is 0:205s. The nearest-neighbor tter with a random sample (not shown) performs slightly
worse, but still signicantly better than the 32 32 grid (one-tailed t-test gives p = :971): its error, averaged
over 5 runs, is 0:235s.
All of the above models are fairly large: the smallest one requires us to evaluate 2000 transitions for
every value backup. Figure 8 shows what happens when we try to t a smaller model. The 12 12 grid is
shown after 60 iterations; it is in the process of diverging, since the transitions are too short to reach the
goal from adjacent grid cells. The 4-nearest-neighbor tter on the same 144 grid points has converged; its
RMS error from the reference model is 0:278s (better than the 32 32 grid, despite needing to simulate
fewer than one-seventh as many transitions). A 4-nearest-neighbor tter on a random sample of size 150
(not shown) also converged, with RMS error 0:423s.
8 Conclusions and further research
We have proved convergence for a wide class of approximate temporal dierence methods, and shown experimentally that these methods can solve Markov decision processes more eciently than grids of comparable
accuracy.
Unfortunately, many popular function approximators, such as neural nets, linear regression, and CMACs,
do not fall into this class (and in fact can diverge). The chief reason for divergence is exaggeration: the more
a method can exaggerate small changes in its target function, the more often it diverges under temporal
dierencing. (In some cases, though, it is possible to detect and compensate for this instability. The
grow-support algorithm of [BM95], which detects instability by interspersing TD(1) \rollouts," is a good
example.)
There is another important dierence between averagers and methods like neural nets. This dierence
is the ability to allocate structure dynamically: an averager cannot decide to concentrate its resources on
18
one important region of the state space, whether or not this decision is justied. This ability is important,
and it can be grafted on to averagers (for example, adaptive sampling for k-nearest-neighbor, or adaptive
meshes for grids or interpolation). The resulting function approximator still does not exaggerate; but it is
no longer an averager, and so is not covered by this paper's proofs. Still, methods of this sort have been
shown to converge in practice [Moo94, Moo91], so there is hope that a proof is possible.
A The derived process for Q-learning
Here is the analog for Q-learning of the derived MDP theorem. The chief dierence is that, where the
theorem for value iteration considered the combined operator TM MA , this version considers MA TM .
The dierence is necessary to keep the min operation in the Q-learning backup from getting in the way. Of
course, if we show that either TM MA or MA TM converges from any initial guess, then the other must
also converge.
Theorem A.1 (Derived MDP for Q-learning) For any averager A with mapping MA , and for any
MDP M (either discounted or nondiscounted) with Q-learning backup operator TM , the function MA TM
is the Q-learning backup operator for a new Markov decision process M 0.
The domain of A will now be pairs of states and actions. Write xayb for the coecient of Q(y; b)
in the approximation of Q(x;a); write kxa and xa for the constant and its coecient.
Take an initial guess Q(x;a). Write Q0 for the result of applying TM to Q; write Q00 for the result of
applying MA to Q0 . Then we have
Proof:
Q0(x;a) = E (c(x;a) + min
Q( (x;a); b))
X b
= cxa + pxay min
Q(y;b)
b
Q00(z;c) =
=
=
=
XX
x a
XX
x a
XX
x a
y
zcxa Q0(x; a) + zc kzc
zcxa cxa + zcxa cxa + XX
x a
X
!
pxay min
Q(y; b) + zc kzc
b
y
XX
x a
zcxa
!
zcxa cxa + zc kzc + X
y
pxay min
Q(y;b) + zc kzc
b
X XX
y
x a
!
zcxapxay min
Q(y;b)
b
We now interpret the rst parenthesis above as the cost of taking action
c from state z in M 0 ; the second
P
0
0
parenthesis is the transition probability pzcy for M . Note that the sum y p0zcy will generally be less than
1; so we will make up the dierence by adding a transition in M 0 from state z with action c to state 1 (which
is assumed as before to be cost-free and absorbing and to have V (1) 0).
2
Acknowledgements
I would like to thank Rich Sutton, Andrew Moore, Justin Boyan, and Tom Mitchell for their helpful conversations with me. Without their constantly poking holes in my misconceptions, this paper would never
have been written. This material is based on work supported under a National Science Foundation Graduate Research Fellowship, and by NSF grant number BES-9402439. Any opinions, ndings, conclusions, or
recommendations expressed in this publication are those of the author and do not necessarily reect the
views of the National Science Foundation.
19
References
[BD59] R. Bellman and S. Dreyfus. Functional approximations and dynamic programming. Mathematical
Tables and Aids to Computation, 13:247{251, 1959.
[Bel58] R. Bellman. On a routing problem. Quarterly of Applied Mathematics, 16(1):87{90, 1958.
[Bel61] R. Bellman. Adaptive Control Processes. Princeton University Press, 1961.
[BKK63] R. Bellman, R. Kalaba, and B. Kotkin. Polynomial approximation | a new computational technique in dynamic programming: allocation processes. Mathematics of Computation, 17:155{161,
1963.
[Bla65] D. Blackwell. Discounted dynamic programming. Annals of Mathematical Statistics, 36:226{235,
1965.
[BM95] J. A. Boyan and A. W. Moore. Generalization in reinforcement learning: safely approximating
the value function. In G. Tesauro and D. Touretzky, editors, Advances in Neural Information
Processing Systems, volume 7. Morgan Kaufmann, 1995.
[BT89] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods.
Prentice Hall, 1989.
[CT89] C.-S. Chow and J. N. Tsitsiklis. An optimal multigrid algorithm for discrete-time stochastic
control. Technical Report P-135, Center for Intelligent Control Systems, 1989.
[Day92] P. Dayan. The convergence of TD() for general lambda. Machine Learning, 8(3{4):341{362, 1992.
[FF62] L. R. Ford, Jr. and D. R. Fulkerson. Flows in Networks. Princeton University Press, 1962.
[JJS94] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic
programming algorithms. Neural Computation, 6(6):1185{1201, 1994.
[Lin92] L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning, and teaching.
Machine Learning, 8(3{4):293{322, 1992.
[Moo91] A. W. Moore. Variable resolution dynamic programming: eciently learning action maps in
multivariate real-valued state-spaces. In L. Birnbaum and G. Collins, editors, Machine Learning:
Proceedings of the eighth international workshop. Morgan Kaufmann, 1991.
[Moo94] A. W. Moore. The parti-game algorithm for variable resolution reinforcement learning in multidimensional state spaces. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural
Information Processing Systems, volume 6. Morgan Kaufmann, 1994.
[Sab93] P. Sabes. Approximating Q-values with basis function representations. In Proceedings of the Fourth
Connectionist Models Summer School, Hillsdale, NJ, 1993. Lawrence Erlbaum.
[Sam59] A. L. Samuels. Some studies in machine learning using the game of checkers. IBM Journal of
Research and Development, 3(3):210{229, 1959.
[SS94] S. P. Singh and R. S. Sutton. Reinforcement learning with replacing eligibility traces. Manuscript
in preparation, 1994.
[Sut88] R. S. Sutton. Learning to predict by the methods of temporal dierences. Machine Learning,
3(1):9{44, 1988.
[Tes90] G. Tesauro. Neurogammon: a neural network backgammon program. In IJCNN Proceedings III,
pages 33{39, 1990.
20
[TS93]
[Tsi94]
[Wat89]
[WD92]
[Wit77]
S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement learning. In
Proceedings of the Fourth Connectionist Models Summer School, Hillsdale, NJ, 1993. Lawrence
Erlbaum.
J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning,
16(3):185{202, 1994.
C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King's College, Cambridge,
England, 1989.
C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3{4):279{292, 1992.
I. H. Witten. An adaptive optimal controller for discrete-time Markov environments. Information
and Control, 34:286{295, 1977.
21
Download