Stable Function Approximation in Dynamic Programming Georey J. Gordon January 1995 CMU-CS-95-103 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract The success of reinforcement learning in practical problems depends on the ability to combine function approximation with temporal dierence methods such as value iteration. Experiments in this area have produced mixed results; there have been both notable successes and notable disappointments. Theory has been scarce, mostly due to the diculty of reasoning about function approximators that generalize beyond the observed data. We provide a proof of convergence for a wide class of temporal dierence methods involving function approximators such as k-nearest-neighbor, and show experimentally that these methods can be useful. The proof is based on a view of function approximators as expansion or contraction mappings. In addition, we present a novel view of approximate value iteration: an approximate algorithm for one environment turns out to be an exact algorithm for a dierent environment. The author's current e-mail address is ggordon@cs.cmu.edu. This material is based on work supported under a National Science Foundation Graduate Research Fellowship, and by NSF grant number BES-9402439. Any opinions, ndings, conclusions, or recommendations expressed in this publication are those of the author and do not necessarily reect the views of the National Science Foundation or the United States Government. Keywords: machine learning, dynamic programming, temporal dierences, value iteration, function approximation 1 Introduction and background The problem of temporal credit assignment | deciding which of a series of actions is responsible for a delayed reward or penalty | is an integral part of machine learning. The methods of temporal dierences are one approach to this problem. In order to learn how to select actions, they learn how easily the agent can achieve a reward from various states of its environment. Then, they weigh the immediate rewards for an action against its long-term consequences | a small immediate reward may be better than a large one, if the small reward allows the agent to reach a high-payo state. If a temporal dierence method discovers a low-cost path from one state to another, it will remember that the rst state can't be much harder to get rewards from than the second; in this way, information propagates backwards from states in which an immediate reward is possible to those from which the agent can only achieve a delayed reward. One of the rst examples of a temporal dierence method was the Bellman-Ford single-destination shortest paths algorithm [Bel58, FF62], which learns paths through a graph by repeatedly updating the estimated distance-to-goal for each node based on the distances for its neighbors. At around the same time, research on optimal control led to the solution of Markov processes and Markov decision processes (see below) by temporal dierence methods [Bel61, Bla65]. More recently [Wit77, Sut88, Wat89], researchers have attacked the problem of solving an unknown Markov process or Markov decision process by experimenting with it. Many of the above methods have proofs of convergence [BT89, WD92, Day92, JJS94, Tsi94]. Unfortunately, most of these proofs assume that we represent our solution exactly and therefore expensively, so that solving a Markov decision problem with n states requires O(n) storage. On the other hand, it is perfectly possible to perform temporal dierencing on an approximate representation of the solution to a decision problem | Bellman discusses quantization and low-order polynomial interpolation in [Bel61], and approximation by orthogonal functions in [BD59, BKK63]. These approximate temporal dierence methods are not covered by the above convergence proofs. But, if they do converge, they can allow us to nd numerical solutions to problems which would otherwise be too large to solve. Researchers have experimented with a number of approximate temporal dierence methods. Results have been mixed: there have been notable successes, including Samuels' checkers player [Sam59], Tesauro's backgammon player [Tes90], and Lin's robot navigation [Lin92]. But these algorithms are notoriously unstable; Boyan and Moore [BM95] list several embarrassingly simple situations where popular approximate algorithms fail miserably. Some possible reasons for these failures are given in [TS93, Sab93]. Several researchers have recently conjectured that some classes of function approximators work more reliably with temporal dierence methods than others. For example, Sutton [SS94] has provided experimental evidence that linear functions of coarse codes, such as CMACs, can converge reliably for some problems. He has also suggested that online exploration of a Markov decision process can help to concentrate the representational power of a function approximator in the important regions of the state space. We will prove convergence for a signicant class of approximate temporal dierence algorithms, including algorithms based on k-nearest-neighbor, linear interpolation, some types of splines, and local weighted averaging. These algorithms will converge when applied either to discounted decision processes or to an important subset of nondiscounted decision processes. We will give sucient conditions for convergence to the exact value function, and for discounted processes we will bound the maximum error between the estimated and true value functions. 2 Denitions and basic theorems Our theorems in the following sections will be based on two views of function approximators. First, we will cast function approximators as expansion or contraction mappings; this distinction captures the essential dierence between approximators that can exaggerate changes in their training values, like linear regression and neural nets, and those like k-nearest-neighbor that respond conservatively to changes in their inputs. Second, we will show that approximate temporal dierence learning with some function approximators is equivalent to exact temporal dierence learning for a slightly dierent problem. To aid the statement of these theorems, we will need several denitions. 1 Consider a vector space S and a norm k k on S. If S is closed under limits in k k, then S is a complete vector space . Unless otherwise noted, all vector spaces will be assumed to be complete. Examples of complete vector spaces include the real numbers under absolute value and the n-vectors of real numbers under Manhattan (L1 ), Euclidean (L2 ), and max (L1 ) norms. The max norm, dened as Definition: k a k1 max j ai j i and the weighted max norm with weight vector W, dened as 1 ja j k a kW max i Wi i are particularly important for reasoning about Markov decision problems. Definition: A function f from a vector space S to itself is a contraction mapping if, for all points a and b in S, k f(a) , f(b) k k a , b k. Here , the contraction factor or modulus, is any real number in [0;1). If we merely have k f(a) , f(b) k k a , b k, we call f a nonexpansion. For example, the function f(x) = 5 + x2 is a contraction with contraction factor 12 . The identity function is a nonexpansion. All contractions are nonexpansions. Lemma 2.1 Let C and D be contractions with contraction factors and on some vector space S . Let M and N be nonexpansions on S . Then C N and N C are each contractions with contraction factor C D is a contraction with factor M N is a nonexpansion. A point x is a xed point of the function f if f(x) = x. A function may have any number of xed points. For example, the function x2 on the real line has two xed points, 0 and 1; any number is a xed point of the identity function; and x + 1 has no xed points. Definition: Theorem 2.1 (Contraction Mapping) Let S be a vector space with norm kk. Suppose f is a contraction mapping on S with contraction factor . Then f has exactly one xed point x in S . For any initial point x0 in S , the sequence x0 ;f(x0); f(f(x0)); .. . converges to x ; the rate of convergence of the above sequence in the norm k k is at least . For example, the function 5 + x2 has exactly one xed point on the real line, namely x = 10. If S is a nite-dimensional vector space, then convergence in any norm implies convergence in all norms. Definition: A Markov decision process is a tuple (S; A;;c; ;S0 ). MDPs are a formalism for describing the experiences of an agent interacting with its environment. The set S is the state space ; A is the action space. At any given time t, the environment is in some state xt 2 S. The agent perceives xt, and is allowed to choose an action at 2 A. The transition function, (which may be probabilistic), then acts on xt and at to produce a next state xt+1, and the process repeats. S0 is a distribution on S which gives the probability of being in each state at time 0. The cost function, c (which may be probabilistic), measures how well the agent is doing: at each timePstep t, the agent incurs a cost c(xt; at). The agent must act to minimize the t expected discounted cost E[ 1 t=0 c(xt ;at )]; 2 [0;1] is called the discount factor. We will write V (x) for the optimal value function, the minimal possible expected discounted cost starting from state x. We introduce conditions below under which V is unique and well-dened. We will say that an MDP is deterministic if the functions c(x;a) and (x; a) are deterministic for all x and a, i.e., if the current state and action uniquely determine the cost and the next state. An MDP is nite if its state and action spaces are nite; it is discounted if < 1. We will call an MDP a Markov process 2 if jAj = 1. In a Markov process, we cannot inuence the expected discounted cost; our goal is merely to compute it. There are several ways that we can ensure that V exists. In a nite discounted MDP, it is sucient to require that the cost function c(x; a) have bounded mean and variance for all x and a. For a nondiscounted MDP, even if c is bounded, cycles may cause the expected total cost for some states to be innite. So, we are usually interested in the case where some set of states G is absorbing and cost-free : that is, if we are in G at time t, we will be in G at time t + 1, and c(x;a) = 0 for any x 2 G and a 2 A. Without loss of generality, we may lump all such states together and replace them by a single state. Suppose that state 1 of an MDP is absorbing. Call an action selection strategy proper if, no matter what state we start in, following the strategy ensures that P (xt = 1) ! 1 as t ! 1. A nite nondiscounted MDP will have a well-dened optimal value function as long as the cost function has bounded mean and variance, there exists a proper strategy, and there does not exist a strategy which has expected cost equal to ,1 from some initial state. From now on, we will assume that all MDPs that we consider have a well-dened V . We will also assume that S0 puts a nonzero probability on every state in S. This allows us to avoid worrying about inaccessible states. If we have two MDPs M1 = (S; A1; 1 ;c1 ;1 ;S0 ) and M2 = (S; A2; 2 ;c2 ;2 ;S0 ) which share the same state space, we can dene a new MDP M12 , the composition of M1 and M2 , by alternately following transitions from M1 and M2. More formally, let M12 = (S;A1 A2; 12; c12;1 2 ;S0 ). At each step, the agent will select one action from A1 and one from A2 ; we dene the composite transition function 12 so that 12(x; (a1; a2)) = 2 (1 (x;a1 );a2 ). The cost of the composite action will be c12(x;(a1 ;a2 )) = c1 (x;a1) + 1 c2 (1 (x; a1); a2). A trajectory is a sequence of tuples (x0 ;a0 ;c0 );(x1 ;a1 ;c1 ); .. .; trajectories describe the experiences of an agent acting in an MDP. If the MDP is absorbing, there will be a point t so that ct = ct+1 = .. . = 0; we will usually omit the portion of the trajectory after t. Dene a policy to be a function : S 7! A. An agent may follow policy by choosing action (x) whenever it is in state x. It is possible to generalize the above denition to include randomized strategies and strategies which change over time; but the extra generality is unnecessary. It is well-known [Bel61, BT89] that every Markov decision process with a well-dened V has at least one optimal policy ; an agent which follows will do at least as well as any other agent, including agents which choose actions according to non-policies. The policy will satisfy Bellman's equation (8x 2 S) V (x) = E(c(x; (x)) + V ((x; (x)))) and every policy which satises Bellman's equation is optimal. There are two broad classes of learning problems for Markov decision processes: online and oine. In both cases, we wish to compute an optimal policy for some MDP. In the oine case, we are allowed access to the whole MDP, including the cost and transition functions; in the online case, we are only given S and A, and then must discover what we can about the MDP by interacting with it. (In particular, in the online case, we are not free to try an action from any state; we are limited to acting in the current state.) We can transform an online problem into an oine one by observing one or more trajectories, estimating the cost and transition functions, and then pretending that our estimates are the truth. (This approach is called the certainty equivalent method.) Similarly, we can transform an oine problem into an online one by pretending that we don't know the cost and transition functions. Most of the remainder of the paper deals with oine problems, but we mention online problems again in section 5. In the oine case, the optimal value function tells us the optimal policies: we may set (x) to be any a which maximizes E(c(x;a) + V ((x; a))). (In the online case, V is not sucient, since we can't compute the above expectation.) For a nite MDP, we can nd V by dynamic programming. With appropriate assumptions, repeated application of the dynamic programming backup operator V (x) min E(c(x; a) + V ((x;a))) a2A 3 to every state is guaranteed to converge to V from any initial guess [BT89]. (In the case of a nondiscounted problem with cost-free absorbing state g, we dene the backup operator to set V (g) 0 as a special case.) This dynamic programming algorithm is called value iteration. If we need to solve an innite MDP that satises certain continuity conditions, we may rst approximate it by a nite MDP as described in [CT89], then solve the nite MDP by value iteration. For this reason, the remainder of the paper will focus on nite (although possibly very large) Markov decision processes. We can generalize the above single-state version of the value iteration backup operator to allow parallel updating: instead of merely changing our estimate for one state at a time, we compute the new value for every state before altering any of the estimates. The result of this change is the parallel value iteration operator. The following two theorems imply the convergence of parallel value iteration. See [BT89] for proofs. Theorem 2.2 (value contraction) The parallel value iteration operator for a discounted Markov decision process is a contraction in max norm, with contraction factor equal to the discount. If all policies in a nondiscounted Markov decision process are proper, then the parallel value iteration operator for that process is a contraction in some weighted max norm. The xed point of each of these operators is the optimal value function for the MDP. Theorem 2.3 Let a nondiscounted Markov decision process have at least one proper policy, and let all improper policies have expected cost equal to +1 for at least one initial state. Then the parallel value iteration operator for that process converges from any initial guess to the optimal value function for that process. 3 Main results: a simple case In this section, we will consider only discounted Markov decision processes. The following sections generalize the results to other interesting cases. Suppose that T is the parallel value backup operator for a Markov decision process M. In the basic value iteration algorithm, we start o by setting V0 to some initial guess at M's value function. Then we repeatedly set Vi+1 to be T(Vi ) until we either run out of time or decide that some Vn is a suciently accurate approximation to V . Normally we would represent each Vi as an array of real numbers indexed by the states of M; this data structure allows us to represent any possible value function exactly. Now suppose that we wish to represent Vi , not by a lookup table, but by some other more compact data structure such as a neural net. We immediately run into two diculties. First, computing T (Vi ) generally requires that we examine Vi (x) for nearly every x in M's state space; and if M has enough states that we can't aord a lookup table, we probably can't aord to compute Vi that many times either. Second, even if we can represent Vi exactly with a neural net, there is no guarantee that we can also represent T(Vi ). To address both of these diculties, we will assume that we have a sample X0 of states from M. X0 should be small enough that we can examine each element repeatedly; but it should be large enough and representative enough that we can learn something about M by examining only the states in X0 . Now we can dene an approximate value iteration algorithm. Rather than setting Vi+1 to T (Vi ), we will rst compute (T(Vi ))(x) only for x 2 X0 ; then we will t our neural net (or other approximator) to these training values and call the resulting function Vi+1 . In order to reason about approximate value iteration, we will consider function approximation methods themselves as operators on the space of value functions: given any target value function, the approximator will produce a tted value function, as shown in gure 1. In the gure, the sample X0 is the rst ve natural numbers, and the representable functions are the cubic splines with knots in X0 . The characteristics of the function approximation operator determine how it behaves when combined with value iteration. One particularly important property is illustrated in gure 2. As the gure shows, linear regression can exaggerate the dierence between two target value functions V1 and V2 : a small dierence between the targets V1 (x) and V2 (x) can lead to a larger dierence between the tted values V^1 (x) and V^2 (x). Many function approximators, such as neural nets and local weighted regression, can exaggerate this way; others, such as k-nearest-neighbor, can not. We will show later that this sort of exaggeration can cause instability in an approximate value iteration algorithm. 4 16 12 8 4 0 1 2 3 4 0 1 (a) 2 3 4 0 1 (b) 2 3 4 (c) Figure 1: Function approximation methods as mappings. In (a) we see the value function for a simple random walk on the positive real line. (On each transition, the agent has an equal probability of moving left or right by one step. State 0 is absorbing; transitions from other states have cost 1.) Applying a function approximator (in this case, tting a spline with knots at the rst ve natural numbers) maps the value function in (a) to the value function in (b). Since the function approximator discards some information, its mapping can't be 1-to-1: in (c) we see a dierent value function which the approximator also maps to (b). 2 2 1 1 0 0 0 1 2 0 (a) 1 2 (b) Figure 2: The mapping associated with linear regression when samples are taken at the points x = 0; 1;2. In (a) we see a target value function (solid line) and its corresponding tted value function (dotted line). In (b) we see another target function and another tted function. The rst target function has values y = 0;0;0 at the sample points; the second has values y = 0;1;1. Regression exaggerates the dierence between the two functions: the largest dierence between the two target functions at a sample point is 1 (at x = 1 and x = 2), but the largest dierence between the two tted functions at a sample point is 67 (at x = 2). Suppose we wish to approximate a function from a space S to a vector space R. Fix a sample vector X0 of points from S, and x a function approximation scheme A. Now for each possible vector Y of target values in R, A will produce a function f^ from S to R. Dene Y^ to be the vector of tted values; that is, the i-th element of Y^ will be f^ applied to the i-th element of X0 . Now dene MA , the mapping associated with A, to be the function which takes each possible Y to its corresponding Y^ . Now we can apply the powerful theorems about contraction mappings to function approximation methods. In fact, it will turn out that if MA is a nonexpansion in an appropriate norm, the combination of A with value iteration is stable. (That is, under the usual assumptions, value iteration will converge to some approximation of the value function.) The rest of this section states the required property more formally, then proves that some common function approximators have this property. Definition: Two operators MF and T on the space S are compatible if repeated application of MF T is guaranteed to converge to some x 2 S from any initial guess x0 2 S. Note that compatibility is symmetric: if the sequence of operators MF ;T; MF ;T;MF ; . .. causes convergence from any initial guess x0, then it also causes convergence from T(x0 ); so the sequence T; MF ;T;MF ;. .. also causes convergence. Theorem 3.1 Let TM be the parallel value backup operator for some Markov decision process M with state space S , action space A, and discount < 1. Let X = S A. Let F be a be a function approximator (for functions from X to R) with mapping MF 2 RjX j 7! RjX j . Suppose MF is a nonexpansion in max norm. Then MF is compatible with TM , and MF TM has contraction factor . Definition: 5 3 4 1 2 5 6 (a) (b) Figure 3: A Markov process and a CMAC which are incompatible. Part (a) shows the process. Its goal is state 1. On each step, with probability 95%, the process follows a solid arrow, and with probability 5%, it follows a dashed arrow. All arc costs are zero. Part (b) shows the CMAC, which has 4 receptive elds each covering 3 nodes. If the CMAC starts out with all predictions equal to 1, approximate value iteration 77 ;( 77 )2 ; ... for state 2. produces the series of target values 1; 60 60 By the value contraction theorem, TM is a contraction in max norm with factor . By assumption, MF is a nonexpansion in max norm. Therefore MF TM is a contraction in max norm by the factor . 2 Proof: Corollary 3.1 The approximate value iteration algorithm based on F converges in max norm at the rate when applied to M . It remains to show which function approximators are compatible with value iteration. Many common approximators are incompatible with standard parallel value backup operators. For example, as gure 2 demonstrates, linear regression can be an expansion in max norm; and Boyan and Moore [BM95] show that the combination of value iteration with linear regression can diverge. Other incompatible methods include standard feedforward neural nets, some forms of spline tting, and local weighted regression [BM95]. A particularly interesting case is the CMAC. Sutton [SS94] has recently reported success in combining value iteration with a CMAC, and has suggested that function approximators similar to the CMAC are likely to allow temporal dierencing to converge in the online case. While we have no evidence to support or refute this suggestion, it is worth mentioning that convergence is not guaranteed in our oine framework, as the counterexample in gure 3 shows. In this example, the optimal value function is uniformly zero, since all arc costs are zero. The exact value backup operator assigns 0 to V (1) (since state 1 is the goal) and :05V (1) + :95V (2) to V (i) for i 6= 1 (since all states except 1 have a 5% chance of transitioning to state 1 and a 95% chance of transitioning to state 2). The approximate backup operator computes these same 6 numbers as training data for the CMAC; but since the CMAC can't represent this function exactly, our output is the closest representable function in the sense of least squared error. If the exact operator produces kv where v = (0;1; 1;1; 1;1)T , then the closest representable function will be kw where w = ( 31 ; 43 ; 56 ; 65 ; 56 ; 65 )T. If we repeat the process from this new value function, the exact backup operator will now produce 7760k v, and the CMAC will produce 7760k w. 77 . Further iteration causes divergence at the rate 60 The CMAC still diverges even if we choose a small learning rate: with learning rate , the rate of divergence is 1+ 17 60 . However, if we train the CMAC based on actual trajectories in the Markov process, as Sutton suggests, we no longer diverge: transitions out of state 2, which lower the coecients of our CMAC substantially, are as frequent as transitions into state 2, which raise the coecients somewhat. In fact, since this example is a Markov process rather than a Markov decision process, convergence of the online algorithm is guaranteed by a theorem in [Day92]. We will prove that a broad class of approximation methods is compatible with value iteration. This class includes kernel averaging, k-nearest-neighbor, weighted k-nearest-neighbor, Bezier patches, linear interpolation on a triangular (or tetrahedral, etc.) mesh, bilinear interpolation on a square (or cubical, etc.) 6 mesh, and many others. (See the Experiments section for a denition of bilinear interpolation. Note that the square mesh is important: on non-rectangular meshes, bilinear interpolation will sometimes need to extrapolate.) Definition: A real-valued function approximation scheme is an averager if every tted value is the weighted average of zero or more target values and possibly some predetermined constants. The weights involved in calculating the tted value Y^i may depend on the sample vector X0 , but may not depend on the target values Y . More precisely, for a xed X0 , if Y has n elements, there must exist n real numbers ki,Pn2 nonnegative real numbers ij , and n nonnegative real numbers i , so that for each i we have i + j ij = 1 and P Y^i = i ki + j ij Yj . It should be obvious that all of the methods mentioned before the denition are averagers: in all cases, the tted value at any given coordinate is a weighted average of target values, and the weights are determined by distances in X values, and so are unaected by the target Y values. Theorem 3.2 The mapping MF associated with any averaging method F is a nonexpansion in max norm, and is therefore compatible with the parallel value backup operator for any discounted MDP. Proof: Fix the s and ks for F as in the above denition. Let Y and Z be two vectors of target values. Consider a particular coordinate i. Then, letting k k denote max norm, [M (Y ) , M (Z)] = ( k + X Y ) , ( k + X Z ) F F i i i ij j i i ij j j j X = (Y , Z ) j ij j j max (Y , Zj ) j j = kY , Z k That is, every element of (MF (Y ) , MF (Z)) is no larger than k Y , Z k. Therefore, the max norm of (MF (Y ) , MF (Z)) is no larger than k Y , Z k. In other words, MF is a nonexpansion in max norm. 2 4 Nondiscounted processes Consider now a nondiscounted Markov decision process M. Suppose for the moment that all policies for M are proper. Then the value contraction theorem states that the parallel value backup operator TM for this process is a contraction in some weighted max norm k kW . The previous section proved that if the approximation method A is an averager, then MA is a nonexpansion in (unweighted) max norm. If we could prove that MA were also a nonexpansion in k kW , we would have MA compatible with TM . Unfortunately, MA may be an expansion in k kW ; in fact parts (a) and (b) of gure 4 show a simple example of a nondiscounted MDP and an averager which are incompatible. (Part (c) is a simple proof that the MDP and the averager are incompatible; see below for an explanation.) Fortunately, there are averagers which are compatible with nondiscounted MDPs. The proof relies on an intriguing property of averagers: we can view any averager as a Markov process, so that state x has a transition to state y whenever xy > 0, i.e., whenever the tted V (x) depends on the target V (y). Part (b) of gure 4 shows one example of a simple averager viewed as a Markov process; this averager has 11 = 23 = 33 = 1 and all other coecients zero. If we view an averager as a Markov process, and compose this process with our original MDP, we will derive a new MDP. Part (c) of gure 4 shows a simple example; a slightly more complicated example is in gure 5. As the following theorem shows, exact value iteration on this derived MDP is the same as approximate value iteration on the original MDP. Theorem 4.1 (Derived MDP) For any averager A with mapping MA, and for any MDP M (either discounted or nondiscounted) with parallel value backup operator TM , the function TM MA is the parallel value backup operator for a new Markov decision process M 0 . 7 1 2 3 (a) (b) (c) Figure 4: A nondiscounted deterministic Markov process, and an averager with which it is incompatible. The process is shown in (a); the goal is state 1, and all arc costs except at the goal are 1. In (b) we see an averager, represented as a Markov process: states 1 and 3 are unchanged, while V (2) is replaced by V (3). The derived Markov process is shown in (c); state 3 has been disconnected, so its value estimate will diverge. (a) (b) (c) Figure 5: An example of the construction of the derived Markov process. Part (a) shows a deterministic Markov process: its state space is the unit triangle, and on every step the agent moves a constant distance towards the origin. The value of each state is simply its distance from the origin, so the value function is nonlinear. For our function approximator, we will use linear interpolation on the three corners of the triangle. Part (b) shows a representative transition from the derived process: as before, the agent moves towards the goal, but then the averager moves the agent randomly to one of the three corners. On average, this scattering moves the agent back away from the goal, so steps in the derived process don't move the agent as far on average as they did in the original process. Part (c) shows the expected progress the agent makes on each step. The value function for the derived process is V (x; y) = x + y. 8 Dene the derived MDP M 0 as follows. It will have the same state and action spaces as M, and it will also have the same discount factor and initial distribution. We can assume without loss of generality that state 1 of M is cost-free and absorbing: if not, we can renumber the states of M starting at 2, add a new state 1 which satises this property, and make all of its incoming transition probabilities zero. We can also assume, again without loss of generality, that 1 = 1 and k1 = 0 (that is, that A always sets V (1) = 0) | again, if this property does not already hold for M, we can add a new state 1. Suppose that,Pin M, action a in state x takes us to state y with probability paxy . Suppose that A replaces V (y) by y ky + z yz V (z). Then we will dene the transition probabilities in M 0 for state x and action a to be Proof: p0axz = p0ax1 = X y X y (z 6= 1) paxy yz paxy (y 1 + y ) P These transition probabilities make sense: since A is an averager, we know that z yz + y is 1, so X0 XX X paxz = paxy yz + paxy (y 1 + y ) z z6=1 y X = y X = y paxy X z y yz + y ! paxy = 1 Now suppose that, in M, performing action a from state x yields expected cost cxa. Then performing action a from state x in M 0 yields expected cost c0xa = cxa + X y paxy y ky 0 0 0 0 Now the parallel value backup operator TM for M 0 is 0 V (x) min E(c0 (x;a) + V ( 0 (x;a))) a X0 0 = min paxz (cxa + V (z)) a 1 0z ! ! X X X @ paxy yz (c0xa + V (z)) + paxy (y 1 + y ) (c0xa + V (1))A = min a y z6=1 y 0 1 X @X 0 = min p yz (cxa + V (z)) + (y 1 + y )c0xa A a y axy 0z6=1 1 X @0 X = min p c + yz V (z)A a y axy xa z6=1 0 1 X X @c0xa + paxy yz V (z)A = min a y z6=1 1 0 X X X @cxa + paxy y ky + paxy yz V (z)A = min a y 0 0 0 0 y 9 z6=1 On the other hand, the parallel value backup operator for M is V (x) min E(c(x; a) + V ((x;a))) a X = min paxy (cxa + V (y)) a y If we replace V (y) by its approximation under A, the operator becomes TM MA : V (x) min a X y paxy cxa + (y ky + X z ! yz V (z)) 0 1 X X X @cxa + paxy y ky + paxy yz V (z)A = min a y y z6=1 which is exactly the same as TM above. 2 Given an initial estimate V0 of the value function, approximate value iteration begins by computing MA (V0 ), the representation of V0 in A. Then it alternately applies TM and MA to produce the series of functions V0 ; MA (V0 );TM (MA (V0 )); MA (TM (MA (V0 ))); .... (In an actual implementation, only the functions MA (. ..) would be represented explicitly; the functions TM (. ..) would just be sampled at the points X0 .) On the other hand, exact value iteration on M 0 produces the series of functions V0 ;TM MA (V0 );TM MA (TM MA (V0 )); . . .. This series obviously contains exactly the same information as the previous one. The only dierence between the two algorithms is that approximate value iteration would stop at one of the functions MA (. ..), while iteration on M 0 would stop at one of the functions TM (. ..). Now we can see why the combination in gure 4 diverges: the derived MDP has a state with innite cost. So, in order to prove compatibility, we need to guarantee that the derived MDP is well-behaved. If the arc costs of a discounted MDP M have nite mean and variance, it is obvious that the arc costs of M 0 also have nite mean and variance. That means that TM = TM MA converges in max norm at the rate | i.e., we have just proven again that MA is compatible with TM . More importantly, if M is a nite nondiscounted process, there are averagers which are compatible with it. For example, if A uses weight decay (i.e., if y > 0 for all y), then M 0 will have all policies proper, since any action in any state has a nonzero probability of bringing us immediately to state 1. More generally, if M has only proper policies, we may partition its S state space by distance from state 1, as follows (see [BT89]). Let S1 = f1g. Now recursively dene Uk = j<k Sk and Sk as 0 0 Sk = x (x 62 Uk ) ^ (min max P((x;a) = y) > 0) a2A y2U k Bertsekas and Tsitsiklis show that this partitioning is exhaustive; so we may set k(x) to be the unique k so that x 2 Sk . If M is nite, there will be nitely many nonempty Sk . We will say that an averager A is self-weighted for M if for every state y, either y > 0 or there exists a state x so that k(x) k(y) and yx > 0. Now, if A is self-weighted for a nite nondiscounted MDP M, then M 0 will have only proper policies: if at some time we are in state x, then no matter what action a we take, there is a nonzero chance that we will immediately follow an arc to a state in Sk for k < k(x). (By denition of the partition, there is a transition in M under a from x to some y so that k(y) < k(x). By the self-weighting property, either yz must be nonzero for some z so that k(z) k(y), in which case x has a possible transition in M 0 under a to z; or y > 0, in which case x has a possible transition in M 0 under a to state 1.) If we follow such an arc, there is then a nonzero chance that we will immediately follow another arc to a state in Sk for k0 < k, and so forth until we eventually (with positive probability) reach partition S1 and therefore state 1. Let m be the largest integer so that Sm is nonempty. Then the previous discussion shows that, no matter what state we start from, we have a positive probability of reaching state 1 in m , 1 steps. Call the smallest such probability . Then in k(m , 1) steps, we have probability at least 1 , (1 , )k of reaching state 1. The limit of this quantity as k ! 1 is 1; so with probability 1 we eventually absorb from any initial state. We have just proven 0 10 Corollary 4.1 Let M be a nite nondiscounted Markov decision process. Suppose all policies in M are proper. Let A be an averager which is self-weighted for M . Let TM and MA be the parallel backup operator for M and the mapping associated with A. Then TM and MA are compatible, and the approximate value iteration algorithm based on A will converge if is applied to M . If A ignores V (x) for all states x not in some sample X0 , then the states in X0 will play an important role in the derived MDP M 0 . While the initial state of a trajectory may be outside of X0 , all transitions in M 0 lead to states in X0 , so after one step the trajectory will enter X0 and stay there indenitely. This means that we can characterize a large part of M 0 by looking only at its behavior on X0 | just what we need for a tractable algorithm. (It is worth mentioning that, if M is nondiscounted, X0 should contain the goal state: if it doesn't, M 0 will have no transitions into the goal, so all of its values will be innite.) 5 The online problem and Q-learning The results of the previous sections carry over directly to a gradual version of the parallel value backup operator V (x) x min E(c(x; a) + V ((x;a))) a2A in which, rather than replacing V (x) by its computed update on each step, we take a weighted average of the old and new values of V (x). (The weights x may dier for each x, and may change from iteration to iteration.) That is, we can still construct the derived MDP M 0 and perform gradual value iteration on it, and gradual value iteration on M 0 is still the same as gradual approximate value iteration on M. The results also apply nearly directly to dynamic programming with Watkins' [Wat89] Q function Q (x;a) E(c(x;a) + V ((x; a))) and Q-learning operator Q(x;a) xa E(c(x; a) + min Q((x;a); a0 )) a (The learning rates xa may now be random variables, and may depend for each step not only on x and a but also on the entire past history of the agent's interactions with the MDP, up to but not including the current values of the random variables c(x;a) and (x;a).) That is, we can still dene a derived MDP so that the behavior of the approximate algorithm on the original MDP is the same as the behavior of the exact algorithm on the derived MDP. The derived MDP is, however, slightly dierent from the derived MDP for value iteration; see the Appendix for details. The previous paragraphs imply that, if we could sample transitions at will from the derived MDP (and so compute unbiased estimates of the Q-learning updates), we could apply gradual Q-learning to these transitions and learn a policy. The convergence of this algorithm would be guaranteed, as long as the weights xa were chosen appropriately, by any suciently general convergence proof for Q-learning [JJS94, Tsi94]. Unfortunately, there is a catch. Q-learning is designed to work for online problems, where we don't know the cost or transition functions and can only sample transitions from our current state. The power of the approximate value iteration method, on the other hand, comes from the fact that we can pay attention only to transitions from a certain small set of states. So, while the derived MDP for Q-learning will still have only a few relevant states, we won't in general be able to observe many transitions from these states, and so the approximate Q-learning iteration will take a very long time to converge. There are two ways that the approximate Q-learning algorithm might still be useful. The rst is if we encounter a problem which is somewhere between online and oine: we can sample any transition at will, but don't know the cost or transition functions a priori or can't compute the necessary expectations. In such a problem we still can't use value iteration, since it is dicult to compute an unbiased estimate of the value iteration update, so the approximate Q-learning algorithm is helpful. The second way is if we are willing to accept possible lack of convergence. Suppose our function approximator pays attention to the states in the set X0 . If we pretend that every transition we see from a state x 62 X0 is actually a transition from the nearest state x0 2 X0 , we will then have enough data to compute the behavior of the derived MDP on X0 . Unfortunately, following this approximation is equivalent to introducing hidden state into the derived MDP; so we now run the risk of divergence. 0 11 6 Converging to what? Until now, we have only considered the convergence or divergence of approximate dynamic programming algorithms. Of course we would like not only convergence, but convergence to a reasonable approximation of the value function. The next section contains some empirical studies of approximate value iteration; this section proves some error bounds, then outlines the types of problems we have encountered while experimenting with approximate value iteration. 6.1 Error bounds Suppose that M is an MDP with value function V , and let A be an averager. What if V is also a xed point of MA ? Then V is a xed point of TM MA ; so, if we can show that TM MA converges, we will know that it converges to the right answer. If M is discounted and has bounded costs, TM MA will always converge; the following theorem describes some situations where nondiscounted problems converge. Theorem 6.1 Let V be the optimal value function for a nite nondiscounted Markov decision process M . Let TM be the parallel value backup operator for M . Let MA be the mapping for an averager A, and let V also be a xed point of MA . Then V is a xed point of TM MA ; and if either 1. M has all policies proper and A is self-weighted for M , or 2. M has E (c(x;a)) > 0 for all x 6= 1 and A has kx 0 for all x then iteration of TM MA converges to V . By the value contraction theorem, V is a xed point of TM . So TM MA (V ) = TM (V ) = V ; that is, V is a xed point of TM MA . If (1) holds, then by the corollary to the derived MDP theorem, TM MA is a contraction in some weighted max norm; so V is the unique xed point of TM MA , and iteration of TM MA converges to V . If (2) holds, then M 0 has c0 (x;a) > 0 for all x 6= 1; so every improper policy in M 0 has innite cost for some initial states. If we can show that M 0 has at least one proper policy, then TM MA must converge (and therefore must converge to V ). Note that V (1) = 0 and V (x) > 0 for x 6= 1, since all arc costs in M are positive. Suppose we start at some non-goal state x in M 0, and choose an action a so that V (x) = E (c0 (x;a) + V ( 0(x; a))). (There must be such an action, since V is a xed point of the value backup operator for M 0 .) Since c0 (x;a) > 0, we know that V (x) > E (V ( 0 (x;a))). In particular, there must be a possible transition to some state y so that V (x) > V (y ). If y is not the goal, we can repeat the argument to nd a z so that y has a possible transition to z and V (y ) > V (z ), and so forth until (with positive probability) we eventually reach the goal. 2 The above theorem is useful only when we know that the optimal value function is a xed point of our averager. For example, it shows that bilinear interpolation will converge to the exact value function for a gridworld, if every arc's cost is equal to its (Manhattan) length, since the value function for this MDP is linear. If we are trying to solve a discounted MDP, on the other hand, we can prove a much stronger result: if we only know that the optimal value function is somewhere near a xed point of our averager, we can still guarantee an error bound for approximate value iteration. Proof: Theorem 6.2 Let V be the optimal value function for a nite Markov decision process M with discount factor . Let TM be the parallel value backup operator for M . Let MA be the mapping for an averager A. Let V A be any xed point of MA . Suppose k V A , V k = , where k k denotes max norm. Then iteration of TM MA converges to a value function V0 so that k V0 , V k 12, . Proof: By the value contraction theorem, TM is a contraction in max norm with factor . By theorem 3.2, MA is a nonexpansion in max norm. So, TM MA is a contraction in max norm with factor , and therefore 12 converges to some V0 . Repeated application of the triangle inequality and the denition of a contraction give k V0 , TM (MA (V )) k = k TM (MA (V0 )) , TM (MA (V )) k k V0 , V k k TM (MA (V )) , V k = k TM (MA (V )) , TM (V ) k k MA (V ) , V k k MA (V ) , V A k + k V A , V k = k MA (V ) , MA (V A ) k + k V A , V k k V , V A k + k V A , V k k V0 , V k k V0 , TM (MA (V )) k + k TM (MA (V )) , V k k V0 , V k + 2 k V , V A k (1 , )k V0 , V k 2 k V , V A k k V0 , V k 12, which is what was required. 2 If we let ! 0, we can make the above error bound arbitrarily small. This result is somewhat counterintuitive, since A may not even be able to represent V exactly. The reason for this behavior is that the nal step in computing V0 is to apply TM ; when = 0, this step produces V immediately. As mentioned in a previous section, approximate value iteration returns MA (V0 ) rather than V0 itself. So, an error bound for MA (V0 ) would be useful. The error bound on V0 leads directly to a bound for MA (V0 ): k V , MA (V0 ) k k V , V A k + k V A , MA (V0 ) k = + k MA (V A ) , MA (V0 ) k + k V A , V0 k + k V A , V k + k V , V0 k 2 + 12, By way of comparison, Chow and Tsitsiklis [CT89] bound the error introduced by using a grid of side h to compute the value function for a type of continuous-state-space MDP. Writing Vh for the approximate value function computed this way, their bound is k V , Vh k 1 ,h (K1 + K2k V kQ) (for all suciently small h) where k kQ is the span quasi-norm k F kQ = sup F (x) , infx F (x) x and K1 and K2 are constants. While their formalism allows them to discretize the available actions as well as the state space, an approximation which we don't consider, their bound also applies to processes with a nite number of actions. So, we can compare their bound on Vh for the case of non-discretized actions to our bound on MA (V0 ). Write Mh for the mapping associated with discretization on a grid of side h. (For deniteness, assume that (Mh (V ))(x) = V (x0) where x0 is the center of the grid cell which contains x. Almost any other reasonable convention would also work.) The processes that Chow and Tsitsiklis consider satisfy j V (x) , V (x0 ) j Lk x , x0 k for some constant L (i.e., V is Lipschitz continuous ). (We can determine L easily from the cost and transition functions.) So, we can nd a xed point Vh of Mh which is close to V , as follows. Pick a grid cell 13 C . Write CH and CL for the maximum and minimum values of V in C ; write xH and xL for states x 2 C which achieve these values. (Such states exist because V is continuous.) We know that k xH , xL k h, since the diameter of C is h. Therefore, by the Lipschitz condition, j CH , CL j hL. So, if we set Vh (x) to be 21 (CH , CL) for all x 2 C , the largest dierence between V and Vh on C will be hL2 . Applying our error bound now gives k V , Vh k = hL(1 + 1 , ) which has slightly dierent constants but the same behavior as h ! 0 or ! 1 as the bound of Chow and Tsitsiklis. Our bound is valid for any h, rather than just suciently small h. Their bound depends on k V kQ ; since k V kQ is bounded by 12,r , where r is the largest single-step expected reward, we have folded a similar dependence into the constant L. The sort of error bound which we have proved is particularly useful for function approximators such as linear interpolation which have many xed points. (In fact, every function which is representable by a linear interpolator is a xed point of that interpolator's mapping. The same is true for bilinear interpolation, grids, and 1-nearest-neighbor; it is not true for local weighted averaging or k-nearest-neighbor for k > 1.) Because it depends on the maximum dierence between V and the xed point of MA , the error bound is not very useful if V may have large discontinuities at unknown locations: if V has a discontinuity of height d, then any averager which can't mimic the location of this discontinuity exactly will have no representable functions (and therefore no xed points) within 2d of V . 6.2 In practice The most common problem with approximate value iteration is the presence of barriers in the derived MDP. That is, sometimes the derived MDP can be divided into two pieces so that the rst piece contains the goal and the second piece has no transitions into the rst. In this case, the estimated values of the states in the second piece will be innite. (We mentioned a special case of this situation above: if the averager ignores the goal state, then the derived MDP will have no transitions into the goal.) A less drastic but similar problem occurs when the second piece has only low-probability transitions to the rst; in this case, the costs for states in the second piece will not be innite, but will still be articially inated. This sort of problem is likely to happen when the MDP has short transitions and when there are large regions where a single state dominates the averager. For a particularly bad example, suppose our function approximator is 1-nearest-neighbor. If the transitions out of a sampled state x in M are shorter than half the distance to the nearest adjacent sampled state, then the only transitions out of x in M 0 will lead straight back to x. Similarly, in local weighted averaging with a narrow kernel, a short transition out of x in M will translate to a high probability self loop in M 0. In both cases, the eect of the averager is to produce a drag on transitions out of x: actions in M 0 don't get the agent as far on average as they did in M . One way to reduce this drag is to make sure that no single state has the dominant weight over a large region. The best way to do so is to sample the state space more densely; but if we could aord to do that, we wouldn't need a function approximator in the rst place. Another way is to increase a smoothing parameter such as kernel width or number of neighbors, and so reduce the weight of each sample point in its immediate neighborhood. Unfortunately, increased smoothing brings its own problems: it can remove exactly the features of the value function that we are interested in. For example, if the agent must follow a long, narrow path to the goal, the scattering eect of a wide-kernel averager is almost certain to push it o of the path long before it reaches the end. We will see an example of this problem in the hill-car experiment below. Both of the above problems | too much smoothing and the introduction of barriers | can be reduced if we can alter our MDP so that the actions move the agent farther. For example, we might look ahead two or more time steps at each value backup. (This strategy corresponds to the dynamic programming operator TMn MA for some n > 1. Since TMn is the backup operator for an MDP (derived by composing n copies of M ), the previous sections' convergence theorems also apply to TMn MA .) While in general the cost of looking ahead n steps is exponential in n, there are many circumstances where we can reduce this cost dramatically. For instance, in a physical simulation, we can choose a longer time increment; in a grid world, we can consider only the compound actions which don't contain two steps in opposite directions; and in the 14 case of a Markov process, where there's only 1 action, the cost of lookahead is linear rather than exponential in n. (In the last case, TD() [Sut88] allows us to combine lookaheads at several depths.) If actions are selected from an interval of R, numerical minimum-nding algorithms such as Newton's method or golden section search can nd a local minimum quickly. In any case, if the depth and branching factor are large enough, standard heuristic search techniques can at least chip away at the base of the exponential. 7 Experiments This section describes our experiments with the Markov decision problems from [BM95]. 7.1 Puddle world In this world, the state space is the unit square, and the goal is the upper right corner. The agent has four actions, which move it up, left, right, or down by .1 per step. The cost of each action depends on the current state: for most states, it is the distance moved, but for states within the two \puddles," the cost is higher. See gure 6. For a function approximator, we will use bilinear interpolation, dened as follows: to nd the predicted value at a point (x;y ), rst nd the corners (x0 ;y0 ), (x0 ;y1 ), (x1; y0 ), and (x1 ;y1 ) of the grid square containing (x; y ). Interpolate along the left edge of the square between (x0 ;y0 ) and (x0; y1 ) to nd the predicted value at (x0 ;y ). Similarly, interpolate along the right edge to nd the predicted value at (x1 ;y ). Now interpolate across the square between (x0; y ) and (x1 ;y ) to nd the predicted value at (x;y ). Figure 6 shows the cost function for one of the actions, the optimal value function computed on a 100 100 grid, an estimate of the optimal value function computed with bilinear interpolation on the corners of a 7 7 grid (i.e., on 64 sample points), and the dierence between the two estimates. Since the optimal value function is nearly piecewise linear outside the puddles, but curved inside, the interpolation performs much better outside the puddles: the root mean squared dierence between the two approximations is 2.27 within one step of the puddles, and .057 elsewhere. (The lowest-resolution grid which beats bilinear interpolation's performance away from the puddles is 20 20; but even a 5 5 grid can beat its performance near the puddles.) 7.2 Car on a hill In this world, the agent must drive a car up to the top of a steep hill. Unfortunately, the car's motor is weak: it can't climb the hill from a standing start. So, the agent must back the car up and get a running start. The state space is [,1; 1] [,2; 2], which represents the position and velocity of the car; there are two actions, forward and reverse. (This formulation diers slightly from [BM95]: they allowed a third action, coast. We expect that the dierence makes the problem no more or less dicult.) The cost function measures time until goal. There are several interesting features to this world. First, the value function contains a discontinuity despite the continuous cost and transition functions: there is a sharp transition between states where the agent has just enough speed to get up the hill and those where it must back up and try again. Since most function approximators have trouble representing discontinuities, it will be instructive to examine the performance of approximate value iteration in this situation. Second, there is a long, narrow region of state space near the goal through which all optimal trajectories must pass (it is the region where the car is partway up the hill and moving quickly forward). So, excessive smoothing will cause errors over large regions of the state space. Finally, the physical simulation uses a fairly small time step, .03 seconds, so we need ne resolution in our function approximator just to make sure that we don't introduce a barrier. The results of our experiments appear in gure 7. For a reference model, we t a 128 128 grid. While this model has 16384 parameters, it is still less than perfect: the right end of the discontinuity is somewhat rough. (Boyan and Moore used a 200 by 200 grid to compute their optimal value function, and it shows no perceptible roughness at this boundary.) We also t two smaller grids, one 64 64 and one 32 32. Finally, we t a weighted 4-nearest neighbor model using the 1024 centers of the cells of the 32 32 grid as sample points, and another using a uniform random sample of 1000 points from the state space. Note that 15 5 5 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 7 6 5 5 4 3 2 1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 -1 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Figure 6: The puddle world. From top left: the cost of moving up, the optimal value function as seen by a 100 100 grid, the optimal value function as seen by bilinear interpolation on the corners of a 7 7 grid, and the dierence between the two value functions. 16 2 1.5 1 0.5 -2 -1.5 0 -1 -0.5 0 0.5 0.5 0 1 1.5 -0.5 2 -1 2 1.5 1.5 1 1 0.5 0.5 -2 -1.5 0 -2 -1.5 0 -1 -0.5 -1 -0.5 0 0.5 0.5 0 0 0.5 1 1.5 -0.5 0.5 0 -1 1 1.5 -0.5 -1 1.5 1.5 1 1 0.5 0.5 0 -0.5 0 -1 -0.5 -2 -1.5 -1.5 -1 -0.5 -2 -2 -1.5 -1 -0.5 -1 0 0.5 0.5 0 0 0.5 1 1.5 -0.5 0.5 -1 0 1 1.5 -0.5 -1 Figure 7: Approximations to the value function for the hill-car problem. From the top: the reference model, a 32 32 grid, a k-nearest-neighbor model, the error of the 32 32 grid, and the error of the k-nearestneighbor model. In each plot, the horizontal axes represent the agent's position and velocity, and the vertical axis represents the estimated time remaining until reaching the summit at x = :6. 17 2 1.5 1.5 1 1 0.5 0.5 -2 -1.5 0 -2 -1.5 0 -1 -0.5 -1 -0.5 0 0.5 0.5 0 0 0.5 1 1.5 -0.5 0.5 -1 0 1 1.5 -0.5 -1 Figure 8: Two smaller models for the hill-car world: a divergent 12 12 grid, and a convergent nearestneighbor model on the same 144 sample points. the nearest-neighbor methods are roughly comparable in complexity to the 32 32 grid: each one requires us to evaluate about two thousand transitions in the MDP for every value backup. As the dierence plots show, most of the error in the smaller models is concentrated around the discontinuity in the value function. Near the discontinuity, the grids perform better than the nearest-neighbor models (as we would expect, since the nearest-neighbor models tend to smooth out discontinuities). But away from the discontinuity, the nearest-neighbor models win. The 32 32 nearest-neighbor model also beats the 32 32 grid at the right end of the discontinuity: the car is moving slowly enough here that the grid thinks that one of the actions keeps the car in exactly the same place. The nearest-neighbor model, on the other hand, since it smooths more, doesn't introduce as much drag as the grid does and so doesn't have this problem. The root mean square error of the 64 64 grid (not shown) from the reference model is 0:190s, and of the 32 32 grid is 0:336s. The RMS error of the 4-nearest-neighbor tter with samples at the grid points is 0:205s. The nearest-neighbor tter with a random sample (not shown) performs slightly worse, but still signicantly better than the 32 32 grid (one-tailed t-test gives p = :971): its error, averaged over 5 runs, is 0:235s. All of the above models are fairly large: the smallest one requires us to evaluate 2000 transitions for every value backup. Figure 8 shows what happens when we try to t a smaller model. The 12 12 grid is shown after 60 iterations; it is in the process of diverging, since the transitions are too short to reach the goal from adjacent grid cells. The 4-nearest-neighbor tter on the same 144 grid points has converged; its RMS error from the reference model is 0:278s (better than the 32 32 grid, despite needing to simulate fewer than one-seventh as many transitions). A 4-nearest-neighbor tter on a random sample of size 150 (not shown) also converged, with RMS error 0:423s. 8 Conclusions and further research We have proved convergence for a wide class of approximate temporal dierence methods, and shown experimentally that these methods can solve Markov decision processes more eciently than grids of comparable accuracy. Unfortunately, many popular function approximators, such as neural nets, linear regression, and CMACs, do not fall into this class (and in fact can diverge). The chief reason for divergence is exaggeration: the more a method can exaggerate small changes in its target function, the more often it diverges under temporal dierencing. (In some cases, though, it is possible to detect and compensate for this instability. The grow-support algorithm of [BM95], which detects instability by interspersing TD(1) \rollouts," is a good example.) There is another important dierence between averagers and methods like neural nets. This dierence is the ability to allocate structure dynamically: an averager cannot decide to concentrate its resources on 18 one important region of the state space, whether or not this decision is justied. This ability is important, and it can be grafted on to averagers (for example, adaptive sampling for k-nearest-neighbor, or adaptive meshes for grids or interpolation). The resulting function approximator still does not exaggerate; but it is no longer an averager, and so is not covered by this paper's proofs. Still, methods of this sort have been shown to converge in practice [Moo94, Moo91], so there is hope that a proof is possible. A The derived process for Q-learning Here is the analog for Q-learning of the derived MDP theorem. The chief dierence is that, where the theorem for value iteration considered the combined operator TM MA , this version considers MA TM . The dierence is necessary to keep the min operation in the Q-learning backup from getting in the way. Of course, if we show that either TM MA or MA TM converges from any initial guess, then the other must also converge. Theorem A.1 (Derived MDP for Q-learning) For any averager A with mapping MA , and for any MDP M (either discounted or nondiscounted) with Q-learning backup operator TM , the function MA TM is the Q-learning backup operator for a new Markov decision process M 0. The domain of A will now be pairs of states and actions. Write xayb for the coecient of Q(y; b) in the approximation of Q(x;a); write kxa and xa for the constant and its coecient. Take an initial guess Q(x;a). Write Q0 for the result of applying TM to Q; write Q00 for the result of applying MA to Q0 . Then we have Proof: Q0(x;a) = E (c(x;a) + min Q( (x;a); b)) X b = cxa + pxay min Q(y;b) b Q00(z;c) = = = = XX x a XX x a XX x a y zcxa Q0(x; a) + zc kzc zcxa cxa + zcxa cxa + XX x a X ! pxay min Q(y; b) + zc kzc b y XX x a zcxa ! zcxa cxa + zc kzc + X y pxay min Q(y;b) + zc kzc b X XX y x a ! zcxapxay min Q(y;b) b We now interpret the rst parenthesis above as the cost of taking action c from state z in M 0 ; the second P 0 0 parenthesis is the transition probability pzcy for M . Note that the sum y p0zcy will generally be less than 1; so we will make up the dierence by adding a transition in M 0 from state z with action c to state 1 (which is assumed as before to be cost-free and absorbing and to have V (1) 0). 2 Acknowledgements I would like to thank Rich Sutton, Andrew Moore, Justin Boyan, and Tom Mitchell for their helpful conversations with me. Without their constantly poking holes in my misconceptions, this paper would never have been written. This material is based on work supported under a National Science Foundation Graduate Research Fellowship, and by NSF grant number BES-9402439. Any opinions, ndings, conclusions, or recommendations expressed in this publication are those of the author and do not necessarily reect the views of the National Science Foundation. 19 References [BD59] R. Bellman and S. Dreyfus. Functional approximations and dynamic programming. Mathematical Tables and Aids to Computation, 13:247{251, 1959. [Bel58] R. Bellman. On a routing problem. Quarterly of Applied Mathematics, 16(1):87{90, 1958. [Bel61] R. Bellman. Adaptive Control Processes. Princeton University Press, 1961. [BKK63] R. Bellman, R. Kalaba, and B. Kotkin. Polynomial approximation | a new computational technique in dynamic programming: allocation processes. Mathematics of Computation, 17:155{161, 1963. [Bla65] D. Blackwell. Discounted dynamic programming. Annals of Mathematical Statistics, 36:226{235, 1965. [BM95] J. A. Boyan and A. W. Moore. Generalization in reinforcement learning: safely approximating the value function. In G. Tesauro and D. Touretzky, editors, Advances in Neural Information Processing Systems, volume 7. Morgan Kaufmann, 1995. [BT89] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice Hall, 1989. [CT89] C.-S. Chow and J. N. Tsitsiklis. An optimal multigrid algorithm for discrete-time stochastic control. Technical Report P-135, Center for Intelligent Control Systems, 1989. [Day92] P. Dayan. The convergence of TD() for general lambda. Machine Learning, 8(3{4):341{362, 1992. [FF62] L. R. Ford, Jr. and D. R. Fulkerson. Flows in Networks. Princeton University Press, 1962. [JJS94] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6):1185{1201, 1994. [Lin92] L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning, and teaching. Machine Learning, 8(3{4):293{322, 1992. [Moo91] A. W. Moore. Variable resolution dynamic programming: eciently learning action maps in multivariate real-valued state-spaces. In L. Birnbaum and G. Collins, editors, Machine Learning: Proceedings of the eighth international workshop. Morgan Kaufmann, 1991. [Moo94] A. W. Moore. The parti-game algorithm for variable resolution reinforcement learning in multidimensional state spaces. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems, volume 6. Morgan Kaufmann, 1994. [Sab93] P. Sabes. Approximating Q-values with basis function representations. In Proceedings of the Fourth Connectionist Models Summer School, Hillsdale, NJ, 1993. Lawrence Erlbaum. [Sam59] A. L. Samuels. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3):210{229, 1959. [SS94] S. P. Singh and R. S. Sutton. Reinforcement learning with replacing eligibility traces. Manuscript in preparation, 1994. [Sut88] R. S. Sutton. Learning to predict by the methods of temporal dierences. Machine Learning, 3(1):9{44, 1988. [Tes90] G. Tesauro. Neurogammon: a neural network backgammon program. In IJCNN Proceedings III, pages 33{39, 1990. 20 [TS93] [Tsi94] [Wat89] [WD92] [Wit77] S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement learning. In Proceedings of the Fourth Connectionist Models Summer School, Hillsdale, NJ, 1993. Lawrence Erlbaum. J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3):185{202, 1994. C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King's College, Cambridge, England, 1989. C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3{4):279{292, 1992. I. H. Witten. An adaptive optimal controller for discrete-time Markov environments. Information and Control, 34:286{295, 1977. 21