P EGASUS: A policy search method for large MDPs and POMDPs Andrew Y. Ng Computer Science Division UC Berkeley Berkeley, CA 94720 Michael Jordan Computer Science Division & Department of Statistics UC Berkeley Berkeley, CA 94720 Abstract attempt to choose a good policy from some restricted class of policies. We propose a new approach to the problem of searching a space of policies for a Markov decision process (MDP) or a partially observable Markov decision process (POMDP), given a model. Our approach is based on the following observation: Any (PO)MDP can be transformed into an “equivalent” POMDP in which all state transitions (given the current state and action) are deterministic. This reduces the general problem of policy search to one in which we need only consider POMDPs with deterministic transitions. We give a natural way of estimating the value of all policies in these transformed POMDPs. Policy search is then simply performed by searching for a policy with high estimated value. We also establish conditions under which our value estimates will be good, recovering theoretical results similar to those of Kearns, Mansour and Ng [7], but with “sample complexity” bounds that have only a polynomial rather than exponential dependence on the horizon time. Our method applies to arbitrary POMDPs, including ones with infinite state and action spaces. We also present empirical results for our approach on a small discrete problem, and on a complex continuous state/continuous action problem involving learning to ride a bicycle. Most approaches to policy search assume access to the POMDP either in the form of the ability to execute trajectories in the POMDP, or in the form of a black-box “generative model” that enables the learner to try actions from arbitrary states. In this paper, we will assume a stronger model than these: roughly, we assume we have an implementation of a generative model, with the difference that it has no internal random number generator, so that it has to ask us to provide it with random numbers whenever it needs them (such as if it needs a source of randomness to draw samples from the POMDP’s transition distributions). This small change to a generative model results in what we will call a deterministic simulative model, and makes it surprisingly powerful. 1 Introduction In recent years, there has been growing interest in algorithms for approximate planning in (exponentially or even infinitely) large Markov decision processes (MDPs) and partially observable MDPs (POMDPs). For such large domains, the value and -functions are sometimes complicated and difficult to approximate, even though there may be simple, compactly representable policies that perform very well. This observation has led to particular interest in direct policy search methods (e.g., [16, 8, 15, 1, 7]), which We show how, given a deterministic simulative model, we can reduce the problem of policy search in an arbitrary POMDP to one in which all the transitions are deterministic—that is, a POMDP in which taking an action in a state will always deterministically result in transitioning to some fixed state . (The initial state in this POMDP may still be random.) This reduction is achieved by transforming the original POMDP into an “equivalent” one that has only deterministic transitions. Our policy search algorithm then operates on these “simplified” transformed POMDPs. We call our method P EGA SUS (for Policy Evaluation-of-Goodness And Search Using Scenarios, for reasons that will become clear). Our algorithm also bears some similarity to one used in Van Roy [12] for value determination in the setting of fully observable MDPs. The remainder of this paper is structured as follows: Section 2 defines the notation that will be used in this paper, and formalizes the concepts of deterministic simulative models and of families of realizable dynamics. Section 3 then describes how we transform POMDPs into ones with only deterministic transitions, and gives our policy search algorithm. Section 4 goes on to establish conditions under which we may give guarantees on the performance of the algorithm, Section 5 describes our experimental results, and Section 6 closes with conclusions. 2 Preliminaries This section gives our notation, and introduces the concept of the set of realizable dynamics of a POMDP under a policy class. A Markov decision !"# process (MDP) is a tuple where: is a set of states; is the initial-state distribution, from which the start-state %$ is drawn; is a set of actions; are the tran sition probabilities, with giving the'& next-state )( *+-, distribution upon taking action in state ; is the " discount factor; and is the reward function, bounded "/.01 . For the sake of concreteness, by 324( *5-,-6879 we will assume, unless otherwise stated, that is a :<; -dimensional hypercube. For simplicity, we also assume rewards "= "= are de rather than , the exterministic, and written tensions being trivial. Lastly, everything that needs to be measurable is assumed to be measurable. BACD A policy is any mapping >@? IACKJ . The value function of a policy > is a map EGFH? , so that EGF gives the expected discounted sum of rewards for executing > starting from state . With some abuse of notation, we also define the value of a policy, with respect to the initial-state distribution , according to E > L2NMOQPRTS3( E F Q6 $ (1) (where the subscript $U indicates that the expectation is with respect to $ drawn according to ). When we are considering multiple MDPs and wish to make explicit that a value function is for a particular MDP V , we will also write EGW F , E W > , etc. In the policy search setting, we have some fixed class X & of policies, and desire to find a good policy > X . More precisely, for a given MDP V and policy class X , define YZ\[ V X L2^]_+` Our goal is to find a policy f> YZg[ V X . & E W > ed F<acb X so that E (2) f> is close to Note that this framework also encompasses cases where our family X consists of policies that depend only on certain aspects of the state. In particular, in POMDPs, we can restrict attention to policies that depend only on the observables. This restriction results in a subclass of stochastic memoryfree policies. h By introducing artificial “memory variables” into the process state, we can also define stochastic limited-memory policies [9] (which certainly permits some belief state tracking). i Although we have not explicitly addressed stochastic policies so far, they are a straightforward generalization (e.g. using the transformation to deterministic policies given in [7]). Since we are interested in the “planning” problem, we assume that we are given a model of the (PO)MDP. Much previous work has studied the case of (PO)MDPs specified via a generative model [7, 13], which is a stochastic function that takes as input any state-action pair, and outputs according to (and the associated reward). In this paper, we assume a stronger mmodel. lnolpWe ( *+-,-assume 687eq@ACKwe have a deterministic function , so that k j ? s r is distributed Uniform ( *5-,-687 q , for any fixed -pair, if then j !s r is distributed according to the transition distribution . In other words, to draw a sample from ! s for( *+some -,-6 7 q fixed and , we need s only draw r uniformly in , and then take j r to be our sample. We will call such a model a deterministic simulative model for a (PO)MDP. Since a deterministic simulative model allows us to simulate a generative model, it is clearly a stronger model. However, most computer implementations of generative models also provide deterministic simulative models. Consider a generative model that is implemented via a procedure that takes and , makes at most :t calls to a random number generator, and then outputs drawn according to . Then this procedure is already providing a deterministic simulative model. The only difference is that the deterministic simulative model has to make explicit (or “expose”) its interface to the random number generator, via s r . (A generative model implemented via a physical simulation of an MDP with “resets” to arbitrary states does not, however, readily lead to a deterministic simulative model.) Let us examine some simple examples of deterministic sim 2 ulative models. Suppose that!for a state-action pair u%uv /2',wcx !u%uv h y h and some states and % , , 2|, 2 z wcx sr s . Then we may choose :{t s }2 so thats^ ,wcx is just ~ a real number, , and if 2 and let j h h j H 2IJ s otherwise. As another example, suppose ! h h , and is a normal distribution with a cumula24, letting :{t , we tive distribution function s 2 . Again s h may choose j to be j . It is a fact of probability and theory that, given measure any transition distribution , such a deterministic simulative model j can always be constructed for it. (See, e.g. [4].) Indeed, some texts (e.g. [2]) routinely define POMDPs using essentially deterministic simulative models. However, there will often be many different choices of j for representing a (PO)MDP, and it will be up to the user to decide which one is most “natural” to implement. As we will see later, the particular choice of j that the user makes can indeed impact the performance of our algorithm, and “simpler” (in a sense to be formalized) implementations are generally preferred. To close this section, we introduce a concept that will be useful later, that captures the family of dynamics that a (PO)MDP and policy class can exhibit. Assume a deterministic simulative model j , and fix a policy > . If we are executing > from Lsome 2 state , the successor-state is deterj > s r , which is a function of mined by s r F and s r . Varying > over 2 s X , 2 we get a whole e s e family of functions mapping from r j > r mlp( *5-,6 7 q F F into successor states . This set of functions should be thought of as the family of dynamics realizable by the POMDP and X , though since its definition does depend on the particular deterministic simulative model j that we have chosen, this is “as expressed with respect to j .” For each function , also let v be the -th coordinate (so that s r is the -th coordinate of s r ) and let be the corresponding 3l( *+%,6 7 q families ( *+%of,6 coordinate functions mapping from into . Thus, captures all the ways that coordinate of the state can evolve. We are now ready to describe our policy search method. 3 Policy search method In this section, we show how we transform a (PO)MDP into an “equivalent” one that has only deterministic transitions. f > of the policies’ E This then leads to natural estimates values E > . Finally, we may search over policies to opti mize E f > , to find a (hopefully) good policy. 3.1 Transformation of (PO)MDPs 2! - e{!"# and Given a (PO)MDP V a policy class X , we describe how, using a deterministic simulative model j 2 for V , we -construct !our " trans and formed POMDP VI corresponding class of policies X/ , so that VI has only deterministic transitions (though its initial state may still 2)be , , random). To simplify the exposition, we assume :{t s so that the terms r are just real numbers. VI is constructed is as follows: The action space and discount factor for( *5VI 3l -,-68 are the same as in V . The state space for VI is . In other words, %d-d%ad typical state in VI can be written as a vector s s — this consists of h a state from the original state space ( *5-,6 , followed by an infinite sequence of real numbers in . The rest of the transformation Upon is -straightforward. d%d-d taking action in state s s in VI% ,d-d%we deterd h ministically to the state s\ s , where 2 s transition . In other words, the portion of the state j h (which should be thought of as the “actual” state) changes -d%d-d to % , and one number in the infinite sequence s s h is used up to generate from the correct distribution. By the definition of the deterministic simulative model j , we ( *5-,6 U+ see that so long as s , then the “nexth state” distribution of is the same as if we had taken action in state (randomization over s ). h Finally, we distribution over , the initial-state 2 l¡ ( *5-choose ,6 s s -d%d-d drawn according to , so that h will be , and the s & ’s are distributed i.i.d. ( *5-so ,6 that U Uniform . For each policy > X , also let there be a & s s -d-d%d £2 corresponding >¢ Xy , given by " >¢ s s h - d%d-d¤2m"= > and let the reward be given by h , . If one observes only the “ ”-portion (but not the s ’s) of a sequence of states generated in the POMDP VI using policy >¢ , one obtains a sequence that is drawn from the same distribution as would have been generated from the & original > X . It (PO)MDP V under the corresponding policy & & fol> X ¢ > Xy , and lows that, for corresponding policies L2 we have that E W > E W¥ >¢ . This also implies that the best possible in both (PO)MDPs are the expected ¦2 YZgreturns [ VI X/ . same: YZg[ V X To summarize, we have shown how, using a deterministic simulative model, we can transform any POMDP V and policy class X into an “equivalent” POMDP VI and policy class X , so that the are &§ transitions in V &§ deterministic; and an action i.e., given a state , the next-state in VI is exactly determined. Since policies in X & and Xy have the same values, if we can find a policy >¢ X/ that does well & in VI starting from , then the corresponding X will policy > also do well for the original POMDP V starting from . Hence, the problem of policy search in general POMDPs is reduced to the problem of policy search in POMDPs with deterministic transition dynamics. In the next section, we show how we can exploit this fact to derive a simple and natural policy search method. 3.2 P EGASUS: A method for policy search As discussed, it suffices for policy search to find a good & Xy for the& transformed POMDP, since the corpolicy >¢ responding policy > To do X will be just as good. this, to E W , and we first construct an approximation E f W¥ & W ¥ f then search over policies >¢ X/ to optimize E >¢ (as a proxy for optimizing the hard-to-compute E W > ), and thus find a (hopefully) good policy. Recall that E W¥ is given by ¦2NM P R¢S ¥ ( ¨6< E W ¥ > E W F ¥ - $ (3) & where the expectation is over the initial state $ drawn according to . The first step in the approximation is to replace the expectation over the distribution with a finite sample of states. More precisely, we first draw a sam -d-d%d h ª ª ple $ © $ © $ ©« ª of ¬ initial states according to . These states, also called “scenarios” (a term from the stochastic optimization literature; see, e.g. [3]), define an approximation to E W¡¥ > : ¦­ E W¥ > ¬ , ®« °¯ ed E W F ¥ $© ª (4) h Since the in VI are &transitions & deterministic, for a given state and a policy > Xy , the sequence of states that will be visited upon executing > from is exactly determined; hence the sum of discounted rewards for executing > from is also exactly determined. Thus, to calculate one of the terms EW F ¥ $ © ª in the summation in Equation (4) corresponding to scenario $ © ª , we need only use our deterministic simulative model to find the sequence of states visited by executing > from $ © ª , and sum up the resulting discounted rewards. Naturally, this would be an infinite sum, so the second (and standard) part of the approximation is to truncate this sum after some number ± of steps, where ± is called the horizon choose ± to 2N´ time. ,Here, ·¸w we z " .01 µ¶ ² be the ² -horizon time ±=³ , so that (because of discounting) the truncation introduces at most w z ² error into the approximation. -d-d%d To summarize, given ¬ scenarios $ © hª $ ©« ª , our ap W ¥ proximation to E is the deterministic function L2 E f W¡¥ > , ¬ ®« " ¯ ¹=" c¹-%¹=º»" c $© ª © ª º© ª » h where $ © ª © ª º © ª » is the sequence of states deterh ¬ sceministically visited by > starting from $ © ª . Given narios, & this defines an approximation to E W ¥ > for all policies > X/ . The final implementational detail is that, since the states &¼½l¾( *+%,6 $© ª are infinite-dimensional vectors, we have no way of representing them (and their successor states) explicitly. But because we will be simulating only -d%d-d s º ± ³ steps, we need only represent s © ª s © ª © ª » , of 2¿ 4 Main theoretical results P EGASUS samples a number of scenarios , and from uses them to form an approximation E f > to E > . If E f is a uniformly good approximation to E , then we can guaranE f will result in a policy with value close tee that optimizing Y \ Z [ to V X . This section establishes conditions under which this occurs. 4.1 The case of finite action spaces h %d-d-d% discounting. Assuming that the time at which the en agent f W¥ >gÁ is again E ters the goal region is differentiable, then differentiable. %d-d%d h the state $ © ª © ª s © ª s © ª , and so we will do h just that. Viewed in the space of the original, untransformed POMDP, evaluating a policy this way is therefore also akin to generating ¬ Monte Carlo trajectories and taking their empirical average return, but with the crucial difference that all the randomization is “fixed” in advance and “reused” for evaluating different > . Having used ¬ scenarios to define E f W¥ > for all > , we may search over policies to optimize E f W ¥ > . We call this policy search method P EGASUS: Policy Evaluation-of Goodness And Search Using Scenarios. Since E f W¥ > is a deterministic function, the search procedure only needs to optimize a deterministic function, and any number of standard optimization methods may be used. case 2À In the &I JÄÃthat >\Á the action space is continuous and X is  a smoothly parameterized family of policies (so >gÁ is differentiable in for all ) then if all the relevant quanti ties are w differentiable, W¥ it is also possible to find the deriva>\Á , and gradient ascent methods can be tives : :  E f f W ¥ > Á . One common barrier to doing used to optimize E " this is that is often discontinuous, being (say) 1 within a goal region and 0 elsewhere. One approach to dealing " with this problem is to smooth out, possibly in combination with “continuation” methods that gradually unsmooth it again. An alternative approach that may be useful in the setting of continuous dynamical systems is to alter the reward function to use a continuous-time model of Å2 We by considering the case of two actions, begin . Studying policy search in a similar setting, h Kearns, Mansour and Ng [7] established conditions under which their algorithm gives uniformly good estimates of the values of policies. A key to that result was that uniform convergence can be established so long as the policy class X has low “complexity.” This is analogous to the setting of supervised learning, where a learning algorithm that uses a hypothesis class Æ that has low complexity (such as in the sense of low VC-dimension) will also enjoy uniform convergence of its error estimates to their means. In our setting, since X is just a class of functions mapping from into , it is just a set of boolean functions. h Hence, Ç/È X , its Vapnik-Chervonenkis dimension [14], is well defined. That is, we say X shatters a set of ¬ states z if it can realize each of the « possible action combina tions on them, and Ç/È X is just the size of the largest set shattered by X . The result of Kearns et al. then suffices to give the following theorem. '2É Theorem 1 Let a POMDP with actions be h given, and let X be a class of strategies 2 for this POMDP, Ç/È X . Also with Vapnik-Chervonenkis dimension : ÊÌËÀ* let any ² be fixed, and let E f be the policy-value estimates determined by P EGASUS using ¬ scenarios and Í More precisely, if the agent enters the goal region on some time step, then rather than giving it a reward of 1, we figure out what fraction ÎGϧРÑÒeÓÔ of that time step (measured in continuous time) the agent had taken to enter the goal region, and then give it reward ÕÖ instead. Assuming Î is differentiable in the system’s ØÙ dynamics, then Õ+Ö and hence × ¥Ú8Û<Ü-Ý are now also differentiable (other than on a usually-measure 0 set, for example from truncationá at Þàß steps). The algorithm of Kearns, Mansour and Ng uses a “trajectory Ø tree” method to find the estimates × Ú8Û\Ý ; since each trajectory tree Ú O å Ú Q Ý Ý ÞOß , they were very expensive to build. Each is of size âãä scenario in P EGASUS can be viewed as a compact representation of a trajectory tree (with a technical difference that different subtrees are not constructed independently), and the proof given in Kearns et al. then applies without modification to give Theorem 1. a horizon time of ± ³ . If 2Næèç+` ¬ ´éç : " .01 í Ef > !· í í ~ > E µ Ê ² ,G·mÊ then with probability at least close to E : í í í , ´ (5) , E f will be uniformly ´´ ¤î ² , ,ê·ëLìyì > & X (6) Using the transformation given Ë inz Kearns et al., the case of also gives rise to essena finite action space with tially the same uniform-convergence result, so long as X has low “complexity.” The bound given in the theorem has no dependence on the size of the state space or on the “complexity” of the POMDP’s transitions and rewards. Thus, so long as X has low VC-dimension, uniform convergence will occur, independently of how complicated the POMDP is. As in Kearns et al., this theorem therefore recovers the best analogous results in supervised learning, in which uniform convergence occurs so long as the hypothesis class has low VCdimension, regardless of the size or “complexity” of the underlying space and target function. 4.2 The case of infinite action spaces: “Simple” X is insufficient for uniform convergence We now consider the case of infinite action spaces. Whereas, in the 2-action case, X being “simple” was sufficient to ensure uniform convergence, this is not the case in POMDPs with infinite action spaces. Suppose is a (countably or uncountably) infinite set 2 of actions. A “simple” ï &N# class of policies would be X — the set of all policies that al> > ways choose the same action, regardless of the state. Intuitively, this is the simplest policy that actually uses an infinite action space; also, any reasonable notion of complexity of policy classes should assign X a low “dimension.” If it were true that simple policy classes imply uniform convergence, then it is certainly true that this X should always enjoy uniform convergence. Unfortunately, this is not the case, as we now show. Theorem infinite set of actions, and let 2ð 2 Let ï be an &o # be the corresponding set X > > of all “constant valued” policies. Then there exists a finite state MDP with action space , and a deterministic simulative model for it, so that P EGASUS’ estimates using the deterministic simulative model doËmnot * uniformly converge to their means. i.e. There is an ² , so that for estimates E f derived using any finite number ¬ of & scenarios and any finite horizon time, there is a policy > X so that Ef > !· E > Ë ² d (7) The proof of this Theorem, which is not difficult, is in Appendix A. This result shows that simplicity of X is not sufficient for uniform convergence in the case of infinite action spaces. However, the counterexample used in the proof of Theorem 2 has a very complex j despite the MDP being quite simple. Indeed, a different choice for j would have made uniform convergence occur.ñ Thus, it is natural to hypothesize that assumptions on the “complexity” of j are also needed to ensure uniform convergence. As we will shortly see, this intuition is roughly correct. Since actions affect transitions only through j , the crucial quantity is actually the composition of policies and the deterministic simulative model — in other words, the class of the dynamics realizable in the POMDP and policy class, using a particular deterministic simulative model. In the next section, we show how assumptions on the complexity of leads to uniform convergence bounds of the type we desire. 4.3 Uniform convergence in the case of infinite action spaces ¼2K( *5-,67 9 For the remainder of this section, assume ( *+%,679nl . Then is a class of functions mapping from ( *+-,-687 q ( *+%,67 9 into , and so a simple way to capture its “complexity” is to capture the 2ò complexity of its families ,{-d%d-d of coordinate functions, , : ; . ( *5-,679l@Each ( *+%,67e q is a family of functions mapping from into ( *+-,-6 , the -th coordinate of the state vector. Thus, is just a family of real-valued functions — the family of -th coordinate dynamics that X can realize, with respect to j . The complexity of a class of boolean functions is measured by its VC dimension, defined to be the size of the largest set shattered by the class. To capture the “complexity” of realvalued families of functions such as , we need a generalization of the VC dimension. The pseudo-dimension, due to Pollard [10] is defined as follows: Definition (Pollard, 1990). Let J Æ be a family of functions mapping from a space ó into . Let a sequence of : points ô -d%d-d ô 7 & ó be given. We say Æ shatters ô -d%d-d- ô 7 %d-d%dh h õ %d-d-d%öÄõ 7 such if there exists a sequence of real J7 <öÄnumbers ô ¦· h ô 7 ¦· that the& subset of given 7by öB Jõ 7 h h z intersects all orthants (equivalently, õ 7 Æ -d%d-d% of &I*5 -,c 7 if for any sequence of bits ,2 there is : ÷ ÷ ö4& öÄh ô ø , Æ õ ¡ ú ù e ÷ a function such that , for 2ò,{-d%d-d all : ). The pseudo-dimension of Æ , denoted û ° t Æ , is the size of the largest set that Æ shatters, or infinite if Æ can shatter arbitrarily large sets. The pseudo-dimension generalizes the VC dimension, *+-, and coincides with it in the case that Æ maps into . We will use it to capture the “complexity” of the classes of the POMDP’s realizable dynamics . We also remind readers of the definition of Lipschitz continuity. ü For example, ý Úÿþ Ò+Ò ÝHþi if nÑ , þi otherwise; see Appendix A. J AC J Definition. A function ? is Lipschitz continuous (with respect to the Euclidean norm on its range and domain) if there exists a constant such that for all ô & û , ô · + H~ ô · . Here, ° ° ° ° is called a Lipschitz bound. A family of functions Æ J J mapping from into is uniformly Lipschitz öNcontin& uous with Lipschitz bound if every function Æ is Lipschitz continuous with Lipschitz bound . We now state our main theorem, with a corollary regarding when optimizing E f will result in a provably good policy. Ì2 Using tools from [5], it is also possible to show similar uniform convergence results without Lipschitz continuity assumptions, by assuming that the family > is parameterized of real numbers, and that > (for all & by a small number " > X ), j , and are each implemented by a function that calculates their results using only a bounded number of the usual arithmetic operations on real numbers. The proof of Theorem 3, which uses techniques first introduced by Haussler [6] and Pollard [10], is quite lengthy, and is deferred to Appendix B. ( *5-,67 9 Theorem 3 Let a POMDP with state space , and a possibly infinite action space be given. Also let a policy l¸class l( *5-X ,-68,7 q and ACÀ a deterministic simulative model j¸? for the POMDP be given. Let be the corresponding family of realizable dynamics in the POMDP, and theû resulting families of coordinate 2^,{-d%d-dfunc : ; , tions. Suppose that ° t ~ : for each and that each family is uniformly Lipschitz continuous with " Lipschitz .01 6 , and that the reward funcHAC (bound ·" .01at" most is also Lipschitz continuous tion ? ÊË * with Lipschitz bound at most . Finally, let ² be given, and let E f be the policy-value estimates determined by P EGASUS using ¬ scenarios and a horizon time of ±=³ . 2 If ¬ 5 Experiments In this section, we report the results from two experiments. The first, run to examine the behavior of P EGASUS parametrically, involved a simple gridworld POMDP. The second studied a complex continuous state/continuous action problem involving riding a bicycle. Figure 1a shows the finite state and action POMDP used in our first experiment. In this problem, ·#, the agent starts in the lower-left corner, and receives a reinforcement per step until it reaches the absorbing state in the upperright corner. The eight possible observations, also shown in the figure, indicate whether each of the eight squares , , "y.01 adjoining the current position contains The policy 2('&)cax*wall. ' æoç+` ´°é ç ´ ´ ´ µ Ê ,£· µ µ " .01 : ; : t ì/ì class is small, consisting of all $&% functions map: ² ping from the eight possible observations to the four ac,G·mÊ tions corresponding to trying to move in each of the comthen with probability at least , E f will be uniformly í í pass directions. Actions are noisy, and result in moving close to E : í í í !· in a random direction 20% of the time. Since the policy í ~ ´´ & ¤î > (8) Ef > E > ² X class is small enough to exhaustively enumerate, our optimization algorithm for searching over policies was simply Corollary 4 Under the conditions of Theorem 1 or 3, let exhaustive search, trying all $&% policies on the ¬ scenarios, ¬ be ,G chosen ·ÌÊ as in the Theorem. Then with probability at and the best one. Our experiments were done with §2Npicking *+d +)+ 24,%** least , the policy2 f> chosen by optimizing the value and a horizon time of , and all results re± îcµ î estimates, given by f> E f > , will be nearF<acb ported on this problem are averages over 10000 trials. The optimal in X : deterministic simulative model was £ø YZ\[ !· z ,-- Ê _+`\ (9) E f> V X ² ~ *5d *& -if s Ê ´10 *+d **54 s~ *+d°,%* Remark. The (Lipschitz) continuity assumptions give a s ¤2 . Ê û 32 if *+d°,%*74 s~ *+d°, -if *+d°,54 j sufficient but not necessary set of conditions for the the--/ Ê 6ê s~ *+d z * * µ & 8 2 if orem, and other sets of sufficient conditions can be enÊ otherwise visaged. For example, if we assume that the distribution on states induced by any policy at each time step has a Ê where denotes the result of moving one step from bounded density, then we can show uniform convergence in the direction indicated by , and is if this move would for a large class of" (“reasonable”) discontinuous reward ë2Å, ËÀ*+d+* result in running into a wall. functions such as if otherwise. h Figure 1b shows the result of running this experiment, for Space constraints preclude a detailed discussion, but briefly, different numbers of scenarios. The value of the best policy this is done by constructing two Lipschitz continuous reward functions and "! that are “close to” and which upper- and within X is indicated by the topmost horizontal line, and the lower-bound (and which hence give value estimates that also solid curve below that is the mean policy value when using upper- and lower-bound our value estimates under ); using the our algorithm. As we see, even using surprisingly small assumption of bounded densities to show our values under numbers of scenarios, the algorithm manages to find good and "! are # -close to that of ; applying Theorem 3 to show unipolicies, and as ¬ becomes large, the value also approaches form convergence occurs with and ! ; and lastly deducing from this that uniform convergence occurs with as well. the optimal value. Results on 5x5 gridworld -9 -10 G mean policy value 9 -11 -12 -13 -14 S -15 -16 0 (a) 5 10 15 m 20 25 30 (b) Figure 1: (a) 5x5 gridworld, with the 8 observations. (b) P EGASUS results using the normal and complex deterministic simulative models. The topmost horizontal line shows the value of the best policy in : ; the solid curve is the mean policy value using the normal model; the lower curve is the mean policy value using the complex model. The (almost negligible) 1 s.e. bars are also plotted. We had previously predicted that a “complicated” deterministic model lead simulative ö\ <; ( *5j -can ,6\ACD ( *5-,to 6 poor results. For each -pair, let be a hash function ?( *+-,-6 that maps( *5any Uniform random variable into another -,6 = Then if j is a determinUniform random variable. y2 ö>; { s istic simulative model, j s is anj other one that, because of the presence of the hash function, is a much more “complex” model than j . (Here, we appeal to the reader’s intuition about complex functions, rather than formal measures of complexity.) We would therefore predict that using P EGASUS with j would give worse results than j , and indeed this prediction is borne out by the results as shown in Figure 1b (dashed curve). The difference between the curves is not large, and this is also not unexpected given the small size of the problem. ? Our second experiment used Randløv and Alstrøm’s [11] bicycle simulator, where the objective is to ride to a goal one kilometer away. The actions are the torque @ applied to the handlebars and the displacement A of the rider’s centerof-gravity from the center. The six-dimensional state used in [11] includes variables for the bicycle’s tilt angle and orientation,w<,Band the handlebar’s angle. If the bicycle tilt exceeds > , it falls over and enters an absorbing state, receiving a large negative reward. The randomness in the simulator is from a uniformly distributed term added to the intended displacement of the center-of-gravity. Rescaled appropriately, this became the s term of our deterministic simulative model. We performed policy search over the following space: We C In our experiments, this was implemented by choosing, for each Úÿþ ÒD Ý pair, a random integer E Úÿþ ÒD Ý from FvÓÒ<>ÒÓÑÑÑBG , and then letting HJIK L Ú ÝMONQPSRUT<V-Ú E Úÿþ Ò ÝXW Ý , where N1PSRYT<V-Ú3Z+Ý denotes the fractional part of Z . [ Theory predicts that the difference between ý and ý\ ’s performance should be at most åO^Ú ] _a`bdc : c eYf Ý ; see [7]. selected a vector ô r of fifteen (simple, manually-chosen but not fine-tuned) features 2h of geach actions were then i state; ô r @ .01 · @ .kj l ¹ @ .kj l , chosen with sigmoids: @ 2mgi ô - .01· .kj l{O .kj l gonI2 h ¹ A A A , where ,w+,¡¹qp r A sr . Note that since our approach can handle continuous actions directly, we did not, unlike [11], have to discretize the actions. The initial-state distribution was manually chosen to be representative of a “typical” state distribution when riding a bicycle, and 2Ì was x* also not fineof scenarios, tuned. We used only a small number ¬ p2o*+d +)+*t 2uc** ,± , with the continuous-time model of discounting discussed earlier, and (essentially) gradient ascent to optimize over the weights. % Shaping rewards, to reward progress towards the goal, were also used. v We ran 10 trials using our policy search algorithm, testing each of the resulting solutions on 50 rides. Doing so, the median riding distances to the goal of the 10 different poli$ cies ranged from about 0.995km h to 1.07km. In all 500 evaluation runs for the 10 policies, the worst distance we observed was also about 1.07km. These results are significantly better than those of [11], which reported riding distances of about 7km (since their policies often took very “non-linear” paths to the goal), and a single “best-ever” trial of about 1.7km. w Running experiments without the continuous-time model of discounting, we also obtained, using a non-gradient based hillclimbing algorithm, equally good results as those reported here. Our implementation of gradient ascent, using numerically evaluated derivates, was run with a bound on the length of a step taken Ø on any iteration, to avoid problems near × Ú8Û<Ü-Ý ’s discontinuities. x Other experimental details: The shaping reward was proportional to and signed the same as the amount of progress towards the goal. As in [11], we did not include the distance-from-goal as one of the state variables during training; training therefore proceeding “infinitely distant” from the goal. iy Distances under 1km are possible since, as in [11], the goal has a 10m radius. 6 Conclusions We have shown how any POMDP can be transformed into an “equivalent” one in which all transitions are deterministic. By approximating the transformed POMDP’s initial state distribution with a sample of scenarios, we defined an estimate for the value of every policy, and finally performed policy search by optimizing these estimates. Conditions were established under which these estimates will be uniformly good, and experimental results showed our method working well. It is also straightforward to extend these methods and results to the cases of finite-horizon undiscounted reward, and infinite-horizon average reward with ² -mixing time ± ³ . Acknowledgements We thank Jette Randløv and Preben Alstrøm for the use of their bicycle simulator, and Ben Van Roy for helpful comments. A. Ng is supported by a Berkeley Fellowship. This work was also supported by ARO MURI DAAH0496-0341, ONR MURI N00014-00-1-0637, and NSF grant IIS-9988642. References [1] L. Baird and A.W. Moore. Gradient descent for general Reinforcement Learning. In NIPS 11, 1999. [2] Dimitri Bertsekas. Dynamic Programming and Optimal Control, Vol. 1. Athena Scientific, 1995. [3] John R. Birge and Francois Louveaux. Introduction to Stochastic Programming. Springer, 1997. [4] R. Durrett. Probability : Theory and Examples, 2nd edition. Duxbury, 1996. [5] P. W. Goldberg and M. R. Jerrum. Bounding the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers. Machine Learning, 18:131–148, 1995. [6] D. Haussler. Decision-theoretic generalizations of the PAC model for neural networks and other applications. Information and Computation, 100:78–150, 1992. [7] M. Kearns, Y. Mansour, and A. Y. Ng. Approximate planning in large POMDPs via reusable trajectories. (extended version of paper in NIPS 12), 1999. [8] H. Kimura, M. Yamamura, and S. Kobayashi. Reinforcement learning by stochastic hill climbing on discounted reward. In Proceedings of the Twelfth International Conference on Machine Learning, 1995. [9] N. Meuleau, L. Peshkin, K-E. Kim, and L.P. Kaelbling. Learning finite-state controllers for partially observable environments. In Uncertainty in Artificial Intelligence, Proceedings of the Fifteenth conference, 1999. [10] D. Pollard. Empirical Processes: Theory and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics, Vol. 2. Inst. of Mathematical Statistics and American Statistical Assoc., 1990. [11] J. Randløv and P. Alstrøm. Learning to drive a bicycle using reinforcement learning and shaping. In Proceedings of the Fifteenth International Conference on Machine Learning, 1998. [12] Benjamin Van Roy. Learning and Value Function Approximation in Complex Decision Processes. PhD thesis, Massachusetts Institute of Technology, 1998. [13] R. S. Sutton and A. G. Barto. Reinforcement Learning. MIT Press, 1998. [14] V.N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982. [15] J.K. Williams and S. Singh. Experiments with an algorithm which learns stochastic memoryless policies for POMDPs. In NIPS 11, 1999. [16] R.J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992. Appendix A: Proof of Theorem 2 Proof (of Theorem 2). We construct an MDP with states $ "=and y 2 h plus an 2 absorbing ·#,*+%, state. The reward funch tion is . Discounting is ignored for and transition with probain this construction. Both ¢h h bility 1 to the absorbing state regardless of the action taken. The initial-state $ has a .5 chance of transitioning to each of and . h h We now construct j , which will depend ( 6 in a complicated &H( *5-,-6J} s r term. Let z 2 B{| way on the ÷ ÷ ° ¯ 4 -, ~h 4m ~ h < ÷ be the countable set of all finite unions of intervals with rational endpoints in [0,1]. Let zO be the countable subset of z that contains all elements of z that have total (Lebesgue exactly 0.5. For (,wcx+<length w'c6 ( *5d *5*+d z measure) v6{Ì( *5d +*5d)6 example, and are both %d-d%d in z# . Let z z be an enumeration of the elements %d-d-d h of z# . Also let be an enumeration of (some h countably infinite subset of) . The deterministic simulative model on these actions is given by: j QPe ¤2 $ s 2 QPe & h h 2 *+d if s z otherwise So, for all , and is a L2Nthis * h ¢h E > correct model for the MDP. Note also that for all & > X . scenarios ¬ -%d d-d-sample finite s of %$ s r © hª %$ s r © ª %$ r ©« ª , there exists some z¤ s ª & z for all 2o,{-d%d-d- ¬ . Thus, evaluating such © ï that h > using this set of scenarios, all ¬ simulated trajectories will transition , so the value estimate ø , from $ from à h 2 , ) for > is E f > . Since this argu(assuming ± ³ ment holds for any finite number ¬ of scenarios, we¸have 2 * E f does not uniformly converge to E > shown that & (over > X ). For any % Appendix B: Proof of Theorem 3 Due to space constraints, this proof will be slightly dense. The proof techniques we use are due to Haussler [6] and Pollard [10]. Haussler [6], to which we will be repeatedly referring, provides a readable introduction to most of the methods used here. We begin with some standard definitions from [6]. For a , we subset of a space ó endowed with (pseudo-)metric & say $ ó & is an ² -cover for if, for every õ , there ~ ˾* , let $ such that õ õ ² . For each ² is some õ ² denote the size of the smallest ² -cover for . Let Æ be a family of functions mapping from a set ó into ÿ , and let be a proba bounded pseudo metric space metric on Æ by ability u ; measure 2^on Mó R . ( Define ô e a pseudo ô Q6 . Define the capac:& t j t j u ; 2 ]_+` ª © , where ity of ² Æ :* t ]_5` Æ to be ² Æ ª © ó the is over all probability measures on . The quan tity ² Æ thus measures the “richness” of the class Æ . Note that are both decreasing functions of , and and ² ¤2 o < Ë@* ² Æ for any that ² Æ . The main results obtained with pseudo-dimension are uniform convergence of the empirical means of classes of random variables to their true means. Let ( *+ Æ 6 be a family of functions mapping from ó into V , and let ô r (the “training set”) be ¬ i.i.d. draws from ö| some prob & ó Æ ability measure over . Then for each , let ,vw öÄ ô 2 föÄ r ¬ ÿ¦«°¯ 2NMô R ( öT be the empirical mean of h ô . Also let ô Q6 be the true mean. t We now state a few results from [6]. In [6], these are Theorem 6 combined with Theorem 12; Lemma 7; Lemma 8; g £2 being a singleton set, , and2 Theorem 9 (with w 2 z ² $V , and A V ). Below, and respectively J h denote the Euclidean metrics on . e.g. ô <L 2 Manhattan ·and ô . hh r r ¯ h h Lemma mapping from ó ( *5 5 6 Let Æ be2 a family of functions û ° t Æ *.4 Then for any probabilinto V , and : ~ u ; on ó and ity measure 7 have that z any p w <´ ² z p V w , we z ~ ² Æ :& t ÃD V ² V ² . © ª -d%d-d Lemma 6 Let Æ ( *+Æ¡ -,- 6 each be a family of functions h mapping from ó into . The free product of the Ƹ ’s iQi This is inconsistent with the definition used in [6], which has an additional Ú Ó eY¢Ý factor. 2ò< -d%d-d- & is the class of functions( *+Æ -,-6 Æ 2 - d%d-d%?L - ô I h ó mapping from into (where ô e%d-d-d ô h * ). Then for any probability measure ËB h on ó and ² , ² Æ u :& t © £ ; u ~ à ) w g ² Æ ª u :& ; t à © ¯ h e%d-d-d-% Lemma 7 Let ó ó ,{ -¤ d%d-d > ¤ 2^ h h h h ª (10) be bounded metric spaces, and for each , let Æ be a class into ó ¤ . Suppose that of functions mapping from ó h each Æ is uniformly Lipschitz continuous (with respect to the metric on its domain, on its%-range), with ø, and ¤ 2I & h ÷ Æ ¥ ¥¢ ?{ . Let some Lipschitz bound -, ~ \ h ~ Æ be the class of functions mapping from into ó ¤ given by composition of the functions in the ó h h * ËB 2(¦ be given, and let ² Æ ’s. Let ² $ ¯ ÷ ² $ . Then ² Æ ¤ ~ h ¯ h £ ² $ Æ h ¤ (11) h Lemma ( *+ 8 6 Let Æ be a family of functions mapping from ó into V , and let be a probability measure on ó . Let ô r beË@ generated by ¬ independent draws from ó , and assume * . Then ² §©¨ ( ª+ö& Æ !· ÿ Ë 6 ? f ôr ² w<U, '5 ^p W ³ ~ $) ² ñ = Æ ¬ « « (12) We are now ready to prove Theorem 3. No serious attempt has been made to tighten polynomial factors in the bound. Proof (of Theorem 3). Our proof is in three parts. First, E f gives ¹ëan , estimate of the discounted rewards summed over ± ³ -steps; we reduce the problem of showing uniform convergence of E f to one of proving that our of 2Ì*+estimates -d%d-d ± ³ , all the expected rewards on the ± -th step, ± converge uniformly. Second, we carefully define the map ping from the scenarios © ª to the ± -th step rewards, and use Lemmas 5, 6 and 7 to bound its capacity. Lastly, applying Lemma 8 gives" our To simplify .01 result. 24, øthe , notation in this proof, assume , and . Part I: Reduction to uniform convergence of ± -th step rewards. E f was defined by Ef > , ¦2 ¬ ®« °¯ " h ¢¹3"= T¹m%-¹ $© ª © ª h ¦2 " º »"= ed º© ª » h For each ± , let E f º > º © ª be the empirical «¯ « ± -thh step, and let E º > ë2 mean of the reward on the M ^­ ( "= Q6 º be the true expected reward on the ± -th step 2 $ U (starting from and executing ). Thus, > E > º º ¯$ E º > . Suppose we,êcan for, each ± ·3Êshow, w ¹N ± ³ probability , Ef º > ¤· E º > ~ ² w z ± ³ 2o*+%d-d%d ¹m,¯® > & ± ³ , that with X (13) Then bound, that probability ,·mÊ by the union · we w know ¹ ,with z E 2 º *+> -d%d-d~ ² ± ³ , Ef º > holds & simulta ± ³ and for all > X . This neously for all ± & implies that, for all > X , Ef ¤· ~ > Ef ® ~ > º » ® · > º º ¯$ º» º ¯ $ d ~ E Ef º > !· E º E º ¹ > ® ¹ > º » º ² º ¯$ E º > · E > capacity of Æ (and hence prove uniform converge over Æ ), f W ¥ ; º (over we have & also proved uniform convergence for EF X ). all > 2ò,%d-d-d- w z º» º · ~ E º > E > where we used the fact that º ¯$ w z ² , by construction of the ² -horizon time. But this is exactly the desired result. Thus, we need only prove that 2 Equation (13) holds with high probability for each ± *+%d-d%d ± ³. Part II: Bounding the capacity. Let ± ~ ± ³ be& fixed. pl We now write out the mapping from a scenario ª © ( *+%,67eq¢ to the ± -th step reward. Since this mapping depends only on the first :t ± elements of the “s ”s portion of the scenario, we will, with some abuse of notation, write &NÌlp( *+%,67 q º , and ignore its other the scenario as © ª ª coordinates. Thus, a scenario may now be written as © s s -d-d%d- s 7eq º . h 2 For ( *+%,each 67 q º ,{-d-d%d ¹, ² h 7 9 ² 7 q ¤ h h ; AC (·" .01 " .01 6 © ¹ ª ; t à © ; t à ª ± ³:t 7e79 ì ² z p< ¹ :<; 7e7 9 ± ³:{t ì ² (14) Finally, applying Lemma 7 with each of the ¹ ’s N2 , being the 2 on the appropriate space, , and ² norm ± ¹Ì, º ± $ ²$ , we find ² Æ º ¤ £ ~ h ¯ £ ~ ~ º h ¯ + w ² ± z 7 9 ç ¹N, z p< :<; ¹ º e $ - ± ³:{t ± Æ ¹N, h z 79 º »´³ z p< : ; ¹ ± ³ : t - ± ³ ¹N, 7e7 9 º $ ² $ ² h ² 7 q ¤ f WF ¥ º ? Now, let E be the reward received on the -th step when executing from a scenario ± > &è , this defines a family . As we let > vary over X ( ·"/.01<"/.016 of maps from scenarios into . Clearly, this family of maps is a subset of Æ . Thus, if we can bound the z 7 9 ç ~ ² Æ l º 1¤ 7eq (where the definition of the free product of hª © sets of functions is as given in Lemma such an 2 6); note ¹ Ƹ has Lipschitz2 /$ : ± : t . bound at most ; "G Also let Æ º ¤ be a 2 singleton set containing the ( ·" .01 " .01 6 h reward2 function, and ó -º% ¤ . Finally, Æ ( *5-Æ ,-68º 7 q ¤ ¥Æ º ¥( ·" .¥¢0Æ 1 " be let the family of maps from pln hº h .01 6 into . u :& , define ó ¤ he . For example, ó is just the space of h s scenarios (with only the first elements : ¤ t ± 2 2 ,%d-d-ofd- the ’s kept), and ó º ¤ . For each ± , deh ó ó Q¤ fine 2 a family of maps from into according l lB%-¢l l l l@-%¢to l Æ ± 2 where we have used the]fact is decreasing in its ² _5` that parameter. By taking a over probability measures , ² Ƹ . Now, as metrics over this is also a bound on J 79 ¤ º 7eq h O~ . Thus, this also gives © © ª ª, }l Given ( *+-,-687 q a family ( *5-of ,-6 functions (such as ) mapping Hlk( *+from %,67 q ¤ into ,ø we* extend its domain to for any finite ° simply by having it ignore the extra coordinates. Note this extension of the domain does not change the pseudo-dimension of a family of functions. 2,{-d%d-d ° ° , define a mapping ± from Also, for each Ìlp( *+-,-6kAC ( *+%,6 -d-d%d- s 2 according to ± s s 2 h s . For each ° , let ² be singleton sets. Where ± necessary, ± ’s domain is also extended as we have just described. u ; u ² Æ & : t à ª © 79 £ w ¹ Ì · e ~ ² : ; ± :t ¯ 79 h u £ w ¹ e ~ ² :<; ± ³:t :& ¯ h p ¹ z p< z :; ± ³:t ´ :; ~ z 7 9 ç ² ¹ 7e7 9 z p :; ± ³:t ~ z 7 9 ç ì ² ² û ~ For each : ; , u since ° ~ t z z pcw <: ´ , Lemma z pcw 7 5 ; à implies that ² :* t ² . u ª ; ë2 , ² © Moreover, clearly since each ² ² ² :* t à ª © is a singleton set. with Lemma 6, this implies 2^,{Combined -d-d%d , ± and ² ~ , that, for each º » ì 7e79 º » µ Part III: Proving uniform convergence. Applying ² Æ , we find Lemma 8 with the above bound on ,#·@Ê that for there to be a probability of our estimate of the expected ± -th step reward to be ² -close to the mean, it suffices that ¬ 2·¶ 2 , ¹k´ w ,U'5 ì {µ Ê µ $ ² Æ , , , ç<` ´°éç ´ ´ ´ , · {µ Ê ê µ µ ´ : ; : t ìì : ² z ' ² ç+´ This completes the proof of the Theorem. d