P : A policy search method for large MDPs and POMDPs EGASUS

advertisement
P EGASUS: A policy search method for large MDPs and POMDPs
Andrew Y. Ng
Computer Science Division
UC Berkeley
Berkeley, CA 94720
Michael Jordan
Computer Science Division
& Department of Statistics
UC Berkeley
Berkeley, CA 94720
Abstract
attempt to choose a good policy from some restricted class
of policies.
We propose a new approach to the problem
of searching a space of policies for a Markov
decision process (MDP) or a partially observable Markov decision process (POMDP), given
a model. Our approach is based on the following
observation: Any (PO)MDP can be transformed
into an “equivalent” POMDP in which all state
transitions (given the current state and action) are
deterministic. This reduces the general problem
of policy search to one in which we need only
consider POMDPs with deterministic transitions.
We give a natural way of estimating the value of
all policies in these transformed POMDPs. Policy search is then simply performed by searching
for a policy with high estimated value. We also
establish conditions under which our value estimates will be good, recovering theoretical results
similar to those of Kearns, Mansour and Ng [7],
but with “sample complexity” bounds that have
only a polynomial rather than exponential dependence on the horizon time. Our method applies
to arbitrary POMDPs, including ones with infinite state and action spaces. We also present
empirical results for our approach on a small
discrete problem, and on a complex continuous
state/continuous action problem involving learning to ride a bicycle.
Most approaches to policy search assume access to the
POMDP either in the form of the ability to execute trajectories in the POMDP, or in the form of a black-box “generative model” that enables the learner to try actions from
arbitrary states. In this paper, we will assume a stronger
model than these: roughly, we assume we have an implementation of a generative model, with the difference that
it has no internal random number generator, so that it has
to ask us to provide it with random numbers whenever it
needs them (such as if it needs a source of randomness to
draw samples from the POMDP’s transition distributions).
This small change to a generative model results in what
we will call a deterministic simulative model, and makes it
surprisingly powerful.
1 Introduction
In recent years, there has been growing interest in algorithms for approximate planning in (exponentially or even
infinitely) large Markov decision processes (MDPs) and
partially observable MDPs (POMDPs). For such large domains, the value and -functions are sometimes complicated and difficult to approximate, even though there may
be simple, compactly representable policies that perform
very well. This observation has led to particular interest in
direct policy search methods (e.g., [16, 8, 15, 1, 7]), which
We show how, given a deterministic simulative model,
we can reduce the problem of policy search in an arbitrary POMDP to one in which all the transitions are
deterministic—that is, a POMDP in which taking an action in a state will always deterministically result in
transitioning to some fixed state . (The initial state in this
POMDP may still be random.) This reduction is achieved
by transforming the original POMDP into an “equivalent”
one that has only deterministic transitions.
Our policy search algorithm then operates on these “simplified” transformed POMDPs. We call our method P EGA SUS (for Policy Evaluation-of-Goodness And Search Using Scenarios, for reasons that will become clear). Our
algorithm also bears some similarity to one used in Van
Roy [12] for value determination in the setting of fully observable MDPs.
The remainder of this paper is structured as follows: Section 2 defines the notation that will be used in this paper, and formalizes the concepts of deterministic simulative
models and of families of realizable dynamics. Section 3
then describes how we transform POMDPs into ones with
only deterministic transitions, and gives our policy search
algorithm. Section 4 goes on to establish conditions under which we may give guarantees on the performance of
the algorithm, Section 5 describes our experimental results,
and Section 6 closes with conclusions.
2 Preliminaries
This section gives our notation, and introduces the concept
of the set of realizable dynamics of a POMDP under a policy class.
A
Markov
decision
!"# process (MDP) is a tuple
where:
is a set of states;
is the initial-state
distribution, from
which
the start-state
%$ is drawn;
is a set of actions;
are the tran
sition probabilities, with
giving the'&
next-state
)( *+-, distribution upon taking action
in
state
;
is the
"
discount
factor;
and
is
the
reward
function,
bounded
"/.01
. For the sake of concreteness,
by
324( *5-,-6879 we will assume, unless otherwise stated, that
is a :<; -dimensional
hypercube. For simplicity,
we
also
assume
rewards
"= "=
are de rather than
, the exterministic, and written
tensions being trivial. Lastly, everything that needs to be
measurable is assumed to be measurable.
BACD
A policy is any mapping >@? IACKJ . The value function
of a policy > is a map EGFH?
, so that EGF gives
the expected discounted sum of rewards for executing >
starting from state . With some abuse of notation, we also
define the value
of a policy, with respect to the initial-state
distribution , according to
E
>
L2NMOQPRTS3(
E F
Q6
$
(1)
(where the subscript $U
indicates that
the expectation
is with respect to $ drawn according to ). When we are
considering multiple MDPs and wish to make explicit that
a value function
is for
a particular MDP V , we will also
write EGW F , E W > , etc.
In the policy search setting, we have some fixed
class X
&
of policies, and desire to find a good policy >
X . More
precisely, for a given MDP V and policy class X , define
YZ\[ V
X
L2^]_+`
Our goal
is to find a policy f>
YZg[ V X .
&
E W
>
ed
F<acb
X
so that E
(2)
f>
is close to
Note that this framework also encompasses cases where our
family X consists of policies that depend only on certain aspects of the state. In particular, in POMDPs, we can restrict
attention to policies that depend only on the observables.
This restriction results in a subclass of stochastic memoryfree policies. h By introducing artificial “memory variables” into the process state, we can also define stochastic
limited-memory policies [9] (which certainly permits some
belief state tracking).
i
Although we have not explicitly addressed stochastic policies
so far, they are a straightforward generalization (e.g. using the
transformation to deterministic policies given in [7]).
Since we are interested in the “planning” problem, we assume that we are given a model of the (PO)MDP. Much previous work has studied the case of (PO)MDPs specified via
a generative model [7, 13],
which is a stochastic function
that takes as input
any
state-action pair, and outputs
according to
(and the associated reward). In this
paper, we assume a stronger
mmodel.
lnolpWe
( *+-,-assume
687eq@ACKwe
have a
deterministic function
, so that
k
j
?
s r is distributed Uniform ( *5-,-687 q ,
for any fixed
-pair,
if
then j !s r is
distributed according to the transition distribution
. In other words, to draw a sample from
!
s
for( *+some
-,-6 7 q fixed and , we
need
s only draw r uniformly in
, and then take j r to be our sample.
We will call such a model a deterministic simulative model
for a (PO)MDP.
Since a deterministic simulative model allows us to simulate a generative model, it is clearly a stronger model. However, most computer implementations of generative models
also provide deterministic simulative models. Consider a
generative model that is implemented via a procedure that
takes and , makes at most :t calls to a random number
generator, and then outputs drawn according to
.
Then this procedure is already providing a deterministic
simulative model. The only difference is that the deterministic simulative model has to make explicit (or “expose”) its
interface to the random number generator, via s r . (A generative model implemented via a physical simulation of an
MDP with “resets” to arbitrary states does not, however,
readily lead to a deterministic simulative model.)
Let us examine some simple examples of deterministic
sim 2
ulative models. Suppose that!for
a
state-action
pair
u%uv /2',wcx !u%uv h y
h
and
some
states
and
%
,
,
2|,
2
z wcx
sr
s
. Then we may choose
:{t s }2 so thats^
,wcx is just
~
a real number,
, and
if
2 and let j h h
j H 2IJ s
otherwise.
As
another
example,
suppose
! h
h
, and
is a normal
distribution with a cumula24,
letting :{t
, we
tive distribution function
€
s 2 . Again
s €‚ƒh
may choose j to be j .
It is a fact of probability and
theory that, given
measure
any transition distribution
, such a deterministic simulative model j can always be constructed for it. (See,
e.g. [4].) Indeed, some texts (e.g. [2]) routinely define
POMDPs using essentially deterministic simulative models. However, there will often be many different choices of
j for representing a (PO)MDP, and it will be up to the user
to decide which one is most “natural” to implement. As we
will see later, the particular choice of j that the user makes
can indeed impact the performance of our algorithm, and
“simpler” (in a sense to be formalized) implementations are
generally preferred.
To close this section, we introduce a concept that will be
useful later, that captures the family of dynamics that a
(PO)MDP and policy class can exhibit. Assume a deterministic simulative model j , and fix a policy > . If we are
executing > from
Lsome
2
state
, the
successor-state is deterj > s r , which is a function of mined by „ s r
F
and s r . Varying
> over
2†
s X ˆ, 2 we get
a whole
e s e family of functions
mapping from
…
„
„
r
j
>
r
mlp( *5-,6 7 q
Fƒ‡ F
into successor states . This set of functions
… should be thought of as the family of dynamics realizable by the POMDP and X , though since its definition does
depend on the particular deterministic simulative model j
that we have chosen, this is “as expressed with respect to
j .” For each
function
„ , also let „v‰ be the Š -th coordinate
(so that „ ‰ s r is the Š -th coordinate of „ s r ) and let … ‰
be the corresponding
3l‹( *+%,6 7 q families
( *+%of,6 coordinate functions mapping from
into
. Thus, …Œ‰ captures all the
ways that coordinate Š of the state can evolve.
We are now ready to describe our policy search method.
3 Policy search method
In this section, we show how we transform a (PO)MDP into
an “equivalent” one that has only deterministic
transitions.
f > of the policies’
E
This then leads
to
natural
estimates
values E > . Finally, we may search over policies to opti mize E f > , to find a (hopefully) good policy.
3.1
Transformation of (PO)MDPs
2Ž!
ˆ- e{!"#
and
Given a (PO)MDP V
a policy class X , we describe how, using a deterministic simulative model j 2
for
V , we
-construct
!our
" trans
and
formed POMDP VI
corresponding class of policies X/ , so that VI has only deterministic transitions (though its initial state may still
2)be
,
,
random). To simplify the exposition, we assume :{t
s
so that the terms r are just real numbers.
VI is constructed is as follows: The action space and discount factor
for( *5VI
3l
-,-68  are the same as in V . The state space
for VI is
. In other
words,
%d-d%ad typical state in VI
can be written as a vector s s’‘
— this consists of
h
a state from the original state space
( *5-,6 , followed by an
infinite sequence of real numbers in
.
The rest of the transformation
Upon
is -straightforward.
d%d-d
taking action in state s s’‘ in VI% ,d-d%we
deterd
h
ministically
to the state s\‘ s’“
, where
2
s transition
. In other words, the portion of the state
j h
(which should be thought of as the “actual” state)
changes
-d%d-d
to % , and one number in the infinite sequence s s’‘
h
is used up to generate from the correct distribution. By
the definition of the deterministic simulative
model j , we
( *5-,6
U•”Œ–+—™˜›šœž
see that so long as s
, then the “nexth
state” distribution of is the same as if we had taken action
in state (randomization over s ).
h
Finally,
we
distribution over
, the initial-state
2Ÿ l¡
( *5-choose
,6›
s s’‘ -d%d-d drawn according to
,
so
that
h
will be
, and the s & ‰ ’s are distributed i.i.d.
( *5-so
,6 that U
Uniform
. For each policy >
X , also let there be a
&
s ’
s ‘ -d-d%d £2
corresponding >¢
Xy , given by
" >¢ s s h ‘ - d%d-d¤2m"= >
and let the reward be given by h
,
.
If one observes only the “ ”-portion (but not the s ‰ ’s) of a
sequence of states generated in the POMDP VI using policy >¢ , one obtains a sequence that is drawn from the same
distribution as would have been generated from the
& original
>
X . It
(PO)MDP V under the corresponding policy
&
& fol>
X
¢
>
Xy ,
and
lows that, for corresponding
policies
L2
we have that E W >
E W‚¥ >¢ . This also implies that the
best possible
in both (PO)MDPs are the
expected
¦2 YZgreturns
[ VI X/ .
same: YZg[ V X
To summarize, we have shown how, using a deterministic
simulative model, we can transform any POMDP V and
policy class X into an “equivalent” POMDP VI and policy
class X , so that the
are
&§ transitions in V &§
deterministic;
and an action i.e., given a state , the next-state
in VI is exactly determined. Since policies in X & and Xy
have the same values, if we can
find a policy >¢
X/ that
does well & in VI starting from , then the corresponding
X will
policy >
also do well for the original POMDP
V starting from . Hence, the problem of policy search
in general POMDPs is reduced to the problem of policy
search in POMDPs with deterministic transition dynamics.
In the next section, we show how we can exploit this fact
to derive a simple and natural policy search method.
3.2
P EGASUS: A method for policy search
As discussed,
it suffices for policy search to find a good
&
Xy for the& transformed POMDP, since the corpolicy >¢
responding policy >
To do
X will be just as good.
this,
to E W
, and
we first construct an approximation E f W‚¥
&
‚
W
¥
f
then search over policies >¢
X/ to optimize E >¢ (as
a proxy for optimizing the hard-to-compute E W > ), and
thus find a (hopefully) good policy.
Recall that E W‚¥ is given by
¦2NM P R¢S ¥ (
¨6<
E W ¥ >
E W F ¥ - $
(3)
&ˆ
where the expectation
is over the initial state $
drawn
according to . The first step in the approximation is to
replace the expectation over the distribution with a finite
sample of states.
More precisely, we first draw a sam
‘ -d-d%d
h
ª
ª
ple
$ ©
$ ©
$ ©™« ª of ¬ initial states according to
. These states, also called “scenarios” (a term from the
stochastic optimization literature;
see, e.g. [3]), define an
approximation to E W¡¥ > :
¦­
E W‚¥ >
¬
,
®«
‰°¯
‰ ed
E W F ¥ $© ª
(4)
h
Since the
in VI are
&Ÿtransitions
& deterministic, for a given
state and a policy >
Xy , the sequence of states
that will be visited upon executing > from is exactly determined; hence the sum of discounted rewards for executing
> from is also exactly determined. Thus, to calculate one
‰ of the terms EW F ¥ $ © ª in the summation in Equation (4)
‰
corresponding to scenario $ © ª , we need only use our deterministic simulative model to find the sequence of states
‰
visited by executing > from $ © ª , and sum up the resulting discounted rewards. Naturally, this would be an infinite
sum, so the second (and standard) part of the approximation is to truncate this sum after some number ± of steps,
where ± is called the horizon
choose ± to
2N´ time.
,Here,
·¸ƒw we
z " .01 šµ¶ ²
be the ² -horizon time ±=³
, so that
(because
of
discounting)
the
truncation
introduces
at most
w z
²
error into the approximation.
-d-d%d
To summarize, given ¬ scenarios $ © hª
$ ©™« ª , our ap‚
W
¥
proximation to E
is the deterministic function
L2
E f W¡¥ >
,
¬
®«
"
‰™¯
¹=ƒ" ‰ c¹ˆ-%¹=ƒº»" ‰ ‰ c
$© ª
© ª
º© ª »
h
‰ where $ © ª © ª
º © ª » is the sequence of states deterh
‰
¬ sceministically visited by > starting from $ © ª . Given
narios, & this defines an approximation to E W ¥ > for all policies >
X/ .
The final implementational detail is that, since the states
‰ &¼½l¾( *+%,6›
$© ª
are infinite-dimensional vectors, we
have no way of representing them (and their successor
states) explicitly. But because we will be simulating only
‰ ‰ -d%d-d s ‰
º
± ³ steps, we need only represent s © ª s ‘ © ª
© ª » , of
‰
2¿
4 Main theoretical results
P EGASUS samples a number of scenarios
, and
from
uses them to form an approximation E f > to E > . If E f is
a uniformly good approximation to E , then we can guaranE f will result in a policy with value close
tee that optimizing
Y
\
Z
[
to
V
X . This section establishes conditions under
which this occurs.
4.1
The case of finite action spaces
h
‰ %d-d-d%
‰ discounting. Assuming that the time at which the
en agent
f W‚¥ >gÁ is again
E
ters the goal region
is
differentiable,
then
‘
differentiable.
‰ ‰ ‰ %d-d%d h
the state $ © ª
© ª s © ª s ‘© ª
, and so we will do
h
just that. Viewed in the space of the original, untransformed POMDP, evaluating a policy this way is therefore
also akin to generating ¬ Monte Carlo trajectories and taking their empirical average return, but with the crucial difference that all the randomization is “fixed” in advance and
“reused” for evaluating different > .
Having used ¬ scenarios to define E f W‚¥ > for all > , we
may search over policies to optimize E f W ¥ > . We call
this policy search method P EGASUS: Policy Evaluation-of Goodness And Search Using Scenarios. Since E f W‚¥ > is a
deterministic function, the search procedure only needs to
optimize a deterministic function, and any number of standard optimization methods may be used.
case
2À In the &I
JÄÃthat
>\Á
the action space is continuous and X
is
‡Â
a smoothly parameterized family of policies (so >gÁ is
differentiable in for all ) then if all the relevant quantiÂ
ties are
w differentiable,
W‚¥ it is also possible to find the deriva>\Á , and gradient ascent methods can be
tives : : Â E f
f W ¥ > Á . One common barrier to doing
used to optimize
E
"
this is that is often discontinuous, being (say) 1 within
a goal region and 0 elsewhere. One
approach to dealing
"
with this problem is to smooth
out, possibly in combination with “continuation” methods that gradually unsmooth it again. An alternative approach that may be useful in the setting of continuous dynamical systems is to alter the reward function to use a continuous-time model of
Å2
We
by considering the case of two actions,
begin
‘ . Studying policy search in a similar setting,
h
Kearns, Mansour and Ng [7] established conditions under
which their algorithm gives uniformly good estimates of
the values of policies. A key to that result was that uniform
convergence can be established so long as the policy class
X has low “complexity.” This is analogous to the setting of
supervised learning, where a learning algorithm that uses
a hypothesis class Æ that has low complexity (such as in
the sense of low VC-dimension) will also enjoy uniform
convergence of its error estimates to their means.
In our setting, since
X is just a class of functions mapping
from into ‘ , it is just a set of boolean functions.
h
Hence, Ç/È X , its Vapnik-Chervonenkis dimension [14],
is well defined. That is, we say X shatters a set of ¬ states
z
if it can realize each of the
« possible action combina
tions on them, and Ç/È X is just the size of the largest set
shattered by X . The result of
“ Kearns et al. then suffices to
give the following theorem.
'2É
Theorem 1 Let a POMDP with actions
‘ be
h
given, and let X be a class of strategies 2 for this POMDP,
Ç/È X . Also
with Vapnik-Chervonenkis dimension :
ÊÌËÀ*
let any ²
be fixed, and let E f be the policy-value
estimates determined by P EGASUS using ¬ scenarios and
Í
More precisely, if the agent enters the goal region on some
time step, then rather than giving it a reward of 1, we figure out
what fraction ÎGϧРѝÒeÓÔ of that time step (measured in continuous
time) the agent had taken to enter the goal region, and then give
it reward ÕÖ instead. Assuming Î is differentiable in the system’s
ØÙ
dynamics, then Õ+Ö and hence × ¥Ú8Û<Ü-Ý are now also differentiable
(other than on a usually-measure 0 set, for example from truncationá at Þàß steps).
The algorithm of Kearns, Mansour and Ng uses a “trajectory
Ø
tree” method to find the estimates × Ú8Û\Ý ; since each trajectory tree
›
Ú
O
å
Ú
Q
Ý
Ý
ÞOß , they were very expensive to build. Each
is of size âãä
scenario in P EGASUS can be viewed as a compact representation
of a trajectory tree (with a technical difference that different subtrees are not constructed independently), and the proof given in
Kearns et al. then applies without modification to give Theorem 1.
a horizon time of ± ³ . If
2Næèç+`
¬
´™éç
š
:
" .01
í
Ef
>
!·
í
í ~
>
E
šµ Ê
²
,G·mÊ
then with probability at least
close to E : í
í
í
,
´
(5)
, E f will be uniformly
´™´
˜›šœ¤î
²
,
,ê·ëLìyì
>
&
X
(6)
Using the transformation given
Ë inz Kearns et al., the case of
also gives rise to essena finite action space with ‡ ‡
tially the same uniform-convergence result, so long as X
has low “complexity.”
The bound given in the theorem has no dependence on
the size of the state space or on the “complexity” of the
POMDP’s transitions and rewards. Thus, so long as X has
low VC-dimension, uniform convergence will occur, independently of how complicated the POMDP is. As in Kearns
et al., this theorem therefore recovers the best analogous
results in supervised learning, in which uniform convergence occurs so long as the hypothesis class has low VCdimension, regardless of the size or “complexity” of the
underlying space and target function.
4.2
The case of infinite action spaces: “Simple” X is
insufficient for uniform convergence
We now consider the case of infinite action spaces.
Whereas, in the 2-action case, X being “simple” was sufficient to ensure uniform convergence, this is not the case in
POMDPs with infinite action spaces.
Suppose
is a (countably or uncountably) infinite set
2
of
actions.
A
“simple”
‚ï
&N# class of policies would be X
— the set of all policies that al> >
‡
ways choose the same action, regardless of the state. Intuitively, this is the simplest policy that actually uses an infinite action space; also, any reasonable notion of complexity
of policy classes should assign X a low “dimension.” If it
were true that simple policy classes imply uniform convergence, then it is certainly true that this X should always
enjoy uniform convergence. Unfortunately, this is not the
case, as we now show.
Theorem
infinite
set of actions, and let
2ð 2 Let
ˆï be an &o
#
be the corresponding set
X
> >
‡
of all “constant valued” policies.
Then there exists a finite
state MDP with action space , and a deterministic simulative model for it, so that P EGASUS’ estimates using the
deterministic simulative model doËmnot
* uniformly converge
to their means. i.e. There is an ²
, so that for estimates
E f derived using any finite number ¬ of
& scenarios and any
finite horizon time, there is a policy >
X so that
‡
Ef
>
!·
E
>
‡
Ë
²
d
(7)
The proof of this Theorem, which is not difficult, is in Appendix A. This result shows that simplicity of X is not sufficient for uniform convergence in the case of infinite action spaces. However, the counterexample used in the proof
of Theorem 2 has a very complex j despite the MDP being quite simple. Indeed, a different choice for j would
have made uniform convergence occur.ñ Thus, it is natural to hypothesize that assumptions on the “complexity” of
j are also needed to ensure uniform convergence. As we
will shortly see, this intuition is roughly correct. Since actions affect transitions only through j , the crucial quantity
is actually the composition of policies and the deterministic simulative model — in other words, the class … of the
dynamics realizable in the POMDP and policy class, using a particular deterministic simulative model. In the next
section, we show how assumptions on the complexity of …
leads to uniform convergence bounds of the type we desire.
4.3
Uniform convergence in the case of infinite action
spaces
¼2K( *5-,6›7 9
For the remainder of this section, assume
( *+%,6›79nl .
Then
is
a
class
of
functions
mapping
from
…
( *+-,-687 q
( *+%,6›7 9
into
, and so a simple way to capture its
“complexity” is to capture the 2ò
complexity
of its families
,{-d%d-d
of coordinate functions, … ‰ , Š
:
;
.
( *5-,6›79ˆl@Each
( *+%,6›7e… q ‰ is a
family
of
functions
mapping
from
into
( *+-,-6
, the Š -th coordinate of the state vector. Thus, …Œ‰ is
just a family of real-valued functions — the family of Š -th
coordinate dynamics that X can realize, with respect to j .
The complexity of a class of boolean functions is measured
by its VC dimension, defined to be the size of the largest set
shattered by the class. To capture the “complexity” of realvalued families of functions such as …Œ‰ , we need a generalization of the VC dimension. The pseudo-dimension, due
to Pollard [10] is defined as follows:
Definition (Pollard, 1990). Let
J Æ be a family of functions
mapping
from
a
space
ó
into
. Let a sequence of : points
ô -d%d-d ô 7 & ó be given. We say Æ shatters ô -d%d-d- ô 7
%d-d%dh h
õ %d-d-d%öÄõ 7 such
if there exists a sequence
of real
J7
<öÄnumbers
ô ¦·
h
ô 7 ¦·
that
the& subset
of
given 7by
öB
Jõ 7 h
h
z
intersects all
orthants
(equivalently,
õ 7
Æ
-d%d-d% of &I*5
-,c
‡
7
if for any sequence
of
bits
,2•
there
is
:
÷
÷
ö4&
öÄh ô ‹ø
,
Æ
‰
õ
¡
‰
ú
ù
e
÷
‰
a function
such
that
,
for
2ò,{-d%d-d
all
Š : ). The pseudo-dimension of Æ , denoted
û
—°ž t Æ , is the size of the largest set that Æ shatters, or
infinite if Æ can shatter arbitrarily large sets.
The pseudo-dimension generalizes the VC dimension,
*+-, and
coincides with it in the case that Æ maps into
. We
will use it to capture the “complexity” of the classes of the
POMDP’s realizable dynamics …Œ‰ . We also remind readers
of the definition of Lipschitz continuity.
ü
For example, ý Úÿþ Ò+Ò ÝHþ’i if nÑ , þi otherwise; see
Appendix A.
J
AC
J
Definition. A function „ ?
is Lipschitz continuous (with respect to the Euclidean norm on its range
and
domain) if there exists a constant such that for all
ô Ÿ& û šž „ , „ ô Œ· „ + ‘H~ ô · ‘ . Here,
‡°‡
‡°‡
‡°‡
‡°‡
is called a Lipschitz bound. A family of functions Æ
J
J
mapping from
into is uniformly Lipschitz
öNcontin&
uous with Lipschitz bound if every function
Æ is
Lipschitz continuous with Lipschitz bound .
We now state our main theorem, with a corollary regarding
when optimizing E f will result in a provably good policy.
Ì2
Using tools from [5], it is also possible to show similar
uniform convergence results without Lipschitz continuity
assumptions, by assuming that the family > is parameterized
of real numbers, and that > (for all
& by a small number
"
>
X ), j , and are each implemented by a function that
calculates their results using only a bounded number of the
usual arithmetic operations on real numbers.
The proof of Theorem 3, which uses techniques first introduced by Haussler [6] and Pollard [10], is quite lengthy,
and is deferred to Appendix B.
( *5-,6›7 9
Theorem 3 Let a POMDP with state space
,
and a possibly infinite action space be given. Also let
a policy
l¸Ÿclass
l‹( *5-X ,-68,7 q and
ACÀ a deterministic simulative model
j¸?
for the POMDP be given. Let …
be the corresponding family of realizable dynamics in the
POMDP, and … ‰ theû resulting
families of coordinate
2^,{-d%d-dfunc
: ; ,
tions. Suppose that —°ž t …Œ‰ ~ : for each Š
and that each family …Œ‰ is uniformly Lipschitz continuous
with " Lipschitz
.01 6 , and that the reward funcHAC (bound
·Œ" .01at" most
is also Lipschitz continuous
tion ?
Ê‹Ë *
with Lipschitz bound at most . Finally, let ²
be
given, and let E f be the policy-value estimates determined
by P EGASUS
using ¬ scenarios and a horizon time of ±=³ .
2
If ¬
5 Experiments
In this section, we report the results from two experiments.
The first, run to examine the behavior of P EGASUS parametrically, involved a simple gridworld POMDP. The second studied a complex continuous state/continuous action
problem involving riding a bicycle.
Figure 1a shows the finite state and action POMDP used
in our first experiment. In this problem,
·#, the agent starts
in the lower-left corner, and receives a
reinforcement
per step until it reaches the absorbing state in the upperright corner. The eight possible observations, also shown
in the figure, indicate whether each of the eight squares
,
,
"y.01
adjoining the current position contains
The policy
2('&)cax*wall.
'
æoç+` ´°é ç ´
´
´
š
šµ Ê ,£·‹
šµ šµ " .01 : ; : t ì/ì class is small, consisting of all $&%
functions map:
²
ping from the eight possible observations to the four ac,G·mÊ
tions corresponding to trying to move in each of the comthen with probability at least
, E f will be uniformly
í
í
pass directions. Actions are noisy, and result in moving
close to E : í
í
í !·
in a random direction 20% of the time. Since the policy
í ~
´™´
&
˜›šœ¤î >
(8)
Ef >
E >
²
X
class is small enough to exhaustively enumerate, our optimization algorithm for searching over policies was simply
Corollary 4 Under the conditions of Theorem 1 or 3, let
exhaustive search, trying all $&% policies on the ¬ scenarios,
¬ be ,G
chosen
·ÌÊ as in the Theorem. Then with probability at
and
the best one. Our experiments
were done with
§2Npicking
*+d +)+
24,%**
least
, the policy2 f> chosen by optimizing
the value
and
a
horizon
time
of
,
and
all results re±
îcœµž î
estimates, given by f>
E f > , will be nearF<acb
ported on this problem are averages over 10000 trials. The
optimal in X :
deterministic simulative model was
£ø YZ\[ !· z
,-- Ê _+`\
(9)
E f>
V
X
²
~ *5d *&
-if s
Ê ´10 *+d **54 s~ *+d°,%*
Remark. The (Lipschitz) continuity assumptions give a
s ¤2 . Ê û ˜32 if *+d°,%*74 s~ *+d°,
-if *+d°,54
j
sufficient but not necessary set of conditions for the the--/ Ê š6ê– s~ *+d z *
œ
™
—
*
µ
&
8
2
if
orem, and other sets of sufficient conditions can be enÊ otherwise
visaged. For example, if we assume that the distribution
on states induced by any policy at each time step has a
Ê where denotes the result of moving one step from bounded density, then we can show uniform convergence
in the direction indicated by , and is if this move would
for a large class of"‚ (“reasonable”)
discontinuous
reward
ë2Å,
ËÀ*+d+*
result
in running into a wall.
functions such as
if otherwise. h
Figure 1b shows the result of running this experiment, for
Space constraints preclude a detailed discussion, but briefly,
different
numbers of scenarios. The value of the best policy
this is done by constructing two Lipschitz continuous reward
functions and "! that are “close to” and which upper- and
within X is indicated by the topmost horizontal line, and the
lower-bound (and which hence give value estimates that also
solid curve below that is the mean policy value when using
upper- and lower-bound our value estimates under ); using the
our algorithm. As we see, even using surprisingly small
assumption of bounded densities to show our values under numbers of scenarios, the algorithm manages to find good
and "! are # -close to that of ; applying Theorem 3 to show unipolicies, and as ¬ becomes large, the value also approaches
form convergence occurs with and ! ; and lastly deducing
from this that uniform convergence occurs with as well.
the optimal value.
Results on 5x5 gridworld
-9
-10
G
mean policy value
9
-11
-12
-13
-14
S
-15
-16
0
(a)
5
10
15
m
20
25
30
(b)
Figure 1: (a) 5x5 gridworld, with the 8 observations. (b) P EGASUS results using the normal and complex deterministic simulative
models. The topmost horizontal line shows the value of the best policy in : ; the solid curve is the mean policy value using the normal
model; the lower curve is the mean policy value using the complex model. The (almost negligible) 1 s.e. bars are also plotted.
We had previously predicted that a “complicated” deterministic
model
lead
simulative
ö\
<; ( *5j -can
,6\ACD
( *5-,to
6 poor results. For
each -pair, let
be a hash function
?( *+-,-6
that maps( *5any
Uniform
random
variable
into another
-,6
= Then if j is a determinUniform
random variable.
y2
ö’>; { s istic simulative model, j s
is anj other one that, because of the presence of the hash function,
is a much more “complex” model than j . (Here, we appeal
to the reader’s intuition about complex functions, rather
than formal measures of complexity.) We would therefore
predict that using P EGASUS with j would give worse results than j , and indeed this prediction is borne out by the
results as shown in Figure 1b (dashed curve). The difference between the curves is not large, and this is also not
unexpected given the small size of the problem. ?
Our second experiment used Randløv and Alstrøm’s [11]
bicycle simulator, where the objective is to ride to a goal
one kilometer away. The actions are the torque @ applied to
the handlebars and the displacement A of the rider’s centerof-gravity from the center. The six-dimensional state used
in [11] includes variables for the bicycle’s tilt angle and
orientation,w<,Band
the handlebar’s angle. If the bicycle tilt
exceeds >
, it falls over and enters an absorbing state,
receiving a large negative reward. The randomness in the
simulator is from a uniformly distributed term added to the
intended displacement of the center-of-gravity. Rescaled
appropriately, this became the s term of our deterministic
simulative model.
We performed policy search over the following space: We
C
In our experiments, this was implemented by choosing, for
each Úÿþ ÒD Ý pair, a random integer E Úÿþ ÒD Ý from FvÓÒ<>ÒÓÑÑÑBG ,
and then letting HJIK L Ú ÝMONQPSRUT<V-Ú E Úÿþ Ò ÝXW Ý , where N1PSRYT<V-Ú3Z+Ý
denotes
the fractional part of Z .
[
Theory predicts that the difference between ý and ý\ ’s performance should be at most åO^Ú ] _a`bdc : c eYf Ý ; see [7].
selected a vector ô r of fifteen (simple, manually-chosen but
not fine-tuned) features 2h
of geach
actions were then
i state;
ô r @ .01 · @ .kj l ¹ @ .kj l ,
chosen
with
sigmoids:
@
2mgi ‘ ô - .01‹·
.kj l{O
.kj l
gonI2
h ¹
A
A
A
, where
,w+,¡¹qp
r A
sr . Note that since our approach can handle
continuous actions directly, we did not, unlike [11], have
to discretize the actions. The initial-state distribution was
manually chosen to be representative of a “typical” state
distribution when riding a bicycle, and 2Ì
was
x* also not fineof scenarios,
tuned.
We
used
only
a
small
number
¬
p2o*+d +)+*t
2uc**
,±
, with the continuous-time model of
discounting discussed earlier, and (essentially) gradient ascent to optimize over the weights. % Shaping rewards, to
reward progress towards the goal, were also used. v
We ran 10 trials using our policy search algorithm, testing
each of the resulting solutions on 50 rides. Doing so, the
median riding distances to the goal of the 10 different poli$
cies ranged from about 0.995km h to 1.07km. In all 500
evaluation runs for the 10 policies, the worst distance we
observed was also about 1.07km. These results are significantly better than those of [11], which reported riding distances of about 7km (since their policies often took very
“non-linear” paths to the goal), and a single “best-ever”
trial of about 1.7km.
w
Running experiments without the continuous-time model of
discounting, we also obtained, using a non-gradient based hillclimbing algorithm, equally good results as those reported here.
Our implementation of gradient ascent, using numerically evaluated derivates, was run with a bound on the length of a step taken
Ø
on any
iteration, to avoid problems near × Ú8Û<Ü-Ý ’s discontinuities.
x
Other experimental details: The shaping reward was proportional to and signed the same as the amount of progress towards
the goal. As in [11], we did not include the distance-from-goal as
one of the state variables during training; training therefore proceeding
“infinitely distant” from the goal.
iy
Distances under 1km are possible since, as in [11], the goal
has a 10m radius.
6 Conclusions
We have shown how any POMDP can be transformed into
an “equivalent” one in which all transitions are deterministic. By approximating the transformed POMDP’s initial
state distribution with a sample of scenarios, we defined an
estimate for the value of every policy, and finally performed
policy search by optimizing these estimates. Conditions
were established under which these estimates will be uniformly good, and experimental results showed our method
working well. It is also straightforward to extend these
methods and results to the cases of finite-horizon undiscounted reward, and infinite-horizon average reward with
² -mixing time ± ³ .
Acknowledgements
We thank Jette Randløv and Preben Alstrøm for the use
of their bicycle simulator, and Ben Van Roy for helpful
comments. A. Ng is supported by a Berkeley Fellowship.
This work was also supported by ARO MURI DAAH0496-0341, ONR MURI N00014-00-1-0637, and NSF grant
IIS-9988642.
References
[1] L. Baird and A.W. Moore. Gradient descent for general Reinforcement Learning. In NIPS 11, 1999.
[2] Dimitri Bertsekas. Dynamic Programming and Optimal Control, Vol. 1. Athena Scientific, 1995.
[3] John R. Birge and Francois Louveaux. Introduction
to Stochastic Programming. Springer, 1997.
[4] R. Durrett. Probability : Theory and Examples, 2nd
edition. Duxbury, 1996.
[5] P. W. Goldberg and M. R. Jerrum. Bounding the
Vapnik-Chervonenkis dimension of concept classes
parameterized by real numbers. Machine Learning,
18:131–148, 1995.
[6] D. Haussler. Decision-theoretic generalizations of
the PAC model for neural networks and other applications. Information and Computation, 100:78–150,
1992.
[7] M. Kearns, Y. Mansour, and A. Y. Ng. Approximate
planning in large POMDPs via reusable trajectories.
(extended version of paper in NIPS 12), 1999.
[8] H. Kimura, M. Yamamura, and S. Kobayashi. Reinforcement learning by stochastic hill climbing on
discounted reward. In Proceedings of the Twelfth International Conference on Machine Learning, 1995.
[9] N. Meuleau, L. Peshkin, K-E. Kim, and L.P. Kaelbling. Learning finite-state controllers for partially
observable environments. In Uncertainty in Artificial
Intelligence, Proceedings of the Fifteenth conference,
1999.
[10] D. Pollard. Empirical Processes: Theory and Applications. NSF-CBMS Regional Conference Series in
Probability and Statistics, Vol. 2. Inst. of Mathematical Statistics and American Statistical Assoc., 1990.
[11] J. Randløv and P. Alstrøm. Learning to drive a bicycle using reinforcement learning and shaping. In Proceedings of the Fifteenth International Conference on
Machine Learning, 1998.
[12] Benjamin Van Roy. Learning and Value Function
Approximation in Complex Decision Processes. PhD
thesis, Massachusetts Institute of Technology, 1998.
[13] R. S. Sutton and A. G. Barto. Reinforcement Learning. MIT Press, 1998.
[14] V.N. Vapnik. Estimation of Dependences Based on
Empirical Data. Springer-Verlag, 1982.
[15] J.K. Williams and S. Singh. Experiments with an algorithm which learns stochastic memoryless policies
for POMDPs. In NIPS 11, 1999.
[16] R.J. Williams. Simple statistical gradient-following
algorithms for connectionist reinforcement learning.
Machine Learning, 8:229–256, 1992.
Appendix A: Proof of Theorem 2
Proof
(of
Theorem 2). We construct an MDP with states
$ "=and
y 2 h plus an 2 absorbing
·#,*+%, state. The reward funcƒh
tion is ‰
. Discounting is ignored
Š for Š
and transition with probain this construction. Both ¢h
h
bility 1 to the absorbing state regardless of the action taken.
The initial-state $ has a .5 chance of transitioning to each
of and .
ƒh
h
We now construct j , which
will depend
( 6 in a complicated
&H( *5-,-6J}
s r term. Let z 2 B{|
way
on
the
‰ ÷ ‰ ‰ ÷ ‰
°
‰
¯
4
-, ~h 4€m
~ ‡
h
<‰
÷‰
be the countable set of all finite
unions of intervals with rational endpoints in [0,1]. Let zO
be the countable subset of z that contains all elements of z
that have total
(Lebesgue
exactly 0.5. For
(™,wcx+<length
w'c6
( *5d *5*+d z measure)
v6‚{Ì( *5d +*5dƒ)6
example,
and
are both
‘ %d-d%d
in z# . Let z
z
be
an
enumeration
of
the
elements
%d-d-d™
h of z# . Also let ‘
be an enumeration of (some
h
countably infinite subset of) . The deterministic simulative model on these actions is given by:
j
QP†e
¤2…„
$ ‰ s
Œ2
QP†e
&
ƒh
h
Œ2 *+d
if s
z ‰
otherwise
So,
for all ‰ , and
is a
L2Nthis
*
h
¢h
E
>
correct
model
for
the
MDP.
Note
also
that
for
all
&
>
X .
scenarios
¬
‘ -%d d-d-sample
finite
s
of
%$ s r © hª
%$ s r © ª
%$ r ©™« ª , there exists some z¤‰
s ª & z‰ for all Š 2o,{-d%d-d- ¬ . Thus, evaluating
such
©‡
ï that h‰
ˆ
>’‰
‰ using this set of scenarios, all ¬ simulated trajectories will transition
, so the value estimate
ø , from $ from
à
h 2 ,
) for > ‰ is E f > ‰
. Since this argu(assuming ± ³
ment holds for any finite number ¬ of scenarios, we¸have
2 *
E f does not uniformly converge to E >
shown that
&
(over >
X ).
‹
For
any
%
Appendix B: Proof of Theorem 3
Due to space constraints, this proof will be slightly dense.
The proof techniques we use are due to Haussler [6] and
Pollard [10]. Haussler [6], to which we will be repeatedly
referring, provides a readable introduction to most of the
methods used here.
We begin with some standard definitions from [6]. For a
 , we
subset Œ of a space ó endowed with (pseudo-)metric
&
say Œ $Ž ó & is an ² -cover for
Œ if, for every õ
Π, there
~
˾*
, let
Œ $ such that  õ õ ² . For each ²
 is some
õ ² Œ  denote the size of the smallest ² -cover for Œ .
Let Æ be a family of functions mapping
from a set ó into
ÿ  , and let be a proba bounded pseudo metric space
metric on Æ by
ability
u ; ’ measure
Œ2^on
M“ó R . ( Define
ô e a pseudo
ô Q6 . Define the capac:&‘ t
„ j
t
 „
j
u ;’ Œ2 ]_+`  ª
©
, where
ity of
² Æ :*‘ t
]_5` Æ to be ” ² Æ 
ª
©
ó
the is over
all
probability
measures
on
.
The
quan tity ” ² Æ  thus
measures
the
“richness”
of
the
class
Æ .

Note that
are
both
decreasing
functions
of
,
and
” and
²
¤2
o• <• • Ë@*
”
² Æ
 for any
that ” ² Æ 
.
The main results obtained with pseudo-dimension are uniform convergence of the empirical means of classes of
random variables to their true means. Let
( *+ Æ 6 be a family of functions mapping from ó into
V , and let ô r
(the “training set”)
be ¬ i.i.d. draws from ö|
some
prob
&
ó
Æ
ability
measure
over
.
Then
for
each
, let
,vw ™˜
öÄ –— ô 2
föÄ r ¬ ÿ¦«‰°¯ 2NM“ô R ‰ ( öT
be the
empirical
mean
of
h
ô . Also let –—
ô Q6 be the true mean.
t
We now state a few results from [6]. In [6], these are Theorem 6 combined with Theorem 12; Lemma 7; Lemma
8;
g £2
š being a singleton set, ›
,
and2 Theorem
9
(with
w
2 z
œ
² $V , and A
V ). Below, › and › ‘ respectively
J
h
denote
the
Euclidean metrics on
. e.g.
˜ ô <L
2 Manhattan
·and
ô
‰
‰ . hh
›
r r
‰™¯
h
h ‡
‡
Lemma
mapping from ó
( *5 5 6 Let Æ be2 a family
of functions
û
—°ž t Æ *ž.4 Then for any probabilinto
V , and :
~
u ; on ó and
 ity measure
7 have that
z any
p
w <´ ² z p V w , we
z
~
² Æ :&‘ t ÃDŸ
V
² –
V
²
.
©
ª
-d%d-d
Lemma 6 Let Æ
( *+Æ¡
-,- 6 each be a family of functions
h
mapping from ó into
. The free product of the Ƹ‰ ’s
iQi
This is inconsistent with the definition used in [6], which has
an additional Ú Ó eY¢’Ý factor.
2ò<
-d%d-d-
&
is the class of functions( *+Æ -,-6
„
Æ 2
„ - d%d-d%?L „ ‡ - ô I
h
‡
ó
„
„
mapping
from
into
(where
ô e%d-d-d
ô h
„
„ *
). Then for any probability measure
ËB
h
on ó and ²
,
² Æ

u
:&‘
t
©
£
; u ~
Ã
)
w •g
²
Æ

ª
u
:&‘
‡
;
t ß
©
¯
‡ h
e%d-d-d-%
Lemma 7 Let ó

ó ,{ -¤ d%d-d >• ¤
2^
h
h
h
h
ª
(10)
be bounded
metric spaces, and for each Š
, let Æ
be a class
‡
into ó ¤ . Suppose that
of functions mapping from ó
‡
‡ h
each Æ
is uniformly Lipschitz continuous (with respect to
‡
the metric  on its domain,
on its%-range),
with
øŸ, and  ‡ ¤ 2I
&
h ‡
÷
Æ
¥
„
¥¢„ ?{„
.
Let
some
Lipschitz
bound
-, ~
\
•
h
‡
‡
~
Æ
Š
be the class of functions mapping from
‡
into ó ¤ given by composition of the functions in the
ó
h
h *
ËB
2(•ƒ¦ be given, and let ²
Æ ’s. Let ² $
¯ ÷ ² $ . Then
‡
”
² Æ
 ¤
~
h
¯
‡
h
‡
£
”
² $ Æ
h
‡

‡
‡
¤
(11)
h
Lemma
( *+ 8 6 Let Æ be a family of functions mapping from ó
into V , and let be a probability measure on ó . Let ô r
beË@
generated
by ¬ independent draws from ó , and assume
*
. Then
²
§©¨ ( ª+öˆ&
Æ
!· – — ÿ Ë
6
? –f — ôr
²
‡ w<U, '5 ^p ‡ Ÿ
W Ÿ
³
~ $)” ²
ñ
=
Æ › ‘  ¬
« «
(12)
We are now ready to prove Theorem 3. No serious attempt
has been made to tighten polynomial factors in the bound.
Proof (of Theorem 3). Our proof is in three parts. First, E f
gives
¹ëan
, estimate of the discounted rewards summed over
± ³
-steps; we reduce the problem of showing uniform
convergence of E f to one of proving that our
of
2Ì*+estimates
-d%d-d
± ³ , all
the expected rewards on the ± -th step, ±
converge uniformly. Second, we carefully define the map‰
ping from the scenarios © ª to the ± -th step rewards, and
use Lemmas 5, 6 and 7 to bound its capacity. Lastly, applying Lemma 8 gives" our
To simplify
.01 result.
24,
øŸthe
, notation in
this proof, assume
, and .
Part I: Reduction to uniform convergence of ± -th step
rewards. E f was defined by
Ef
>
,
¦2
¬
®«
‰°¯
"‚
h
‰ ¢¹3ƒ"= ‰ T¹m%-¹
$© ª
© ª
h
¦2
"‚
˜
º »"=
‰ ed
º© ª »
‰ h
For each ± , let E f º >
º © ª be the empirical
«‰™¯
« ± -thh step, and let E º > ë2
mean
of
the
reward
on
the
M ^­ ( "=
Q6
º
be the true
expected reward on the ± -th
step
2
ˆ
$
U
(starting
from
and
executing
).
Thus,
>
E
>
˜ 
º
º ¯ƒ$
E º > .
Suppose we,êcan
for, each ±
·3ʝshow,
w
¹N
± ³
probability
,
‡
Ef º
>
¤·
E º
>
‡
~
²
w z ± ³
2o*+%d-d%d
¹m,¯®
>
&
± ³ , that with
X
(13)
Then
bound,
that
probability
,·mÊ by the union
Œ·
we w know
¹ ,with
z E 2 º *+> -d%d-d~ ²
± ³
, ‡Ef º >
holds
& simulta‡
±
³
and
for
all
>
X . This
neously for all ±
&
implies that, for all >
X ,
‡
Ef
¤·
~
>
Ef
‡
®
~
>
º ‡»
®
·
>
º
º ¯ƒ$
º»
‡
º ƒ
¯ $
d
~
E
Ef º
>
!·
E º
E º
¹
‡
>
®
¹
>
‡
º »
º
²
‡ º
¯ƒ$
E º
>
·
E
>
‡
capacity of Æ (and hence prove uniform converge over Æ ),
f W ¥ ; º (over
we have
& also proved uniform convergence for EF
X ).
all >
2ò,%d-d-d-
w z
º»
º
·
~
E º >
E >
where
we used the fact that ‡ º ¯ƒ$
w z
‡
²
, by construction of the ² -horizon time. But this is exactly the desired result. Thus, we need only prove that
2
Equation
(13) holds with high probability for each ±
*+%d-d%d
± ³.
Part II: Bounding the capacity. Let ± ~ ± ³ be&‹
fixed.
pl
‰
We
now
write
out
the
mapping
from
a
scenario
ª
©
( *+%,6›7eq¢
to the ± -th step
reward. Since this mapping
depends only on the first :t ± elements of the “s ”s portion
of the scenario, we will,
with some abuse of notation, write
‰ &NÌlp( *+%,6›7 q º
, and ignore its other
the scenario as © ª
‰
ª
coordinates.
Thus,
a
scenario
may
now be written as
©
s s ‘ -d-d%d- s
7eq º .
h
2
For
( *+%,each
6›7 q º Š
,{-d-d%d
¹†,
²
…
h
… ‘
… 7 9
² 7 q ¤
h
h
;
AC
(™·Œ" .01 " .01 6
©
¹
ª
;
t ß
©
;
t ß
ª
± ³:t
7e79
ì
²
z p<
¹
:<;
‘ 7e7 9
± ³:{t
ì
²
(14)
Finally, applying Lemma 7 with each
of the  ¹ ’s
•N2
, being the
2
‘
on
the
appropriate
space,
, and ²
› norm
±
¹Ì, º
±
$ ²$ , we find
² Æ
”
‘ ›
º ¤
£
~
h
¯
£
~
‡
~
༠h
¯
”
+
w ²
±
z 7 9 ç
¹N,
z p<
:<;
¹
º
e
$ -
± ³:{t
±
Æ
‡
‘ ›
¹N,
h
z 79 º »´³
z p<
: ;
¹
± ³ : t
-
± ³
¹N,
‘ 7e7 9
º
$ ²
$
²
h
‘
² 7 q ¤
f WF ¥ º ?
Now, let E
be the reward
received
on
the
-th
step
when
executing
from
a scenario
±
>
&è
,
this
defines
a family
. As we let > vary over
X
( ·Œ"/.01<"/.016
of maps from scenarios into
. Clearly, this
family of maps is a subset of Æ . Thus, if we can bound the
z 7 9 ç
‘ ~
” ² Æ ‰ ›
•l
º ‰1¤ 7eq (where the definition of the free product of
hª
© 
sets
of functions is as given in Lemma
such an
2 6);
note
¹
Ƹ‰ has Lipschitz2•
/$
:
ˆ
±
: t .
bound
at
most
;
"G
Also let Æ º ¤
be a 2 singleton
set
containing
the
( ·Œ" .01 " .01 6
h
reward2 function, and ó -º% ¤ ‘
. Finally,
Æ ( *5-Æ ,-68º 7 q ¤ ¥ƒÆ º ¥( ·Œ" .¥¢0Æ 1 " be
let
the family of maps from
pln
hº
h .01 6
into
.
u
:&‘

, define ó ‰
¤
‰
he . For example, ó
is just the space of
h
s
scenarios (with only
the
first
elements
:
¤
t
±
2
2
,%d-d-ofd- the ’s
kept), and ó º ¤
. For each Š
± , deh
ó
‰
ó
Q¤
‰
fine 2 a family
of
maps
from
into
according
l
lB%-¢l
l
l
l@-%¢to
l
Æ ‰
±
2
where we have used the]fact
is decreasing in its ²
_5` that
parameter. By taking a over
probability
measures ,
” ² Ƹ‰ › . Now, as metrics over
this
is
also
a
bound
on
J 79 ¤ º ‰ 7eq
h
‘O~ › . Thus, this
also gives
©
©  ª ª,›
}l
Given
( *+-,-687 q a family
( *5-of
,-6 functions (such as … ‰ ) mapping
Hlk( *+from
%,6›7 q ¤ into
,ø we* extend its domain to
for any finite °
simply by having it ignore the extra coordinates. Note this extension of the domain does
not change the pseudo-dimension
of a family of functions.
2†,{-d%d-d
°
° , define a mapping ± from
Also,
for
each
Ìlp( *+-,-6kAC
( *+%,6
-d-d%d- s 2
according
to ± s s’‘
2
h
s . For each ° , let ² be singleton sets. Where
± necessary, ± ’s domain is also extended as we have just described.
u ; u ² Æ ‰ &
: ‘ t Ã
ª
©
79
£

w
¹ Ì
· e
~
² : ;
±
Š :t
…
‡
¯
‡ 79 h
u
£
 w
¹
e
~
² :<;
± ³:t
…
:&‘
‡
¯
‡ h
p
¹
z p<
z
:;
± ³:t ´
:;
~ z 7 9 ç
–
²
¹
‘ 7e7 9
z p
:;
± ³:t
~ z 7 9 ç
ì
²

²
˜
û
~
For each Š  : ; , u since
—°ž ~ t z €‰ z pcw <: ´ , Lemma
z pcw 7 5
;
Ã
Ÿ
implies that
² …  ‰ :* ‘ t
²
.
u ª ; ë2 , ² –
©
Moreover, clearly
since each ² ‰
² ² ‰ :*‘ t à Ÿ
ª
©
is a singleton set.
with
Lemma
6, this implies
2^,{Combined
-d-d%d
,
± and ² ~
,
that, for each Š
º »
ì
‘ 7e79 º »
µ
Part III: Proving uniform convergence.
‘ Applying
” ² Æ
› , we find
Lemma 8 with the above
bound
on
,#·@Ê
that for there to be a
probability of our estimate of
the expected ± -th step reward to be ² -close to the mean, it
suffices that
¬
2·¶
2
,
¹k´
w ,U'5 ‘ ì
š{µ Ê
š µ $” ²

Æ ›
,
,
,
ç<` ´°éç ´
´
´
, ·‹
š
š{µ Ê ê
š µ 
š µ ´ : ; : t ìŒì

:
²
z '
‘
²
ç+´
This completes the proof of the Theorem.
‹
d
Download